Explore

Analytical DBs

AWS Glue

AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, machine learning, and application development. It also includes additional productivity and data ops tooling for authoring, running jobs, and implementing business workflows.

With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

AWS Glue consolidates major data integration capabilities into a single service. These include data discovery, modern ETL, cleansing, transforming, and centralized cataloging. It's also serverless, which means there's no infrastructure to manage. With flexible support for all workloads like ETL, ELT, and streaming in one service, AWS Glue supports users across various workloads and types of users.

Also, AWS Glue makes it easy to integrate data across your architecture. It integrates with AWS analytics services and Amazon S3 data lakes. AWS Glue has integration interfaces and job-authoring tools that are easy to use for all users, from developers to business users, with tailored solutions for varied technical skill sets.

⁠

ETL, ELT & Streaming in one service

ETL, ELT & Streaming in one service

⁠

Characteristics

Automatic Data Discovery and Profiling:

AWS Glue automatically discovers and profiles data via the Glue Data Catalog.

Recommends and generates ETL code to transform source data into target schemas.

Managed Apache Spark Environment:

Runs ETL jobs on a fully managed, scale-out Apache Spark environment.

Loads data into its destination efficiently.

Orchestration and Monitoring:

Allows setup, orchestration, and monitoring of complex data flows.

Provides a flexible scheduler for dependency resolution, job monitoring, and retries.

Easy-to-use Interface:

Create and run ETL jobs with a few clicks in the AWS Management Console.

Simply point AWS Glue to your data stored on AWS for automatic discovery and cataloging.

Components:

Data Catalog: Central metadata repository storing table definitions and schemas.

ETL Engine: Automatically generates Scala or Python code for ETL transformations.

Scheduler: Handles dependency resolution, job monitoring, and retries.

Automated Data Processing:

Automates discovery, categorization, cleaning, enriching, and moving data.

Reduces undifferentiated heavy lifting, allowing more time for data analysis.

Machine Learning Transform:

Provides a Machine Learning Transform called FindMatches for deduplication and finding matching records.

Data sources and destinations

AWS Glue for Spark allows you to read and write data from multiple systems and databases including:

Amazon S3

Amazon DynamoDB

Amazon Redshift

Amazon Relational Database Service (Amazon RDS)

Third-party JDBC-accessible databases

MongoDB and Amazon DocumentDB (with MongoDB compatibility)

Other marketplace connectors and Apache Spark plugins

Data streams

AWS Glue for Spark can stream data from the following systems:

Amazon Kinesis Data Streams

Apache Kafka

AWS Glue features

AWS Glue features fall into three major categories:

Discover and organize data

Transform, prepare, and clean data for analysis

Build and monitor data pipelines

Discover and organize data

Unify and search across multiple data stores – Store, index, and search across multiple data sources and sinks by cataloging all your data in AWS.

Automatically discover data – Use AWS Glue crawlers to automatically infer schema information and integrate it into your AWS Glue Data Catalog.

Manage schemas and permissions – Validate and control access to your databases and tables.

Connect to a wide variety of data sources – Tap into multiple data sources, both on premises and on AWS, using AWS Glue connections to build your data lake.

Transform, prepare, and clean data for analysis

Visually transform data with a job canvas interface – Define your ETL process in the visual job editor and automatically generate the code to extract, transform, and load your data.

Build complex ETL pipelines with simple job scheduling – Invoke AWS Glue jobs on a schedule, on demand, or based on an event.

Clean and transform streaming data in transit – Enable continuous data consumption, and clean and transform it in transit. This makes it available for analysis in seconds in your target data store.

Deduplicate and cleanse data with built-in machine learning – Clean and prepare your data for analysis without becoming a machine learning expert by using the FindMatches feature. This feature deduplicates and finds records that are imperfect matches for each other.

Built-in job notebooks – AWS Glue job notebooks provide serverless notebooks with minimal setup in AWS Glue so you can get started quickly.

Edit, debug, and test ETL code – With AWS Glue interactive sessions, you can interactively explore and prepare data. You can explore, experiment on, and process data interactively using the IDE or notebook of your choice.

Define, detect, and remediate sensitive data – AWS Glue sensitive data detection lets you define, identify, and process sensitive data in your data pipeline and in your data lake.

Build and monitor data pipelines

Automatically scale based on workload – Dynamically scale resources up and down based on workload. This assigns workers to jobs only when needed.

Automate jobs with event-based triggers – Start crawlers or AWS Glue jobs with event-based triggers, and design a chain of dependent jobs and crawlers.

Run and monitor jobs – Run AWS Glue jobs with your choice of engine, Spark or Ray. Monitor them with automated monitoring tools, AWS Glue job run insights, and AWS CloudTrail. Improve your monitoring of Spark-backed jobs with the Apache Spark UI.

Define workflows for ETL and integration activities – Define workflows for ETL and integration activities for multiple crawlers, jobs, and triggers.

AWS Glue simplifies and automates the data preparation process, making it easier to perform analytics tasks efficiently and effectively.

⁠

AWS Glue Crawlers

AWS Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. The AWS Glue Data Catalog can then be used to guide ETL operations.

Populating Data Catalog:

Crawlers populate the AWS Glue Data Catalog with tables.

Can crawl multiple data stores in a single run.

Creates or updates tables in the Data Catalog upon completion.

Schema Inference:

Crawlers connect to source or target data stores.

Use classifiers to determine schema for the data.

Creates metadata in the AWS Glue Data Catalog for use in ETL job authoring.

Scheduling and Triggers:

Run crawlers on a schedule, on-demand, or based on an event to keep metadata up-to-date.

⁠

Loading docs.aws.amazon.com⁠