AWS Glue

Simple, flexible, and cost-effective ETL

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL.

Introducing AWS Glue (1:47)

Benefits

Less hassle

AWS Glue is integrated across a wide range of AWS services, meaning less hassle for you when onboarding. AWS Glue natively supports data stored in Amazon Aurora and all other Amazon RDS engines, Amazon Redshift, and Amazon S3, as well as common database engines and databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2.

Cost effective

AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.

More power

AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.

 

 

How it works

Select a data source and data target. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target. You can edit, debug and test this code via the Console, in your favorite IDE, or any notebook.

Step 1: Build your Data Catalog
screenshot-glue-step1-data-catalog2b

First, use the AWS Management Console to register your data sources. AWS Glue will crawl your data sources and construct your Data Catalog using pre-built classifiers for many popular source formats and data types, including JSON, CSV, Parquet, and more.

Step 2: Generate and Edit Transformations
screenshot-glue-step2-etl-generation4

Next, select a data source and data target. AWS Glue will generate ETL code in Scala or Python to extract data from the source, transform the data to match the target schema, and load it into the target. You can edit, debug and test this code via the Console, in your favorite IDE, or any notebook.

Step 3: Schedule and Run Your Jobs
screenshot-glue-step3-orchestration2

AWS Glue makes it easy to schedule recurring ETL jobs, chain multiple jobs together, or invoke jobs on-demand from other services like AWS Lambda. AWS Glue manages the dependencies between your jobs, automatically scales underlying resources, and retries jobs if they fail.

Visit the AWS Glue features page, or refer to our product documentation to learn more.

Use cases

Queries Against an Amazon S3 Data Lake

Data lakes are an increasingly popular way to store and analyze both structured and unstructured data. If you want to build your own custom Amazon S3 data lake, AWS Glue can make all your data immediately available for analytics without moving the data.

To build a secure data lake in days, learn more about AWS Lake Formation.

product-page-diagram_Glue_Queries-Against-an-Amazo-S3-Data-Lake

Analyze Log Data in Your Data Warehouse

Prepare your clickstream or process log data for analytics by cleaning, normalizing, and enriching your data sets using AWS Glue. AWS Glue generates the schema for your semi-structured data, creates ETL code to transform, flatten, and enrich your data, and loads your data warehouse on a recurring basis.

product-page-diagram_Glue_Analyze-Log-Data-in-Data-Warehouse

Unified View of Your Data Across Multiple Data Stores

You can use the AWS Glue Data Catalog to quickly discover and search across multiple AWS data sets without moving the data. Once the data is cataloged, it is immediately available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

product-page-diagram_Glue_Unified-View-of-Data-Across-Multiple-Data-Stores

Event-driven ETL Pipelines

AWS Glue can run your ETL jobs based on an event, such as getting a new data set. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.

product-page-diagram_Glue_Event-driven-ETL-Pipelines

Get started with AWS

icon1

Sign up for an AWS account

Instantly get access to the AWS Free Tier.
icon2

Learn with 10-minute Tutorials

Explore and learn with simple tutorials.
icon3

Start building with AWS

Begin building with step-by-step guides to help you launch your AWS project.

Learn more about AWS Glue

Visit the features page
Ready to build?
Get started with AWS Glue
Have more questions?
Contact us