Integrated data catalog

The AWS Glue Data Catalog is your persistent metadata store for all your data assets, regardless of where they are located. The Data Catalog contains table definitions, job definitions, and other control information to help you manage your AWS Glue environment. It automatically computes statistics and registers partitions to make queries against your data efficient and cost-effective. It also maintains a comprehensive schema version history so you can understand how your data has changed over time.

Automatic schema discovery

AWS Glue crawlers connect to your source or target data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata in your AWS Glue Data Catalog. The metadata is stored in tables in your data catalog and used in the authoring process of your ETL jobs. You can run crawlers on a schedule, on-demand, or trigger them based on an event to ensure that your metadata is up-to-date.

Code generation

AWS Glue automatically generates the code to extract, transform, and load your data. Simply point AWS Glue to your data source and target, and AWS Glue creates ETL scripts to transform, flatten, and enrich your data. The code is generated in Scala or Python and written for Apache Spark.

Clean and deduplicate data

AWS Glue helps clean and prepare your data for analysis by providing a Machine Learning Transform called FindMatches for deduplication and finding matching records. For example, use AWS Lake Formation's FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows a “Joseph's Pizzeria” at “121 Main”. You don't need to know anything about machine learning to do this. FindMatches will just ask you to label sets of records as either “matching” or “not matching”. The system will then learn your criteria for calling a pair of records a “match” and will build an ML Transform that you can use to find duplicate records within a database or matching records across two databases.

Developer endpoints

If you choose to interactively develop your ETL code, AWS Glue provides development endpoints for you to edit, debug, and test the code it generates for you. You can use your favorite IDE or notebook. You can write custom readers, writers, or transformations and import them into your AWS Glue ETL jobs as custom libraries. You can also use and share code with other developers in our GitHub repository.

Flexible job scheduler

AWS Glue jobs can be invoked on a schedule, on-demand, or based on an event. You can start multiple jobs in parallel or specify dependencies across jobs to build complex ETL pipelines. AWS Glue will handle all inter-job dependencies, filter bad data, and retry jobs if they fail. All logs and notifications are pushed to Amazon CloudWatch so you can monitor and get alerts from a central service.

Serverless Streaming ETL

Serverless streaming ETL in AWS Glue makes it easy to set up continuous ingestion pipelines that prepare streaming data on the fly and make it available for analysis in seconds. These jobs can consume data from streaming sources like Amazon Kinesis and Apache Kafka, clean and transform those data streams in-flight, and continuously load the results into Amazon S3 data lakes, data warehouses, and other data stores. Use this feature to process event data like IoT event streams, clickstreams, and network logs. AWS Glue streaming ETL jobs can enrich and aggregate data, join batch and streaming sources, and run a variety of complex analytics and machine learning operations.

AWS Glue pricing
Visit the pricing page

Explore pricing options for AWS Glue.

Learn more 
Sign up for an AWS account
Sign up for a free account

Instantly get access to the AWS Free Tier. 

Sign up 
Start building in the console
Start building in the console

Get started building with AWS Glue in the AWS Management Console.

Sign in