AWS Glue is a serverless data integration service that makes it easy to prepare data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration, so you can gain insights and put your data to use in minutes instead of months. With AWS Glue, there is no infrastructure to set up or manage. You pay only for the resources consumed while your jobs are running.
Discover and search across all your AWS data sets
The AWS Glue Data Catalog is your persistent metadata store for all your data assets, regardless of where they are located. The Data Catalog contains table definitions, job definitions, schemas, and other control information to help you manage your AWS Glue environment. It automatically computes statistics and registers partitions to make queries against your data efficient and cost-effective. It also maintains a comprehensive schema version history so you can understand how your data has changed over time.
Automatic schema discovery
AWS Glue crawlers connect to your source or target data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata in your AWS Glue Data Catalog. The metadata is stored in tables in your data catalog and used in the authoring process of your ETL jobs. You can run crawlers on a schedule, on-demand, or trigger them based on an event to ensure that your metadata is up-to-date.
Manage and enforce schemas for data streams
AWS Glue Schema Registry, a serverless feature of AWS Glue, enables you to validate and control the evolution of streaming data using registered Apache Avro schemas, at no additional charge. Through Apache-licensed serializers and deserializers, the Schema Registry integrates with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda. When data streaming applications are integrated with the Schema Registry, you can improve data quality and safeguard against unexpected changes using compatibility checks that govern schema evolution. Additionally, you can create or update AWS Glue tables and partitions using schemas stored within the registry.
Visually transform data with a drag-and-drop interface
AWS Glue Studio allows you to author highly scalable ETL jobs for distributed processing without becoming an Apache Spark expert. Define your ETL process in the drag-and-drop job editor and AWS Glue automatically generates the code to extract, transform, and load your data. The code is generated in Scala or Python and written for Apache Spark.
Build complex ETL pipelines with simple job scheduling
AWS Glue jobs can be invoked on a schedule, on-demand, or based on an event. You can start multiple jobs in parallel or specify dependencies across jobs to build complex ETL pipelines. AWS Glue will handle all inter-job dependencies, filter bad data, and retry jobs if they fail. All logs and notifications are pushed to Amazon CloudWatch so you can monitor and get alerts from a central service.
Clean and transform streaming data in-flight
Serverless streaming ETL jobs in AWS Glue continuously consume data from streaming sources including Amazon Kinesis and Amazon MSK, clean and transform it in-flight, and make it available for analysis in seconds in your target data store. Use this feature to process event data like IoT event streams, clickstreams, and network logs. AWS Glue streaming ETL jobs can enrich and aggregate data, join batch and streaming sources, and run a variety of complex analytics and machine learning operations.
Combine and replicate data across multiple data stores using SQL
AWS Glue Elastic Views enables you to create views over data stored in multiple types of AWS data stores, and materialize the views in a target data store of your choice. You can use AWS Glue Elastic Views to create materialized views by writing queries in PartiQL. PartiQL is an open source SQL-compatible query language that you can use to query and manipulate data, regardless of whether the data has a tabular or a flexible, document-like structure. You can interactively write PartiQL queries using the query editor in the AWS Management Console or issue queries through the API or CLI.
AWS Glue Elastic Views supports Amazon DynamoDB as a source (with support for Amazon Aurora and Amazon RDS to follow), and Amazon Redshift, Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), and Amazon S3 as targets (with support for Amazon Aurora, Amazon RDS, and Amazon DynamoDB to follow). You can speed up development time by sharing your materialized views with other users for use in their applications. AWS Glue Elastic Views monitors for changes to data in your source data stores continuously, and provides updates to your target data stores automatically. Learn more about AWS Glue Elastic Views.
Deduplicate and cleanse data with built-in machine learning
AWS Glue helps clean and prepare your data for analysis without becoming a machine learning expert. Its FindMatches feature deduplicates and finds records that are imperfect matches of each other. For example, use FindMatches to find duplicate records in your database of restaurants, such as when one record lists “Joe's Pizza” at “121 Main St.” and another shows a “Joseph's Pizzeria” at “121 Main”. FindMatches will just ask you to label sets of records as either “matching” or “not matching.” The system will then learn your criteria for calling a pair of records a “match” and will build an ETL job that you can use to find duplicate records within a database or matching records across two databases.
Edit, debug, and test ETL code with developer endpoints
If you choose to interactively develop your ETL code, AWS Glue provides development endpoints for you to edit, debug, and test the code it generates for you. You can use your favorite IDE or notebook. You can write custom readers, writers, or transformations and import them into your AWS Glue ETL jobs as custom libraries. You can also use and share code with other developers in our GitHub repository.
Normalize data without code using a visual interface
AWS Glue DataBrew provides an interactive, point-and-click visual interface for users like data analysts and data scientists to clean and normalize data without writing code. You can easily visualize, clean, and normalize data directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon Redshift, Amazon Aurora, and Amazon RDS. You can choose from over 250 built-in transformations to combine, pivot, and transpose the data, and automate data preparation tasks by applying saved transformations directly to the new incoming data.