AWS Glue is a serverless data preparation service that makes it easy for data engineers, extract, transform, and load (ETL) developers, data analysts, and data scientists to extract, clean, enrich, normalize, and load data. AWS Glue reduces the time it takes to start analyzing your data from months to minutes.
Data preparation is a critical but challenging process. To get data ready for analysis, you first extract data from various sources. You then clean it, transform it into the required format, and load it into databases, data warehouses, and data lakes for further analysis. These tasks are often performed by different groups with different tools.
AWS Glue provides you with both visual and code-based interfaces to make data preparation easy. Data engineers and ETL developers can use AWS Glue Studio to create, run, and monitor ETL workflows with a few clicks. Data analysts and data scientists can use AWS Glue DataBrew to visually clean up and normalize data without writing code.
Prepare data faster
AWS Glue provides integrated tools for all your users to simplify data preparation for analytics and machine learning. Different groups across your organization can work together to prepare data, including extraction, cleaning, normalization, loading, and running scalable ETL workflows. This way, you reduce the time it takes to start analyzing your data from months to minutes.
Automate at scale
AWS Glue automates much of the effort required for data preparation. AWS Glue crawls your data sources, identifies data formats, and suggests schemas to store your data. It automatically generates the code to run your data transformations and loading processes. You can use AWS Glue to easily run and manage thousands of ETL jobs to efficiently prepare petabytes of data for analytics and machine learning.
No servers to manage
AWS Glue runs Apache Spark and Python in a serverless environment. There is no infrastructure to manage, and AWS Glue provisions, configures, and scales the resources required to run your data preparation jobs. You pay only for the resources your jobs use while running.
Unified view of your data across multiple data stores
You can use the AWS Glue Data Catalog to quickly discover and search across multiple AWS data sets without moving the data. Once the data is cataloged, it is immediately available for search and query using Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
Event-driven ETL pipelines
AWS Glue can run your ETL jobs as new data arrives. For example, you can use an AWS Lambda function to trigger your ETL jobs to run as soon as new data becomes available in Amazon S3. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.
Big data ETL without coding
AWS Glue Studio makes it easy to visually create, run, and monitor AWS Glue ETL jobs. You can compose ETL jobs that move and transform data and run them on AWS Glue. You can then use the AWS Glue Studio job run dashboard to monitor ETL execution and ensure that your jobs are operating as intended. Learn more about AWS Glue Studio here.
Self-service visual data preparation
AWS Glue DataBrew enables you to explore and experiment with data directly from your data lake, data warehouses, and databases, including Amazon S3, Amazon Redshift, AWS Lake Formation, Amazon Aurora, and Amazon RDS. You can choose from over 250 prebuilt transformations in AWS Glue DataBrew to automate data preparation tasks, such as filtering anomalies, standardizing formats, and correcting invalid values. After the data is prepared, you can immediately use it for analytics and machine learning. Learn more about AWS Glue DataBrew here.
Learn more about the key features of AWS Glue.
Instantly get access to the AWS Free Tier.
Get started building with AWS Glue in the visual ETL interface.