Q: What is AWS Glue?
AWS Glue is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics. AWS Glue automatically discovers and profiles your data via the Glue Data Catalog, recommends and generates ETL code to transform your source data into target schemas, and runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination. It also allows you to setup, orchestrate, and monitor complex data flows.
Q: How do I get started with AWS Glue?
To start using AWS Glue, simply sign into the AWS Management Console and navigate to “Glue” under the “Analytics” category. You can follow one of our guided tutorials that will walk you through an example use case for AWS Glue. You can also find sample ETL code in our GitHub repository under AWS Labs.
Q. What are the main components of AWS Glue?
AWS Glue consists of a Data Catalog which is a central metadata repository, an ETL engine that can automatically generate Scala or Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. Together, these automate much of the undifferentiated heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so you can spend more time analyzing your data.
Q: When should I use AWS Glue?
You should use AWS Glue to discover properties of the data you own, transform it, and prepare it for analytics. Glue can automatically discover both structured and semi-structured data stored in your data lake on Amazon S3, data warehouse in Amazon Redshift, and various databases running on AWS. It provides a unified view of your data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Glue automatically generates Scala or Python code for your ETL jobs that you can further customize using tools you are already familiar with. AWS Glue is serverless, so there are no compute resources to configure and manage.
Q: What data sources does AWS Glue support?
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, DynamoDB and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. AWS Glue also supports (in beta) data streams from Amazon MSK, Amazon Kinesis Data Streams, and Apache Kafka.
You can also write custom Scala or Python code and import custom libraries and Jar files into your AWS Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.
Q: How does AWS Glue relate to AWS Lake Formation?
A: Lake Formation leverages a shared infrastructure with AWS Glue, including console controls, ETL code creation and job monitoring, a common data catalog, and a serverless architecture. While AWS Glue is still focused on these types of functions, Lake Formation encompasses all AWS Glue features AND provides additional capabilities designed to help build, secure, and manage a data lake. See the AWS Lake Formation pages for more details.
AWS Glue Data Catalog
Q: What is the AWS Glue Data Catalog?
The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given data set, you can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.
The AWS Glue Data Catalog is Apache Hive Metastore compatible and is a drop-in replacement for the Apache Hive Metastore for Big Data applications running on Amazon EMR. For more information on setting up your EMR cluster to use AWS Glue Data Catalog as an Apache Hive Metastore, click here.
The AWS Glue Data Catalog also provides out-of-box integration with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Once you add your table definitions to the Glue Data Catalog, they are available for ETL and also readily available for querying in Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.
Q: How do I get my metadata into the AWS Glue Data Catalog?
AWS Glue provides a number of ways to populate metadata into the AWS Glue Data Catalog. Glue crawlers scan various data stores you own to automatically infer schemas and partition structure and populate the Glue Data Catalog with corresponding table definitions and statistics. You can also schedule crawlers to run periodically so that your metadata is always up-to-date and in-sync with the underlying data. Alternately, you can add and update table details manually by using the AWS Glue Console or by calling the API. You can also run Hive DDL statements via the Amazon Athena Console or a Hive client on an Amazon EMR cluster. Finally, if you already have a persistent Apache Hive Metastore, you can perform a bulk import of that metadata into the AWS Glue Data Catalog by using our import script.
Q: What are AWS Glue crawlers?
An AWS Glue crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can run periodically to detect the availability of new data as well as changes to existing data, including table definition changes. Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions. You can customize Glue crawlers to classify your own file types.
Q: How do I import data from my existing Apache Hive Metastore to the AWS Glue Data Catalog?
You simply run an ETL job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog.
Q: Do I need to maintain my Apache Hive Metastore if I am storing my metadata in the AWS Glue Data Catalog?
No. AWS Glue Data Catalog is Apache Hive Metastore compatible. You can point to the Glue Data Catalog endpoint and use it as an Apache Hive Metastore replacement. For more information on how to configure your cluster to use AWS Glue Data Catalog as an Apache Hive Metastore, please read our documentation here.
Q. If I am already using Amazon Athena or Amazon Redshift Spectrum and have tables in Amazon Athena’s internal data catalog, how can I start using the AWS Glue Data Catalog as my common metadata repository?
Before you can start using AWS Glue Data Catalog as a common metadata repository between Amazon Athena, Amazon Redshift Spectrum, and AWS Glue, you must upgrade your Amazon Athena data catalog to AWS Glue Data Catalog. The steps required for the upgrade are detailed here.
Extract, transform, and load (ETL)
Q: What programming language can I use to write my ETL code for AWS Glue?
You can use either Scala or Python.
Q: How can I customize the ETL code generated by AWS Glue?
AWS Glue’s ETL script recommendation system generates Scala or Python code. It leverages Glue’s custom ETL library to simplify access to data sources as well as manage job execution. You can find more details about the library in our documentation. You can write ETL code using AWS Glue’s custom library or write arbitrary code in Scala or Python by using inline editing via the AWS Glue Console script editor, downloading the auto-generated code, and editing it in your own IDE. You can also start with one of the many samples hosted in our Github repository and customize that code.
Q: Can I import custom libraries as part of my ETL script?
Yes. You can import custom Python libraries and Jar files into your AWS Glue ETL job. For more details, please check our documentation here.
Q: Can I bring my own code?
Yes. You can write your own code using AWS Glue’s ETL library, or write your own Scala or Python code and upload it to a Glue ETL job. For more details, please check our documentation here.
Q: How can I develop my ETL code using my own IDE?
You can create and connect to development endpoints that offer ways to connect your notebooks and IDEs.
Q: How can I build end-to-end ETL workflow using multiple jobs in AWS Glue?
In addition to the ETL library and code generation, AWS Glue provides a robust set of orchestration features that allow you to manage dependencies between multiple jobs to build end-to-end ETL workflows. AWS Glue ETL jobs can either be triggered on a schedule or on a job completion event. Multiple jobs can be triggered in parallel or sequentially by triggering them on a job completion event. You can also trigger one or more Glue jobs from an external source such as an AWS Lambda function.
Q: How does AWS Glue monitor dependencies?
AWS Glue manages dependencies between two or more jobs or dependencies on external events using triggers. Triggers can watch one or more jobs as well as invoke one or more jobs. You can either have a scheduled trigger that invokes jobs periodically, an on-demand trigger, or a job completion trigger.
Q: How does AWS Glue handle errors?
AWS Glue monitors job event metrics and errors, and pushes all notifications to Amazon CloudWatch. With Amazon CloudWatch, you can configure a host of actions that can be triggered based on specific notifications from AWS Glue. For example, if you get an error or a success notification from Glue, you can trigger an AWS Lambda function. Glue also provides default retry behavior that will retry all failures three times before sending out an error notification.
Q: Can I run my existing ETL jobs with AWS Glue?
Yes. You can run your existing Scala or Python code on AWS Glue. Simply upload the code to Amazon S3 and create one or more jobs that use that code. You can reuse the same code across multiple jobs by pointing them to the same code location on Amazon S3.
Q: How can I use AWS Glue to ETL streaming data?
AWS Glue supports ETL on streams from Amazon Kinesis Data Streams, Apache Kafka, and Amazon MSK. Add the stream to the Glue Data Catalog and then choose it as the data source when setting up your AWS Glue job.
Q: Do I have to use both AWS Glue Data Catalog and Glue ETL to use the service?
No. While we do believe that using both the AWS Glue Data Catalog and ETL provides an end-to-end ETL experience, you can use either one of them independently without using the other.
Q: When should I use AWS Glue Streaming and when should I use Amazon Kinesis Data Analytics?
Both AWS Glue and Amazon Kinesis Data Analytics can be used to process streaming data. AWS Glue is recommended when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform. Amazon Kinesis Data Analytics is recommended when your use cases are primarily analytics and when you want to run jobs on a serverless Apache Flink-based platform.
Streaming ETL in AWS Glue enables advanced ETL on streaming data using the same serverless, pay-as-you-go platform that you currently use for your batch jobs. AWS Glue generates customizable ETL code to prepare your data while in flight and has built-in functionality to process streaming data that is semi-structured or has an evolving schema. Use Glue to apply both its built-in and Spark-native transforms to data streams and load them into your data lake or data warehouse.
Amazon Kinesis Data Analytics enables you to build sophisticated streaming applications to analyze streaming data in real time. It provides a serverless Apache Flink runtime that automatically scales without servers and durably saves application state. Use Amazon Kinesis Data Analytics for real-time analytics and more general stream data processing.
Q: When should I use AWS Glue and when should I use Amazon Kinesis Data Firehose?
Both AWS Glue and Amazon Kinesis Data Firehose can be used for streaming ETL. AWS Glue is recommended for complex ETL, including joining streams, and partitioning the output in Amazon S3 based on the data content. Amazon Kinesis Data Firehose is recommended when your use cases focus on data delivery and preparing data to be processed after it is delivered.
Streaming ETL in AWS Glue enables advanced ETL on streaming data using the same serverless, pay-as-you-go platform that you currently use for your batch jobs. AWS Glue generates customizable ETL code to prepare your data while in flight and has built-in functionality to process streaming data that is semi-structured or has an evolving schema. Use Glue to apply complex transforms to data streams, enrich records with information from other streams and persistent data stores, and then load records into your data lake or data warehouse.
Streaming ETL in Amazon Kinesis Data Firehose enables you to easily capture, transform, and deliver streaming data. Amazon Kinesis Data Firehose provides ETL capabilities including serverless data transformation through AWS Lambda and format conversion from JSON to Parquet. It includes ETL capabilities that are designed to make data easier to process after delivery, but does not include the advanced ETL capabilities that AWS Glue supports.
Clean and deduplicate data
Q: What kind of problems does the FindMatches ML Transform solve?
FindMatches generally solves Record Linkage and Data Deduplication problems. Deduplication is what you have to do when you are trying to identify records in a database which are conceptually “the same”, but for which you have separate records. This problem is trivial if duplicate records can be identified by a unique key (for instance if products can be uniquely identified by a UPC Code), but becomes very challenging when you have to do a “fuzzy match”.
Record linkage is basically the same problem as data deduplication under the hood, but this term usually means that you are doing a “fuzzy join” of two databases that do not share a unique key rather than deduplicating a single database. As an example, consider the problem of matching a large database of customers to a small database of known fraudsters. FindMatches can be used on both record linkage and deduplication problems.
For instance, AWS Glue's FindMatches ML Transform can help you with the following problems:
Linking patient records between hospitals so that doctors have more background information and are better able to treat patients by using FindMatches on separate databases that both contain common fields such as name, birthday, home address, phone number, etc.
Deduplicating a database of movies containing columns like “title”, “plot synopsis”, “year of release”, “run time”, and “cast”. For instance, the same movie might be variously identified as “Star Wars”, “Star Wars: A New Hope”, and “Star Wars: Episode IV—A New Hope (Special Edition)”.
Automatically group all related products together in your storefront by identifying equivalent items in an apparel product catalog where you want to define “equivalent” to mean that they are the same ignoring differences in size and color. Hence “Levi 501 Blue Jeans, size 34x34” is defined to be the same as “Levi 501 Jeans--black, Size 32x31”.
Q: How does AWS Glue deduplicate my data?
AWS Glue's FindMatches ML Transform makes it easy to find and link records that refer to the same entity but don’t share a reliable identifier. Before FindMatches, developers would commonly solve data-matching problems deterministically, by writing huge numbers of hand-tuned rules. FindMatches uses machine learning algorithms behind the scenes to learn how to match records according to each developer's own business criteria. FindMatches first identifies records for the customer to label as to whether they match or do not match and then uses machine learning to create an ML Transform. Customers can then execute this Transform on their database to find matching records or they can ask FindMatches to give them additional records to label to push their ML Transform to higher levels of accuracy.
Q: What are ML Transforms?
ML Transforms provide a destination for creating and managing machine-learned transforms. Once created and trained, these ML Transforms can then be executed in standard AWS Glue scripts. Customers select a particular algorithm (for example, the FindMatches ML Transform) and input datasets and training examples, and the tuning parameters needed by that algorithm. AWS Glue uses those inputs to build an ML Transform that can be incorporated into a normal ETL Job workflow.
Q: How do ML Transforms work?
AWS Glue includes specialized ML-based dataset transformation algorithms customers can use to create their own ML Transforms. These include record de-duplication and match finding.
Customers start by navigating to the ML Transforms tab in the console (or using the ML Transforms service endpoints or accessing ML Transforms training via CLI) to create their first ML transform model. The ML Transforms tab provides a user-friendly view for management of user transforms. ML Transforms require distinct workflow requirements from other transforms, including the need for separate training, parameter tuning, and execution workflows; the need for estimating the quality metrics of generated transformations; and the need to manage and collect additional truth labels for training and active learning.
To create an ML transform via the console, customers first select the transform type (such as Record Deduplication or Record Matching) and provide the appropriate data sources previously discovered in Data Catalog. Depending on the transform, customers may then be asked to provide ground truth label data for training or additional parameters. Customers can monitor the status of their training jobs and view quality metrics for each transform. (Quality metrics are reported using a hold-out set of the customer-provided label data.)
Once satisfied with the performance, customers can promote ML Transforms models for use in production. ML Transforms can then be used during ETL workflows, both in code autogenerated by the service and in user-defined scripts submitted with other jobs, similar to pre-built transforms offered in other AWS Glue libraries.
Q: Can I see a presentation on using AWS Glue (and AWS Lake Formation) to find matches and deduplicate records?
A: Yes, the full recording of the AWS Online Tech Talk, "Fuzzy Matching and Deduplicating Data with ML Transforms for AWS Lake Formation" is available here.
AWS Product Integrations
Q: When should I use AWS Glue vs. AWS Data Pipeline?
AWS Glue provides a managed ETL service that runs on a serverless Apache Spark environment. This allows you to focus on your ETL job and not worry about configuring and managing the underlying compute resources. AWS Glue takes a data first approach and allows you to focus on the data properties and data manipulation to transform the data to a form where you can derive business insights. It provides an integrated data catalog that makes metadata available for ETL as well as querying via Amazon Athena and Amazon Redshift Spectrum.
AWS Data Pipeline provides a managed orchestration service that gives you greater flexibility in terms of the execution environment, access and control over the compute resources that run your code, as well as the code itself that does data processing. AWS Data Pipeline launches compute resources in your account allowing you direct access to the Amazon EC2 instances or Amazon EMR clusters.
Furthermore, AWS Glue ETL jobs are Scala or Python based. If your use case requires you to use an engine other than Apache Spark or if you want to run a heterogeneous set of jobs that run on a variety of engines like Hive, Pig, etc., then AWS Data Pipeline would be a better choice.
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a scale-out execution environment for your data transformation jobs. AWS Glue infers, evolves, and monitors your ETL jobs to greatly simplify the process of creating and maintaining jobs. Amazon EMR provides you with direct access to your Hadoop environment, affording you lower-level access and greater flexibility in using tools beyond Spark.
Q: When should I use AWS Glue vs AWS Database Migration Service?
AWS Database Migration Service (DMS) helps you migrate databases to AWS easily and securely. For use cases which require a database migration from on-premises to AWS or database replication between on-premises sources and sources on AWS, we recommend you use AWS DMS. Once your data is in AWS, you can use AWS Glue to move and transform data from your data source into another database or data warehouse, such as Amazon Redshift.
Q: When should I use AWS Glue vs AWS Batch?
AWS Batch enables you to easily and efficiently run any batch computing job on AWS regardless of the nature of the job. AWS Batch creates and manages the compute resources in your AWS account, giving you full control and visibility into the resources being used. AWS Glue is a fully-managed ETL service that provides a serverless Apache Spark environment to run your ETL jobs. For your ETL use cases, we recommend you explore using AWS Glue. For other batch oriented use cases, including some ETL use cases, AWS Batch might be a better fit.
Q: When should I use AWS Glue vs Amazon Kinesis Data Analytics?
Amazon Kinesis Data Analytics allows you to run standard SQL queries on your incoming data stream. You can specify a destination like Amazon S3 to write your results. Once your data is available in your target data source, you can kick off an AWS Glue ETL job to do further transform your data and prepare it for additional analytics and reporting.
Pricing and billing
Q: How am I charged for AWS Glue?
You will pay a simple monthly fee, above the AWS Glue Data Catalog free tier, for storing and accessing the metadata in the AWS Glue Data Catalog. Additionally, you will pay an hourly rate, billed per second, for the ETL job and crawler run, with a 10-minute minimum for each. If you choose to use a development endpoint to interactively develop your ETL code, you will pay an hourly rate, billed per second, for the time your development endpoint is provisioned, with a 10-minute minimum. For more details, please refer our pricing page.
Q: When does billing for my AWS Glue jobs begin and end?
Billing commences as soon as the job is scheduled for execution and continues until the entire job completes. With AWS Glue, you only pay for the time for which your job runs and not for the environment provisioning or shutdown time.
Security and availability
Q: How does AWS Glue keep my data secure?
We provide server side encryption for data at rest and SSL for data in motion.
Q: What are the service limits associated with AWS Glue?
Please refer our documentation to learn more about service limits.
Q: What regions is AWS Glue in?
Please refer to the AWS Region Table for details of AWS Glue service availability by region.
Q: How many DPUs (Data Processing Units) are allocated to the development endpoint?
A development endpoint is provisioned with 5 DPUs by default. You can configure a development endpoint with a minimum of 2 DPUs and a maximun of 5 DPUs.
Q: How do I scale the size and performance of my AWS Glue ETL jobs?
You can simply specify the number of DPUs (Data Processing Units) you want to allocate to your ETL job. A Glue ETL job requires a minimum of 2 DPUs. By default, AWS Glue allocates 10 DPUs to each ETL job.
Q: How do I monitor the execution of my AWS Glue jobs?
The AWS Glue provides status of each job and pushes all notifications to Amazon CloudWatch. You can setup SNS notifications via CloudWatch actions to be informed of job failures or completions.
Service Level Agreement
Q: What does the AWS Glue SLA guarantee?
Our AWS Glue SLA guarantees a Monthly Uptime Percentage of at least 99.9% for AWS Glue.
Q: How do I know if I qualify for a SLA Service Credit?
You are eligible for a SLA credit for AWS Glue under the AWS Glue SLA if more than one Availability Zone in which you are running a task, within the same region has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle.
For full details on all of the terms and conditions of the SLA, as well as details on how to submit a claim, please see the AWS Glue SLA details page.