How to extract, transform, and load data for analytic processing using AWS Glue (Part 2)

One of the biggest challenges enterprises face is setting up and maintaining a reliable extract, transform, and load (ETL) process to extract value and insight from data. Traditional ETL tools are complex to use, and can take months to implement, test, and deploy. After the ETL jobs are built, maintaining them can be painful because data formats and schemas change frequently and new data sources need to be added all the time.

AWS Glue automates much of the undifferentiated heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so you can spend more time analyzing your data. AWS Glue automatically crawls your data sources, identifies data formats, and then suggests schemas and transformations. This means that you don’t have to spend time hand-coding data flows.

AWS Glue is designed to simplify the tasks of moving and transforming your datasets for analysis. It’s a serverless, fully managed service built on top of the popular Apache Spark execution framework.

In part 2 of this two-part migration blog series, we build an AWS CloudFormation stack. We use this stack to show you how AWS Glue extracts, transforms, and loads data to and from an Amazon Aurora MySQL database. We use Amazon Aurora MySQL as the source and Amazon Simple Storage Service (Amazon S3) as the target for AWS Glue. We also provide a scenario where we show you how to build a centralized data lake in Amazon S3 for easy querying and reporting by using Amazon Athena. You can also use Amazon Redshift as a data target for building a data warehouse strategy. What we provide you in this post is a framework to get started with AWS Glue and customize as needed.

AWS Glue has three core components:

Data Catalog – Serves as the central metadata repository. Tables and databases are objects in the AWS Glue Data Catalog. They contain metadata; they don’t contain data from a data store.
Crawler – Discovers your data and associated metadata from various data sources (source or target) such as S3, Amazon RDS, Amazon Redshift, and so on. Crawlers help automatically build your Data Catalog and keep it up-to-date as you get new data and as your data evolves.
ETL Job – The business logic that is required to perform data processing. An ETL job is composed of a transformation script, data sources, and data targets.

AWS Glue also provides the necessary scheduling, alerting, and triggering features to run your jobs as part of a wider data processing workflow.

We use a CloudFormation script that uses resources such as your virtual private cloud (VPC) and Amazon Aurora MySQL, deployed in part 1 of this blog series. This script lets us show you the features and functionalities of AWS Glue. The CloudFormation stack deployed in this blog creates the four core components needed for doing ETL with AWS Glue:

An AWS Glue database connection to an Amazon Aurora MySQL database called HRDATA database
An AWS Glue crawler that allows you to crawl the HRDATA database
An AWS Glue database in the Data Catalog to store crawler results
An AWS Glue job to transform your data by merging two tables and creating a new table

With an AWS Glue job, you can also write complex data transformations. For the purposes of this blog post, we use a simple transformation. This simple transformation shows a typical scenario of taking data from a source, transforming it externally, and storing it for analytic processing to gain business insights. The transformed data that AWS Glue loads into S3 becomes a source for building your data lake, which you can query using Amazon Athena.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Amazon Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Many customers just want to easily query their data without worrying about running and managing clusters, and Amazon Athena helps accomplish this.

The following illustration shows the AWS Glue components and the architecture of the exercise.

Prerequisites

To get started with this AWS Glue exercise, take the following steps:

Complete the steps in part 1 of this blog series.
This script works only in the US West-2 (Oregon) or US East-2 (Ohio) Region. Make sure to use the same AWS Region as in part 1.
We reuse the following components from the CloudFormation stack deployed in part 1:
- Amazon Aurora MySQL-compatible HRDATA database
- VPC, security group, and subnet associated with this database
- VPC endpoint for S3
The CloudFormation script creates an AWS Glue IAM role—a mandatory role that AWS Glue can assume to access the necessary resources like Amazon RDS and S3.
The script also creates an AWS Glue connection, database, crawler, and job for the walkthrough.
The scripts for the AWS Glue Job are stored in S3.

Launch the stack

Note: Some of the resources deployed by the stack incur costs as long as they’re in use.

To get started deploying the CloudFormation template, take these steps:

Choose the Launch Stack button. It automatically launches the AWS CloudFormation service in your AWS account with a template. You’re prompted to sign in if needed.
Go to the same AWS Region (Ohio or Oregon) that you used to create the stack in part 1. Choose Next to proceed.
Specify the name of the CloudFormation stack that was deployed in part 1 <link> of this blog series. Note: The MigrationStackName input parameter must be the stack name of the CloudFormation template that was deployed in part 1. CloudFormation automatically extracts the values from the output of the previous stack to use in this new stack.
On the Review page, acknowledge that CloudFormation creates AWS Identity and Access Management (IAM) resources as a result of launching the stack. Choose Create.
To see the stack creation progress, choose Refresh and then select the stack to see the launch events.
The stack creation takes a few minutes to finish. When the stack is successfully launched, the status changes from CREATE_IN_PROGRESS to CREATE_COMPLETE. The Output section also has information about the resources deployed by CloudFormation.
In the AWS Management Console, navigate to AWS Glue. In the left navigation pane under Data Catalog, choose Databases. You should see that a new database called hrdb has been added by the CloudFormation stack. This database stores information such as database, table, partition, and column definitions. We use the AWS Glue crawler to populate the Data Catalog in later steps.
In the left navigation pane under Data Catalog, choose Connections. You should see a new RDS connection called rds-aurora-blog-conn. This connection was created by CloudFormation.
Choose rds-aurora-blog-conn to look at the connection details. We use the JDBC string for the Amazon Aurora MySQL database as shown. We got the VPC ID, subnet ID, and security group ID from the Aurora MySQL settings, as shown following.
In the left navigation pane, choose Crawler. You should see that a new crawler has been created by the CloudFormation stack. The Table updated and Tables added columns each have the value 0. This is because the crawler hasn’t run yet to crawl the data source.
Choose rds-aurora-blog-crawler to look at the details. We use the rds-aurora-blog-conn connection, AWS Glue Data Catalog hrdb database, and include path parameter of HRDATA/% to crawl all the schemas in HRDATA database. The crawler has prebuilt classifiers for many popular source formats. The tables crawled are prefixed with glue_ in the Data Catalog.
Choose rds-aurora-blog-crawler from the list, and choose Run crawler to start the crawler. The crawler runs for a few minutes.
When it’s finished, you should see the tables updated and added to the AWS Glue Data Catalog.
In the left navigation pane, choose Databases and then choose the hrdb database. This database was created by the CloudFormation stack, and the crawler that ran in the previous step added the tables to this database in the AWS Glue Data Catalog.
In the left navigation pane, navigate to the ETL section and choose Jobs. You should see the job created by the CloudFormation stack.
Select this job, and you can see details such as the location of the PySpark script from a S3 bucket and the script associated with the ETL job. The job creates an AWS Glue DynamicFrame for each of two tables, glue_hrdata_employees and glue_hrdata_departments, from the hrdb database in Data Catalog.

Using the script following, do a JOIN transformation on the DEPARTMENT_ID value of the two DynamicFrames and create a third DynamicFrame called employees_department. The script then writes the resulting DynamicFrame to a destination S3 bucket. Note: In the script following, you can see how we pass the parameter for the destination S3 bucket using the –DefaultArguments value of the job and extracting that in the script using sys.argv.

"GlueJob": {
      "Type": "AWS::Glue::Job",
      "DependsOn": ["AWSGlueRole", "AWSGlueconnectionRDS", "S3Bucket"],
      "Properties": {
        "Role": {
          "Ref": "AWSGlueRole"
        },
        "Connections": {
          "Connections": [
            {
              "Ref": "AWSGlueconnectionRDS"
            }
          ]
        },
        "Description": "Job created with CloudFormation using existing script",
        "Command": {
          "Name": "glueetl",
          "ScriptLocation": {
            "Fn::If" : [
              "UseProdCondition",
              "s3://blog-scripts-glueetl/glueetl/blogetls3.py",
              "s3://blog-scripts-gluetl-west2/glueetl/blogetls3.py"
              ]}
        },
        "DefaultArguments": {

          "--job-bookmark-option": "job-bookmark-enable",
          "--destination": {"Ref": "S3Bucket"}
        },
        "ExecutionProperty": {
          "MaxConcurrentRuns": 2
        },
        "MaxRetries": 0,
        "Name": {
          "Fn::Join": [
            "-",
            [
              {
                "Ref": "AWS::StackName"
              },
              "blog-etljob"
            ]
          ]
        },
        "AllocatedCapacity": "10"
      }
    }
  }

Choose the job, and for Action, choose Run job. You can change advance parameters like the data processing unit (DPU) and maximum concurrency, if needed.
Check the status of the job in the bottom panel. It shows you the time duration the job runs for and provides access to Logs and Error Logs values in Amazon CloudWatch. The job runs for about few minutes at the default 10 DPU.
Once the job is completed, the status changes from RUNNING to SUCCEEDED.

Use Athena to query the result in S3

Next, we put the data written by AWS Glue into Amazon S3 and run SQL queries in Amazon Athena.

In the console, navigate to the Amazon Athena service. You can see that Amazon Athena automatically recognizes the hrdb database created in AWS Glue. Amazon Athena uses the same Data Catalog that AWS Glue does.
In the left navigation pane for Athena, under hrdb choose Create Table, and then choose Automatically (AWS Glue crawler).
You’re prompted with a notification about being redirected to the AWS Glue console. Choose Continue, and you go to the AWS Glue console to create a new crawler.
In the AWS Glue console, provide a crawler name and choose Continue.
Make sure that S3 is the type for the data store, and choose the folder icon to select the destination bucket. This bucket should be the one used by AWS Glue to write the result of the joined DynamicFrame to S3. Choose Next to continue.
On the next page, choose No and then choose Continue.
In the Choose an IAM role section, choose an existing IAM role. For IAM role, choose the IAM role already created for the AWS Glue service by the AWS CloudFormation stack. Choose Next to continue.
In the Create a schedule for this crawler section, choose the frequency as Run on demand. When the data is S3 in constantly changing, running the crawler periodically helps to capture the changes in the AWS Glue Data Catalog automatically. Choose Next to continue.
In the crawler’s output section, select hrbd as the database in the AWS Glue Data Catalog to store the result. You can add a prefix like s3_ to distinguish S3-specific tables from other tables in AWS Glue catalog. Choose Next to continue.
Review the settings, and choose Finish to create the crawler.

Now that you’ve created your crawler, you can run it:

Run the AWS Glue crawler to crawl S3 and add the new table to the hrdb Data Catalog database. The crawler runs for about a minute.
When it’s done, you can see that a new table has been added to AWS Glue Catalog.
Navigate back to the Amazon Athena console, and choose the refresh icon. You should see the new S3 table in Athena for querying.
Use the query editor to try queries such as those following. Make sure to replace the text highlighted in red italics with the table name in the Amazon Athena navigation pane as needed.
```
select * from s3_destinationbucketforhrdata;
select * from s3_destinationbucketforhrdata where department_name = 'Sales' and commission_pct >= 0.30;
```

At this stage, you have successfully used AWS Glue to crawl, transform, and load the data to S3. You’ve also used Amazon Athena to run ad hoc SQL queries on the result. This is a common pattern, doing ETL to build a data lake in S3 and then using Amazon Athena to run SQL queries.

Perform cleanup steps

In this blog post, the CloudFormation script created an IAM service role specific to AWS Glue, plus a Data Catalog database, connection, crawler, and job. As part of the AWS Glue crawler and job execution, elastic network interfaces were also created in the VPC. These connect to Amazon RDS over private IP addresses. When you have completed the walkthrough, you can delete these resources to prevent any additional costs.

To delete the resources created in part 2 of this blog series, take the following steps:

In the console, navigate to S3 and delete the contents of the destination bucket that was used in the AWS Glue job. The S3 bucket needs to be empty before AWS CloudFormation can delete items in the next steps.
In the console, navigate to AWS Glue crawler section select and delete the crawler you created to crawl the destination S3 bucket.
In the console, navigate to CloudFormation.
Choose the stack you created for this blog post, and for Actions, choose Delete Stack. You get a warning message. Choose Yes, Delete to proceed.
The deletion process takes a few minutes to complete. It deletes all the resources created by the stack. During the process, the status shows DELETE_IN_PROGRESS. When the deletion is successful, the stack disappears from the CloudFormation list.
In the console, navigate to Amazon EC2, and choose Network Interfaces from the left navigation pane. From the list of network interfaces, select only the interfaces that have the following description:
```
Attached to Glue using role: arn:aws:iam::<accountid>:role/<stackname>-AWSGlueServiceRole 
```
Choose Detach if they have in-use status. You get a warning; choose Yes, Detach.
Once the network interfaces are detached, choose them again and choose Delete. You get a warning. Choose Yes, Delete to proceed. Make sure that the interfaces that you chose are deleted successfully (disappear from the list).
At this point, all the resources deployed as part of this blog post are successfully deleted. Follow the cleanup instructions in part 1 of the migration blog series to delete resources created in part 1.

Conclusion

In this blog post, we used a simple walkthrough in the console to show the key aspects of using AWS Glue. With these features, you can understand your data, prepare your data, and load your data in S3 for querying through Amazon Athena.

With AWS Glue, you get a consistent and reliable way to automate and manage ETL tasks. You can further work with the CloudFormation template provided in this blog to test and automate your ETL workflow.

About the authors

Sona Rajamani is a solutions architect at AWS. She lives in the San Francisco Bay area and helps customers architect and optimize applications on AWS. In her spare time, she enjoys hiking and traveling.

Ballu Singh is a solutions architect at AWS. He lives in the San Francisco bay area and helps customers architect and optimize applications on AWS. In his spare time, he enjoys reading and spending time with his family.

AWS Database Blog