Building an AWS Glue ETL pipeline locally without an AWS account
This blog was last reviewed May, 2022.
If you’re new to AWS Glue and looking to understand its transformation capabilities without incurring an added expense, or if you’re simply wondering if AWS Glue ETL is the right tool for your use case and want a holistic view of AWS Glue ETL functions, then please continue reading. In this post, we walk you through several AWS Glue ETL functions with supporting examples, using a local PySpark shell in a containerized environment with no AWS artifact dependency. If you’re already familiar with AWS Glue and Apache Spark, you can use this solution as a quick cheat sheet for AWS Glue PySpark validations.
You don’t need an AWS account to follow along with this walkthrough. We use small example datasets for our use case and go through the transformations of several AWS Glue ETL PySpark functions:
Write Dynamic Frame.
This post provides an introduction of the transformation capabilities of AWS Glue and provides insights towards possible uses of the supported functions. The goal is to get up and running with AWS Glue ETL functions in the shortest possible time, at no cost and without any AWS environment dependency.
To follow along, you should have the following resources:
- Basic programming experience
- Basic Python and Spark knowledge (not required but good to have)
- A desktop or workstation with Docker installed and running
If you prefer to set up the environment locally outside of a Docker container, you can follow the instructions provided in the GitHub repo, which hosts libraries used in AWS Glue. These libraries extend Apache Spark with additional data types and operations for ETL workflows.
Setting up resources
For this post, we use the
amazon/aws-glue-libs:glue_libs_1.0.0_image_01 image from Dockerhub. This image has only been tested for AWS Glue 1.0 spark shell (PySpark). Additionally, this image also supports Jupyter and Zeppelin notebooks and a CLI interpreter.
Please refer to the blog Developing AWS Glue ETL jobs locally using a container to setup the environment locally.
To get started, enter the following import statements in the PySpark shell. We import
GlueContext, which wraps the Spark
SQLContext, thereby providing mechanisms to interact with Apache Spark:
We first generate a Spark DataFrame consisting of dummy data of an order list for a fictional company. We process the data using AWS Glue PySpark functions.
Enter the following code into the shell:
.show() command allows us to view the DataFrame in the shell:
A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required. We convert the
df_orders DataFrame into a DynamicFrame.
Enter the following code in the shell:
Now that we have our Dynamic Frame, we can start working with the datasets with AWS Glue transform functions.
The columns in our data might be in different formats, and you may want to change their respective names.
ApplyMapping is the best option for changing the names and formatting all the columns collectively. For our dataset, we change some of the columns to
String format to save storage space later. We also shorten the column
zip. See the following code:
We now want to prioritize our order delivery for essential items. We can achieve that using the Filter function:
Map allows us to apply a transformation to each record of a Dynamic Frame. For our case, we want to target a certain zip code for next day air shipping. We implement a simple “
next_day_air” function and pass it to the Dynamic Frame:
To ship essential orders to the appropriate addresses, we need customer data. We demonstrate this by generating a custom JSON dataset consisting of zip codes and customer addresses. In this use case, this data represents the customer data of the company that we want to join later on.
We generate JSON strings consisting of customer data and use the Spark json function to convert them to a JSON structure (enter each
jsonStr variable one at a time in case the terminal errors out):
To convert the DataFrame back to a DynamicFrame to continue with our operations, enter the following code:
To join with the order list, we don’t need all the columns, so we use the
SelectFields function to shortlist the columns we need. In our use case, we need the zip code column, but we can add more columns as the argument paths accepts a list:
Join function is straightforward and manages duplicate columns. We had two columns named
zip from both datasets. AWS Glue added a period (.) in one of the duplicate column names to avoid errors:
Because we don’t need two columns with the same name, we can use
DropFields to drop one or multiple columns all at once. The backticks (
`) around .zip inside the function call are needed because the column name contains a period (.):
Relationalize function can flatten nested structures and create multiple dynamic frames. Our customer column from the previous operation is a nested structure, and
Relationalize can convert it into multiple flattened DynamicFrames:
To see the DynamicFrames, we can’t run a
.show() yet because it’s a collection. We need to check what keys are present. See the following code:
In the follow-up function in the next section, we show how to pick the DynamicFrame from a collection of multiple DynamicFrames.
SelectFromCollection function allows us to retrieve the specific DynamicFrame from a collection of DynamicFrames. For this use case, we retrieve both DynamicFrames from the previous operation using this function.
To retrieve the first DynamicFrame, enter the following code:
To retrieve the second DynamicFrame, enter the following code:
The second DynamicFrame we retrieved from the previous operation introduces a period (.) into our column names and is very lengthy. We can change that using the
ResloveChoice can gracefully handle column type ambiguities. For more information about the full capabilities of ResolveChoice, see the GitHub repo.
We generate another dataset to demonstrate a few other functions. In this use case, the company’s warehouse inventory data is in a nested JSON structure, which is initially in a
String format. See the following code:
Unbox to extract JSON from
String format for the new data. Compare the preceding
printSchema() output with the following code:
Unnest allows us to flatten a single DynamicFrame to a more relational table format. We apply
Unnest to the nested structure from the previous operation and flatten it:
DropNullFields function makes it easy to drop columns with all null values. Our warehouse data indicated that it was out of pears and can be dropped. We apply the
DropNullFields function on the DynamicFrame, which automatically identifies the columns with null values and drops them:
SplitFields allows us to split a DyanmicFrame into two. The function takes the field names of the first DynamicFrame that we want to generate followed by the names of the two DynamicFrames:
For the first DynamicFrame, see the following code:
For the second Dynamic Frame, see the following code:
SplitRows allows us to filter our dataset within a specific range of counts and split them into two DynamicFrames:
For the first Dynamic Frame, see the following code:
For the second Dynamic Frame, see the following code:
Spigot allows you to write a sample dataset to a destination during transformation. For our use case, we write the top 10 records locally:
Depending on your local environment configuration,
Spigot may run into errors. Alternatively, you can use an AWS Glue endpoint or an AWS Glue ETL job to run this function.
write_dynamic_frame function writes a DynamicFrame using the specified connection and format. For our use case, we write locally (we use a
connection_type of S3 with a POSIX path argument in
connection_options, which allows writing to local storage):
This article discussed the PySpark ETL capabilities of AWS Glue. Further testing with an AWS Glue development endpoint or directly adding jobs in AWS Glue is a good pivot to take the learning forward. For more information, see General Information about Programming AWS Glue ETL Scripts.
About the Authors
Adnan Alvee is a Big Data Architect for AWS ProServe Remote Consulting Services. He helps build solutions for customers leveraging their data and AWS services. Outside of AWS, he enjoys playing badminton and drinking chai.
Imtiaz (Taz) Sayed is the World Wide Tech Leader for Data Analytics at AWS. He is an ardent data engineer and relishes connecting with the data analytics community.