AWS Big Data Blog
Building an AWS Glue ETL pipeline locally without an AWS account
This blog was last reviewed May, 2022.
If you’re new to AWS Glue and looking to understand its transformation capabilities without incurring an added expense, or if you’re simply wondering if AWS Glue ETL is the right tool for your use case and want a holistic view of AWS Glue ETL functions, then please continue reading. In this post, we walk you through several AWS Glue ETL functions with supporting examples, using a local PySpark shell in a containerized environment with no AWS artifact dependency. If you’re already familiar with AWS Glue and Apache Spark, you can use this solution as a quick cheat sheet for AWS Glue PySpark validations.
You don’t need an AWS account to follow along with this walkthrough. We use small example datasets for our use case and go through the transformations of several AWS Glue ETL PySpark functions: ApplyMapping
, Filter
, SplitRows
, SelectFields
, Join
, DropFields
, Relationalize
, SelectFromCollection
, RenameField
, Unbox
, Unnest
, DropNullFields
, SplitFields
, Spigot
and Write Dynamic Frame
.
This post provides an introduction of the transformation capabilities of AWS Glue and provides insights towards possible uses of the supported functions. The goal is to get up and running with AWS Glue ETL functions in the shortest possible time, at no cost and without any AWS environment dependency.
Prerequisites
To follow along, you should have the following resources:
- Basic programming experience
- Basic Python and Spark knowledge (not required but good to have)
- A desktop or workstation with Docker installed and running
If you prefer to set up the environment locally outside of a Docker container, you can follow the instructions provided in the GitHub repo, which hosts libraries used in AWS Glue. These libraries extend Apache Spark with additional data types and operations for ETL workflows.
Setting up resources
For this post, we use the amazon/aws-glue-libs:glue_libs_1.0.0_image_01
image from Dockerhub. This image has only been tested for AWS Glue 1.0 spark shell (PySpark). Additionally, this image also supports Jupyter and Zeppelin notebooks and a CLI interpreter.
Please refer to the blog Developing AWS Glue ETL jobs locally using a container to setup the environment locally.
Importing GlueContext
To get started, enter the following import statements in the PySpark shell. We import GlueContext
, which wraps the Spark SQLContext
, thereby providing mechanisms to interact with Apache Spark:
Dataset 1
We first generate a Spark DataFrame consisting of dummy data of an order list for a fictional company. We process the data using AWS Glue PySpark functions.
Enter the following code into the shell:
The following .show()
command allows us to view the DataFrame in the shell:
DynamicFrame
A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required. We convert the df_orders
DataFrame into a DynamicFrame.
Enter the following code in the shell:
Now that we have our Dynamic Frame, we can start working with the datasets with AWS Glue transform functions.
ApplyMapping
The columns in our data might be in different formats, and you may want to change their respective names. ApplyMapping
is the best option for changing the names and formatting all the columns collectively. For our dataset, we change some of the columns to Long
from String
format to save storage space later. We also shorten the column zipcode
to zip
. See the following code:
Filter
We now want to prioritize our order delivery for essential items. We can achieve that using the Filter function:
Map
Map allows us to apply a transformation to each record of a Dynamic Frame. For our case, we want to target a certain zip code for next day air shipping. We implement a simple “next_day_air
” function and pass it to the Dynamic Frame:
Dataset 2
To ship essential orders to the appropriate addresses, we need customer data. We demonstrate this by generating a custom JSON dataset consisting of zip codes and customer addresses. In this use case, this data represents the customer data of the company that we want to join later on.
We generate JSON strings consisting of customer data and use the Spark json function to convert them to a JSON structure (enter each jsonStr
variable one at a time in case the terminal errors out):
To convert the DataFrame back to a DynamicFrame to continue with our operations, enter the following code:
SelectFields
To join with the order list, we don’t need all the columns, so we use the SelectFields
function to shortlist the columns we need. In our use case, we need the zip code column, but we can add more columns as the argument paths accepts a list:
Join
The Join
function is straightforward and manages duplicate columns. We had two columns named zip
from both datasets. AWS Glue added a period (.) in one of the duplicate column names to avoid errors:
DropFields
Because we don’t need two columns with the same name, we can use DropFields
to drop one or multiple columns all at once. The backticks (`
) around .zip inside the function call are needed because the column name contains a period (.):
Relationalize
The Relationalize
function can flatten nested structures and create multiple dynamic frames. Our customer column from the previous operation is a nested structure, and Relationalize
can convert it into multiple flattened DynamicFrames:
To see the DynamicFrames, we can’t run a .show()
yet because it’s a collection. We need to check what keys are present. See the following code:
In the follow-up function in the next section, we show how to pick the DynamicFrame from a collection of multiple DynamicFrames.
SelectFromCollection
The SelectFromCollection
function allows us to retrieve the specific DynamicFrame from a collection of DynamicFrames. For this use case, we retrieve both DynamicFrames from the previous operation using this function.
To retrieve the first DynamicFrame, enter the following code:
To retrieve the second DynamicFrame, enter the following code:
RenameField
The second DynamicFrame we retrieved from the previous operation introduces a period (.) into our column names and is very lengthy. We can change that using the RenameField
function:
ResolveChoice
ResloveChoice
can gracefully handle column type ambiguities. For more information about the full capabilities of ResolveChoice, see the GitHub repo.
Dataset 3
We generate another dataset to demonstrate a few other functions. In this use case, the company’s warehouse inventory data is in a nested JSON structure, which is initially in a String
format. See the following code:
Unbox
We use Unbox
to extract JSON from String
format for the new data. Compare the preceding printSchema()
output with the following code:
Unnest
Unnest
allows us to flatten a single DynamicFrame to a more relational table format. We apply Unnest
to the nested structure from the previous operation and flatten it:
DropNullFields
The DropNullFields
function makes it easy to drop columns with all null values. Our warehouse data indicated that it was out of pears and can be dropped. We apply the DropNullFields
function on the DynamicFrame, which automatically identifies the columns with null values and drops them:
SplitFields
SplitFields
allows us to split a DyanmicFrame into two. The function takes the field names of the first DynamicFrame that we want to generate followed by the names of the two DynamicFrames:
For the first DynamicFrame, see the following code:
For the second Dynamic Frame, see the following code:
SplitRows
SplitRows
allows us to filter our dataset within a specific range of counts and split them into two DynamicFrames:
For the first Dynamic Frame, see the following code:
For the second Dynamic Frame, see the following code:
Spigot
Spigot
allows you to write a sample dataset to a destination during transformation. For our use case, we write the top 10 records locally:
Depending on your local environment configuration, Spigot
may run into errors. Alternatively, you can use an AWS Glue endpoint or an AWS Glue ETL job to run this function.
Write Dynamic Frame
The write_dynamic_frame
function writes a DynamicFrame using the specified connection and format. For our use case, we write locally (we use a connection_type
of S3 with a POSIX path argument in connection_options
, which allows writing to local storage):
Conclusion
This article discussed the PySpark ETL capabilities of AWS Glue. Further testing with an AWS Glue development endpoint or directly adding jobs in AWS Glue is a good pivot to take the learning forward. For more information, see General Information about Programming AWS Glue ETL Scripts.
About the Authors
Adnan Alvee is a Big Data Architect for AWS ProServe Remote Consulting Services. He helps build solutions for customers leveraging their data and AWS services. Outside of AWS, he enjoys playing badminton and drinking chai.
Imtiaz (Taz) Sayed is the World Wide Tech Leader for Data Analytics at AWS. He is an ardent data engineer and relishes connecting with the data analytics community.