Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. AWS Glue 2.0 features an upgraded infrastructure for running Apache Spark ETL jobs in AWS Glue with reduced startup times. With reduced startup delay time and lower minimum billing duration, overall jobs complete faster, enabling you to run micro-batching and time-sensitive workloads more cost-effectively. To use this feature with your AWS Glue Spark ETL jobs, choose 2.0 for the AWS Glue version when creating your jobs.

AWS Glue 2.0 also lets you provide additional Python modules at the job level. You can use the --additional-python-modules option with a list of comma-separated Python modules to add a new module or change the version of an existing module. AWS Glue uses the Python Package Installer (pip3) to install the additional modules. You can pass additional options specified by the --python-modules-installer-option to pip3 to install the modules. Any incompatibly or limitations from pip3 apply. AWS Glue supports Python modules out of the box. For more information, see Running Spark ETL Jobs with Reduced Startup Times.

In this post, we go through the steps needed to create an AWS Glue Spark ETL job with the new capability to install or upgrade Python modules from a wheel file, from a PyPI repository, or from an Amazon Simple Storage Service (Amazon S3) bucket. We discuss approaches to install additional python modules for an AWS Glue Spark ETL job from a PyPI repository or from a wheel file on Amazon S3 in a VPC with and without internet access.

Setting up an AWS Glue job in a VPC with internet access

To set up your AWS Glue job in a VPC with internet access, you have two options:

Install Python modules from a PyPI repository
Install Python modules using a wheel file on Amazon S3

To setup an Internet Gateway and attach to a VPC, please refer the documentation here.

The following diagram illustrates the final architecture.

Installing Python modules from a PyPI repository

You can create an AWS Glue Spark ETL job with job parameters --additional-python-modules and --python-modules-installer-option to install a new Python module or update an existing Python module from a PyPI repository.

The following screenshot shows the Amazon CloudWatch logs for the job.

The AWS Glue job successfully uninstalled the previous version of scikit-learn and installed the provided version. We can also see that the nltk requirement was already satisfied.

Installing Python modules using a wheel file from Amazon S3

To install a new Python module or update an existing Python module using a wheel file from Amazon S3, create an AWS Glue Spark ETL job with job parameters --additional-python-modules and --python-modules-installer-option.

The following screenshot shows the CloudWatch logs for the job.

The AWS Glue job successfully installed the psutil Python module using a wheel file from Amazon S3.

Setting up an AWS Glue job in a VPC without internet access

In this section, we discuss the steps to set up an AWS Glue job in a VPC without internet access. The following diagram illustrates this architecture.

Setting up a VPC and a VPC endpoint for Amazon S3

As our first step, we will set up a VPC.

Create a VPC with at least one private subnet, and make sure that DNS hostnames are enabled.

For more information about creating a private VPC, see VPC with a private subnet only and AWS Site-to-Site VPN access.

Create an Amazon S3 endpoint. During the setup, associate the endpoint with the route table of your private subnet.

For more information about creating an Amazon S3 endpoint, see Amazon VPC Endpoints for Amazon S3.

Setting up an S3 bucket for Python repository

You now configure your S3 bucket for your Python repository.

Create an S3 bucket.
Configure the bucket to host a static website for Python repository.

You want to qualify that the S3 bucket holds the Python packages and acts as a repository. For more information, see Enabling website hosting.

Record the Amazon S3 website endpoint.
Configure the bucket policy with restricted access to a specific Amazon VPC (AWS Glue VPC).

Creating a Python repository on Amazon S3

To create your Python repository on Amazon S3, complete the following steps:

If you haven’t already, install Docker for Linux, Windows, or macOS on your computer.

Create a modules_to_install.txt file with required Python modules and their versions. For example, see the following code:

psutil==5.7.2
scikit-learn==0.23.0
scikit-learn==0.23.1
scikit-learn==0.23.2
geopy==2.0.0
Shapely==1.7.1
googleads==25.0.0
nltk==3.5

Create a script.sh file with the following code:

#!/bin/bash
# install required lib python3.7 and gcc
yum -y install gcc python3-devel python3
# create the virtual environment
python3.7 -m venv wheel-env
# activate the virtual environment
source wheel-env/bin/activate
# install wheel package for creating wheel files
pip install wheel
# create folder for package and cache
mkdir wheelhouse cache
# run pip command on cache location
cd cache
for f in $(cat ../modules_to_install.txt); do pip wheel $f -w ../wheelhouse; done
cd ..
# create the index.html file
cd wheelhouse
INDEXFILE="<html><head><title>Links</title></head><body><h1>Links</h1>"
for f in *.whl; do INDEXFILE+="<a href='$f'>$f</a><br>"; done
INDEXFILE+="</body></html>"
echo "$INDEXFILE" > index.html
cd ..
# cleanup environment
deactivate
rm -rf cache wheel-env
# exit the docker container
exit

Create a wheelhouse using the following Docker command:

docker run -v "$PWD":/tmp amazonlinux:latest /bin/bash -c "cd /tmp;sh script.sh"

The expected outcome looks like the following:

|- modules_to_install.txt
|- script.sh
|- wheelhouse/
  |- PyYAML-5.3.1-cp37-cp37m-linux_x86_64.whl
  |- psutil-5.7.2-cp37-cp37m-linux_x86_64.whl
  |- scikit_learn-0.23.0-cp37-cp37m-manylinux1_x86_64.whl
  |- scikit_learn-0.23.1-cp37-cp37m-manylinux1_x86_64.whl
  ....
  |- index.html

Copy the wheelhouse directory into the S3 bucket using following code:

S3_BUCKET="MY-PYTHON-REPO-BUCKET"
S3_GLUE_SCRIPT_BUCKET="MY-SCRIPT-BUCKET"
aws s3 cp wheelhouse/ "s3://$S3_BUCKET/wheelhouse/" --recursive --profile default

For more information, see Named profiles.

Creating an AWS Glue connection

To enable AWS Glue to access resources inside your VPC, you must provide additional VPC-specific configuration information that includes VPC subnet IDs and security group IDs. For instructions, see Creating the Connection to Amazon S3.

Test if the AWS Glue connection to the S3 bucket MY-PYTHON-REPO-BUCKET is working properly. For instructions, see Testing an AWS Glue Connection.

The following screenshot shows the message that your connection is successful.

Creating an AWS Glue Spark ETL job with an AWS Glue connection

Finally, create an AWS Glue Spark ETL job with job parameters --additional-python-modules and --python-modules-installer-option to install a new Python module or update the existing Python module using Amazon S3 as the Python repository.

The following code is an example job parameter:

{
"--additional-python-modules" : "psutil==5.7.2,scikit-learn==0.23.1,geopy==2.0.0,Shapely==1.7.1,googleads==25.0.0,nltk==3.5",
"--python-modules-installer-option" : "--no-index --find-links=http://MY-BUCKET.s3-website-us-east-1.amazonaws.com/wheelhouse --trusted-host MY-BUCKET.s3-website-us-east-1.amazonaws.com"
}

For this use case, we create a sample S3 bucket, a VPC, and an AWS Glue ETL Spark job in the US East (N. Virginia) Region, us-east-1.

To view the CloudWatch logs for the job, complete the following steps:

Choose your AWS Glue job.
Select the run ID.
Choose Error logs.
Select the driver log stream for that run ID.
Check the status of the pip installation step.

The logs show that the AWS Glue job successfully installed all the Python modules and its dependencies from the Amazon S3 PyPI repository using Amazon S3 static web hosting.

Limitation: It is currently not supported to install a python module with a C binding that relies on a native library (compiled) from a rpm package that is not available at runtime.

Summary

In this post, you learned how to configure AWS Glue Spark ETL jobs to install additional Python modules and its dependencies in an environment that has access to internet and in a secure environment that doesn’t have access to the internet.

About the Authors

Rumeshkrishnan Mohan is a Big Data Consultant with Amazon Web Services. He works with Global Customers in building their data lakes.

Krithivasan Balasubramaniyan is Senior Consultant at Amazon Web Services. He enables global enterprise customers in their digital transformation journey and helps architect cloud native solutions.