How do I use external Python libraries in an AWS Glue job?

Last updated: 2020-04-29

I want to use an external Python library in an AWS Glue job.

Short Description

To use an external library in an Apache Spark ETL job:

1.    Package the library files in a .zip file (unless the library is contained in a single .py file).

2.    Upload the package to Amazon Simple Storage Service (Amazon S3).

3.    Use the library in a job or job run.

Resolution

The following is an example of how to use an external library in a Spark ETL job. If you want to use an external library in a Python shell job, follow the steps at Providing Your Own Python Library.

1.    Create a Python 2 or Python 3 library for boto3. Be sure that the AWS Glue version that you're using supports the Python version that you choose for the library. AWS Glue version 1.0 supports Python 2 and Python 3. For more information, see AWS Glue Versions.

Note: Libraries and extension modules for Spark jobs must be written in Python. Libraries such as pandas, which is written in C, aren't supported.

2.    Launch an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance.

3.    Connect to the Linux instance using SSH.

4.    Run the following commands to install pip and boto3. For more information, see Quickstart.

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
sudo python get-pip.py
sudo pip install boto3

5.    Confirm the location of the Python site-packages directory:

python -m site

Example output:

/usr/lib/python3.6/site-packages

6.    Package the external library files in a .zip file (unless the library is contained in a single .py file). The .zip file must include an __init__.py file and the package directory must be at the root of the archive. The __init__.py file can be empty. For more information, see Packages in the Python documentation. Example:

cd /usr/lib/python3.6/site-packages
sudo zip -r -X "/home/ec2-user/site-packages.zip" *

7.    Upload the package to Amazon S3:

aws s3 cp /home/ec2-user/site-packages.zip s3://awsexamplebucket/

8.    Use the library in a job or job run.

To use an external library in a development endpoint:

1.    Package the library and upload the file to Amazon S3, as explained previously.

2.    Create the development endpoint. For Python library path, enter the Amazon S3 path for the package. For more information, see Loading Python Libraries in a Development Endpoint.


Did this article help you?

Anything we could improve?


Need more help?