How do I use external Python libraries in my AWS Glue 1.0 or 0.9 ETL job?

3 minute read

I want to use an external Python library in my AWS Glue 1.0 or 0.9 extract, transform, and load (ETL) job.

Short description

To use an external library in an Apache Spark ETL job, do the following:

1. Package the library files in a .zip file (unless the library is contained in a single .py file).

2. Upload the package to Amazon Simple Storage Service (Amazon S3).

3. Use the library in a job or JobRun.

Resolution

The following is an example of how to use an external library in a Spark ETL AWS Glue 1.0 or 0.9 ETL job.

Important: If you want to use an external library in your AWS Glue 2.0 job, then see How do I use external Python libraries in my AWS Glue 2.0 ETL job? If you want to use an external library in a Python shell job, then follow the steps at Providing your own Python library.

1. Create a Python 2 or Python 3 library for boto3. Be sure that the AWS Glue version that you're using supports the Python version that you choose for the library. AWS Glue version 1.0 supports Python 2 and Python 3, and AWS Glue version 0.9 supports only Python 2.

Note: Libraries and extension modules for Spark jobs must be written in Python. Libraries, such as pandas, that are written in C aren't supported in Glue 0.9 or 1.0. If you need to use a Library written in C, then upgrade AWS Glue to at least version 2.0 and use the --additional-python-modules option. For more information, see How do I use external Python libraries in my AWS Glue 2.0 ETL job?

2. Launch an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance.

3. Connect to the Linux instance using SSH.

4. Run the following commands to install Python and Boto3. For more information, see Boto3 documentation for Quickstart.

sudo yum groupinstall "Development Tools"
sudo yum -y install openssl-devel
wget https://www.python.org/ftp/python/3.6.9/Python-3.6.9.tgz
tar xvf Python-3.6.9.tgz
cd Python-3.6.9/
./configure --enable-optimizations
sudo make install
sudo pip install boto3

5. Confirm the location of the Python site-packages directory:

python -m site

You receive an output similar to the following:

/usr/lib/python3.6/site-packages

6. Package the external library files in a .zip file unless the library is contained in a single .py file. The .zip file must include an __init__.py file, and the package directory must be at the root of the archive. The __init__.py file can be empty. For more information, see Python documentation for Packages.

Example:

cd /usr/lib/python3.6/site-packages
sudo zip -r -X "/home/ec2-user/site-packages.zip" *

7. Upload the package to Amazon S3:

aws s3 cp /home/ec2-user/site-packages.zip s3://awsexamplebucket/

8. Use the library in a job or JobRun.

To use an external library in a development endpoint, do the following:

1. Package the library and upload the file to Amazon S3, as explained previously.

2. Create the development endpoint. For Python library path, enter the Amazon S3 path for the package. For more information, see Loading Python libraries in a development endpoint.

Related information

Using Python libraries with AWS Glue

Topics

Analytics

Relevant content

Using External Python Packages on AWS Glue.
Nihal-Rainu
asked 2 years ago
External python libraries in a AWS Glue python shell job
Guts
asked 10 days ago
How to use external libraries in AWS Glue Python Shell
AWS-User-0885813
asked 2 years ago
Cannot Install external python packages in AWS Glue spark script?
Hariteja
asked 2 years ago
Running PySpark Jobs Locally with the AWS Glue ETL library - On windows
Adriano
asked 9 months ago
How do I troubleshoot AWS Marketplace connection errors in my AWS Glue ETL jobs?
AWS OFFICIALUpdated 3 months ago
How do I resolve the "No space left on device" error in an AWS Glue ETL job?
AWS OFFICIALUpdated a year ago
How do I deploy Lambda functions with external libraries using AWS Cloud9?
AWS OFFICIALUpdated 3 years ago
How do I use external Python libraries in my AWS Glue 2.0 ETL job?
AWS OFFICIALUpdated 2 years ago
Harness powerful insights from your data with SQL-Based ETL with Apache Spark on Amazon EKS
EXPERT
Markus Adhiwiyogo
published 11 days ago