How do I use external Python libraries in an AWS Glue job?
Last updated: 2020-04-29
I want to use an external Python library in an AWS Glue job.
To use an external library in an Apache Spark ETL job:
1. Package the library files in a .zip file (unless the library is contained in a single .py file).
2. Upload the package to Amazon Simple Storage Service (Amazon S3).
1. Create a Python 2 or Python 3 library for boto3. Be sure that the AWS Glue version that you're using supports the Python version that you choose for the library. AWS Glue version 1.0 supports Python 2 and Python 3. For more information, see AWS Glue Versions.
Note: Libraries and extension modules for Spark jobs must be written in Python. Libraries such as pandas, which is written in C, aren't supported.
4. Run the following commands to install pip and boto3. For more information, see Quickstart.
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py sudo python get-pip.py sudo pip install boto3
5. Confirm the location of the Python site-packages directory:
python -m site
6. Package the external library files in a .zip file (unless the library is contained in a single .py file). The .zip file must include an __init__.py file and the package directory must be at the root of the archive. The __init__.py file can be empty. For more information, see Packages in the Python documentation. Example:
cd /usr/lib/python3.6/site-packages sudo zip -r -X "/home/ec2-user/site-packages.zip" *
7. Upload the package to Amazon S3:
aws s3 cp /home/ec2-user/site-packages.zip s3://awsexamplebucket/
To use an external library in a development endpoint:
1. Package the library and upload the file to Amazon S3, as explained previously.
2. Create the development endpoint. For Python library path, enter the Amazon S3 path for the package. For more information, see Loading Python Libraries in a Development Endpoint.