How do I use external Python libraries in an AWS Glue job?
Last updated: 2019-12-09
I want to use an external Python library in an AWS Glue job.
AWS Glue supports two job types: Apache Spark and Python shell.
Note: Libraries and extension modules for Spark jobs must be written in Python. Libraries such as pandas, which is written in C, are not supported.
To use an external library in a Spark ETL job:
- Package the library files in a .zip file (unless the library is contained in a single .py file). The .zip file must include an __init__.py file and the package directory must be at the root of the archive.
- Upload the package to Amazon Simple Storage Service (Amazon S3).
- Use the library in a job or job run.
To use an external library in a development endpoint:
- Package the library and upload the file to Amazon S3, as explained previously.
- Create the development endpoint. For Python library path, enter the Amazon S3 path for the package. For more information, see Loading Python Libraries in a Development Endpoint.
Python shell jobs
- Python shell jobs support a variety of libraries, including pandas. For a complete list, see Supported Libraries for Python Shell Jobs.
- Libraries must be packaged in an .egg archive. If you try to use a .zip file, you get this error: "ImportError: No module named module_name."
- For more information and an example of a Python shell job, see Providing Your Own Python Library.