How do I use external Python libraries in an AWS Glue job?

Last updated: 2019-12-09

I want to use an external Python library in an AWS Glue job.

Resolution

AWS Glue supports two job types: Apache Spark and Python shell.

Spark jobs

Note: Libraries and extension modules for Spark jobs must be written in Python. Libraries such as pandas, which is written in C, are not supported.

To use an external library in a Spark ETL job:

  1. Package the library files in a .zip file (unless the library is contained in a single .py file). The .zip file must include an __init__.py file and the package directory must be at the root of the archive.
  2. Upload the package to Amazon Simple Storage Service (Amazon S3).
  3. Use the library in a job or job run.

To use an external library in a development endpoint:

  1. Package the library and upload the file to Amazon S3, as explained previously.
  2. Create the development endpoint. For Python library path, enter the Amazon S3 path for the package. For more information, see Loading Python Libraries in a Development Endpoint.

Python shell jobs

  • Python shell jobs support a variety of libraries, including pandas. For a complete list, see Supported Libraries for Python Shell Jobs.
  • Libraries must be packaged in an .egg archive. If you try to use a .zip file, you get this error: "ImportError: No module named module_name."
  • For more information and an example of a Python shell job, see Providing Your Own Python Library.

Did this article help you?

Anything we could improve?


Need more help?