How do I use external Python libraries in my AWS Glue 2.0 ETL job?

Lesedauer: 5 Minute
0

I want to use external Python libraries in an AWS Glue 2.0 extract, transform, and load (ETL) jobs.

Resolution

With AWS Glue version 2.0, you can install additional Python modules or different versions at the job level. To add a new module or change the version of an existing module, use the --additional-python-modules job parameter key with a value containing a list of comma-separated Python modules. This allows your AWS Glue 2.0 ETL job to install the additional modules using the Python package installer (pip3).

To install an additional Python module for your AWS Glue job:

  1. Open the AWS Glue console.
  2. In the navigation pane, Choose Jobs.
  3. Select the job where you want to add the Python module.
  4. Choose Actions, and then choose Edit job.
  5. Expand the Security configuration, script libraries, and job parameters (optional) section.
  6. Under Job parameters, do the following:
    For Key, enter --additional-python-modules.
    For Value, enter pymysql==1.0.2, s3://aws-glue-add-modules/nltk-3.6.2-py3-none-any.whl.
  7. Choose Save.

These steps provide an example for installing two different modules:

  • PyMySQL through the internet
  • Natural Language Toolkit (NLTK) from a wheel file on Amazon Simple Storage Service (Amazon S3)

Installing a new module or updating an existing module requires downloading module-related dependencies. This means that you must have internet access to complete either of these tasks. If you don't have internet access, then see Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0.

For the list of additional Python modules that are already provided in AWS Glue 2.0, see Python modules already provided in AWS Glue version 2.0.

Libraries and extension modules written in C are also supported by AWS Glue 2.0 with the --additional-python-modules option. However, a subset of Python modules, such as spacy and grpc, require root permissions to install. Without root permissions, the compilation of these modules fails during installation. AWS Glue doesn't provide root access during package installation. The solution is to precompile the binaries into a wheel compatible with AWS Glue and install that wheel.

To compile a library in a C-based language, the compiler must be compatible with the target operating system and processor architecture. If the library is compiled against a different operating system or processor architecture, then the wheel isn't installed in AWS Glue. Because AWS Glue is a managed service, cluster access isn't available to develop these dependencies. To precompile the C-based Python module that requires root permissions, do the following:

Note: These steps provide an example for installing the grpcio module.

1.    Launch an Amazon Elastic Compute Cloud (Amazon EC2) Linux instance with enough volume space for your libraries.

2.    Install Docker container on the instance, set up the non-sudo access, and then start docker.

sudo yum install docker -y
sudo usermod -a -G docker ec2-user
sudo service docker start

3.    Create a file dockerfile_grpcio and copy the following into the file:

# Base for AWS Glue
FROM amazonlinux
RUN yum update -y
RUN yum install shadow-utils.x86_64 -y
RUN yum install -y java-1.8.0-openjdk.x86_64
RUN yum install -y python3
RUN yum install -y cython doxygen numpy scipy gcc autoconf automake libtool zlib-devel openssl-devel maven wget protobuf-compiler cmake make gcc-c++
# Additional components needed for grpcio
WORKDIR /root
RUN yum install python3-devel -y
RUN yum install python-devel -y
RUN pip3 install wheel
# Install grpcio and related modules
RUN pip3 install Cython
RUN pip3 install cmake scikit-build
RUN pip3 install grpcio
# Create a directory for the wheel
RUN mkdir wheel_dir
# Create the wheel
RUN pip3 wheel grpcio -w wheel_dir

4.    Run the docker build to build your Dockerfile. Run the following commands to restart the Docker daemon:

$ sudo service docker restart
$ docker build -f dockerfile_grpcio .

When the docker build command completes, a success message displays with your Docker image ID. For example, "Successfully built 1111222233334444." Note the Docker image ID to use in the next step.

5.    Extract the wheel file from the Docker container. Run the following commands to extract the .whl file:

# Get the docker image ID
$ docker image ls

# Run the container
$ docker run -dit 111122223334444

# Verify the location of the wheel file and retrieve the name of the wheel file
$ docker exec -t -i 5555666677778888 ls /root/wheel_dir/

# Copy the wheel out of docker to EC2
$ docker cp 5555666677778888:/root/wheel_dir/doc-example-wheel .

Be sure to replace the following values in the preceding commands:

  • 1111222233334444 with the Docker image ID
  • 5555666677778888 with the container ID
  • doc-example-wheel with the name of the generated wheel file

6.    Upload the wheel to Amazon S3 by running the following commands:

aws s3 cp doc-example-wheel s3://path/to/wheel/
aws s3 cp grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl s3://aws-glue-add-modules/grpcio/

Be sure to replace the following values in the preceding commands:

  • doc-example-wheel with the name of the generated wheel file
  • grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl with the name of the Python package file

7.    For the AWS Glue ETL job, in the AWS Glue console, under Job parameters, do the following: For Key, enter --additional-python-modules.
For Value, enter s3://aws-glue-add-modules/grpcio/grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl.
Note: Be sure to replace grpcio-1.32.0-cp37-cp37m-linux_x86_64.whl with the name of the Python package file.

Important: AWS Glue versions 0.9 and 1.0 don't support Python modules written in C. To install an external Python library in AWS Glue 0.9 and 1.0, see How do I use external Python libraries in my AWS Glue 1.0 or 0.9 ETL job?


Related information

Using Python libraries with AWS Glue

AWS OFFICIAL
AWS OFFICIALAktualisiert vor 2 Jahren