AWS Big Data Blog
Access private code repositories for installing Python dependencies on Amazon MWAA
Customers who use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) often need Python dependencies that are hosted in private code repositories. Many customers opt for public network access mode for its ease of use and ability to make outbound Internet requests, all while maintaining secure access. However, private code repositories may not be accessible via the Internet. It’s also a best practice to only install Python dependencies where they are needed. You can use Amazon MWAA startup scripts to selectively install Python dependencies required for running code on workers, while avoiding issues due to web server restrictions.
This post demonstrates a method to selectively install Python dependencies based on the Amazon MWAA component type (web server, scheduler, or worker) from a Git repository only accessible from your virtual private cloud (VPC).
Solution overview
This solution focuses on using a private Git repository to selectively install Python dependencies, although you can use the same pattern demonstrated in this post with private Python package indexes such as AWS CodeArtifact. For more information, refer to Amazon MWAA with AWS CodeArtifact for Python dependencies.
The Amazon MWAA architecture allows you to choose a web server access mode to control whether the web server is accessible from the internet or only from your VPC. You can also control whether your workers, scheduler, and web servers have access to the internet through your customer VPC configuration. In this post, we demonstrate an environment such as the one shown in the following diagram, where the environment is using public network access mode for the web servers, and the Apache Airflow workers and schedulers don’t have a route to the internet from your VPC.
There are up to four potential networking configurations for an Amazon MWAA environment:
- Public routing and public web server access mode
- Private routing and public web server access mode (pictured in the preceding diagram)
- Public routing and private web server access mode
- Private routing and private web server access mode
We focus on one networking configuration for this post, but the fundamental concepts are applicable for any networking configuration.
The solution we walk through relies on the fact that Amazon MWAA runs a startup script (startup.sh
) during startup on every individual Apache Airflow component (worker, scheduler, and web server) before installing requirements (requirements.txt
) and initializing the Apache Airflow process. This startup script is used to set an environment variable, which is then referenced in the requirements.txt file to selectively install libraries.
The following steps allow us to accomplish this:
- Create and install the startup script (
startup.sh
) in the Amazon MWAA environment. This script sets the environment variable for selectively installing dependencies. - Create and install global Python dependencies (
requirements.txt
) in the Amazon MWAA environment. This file contains the global dependencies required by all Amazon MWAA components. - Create and install component-specific Python dependencies in the Amazon MWAA environment. This step involves creating separate requirements files for each component type (worker, scheduler, web server) to selectively install the necessary dependencies.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account
- An Amazon MWAA environment deployed with public access mode for the web server
- Versioning enabled for your Amazon MWAA environment’s Amazon Simple Storage Service (Amazon S3) bucket
- Amazon CloudWatch logging enabled at the INFO level for worker and web server
- A Git repository accessible from within your VPC
Additionally, we upload a sample Python package to the Git repository:
Create and install the startup script in the Amazon MWAA environment
Create the startup.sh file using the following example code:
Upload startup.sh to the S3 bucket for your Amazon MWAA environment:
Browse the CloudWatch log streams for your workers and view the worker_console log. Notice the startup script is now running and setting the environment variable.
Create and install global Python dependencies in the Amazon MWAA environment
Your requirements file must include a –constraint statement to make sure the packages listed in your requirements are compatible with the version of Apache Airflow you are using. The statement beginning with -r
references the environment variable you set in your startup.sh
script based on the component type.
The following code is an example of the requirements.txt
file:
Upload the requirements.txt file to the Amazon MWAA environment S3 bucket:
Create and install component-specific Python dependencies in the Amazon MWAA environment
For this example, we want to install the Python package scrapy on workers and schedulers from our private Git repository. We also want to install pprintpp on the web server from the public Python packages indexes. To accomplish that, we need to create the following files (we provide example code):
webserver_reqs.txt
:
worker_reqs.txt
:
scheduler_reqs.txt
:
Upload webserver_reqs.txt
, scheduler_reqs.txt
, and worker_reqs.txt
to the DAGs folder for the Amazon MWAA environment:
Update the environment for the new requirements file and observe the results
Get the latest object version for the requirements file:
Update the Amazon MWAA environment to use the new requirements.txt
file:
Browse the CloudWatch log streams for your workers and view the requirements_install
log. Notice the startup script is now running and setting the environment variable.
Conclusion
In this post, we demonstrated a method to selectively install Python dependencies based on the Amazon MWAA component type (web server, scheduler, or worker) from a Git repository only accessible from your VPC.
We hope this post provided you with a better understanding of how startup scripts and Python dependency management work in an Amazon MWAA environment. You can implement other variations and configurations using the concepts outlined in this post, depending on your specific network setup and requirements.
About the Author
Tim Wilhoit is a Sr. Solutions Architect for the Department of Defense at AWS. Tim has over 20 years of enterprise IT experience. His areas of interest are serverless computing and ML/AI. In his spare time, Tim enjoys spending time at the lake and rooting on the Oklahoma State Cowboys. Go Pokes!