AWS Open Source Blog
Amazon MWAA with AWS CodeArtifact for Python dependencies
This post was written by Dzenan Softic and Sam Dengler.
Many organizations rely on Apache Airflow, an open source project, to orchestrate their data pipelines. In 2020, Amazon Web Services (AWS) released Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which lets engineers focus on business solutions rather than on running and maintaining infrastructure for Airflow. Apache Airflow is written in Python, letting developers use its rich ecosystem of libraries or even write their own.
Development teams creating in-house libraries hosted in private repositories is common. AWS CodeArtifact is a fully managed software artifact repository service that makes securely storing, publishing, and sharing packages easier. With CodeArtifact, making a connection to public repository, such as PyPi, to consume open source libraries is also possible.
In this post, we demonstrate how to use a CodeArtifact repository with Apache Airflow. We focus on Amazon MWAA, but the same approach can be applied to self-hosted Apache Airflow on AWS.
Solution overview
Amazon MWAA is deployed to private subnets across two Availability Zones. In this example solution, Amazon MWAA has no internet access and uses VPC endpoints to communicate with other AWS services. Amazon MWAA fetches directed acyclic graphs (DAGs) and a requirements file from an Amazon Simple Storage Service (Amazon S3) bucket. It connects to an AWS CodeArtifact private repository to install required Python packages. This repository is configured to have an external connection to public PyPi repository, which enables collecting open source packages.
To connect to CodeArtifact, index-url is constructed with the repository URL and authorization token. Because the CodeArtifact authorization token is valid for a maximum of 12 hours, we need a way to refresh the token automatically. We use an AWS Lambda function to obtain a new authorization token and update the index-url, and trigger it to run every 10 hours using Amazon CloudWatch Events. During initial infrastructure provisioning, Lambda is invoked via AWS CloudFormation custom resource.
This architecture does not require Amazon MWAA to have access to public internet to fetch libraries from PyPi, so we don’t need to provision a pair of NAT gateways in our VPC. This means that we can use a private repository for both in-house and public open source libraries.
Walkthrough
You can deploy this solution from a local machine.
Prerequisites
- An AWS account
- Npm package manager
- AWS Command Line Interface (AWS CLI)
- AWS Cloud Development Kit (AWS CDK) version 1.102.0
- Python version 3.6 or higher
Project setup and deployment
To get started, clone the GitHub repository to a local machine:
This repository contains multiple projects, so we must navigate to the correct folder:
Create Python virtual environment:
This rule will create a virtual environment in infra/venv
and install all required dependencies for the project. Before we can deploy, we must set environment variables in .env
for AWS CDK. Edit the .env
file with an AWS Region of your choice and a unique Amazon S3 bucket name:
You can choose between two supported versions of Apache Airflow on Amazon MWAA: 1.10.12 or 2.0.2.
We are now ready to deploy. To do that, run:
The AWS CDK CLI will ask for permission to deploy specific resources, so acknowledge by typing y in your terminal and pressing Enter. Deployment can take up to 30 minutes. You can track the deployment status via CLI or in the AWS Console.
Once deployment has finished, we can investigate whether the provisioned Amazon MWAA environment successfully connected to the CodeArtifact repository to install preferred packages in requirements.txt
.
If you look more closely at the requirements.txt
, the first line points to codeartifact.txt
that should contain the correct --index-url
to a private PyPi repository in CodeArtifact. It tells pip
to install packages from the CodeArtifact repository—in this case, numpy
library. The Lambda function generated --index-url
during the deployment phase, and will update it with a new authorization token every 10 hours:
Navigate to Amazon MWAA in the AWS Management Console and open the mwaa_codeartifact_env
environment that we provisioned. We will now inspect Airflow scheduler logs to confirm that it connected to the CodeArtifact repository to install numpy
. Navigate to Monitoring and open the Airflow scheduler log group.
From the scheduler logs, we can observe that it connected to the CodeArtifact repository with the authorization token to download and install numpy
. You can also open the Airflow UI from the AWS Management Console and try to run example_dag
, which prints the numpy
array.
Also, you can navigate to CodeArtifact to verify that the numpy
package is fetched and available in the repository.
Add new Python dependencies
Install preferred Python dependencies to an Amazon MWAA environment by updating requirememnts.txt
. To make these changes take effect, you must upload requirements.txt
to an Amazon S3 bucket and update the Amazon MWAA environment with a new file version. You can do it in the AWS Management Console or via the AWS CLI.
Add a library of your choice and run the following to upload requirements.txt
to Amazon S3:
To get requirements.txt
versions, run:
Finally, update an Amazon MWAA environment with the latest version:
If you build your own Python packages, you can publish those to the same CodeArtifact repository and update the Amazon MWAA environment as a part of a release pipeline.
Cleaning up
Once you are finished exploring this solution, you can clean up the account to avoid unnecessary cost. To delete all resources associated with this blog post, run the following command:
Conclusion
In this post, we demonstrated how to integrate Amazon MWAA with AWS CodeArtifact for Python dependencies.
We created a private CodeArtifact repository that can be used for both in-house and public libraries. We also experimented with VPC endpoints, AWS Lambda, and Amazon CloudWatch Events.
Finally, we deployed the infrastructure with AWS CDK.
You can find the source code from this post on GitHub and use it as a basis to build your own solution. If you have any questions or suggestions, please comment on the blog or open an issue in the GitHub repository.