Migrating from self-managed Apache Airflow to Amazon Managed Workflows for Apache Airflow (MWAA)
This post was written by Tomas Christ, Solution Architect at eprimo GmbH.
eprimo GmbH is a wholly owned subsidiary of E.ON SE, situated near Frankfurt, Germany. It represents the largest purely green-energy supplier in Germany with some 1.7 million customers. Currently, eprimo has a staff of approximately 160 people. We have been using Amazon Web Services (AWS) since 2016 and run multiple production workloads in more than 40 AWS Control Tower-managed AWS Accounts.
In this blog post, I will describe how we migrated from self-managed Apache Airflow to Amazon Managed Workflows for Apache Airflow (MWAA). Then I will explain how we upgraded to Apache Airflow 2.x. This should provide a guideline with the necessary steps to accomplish the migration.
With only 2 of 160 employees taking care of the entire DevOps and data engineering environment at eprimo, automation has always been key. This was a strong driver into looking for a powerful workflow management platform, going with Apache Airflow, which we deployed using a self-managed approach back in 2019.
Apache Airflow is an open source workflow management platform that enables us to develop, manage, schedule, and monitor workflows based on directed acyclic graphs (DAGs) that are based on Python source files. An Airflow DAG defines a workflow; at eprimo, this includes mostly data pipelines and can consist of several steps that are called Tasks in Apache Airflow.
Because the availability of workflow orchestration and management platforms is critical, we had to take the necessary steps to ensure the availability of Airflow (for example, by configuring Amazon EC2 Auto Scaling groups). With that in mind, learning that Amazon Managed Workflows for Apache Airflow (MWAA) was announced in late 2020 sounded like great news to us.
The following diagram shows the Apache Airflow environment before migrating to Amazon MWAA. While the classical Apache Airflow on Amazon Elastic Compute Cloud (Amazon EC2) used the Apache Airflow version 1.15, Amazon MWAA uses v1.10.12. Nonetheless, we did not experience any compatibility issues between the different versions.
After we began experimenting with Amazon Managed Workflows for Apache Airflow, the following benefits convinced us to deploy Amazon MWAA:
- Compatibility: Amazon MWAA is the same open source Apache Airflow that one can download. DAGs developed for self-managed Apache Airflow also run on Amazon MWAA.
- Setup, patching, and upgrades: Creating an Apache Airflow environment is simpler. Specify the information that the managed service needs to set up the environment (for example, with an AWS CloudFormation template) and AWS takes care of the rest. With AWS managing the environment, MWAA also significantly simplifies the complexity of patching and upgrading Apache Airflow.
- Availability: Amazon MWAA monitors the workers in its environment, and as demand increases or decreases, Amazon MWAA adds additional worker containers or removes those that are free, respectively.
- Security: With self-managed Apache Airflow, administrators must manage the SSH private keys that allow access to the Amazon EC2 instances where Apache Airflow is installed. Amazon MWAA is integrated with AWS Identity and Access Management (IAM), including role-based authentication and authorization for access to the Airflow user interface. Because our AWS organization is linked to our Azure Active Directory (Azure AD) thanks to AWS Single Sign-On (AWS SSO), we can centrally manage the access to Amazon MWAA and its Amazon CloudWatch dashboards by adding users to Azure AD groups whose structure and policies are actively monitored.
- Monitoring: Before migrating to Amazon MWAA, we monitored our environment using Grafana for the Airflow logs in addition to Amazon CloudWatch for the metrics of the Amazon EC2 instances. With MWAA’s integration with Amazon CloudWatch, we were able to free up resources, unify monitoring, and monitor DAG failure and successes in CloudWatch dashboards.
Migrating to MWAA
eprimos’ approach to create a new MWAA environment
The following deployment guide lists the steps that we at eprimo took to create the new Amazon MWAA environment. The goal was to replace our three different Apache Airflow environments: one for DAG development/testing, one for DAG production, and one optional environment for testing new Apache Airflow features.
At eprimo, whenever infrastructure with more than one environment is supposed to be set up, we have a policy of using infrastructure as code (IaC) methods. Whenever multiple executions of a step cannot be ruled out, we also strive to implement it programmatically. Thus, the resources were created using CloudFormation stacks. The actual migration process, including testing dozens of productive Airflow DAGs, took about a week for two employees.
All steps, including the CloudFormation templates, also can be found on GitHub.
Step 1. Create all necessary resources for Amazon MWAA
First, we needed to create all necessary resources for the MWAA environment. At eprimo, that included the following:
- An Amazon Simple Storage Service (Amazon S3) bucket that contains the Airflow DAGs, hooks, and sensors
- A, AWS CodePipeline that pulls files from an AWS CodeCommit git repository and uploads them to Amazon S3
- An AWS Lambda function that zips the custom hooks and sensors and uploads them to Amazon S3
- IAM roles for MWAA, AWS CodePipeline, and Lambda
- A security group for the MWAA resources
The CloudFormation YAML file for this step is on GitHub.
Step 2. Adjust the KMS key policy
Disks of MWAA resources were supposed to be encrypted at-rest with AWS Key Management Service (AWS KMS). This KMS key had to be identical to the key that is used to encrypt data stored in the Amazon S3 bucket mentioned in step 1 to allow encrypted CloudWatch Logs. To achieve this, the KMS key policy had to be changed accordingly:
AWS_ACCOUNT_ID must be replaced with the concrete values of an environment.
Step 3. Create the actual MWAA environment
Finally, we created the MWAA environment with Apache Airflow version 1.10.12. Resources creation took about 20 minutes.
The CloudFormation YAML file to this step is on GitHub.
Update the MWAA environment
Update DAGs and insert new DAGs
Both updating DAGs and inserting new DAGs is done automatically. After the Python source file that represents an Airflow DAG is pushed to a specific git repository hosted on AWS CodeCommit, the CodePipeline will pull the newly pushed file and store it on a specific path on an Amazon S3 bucket, like s3://mybucketname/dags/dagname.py. Amazon MWAA automatically synchronizes the content of the path to its own directory every 30 seconds.
To update or add custom plugins (for example, hooks, or sensors, or requirements to the MWAA environment), a manual update of the environment is necessary. This can be done by initiating the following steps:
- The new hooks must be pushed to the specific repository. Because changes at hooks can involve several files, we have learned that, when taking new plugins to production, copying the entire plugins folder can make sense if the development git repository differs from the production git repository and a
git mergeis not possible.
git pushstarts the AWS CodePipeline that calls a Lambda function, which generates a ZIP file of the plugins folder and saves it to the Amazon S3 location that contains the DAG plugins. Once this is done successfully, you can begin to update the MWAA environment.
- In the AWS Management Console, select the right MWAA environment and choose Edit.
- The corresponding version can be selected at Plugins file. By default, MWAA always uses the previously deployed version when updating, so the latest version must be selected here manually. Paying attention to the timestamp of the file is recommended. Because it may take a moment for the pushed code to end up in the Amazon S3 bucket, make sure that the latest version possible in MWAA is the latest version that you just pushed. If the timestamp doesn’t match, you must wait a couple of minutes and refresh the Edit page. In Figure 4, the field with the to-be-adjusted timestamp is outlined in blue.
- For updating the requirements version, repeat the preceding steps.
- You can skip the following settings and choose Save to have the MWAA environment updated. This can take up to 15 minutes.
Migrating to Apache Airflow 2.x
Once Amazon Managed Workflows for Apache Airflow supported Apache Airflow 2.0, we were eager to upgrade because it includes new features that both enhance operability, such as task groups and the new UI, and allow high availability of the Apache Airflow Scheduler.
Apart from migrating to an Apache Airflow 2.0 environment, hooks and DAGs also must be changed for a successful migration. Now let’s walk through changes implemented in hooks and DAGs to migrate successfully to Apache Airflow 2.0.
Step 1. Create necessary resources for Amazon MWAA v2
Directly upgrading MWAA from version 1.10.12 to version 2.0 is not possible. Therefore, a new MWAA environment must be created with all the necessary resources. Review the previous steps to achieve that.
Step 2. Adjust the hooks
Adjust folder structure and importing the hooks
Loading hooks as Apache Airflow plugins is deprecated from version 2.0. The hooks can now be imported directly from the corresponding Airflow plugins folder. Furthermore, the distinction between hook and sensor locations is no longer made in Airflow 2.0. Inserting a custom hook into an Airflow plugin also is not necessary. The new, simplified folder structure compared to the old folder structure looks like the following:
- Old folder structure:
- New folder structure:
Hooks can now be imported with the following statement:
Adjust the hook import statements
Importing custom hooks has changed. Instead of
from hooks.awshooks.v02.biccSSM import custom_hook_name at Apache Airflow version 1.10.12, the import statement has changed to
from awshooks_v03.biccSSM import custom_hook_name at Apache Airflow version 2.0.2.
Adjust the DAG import statements
A long list of import statements is available in the Apache Airflow versions on Amazon Managed Workflows for Apache Airflow (MWAA) documentation.
Imports in version 1.10.12:
Imports in version 2.0.2:
Following are highlights from what we learned and recommendations based on our migration to Amazon MWAA in February 2021 and our upgrade to Apache Airflow 2.0.2 on MWAA in June 2021.
- Migrating the DAGs is trivial. The automatic synchronization pushes the DAGs to the MWAA environment and quickly makes the DAGs available in the user interface.
- Migrating custom hooks and sensors was more difficult as the former hooks and sensors could not be loaded simply in the same way from a specific directory to the Amazon MWAA environment. Instead, the files had to be saved as ZIP file, which means an additional step for us run by a Lambda function.
- Debugging the environment, in addition to hooks and sensors, can take time due to the up to 15 minutes that the Amazon MWAA is blocked when updating. Moreover, introducing new hooks and sensors can result in an error in all custom modules failing to load. Thus, testing custom sensors and hooks in a development environment is essential.
- Installing Python modules is intuitive. Store a
requirements.txtfile with the desired modules on Amazon S3 and select this file when updating Amazon MWAA.
- When encrypting Amazon MWAA, if CloudWatch is supposed to be enabled for the Amazon MWAA environment, ensure that the key policy is expanded to include CloudWatch Logs.
- Managing the authentication process has become simpler because once you are logged to the right AWS account, logging into the Airflow user interface does not require additional credentials. If your IT security approves, we recommend deploying the WebserverAccessMode PUBLIC_ONLY, as accessing the UI from outside the network will be easier, especially since port forwarding to the Amazon MWAA UI with SSO is not possible.
In this blog post, I explained why we migrated to Amazon Managed Workflows for Apache Airflow (MWAA), steps we took as part of that migration, and lessons we learned along the way. Resources covered in this article are available in our GitHub repository.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.