AWS Big Data Blog

Introducing Apache Airflow version 2.6.3 support on Amazon MWAA

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud. Trusted across various industries, Amazon MWAA helps organizations like Siemens, ENGIE, and Choice Hotels International enhance and scale their business workflows, while significantly improving security and reducing infrastructure management overhead.

Today, we are announcing the availability of Apache Airflow version 2.6.3 environments. If you’re currently running Apache Airflow version 2.x, you can seamlessly upgrade to v2.6.3 using in-place version upgrades, thereby retaining your workflow run history and environment configurations.

In this post, we delve into some of the new features and capabilities of Apache Airflow v2.6.3 and how you can set up or upgrade your Amazon MWAA environment to accommodate this version as you orchestrate your workflows in the cloud at scale.

New feature: Notifiers

Airflow now gives you an efficient way to create reusable and standardized notifications to handle systemic errors and failures. Notifiers introduce a new object in Airflow, designed to be an extensible layer for adding notifications to DAGs. This framework can send messages to external systems when a task instance or an individual DAG run changes its state. You can build notification logic from a new base object and call it directly from your DAG files. The BaseNotifier is an abstract class that provides a basic structure for sending notifications in Airflow using the various on_*__callback. It is intended for providers to extend and customize this for their specific needs.

Using this framework, you can build custom notification logic directly within your DAG files. For instance, notifications can be sent through email, Slack, or Amazon Simple Notification Service (Amazon SNS) based on the state of a DAG (on_failure, on_success, and so on). You can also create your own custom notifier that updates an API or posts a file to your storage system of choice.

For details on how to create and use a notifier, refer to Creating a notifier.

New feature: Managing tasks stuck in a queued state

Apache Airflow v2.6.3 brings a significant improvement to address the long-standing issue of tasks getting stuck in the queued state when using the CeleryExecutor. In a typical Apache Airflow workflow, tasks progress through a lifecycle, moving from the scheduled state to the queued state, and eventually to the running state. However, tasks can occasionally remain in the queued state longer than expected due to communication issues among the scheduler, the executor, and the worker. In Amazon MWAA, customers have experienced such tasks being queued for up to 12 hours due to how it utilizes the native integration of Amazon Simple Queue Service (Amazon SQS) with CeleryExecutor.

To mitigate this issue, Apache Airflow v2.6.3 introduced a mechanism that checks the Airflow database for tasks that have remained in the queued state beyond a specified timeout, defaulting to 600 seconds. This default can be modified using the environment configuration parameter scheduler.task_queued_timeout. The system then retries such tasks if retries are still available or fails them otherwise, ensuring that your data pipelines continue to function smoothly.

Notably, this update deprecates the previously used celery.stalled_task_timeout and celery.task_adoption_timeout settings, and consolidates their functionalities into a single configuration, scheduler.task_queued_timeout. This enables more effective management of tasks that remain in the queued state. Operators can also configure scheduler.task_queued_timeout_check_interval, which controls how frequently the system checks for tasks that have stayed in the queued state beyond the defined timeout.

For details on how to use task_queued_timeout, refer to the official Airflow documentation.

New feature: A new continuous timetable and support for continuous schedule

With prior versions of Airflow, to run a DAG continuously in a loop, you had to use the TriggerDagRunOperator to rerun the DAG after the last task is finished. With Apache Airflow v2.6.3, you can now run a DAG continuously with a predefined timetable. The simplifies scheduling for continual DAG runs. The new ContinuousTimetable construct will create one continuous DAG run, respecting start_date and end_date, with the new run starting as soon as the previous run has completed, regardless of whether the previous run has succeeded or failed. Using a continuous timetable is especially useful when sensors are used to wait for highly irregular events from external data tools.

You can bound the degree of parallelism to ensure that only one DAG is running at any given time with the max_active_runs parameter:

@dag(
    start_date=datetime(2023, 5, 9),
    schedule="@continuous",
    max_active_runs=1,  
    catchup=False,
)

New feature: Trigger the DAG UI extension with flexible user form concept

Prior to Apache Airflow v2.6.3, you could provide parameters in JSON structure through the Airflow UI for custom workflow runs. You had to model, check, and understand the JSON and enter parameters manually without the option to validate them before triggering a workflow. With Apache Airflow v2.6.3, when you choose Trigger DAG w/ config, a trigger UI form is rendered based on the predefined DAG Params. For your ad hoc, testing, or custom runs, this simplifies the DAG’s parameter entry. If the DAG has no parameters defined, a JSON entry mask is shown. The form elements can be defined with the Param class and attributes define how a form field is displayed.

For an example DAG the following form is generated by DAG Params.

Set Up a New Apache Airflow v2.6.3 Environment

You can set up a new Apache Airflow v2.6.3 environment in your account and preferred Region using the AWS Management Console, API, or AWS Command Line Interface (AWS CLI). If you’re adopting infrastructure as code (IaC), you can automate the setup using either AWS CloudFormation, the AWS Cloud Development Kit (AWS CDK), or Terraform scripts.

When you have successfully created an Apache Airflow v2.6.3 environment in Amazon MWAA, the following packages are automatically installed on the scheduler and worker nodes along with other provider packages:

apache-airflow-providers-amazon==8.2.0

python==3.10.8

For a complete list of provider packages installed, refer to Apache Airflow provider packages installed on Amazon MWAA environments.

Upgrade from older versions of Apache Airflow to Apache Airflow v2.6.3

You can perform in-place version upgrades of your existing Amazon MWAA environments to update your older Apache Airflow v2.x-based environments to v2.6.3. To learn more about in-place version upgrades, refer to Upgrading the Apache Airflow version or Introducing in-place version upgrades with Amazon MWAA.

Conclusion

In this post, we talked about some of the new features of Apache Airflow v2.6.3 and how you can get started using them in Amazon MWAA. Try out these new features like notifiers and continuous timetables, and other enhancements to improve your data orchestration pipelines.

For additional details and code examples on Amazon MWAA, visit the Amazon MWAA User Guide  and the Amazon MWAA examples GitHub repo.

Apache, Apache Airflow, and Airflow are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.


About the Authors

Hernan Garcia is a Senior Solutions Architect at AWS, based out of Amsterdam, working in the Financial Services Industry since 2018. He specializes in application modernization and supports his customers in the adoption of cloud operating models and serverless technologies.

Parnab Basak is a Solutions Architect and a Serverless Specialist at AWS. He specializes in creating new solutions that are cloud native using modern software development practices like serverless, DevOps, and analytics. Parnab works closely in the analytics and integration services space helping customers adopt AWS services for their workflow orchestration needs.

Shubham Mehta is an experienced product manager with over eight years of experience and a proven track record of delivering successful products. In his current role as a Senior Product Manager at AWS, he oversees Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and spearheads the Apache Airflow open-source contributions to further enhance the product’s functionality.