Networking & Content Delivery

United Airlines implement enterprise-wide resilience program with AWS

This blog is co-authored with Jenny Zhou, Principal Enterprise Architect at United Airlines

In this blog, we will explore how United Airlines implemented an enterprise-wide resilience program using Amazon Web Services (AWS).

United Airlines, a major U.S. airline headquartered in Chicago, Illinois, announced its United Next plan in 2021. United Next is the airline’s plan to improve its network and enhance the customer experience. As the company transitions hundreds of applications to AWS and modernizes its critical digital systems, it must ensure 100% availability of these business-critical applications.

To meet the business requirement of lower recovery time objective (RTO), application teams began designing multi-Region application architectures. United Airlines teams identified the need for a flexible, repeatable, and robust platform that could scale quickly. As application teams modernized on AWS, United Airlines Platform Engineering, Database Administration (DBA), and application teams managed complex and time-consuming failover runbooks and procedures. They often relied on manual failover processes requiring human intervention. These processes were inefficient and error-prone, potentially causing downtime and disrupting critical business services.

To address these challenges, the United Airlines leadership tasked the Enterprise Architecture (EA) team with building a more robust, repeatable, and automated solution.

Rapid recovery solution

In April 2023, the EA team began rolling out Rapid Recovery solution. Rapid Recovery is a central platform developed to enable rapid cross-Region recovery capabilities for critical applications hosted on AWS. This platform automates common recovery steps such as 1. Switching application between Regions by using Amazon Application Recovery Controller (ARC), 2. Automating database failover tasks like promoting a secondary DB cluster to be the primary DB cluster, and 3. Providing templates to create an observability dashboard. Rapid Recovery aims to provide enhanced business continuity and disaster recovery (BCDR) protection compared to the high availability provided in a single AWS Region. To date, 70+ business critical services running on AWS are using this platform.

Diagram of an AWS-based application recovery system. It shows the flow from an authorized admin through monitoring tools, central resilience AWS account, and application AWS account. The process includes application and database failover, event bus, step functions for notifications and failovers, and connections to various teams and AWS services like DocumentDB and Aurora. The system incorporates chaos engineering and recovery dashboard components.

Figure 1: Architecture for Rapid Recovery Solution

A small team of five people built Rapid Recovery solution in less than six months.

The above architecture has the following key features:

  1. Automated recovery: Rapid Recovery automates common recovery steps by using Amazon Application Recovery Controller (ARC) to switch application traffic routing and triggers managed failover and switchover of the database which includes changing the custom DNS endpoint using with Route 53. By using ARC, Rapid Recovery gains insights into whether resources are prepared for recovery in the failover Region and triggering of failover recovery for applications across multiple AWS Regions.
  2. Flexible usage scenarios: The platform supports various enterprise use cases, including:
    1. Incident recovery
    2. Major application version releases or upgrades
    3. Component level chaos testing
    4. Scheduled failovers
    5. Fully automated failovers triggered by alarms (less common but supported)
  3. Easy-to-use workflow: Using a custom workflow, authorized team members easily initiate failovers through a simple workflow interface. [see Figure 3 below]
  4. Comprehensive monitoring: The solution provides a standard monitoring dashboard which integrates with AWS services like Step Functions, AWS Lambda, and Amazon EventBridge for detailed execution tracking. This dashboard provides an enterprise-wide view for United Airlines leadership and a per-application view for individual application owners. [see Figure 2 below]
  5. Automated notifications: Support teams receive email notifications throughout the failover process, ensuring clear communication and coordination.
  6. Customizable: While providing a standardized foundation, Rapid Recovery allows application teams to customize the platform for specific requirements.

Figure 2: Enterprise-wide dashboard with failover history

Figure 2: Enterprise-wide dashboard with failover history

Figure 2: Enterprise-wide dashboard with failover history

Implementing automation allowed United Airlines to expedite the recovery process and reduce recovery time objective (RTO) during service impairment.

Besides providing Rapid Recovery, the EA team also standardized application onboarding by providing detailed guidance on disaster recovery design, performing Well-Architected reviews and helping with starter kits (documentation and code) with example runbooks for standard application architecture patterns. These runbooks outlined the preparation steps, recovery procedures, and post-failover testing requirements.

Before an application moves to production, application teams are required to perform a full application failover drill into another Region. This mandatory step validated the failover runbook and built confidence in the team’s ability to execute a failover when needed. The EA team leads sessions with application teams providing guidance and training to ensure the success of this initiative.

Recovery process

Most application architectures at United Airlines rely on human intervention to trigger cross-Region failover processes. Switching between Regions typically involves deliberate human assessment and decision-making. This approach prioritizes human oversight and control over automated failover mechanisms based on observability signals. This human-in-the-loop approach ensures careful consideration of potential impacts before executing a Regional failover, maintaining a balance between system resilience and operational control.

United Airlines has a well-defined event management process to handle critical service disruptions. This process includes an incident management team, application owners, and senior leadership to assess the impact and define next steps.

Failover process

  1. Incident detection: Observability tools detect an impairment to critical business service; an incident call is started.
  2. Assessment: During the call business leaders, application owners and operations teams evaluate the situation and impact to business. They also determine if a failover is necessary to mitigate quickly negative impact giving application teams time to root cause the issue.
  3. Decision-making: Teams may opt to failover specific components, such as service tiers, and AWS services like databases.
  4. Execution: Authorized application owners use a custom workflow to start and manage the failover process.

Figure 3: workflow interface provided to the application team

Figure 3: workflow interface provided to the application team

Resilience at scale

Resilience is a continuous process. Periodically evaluating and practicing your disaster recovery plan is essential to ensure its effectiveness and to build confidence in its implementation when needed.

To understand its enterprise-wide resilience posture, United Airlines decided to capture and monitor automated and manual process signals (including monitoring of failure mode) into an operational dashboard called Application Reliability Dashboard (ARD). ARD is a custom application with a dedicated software development team. It’s goal is to enhance customer satisfaction by ensuring that applications meet high standards of quality and dependability.

ARD serves as a comprehensive overview of an application’s health and reliability. It provides a unified interface where each application service is assigned a resiliency score, with a target pass criterion set at 80% or higher. This reliability score is calculated using United Airlines specific metrics that Gartner, a leading research and advisory company has reviewed and endorsed. The scoring model is based on a customized service reliability engineering framework, specifically tailored to meet United Airlines’ unique needs and requirements.

Figure 4: Reliability score metrics

Figure 4: Reliability score metrics

ARD serves three primary functions:

  1. Measurement: It quantifies the reliability, production readiness, and overall health of United Airlines applications.
  2. Visibility: ARD provides clear insights into well-defined critical metrics.
  3. Progress Tracking: It allows application teams to monitor improvements and changes.

By focusing on these areas, ARD enables application teams to deliver services that are reliable (consistently performing as expected), stable (resistant to unexpected failures or downtime), and high-performing (Operating efficiently and responsively).

Application Reliability Readiness Dashboard

Figure 5: ARD Dashboard view

Cost optimization initiatives

Striving for shorter recovery time objectives (RTO) and recovery point objectives (RPO) typically leads to increased costs in both resource allocation and operational complexity. As such, it’s advisable to select RTO and RPO targets that strike an optimal balance between recovery capabilities and cost-effectiveness for your specific workload.

When United Airlines’ application teams initially explored multi-Region deployment, their primary worry was a potential doubling of application costs. To mitigate this concern, it’s essential to select the most appropriate disaster recovery (DR) strategy for each application, as this plays a pivotal role in managing overall application cost.

To further maintain cost-effectiveness, United Airlines implemented:

  1. Resource optimization: United Airlines implements a cost-effective strategy by sharing Amazon Application Recovery Controller across multiple AWS accounts. A key advantage of sharing a cluster across AWS accounts is that you spread out the total cost of running a single cluster across several teams. By adopting this strategy, United Airlines reduces the overall number of clusters required, thereby achieving application resiliency more economically.
Figure 6: Application Recovery Controller cluster sharing using AWS RAM

Figure 6: Application Recovery Controller cluster sharing using AWS RAM

  1. Real-time cost tracking: Application teams have access to a Harness Cloud Cost Management dashboard for monitoring costs.
  2. FinOps hackathons: Regular hackathon-style events to benchmark application performance and identify new cost-saving opportunities.
  3. Integration of hackathon outputs: scaling cost-optimization techniques and learning from FinOps hackathons into a repeatable deployment pipeline which are leverages by all teams.

Summary

United Airlines has improved its operational resilience by implementing a comprehensive, enterprise-wide program on AWS. These initiatives has enhanced the reliability of the airline’s critical applications. To date, the program has showed impressive results, with over 1,000 successful cross-Region application failovers and over 400 automated database failovers. The airline has also achieved a notable 7% reduction in MTTR in 2024 which led to a 5% increase in Net Promoter Score (NPS) in Q3 2024 compared to 2023. These accomplishments highlight United Airlines’ commitment to robust, uninterrupted service delivery and illustrate the effectiveness of their cloud-based resilience strategy.

Further reading

AWS Well Architected Framework – Resilience Pillar
AWS Multi-Region Fundamentals whitepaper
Disaster Recovery (DR) Architecture on AWS, Architecture Blog series
AWS Cloud Resilience
AWS Multi-Region Capabilities

About the authors

Hemal Jani Headshot

Hemal Jani

Hemal Jani is a Solutions Architect with Amazon Web Services (AWS) based out of Chicago, IL. His area of focus is Enterprise Migrations & Resilience. He has 20+ years of technology leadership experience and currently works with Travel & Hospitality customers.

Jenny Zhou Headshot

Jenny Zhou

Jenny Zhou is a Chicago-based Principal Enterprise Architect at United Airlines. She has 20+ years of experience in Airlines industry and 10+ years leading enterprise architecture initiatives. Specialized in application architecture, cloud migration & resilience, and enterprise governance.