AWS Architecture Blog

Enhance the resilience of critical workloads by architecting with multiple AWS Regions

In this post, we will share how you can use multi-Region as an architectural approach to achieve higher resilience on Amazon Web Services (AWS). This approach relies on first operating a workload across multiple Availability Zones within an AWS Region, before expanding to achieve even higher resilience by using multiple Regions. This is because within a Region there are multiple Availability Zones, which are physically separated by many miles but still close enough together (60 miles or less) to allow for single-digit millisecond latency. Each Availability Zone features one or more data centers, each housed in its own facility with its own redundant networking, connectivity, and power. Availability Zones provide fundamental building blocks that can help you achieve your resilience goals for your applications. First, you can benefit from the separation between Availability Zones by using Zonal services to specify which Availability Zone a resource is in, such as an Amazon Elastic Compute Cloud (Amazon EC2) instance. This means that if you build your application with redundant replicas of your application resources in each Availability Zone, you can gain excellent resilience to infrastructure events impacting any one Availability Zone.

A multi-Region approach is a reliable way to achieve a bounded recovery time for critical applications in the rare event of a service failure in a Region that is impacting your application. Each Region has strict logical and physical separation from other Regions. This purposeful design helps avoid service and infrastructure disruptions in one Region affecting another Region. This unique property of Regions can be used to build multi-Region applications with predictable fault domains.

While a multi-Region approach can improve your application’s resilience to failures, it can be challenging to build and operate such an application. It requires careful work to take advantage of the isolation between Regions, with care taken to not remove this isolation benefit at the application level. For example, if you fail over an application between Regions, you need to maintain strict separation between your application stacks in each Region, be aware of all the application dependencies, and fail over all parts of the application together. This kind of system requires planning and coordination amongst many engineering and business teams, especially with a complex, microservices-based architecture that could have several dependencies between applications.

If you’re replicating data between Regions using an asynchronous approach, you should be aware of the risk that not all your data has been replicated to the standby Region when you fail over. Because there’s a finite time needed to copy data over between Regions, data might be out of sync between the primary and standby Regions. If you use a synchronously replicated database across Regions to support your applications running from more than one Region concurrently, you avoid issues with data being out of sync when starting your application in the new Region. However, this introduces higher latency characteristics into your application’s resources. This is because writes need to commit to more than one Region, and the Regions can span hundreds or thousands of miles from one another. This latency characteristic needs to be accounted for in your application design. In addition, synchronous replication can increase the chance for correlated failures because writes need to be committed to more than one Region to be successful. If there is an impairment within one Region, you’ll need to form a quorum for writes to be successful, which typically involves having your database in three Regions and having a quorum of two out of three.

Finally, you need to practice the failover and simulate Region impairments to know that it works when you need it. It’s a substantial time and resource investment to regularly rotate your application between Regions to practice failover, but it’s a recommended practice if you plan to build a multi-Region application.

Given these additional considerations when implementing a multi-Region approach, for most AWS customers, multi-AZ is the right approach for building and operating resiliently in the cloud. This approach helps mitigate most infrastructure failures, which are usually contained within an Availability Zone. A multi-Region approach is most common in the following scenarios.

Meet regulatory and compliance requirements and enhance disaster recovery capabilities

Regulated industries like financial services and healthcare and life sciences can require that applications be multi-Region. Healthcare providers and pharmaceutical companies, for example, often deploy electronic health records (EHR), clinical trial management systems, and other applications across multiple Regions for enhanced data redundancy, disaster recovery, and compliance with regional data privacy regulations (like HIPAA in the US or GDPR in the EU). Epic on AWS, for example, is typically deployed across multiple Availability Zones and multiple Regions to increase the resilience of customers’ EHR and integrated application environment, making full use of the resources and geographic diversity of the AWS Cloud.

Banks and financial institutions, including Fidelity and Vanguard, also deploy many of their core trading and investment platforms and customer-facing applications across multiple Regions for enhanced business continuity and compliance with local data protection regulations.

Achieve a bounded recovery time to support highly available business-critical workloads

With growing demand for always-on applications and services, companies are increasingly reliant on cloud-based services and infrastructure for day-to-day operations and business continuity. While a single Region supports highly available and resilient applications, distributing workloads across multiple Regions enables a bounded recovery time in the rare event of a disruption to the application. The physical and logical separation of Regions provides a well-defined fault isolation boundary that you can use to create predictable fault boundaries for your applications. If the application experiences issues in one Region, the workloads can continue operating in another Region, which minimizes downtime for customers and users.

Streaming platforms like Netflix, NBCUniversal, and Disney, for example, deploy their content delivery networks (CDNs) and video streaming infrastructure across multiple Regions to provide a seamless media experience for their customers. In many cases, video streaming and video gaming companies deploy their infrastructure across multiple Regions to offer lower-latency gaming experiences for players worldwide.

Automotive companies such as Honda deploy their connected vehicle platforms across multiple Regions to scale globally. They use geo-location routing that identifies the closest broker the vehicle should communicate with based on customer-configured rules that govern how vehicles connect to the cloud infrastructure. This allows them to reliably connect millions of vehicles to the cloud while supporting high availability.

Conclusion

No matter the industry or scenario, AWS is the definitive choice for organizations that want to build and run highly available, resilient applications in the cloud, with resilience built into its infrastructure, operational models, and comprehensive capabilities across Regions. To learn how to choose between the different options for building resilience into your application, see the Well-Architected reliability pillar, and for a detailed framework for choosing multi-Region, see AWS Multi-Region Fundamentals.


John Formento

John Formento

John Formento, Jr. is a Principal Product Manager in the Resilience Infrastructure and Solutions organization at AWS. He helps customers achieve their resilience goals while operating on AWS by building recovery tools and focusing on internal AWS resilience initiatives.