AWS Architecture Blog

Disaster Recovery (DR) Architecture on AWS, Part IV: Multi-site Active/Active

In my first blog post of this series, I introduced you to four strategies for disaster recovery (DR). My subsequent posts shared details on the backup and restore, pilot light, and warm standby active/passive strategies.

In this post, you’ll learn how to implement an active/active strategy to run your workload and serve requests in two or more distinct sites. Like other DR strategies, this enables your workload to remain available despite disaster events such as natural disasters, technical failures, or human actions.

DR strategies: Multi-site active/active

As we know from our now familiar DR strategies diagram (Figure 1), the multi-site active/active strategy will give you the lowest RTO (recovery time objective) and RPO (recovery point objective). However, this must be weighed against the potential cost and complexity of operating active stacks in multiple sites.

DR strategies

Figure 1. DR strategies

Implementing multi-site active/active

The architecture in Figure 2 shows you how to use AWS Regions as your active sites, creating a multi-Region active/active architecture. Only two Regions are shown, which is common, but more may be used. Each Region hosts a highly available, multi-Availability Zone (AZ) workload stack. In each Region, data is replicated live between the data stores and also backed up. This protects against disasters that include data deletion or corruption, since the data backup can be restored to the last known good state.

Multi-site active/active DR strategy

Figure 2. Multi-site active/active DR strategy

Traffic routing

Each regional stack serves production traffic. How you implement traffic routing determines which Region will receive a given request. Figure 2 shows Amazon Route 53, a highly available and scalable cloud Domain Name System (DNS), used for routing. Route 53 offers multiple routing policies. For example, the geolocation or latency routing policies are good choices for active/active deployments. For geolocation routing, you configure which Region a request goes to based on the origin location of the request. For latency routing, AWS automatically sends requests to the Region that provides the shortest round-trip time.

Your data governance strategy helps inform which routing policy to use. Geolocation routing lets you distribute requests in a deterministic way. This allows you to keep data for certain users within a specific Region, or you can control where write operations are routed to prevent contention. If optimizing for performance is your top priority, then latency routing is a good choice.

Read/write patterns

Read local/write local pattern

The Region to which a request is routed is called the “local Region” for that request. To maintain low latencies and reduce the potential for network error, serve all read and write requests from the local Region of your multi-Region active/active architecture.

I use Amazon DynamoDB for the example architecture in Figure 2. DynamoDB global tables replicate a table to multiple Regions. Writes to the table in any Region are replicated to other Regions within a second. This makes it a good choice when using the read local/write local pattern. However, there is the possibility of write contention if updates are made to the same item in different Regions at about the same time. To help ensure eventual consistency, DynamoDB global tables use a last writer wins reconciliation between concurrent updates. In this case, the data written by the first writer is lost. If your application cannot handle this and you require strong consistency, use another write pattern to avoid write contention.

Read local/write global pattern

With a write global pattern, you choose a Region to be the global write Region and only accept writes in that Region. DynamoDB global tables are still an excellent choice for replicating data globally; however, you must ensure that locally received write requests are re-directed to the global write Region.

Amazon Aurora is another good choice. When deployed as an Aurora global database, a primary cluster is deployed to your global write Region, and read-only instances (Aurora Replicas) are deployed to other AWS Regions. Data is replicated to these read-only instances with typical latency of under a second. Aurora global database write forwarding (available using Aurora MySQL-Compatible Edition) allows Aurora Replicas in the secondary cluster to forward write operations to the primary cluster in the global write Region. This way, you can treat the read-only replicas in all your Regions as if they were read/write capable. Using write forwarding, the request travels over the AWS network and not the public internet, reducing latency.

Amazon ElastiCache for Redis also can replicate data across Regions. For example, to store session data, you write to your global write Region and use Global Datastore to ensure that this data is available to be read from other Regions.

Read local/write partitioned pattern

For write-heavy workloads with users located around the world, your application may not be suited to incur the round trip to the global write Region with every write. Consider using a write partitioned pattern to mitigate this. With this pattern, each item or record is assigned a home Region. This can be done based on the Region it was first written to. Or it can be based on a partition key in the record (such as user ID) by pre-assigning a home Region for each value of this key. As shown in Figure 3, records for this user are assigned to the left AWS Region as their home Region. The goal is to try to map records to a home Region close to where most write requests will originate.

Read local/write partitioned pattern for multi-site active/active DR strategy

Figure 3. Read local/write partitioned pattern for multi-site active/active DR strategy

When the user in Figure 3 travels away from home, they will read local, but writes will be routed back to their home Region. Usually writes will not incur long round trips as they are expected to typically come from near the home Region. Since writes are accepted in all Regions (for records homed to that respective Region), DynamoDB global tables, which accept writes in all Regions, are a good choice here also.

Failover

With a multi-Region active/active strategy, if your workload cannot operate in a Region, failover will route traffic away from the impacted Region to healthy Region(s). You can accomplish this with Route 53 by updating the DNS records. Make sure you set TTL (time to live) on these records low enough so that DNS resolvers will reflect your changes quickly enough to meet your RTO targets. Alternatively, you can use AWS Global Accelerator for routing and failover. It does not rely on DNS. Global Accelerator gives you two static IP addresses. You then configure which Regions user traffic goes to based on traffic dials and weights you set.

If you’re using a write global pattern and the impacted Region is the global write Region, then a new Region needs to be promoted to be the new global write Region. If you’re using a write partitioned pattern, your workload must repartition so that the records homed in the impacted Region are assigned to one of the remaining Regions. Using write local, all Regions can accept writes. With no changes needed to the data storage layer, this pattern can have the fastest (near zero) RTO.

Conclusion

Consider the multi-site active/active strategy for your workload if you need DR with the quickest recovery time (lowest RTO) and least data loss (lowest RPO). Implementing it across Regions (multi-Region) is a good option if you are looking for the most separation and complete independence of your sites, or if you need to provide low latency access to the workload from users in globally diverse locations.

Also consider the trade-offs. Implementing and operating this strategy, particularly using multi-Region, can be more complicated and more expensive, than other DR strategies. When implementing multi-Region active/active in AWS, you have access to resources to choose the routing policy and the read/write pattern that is right for your workload.

Related information

Seth Eliot

Seth Eliot

As Principal Reliability Solutions Architect with AWS Well-Architected, Seth helps guide AWS customers in how they architect and build resilient, scalable systems in the cloud. He draws on 10 years of experience in multiple engineering roles across the consumer side of Amazon.com, where as Principal Solutions Architect he worked hands-on with engineers to optimize how they use AWS for the services that power Amazon.com. Previously he was Principal Engineer for Amazon Fresh and International Technologies. Seth joined Amazon in 2005 where soon after, he helped develop the technology that would become Prime Video. You can follow Seth on twitter @setheliot, or on LinkedIn at https://www.linkedin.com/in/setheliot/.