AWS Database Blog
Best practices for Amazon DynamoDB Global Tables – Part 2: Failover strategies
In Part 1 of this series, we covered the groundwork for Regional resilience with Amazon DynamoDB global tables: how replication works, what RPO and RTO mean for your workload, and the operational preparation that separates a controlled failover from a scramble. In Part 3, we show you how to validate your failover strategy using AWS Fault Injection Service (FIS).
Now it’s 2 AM. The page has arrived. A service is impaired in a Region, and your application must start serving traffic from a different Region.
In this post we cover the two primary failover strategies for DynamoDB global tables, the tradeoffs between them, and the operational considerations that you must be aware of during and after a failover.
Failover strategies
The core question is straightforward: when a service is impaired in a Region, how does your application start using a different Region? There are two primary approaches, each with different RTO characteristics and operational complexity. Both rely on the monitoring signals we established in Part 1, specifically ReplicationLatency, SystemErrors, and synthetic canaries, as inputs to their failover decisions.
Strategy 1: Amazon Route 53 Application Recovery Controller (ARC)
For mission-critical workloads that require coordinated, multi-service failover, Amazon Route 53 Application Recovery Controller (ARC) provides the most robust solution. ARC is purpose-built for multi-Region recovery, and its architecture is designed so that the components you depend on during a failover are themselves highly available, even when an entire Region is impaired.
ARC’s multi-Region recovery capabilities are built around several core components that work together.
Region switch (recommended)
Region switch is ARC’s recommended capability for orchestrating large-scale, complex recovery tasks across multiple AWS accounts. Region switch is built around the concept of a plan containing workflows and execution blocks that run in parallel or in sequence to complete a recovery.
You can trigger a Region switch plan manually, or automate it by associating the plan with an Amazon CloudWatch alarm. For example, you might configure a composite alarm that monitors your application’s 5xx error rate, DynamoDB SystemErrors, ReplicationLatency, and SuccessfulRequestLatency across your primary Region. When the composite alarm enters the ALARM state, Region switch automatically executes the recovery plan, coordinating the failover of DynamoDB traffic, compute resources, and dependent services in the sequence you defined.
Region switch supports both active-passive (failover and failback) and active-active (shift-away and return) configurations, and provides dashboards for real-time visibility into the recovery process.
For most teams adopting ARC, Region switch is the right starting point. It provides the coordination and automation needed for multi-service failover while reducing the manual steps required during an incident.
Routing controls
Routing controls are straightforward on/off switches that you can use to redirect client traffic from one Regional replica to another. Each routing control is associated with an Amazon Route 53 health check, which is tied to a DNS failover record fronting your application in each Region. When you flip a routing control from On to Off, Route 53 marks the corresponding health check as unhealthy, and DNS failover redirects traffic to the healthy Region.
The key design decision behind routing controls is that they operate on an extremely reliable data plane hosted across five Regional endpoints in a dedicated cluster. This means you can update routing control states even if the Region you’re failing away from is completely unavailable. AWS recommends using the data plane API to update routing control states during an actual incident, and choosing one of the five cluster endpoints at random with retry logic across all five.
Safety rules
Safety rules are guardrails that prevent dangerous routing control state changes during high-stress incidents. For example, you can define a rule that prevents all Regions from being disabled simultaneously, or a rule that requires at least one Region to remain active at all times. During a 2 AM incident, when your team is under pressure and making rapid decisions, safety rules act as a backstop against operator error. If a safety rule blocks an update that you’ve determined is correct, you can override it, but the override is explicit and auditable.
Readiness checks
Readiness checkscontinuously monitor your application’s resources across Regions, auditing things like AWS resource quotas, capacity settings, and network routing policies. Their purpose is to verify, on an ongoing basis, that your standby replica matches your production replica in configuration and capacity. If your failover Region’s DynamoDB table has lower provisioned capacity than production, or if an Amazon Virtual Private Cloud (Amazon VPC) endpoint is missing, readiness checks surface that drift before you must fail over. It’s important to note that readiness checks are designed for ongoing monitoring, not as a trigger for failover during an active incident.
RTO and tradeoffs
The RTO with ARC is seconds to low minutes, depending on DNS TTL propagation. Routing control state changes take effect within seconds, and Route 53 immediately marks the corresponding health checks as healthy or unhealthy. However, clients still must pick up the DNS change, so the effective RTO depends on your TTL settings (60–120 seconds is recommended). The key advantage of ARC over other failover approaches is the reliability of the failover mechanism itself: the extremely reliable data plane means you can execute the failover even when an entire Region is impaired, and safety rules prevent operator error under pressure.
ARC requires upfront investment in modeling your application (defining recovery groups, cells, and readiness checks), configuring routing controls and safety rules, and maintaining long-lived IAM credentials specifically for disaster recovery tasks. AWS recommends keeping these credentials in an on-premises physical safe or virtual vault, separate from your normal federated access, so they’re accessible even if your identity provider is impaired. This investment pays off the first time you must fail over under pressure and the guardrails prevent an operator mistake.
Operational recommended practices
A few operational recommended practices from the ARC documentation are worth highlighting:
- Set DNS TTLs to 60 or 120 seconds for records involved in failover.
- Bookmark or hard-code your five Regional cluster endpoints and routing control ARNs so you can access them even if the ARC control plane is unavailable.
- Limit the time clients stay connected to your endpoints (the default Application Load Balancer keepalive of 3,600 seconds is too long for fast recovery; consider lowering it to 300 seconds).
- Test failover regularly with ARC to verify that your structures are aligned with the correct resources in your stack.
For teams with mature operational practices and mission-critical workloads with strict RTO SLAs, ARC is the right tool.
Strategy 2: DNS-based failover with Amazon Route 53
For teams looking for a more straightforward starting point, Route 53 provides an approach to routing traffic away from an impaired Region.
You configure Route 53 health checks to monitor your regional application endpoints, such as Amazon API Gateway, Elastic Load Balancing (ELB), or a custom health check endpoint. Then you use a failover routing policy so that Route 53 directs traffic to the primary Region under normal conditions and automatically routes to the secondary Region when the health check fails. Because both Regions have a global table replica, the secondary Region can immediately serve reads and accept writes.
To make this approach respond to real application health signals rather than straightforward endpoint reachability, you can use a Route 53 health check that monitors a CloudWatch alarm. For example, you might create a CloudWatch alarm that fires when your API Gateway’s 5XXError rate exceeds 5% for three consecutive 1-minute evaluation periods, or when your DynamoDB SystemErrors metric rises above zero for a sustained period of time. You then associate that alarm with a Route 53 health check using the CLOUDWATCH_METRIC type. When the alarm enters the ALARM state, Route 53 marks the health check as unhealthy and triggers the DNS failover automatically.
This gives you automated failover driven by the health signals that actually matter to your application, not only whether a TCP connection succeeds. You can combine multiple CloudWatch alarms into a composite alarm to trigger failover only when several conditions are met simultaneously, reducing the risk of false positives. For example, a composite alarm that requires elevated 5xx error rates, increased DynamoDB SystemErrors, and sustained ReplicationLatency spikes before triggering failover is more reliable than any single signal alone.
RTO considerations
The RTO for this approach depends on several factors. CloudWatch alarms themselves might take minutes to evaluate and fire, depending on your evaluation periods and datapoints-to-alarm configuration. Once the alarm fires, DNS TTL propagation adds additional delay. Even with low TTLs (60 seconds is common), client-side DNS caching and resolver caching can extend the actual failover time. Some clients and resolvers might not honor TTLs at all, which means a portion of your traffic might continue hitting the impaired Region longer than expected. In practice, expect the end-to-end failover time to be several minutes when accounting for alarm evaluation, DNS propagation, and client caching.
This approach works well for straightforward architectures where several minutes of degraded service is acceptable, and for teams that want a managed, infrastructure-level solution with minimal application code changes. Make sure your health checks validate actual application functionality, not only endpoint reachability, and set DNS TTLs as low as practical.
Comparing the approaches
| A | Route 53 ARC | DNS-based (Route 53) |
| RTO | Seconds to low minutes | Several minutes |
| Complexity | High | Low |
| Application changes | Minimal (infrastructure) | None/minimal |
| Coordination | Multi-service | Single service |
| Best for | Mission-critical workloads | Straightforward architectures |
For mission-critical workloads, we recommend starting with ARC, using Region switch as the primary failover mechanism. For teams earlier in their resilience journey, DNS-based failover with Route 53 is a practical starting point that you can evolve toward ARC as your requirements mature.
What to expect during and after failover
Even with a solid failover strategy, there are operational realities that you must be prepared for.
Reading stale data after failover
Immediately after failing over with the eventual consistency model (MREC), your application might read items from the new Region that haven’t received the latest writes from the failed Region. This is an inherent consequence of the eventual consistency model, and your application must handle records that might not yet reflect the most recent state due to replication lag.
Design your operations to be idempotent, include version attributes on items, and avoid assuming strong consistency across Regions unless you’re using MRSC. For example, if your application processes an order and then immediately reads it back from the failover Region, the read might return stale data. A version check or conditional write can protect against acting on outdated state.
Conflict resolution surprises
Last-writer-wins works well in most cases, but it can produce unexpected results with concurrent writes. If two Regions update the same item attribute simultaneously, one write silently “loses”. For most workloads this is fine because the writes are seconds apart and the latest value is the correct one. But for workloads where every write matters, such as financial ledgers or inventory counters, you must be more deliberate. Consider using conditional writes, designing your data model to avoid cross-Region conflicts on the same item, or using MRSC to remove the conflict window entirely.
High replication latency after failover
After a Region recovers, you might see elevated replication latency caused by a backlog of writes that accumulated during the disruption. The ReplicationLatency alarms and synthetic canaries from Part 1 will help you track this. Monitor the metric and allow time for the backlog to drain. If latency exceeds 5 minutes and isn’t trending downward, contact AWS Support.
Throttling in the target Region
If you encounter ThrottlingException errors in the target Region, the cause is almost always insufficient read capacity for the redirected traffic. For on-demand tables, these errors should be transient as capacity auto-scales. For provisioned tables, increase the provisioned capacity immediately using the DynamoDB auto scaling APIs as described in Part 1. This is why preparing capacity in advance matters so much.
Connectivity failures
If your application cannot connect to the target Region, check your IAM policies to confirm they allow DynamoDB access in the target Region. If you’re using Amazon VPC endpoints for private connectivity, verify that the endpoints are configured in the failover Region as well. Security group and network ACL rules are another common source of connectivity failures that only surface during an actual failover.
Data inconsistency
If you observe data inconsistency after failover, this is expected behavior when using the eventual consistency model. In-flight writes from the failed Region might not have been replicated before the disruption. Implement application-level consistency checks, use conditional writes with version numbers, and design your application to converge to a correct state rather than assuming immediate consistency.
Conclusion
Failing over with DynamoDB global tables isn’t only about having replicas in multiple Regions. It’s about choosing the right failover strategy for your workload, understanding the tradeoffs, and building the operational confidence to execute when it matters.
Route 53 ARC provides the most robust, coordinated failover for mission-critical workloads, with Region switch as the recommended mechanism for orchestrating recovery across services. The extremely reliable data plane makes sure the failover mechanism itself remains available even during a Regional impairment.
DNS-based failover with Route 53 offers a more straightforward starting point with minimal application changes, though end-to-end failover time will be several minutes when accounting for alarm evaluation and DNS propagation. For workloads that require zero data loss, MRSC removes RPO entirely.
Whichever strategy you choose, the preparation matters as much as the mechanism. If you haven’t already, start with Part 1 of this series to make sure your infrastructure is ready. Then validate your failover end to end using the FIS experiment walkthrough in Part 3.
Get started
Your next step depends on where you are today. If you have no failover mechanism, start with DNS-based failover using Route 53: a health check backed by a composite CloudWatch alarm monitoring SystemErrors and ReplicationLatency, paired with a failover routing policy. If you already have DNS-based failover, move to Route 53 Application Recovery Controller and configure a Region switch plan.
Either way, don’t wait for an incident to find out whether your failover works. Schedule a GameDay using AWS Fault Injection Service to simulate a Regional disruption and run your failover end to end. For a step-by-step guide on running FIS experiments against DynamoDB global tables, including both MRSC and MREC configurations, see Part 3 of this series.
If you haven’t set up the monitoring and capacity preparation covered in Part 1, start there first. None of this works without it.
For deeper guidance, see Using DynamoDB global tables and the Reliability Pillar of the AWS Well-Architected Framework.