AWS Database Blog
Best practices for Amazon DynamoDB Global Tables – Part 1: Operational readiness
Imagine you’re running a global ecommerce platform. You have 2 million active users spread across North America, Europe, and Asia-Pacific. It’s Black Friday. Orders are flowing at 5,000 requests per second. Your customers expect a seamless experience no matter where they are or what’s happening behind the scenes.
Running a business at a global scale means you must prepare for the unexpected. An event can cause your workload in an AWS Region to run in a degraded state, so your data layer must be ready when it does. For many organizations, this is why they adopt Amazon DynamoDB global tables.
DynamoDB global tables provide fully managed, active-active replication across multiple AWS Regions and accounts. But having multi-Region replication is only half the story. The other half is knowing that your infrastructure is ready before things go wrong.
This is Part 1 of a series on best practices for DynamoDB global tables. Unless otherwise noted, this post discusses global tables with multi-Region eventual consistency (MREC), the default replication mode. We call out multi-Region strong consistency (MRSC) where the guidance differs. In this post, we focus on preparation: understanding how replication works, what your resilience posture looks like, and the operational groundwork that separates a controlled failover from a scramble. In Part 2, we cover failover strategies and what to do when that 2 AM page arrives.
How global tables replication works
Before we talk about preparation, let’s establish a shared understanding of how global tables replicate data. DynamoDB global tables use an active-active architecture. Every replica table in every Region can accept both reads and writes. When you write an item to one Region, DynamoDB replicates that change to all other replica Regions. How that replication works depends on which consistency mode you use.
MREC replication
With the default multi-Region eventual consistency (MREC) model, writes are replicated asynchronously. There are a few characteristics of this model that directly influence how you plan for resilience. First, MREC uses a last-writer-wins conflict resolution strategy based on item-level timestamps. If the same item is updated in two Regions at roughly the same time, the write with the latest timestamp takes precedence.
Second, under normal conditions, changes typically replicate across Regions within a second or less, though this can vary based on item size, write volume, and the physical distance between replicas. Third, reads from a replica Region are eventually consistent with respect to writes made in other Regions. A read in eu-west-1 might not immediately reflect a write that just occurred in us-east-1. This model gives you low-latency local reads and writes in each Region, but the eventual consistency model has direct implications for how you plan your resilience strategy, particularly around data loss tolerance.
MRSC replication
With multi-Region strong consistency (MRSC) writes are replicated synchronously. Item changes are synchronously replicated to at least one other Region before the write operation returns a successful response.
Unlike MREC’s last-writer-wins approach, writes in MRSC are evaluated against the latest write from any Region, and concurrent writes to the same item from different Regions might result in conflicts. MRSC supports strongly consistent reads from any active Region (by setting ConsistentRead=true), giving you the confidence that a read always reflects the most recent committed write. Eventually consistent reads remain the default.
Understanding RPO and RTO with global tables
Two metrics define your resilience posture during a failover.
- Recovery Point Objective (RPO) represents how much data you can afford to lose, expressed as a time window. An RPO of 1 second means that you can tolerate losing, at most, the last 1 second of writes.
- Recovery Time Objective (RTO) represents how quickly your application must recover. An RTO of 60 seconds means that users should be back to normal within a minute of a failure.
Multi-Region eventual consistency (MREC)
With MREC, writes are replicated to other Regions asynchronously after they’re accepted in the source Region. If an impairment occurs before those writes reach the failover Region, some recent writes might be missing after failover. From an RPO perspective, MREC doesn’t provide zero RPO because acknowledged writes might still be in transit when the failure occurs. The ReplicationLatency Amazon CloudWatch metric helps monitor replication health, but it should be treated as a directional signal rather than an exact measure of failover data loss. For many workloads, this is an acceptable tradeoff.
Multi-Region strong consistency (MRSC)
Item changes in an MRSC global table replica are synchronously replicated to at least one other Region before the write operation returns a successful response. This means MRSC provides zero RPO. No committed write is ever lost during a failover. MRSC supports two configurations: three active Regions, or two active Regions with a witness Region that participates in replication but doesn’t serve reads or writes.
The trade-off is latency. Synchronous cross-Region replication adds round-trip time to every write. For latency-sensitive workloads that can tolerate a small RPO window, MREC remains the right choice. We recommend MRSC for workloads where zero data loss is a non-negotiable requirement: financial transactions, inventory systems, and regulatory compliance scenarios.
Regardless of whether you use MREC or MRSC, your RTO depends entirely on which failover strategy you choose. We cover the three primary approaches and their RTO characteristics in Part 2 of this series.
Monitoring and observability
You can’t respond to a disruption effectively if you don’t know that something is wrong. Monitoring is the foundation of preparation, and it deserves attention well before any incident occurs.
ReplicationLatency
The ReplicationLatency CloudWatch metric is available for MREC global tables and tracks the time (in milliseconds) for items to replicate from one Region to another. It’s your primary indicator of replication health and your best proxy for RPO under the eventual consistency model.
The ReplicationLatency metric is emitted per Region pair. If your global table has replicas in us-east-1, us-west-2, and eu-west-1, then CloudWatch in us-east-1 will show two separate ReplicationLatency metrics: one for replication to us-west-2 and one for replication to eu-west-1. Set up alarms on each pair independently, because latency varies significantly based on the physical distance between Regions.
We recommend setting up alarms on this metric with two thresholds: a warning at sustained latency above 3,000 ms for 5 minutes, and a critical alarm at sustained latency above 5,000 ms for 3 minutes. These are starting points, we recommend that you tune them based on your workload’s baseline. The warning gives your team time to investigate before the situation becomes urgent. Keep in mind that elevated replication latency isn’t always caused by a Regional impairment. Sustained throttling due to under-provisioned capacity can also increase replication lag, so investigate your table’s throttling metrics alongside ReplicationLatency before concluding there is a Regional issue.
Keep in mind that a replica can show ACTIVE status even during Regional disruptions. Replica status alone isn’t sufficient. Monitoring replication latency gives you the real-time signal that you need.
This metric isn’t available for MRSC. For MRSC global tables, monitor the latency of your strongly consistent read and write API calls to assess Region health, elevated latencies, or timeouts on these operations indicate a potential Regional impairment.
SystemErrors
The SystemErrors CloudWatch metric tracks the number of requests that result in an HTTP 500 error from DynamoDB. While occasional system errors can occur during normal operation, a sustained increase is a strong indicator of degradation.
The right alarm threshold depends on your throughput. A table handling 1 million requests per second will naturally see more transient system errors than one handling 5 requests per second, so absolute error counts aren’t meaningful on their own. Instead, we recommend alarming on the SystemErrors rate as a percentage of total requests. A warning at a sustained error rate above 0.5% over a 5-minute period, and a critical alarm above 1% over 3 minutes, is a reasonable starting point, but you should tune these thresholds based on your workload’s baseline. The threshold should be lenient because transient system errors aren’t uncommon and don’t necessarily indicate a Regional issue. What you’re looking for is a pattern of elevated errors that, combined with other signals like rising ReplicationLatency, paints a picture of degradation worth acting on.
For MREC, when combined with ReplicationLatency alarms, SystemErrors gives you a second, independent signal for Regional health. For MRSC, SystemErrors and strongly consistent read/write latency serve as your primary health indicators.
This is particularly useful when building composite alarms that drive automated failover decisions, a topic we cover in Part 2.
Synthetic canaries
For the most proactive monitoring, deploy synthetic canaries that continuously validate cross-Region replication. These canaries write a known item to the source Region and then poll the target Region until the item appears, measuring the actual end-to-end replication time from your application’s perspective. The following Python script demonstrates a basic replication canary that writes an item, polls for it in the target Region, and publishes the measured lag as a custom CloudWatch metric:
This gives you a near real time, application-level view of replication health, independent of CloudWatch metrics. It’s the kind of signal that tells you something is wrong before the dashboards do. To prevent canary items from accumulating over time, enable Time to Live (TTL) on the table and include a TTL attribute on each canary item.
An important consideration is where your monitoring infrastructure runs. If your canaries, alarms, and failover decision logic all run in your primary Region, they won’t be available when that Region is impaired, exactly when you need them most. Deploy your synthetic canaries and composite alarms from a separate Region so that you can detect and respond to an impairment without depending on the impaired Region itself.
Preparing your infrastructure
Having monitoring in place tells you when something is wrong. But the teams that recover fastest are the ones who prepared their infrastructure before the incident, not during it.
Map your workload dependencies
DynamoDB is rarely the only service that your workload depends on. Before you can confidently fail over, you must understand the full set of dependencies: compute, networking, authentication, caching, messaging, and any other services that your application requires. Each dependency must be available and correctly configured in your failover Region. A DynamoDB replica that’s healthy doesn’t help if your application can’t reach it. Document these dependencies as part of your failover runbook and verify them during GameDays.
Verify that your replicas are healthy
Before you can fail over your application to another Region, you must know that the Region is ready to accept traffic. Get in the habit of periodically verifying that your replica tables are in an ACTIVE state:
In the output, check that TableStatus is ACTIVE and that each entry in the Replicas array shows ReplicaStatus: ACTIVE. If a replica is in a different state, such as CREATING, UPDATING, REPLICATION_NOT_AUTHORIZED or INACCESSIBLE_ENCRYPTION_CREDENTIALS, it’s not ready to serve as a failover target. Understanding why a replica is in a non-active state is important. For example, an UPDATING state could indicate an in-progress settings synchronization or a scaling event, and the remediation differs depending on the cause.
Confirm that the capacity is ready in your failover Region
Before diving into capacity modes and auto scaling, verify that your AWS Service Quotas are consistent across all replica Regions. A Region with lower DynamoDB quotas than your primary can become a bottleneck during failover, even if your capacity configuration is correct. Check and align these quotas well in advance of any anticipated peak event.
One of the most common gaps that we see is failover Regions provisioned for normal read traffic, a reasonable cost improvement during steady state. But when the primary Region is impaired and that replica must suddenly absorb the redirected traffic, they reach throttling within seconds.
It’s important to understand that with global tables, every replica already handles full production write traffic as part of normal replication. The concern is primarily around read capacity. If your application shifts all reads to the failover Region, that Region must be provisioned accordingly.
If your tables use on-demand capacity mode, you’re in better shape because capacity scales automatically. However, if you’re using provisioned capacity mode, you must verify that your failover Region can handle production-level traffic before you need it to.
The key metric to check isn’t what’s currently provisioned, but what your auto scaling upper bound allows. Current provisioned capacity reflects what auto scaling has settled on based on recent traffic, and it can change at any time. What matters for failover readiness is whether auto scaling can scale high enough to absorb production traffic. Check your auto scaling configuration:
In the output, review the MinimumUnits and MaximumUnits for both read and write capacity in your failover Region. If the MaximumUnits is lower than your primary Region’s peak provisioned capacity, auto scaling will reach a ceiling during failover and you will see throttling.
There are two approaches to address this, depending on your situation:
For planned events or anticipated risk periods, temporarily raise the MinimumUnits in your failover Region to match the primary Region’s current provisioned capacity. This pre-warms the capacity so it’s immediately available during failover, rather than waiting minutes for auto scaling to react to a sudden traffic surge. You can lower it back after the event.
For ongoing readiness, ensure the MaximumUnits across all replica Regions are consistent and high enough to handle full production load. This way, even if you don’t pre-warm, auto scaling has room to scale up.
For global tables, use the DynamoDB auto scaling APIs (not the UpdateTable API or the AWS Application Auto Scaling APIs directly) to adjust these bounds. Updates made to throughput through UpdateTable can be overridden by auto scaling. The following command shows the recommended approach:
Global tables synchronize certain settings, including auto scaling configuration, across replicas. This means read capacity settings are also synchronized unless you have crafted a custom auto scaling policy to lower read capacity in non-primary Regions. Be aware of this behavior when planning your capacity strategy.
Consider switching to on-demand mode in advance of a planned event. This gives all replicas identical capacity behavior and removes the need to manage auto scaling across Regions during a high-stress incident.
Avoid control plane operations during disruptions
This is critical and often overlooked. During a Regional disruption:
- Don’t make structural changes to your tables.
- Don’t add or remove Global Secondary Indexes.
- Don’t add or remove global table replicas.
- Don’t modify the capacity mode (switching between provisioned and on-demand).
- Don’t update table tags or TTL settings.
These are control plane operations that require coordination across replicas. Global tables synchronize settings across all replicas by default, so setting changes aren’t permitted when one of the replicas is inaccessible. They can only be made when all replicas are healthy. Data plane operations (your reads and writes) are safe and expected after failover to a healthy Region. But structural changes should wait until the all-clear. This is why the best practice is to have your standby infrastructure fully configured and ready before an impairment occurs, rather than attempting to create or modify resources during an incident.
Build your pre-event checklist and runbook
To strengthen your resilience posture, we recommend creating afailover runbookthat documents your failover procedure, including pre-decided thresholds and criteria for when to initiate failover. The decision to fail over is ultimately a business decision, unique to each customer’s requirements and risk tolerance. Having those decisions made in advance, not during a 2 AM incident, is what separates prepared teams from reactive ones.
As part of that runbook, maintain a pre-event checklist that you review before each planned event, peak traffic period, or when AWS communicates a potential disruption through the AWS Health Dashboard:
- Identify all global tables with replicas in the affected Region
- Verify replica status in alternate Regions (
ACTIVEstate) - Check
ReplicationLatencymetrics to confirm replication is current - Verify that provisioned capacity or on-demand mode is appropriate in the target Region
- Confirm that service quotas are consistent across replica Regions
- Review CloudWatch alarms for DynamoDB tables
- Document current application endpoint configuration
- Confirm that your operations team is available and familiar with the failover runbook
- Establish a communication channel with stakeholders
This might seem like overhead, but consider that airline pilots run a pre-flight checklist before every takeoff, no matter how experienced they are. The checklist exists not because pilots don’t know how to fly, but because high-stress situations are exactly when steps get skipped. The same principle applies here. During an actual incident, having this list already completed can be the difference between a 5-minute failover and a 45-minute scramble.
Common preparation pitfalls
Even with the best intentions, there are traps that catch teams off guard during preparation.
Replication lag spikes under load
During high write volumes (think flash sales or batch imports), replication latency can spike beyond the typical subsecond range. If an impairment occurs during one of these spikes, your RPO window is larger than expected. This doesn’t mean that your resilience strategy is broken, but it does mean that you must be aware of it. Monitor the ReplicationLatency CloudWatch metric closely during peak traffic periods, and factor these spikes into your RPO calculations. If your business requires a hard zero-RPO guarantee regardless of load, DynamoDB multi-Region strong consistency (MRSC) is the answer.
Capacity planning failures
If your failover Region is provisioned for normal read traffic and must suddenly handle full production read traffic, you will reach throttling. Auto scaling helps, but it takes minutes to react, which is too slow for a sudden failover surge. Use on-demand capacity mode for your replica tables, or verify that provisioned capacity in all Regions can handle full production load at all times. There is a real cost to maintaining higher capacity in your failover Region, and the right balance depends on your workload and risk tolerance. Consider using on-demand capacity mode for replica tables, which scales automatically without requiring you to maintain over-provisioned capacity. If you use provisioned mode, verify that your auto scaling MaximumUnits can handle full production load, even if your MinimumUnits
are set lower during steady state. The key is that your failover Region can scale to meet demand when it needs to, not that it’s running at full capacity at all times.
Never testing failover
Failover is like a fire drill. If you’ve never practiced it, you won’t execute it smoothly when it matters. Run regular GameDays where you simulate Regional impairments and practice your failover runbook. You can use AWS Fault Injection Service (FIS) to inject faults and test your resilience posture continuously. Identify gaps in automation, documentation, and team readiness before a real incident forces you to. We’ve worked with customers who discovered during a GameDay that their failover Region’s security groups blocked the application from connecting to DynamoDB, a configuration drift that went unnoticed for months. Better to find that on a Tuesday afternoon than during a Saturday night disruption.
More generally, the advice is that the failover strategy needs to be regularly exercised. It isn’t enough to only check these things. The failover must be performed at some regular cadence.
Conclusion
Preparing for Regional impairments with DynamoDB global tables isn’t only about having replicas in multiple Regions. It’s about understanding your replication model, knowing your RPO and RTO requirements, monitoring replication health continuously, and building the operational muscle to act decisively when it matters.
Your data will still be available after a single-Region impairment. That’s the promise of global tables. But the speed and smoothness of your recovery depends entirely on the preparation you do today.
In Part 2 of this series, we cover the failover strategies themselves: DNS-based failover with Amazon Route 53, application-level circuit breakers, and Amazon Route 53 Application Recovery Controller (ARC). We walk through the tradeoffs, the implementation details, and the operational considerations for each approach.
To learn more
If you’re running DynamoDB global tables today, start by checking your ReplicationLatency metric in CloudWatch. If you don’t have alarms set up, create them now. Then run describe-table-replica-auto-scaling on your failover Region and compare the MaximumUnits to your primary Region’s peak traffic. If there’s a gap, close it before your next peak event.
If you don’t have a failover runbook yet, use the checklist in this post as your starting point. Write down the thresholds that would trigger a failover for your workload, get your team to review them, and schedule a GameDay to test it. AWS Fault Injection Service can help you simulate Regional impairments in a controlled way.
For a deeper dive into global tables configuration and design, see the best practices for DynamoDB global tablesand the Reliability Pillar of the AWS Well-Architected Framework. If zero RPO is a requirement for your workload, explore multi-Region strong consistency (MRSC).
In Part 2, we put this preparation to work with three failover strategies.