AWS Database Blog
Best practices for Amazon DynamoDB Global Tables – Part 3: Validating regional resilience with AWS Fault Injection Service
This post is Part 3 of our series on best practices for Amazon DynamoDB global tables. In Part 1, we discussed how to prepare your application for regional outages. In Part 2, we covered failover strategies for regional disruptions.
In this post, we show you how to use AWS Fault Injection Service (AWS FIS) to validate that your application handles regional disruptions the way you expect, by running controlled experiments against your DynamoDB global tables. We cover both multi-Region strong consistency (MRSC) and multi-Region eventually consistent (MREC) global tables, because AWS FIS works differently with each.
Building a resilient multi-Region application involves two distinct challenges: designing for failure, and proving your design works. The first two parts of this series addressed the design. This post addresses the proof.
AWS Fault Injection Service
AWS Fault Injection Service provides the controls and guardrails that teams need to run experiments on your AWS workloads, such as automatically rolling back or stopping the experiment if specific conditions are met. You define experiment templates that specify which resources to target, what fault to inject, and how long the experiment should last. AWS FIS handles the injection with built-in safety guardrails, including stop conditions that automatically halt an experiment if it exceeds your defined impact threshold.
Why test with AWS FIS
DynamoDB global tables, whether MRSC or MREC, give you multi-Region replication with high availability. Steady state replication does not tell you how your application behaves when replication is disrupted. Questions like these are difficult to answer without testing:
- When DynamoDB returns errors during a regional disruption, does your application failover or crash?
- Does your application detect a disruption early that may need failover actions?
- Can your failover mechanism shift traffic within your target recovery time (RTO)?
- After the disruption resolves, DynamoDB replication might still be catching up in the recovered Region – does your application recover without manual intervention?
AWS FIS experiments lets you answer these questions with evidence instead of assumptions.
How AWS FIS experiments work with MRSC global tables
When you run an AWS FIS experiment using the aws:dynamodb:global-table-pause-replication action against an MRSC global table, it pauses replication between the experiment Region and all other Regions. For example, consider an MRSC global table with replicas in US West (Oregon) (us-west-2), US East (N. Virginia) (us-east-1), and US East (Ohio) (us-east-2). If you run an experiment with us-east-2 as the experiment Region, replication in and out of us-east-2 is affected, while us-west-2 and us-east-1 will continue normal operations.
During the experiment, the isolated replica behaves differently depending on the type of operation:
- Eventually consistent reads – Permitted, continue to work against the isolated replica.
- Strongly consistent reads – Return HTTP 500 errors, because cross-Region consensus cannot be reached.
- Writes – Return HTTP 500 errors, cross-Region replication is paused.
- Control plane actions (
UpdateTable, and so on) – Blocked table-wide in all replica Regions.
These induced errors are tracked by a dedicated Amazon CloudWatch metric: FaultInjectionServiceInducedErrors. This metric counts the simulated HTTP 500 errors returned during the experiment and during the post-experiment catchup period, broken down by TableName and Operation. Note that these induced errors also increment the SystemErrors metric with equal counts, so existing alarms on SystemErrors will fire during experiments. The FaultInjectionServiceInducedErrors metric helps you distinguish experiment-induced errors from organic service issues.
IAM permissions for MRSC experiments
MRSC experiments use the aws:dynamodb:global-table-pause-replication action, which requires the dynamodb:InjectError IAM permission. This permission must be granted with Resource: "*" because it cannot be scoped to a specific table ARN. The experiment role also needs permissions to manage resource policies (used internally by AWS FIS for setup and teardown) and tag:GetResources if you target tables by tags.
How AWS FIS works with MREC global tables
For MREC global tables, AWS FIS uses a resource-policy-based approach. When you run the same aws:dynamodb:global-table-pause-replication action against an MREC table, AWS FIS dynamically attaches deny statements to the table’s resource policy that block the DynamoDB replication service-linked role (SLR) from performing replication operations. Specifically:
- A deny statement is added to the table policy blocking
GetItem,PutItem,UpdateItem,DeleteItem,DescribeTable,UpdateTable,Scan,DescribeTimeToLive, andUpdateTimeToLivefor the replication SLR. - A deny statement is added to the stream policy blocking
GetRecords,DescribeStream, andGetShardIteratorfor the replication SLR.
Both statements include a time-bound DateLessThan condition, that provides a built-in safety mechanism for expected end time of experiment.
The key difference from MRSC: your application does not see errors. Reads and writes continue to succeed locally in all Regions. What stops is the background replication, writes in one Region are no longer propagated to other Regions, nor are writes from other Regions propagated to the isolated one. This means:
- Your application continues to function normally from the user’s perspective.
- Data written during the experiment diverges across Regions.
- The
ReplicationLatencyCloudWatch metric is no longer emitted. - After the experiment ends, AWS FIS removes the deny statements and replication catches up.
If the target table doesn’t already have a resource policy, AWS FIS creates one for the duration of the experiment and deletes it when the experiment completes. If the table already has a resource policy, AWS FIS inserts only the deny statements and removes them at the end, leaving your existing policy unchanged.
IAM permissions for MREC experiments
MREC experiments use a resource-policy-based mechanism and do not require the dynamodb:InjectError permission:
Prerequisites
Before you begin, make sure you have:
- A DynamoDB global table (MRSC or MREC) with replicas in multiple Regions. For this walkthrough, we use an MRSC table in us-west-2 (primary), us-east-1, and us-east-2.
- An IAM role for AWS FIS experiments.
- The AWS Command Line Interface (AWS CLI) v2 installed and configured.
- A CloudWatch dashboard or monitoring solution in place. Without it, you won’t be able to measure the impact of your experiments.
Important: AWS FIS carries out real actions on real AWS resources. If this is your first AWS FIS experiment, we strongly recommend starting in a pre-production or test environment. Build confidence with your experiment design and monitoring before running against production tables.
Step 1: Define steady state
Before creating your experiment, define what “normal” looks like for your application and expected state during the experiment.
For example:
- “If replication is paused in us-east-2 for 10 minutes, the error rate seen by end users should not exceed 0.1%, because our application fails over to us-west-2 within 30 seconds.”
- “If strongly consistent reads fail in us-east-2, our application should fall back to eventually consistent reads and maintain p99 latency under 50ms for read operations in that Region.”
To establish your baseline, record the following metrics during normal operation for at least 24 hours:
SuccessfulRequestLatency(p50, p99) per operation.SystemErrors(Sum). Should be near zero.- Your application-level error rate and latency metrics.
You’ll compare these baselines against the same metrics during and after the experiment to evaluate your criteria.
Step 2: Create the AWS FIS experiment role
Create an IAM role that AWS FIS can assume to run experiments against your global table. The role needs permissions to manage resource policies (used internally by AWS FIS) and the dynamodb:InjectError permission specific to MRSC tables. Note that dynamodb:InjectError requires Resource: "*" – it cannot be scoped to a specific table ARN.
Experiment role policy:
Trust policy:
Step 3: Create an experiment template
The following command creates an experiment template that pauses replication for 10 minutes. Note the stop condition, a CloudWatch alarm that automatically halts the experiment if your application’s error rate exceeds an acceptable threshold:
You can also target tables by tags instead of ARNs, which is useful when you want to run experiments across multiple tables that share a common tag (for example, Environment: staging). To do this, replace resourceArns with resourceTags in your target definition:
Step 4: Run the experiment
Start the experiment:
Monitor the experiment status:
While the experiment runs, observe your CloudWatch dashboard and your application’s behavior in all replica Regions, not only the isolated one.
During the experiment:
For MRSC tables:
FaultInjectionServiceInducedErrorsshould show a steady count of induced 500 errors in the isolated Region, broken down by operation (GetItem,PutItem,Query, and so on).SystemErrorswill increase by the same count. Existing alarms on this metric will fire.- Eventually consistent reads should continue succeeding in the isolated Region. Verify this in your
SuccessfulRequestLatencySampleCount, filtered by operation. - Strongly consistent reads and writes in the isolated Region should return HTTP 500 errors.
- Healthy Regions should continue operating normally. Verify with
SuccessfulRequestLatencyin those Regions. - If you have regional failover configured (for example, through Amazon Route 53 health checks), verify that traffic shifts to a healthy Region within your target RTO.
For MREC tables:
- Your application should continue operating normally in all Regions. Reads and writes succeed locally.
- Monitor
ReplicationLatencybetween healthy Regions to confirm they continue replicating between each other. ReplicationLatencymetric into the experiment Region is no longer emitted.- Your application logs should show no errors. The replication pause is invisible to your application.
After the experiment ends:
For MRSC tables:
FaultInjectionServiceInducedErrorsmay continue briefly during the catchup period as DynamoDB resumes replication and converges.SuccessfulRequestLatencySampleCount should recover to pre-experiment levels in the previously isolated Region.- Watch
ThrottledRequestsfor retry storms as clients reconnect.
For MREC tables:
ReplicationLatencymetric into experiment Region should start emitting and decrease as the replication backlog drains.- Verify that data written to healthy Regions during the experiment is now readable in the previously isolated Region, and vice versa.
For both table types:
- Confirm your application returns to steady state without manual intervention.
- Control plane operations should succeed again.
Step 5: Evaluate results
After the experiment, return to the criteria you defined in Step 1 and evaluate it against the data you collected. Ask these questions:
Did your application handle the 500 errors gracefully? Check your application logs for unhandled exceptions. If errors propagated to end users, consider adjusting your SDK retry configuration or adding application-level retry logic with circuit breakers.
Did eventually consistent reads provide a useful fallback? During the experiment, eventually consistent reads continued to work against the isolated replica. If your application can tolerate stale reads for certain operations, consider implementing a fallback that switches from ConsistentRead=True to ConsistentRead=False when strongly consistent reads return 500 errors. This can maintain partial availability during a regional disruption while you shift traffic to a healthy Region.
Did your failover mechanism activate? If you’re using Route 53 health checks or a custom failover mechanism, measure the time from the first induced error to the traffic shift. Compare this against your target Recovery Time Objective (RTO).
Did your alerts fire? If you have CloudWatch alarms on error metrics, verify they triggered within your expected timeframe. Note that for MRSC tables, AWS FIS-induced errors increment both the FaultInjectionServiceInducedErrors metric and the SystemErrors metric with equal counts. This means existing alarms on SystemErrors will fire during AWS FIS experiments. If you want to distinguish between AWS FIS-induced and organic errors, use the FaultInjectionServiceInducedErrors metric. Consider creating alarms on application-level error rates as well, because those capture the end-user impact regardless of the error source.
Did recovery complete cleanly? After the experiment, verify that your application returned to normal operation without manual intervention. Check for any data inconsistencies that might have resulted from writes during the experiment.
If the results didn’t match your evaluation criteria, that’s a successful experiment, you’ve identified a gap in your resilience before it affected your customers.
Best practices
Based on our experience working with customers running AWS FIS experiments against DynamoDB global tables, we recommend the following:
Start small, iterate, and automate. Begin with short experiments (10-15 minutes) against a single table in a test environment. As you build confidence, increase the duration, test in staging, and eventually run experiments in production during business hours when your team is available to observe. Once you’ve established a baseline, save your experiment templates and run them regularly, many teams incorporate AWS FIS experiments into their monthly or quarterly operational readiness reviews. Resilience is not a one-time achievement. It requires ongoing validation as your application evolves.
Design experiments around your real risks. Review past incidents and near-misses to prioritize which failure scenarios to test first. Experiments that challenge assumptions (“we’re confident our retry logic handles 500s”) often reveal the most valuable findings. Consider using scenario-based testing that combines DynamoDB replication pause with other AWS FIS actions for other AWS services to test for more realistic failure scenarios – for example, the Cross-Region Connectivity scenario blocks application network traffic, S3, DynamoDB and other services.
Monitor recovery phase, including catchup. Don’t stop observing when the experiment ends. The post-experiment catchup period, when DynamoDB resumes replication and processes any pending writes, can reveal issues with your application’s recovery logic. The FaultInjectionServiceInducedErrors metric tracks errors during this period as well. Verify that your application returns to steady state without manual intervention before declaring the experiment a success.
Enable FIS experiment logging: Enable experiment logging to capture detailed information about your experiment as it runs. Note that experiment logging is disabled by default.
Conclusion
In Parts 1 and 2 of this series, we covered designing your application for regional resilience with DynamoDB global tables. In this post, we walked through validating that design using AWS Fault Injection Service. For MRSC tables, FIS produces the same HTTP 500 errors your application would see during a real disruption. For MREC tables, FIS pauses background replication. We recommend starting with a single table and short duration, then expanding scope as you gain confidence. See the FIS planning guide for more on scoping experiments safely, and Fault Injection Testing in the DynamoDB developer guide for further details.