Test and build application resilience using Amazon EBS latency injection

As businesses strive to build highly available applications, they must prevent disruptions that can lead to downtime and revenue loss. Robust monitoring systems help identify failures proactively, but chaos engineering has emerged as a systematic approach to building resilient systems by uncovering potential issues before they become outages.

Chaos engineering is especially critical for storage infrastructure such as Amazon Elastic Block Store (Amazon EBS), a high-performance block storage offering for performance-intensive and latency-sensitive applications, which business-critical workloads such as databases, big data analytics, and media applications depend on. Elevated I/O latency can lead to disruptions in these applications and can impact their end users; thus, organizations perform chaos engineering experiments to simulate storage slowdown. This enables them to understand how their applications respond to these disruptions and allows them to implement preventive measures to maintain high application availability.

In this post, we walk through how you can use the Amazon EBS latency injection action in AWS Fault Injection Service (AWS FIS) to introduce high I/O latency on your EBS volumes to replicate real world conditions. We discuss how you can use this capability to observe application behavior when your volumes experience high latency. We also discuss how you can tune your monitoring with Amazon CloudWatch alarms to alert and take automated recovery actions that can help build resiliency for your application. Customers can use this new action to test and build confidence that their systems respond as intended when experiencing slowdown in their application storage layer.

Solution Overview

AWS FIS enables controlled testing of Amazon EBS faults through systematic fault injection experiments. This service facilitates the simulation of different types of disruptions, such as stalled I/O and high I/O latency, within a managed testing environment. AWS FIS uses these controlled testing capabilities to help improve application resilience by enabling proactive identification and remediation of potential vulnerabilities in Amazon EBS based infrastructure before they can cause widespread impact in production environments.

Customers can use the Amazon EBS latency injection action in AWS FIS to test how their applications react to unexpected variances in volume I/O latency and understand the potential impact on the different layers of their application stack and end-users. To quickly get started with your latency experiments, you can use the preconfigured latency scenario templates available through the Amazon EBS and AWS FIS consoles. These scenarios replicate various latency patterns that can be observed during real-world cases. You can adapt tests based on your application requirements by customizing the scenario templates from the AWS FIS scenario library or creating your templates from scratch to meet your specific testing needs. Disruptions in your workload can occur even when a small portion of your I/O experience performance degradation which results in high tail latencies. You can use the Amazon EBS latency action to reproduce these latency outliers to understand how your application behaves in these situations. The following summaries briefly discuss each of the four scenarios that simulate this behavior and help test different types of slowdown on your volumes.

Scenario 1 – EBS: Sustained latency
This scenario presents a latency pattern where you observe a persistent latency of 500 ms on 50% of your read I/O operations and 100% of your write I/O operations for the duration of the experiment. You can use this scenario to establish performance baselines for your applications and determine when latency deviates from acceptable levels for a longer period to indicate any potential problems.

Scenario 2 – EBS: Intermittent latency
This scenario presents a latency pattern where you observe sharp short-lived spikes of up to 30 seconds on 0.1% your I/O operations, with periods of recovery in between the latency spikes. You can use this scenario to test extreme sporadic tail latencies. Furthermore, you can see how these brief inconsistent delays can impact the performance of your application and cause disruptions to your end users due to the unpredictable pattern.

Scenario 3 – EBS: Increasing latency
This scenario presents a latency pattern where you observe a gradual increase in latency from 50 ms up to 15 seconds on 10% of your read and 25% of your write I/O operations during the experiment. You can use this scenario to test if your system can detect issues early on. Furthermore, you can proactively take steps to resolve the issues before they significantly impact your application performance and result in downtime for users.

Scenario 4 – EBS: Decreasing latency
This scenario presents a latency pattern where you observe a gradual decrease in latency from 20 seconds down to 40 ms on 10% of your read and write I/O operations during the experiment. You can use this scenario to determine at what point your system stops alarming and how your applications respond to this form of degradation so that you can test your failback mechanisms.

In this walkthrough, we explore how to test using these latency scenarios using an example test workload. In these tests, we are running a random workload with 16 KiB IO size for both reads and writes on a GP3 volume configured at 16000 IOPS and 1000 MB/s of throughput. We also demonstrate how you can run your own custom experiment. Moreover, we discuss how to effectively monitor these latencies and proactively implement necessary recovery mechanisms to identify and remediate application performance degradation before it becomes a user-facing issue.

Prerequisites

You must set up an AWS account with sufficient permissions to use Amazon EBS, AWS FIS, and Amazon CloudWatch. Furthermore, when using AWS FIS, it automatically configures an AWS Identity and Access Management (IAM) role for the experiment. AWS FIS conducts real disruptions on real AWS resources in your system. Before you run experiments in production, we strongly recommend that you complete a planning phase and run the experiments in a pre-production/test environment.

Walkthrough

To get started with latency injection experiments, you must follow these steps.

Step 1: Log in to the AWS Management Console and navigate to the Amazon EC2 service console

Log in to the AWS Management Console and choose the appropriate AWS Region.
Navigate to the Amazon Elastic Compute Cloud (Amazon EC2) console.
To change the Region, use the Region selector in the upper-right corner of the page.

Step 2: Choose Volumes under Elastic Block Store from the left navigation pane

Choose Volumes from the left navigation pane under the Elastic Block Store Choose the volume IDs that you want to use for this experiment, and choose Actions. Choose Resiliency testing > Inject volume I/O latency.

Choose Volumes under Elastic Block Store console

Step 3: Test for various Amazon EBS latency scenarios listed under Experiment settings and monitor latency using CloudWatch

In this step, we walk through how you can run each of the four scenarios described in Solution Overview and observe the latency experienced on the EBS volume using CloudWatch metrics.

Scenario 1: Sustained

1.1. Choose Sustained from the Experiment settings.
1.2. Under EBS volumes, make sure that the specified volume ID is correct.
1.3. You can choose an existing IAM role that has the necessary permissions to perform this specific AWS Resilience Hub scenario for Amazon EBS, or you can create a new role.
1.4. Note the Pricing estimation to run this experiment. You are charged based on the duration during which an action is active. This experiment runs for a duration of 15 minutes.
1.5. Choose Start experiment. Set up Sustained scenario under Experiment settings and monitor latency using CloudWatch

You can monitor this through CloudWatch metrics named VolumeAvgReadLatency and VolumeAvgWriteLatency, which are per-minute metrics that show average I/O latency. You can observe an average read latency of 250 ms for reads I/O operations and an average write latency of 500 ms for write I/O operations as defined in Scenario 1 for sustained latency under solutions overview. This is because only 50% of the read operations were injected with latency as opposed to 100% of write operations.

Monitor through CloudWatch metrics named VolumeAvgReadLatency and VolumeAvgWriteLatency

Scenario 2: Intermittent

2.1. Choose Intermittent from the Experiment settings.

Choose Intermittent scenario for testing 2.2. Then, follow steps 1.2 to 1.5 in Scenario 1 to begin this experiment.

The latency you injected is on a very small percentage of I/O (0.1%); thus, the observed average latency is actually lower than the latency values in the experiment.

Averaged read and write latency charts for EBS.

Scenario 3: Increasing

3.1. Choose Increasing from the Experiment settings.
Select Increasing from the Experiment settings. 3.2. Then, follow steps 1.2 to 1.5 in Scenario 1 to begin this experiment.

When you look at the CloudWatch metrics you can see that the average read latency increased from 5 ms to 100 ms to 2200 ms at its peak, while the write latency increased from 15 ms to 125 ms to 3500 ms at its peak. This is because you specified different percentages for read (10%) and write (25%) I/O to be impacted in the experiment.

Average read and write EBS latency charts for Increasing scenario in the experiment

Scenario 4: Decreasing

4.1. Choose Decreasing from the Experiment settings.
Select Decreasing scenario in the Experiment setting. 4.2. Then, follow steps 1.2 to 1.5 in Scenario 1 to begin this experiment.

When you look at the CloudWatch metrics, you can see that the average read latency decreased from its peak of approximately 2500 ms to 470 ms to 5 ms.

Average read and write EBS latency charts for Decreasing scenario in the experiment

Step 4: Create a custom experiment

As different applications have different sensitivity to latency, you can also create custom experiments to curate your tests that are specific to your application needs. The following is an example of a custom experiment.

From the AWS Resilience Hub, under Resilience testing, choose Experiment templates from the left pane and choose Create experiment template.
Provide a Description and Name – optional. Choose This AWS account under experiment type. Then, choose
Under Specify actions and target, provide the following configurations:
3.1. Under the Actions tab: Add action: Choose EBS > aws:ebs:volume:io:latency.
3.2. Name: Provide the name of your choice for this custom experiment
3.3. Description – optional: This is optional
3.4. Start after – optional: If you want this experiment to start after certain actions, then you can choose from the dropdown. This is optional, and not required for this experiment, so do not choose anything.
3.5. Target: Leave this as default.
3.6. Action parameters:

- Duration: This is the time to wait before the added I/O latency is removed. Set this to 15 minutes.
- I/O latency mode: From here you can choose the latency for which you want to test this. Choose Read only from the dropdown.
- Read I/O latency milliseconds – optional: This is the minimum read latency to set for I/O operations, in milliseconds. Set this to 100.
- Read I/O percentage – optional: This is the percentage of read I/O operations to impair. The default is 100 percent. Set this to 50%.

3.7. Choose Save.

3.8. Under the Targets tab > Add target.
3.9. Name: Provide a name for this target.
3.10. Resource type: Choose aws:ec2:ebs-volume.
3.11. Target method: Choose Resource IDs.
3.12. Under Resource IDs: Choose the volume ID on which you want to run this test. Select mode: All.
3.13. Choose Save.

For Experiment options > under the Empty target resolution mode dropdown, choose Then, choose Next.
On the Configure service access page, choose either Create a new role for the experiment template or Use an existing IAM role. In this post, choose to create a new role. Then, choose Next.
On the Configure optional settings page, leave everything as default and choose
Similarly on Review and create page, verify all of the configurations, scroll down, and choose Create experiment template. Type in “create”, and choose Create experiment template again to confirm.
To begin the experiment, choose Start experiment.
If you want to provide Experiment tags, choose Add new tag and provide a tag value. Then, choose Start experiment and confirm by typing in “start”.

Similar to the previous experiments, this can be monitored by CloudWatch metrics: VolumeAvgReadLatency and VolumeAvgWriteLatency. The experiment template configuration of the latency of 100 ms for 50% of read I/Os coincides with 50 ms of read latency, which is also visible in the following CloudWatch metric dashboard. We only specified read latency in the experiment, so the write latency remains within the expected values for the gp3 volume, as shown in the following figure.

Monitor the custom experiment using CloudWatch metrics: VolumeAvgReadLatency and VolumeAvgWriteLatency

CloudWatch helps you monitor latency at a per-minute granularity. However, if you want higher resolution metrics, then you can use the Amazon EBS detailed performance statistics vended by the nvme block device to get per second latency values and track latency outliers. You can access these metrics directly from the instance. For more information, go to the steps listed under Accessing the statistics. We can observe these for the custom latency experiment that we are running. The volume on which the test is being run is mounted on /dev/nvme1n1. To get the nvme detailed metrics, log in to the CLI of the EC2 instance and run the ebsnvme script during the experiment, as shown in the following:

[root@ip-172-31-9-9 ec2-user]# sudo ./ebsnvme stats /dev/nvme1n1
Total Ops
  Read: 1513326
  Write: 1574077
Total Bytes
  Read: 24794405888
  Write: 39753392128
Total Time (us)
  Read: 14831626216
  Write: 1617670919
EBS Volume Performance Exceeded (us)
  IOPS: 14339971
  Throughput: 0
EC2 Instance EBS Performance Exceeded (us)
  IOPS: 0
  Throughput: 278165
Queue Length (point in time): 1

 
Read IO Latency Histogram (us)
Number of bins: 28
=================================
Lower       Upper        IO Count
=================================
[0        - 1       ] => 0
[1        - 2       ] => 0
[2        - 4       ] => 0
[4        - 8       ] => 0
[8        - 16      ] => 0
[16       - 32      ] => 0
[32       - 64      ] => 0
[64       - 128     ] => 0
[128      - 256     ] => 0
[256      - 512     ] => 87091
[512      - 1024    ] => 1281637
[1024     - 2048    ] => 5223
[2048     - 4096    ] => 278
[4096     - 8192    ] => 46
[8192     - 16384   ] => 6
[16384    - 32768   ] => 8
[32768    - 65536   ] => 1
[65536    - 131072  ] => 1390361
[131072   - 262144  ] => 0
[262144   - 524288  ] => 0
[524288   - 1048576 ] => 0
[1048576  - 2097152 ] => 0
[2097152  - 4194304 ] => 0
[4194304  - 8388608 ] => 0
[8388608  - 16777216] => 0
[16777216 - 33554432] => 0
[33554432 - 67108864] => 0
[67108864 - 18446744073709551615] => 0

Write IO Latency Histogram (us)
Number of bins: 28
=================================
Lower       Upper        IO Count
=================================
[0        - 1       ] => 0
[1        - 2       ] => 0
[2        - 4       ] => 0
[4        - 8       ] => 0
[8        - 16      ] => 0
[16       - 32      ] => 0
[32       - 64      ] => 0
[64       - 128     ] => 0
[128      - 256     ] => 48
[256      - 512     ] => 472
[512      - 1024    ] => 1486682
[1024     - 2048    ] => 29298
[2048     - 4096    ] => 57455
[4096     - 8192    ] => 119
[8192     - 16384   ] => 3
[16384    - 32768   ] => 0
[32768    - 65536   ] => 0
[65536    - 131072  ] => 0
[131072   - 262144  ] => 0
[262144   - 524288  ] => 0
[524288   - 1048576 ] => 0
[1048576  - 2097152 ] => 0
[2097152  - 4194304 ] => 0
[4194304  - 8388608 ] => 0
[8388608  - 16777216] => 0
[16777216 - 33554432] => 0
[33554432 - 67108864] => 0
[67108864 - 18446744073709551615] => 0

Looking through the Read IO latency histogram, we can see that 1390361 IOs had latency in between 65536 microseconds (65 milliseconds) and 131072 microseconds (131 milliseconds). Therefore, roughly 50% of reads had a latency coinciding with our defined configuration of 100 milliseconds for the experiment. This test also had write operations on the volume, but since we did not include latency injection for writes in the experiment configuration, we did not see any similar high latencies in the Write IO latency histogram. Furthermore, all write operations fall in the expected performance buckets for the volume. You can use the latency histograms to track the outlier latency values for your volume. Moreover, you can use the latency injection action to experiment with different levels of latency to simulate how your applications react to various types of outlier latencies, such as high P99.9, P99, and P90.

Using the latency injection action in AWS FIS, you can identify the latency thresholds that impact your various applications. Some applications can be highly latency sensitive, while others can withstand high latencies. Therefore, you can set up the right alarms to alert when the underlying volumes of your workloads experience different degrees of degradation. This can help you take proactive actions to mitigate large-scale impact. For example, you can use the new action to determine the right values to set for your CloudWatch alarms to indicate performance degradation that can potentially impact your application availability. Then, you can set up your systems to take automated recovery actions based on these alarms such as failing over to a secondary volume. You can also use AWS FIS to configure stop conditions for your experiment based on CloudWatch alarms. Therefore, your experiment automatically ends when your alarm threshold is reached.

Cleaning up

If you no longer need the testing setup, then delete the CloudWatch alarm. Moreover, review if you created any new EBS volume specifically for this testing setup, and delete the volume to avoid incurring other charges.

Conclusion

Your applications can experience latency due to various reasons, and they must withstand these disruptions no matter what the underlying cause is. The Amazon EBS volume I/O latency injection action in AWS FIS helps you test your system’s response to increases in Amazon EBS latency. You can use this tool to simulate different latency scenarios, set appropriate timeouts and alarms, and verify recovery procedures. For comprehensive testing, you can combine it with the pause volume I/O action to test the different types of storage disruptions that can impact your applications. Furthermore, you can integrate these tests into your game days and chaos engineering experiments to build confidence that your applications are in fact resilient to the failure modes that you’ve designed them to withstand. We also recommend incorporating these tests into your continuous automated testing routine to maintain continuous system resilience.

Thank you for reading this post. Leave any comments in the comments section.

AWS Storage Blog