Conducting chaos engineering experiments on Amazon EBS using AWS Fault Injection Simulator

As distributed systems get more complex, anticipating disruptions becomes even more challenging. Conventional techniques of verifying known situations through unit or integration testing leave gaps that don’t cover component failures, which can result in expensive outages. Chaos engineering is a disciplined approach to unhide failures before they become outages. By proactively identifying how a system responds to a situation, you can compare what you suspect will happen to what actually happens in your systems and implement fixes to avoid any potential disruption. You can simulate real-world disruptive events and learn how to build more resilient systems by using AWS Fault Injector Simulator (AWS FIS), a fully managed service that provides built-in safety mechanism to run controlled experiments on AWS.

Among components of a resilient system, storage is fundamental for any enterprise application. Many business-critical applications such as SAP, Oracle, databases, big-data analytics engines, and media workflows require dedicated, low-latency, persistent block storage. I/O disruption of the storage component will impact those applications, which may result in potential revenue losses. It is important to understand how an application responds to a storage fault. What happens if the storage service that supports your mission-critical applications stops responding to input output (I/O) operations and results in operating system timeouts? Chaos engineering experiments can be used to effectively answer such question.

In this blog post, we walk you through the use of Pause I/O action in AWS FIS to simulate an unresponsive state of an Amazon Elastic Block Store (Amazon EBS) volume by covering two experiment examples: a simple one-time example and a more advanced example that creates a re-usable experiment template. Pause I/O action reproduces real-world signals that happen when an EBS volume is not responding to I/O operations. This AWS FIS functionality helps you simulate an unresponsive EBS volume so you can test the entire application stack, observe operating system and application configurations, improve monitoring, and tune the application architecture to improve resiliency.

Experiments overview

In this blog post, we will walk through the steps to conduct two experiments on EBS volumes:

Experiment 1: Running experiment on a single EBS volume. We will run experiment on a single EBS volume and review the volume metrics on Amazon CloudWatch to verify the experiment.
Experiment 2: Running experiment on multiple Amazon EBS volumes by using a sample application architecture scenario.

In chaos engineering practice, we can define its key phases as identifying the steady state of the application, defining a hypothesis, running the experiment, verifying the experiment results, and making necessary improvements based on the experiment results.

This image shows five stages of chaos engineering.

For illustration purposes, we will use the following example:

Steady State: We can define steady state as some measurable output of an application that indicates normal behavior. For example, we have an application hosted on Amazon EC2 across more than one Availability Zone, within an Auto Scaling Group. We front the EC2 instance using an Network Load Balancer with defined health checks to make sure the instances are healthy before the request is routed. We use Route 53 as the DNS service to connect user requests to the infrastructure.

This image shows sample application architecture for experiment scenario 2.

Hypothesis: A pause of I/O on Amazon EBS volumes running in a single Availability Zone will not disrupt our application.

Run Experiment: Trigger Pause I/O action for EBS Volume.

Verify: Confirm or discard the hypothesis by looking at the KPIs of the application (e.g via CloudWatch metric, alarms, application logs, route 53 health checks etc).

Improvement: Based on the experiment results, you can then implement necessary fixes. Please note, for the scope of this blog post, we do not cover improvement steps for the example in this experiment.

Walkthrough for Experiment 1: Running experiment on a single EBS volume

Complete the following steps to deploy the experiment:

1. Select the EBS volume you want to perform Pause I/O action on

2. Specify the experiment duration

3. Observe the state of the experiment

4. Verify experiment results by reviewing CloudWatch metrics

Step 1: Select the EBS volume to perform Pause I/O action

1. Select the EBS volume from EC2 console.

2. Choose Actions and select Fault Injection from the dropdown menu. The first option Pause volume I/O allows you to directly run an experiment on the EBS volume.

This image shows available EBS volumes in EC2 Management Console and allows you to inject Pause volume I/O action on any volume of you choice.

Step 2: Specify experiment duration

1. After choosing Pause volume I/O, you will be redirected to the AWS FIS console.

2. Specify Duration for which you want to run the experiment. To allow AWS FIS to run experiment, you may select either Default role which will create a default service role for your account or Use an existing service role. Please refer to FIS document to learn more about the IAM policy and permissions required to use the FIS actions for Amazon EBS.

This images shows sample experiment template with a target EBS volume, experiment duration and service access role.

3. The default duration for an experiment is 10 Minutes. However, you may change this value per your requirement. Please note that AWS FIS actions cost is calculated from the time an action starts until it stops. For more information, review FIS pricing.

4. Select Pause volume I/O and start the experiment. Once start the experiment, it will automatically create an experiment template on your behalf.

This images shows a warning message that pausing I/O causes the EBS volume to become impaired. To confirm that you want to start the experiment, enter start in the field and start experiment.

Step 3: Observe the state of experiment

1. Observe the various stages of the experiment under the AWS FIS Experiments section. The state of the experiment will go from initiating, running, to eventually completed.

This image shows that the experiment is currently in running state.

Step 4: Verify experiment results by reviewing CloudWatch metrics

1. Please ensure that your volume is driving I/O requests. Optionally, you can simulate I/O requests on volume using a workload simulation tool such as flexible I/O tester (fio). For more information, please review fio installation instructions.

Here is a sample command you can use to execute 32 threads, each reading a block size of 128KB sequentially from the mounted EBS volume. To do so, you need to connect to the EC2 instance and run the following command in terminal window.

fio --filename=/dev/<device> --rw=read --bs=128k --iodepth=32 --ioengine=libaio --direct=1 --name=simulate-read-requests

2. Review CloudWatch metrics to verify that the volume is no longer processing I/O.

This images shows EBS volume CloudWatch metrics to confirm that the volume is no longer processing I/O.

In the preceding graph, the CloudWatch metrics for VolumeReadOps or VolumeWriteOps is 0, indicating that the volume is no longer processing I/O. If you drive I/O to a volume that has I/O paused, the CloudWatch metric for queue length VolumeQueueLength will be a non-zero value. As a result, we can confirm that volume was not responding to the I/O requests and the experiment ran successfully.

Walkthrough for Experiment 2: Running experiment on multiple Amazon EBS volumes by using a sample application architecture scenario

In the following section, we will demonstrate how to run experiments on multiple EBS volumes hosting our sample application.

As defined in the preceding Experiment Overview section, our application is hosted on Amazon Elastic Compute Cloud (EC2) instances across more than one Availability Zone. We make an hypothesis that a pause of I/O on Amazon EBS volumes running in a single Availability Zone will not disrupt our application. We will run an experiment by simulating I/O failure using FIS and depending on the experiment results, we can either accept or reject the hypothesis.

For the purpose of this walkthrough, we assume that you already have a sample application to run the experiment against. If not, however, you may choose to deploy a sample test application using the preceding architecture diagram.

Complete the following steps to deploy the experiment:

1. Complete the FIS template pre-requisites by adding Description, Name, Action and Target

2. Select the AWS IAM Role for the experiment template

3. Specify stop conditions and configure logs (Optional)

4. Create the experiment template

5. Run the FIS experiment

6. Observe various states of the experiment

7. Review application logs to assess how your application responded

Step 1: Complete the FIS template by adding Description, Name, Action and Target

1. Navigate to the AWS FIS console to create an experiment template. Select Create experiment template.

This image shows AWS FIS Console.

2. Optionally, you can enter a Description and Name for the experiment.

This image shows FIS template pre-requisite section for Description and Name.

3. Configure Actions section of the template. An action is an activity that FIS performs on an AWS resource during an experiment. FIS provides a set of pre-configured actions based on the AWS resource type. Each action runs for a specified duration during an experiment, or until you stop the experiment.

Under the Actions section, select Add action to get the New action window.

This image shows FIS template section to configure Action.

Enter a Name and a Description for the new action. Select “aws:ebs:pause-volume-io” for Action type. Start after is an optional setting that allows you to specify an action that should precede the one currently configuring. Next, specify the Duration for the experiment and save the action.

This image shows FIS template section with sample Action configured.

4. Configure Targets section of the template. Note that a target has been automatically created with the name Volumes-Target-1. Next, edit the Volumes-Target-1 target to select the target, i.e. the EBS volume on which you want to simulate PauseIO action.

This image shows FIS template section to configure Target.

If you want to run experiment on specific EBS volumes, you may select Resource IDs for Target method, and then select EBS volumes from the dropdown. However, for the purpose of this experiment, we will select Resource tags, filters and parameters as the Target method and specify one of the Availability Zone names where the EC2 instances are running.

This image shows FIS template section to edit Target information.

Step 2: Select the AWS IAM role for the experiment template

1. You can either create a new role or use an existing IAM role with the required permissions to run the experiment.

This image shows FIS template section to choose service access role.

Step 3 (Optional): Specify stop conditions and configure logs

AWS FIS provides the controls and guardrails for you to run experiments safely on your AWS workloads. A stop condition is a mechanism to stop an experiment if it reaches a threshold that you define as an Amazon CloudWatch alarm. If a stop condition is triggered while the experiment is running, then AWS FIS stops the experiment.

This image shows FIS template section to configure Stop conditions.
For this experiment, let’s assume our application is I/O intensive and sensitive to latency. A higher queue length on all volumes is not acceptable as it may potentially impact the end user experience for our application. Therefore, we will define a stop condition as an alarm that gets triggered if the queue length of all volumes hosting our application is higher than 10.

Create a new CloudWatch alarm with a custom metric

1. Navigate to CloudWatch in another tab and select Create alarm.

CloudWatch interface for creating alarms

2. Click on Select Metric to specify the metric and condition.

Interface to specify metric and condition for the alarm

3. In the search bar, type in your volume IDs for which you want to create an alarm and select Per-Volume Metrics.

Pre-volume metrics interface

4. Select VolumeQueueLengthmetrics for the volumes and then select Graphed metricsfrom the panel. From the following illustration, we have two volumes.

This is where to specify EBSPauseIO_alarm as stop condition

5. Click on the Add math dropdown menu and then select Start with empty expression.

Add math drop down menu

6. In the Details column, select Edit math expression and type “IF(m1>10 AND m2>10, 1, 0)”, where m1 is VolumeQueueLength for Volume1, m2 is VolumeQueueLength for Volume2. Refer to CloudWatch metric math to learn more about supported functions.

image shows edit math expression window

7. Select only custom expression and click on Select metric in the bottom right-hand corner.

Image that shows custom expression is selected

Specify metric and conditions

1. Under Graph, select the Period dropdown and select 1 minute as the granularity of the data points on which the alarm is monitored.

Specify metric and conditions for graph

2. Under Conditions, select the threshold type as Static, the alarm condition as Greater/Equal, and the threshold value based on how frequently you want the alarm to be triggered. For example, if you would like to be notified when all volumes meet the condition for 2 continuous minutes, your threshold should be set as 2. Select Next.

This is where you set conditions to be notified.

3. Configure actions to trigger Amazon SNS notifications. Under Configure actions, select In alarm, Create new topic, and add email address to receive notification.

This is where you configure actions to trigger Amazon SNS notifications.

4. Specify an alarm name as EBSPauseIO_alarm and select Next. Preview all the options and select Create Alarm.

This image shows how to add alarm name and description.

5. Go back to the AWS FIS console and specify EBSPauseIO_alarm as stop condition. AWS FIS also allows you to send experiment log data to Amazon S3 bucket or Amazon CloudWatch Logs.

This is where to specify EBSPauseIO_alarm as stop condition

Step 4 : Create the experiment template

1. (Optional) Configure Tags by specifying a tag key and tag value. The tags that you add are applied to your experiment template, and not the experiments using the template.

We will wrap up the process by selecting the Create experiment template.

This image shows FIS template section to tag experiment template.

2. We will get a success message if the entries are correct and the template will be successfully created.

This image shows that the experiment template is successfully created,

Step 5: Run the AWS FIS experiment

1. Select Start experiment and we will get another warning to confirm if we really want to start this experiment. Confirm by entering “start” and click on Start experiment.

This image shows a warning message to confirm if you really want to run the experiment.

Step 6: Observe various states of the experiment

1. The state of the experiment will go from initiating, running, to eventually completed.

This image shows that the experiment is currently in running state.

2. The experiment is now complete.

This image shows that the experiment has completed.

Step 7: Review application logs to assess how your application responded.

1. You can review EBS volume CloudWatch metrics (as demonstrated in Step 4 of Experiment 1) to confirm that the volumes in the specified Availability Zone were not processing I/O during the experiment. This helps you confirm that the experiment ran successfully.

2. Review the KPIs of the application to understand how your application responded to EBS I/O failure in a single Availability Zone. You can review CloudWatch metric, application logs, route 53 health checks or any other third-party monitoring tools integrated with your application to understand it’s behavior. Check if your recovery workflow initiated as expected or not.

3. Based on the result, you can either accept or reject your hypothesis. In case you tend to reject your hypothesis, you should review your application architecture and make necessary changes to improve the resiliency of your application.

Cleaning up

If you created any AWS resources for the preceding experiments, you can terminate the resources to optimize costs. Please follow the steps in the Delete an Amazon EBS volume or Terminate an EC2 instance documentation as applicable. You can also delete the AWS FIS experiment template by following the steps in the Delete an experiment template documentation.

Conclusion

In this blog post, we covered the importance of chaos engineering and how it can be used in practice to simulate storage faults. This helps you assess system resiliency against storage disruptions. The sample application scenario in the more advanced Experiment 2 provides you a holistic process for simulating the unresponsive state of EBS volumes in your applications and defining guardrails to safely run such experiments. You are then able to utilize observability tools to understand how your applications respond to paused I/O and identify ways to further improve their resilience. To get started with AWS Fault Injection Service for Amazon EBS, refer to the service documentation.

Thank you for reading this blog. If you have any comments or questions, please leave them in the comments section.