Bootstrap your chaos engineering journey with AWS Fault Injection Service Scenarios Library

Ensuring the reliability and resilience of applications is crucial for maintaining business continuity, delivering a superior customer experience, and staying compliant with industry regulations.

As defined in the AWS Well-Architected Framework Reliability Pillar, testing reliability plays an important role in ensuring reliability. Chaos engineering is a powerful way to not only test how your systems handle failure conditions, but also uncover unknowns before they manifest in production.

In this post, you will learn how the AWS Fault Injection Service (FIS) Scenario Library can make your chaos engineering journey easier. The FIS Scenario Library, introduced in 2023, provides pre-built experiments to test your application’s resilience, saving time required to create custom fault injection workflows.

Each scenario in the library comes with a detailed description of how the experiment will run against your application. You can find the full list of FIS scenario library experiments in the AWS FIS documentation.

Today, we will focus on the AZ Availability: Power Interruption scenario from the FIS Scenarios Library. We will walk through how to set up and run the experiment, and how to use the FIS Scenarios Library’s features, like shared parameters and targeting tags, to efficiently create and manage your chaos engineering experiments.

By the end of this blog, you will know how to use the FIS Scenarios Library to jumpstart your chaos engineering. By proactively testing the resilience of your AWS-based apps you can ensure that they can handle real-world failures. To follow along deploy the sample application used in this post, as described in the “Bring your own AWS Account” section of the Fault Injection Service (FIS) workshop V2.

AZ Availability: Power Interruption Scenario

The AZ Availability: Power Interruption scenario from the AWS FIS Scenarios Library lets you simulate a power interruption event in a specific Availability Zone (AZ). This type of event can impact regional and zonal AWS services differently.

Regional AWS services, such as Amazon S3 and Amazon DynamoDB, are designed to be highly resilient to individual AZ impairments. These services leverage multiple AZs within a Region, providing a level of abstraction from the underlying fault domains.

You don’t need to do anything for the service to keep running when there’s an AZ impairment. Regional AWS services will automatically detect the issue and apply necessary mitigation actions. However, you may see brief performance degradation as the service mitigates the issue.

For example, your app might have some DNS problems connecting to regional services. This is because Route 53 (the AWS DNS service) will remove the impaired zonal endpoint from the resolution list when health checks fail.

In contrast, zonal services like Amazon EC2 instances in the affected AZ will become unavailable. Auto Scaling groups will try to recover the impaired instances, but this may initially fail as Auto Scaling takes time to acknowledge the AZ impairment. Once it does, it will exclude the affected AZ and recover the instances using the remaining healthy AZs.

For EC2 instances not in an Auto Scaling group, they’ll be unavailable during the AZ impairment. When the AZ recovers, these instances will come back online. But a small percentage may have unresponsive Amazon EBS volumes.

Application architecture

To showcase the FIS Scenarios Library, you will use a sample app deployed across three AZs. It uses services like:

Amazon EC2
Amazon EKS
Amazon ECS
Amazon RDS
Amazon VPC
AWS Lambda
Amazon DynamoDB
Amazon S3

You can see the detailed architecture in Figure 1.

Figure 1. Application Architecture

Note: This sample application is available in the AWS Samples repository, and you can deploy it in your AWS account.

Simulating an Availability Zone Power Impairment

Creating a chaos engineering experiment to simulate AZ impairment for this application manually would be complex. You’d need to consider the impact on each service, define actions, and ensure the experiments work together and accurately replicate the AZ impairment scenario.

But with the AZ Availability: Power Interruption scenario from the FIS Scenarios Library, the process is much simpler. The pre-defined experiment can streamline the setup and execution, as shown in Figure 2.

Figure 2. AWS Fault Injection Service AZ Power Impairment

This approach lets you focus on observing the AZ impairment’s impact, rather than defining complex experiment workflows.

Creating experiment using AZ power interruption Scenario
1. To get started, go to the AWS Resilience Hub and select AZ Availability: Power Interruption scenario from the Scenario Library. Press Create template with scenario button, as shown in Figure 3.
  
  Figure 3. Scenario Library
2. On the next page, define the experiment target. If your application is deployed in a single AWS account, select This AWS Account: 123456789 and press Confirm. The FIS Scenarios Library also supports applications deployed across multiple AWS accounts, which you can select through the provided option.
Selecting shared parameters
One of the key benefits of the FIS Scenarios Library is the use of shared parameters. Parameters like affectedAz and affectedRolesForInsufficientCapacityException let you define experiment details in one place, rather than specifying them multiple times. For example, if you created the experiment manually without the FIS Scenarios Library, you’d need to set the affected Availability Zone more than five times. But with the affectedAz shared parameter, you can define the impacted AZ just once. The scenario will then apply it consistently across all the relevant resources and actions. Similarly, the affectedRolesForInsufficientCapacityException parameter allows you to specify the roles that may be impacted by the AZ impairment. This reduces the need for repetitive data entry.

Figure 4. Shared parameters
Providing Target Tags parameters
Another way the FIS Scenarios Library simplifies chaos engineering is with Targeting Tags parameters. These let you mark the resources you want impacted by the experiment using a specific key-value pair, like AZImpairmentPowerReady:true, as shown in Figure 5.

Figure 5. Targeting Tags example

This approach serves two purposes:
- It creates a fault isolation boundary, ensuring only the tagged resources are affected by the experiment. The other resources in the affectedAZ remain untouched.
- Like the shared parameters, you only need to provide the targeting tag information in one place. This reduces manual data entry and improves the overall experience.
Providing Durations parameters
The AZ Availability: Power Interruption scenario has three distinct phases:
1. Initial Impairment: This is the starting point, where you may notice things like requests being routed to a stale IP address in the affected AZ, as you realize something is happening.
2. Simulated Impairment: This is the phase where the zonal resources in the affected AZ are impaired and unavailable. During this time, you won’t be able to add capacity in the affected AZ.
3. Recovery Phase: In this final phase, some faults are still being injected, but the resources start returning to a steady state.
You can control the duration of each of these phases using the Durations parameter, as shown in Figure 6. The dnsImpactDuration represents the initial phase, outageDuration represents the simulated impairment phase, and recoveryDuration represents the recovery phase.

Figure 6. Impact duration

It’s important to align these duration parameters with your application’s Recovery Time Objective (RTO). This ensures you can accurately observe how your application behaves during the different stages of the power interruption event.
Running the Experiment and Observing the Impact
Once you’ve created the experiment template using the FIS Scenarios Library, you will see that it has added multiple fault actions based on the parameters you specified, as shown in Figure 7.

Figure 7. Multiple fault actions added by AZ Available: Power Interruption Scenario

Based on the configured Duration parameters, you will be able to observe the timeline of how the experiment unfolds, as illustrated in Figure 8.

Preview Changes (opens in a new tab)

Figure 8. Timeline for AZ Available: Power Interruption Scenario

Before running the experiment, you should define a hypothesis that you want to verify or learn from. For example, your hypothesis might be: “With 20 requests per second, the Largest Contentful Paint P(90) will be less than 3 seconds, even if the power in one AZ is interrupted for 30 minutes, and user requests will shift to the other AZ”. By using a tool like AWS CloudWatch RUM to monitor user experience metrics, such as Largest Contentful Paint, along with other application-level metrics, you can observe your application’s behavior during the different phases of the experiment and validate your hypothesis, as shown on the AWS CloudWatch Dashboard in Figure 9.

Figure 9. Observing application behavior during AZ Available: Power Interruption Scenario

In the case of the sample application, the results showed that only 3% of users experienced longer-than-usual page load times, which was within the threshold, confirming the initial hypothesis.

Clean Up

After you have run the experiment using AZ Availability: Power Interruption scenario, delete the resources created in this blog following the clean up instructions to avoid costs.

Conclusion

In this post, we’ve explored how the AWS Fault Injection Service (FIS) Scenarios Library can make your chaos engineering efforts easier. By focusing on the AZ Availability: Power Interruption scenario, we showed the benefits of using pre-made experiments to test your AWS application’s resilience.

The FIS Scenarios Library has many other experiments beyond the one we covered. We encourage you to explore the full library and the Fault Injection Service (FIS) workshop V2 workshop. This will help you expand your chaos engineering skills and build more reliable, fault-tolerant systems.

When running FIS experiments, consider the security. FIS has strong security features. These include IAM-based access control, AWS Config integration, “stop conditions”, and Safety Levers to ensure safe and controlled chaos engineering.

By using the FIS Scenarios Library, you can seamlessly add chaos engineering to your application development. This leads to more resilient and high-performing AWS solutions.

AWS Cloud Operations Blog