AWS Cloud Operations Blog

Bootstrap your chaos engineering journey with AWS Fault Injection Service Scenarios Library

Ensuring the reliability and resilience of applications is crucial for maintaining business continuity, delivering a superior customer experience, and staying compliant with industry regulations.

As defined in the AWS Well-Architected Framework Reliability Pillar, testing reliability plays an important role in ensuring reliability. Chaos engineering is a powerful way to not only test how your systems handle failure conditions, but also uncover unknowns before they manifest in production.

In this post, you will learn how the AWS Fault Injection Service (FIS) Scenario Library can make your chaos engineering journey easier. The FIS Scenario Library, introduced in 2023, provides pre-built experiments to test your application’s resilience, saving time required to create custom fault injection workflows.

Each scenario in the library comes with a detailed description of how the experiment will run against your application. You can find the full list of FIS scenario library experiments in the AWS FIS documentation.

Today, we will focus on the AZ Availability: Power Interruption scenario from the FIS Scenarios Library. We will walk through how to set up and run the experiment, and how to use the FIS Scenarios Library’s features, like shared parameters and targeting tags, to efficiently create and manage your chaos engineering experiments.

By the end of this blog, you will know how to use the FIS Scenarios Library to jumpstart your chaos engineering. By proactively testing the resilience of your AWS-based apps you can ensure that they can handle real-world failures. To follow along deploy the sample application used in this post, as described in the “Bring your own AWS Account” section of the Fault Injection Service (FIS) workshop V2.

AZ Availability: Power Interruption Scenario

The AZ Availability: Power Interruption scenario from the AWS FIS Scenarios Library lets you simulate a power interruption event in a specific Availability Zone (AZ). This type of event can impact regional and zonal AWS services differently.

Regional AWS services, such as Amazon S3 and Amazon DynamoDB, are designed to be highly resilient to individual AZ impairments. These services leverage multiple AZs within a Region, providing a level of abstraction from the underlying fault domains.

You don’t need to do anything for the service to keep running when there’s an AZ impairment. Regional AWS services will automatically detect the issue and apply necessary mitigation actions. However, you may see brief performance degradation as the service mitigates the issue.

For example, your app might have some DNS problems connecting to regional services. This is because Route 53 (the AWS DNS service) will remove the impaired zonal endpoint from the resolution list when health checks fail.

In contrast, zonal services like Amazon EC2 instances in the affected AZ will become unavailable. Auto Scaling groups will try to recover the impaired instances, but this may initially fail as Auto Scaling takes time to acknowledge the AZ impairment. Once it does, it will exclude the affected AZ and recover the instances using the remaining healthy AZs.

For EC2 instances not in an Auto Scaling group, they’ll be unavailable during the AZ impairment. When the AZ recovers, these instances will come back online. But a small percentage may have unresponsive Amazon EBS volumes.

Application architecture

To showcase the FIS Scenarios Library, you will use a sample app deployed across three AZs. It uses services like:

  • Amazon EC2
  • Amazon EKS
  • Amazon ECS
  • Amazon RDS
  • Amazon VPC
  • AWS Lambda
  • Amazon DynamoDB
  • Amazon S3

You can see the detailed architecture in Figure 1.

The image is an architectural diagram depicting a pet adoption application hosted on the AWS Cloud. It shows a Virtual Private Cloud (VPC) with two Availability Zones, each containing infrastructure components such as web servers, application servers, and databases. The components are labeled as “Pet Adoption Web”, “Pet Search API”, “Pet Payment API”, and “Pet Adoption Database”. The diagram also illustrates the flow of requests from users to the different components within the VPC.

Figure 1. Application Architecture

Note: This sample application is available in the AWS Samples repository, and you can deploy it in your AWS account.

Simulating an Availability Zone Power Impairment

Creating a chaos engineering experiment to simulate AZ impairment for this application manually would be complex. You’d need to consider the impact on each service, define actions, and ensure the experiments work together and accurately replicate the AZ impairment scenario.

But with the AZ Availability: Power Interruption scenario from the FIS Scenarios Library, the process is much simpler. The pre-defined experiment can streamline the setup and execution, as shown in Figure 2.

This image is a diagram depicting an AWS Cloud infrastructure setup for a Pet Adoption application or service. The diagram shows the various components and services involved, deployed across two Availability Zones within an AWS Virtual Private Cloud (VPC). At the top, there are four boxes representing different stages of an experiment or process: FIS Scenario, AZ Impairment, and FIS Experiment. These stages seem to be connected in a sequential flow. Within the VPC, there are two Availability Zones shown. Each zone has multiple instances or resources represented by red cross icons, likely indicating some form of failure or impairment scenario being simulated. The application components include a Pet Adoption Web interface, a Pet Search API, a Pet Payment API, and a Pet Adoption Database. These components are distributed across the two Availability Zones, with users accessing them from the outside. The diagram illustrates a simulated environment for testing the fault tolerance and resilience of the Pet Adoption application in the AWS Cloud, possibly by introducing failures or impairments in different components and observing the system’s behavior and recovery mechanisms.

Figure 2. AWS Fault Injection Service AZ Power Impairment

This approach lets you focus on observing the AZ impairment’s impact, rather than defining complex experiment workflows.

  1. Creating experiment using AZ power interruption Scenario
    1. To get started, go to the AWS Resilience Hub and select AZ Availability: Power Interruption scenario from the Scenario Library. Press Create template with scenario button, as shown in Figure 3.
      A screenshot of the AWS Resilience Hub’s Scenario Library, showing various fault injection scenario options such as AZ Availability: Power Interruption, Cross-Region: Connectivity, and different types of EC2 and EKS stress tests for CPU, disk, memory, and network. There is a “Create template with scenario” button in the top right corner.

      Figure 3. Scenario Library

    2. On the next page, define the experiment target. If your application is deployed in a single AWS account, select This AWS Account: 123456789 and press Confirm. The FIS Scenarios Library also supports applications deployed across multiple AWS accounts, which you can select through the provided option.
  2. Selecting shared parameters
    One of the key benefits of the FIS Scenarios Library is the use of shared parameters. Parameters like affectedAz and affectedRolesForInsufficientCapacityException let you define experiment details in one place, rather than specifying them multiple times. For example, if you created the experiment manually without the FIS Scenarios Library, you’d need to set the affected Availability Zone more than five times. But with the affectedAz shared parameter, you can define the impacted AZ just once. The scenario will then apply it consistently across all the relevant resources and actions. Similarly, the affectedRolesForInsufficientCapacityException parameter allows you to specify the roles that may be impacted by the AZ impairment. This reduces the need for repetitive data entry.
    A screenshot showing scenario parameters selection for an “AZ Impairment: Power Interruption” scenario in AWS Fault Injection Service. It allows specifying the affected Availability Zone “us-west-2a” and roles for which EC2 instances may face insufficient capacity exceptions during the scenario, including a role titled “FisServerless-FISDummyRoleForASG”.

    Figure 4. Shared parameters

  3. Providing Target Tags parameters
    Another way the FIS Scenarios Library simplifies chaos engineering is with Targeting Tags parameters. These let you mark the resources you want impacted by the experiment using a specific key-value pair, like AZImpairmentPowerReady:true, as shown in Figure 5.
    A screenshot showing advanced parameters for targeting tags in a scenario. It displays multiple rows with input fields labeled “key” and “value” for specifying tag names and values to target different AWS resources like EBS volumes, subnets, Auto Scaling groups, EC2 instances, and ElastiCache Redis clusters. The key field for all rows is pre-populated with “AzImpairmentPower” and the value field is set to “Ready”.

    Figure 5. Targeting Tags example

    This approach serves two purposes:

    • It creates a fault isolation boundary, ensuring only the tagged resources are affected by the experiment. The other resources in the affectedAZ remain untouched.
    • Like the shared parameters, you only need to provide the targeting tag information in one place. This reduces manual data entry and improves the overall experience.
  4. Providing Durations parameters
    The AZ Availability: Power Interruption scenario has three distinct phases:

    1. Initial Impairment: This is the starting point, where you may notice things like requests being routed to a stale IP address in the affected AZ, as you realize something is happening.
    2. Simulated Impairment: This is the phase where the zonal resources in the affected AZ are impaired and unavailable. During this time, you won’t be able to add capacity in the affected AZ.
    3. Recovery Phase: In this final phase, some faults are still being injected, but the resources start returning to a steady state.

    You can control the duration of each of these phases using the Durations parameter, as shown in Figure 6. The dnsImpactDuration represents the initial phase, outageDuration represents the simulated impairment phase, and recoveryDuration represents the recovery phase.
    A screenshot showing advanced parameters for durations in a scenario analysis. It includes input fields to specify the duration in minutes for DNS Impact, outage, and recovery phases, with values of 2 minutes, 30 minutes, and 30 minutes respectively.

    Figure 6. Impact duration

    It’s important to align these duration parameters with your application’s Recovery Time Objective (RTO). This ensures you can accurately observe how your application behaves during the different stages of the power interruption event.

  5. Running the Experiment and Observing the Impact
    Once you’ve created the experiment template using the FIS Scenarios Library, you will see that it has added multiple fault actions based on the parameters you specified, as shown in Figure 7.
    The image shows a list of multiple fault actions added by the AWS AZ Available tool for simulating a power interruption scenario. It includes actions like failover for RDS database clusters, pausing Auto Scaling groups and ElastiCache clusters due to insufficient instance capacity, disrupting network connectivity, stopping EC2 instances and Auto Scaling instances, and pausing EBS volume I/O operations. The actions are ordered in a sequence with specified durations.

    Figure 7. Multiple fault actions added by AZ Available: Power Interruption Scenario

    Based on the configured Duration parameters, you will be able to observe the timeline of how the experiment unfolds, as illustrated in Figure 8.

    The image shows a timeline depicting the sequence and duration of events that occur during a power interruption scenario. The timeline is divided into multiple rows, with each row representing a specific event or action. The events are listed on the left, and their durations are shown as horizontal bars extending to the right, with a time scale provided at the bottom ranging from 0 seconds to 64 minutes. The timeline illustrates the how long each of the fault actions would run along with total duration of the experiment.Preview Changes (opens in a new tab)

    Figure 8. Timeline for AZ Available: Power Interruption Scenario

    Before running the experiment, you should define a hypothesis that you want to verify or learn from. For example, your hypothesis might be: “With 20 requests per second, the Largest Contentful Paint P(90) will be less than 3 seconds, even if the power in one AZ is interrupted for 30 minutes, and user requests will shift to the other AZ”. By using a tool like AWS CloudWatch RUM to monitor user experience metrics, such as Largest Contentful Paint, along with other application-level metrics, you can observe your application’s behavior during the different phases of the experiment and validate your hypothesis, as shown on the AWS CloudWatch Dashboard in Figure 9.

    Line graph showing application behavior during an availability zone power interruption scenario. The graph displays steady state traffic levels, a simulated power interruption causing a traffic drop to near zero and negative user experience, traffic shifting to another availability zone during the interruption, and traffic returning to steady state levels once the interrupted zone recovers. Timeline covers the full scenario duration.

    Figure 9. Observing application behavior during AZ Available: Power Interruption Scenario

    In the case of the sample application, the results showed that only 3% of users experienced longer-than-usual page load times, which was within the threshold, confirming the initial hypothesis.

Clean Up

After you have run the experiment using AZ Availability: Power Interruption scenario, delete the resources created in this blog following the clean up instructions to avoid costs.

Conclusion

In this post, we’ve explored how the AWS Fault Injection Service (FIS) Scenarios Library can make your chaos engineering efforts easier. By focusing on the AZ Availability: Power Interruption scenario, we showed the benefits of using pre-made experiments to test your AWS application’s resilience.

The FIS Scenarios Library has many other experiments beyond the one we covered. We encourage you to explore the full library and the Fault Injection Service (FIS) workshop V2 workshop. This will help you expand your chaos engineering skills and build more reliable, fault-tolerant systems.

When running FIS experiments, consider the security. FIS has strong security features. These include IAM-based access control, AWS Config integration, “stop conditions”, and Safety Levers to ensure safe and controlled chaos engineering.

By using the FIS Scenarios Library, you can seamlessly add chaos engineering to your application development. This leads to more resilient and high-performing AWS solutions.

Saurabh Kumar's picture

Saurabh Kumar

Saurabh Kumar is a Senior Solutions Architect based out of North Carolina, USA. He is passionate about helping customers solve their business challenges and technical problems from migration to modernization and optimization. Outside of work, he spends time with his family watching TV, gardening and outdoor activities.

Vladislav Nedosekin's picture

Vladislav Nedosekin

Vladislav Nedosekin is a Principal Solutions Architect with over 20 years of experience designing and implementing mission-critical services and applications based out of London, UK. At Amazon Web Services, he guided leading financial institutions in architecting innovative, cloud-native solutions with a focus on resilience and chaos engineering. Vladislav has extensive expertise helping customers leverage cutting-edge cloud technologies, including serverless and generative AI, to build highly reliable, scalable solutions.