AWS Architecture Blog

Chaos Testing with AWS Fault Injection Simulator and AWS CodePipeline

The COVID-19 pandemic has proven to be the largest stress test of our technology infrastructures in generations. A meteoric increase in internet consumption followed, due in large part to working and schooling from home. The chaotic, early months of the pandemic have clearly demonstrated the value of resiliency in production. How can we better prepare our critical systems for these global events in the future? A more modern approach to testing and validating your application architecture is needed. Chaos engineering has emerged as an innovative approach to solving these types of challenges.

This blog shows an architecture pattern for automating chaos testing as part of your continuous integration/continuous delivery (CI/CD) process. By automating the implementation of chaos experiments inside CI/CD pipelines, complex risks and modeled failure scenarios can be tested against application environments with every deployment. Application teams can use the results of these experiments to prioritize improvements in their architecture. These results will give your team the confidence they need to operate in an unpredictable production environment.

AWS Fault Injection Simulator (FIS) is a managed service that enables you to perform fault injection experiments on your AWS workloads. Fault injection is based on the principles of chaos engineering. These experiments stress an application by creating disruptive events so that you can observe how your application responds. You can then use this information to improve the performance and resiliency of your applications. With AWS FIS, you set up and run experiments that help you create the real-world conditions needed to uncover application issues.

AWS CodePipeline is a fully managed continuous delivery service for fast and reliable application and infrastructure updates. You can use AWS CodePipeline to model and automate your software release processes. Automating your build, test, and release process allows you to quickly test each code change. You can ensure the quality of your application or infrastructure code by running each change through your staging and release process.

Continuous chaos testing

Figure 1. High-level architecture pattern for automating chaos engineering

Figure 1. High-level architecture pattern for automating chaos engineering

Create FIS experiments

Begin with creating an FIS experiment template by configuring one or more actions (action set) to run against the target resources of the application architecture. Here we have created an action to stop Amazon EC2 instances in our Amazon Elastic Container Service (ECS) cluster identified by a tag. Target resources can be identified by resource IDs, filters, or tags. We can also set up the action parameters for running the actions before or after the actions/duration. Additionally, you can set up Amazon CloudWatch alarms to stop running one or more fault experiments once a particular threshold or boundary has been reached. In Increase your e-commerce website reliability using chaos engineering and AWS Fault Injection Simulator, Bastien Leblanc shares how to set up CloudWatch metric thresholds as stop conditions for experiments.

Figure 2. AWS FIS experiment template

Figure 2. AWS FIS experiment template

Author AWS Lambda to initiate FIS experiments

Create/Add a FIS IAM role to the Lambda function in the configuration permissions section. To start a specific FIS experiment, we use the experimentTemplateId parameter in our Lambda code. Refer to the AWS FIS API Reference when writing your Lambda code. When integrating the Lambda function into your pipeline, a new AWS CodePipeline can be created or an existing one can be used. A new pipeline stage is added at the point we initiate our Lambda function (post deployment stage), which launches our FIS experiment.

Figure 3. AWS Lambda function initiating AWS FIS experiment

Figure 3. AWS Lambda function initiating AWS FIS experiment

 

Figure 4. AWS CodePipeline with AWS FIS experiment stage

Figure 4. AWS CodePipeline with AWS FIS experiment stage

The experimentTemplateId parameter can also be staged as a key/value ‘environment variable’ in your Lambda function configuration. This is useful as it allows you to change your FIS experiment template without having to adjust your function code. You can use the same Lambda function code by dynamically injecting the experimentTemplateId in multiple environments on your way to production.

Verify FIS experiment results on deployed application

By continuously performing fault injection post-deployment in AWS CodePipeline, you learn about complex failure conditions, which you must solve. User experience and availability testing on your application during the runtime of the FIS experiment can be started by a notification rule. In AWS CodePipeline, you can use an Amazon Simple Notification Service (SNS) topic or chatbot integration. CloudWatch Synthetics can be used for those looking to automate experience testing on the candidate application while other FIS experiments are running.

Figure 5. AWS CodePipeline notification rule setting

Figure 5. AWS CodePipeline notification rule setting

Summary

Using AWS CodePipeline to automate chaos engineering experiments on application architecture with AWS FIS is straightforward. Following are some benefits from automating fault injection testing in our CI/CD pipelines:

  • Our team can achieve a higher degree of confidence in meeting the resiliency requirements of our application. We use a more modern approach to testing and automating this experimentation inside our existing CI/CD process with AWS CodePipeline.
  • We know more about the unknown risks to our application. All testing results we receive provide benefit and learning opportunities for our team. We use these results to understand what we do well, where we need to improve, or what we are willing to tolerate based on our application requirements.
  • Continuously evaluating our architectural fitness inside CI/CD allows our team to validate the impact each feature or component iteration has on the resiliency of our application.

However, the sole value of automating chaos testing is not limited to finding, fixing, or documenting the risks that surface in our application. Additional confidence is gained through constantly validating your operational practices, such as alerts and alarms, monitoring, and notifications.

FIS gives you a controlled and repeatable way to reproduce necessary conditions to fine-tune your operational procedures and runbooks. Automating this testing inside a CI/CD pipeline ensures a nearly continuous feedback loop for these operational practices.

Matt Chastain

Matt Chastain

Matt Chastain is an AWS Solutions Architect based out of Chester County, PA. As a 20 year veteran in the tech industry, Matt enjoys designing architecture solutions that reduce complexity and drive business value. Outside of work, he enjoys playing golf, hydroponic gardening, and spending time with his wife Jessica and daughters Emma, Olivia, and Sophia.

Jennifer Moran

Jennifer Moran

Jennifer Moran is an AWS Solutions Architect based out of New York City. She has a diverse background having worked in many technical disciplines including Software Development, Agile Leadership and DevOps. She enjoys helping customers design creative solutions for technical challenges. Traveling and being outdoors with her family are some of her favorite pastimes.

Pavankumar Kasani

Pavankumar Kasani

Pavankumar Kasani is an AWS Solutions Architect based out of New York city. He is passionate about helping customers to design scalable, well-architected and modernized solutions on the AWS Cloud. Outside of work, he loves spending time with his family, playing cricket, table tennis, and also testing out new recipes in the kitchen.