AWS Cloud Operations Blog

How to use Resilience Hub’s Fault Injection Experiments to test application’s resilience

In this post, you’ll learn how to utilize AWS Fault Injection Simulator (AWS FIS) and AWS Resilience Hub to refactor a simple serverless application. Resilience Hub lets you define, validate, and track the resiliency of your AWS application.

Resilience Hub integrates with AWS FIS, a chaos engineering service, to provide fault-injection simulations of real-world failures. These include network errors, application processing errors, or too many open connections to a database. You use these tests to validate that the application recovers within the resilience target objective.

AWS recommends thorough testing to verify an application’s resiliency against different reasons for an outage. These outages could include internal application issues or availability of Regional AWS services.

AWS FIS tests allow for the following:

  • Injection of various failure events for supported AWS services and resources
  • Verification that existing alarms can detect an outage or critical issue
  • Verification for recovery procedures, or Standard Operating Procedures (SOPs), to work correctly to recover the application from the outage

Let’s walk through setting up FIS for a sample serverless application.

Step 1: Deploy sample application and prerequisites

The sample provided is a simple serverless application including the following architecture.

Figure 1 Sample serverless architecture

Figure 1 Sample serverless architecture

  1. Amazon EventBridge Rule – Configured to run every minute to invoke the SQS Lambda function
  2. SQS Lambda Function – Once invoked, sends a sample message to the Amazon Simple Queue Service (Amazon SQS) queue
  3. Amazon SQS queue – Once a message is received, it will invoke a separate Processing Lambda Function
  4. Processing Lambda Function – Accepts messages from Amazon SQS queue but doesn’t take any additional action
  5. Cloudwatch Alarm – Used as a Stop Condition in Resilience Hub to minimize impact to application*
  6. Amazon Simple Notification Service (SNS) Topic – Notifications for the Resilience Hub alarms that a later step creates

*All tests for Resilience Hub and AWS FIS utilize existing AWS resources and will impact the target application and correlated traffic. If the application is serving Production traffic, set a Stop Condition to minimize the impact on your application and stop the FIS experiment(s).

Deploy the sample application using the AWS Management Console

In your account, launch the AWS CloudFormation template by choosing the following Launch Stack button. It will take approximately 10 minutes for the CloudFormation stack to complete.

 

 

Step 2: Add your application to Resilience Hub

After launching the sample serverless application, you can add the CloudFormation stack as an application to Resilience Hub.

  1. In the AWS Console, go to Resilience Hub in the Region where you deployed the CloudFormation stack.
  2. Select Add Application.
  3. Select CloudFormation stacks.
  4. Select the CloudFormation stack deployed in Step 1, and add a name for your application.
  5. Select all supported resources within the CloudFormation stack.
  6. Create a resilience hub policy, and select Select a policy based on a suggested policy.
    1. For our tests, provide a name for critical-application, and select Critical Application, which sets a default Recovery Time Object and Recovery Point Objective. Your target object will vary based on your application’s requirements and could require creating a custom policy.
  7. Once created, select the new critical-application resilience hub policy.
  8. Select Publish to create the application in Resilience Hub.

Now that the application is available in Resilience Hub, you can run an assessment on your application, which will send and receive recommendations to validate and improve resiliency. Once you have completed the assessment, you can create Operational recommendations in the form of Alarms, SOPs, and AWS FIS Experiment Templates. You’ll focus specifically on Alarms and AWS FIS Experiments for this example.

Step 3: Configure alarms and FIS Experiment in Resilience Hub

AWS Resilience Hub alarms monitor the resources and components of the application configured and assessed from the previous steps. These alarms will vary based on supported resources within the application. In this example, we’ll utilize an alarm for AWSResilienceHub-SQSApproximateAgeOfOldestMessageMaximumAlarm_2020-11-26. This step is a prerequisite for the FIS experiment that we’ll configure, and we’ll alert if there are messages in the Amazon SQS queue that remain in the queue.

  1. In the AWS Console, go to the application that you previously created in Resilience Hub.
  2. Select Set up recommendations, and navigate to Operational Recommendations, and then the Alarms tab.
  3. Enter in and configure the AWSResilienceHub-SQSApproximateAgeOfOldestMessageMaximumAlarm_2020-11-26 alarm.
    1. Note: An alarm exists for each supported AWS resource. Suppose there are two Amazon SQS queues. In that case, there will be two distinct Resilience Hub alarms.
  4. Once the alarm is selected, select Create CloudFormation template
    Access the templates you created through an Amazon Simple Storage Service (Amazon S3) URL. To do so:

    1. In Templates, open the alarm template recommendation.
    2. In Templates S3 Path, navigate to the list of S3 bucket objects, and select the .json file under the Amazon S3/alarm prefix.
    3. Copy the Amazon S3 URI for the .json alarm object, and then launch the CloudFormation template.
  5. There is a CloudFormation parameter for an Amazon SNS topic, which you can enter in the Amazon SNS topic provisioned in the sample application or a new Amazon SNS topic that you have created.

AWS Fault Injection Simulator Experiments will simulate outages using AWS Systems Manager Automation documents for supported AWS resources within your application. The FIS experiments utilize CloudWatch alarms to understand when an application is experiencing issues. The FIS experiment template you’ll create is for AWSResilienceHub-BlockSQSDeleteMessageTest_2021-03-09, which will block messages from being cleared from the Amazon SQS queue in the sample application. Running this FIS experiment will cause the Amazon SQS queue to have the oldest message be much higher than the AverageDurationToProcessSentMessage, thus impacting the application and breaching thresholds.

  1. In the AWS Console, go to the application that you previously created in Resilience Hub.
  2. Select Set up recommendations, and navigate to Operational Recommendations, and then the Fault injection experiment templates tab.
  3. Enter and select the AWSResilienceHub-BlockSQSDeleteMessageTest_2021-03-09. FIS experiment template.*
  4. Once the FIS experiment template is selected, select Create CloudFormation template.
  5. Access the templates that you created through an Amazon S3 URL. To do so:
    1. In Templates, open the FIS experiment template recommendation.
    2. In Templates S3 Path, open the link to see the list of all of the objects in your Amazon S3 bucket, and select the .json file under the Amazon S3/test prefix.
    3. Copy the Amazon S3 URI for the .json FIS object, and then launch the CloudFormation template.
  6. There is a CloudFormation parameter for a CanaryAlarm, which is a separate CloudWatch alarm outside of Resilience Hub that will stop the FIS experiment.**

*Each FIS experiment template is specific to an AWS resource. For example, this sample application has two Amazon SQS queues, so there will be two distinct FIS experiments.

**FIS experiments are conducted on existing AWS resources, which can impact traffic and functionality for your application. The CanaryAlarm threshold, if breached, will stop FIS experiments in progress. This CanaryAlarm should be one specific to your application.

Step 4: Run a FIS Experiment

Now that you have configured your Resilience Hub Alarm and FIS Experiment, you are ready to assess your application.

  1. In the AWS Console, go to the application that you previously created in Resilience Hub.
  2. Select the FIS experiment template that you previously created and make sure that it’s referring to the appropriate ARN, and select Run Experiments.
  3. To track progress for your FIS experiment, go to Systems Manager and select Automation.
  4. Select the Systems Manager document corresponding to the FIS experiment, which is arn:aws:ssm:<region>::document/AWSResilienceHub-BlockSQSDeleteMessageTest_2021-03-09. This operation will provide a step-by-step list of required actions.

You’ve successfully configured and run the FIS experiment, and Resilience Hub is tracking the results. You can make changes as needed to your application, re-run your FIS experiment, and create new FIS experiments and corresponding Resilience Hub alarms.

Cleanup

If you deployed the sample application provided in this post, you will need to make sure those resources are cleaned or you may incur costs for resources deployed using the CloudFormation template.

  1. In the AWS Console, go to CloudFormation.
  2. Select the Stacks deployed as part of this blog and delete:
    1. Sample Serverless Application Stack
    2. Resilience Hub Alarm
    3. FIS experiment

Conclusion

This post covered how to set up Resilience Hub and FIS experiments for a sample serverless application. Each application is unique and must have its own configurations and tests. Using Resilience Hub with FIS integration reduces the complexity of you having to build a manual solution to continually track, validate, and test your critical application’s resiliency. It is also important that your tests have actionable outcomes to increase your application’s resiliency to an acceptable RTO/RPO.

Now that you have an understanding for using Resilience Hub and FIS experiments, try creating a Resilience Hub app and deploying a FIS experiment on your own sample application so you get more familiar. Refer to the Resilience Hub supported resources to check which resources are supported within your application.

You can also take a look at how you can build resilient well-architected workloads using AWS Resilience Hub.

Author:

Jonathan Nguyen

Jonathan Nguyen is a Shared Delivery Team Senior Security Consultant at AWS. His background is in AWS Security with a focus on Threat Detection and Incident Response. He helps enterprise customers develop a comprehensive AWS Security strategy, deploy security solutions at scale, and train customers on AWS Security best practices.