AWS Cloud Operations Blog

Automate Standard Operating Procedures (SOPs) execution with AWS Resilience Hub

AWS Resilience Hub is a central location in the AWS Management Console for you to manage and improve the resilience posture of your applications on AWS. AWS Resilience Hub enables you to define your resilience goals, assess your resilience posture against those goals, and implement recommendations for improvement based on the AWS Well-Architected Framework.

AWS Resilience Hub provides both Resiliency and Operational recommendations. Operational recommendations are comprised of Amazon CloudWatch Alarms, Standard Operating Procedures (SOPs) utilizing AWS Systems Manager Documents and chaos experiments using AWS Fault Injection Service (FIS).

An SOP is a prescriptive set of steps designed to efficiently recover your application in the event of a service disruption or alarm. Not having SOPs for the operator to follow when they receive alert notifications is one of the common anti-patterns defined in the AWS Well-Architected Framework – Reliability Pillar. Automating alarm processing can improve system resiliency by taking corrective actions automatically, executing defined SOPs, and reducing manual activities that allow for human, error-prone interventions. The SOPs provided with AWS Resilience Hub are templates you can customize to define your own SOPs.

In this blog post, we walk through how to automate and test the execution of SOPs to events and incidents based on the AWS Resilience Hub operational recommendation templates. This can be used in your CI/CD pipeline to continuously test to determine if you can detect and recover from those disruptions.

To introduce conditions that would require running SOPs, we can use chaos engineering practices with AWS FIS. AWS FIS allows you to run experiments with a clearly defined scope and with safety mechanisms that roll back the experiment if it introduces unexpected turbulence.

Pre-requisites

The example used in the following blog post has a number of pre-requisites.

  • A workload architecture with EC2 instances in an AWS Auto Scaling Group (see Figure 1 for an example)
  • The AWS Cloud Development Kit (AWS CDK), see Getting started with the AWS CDK
  • Define and assess the workload architecture you deployed in your AWS account using AWS Resilience Hub. For more information on enabling AWS Resilience Hub, see this blog.

Architectural overview

The example architecture we are running our experiment against in this blog post

Figure 1 – The example architecture we are running our experiment against in this blog post

Workflow overview

The workflow from the user starting the experiment to automatic SOP execution and alarm remediation

Figure 2 – The workflow from the user starting the experiment to automatic SOP execution and alarm remediation

Automating solutions

AWS Resilience Hub offers recommendations on alarms, SOPs, and FIS experiments, it is the customer’s responsibility to test for the successful implementation of these operational recommendations. For more information about the shared responsibility model with AWS Resilience Hub, see the blog post, Shared Responsibility with AWS Resilience Hub.

We recommend automating critical resource recovery. In this blog post, we will walk through an Amazon EventBridge automation to run an implemented SOP when a specific alarm state is reached. We will test this automation using an FIS experiment.

Chaos engineering is an advanced mode of resilience experimentation, involving automated experiments within a continuous resilience pipeline. The key principle is to “fail fast” – catching and addressing resilience problems as early as possible, before they reach production. Integrating chaos experiments into the continuous resilience workflow enables a proactive and iterative approach to resilience experimentation, ensuring resilience is an integral part of the development process.

Our architecture is that of an application running with Amazon Elastic Compute Cloud (Amazon EC2) within an Autoscaling Group (ASG), backed by a Relational Database Service (RDS) Database, as shown in Figure 1.

For this example, we will be automating a response for high CPU Utilization. Let’s consider a use case where this could be useful:

A customer with an e-commerce web application has configured the web server ASG with min/desired = 1 and max = 2 and the scaling policy is configured by average CPU utilization.
In the case of a sudden spike in user requests e.g., a high season event, the ASG has reached its max capacity of 2, but it is not enough as new users are still trying to connect to the customer’s application and are unable to do so.

There is a time gap until the customer’s on-call team can investigate the issue and make a decision to handle the ASG’s max value. During this time, new users are disconnected resulting in a financial, and/or reputational impact on the business. The automation of this mechanism with an SOP fills this gap as an alarm is triggered to alert the customer team to perform further investigation.

Implementing Operational Recommendations

We have implemented AWS Resilience Hub operational recommendations for this automation in all three operational recommendation areas:

Two alarms

  • AWSResilienceHub-SyntheticCanaryInRegionAlarm_2021-04-01
  • AWSResilienceHub-AsgHighCpuUtilizationAlarm_2020-07-13

One SOP

  • AWSResilienceHub-ScaleOutAsgSOP_2020-07-01

One FIS experiment

  • AWSResilienceHub-InjectCpuLoadInAsgTest_2021-09-22

For more information about implementing operational recommendations, see the blog post, monitor and improve your application resiliency with AWS Resilience Hub.

For the automation we are using Amazon EventBridge, we created the AWS CloudFormation template below to provision this resource. This automation initiates the SOP “AWSResilienceHub-ScaleOutAsgSOP_2020-07-01” when the composite alarm “AsgMaxCapacityReachedAndAsgHighCPUAlarm” is triggered into an “in alarm” state.

AWSTemplateFormatVersion: '2010-09-09'
Description: CloudFormation template for EventBridge rule 'arh-alarm-asg-cpu-triggered'
Parameters:
  AlarmTriggerArn:
    Type: String
    Description: Arn of the Alarm that will trigger this Event
  SSMTemplateAssumeRole:
    Type: String
    Description: An ARN of the role that SSM is going to assume
  SSMTemplateASGName:
    Type: String
    Description: Auto scaling group name (for the SSM Template)

Resources:
  AmazonEventBridgeInvokeStartAutomationExecutionPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: Policy for the Amazon EventBridge Invoke Start Automation Execution
      ManagedPolicyName: !Join ['-', ['AWSResilienceHub-EventBridge_Automation_Policy', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]]
      Path: '/service-role/'
      PolicyDocument:
        !Sub '{ "Version": "2012-10-17", "Statement": [ { "Action": "ssm:StartAutomationExecution", "Effect": "Allow", "Resource": [ "arn:${AWS::Partition}:ssm:${AWS::Region}:*:automation-definition/AWSResilienceHub-ScaleOutAsgSOP_2020-07-01:$DEFAULT" ] }, { "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": "${SSMTemplateAssumeRole}", "Condition": { "StringLikeIfExists": { "iam:PassedToService": "ssm.amazonaws.com" } } } ] }'
  AmazonEventBridgeInvokeStartAutomationExecution:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Join ['-', ['AWSResilienceHub-EventBridge_Automation', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]]
      Description: Amazon EventBridge Invoke Start Automation Execution Role
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: events.amazonaws.com
        Version: "2012-10-17"
      MaxSessionDuration: 3600
      Path: '/service-role/'
      ManagedPolicyArns:
        - !Ref AmazonEventBridgeInvokeStartAutomationExecutionPolicy

  EventRuleArhSop:
    Type: AWS::Events::Rule
    Properties:
      EventBusName: default
      EventPattern:
        source:
          - aws.cloudwatch
        detail-type:
          - CloudWatch Alarm State Change
        detail:
          alarmName:
            - !Ref CloudWatchCompositeAlarmAsgMaxCapacityReachedAndAsgHighCPUAlarm
          state:
            value:
              - ALARM
      Name: !Join ['-', ['arh-alarm-asg-cpu-automation', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]]
      State: ENABLED
      Targets:
        - Id: Id5b81de31-a5ef-42e2-90de-1fc8348b3229
          Arn:
            !Sub "arn:${AWS::Partition}:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/AWSResilienceHub-ScaleOutAsgSOP_2020-07-01"
          RoleArn:
            !GetAtt AmazonEventBridgeInvokeStartAutomationExecution.Arn
          Input:
            !Sub '{"Dryrun":["false"],"AutoScalingGroupName":["${SSMTemplateASGName}"],"AutomationAssumeRole":["${SSMTemplateAssumeRole}"]}'
  CloudWatchAlarmAsgMaxCapacityReached:
    UpdateReplacePolicy: "Retain"
    Type: "AWS::CloudWatch::Alarm"
    Properties:
      ComparisonOperator: "GreaterThanThreshold"
      TreatMissingData: "missing"
      ActionsEnabled: true
      Metrics:
      - Label: "AsgMaxCapacityReached"
        Id: "e1"
        ReturnData: true
        Expression: "IF(m1 >= m2, 1, 0)"
      - ReturnData: false
        MetricStat:
          Period: 120
          Metric:
            MetricName: "GroupInServiceInstances"
            Dimensions:
            - Value: !Ref SSMTemplateASGName
              Name: "AutoScalingGroupName"
            Namespace: "AWS/AutoScaling"
          Stat: "Average"
        Id: "m1"
      - ReturnData: false
        MetricStat:
          Period: 120
          Metric:
            MetricName: "GroupMaxSize"
            Dimensions:
            - Value: !Ref SSMTemplateASGName
              Name: "AutoScalingGroupName"
            Namespace: "AWS/AutoScaling"
          Stat: "Average"
        Id: "m2"
      AlarmName: !Join ['-', ['ARH-AsgMaxCapacityReached', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]]
      EvaluationPeriods: 1
      DatapointsToAlarm: 1
      Threshold: 0
  CloudWatchCompositeAlarmAsgMaxCapacityReachedAndAsgHighCPUAlarm:
    UpdateReplacePolicy: "Retain"
    Type: "AWS::CloudWatch::CompositeAlarm"
    Properties:
      ActionsEnabled: true
      AlarmName: !Join ['-', ['ARH-AsgMaxCapacityReachedAndAsgHighCPUAlarm', !Select [4, !Split ['-', !Select [2, !Split ['/', !Ref "AWS::StackId"]]]]]]
      AlarmRule: !Sub 'ALARM("${CloudWatchAlarmAsgMaxCapacityReached}") AND ALARM("${AlarmTriggerArn}")'

CodeBlock 1 – AWS CloudFormation stack bridge automation setup

We need to prepare, test, and measure our SOPs in advance to ensure timely recovery in the event of an operational outage. To do this, we can use FIS experiments. In this scenario, we are using an AWS Resilience Hub recommended SOP, and can also use an FIS experiment that has been recommended by AWS Resilience Hub to test the hypothesized outcomes of running this SOP. In this use case, it will also test the automation of running the SOP via the invocation of an Amazon EventBridge rule when the autoscaling capacity has reached the maximum.

Hypothesis

High CPU utilization across the EC2 instances is not expected to have a detrimental effect on our application’s performance due to EC2 Autoscaling and the implemented automation of the SOP. The web application should remain accessible, and customers are unlikely to experience any disruption in service.

Once all of the alarms, SOPs, FIS experiment and the Amazon EventBridge rule have been implemented, we can run the experiment to check our automation. The hypothesis of this experiment is that we should see the following:

  1. FIS experiment performs a CPU load injection into the Auto Scaling Group.
  2. The AWSResilienceHub-AsgHighCpuUtilizationAlarm Cloudwatch Alarm should change state to “In alarm”.
  3. Autoscaling will activate and start a new instance to manage the load.
  4. The FIS experiment performs another CPU load injection into the Auto Scaling Group.
  5. Amazon EventBridge will process this event and start the AWSResilienceHub- ScaleOutAsgSOP_2020-07-01 SOP.
  6. The SOP will scale out the Auto Scaling Group to add an additional EC2 instance.
  7. Both the experiment and the SOP will complete successfully.

Pre-checks

Before we start, let’s check the values of our Auto scaling Group and the number of instances we have running for our application in the AWS Management Console, EC2 section.

Original Auto Scaling Group capacity values

Figure 3 – Original Auto Scaling Group capacity values

Original running EC2 instances, numbering 1

Figure 4 – Original running EC2 instances, numbering 1

Running the experiment

We will now start the AWSResilienceHub-InjectCpuLoadInAsgTest_2021-09-22 Fault Injection Simulator (FIS) experiment recommended by AWS Resilience Hub to test the hypothesis above.

Figure 5 – A running FIS experiment.

We see our AWSResilienceHub-AsgHighCpuUtilizationAlarm alarm move to “In alarm” in the CloudWatch console, which shows us that CPU utilization has moved over the set threshold. This triggers the dynamic scaling in the Auto Scaling Group, and we can see there are now two instances running in the autoscaling group.

CloudWatch Alarm state changes.

Figure 6 – CloudWatch Alarm state changes.

2 running EC2 instances

Figure 7 – 2 running EC2 instances.

New ASG values

Figure 8 – New ASG values.

The experiment is finished and now we have two instances running and alarm in “OK” state.

If we start the experiment again, we will see our CloudWatch alarm move to “In Alarm” in the CloudWatch console, showing us that the CPU utilization has moved over the set threshold. Additionally, now we see that the second alarm “ARH-AsgMaxCapacityReached” is also “in Alarm” state, indicating that the max capacity of the Auto Scaling Group has been reached. This leads us to check if our Amazon EventBridge rule has run correctly. The rule is based on the composite alarm, also shown in Figure 9, based on the combination of the previously mentioned alarms.

Figure 9 – CloudWatch Alarm state changes (2nd experiment)

Amazon Eventbridge rule successfully triggered

Figure 10 – Amazon Eventbridge rule successfully triggered.

Verifying the Results

We can see a successful invocation and triggered Amazon EventBridge rule through the monitoring tab of the Amazon EventBridge console for our specific rule. This should then target and automatically run our AWSResilienceHub-ScaleOutAsgSOP_2020-07-01 SOP.

We can see our SOP has completed successfully in the Systems Manager (SSM) Automations feature. Without the Amazon Eventbridge automation, running this SOP would be a manual step to remediate the FIS HighCPU experiment.

The SOP has successfully run, after we ran the FIS experiment twice

Figure 11 – The SOP has successfully run, after we ran the FIS experiment twice.

Let’s check the Auto Scaling Group itself for the new values, and also how many EC2 instances we currently have running.

The additional EC2 instance bringing the total number up to 3.

Figure 12 – The new ASG capacity values.

The additional EC2 instance bringing the total number up to 3.

Figure 13 – The additional EC2 instance bringing the total number up to 3.

As you can see, the values on our Auto Scaling Group have increased for Desired capacity and Maximum capacity. This has also led to the Auto Scaling Group adding an additional instance to our running instances for the application as expected. We can see this also if we look at the Auto Scaling Group events, once the increase was done by an Auto Scaling Group alarm and another time by the SOP.

Auto scaling Group Events

Figure 14 – Auto scaling Group Events.

We can also look at our CloudWatch alarm history to see what actions and state changes have occurred. It’s important to check that the state has moved from “OK’ to “In alarm” as expected, but that the alarm has moved back to “OK’ after running of the SOP.

CloudWatch Alarm state changes during the experiment from “OK” to “In alarm” and back to “OK” (left) and the number of instances and max capacity of the Auto Scaling Group (right).

Figure 15 – CloudWatch Alarm state changes during the experiment from “OK” to “In alarm” and back to “OK” (left) and the number of instances and max capacity of the Auto Scaling Group (right).

Let’s go back to our FIS experiment to make sure that has completed successfully so that we can close out our experiment, and fully ratify our hypothesis.

Completed Experiment viewed in AWS Resilience Hub.

Figure 16 – Completed Experiment viewed in AWS Resilience Hub.

Validation

We can now validate against our original hypothesis:

FIS experiment performs a CPU load injection into the ASG

  1. We see the successful running of the FIS Experiments (Figure 11).
  2. We can check Amazon CloudWatch Alarms have triggered and that the Alarm State has changed (Figure 15).

The ASGHighCPUUtilization Cloudwatch Alarm should change state to “In alarm”

  1. We can check Amazon CloudWatch Alarms have triggered and that the Alarm State has changed (Figure 15).

Amazon Eventbridge will process this event and start the ScaleOutAsg SOP

  1. The Amazon Eventbridge rule has been run (Figure 10).

The SOP will scale out the Auto Scaling Group in case max capacity is reached to add an additional EC2 instance.

We will check both the automation we are implementing using our AWS CloudFormation stack that implements our Amazon EventBridge rule and the success of the SOP in making the required changes in line with our hypothesis.

  1. Automation of the SOP being ran can be seen in the SSM Document completing without manual intervention (Figure 11).
  2. The Auto Scaling Group and EC2 instance count have the expected results (Figures 12, 13 and 14).

Both the experiment and the SOP will complete successfully

  1. SOP completion and FIS experiment completion can be checked (Figures 16 and 11).

Running in a CI/CD pipeline

If you want to run this in your CI/CD pipeline, you can create an AWS Step Function that will orchestrate all of this. The state diagram and the step function are shown below:

State machine

Figure 17 – State machine.

  1. First, create the automation described above.
  2. Then, wait until the automation is deployed.
  3. If deployment is successful, start the FIS experiment.
  4. The automation via Amazon EventBridge will start the SOP upon alarm and Auto Scaling Group max capacity and mitigate the problem.
  5. Upon error, the Simple Notification Service (SNS) message will be sent, and the workflow will fail.
  6. If experiment finishes without error, success is reported.

The AWS Cloud Development Kit (AWS CDK) code that will create this AWS Step Function.

import * as cdk from 'aws-cdk-lib';
import * as iam from 'aws-cdk-lib/aws-iam';
import * as stepfunctions from 'aws-cdk-lib/aws-stepfunctions';

export interface ArhBlogTestImportStackProps extends cdk.StackProps {
}

export class ArhBlogTestImportStack extends cdk.Stack {
  public constructor(scope: cdk.App, id: string, props: ArhBlogTestImportStackProps = {}) {
    super(scope, id, props);

    const iamRoleStepFunctionsRole = new iam.CfnRole(this, 'StepFunctionsRole', {
      path: '/service-role/',
      maxSessionDuration: 3600,
      roleName: 'arh-blog-StepFunctions-role-' + id,
      policies: [
        {
          policyDocument: {
            Version: '2012-10-17',
            Statement: [
              {
                Resource: '*',
                Action: [
                  'cloudformation:CreateStack',
                  'cloudformation:DeleteStack',
                  'cloudformation:DescribeStacks',
                ],
                Effect: 'Allow',
              },
            ],
          },
          policyName: 'cloudformation-permissions',
        },
        {
          policyDocument: {
            Version: '2012-10-17',
            Statement: [
              {
                Resource: '*',
                Action: [
                  'cloudformation:CreateStack',
                  'cloudformation:DeleteStack',
                  'cloudformation:DescribeStacks',
                  "cloudwatch:DescribeAlarms"
                ],
                Effect: 'Allow',
              },
            ],
          },
          policyName: 'cloudwatch-permissions',
        },
        {
          policyDocument: {
            Version: '2012-10-17',
            Statement: [
              {
                Resource: '*',
                Action: [
                  'events:DescribeRule',
                  'events:DeleteRule',
                  'events:PutRule',
                  'events:PutTargets',
                  'events:RemoveTargets',
                ],
                Effect: 'Allow',
              },
            ],
          },
          policyName: 'eventbridge-permissions',
        },
        {
          policyDocument: {
            Version: '2012-10-17',
            Statement: [
              {
                Resource: '*',
                Action: [
                  'fis:StartExperiment',
                  'fis:GetExperiment',
                ],
                Effect: 'Allow'
              },
            ],
          },
          policyName: 'fis-permissions',
        },
        {
          policyDocument: {
            Version: '2012-10-17',
            Statement: [
              {
                Resource: '*',
                Action: [
                  'iam:CreatePolicy',
                  'iam:GetRole',
                  'iam:DetachRolePolicy',
                  'iam:GetPolicy',
                  'iam:CreateRole',
                  'iam:DeleteRole',
                  'iam:AttachRolePolicy',
                  'iam:PutRolePolicy',
                  'iam:PassRole',
                  'iam:ListPolicyVersions',
                  'iam:DeletePolicy',
                ],
                Effect: 'Allow'
              },
            ],
          },
          policyName: 'iam-permissions',
        },
        {
          policyDocument: {
            Version: '2012-10-17',
            Statement: [
              {
                Resource: '*',
                Action: 's3:GetObject',
                Effect: 'Allow',
              },
            ],
          },
          policyName: 's3-permissions',
        },
        {
          policyDocument: {
            Version: '2012-10-17',
            Statement: [
              {
                Resource: '*',
                Action: "sns:Publish",
                Effect: "Allow",
              },
            ],
          },
          policyName: 'sns-permissions',
        },
      ],
      assumeRolePolicyDocument: {
        Version: '2012-10-17',
        Statement: [
          {
            Action: 'sts:AssumeRole',
            Effect: 'Allow',
            Principal: {
              Service: 'states.amazonaws.com',
            },
          },
        ],
      },
    });
    iamRoleStepFunctionsRole.cfnOptions.deletionPolicy = cdk.CfnDeletionPolicy.RETAIN;

    const stateMachine = new stepfunctions.CfnStateMachine(this, 'StepFunctionsStateMachine', {
      definitionString: '{ \"Comment\": \"A description of my state machine\", \"StartAt\": \"CreateAutomationStack\", \"States\": { \"CreateAutomationStack\": { \"Type\": \"Task\", \"Parameters\": { \"StackName\": \"arh-blog-automation\", \"TemplateURL.$\": \"$.input.S3UrlToCloudformationStack\", \"Capabilities\": [ \"CAPABILITY_NAMED_IAM\", \"CAPABILITY_AUTO_EXPAND\" ], \"Parameters\": [ { \"ParameterKey\": \"AlarmTriggerArn\", \"ParameterValue.$\": \"$.input.AlarmTriggerArn\" }, { \"ParameterKey\": \"SSMTemplateAssumeRole\", \"ParameterValue.$\": \"$.input.SSMTemplateAssumeRole\" }, { \"ParameterKey\": \"SSMTemplateASGName\", \"ParameterValue.$\": \"$.input.SSMTemplateASGName\" } ] }, \"Resource\": \"arn:aws:states:::aws-sdk:cloudformation:createStack\", \"Next\": \"WaitForStackToBeReady\", \"Catch\": [ { \"ErrorEquals\": [ \"States.ALL\" ], \"Next\": \"DeleteAutomationStackOnFail\" } ] }, \"WaitForStackToBeReady\": { \"Type\": \"Wait\", \"Seconds\": 5, \"Next\": \"DescribeStacks\" }, \"DescribeStacks\": { \"Type\": \"Task\", \"Next\": \"StackDeploymentStatus\", \"Parameters\": { \"StackName.$\": \"States.ArrayGetItem(States.StringSplit($.StackId, \'/\'), 1)\" }, \"Resource\": \"arn:aws:states:::aws-sdk:cloudformation:describeStacks\", \"OutputPath\": \"$.Stacks[0]\", \"Catch\": [ { \"ErrorEquals\": [ \"States.ALL\" ], \"Next\": \"DeleteAutomationStackOnFail\" } ] }, \"StackDeploymentStatus\": { \"Type\": \"Choice\", \"Choices\": [ { \"Or\": [ { \"Variable\": \"$.StackStatus\", \"StringEquals\": \"REVIEW_IN_PROGRESS\" }, { \"Variable\": \"$.StackStatus\", \"StringEquals\": \"CREATE_IN_PROGRESS\" } ], \"Next\": \"WaitForStackToBeReady\" }, { \"Variable\": \"$.StackStatus\", \"StringEquals\": \"CREATE_COMPLETE\", \"Next\": \"StartExperiment\" } ], \"Default\": \"DeleteAutomationStackOnFail\" }, \"StartExperiment\": { \"Type\": \"Task\", \"Next\": \"WaitForExperimentToFinish\", \"Parameters\": { \"ClientToken.$\": \"States.UUID()\", \"ExperimentTemplateId.$\": \"$$.Execution.Input.input.ExperimentTemplateId\" }, \"Resource\": \"arn:aws:states:::aws-sdk:fis:startExperiment\", \"ResultPath\": \"$.Result\" }, \"WaitForExperimentToFinish\": { \"Type\": \"Wait\", \"Seconds\": 5, \"Next\": \"GetExperiment\" }, \"GetExperiment\": { \"Type\": \"Task\", \"Next\": \"ExperimentStatus\", \"Parameters\": { \"Id.$\": \"$.Result.Experiment.Id\" }, \"Resource\": \"arn:aws:states:::aws-sdk:fis:getExperiment\", \"ResultPath\": \"$.Result\" }, \"ExperimentStatus\": { \"Type\": \"Choice\", \"Choices\": [ { \"Or\": [ { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"pending\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"initiating\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"running\" } ], \"Next\": \"WaitForExperimentToFinish\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"completed\", \"Next\": \"Wait\" } ], \"Default\": \"SNSPublishOnError\" }, \"Wait\": { \"Type\": \"Wait\", \"Seconds\": 20, \"Next\": \"StartExperimentAgain\" }, \"StartExperimentAgain\": { \"Type\": \"Task\", \"Next\": \"WaitForExperimentToFinishAgain\", \"Parameters\": { \"ClientToken.$\": \"States.UUID()\", \"ExperimentTemplateId.$\": \"$$.Execution.Input.input.ExperimentTemplateId\" }, \"Resource\": \"arn:aws:states:::aws-sdk:fis:startExperiment\", \"ResultPath\": \"$.Result\" }, \"WaitForExperimentToFinishAgain\": { \"Type\": \"Wait\", \"Seconds\": 5, \"Next\": \"GetExperimentAgain\" }, \"GetExperimentAgain\": { \"Type\": \"Task\", \"Next\": \"ExperimentStatusAgain\", \"Parameters\": { \"Id.$\": \"$.Result.Experiment.Id\" }, \"Resource\": \"arn:aws:states:::aws-sdk:fis:getExperiment\", \"ResultPath\": \"$.Result\" }, \"ExperimentStatusAgain\": { \"Type\": \"Choice\", \"Choices\": [ { \"Or\": [ { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"pending\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"initiating\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"running\" } ], \"Next\": \"WaitForExperimentToFinishAgain\" }, { \"Variable\": \"$.Result.Experiment.State.Status\", \"StringEquals\": \"completed\", \"Next\": \"DeleteAutomationStack\" } ], \"Default\": \"SNSPublishOnError\" }, \"SNSPublishOnError\": { \"Type\": \"Task\", \"Resource\": \"arn:aws:states:::sns:publish\", \"Parameters\": { \"TopicArn.$\": \"$$.Execution.Input.input.SnsTopic\", \"Message.$\": \"$\" }, \"Next\": \"DeleteAutomationStackOnFail\" }, \"DeleteAutomationStackOnFail\": { \"Type\": \"Task\", \"Parameters\": { \"StackName\": \"arh-blog-automation\" }, \"Resource\": \"arn:aws:states:::aws-sdk:cloudformation:deleteStack\", \"Next\": \"Fail\" }, \"Fail\": { \"Type\": \"Fail\" }, \"DeleteAutomationStack\": { \"Type\": \"Task\", \"Parameters\": { \"StackName.$\": \"States.ArrayGetItem(States.StringSplit($.StackId, \'/\'), 1)\" }, \"Resource\": \"arn:aws:states:::aws-sdk:cloudformation:deleteStack\", \"Next\": \"Success\" }, \"Success\": { \"Type\": \"Succeed\" } } }',

      loggingConfiguration: {
        includeExecutionData: false,
        level: 'OFF',
      },
      stateMachineName: 'arh-blog-statemachine-' + id,
      roleArn: iamRoleStepFunctionsRole.attrArn,
      tags: [
      ],
      stateMachineType: 'STANDARD',
      tracingConfiguration: {
        enabled: false,
      },
    });
    stateMachine.cfnOptions.deletionPolicy = cdk.CfnDeletionPolicy.RETAIN;
  }
}

CodeBlock 2 – AWS CDK stack that creates the AWS Step Function

To run the state machine that will be created with the AWS CDK code above, you’ll need to define some inputs:

  • AlarmTriggerArn – the ARN of the AsgHighCpuUtilizationAlarm that was created from Resilience Hub proposed alarms.
  • SSMTemplateAssumeRole – the ARN of the AWSResilienceHubAsgScaleOutAssumeRole create with SOP.
  • SSMTemplateASGName – the AutoScalingGroup name (Name not ARN)
  • ExperimentTemplateId – the Id of the FIS experiment that should be run (in our case AsgScaleOut).
  • SnsTopic – SNS topic to send messages to, in case the experiment fails.
  • S3UrlToCloudformationStack – URL to the Cloudformation file in an Amazon Simple Storage Service (S3) bucket. The AWS CloudFormation template from the CodeBlock1 above needs to be stored in a S3 folder.

Typically, the inputs will then look like the below, and this is what needs updating to have the CDK code function correctly in your environment.

{
  "input": {
    "AlarmTriggerArn": "arn:aws:cloudwatch:<region>:<accountid>:alarm:AWSResilienceHub-AsgHighCpuUtilizationAlarm-2020-07-13_arh-demo_arh-lab-workload-AutoScalingGroup-oYSKLDR6Vg21",
    "SSMTemplateAssumeRole": "arn:aws:iam::<accountid>:role/arh-sop-AWSResilienceHubAsgScaleOutAssumeRole-qWqL13hCgexP",
    "SSMTemplateASGName": "arh-lab-workload-AutoScalingGroup-oYSKLDR6Vg21",
    "ExperimentTemplateId": "EXT9Au6P89tSQXa",
    "SnsTopic": "arn of the topics",
    "S3UrlToCloudformationStack": "https://<bucketname>.s3.<region>.amazonaws.com/arh-eventbridge.yml"
  }
}

CodeBlock 3 – AWS CDK Inputs that require updating

Now that we have the AWS Step Function created, we can integrate it into our pipeline. This blog post Continually assessing application resilience with AWS Resilience Hub and AWS CodePipeline, shows how you can trigger a step function from AWS Code Pipeline.

Conclusion

By automating responses to well-understood and defined events, you can focus your engineers on more productive tasks. This can also enable you to meet your resiliency goals, for example, by enabling better Mean Time To Resolution (MTTR) and preventing on-call fatigue of your engineering resources.

Depending on the frequency of your releases and the duration of your deployment CI/CD pipeline, you should evaluate the scope and duration of your chaos pipeline. Generally, your Fault Injection Service experiments necessitate extended run times to ensure adequate interactions, data, and experiment conditions in your workload. To avoid slowing down developers, these experiments should be run in later stages of CI/CD pipelines or even in their own dedicated pipeline. The AWS Resilience Hub recommendations can serve as a starting point, regardless of whether you employ your regular deployment CI/CD pipeline or a dedicated “chaos pipeline.”

Tom Reichenbach

Tom is a Solutions Architect at AWS based in Vienna, Austria. After completing his studies in control systems and robotics, he worked in the industry before transitioning to the financial sector. At AWS, Tom particularly enjoys talking with customers about resilience and discussing resilience solutions.

Jamie Ibbs

Jamie Ibbs is a Solutions Architect with AWS, where he helps customers to operate at scale, with a particular interest in management, governance, and resilience.