AWS Developer Blog

Handling Errors, Retries, and adding Alerting to Step Function State Machine Executions

AWS Step Functions allow you to coordinate and stitch together multiple AWS Services into a serverless workflow. Step Function State Machines are created through the use of Amazon States Language – a JSON-based configuration. When it comes to executing a state machine in production, operational features such as retrying failed executions, alerting on failures, and handling and capturing error messages appropriately are critical. Retries limit the human intervention needed. In the event of an ultimate failure, alerting on what happened is key to solving the problem moving forward.

Amazon SNS (Simple Notification Service) is a highly-durable, scalable, secure, and fully-managed Pub/Sub messaging service. As a best practice, it is advisable to keep separable communications across the microservices in distributed and serverless architectures. With Amazon SNS you can seamlessly integrate across your microservices and with various other AWS services, such as CloudWatch Events and send notifications to various subscribers.

In this post, you will learn how to enable error handling and retry logic into your Step Function State Machine. You will also learn how to set up CloudWatch Rules to alert via email if a State Machine execution fails completely after retrying. You will create an example State Machine that runs various calculations on a given set of input numbers. The State Machine will consist of a collection of Lambda functions that are invoked and stitched together to produce various results.

To follow along with the code below, please navigate to this GitHub link.

Architecture

sf_alert_architecture

  1. User invokes the AWS Step Function State Machine.
  2. AWS Step Function State machine invokes consecutive lambda functions defined in states language definition.
  3. In Event of State Machine execution failure, Amazon CloudWatch Event Rule is triggered.
  4. Amazon CloudWatch Event Rule sends message to SNS topic to alert failure event.
  5. Amazon SNS email subscriber is alerted of the execution failure event.

Creating a Step Function State Machine

To get started, navigate to the AWS Console and paste in the following Step Function States Language definition:

{
  "Comment": "CalculationStateMachine",
  "StartAt": "CleanInput",
  "States": {
    "CleanInput": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "CleanInput",
        "Payload": {
          "input.$": "$"
        }
      },
      "Next": "Multiply"
    },
    "Multiply": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "Multiply",
        "Payload": {
          "input.$": "$.Payload"
        }
      },
      "Next": "Choice"
    },
    "Choice": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.Payload.result",
          "NumericGreaterThanEquals": 20,
          "Next": "Subtract"
        }
      ],
      "Default": "Notify"
    },
    "Subtract": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "Subtract",
        "Payload": {
          "input.$": "$.Payload"
        }
      },
      "End": true
    },
    "Notify": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:0123456789:CalculateNotify",
        "Message.$": "$$",
        "Subject": "Failed Test"
      },
      "End": true
    }
  }
}

Using the AWS Toolkit plugin for Visual Studio Code, we can visualize and manage Step Functions. The AWS Toolkit for VSCode also allows you to create, manage, deploy, download, and visualize Step Function State Machines without ever needing to leave the IDE (check out this blog post on working with StateMachines in your VSCode IDE).

state_machine_vscode

 

Explanation of Step Function States

  1. After starting the execution of the State Machine, run the CleanInput step and ensure the input is cleaned and massaged for appropriate consumption for the proceeding steps.
  2. Take the output from the CleanInput step and pass it to the Multiply step and run the Multiply Lambda Function using the input.
  3. The next step is a Choice step, and will either continue on if the specific condition of the value of $.Payload.result is Greater than or equal to the numeric value of 20. In the event that this case is not true, we will publish a failure event to an SNS topic and stop the execution of the State Machine.
  4. If we continue on in the pipeline, we will invoke the final step of Subtract, which is a Lambda function that performs subtraction against the input numbers.
  5. The State Machine execution will complete after the above is done running.

Retry Logic and Error Handling

The above definition of your Step Function State Machine will work well, but it does not account for error conditions. Any one of the steps along the way may produce a failure event which will cause the state machine execution to end on its first attempt. It is possible to incorporate retry logic into the state machine for self-healing purposes when these events may occur.

The following are some possible failure events that may occur

  1. State Machine Definition Issues.
  2. Task Failures due to exceptions thrown in a Lambda Function.
  3. Transient or Networking Issues.
  4. A task has surpassed its timeout threshold.
  5. Privileges are not set appropriately for a task to execute.

Refined States Language Definition

The following is a refined version of the states language definition from above. This new definition includes error handling, retries, and additional parameters

{
  "Comment": "CalculationStateMachine",
  "StartAt": "CleanInput",
  "States": {
    "CleanInput": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 3,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
      ],
      "Parameters": {
        "FunctionName": "CleanInput",
        "Payload": {
          "input.$": "$"
        }
      },
      "Next": "Multiply"
    },
    "Multiply": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 3,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
      ],
      "Parameters": {
        "FunctionName": "Multiply",
        "Payload": {
          "input.$": "$.Payload"
        }
      },
      "Next": "Choice"
    },
    "Choice": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.Payload.result",
          "NumericGreaterThanEquals": 20,
          "Next": "Subtract"
        }
      ],
      "Default": "Notify"
    },
    "Subtract": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 3,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
      ],
      "Parameters": {
        "FunctionName": "Subtract",
        "Payload": {
          "input.$": "$.Payload"
        }
      },
      "End": true
    },
    "Notify": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 3,
          "MaxAttempts": 2,
          "BackoffRate": 1.5
        }
      ],
      "Parameters": {
        "TopicArn": "arn:aws:sns:us-east-1:0123456789:CalculateNotify",
        "Message.$": "$$",
        "Subject": "Failed Test"
      },
      "End": true
    }
  }
}

Retry Definition Block

You will notice that at each step in the State Machine you have incorporated a Retry statement. This statement includes a list of conditions that should trigger a retry in a step, along with specific parameters that go along with the retry. To break down the different parameters in the Retry block:

  1. ErrorEquals – this key expects a list of error conditions that should trigger a retry. In our example, we are going with a blanketed error condition called States.ALL. This will capture all possible error events that could potentially occur during the execution of the step. We will discuss how to make the error conditions to handle even more granular.
  2. IntervalSeconds – the value of this key is the number of seconds to wait to attempt a retry after the first failure occurs. For example, if our step fails, the state machine will wait for 3 seconds before attempting the first execution retry.
  3. MaxAttempts – this value signifies how many times the State Machine should attempt a retry. In this case, we have set the number of retries equal to 2. This means that the state machine execution will attempt a retry up to 2 times and will fail after the 3rd failure occurs.
  4. BackoffRate – the value of this key signifies the multiplier by which the retry interval (IntervalSeconds) increases after each retry attempt. For example, the first retry attempt will wait 3 seconds, and the second retry attempt will wait 4.5 seconds.

More Granular Error Conditions

In the example above, you set the error conditions array equal to States.ALL. This is a great way to capture all possible error conditions that could occur and roughly translates into catching a general error exception in a programming language such as Java, Python, etc. There are specific times where you only want to handle specific error conditions and retry. It is possible to specify multiple error conditions to capture The following are some of the specific error conditions that we can also capture:

  1. States.Runtime – This type of error occurs when there is an exception that was not and could not be handled appropriately during runtime.
  2. States.Timeout – This error condition occurs when the execution of the step surpasses its timeout threshold.
  3. States.TaskFailed – This error condition occurs when a Task state has failed during execution (an example is a Lambda function returning a failure on execution).

Please refer to the following Error Handling Documentation for further information:

Retry Example

You will now run an execution of your Step Function state machine. In this specific execution, you will see firsthand how the Retry block can be used to retry an execution on error. Navigate to the AWS console and search for Step Functions. Once on the Step Functions page, select the appropriate Calculation State Machine.

retry_1

After selecting the appropriate state machine, click on Start execution.

In this example, you will provide the following input JSON to the execution:

{
    "input": "10 20 30"
}

Please see the execution below:

execute_state_machine

After clicking Start execution, you will see the following in the Execution event history:

event_history

Take note of the TaskFailed events: 10, 13, 16.

Drilling-down into the details of one of the failure events, you will see that an error was thrown during the execution of the Multiply Lambda function.

custom_error

Since you set the MaxAttempts value to 2, you will see that the State Machine attempted to run the execution of the Multiply lambda function 3 times. This shows that the retry logic you have incorporated into the state machine is indeed working as you would expect it to.

State Machine Failure Alerting using CloudWatch Rules

In the event that your State Machine execution fails completely, you should be notified. Through the combination of Amazon CloudWatch Event Rules and an Amazon SNS Topic you can get notified if an execution fails completely.

Creating the SNS Topic and Topic Subscriber(s)

Navigate to the AWS console and search for SNS. Once on the SNS service you will see the following Dashboard page:

cw_rules

Select Topics on the left and click Create Topic. You will then be redirected to the following page

sns_notifications

 

You will call the topic CalculationStateMachineFailureTopic and where failure events will be sent to.

Creating the SNS Topic Subscriber(s)

Create an Amazon SNS email topic subscriber to be notified. On the same Amazon SNS page, click on Subscriptions and then click on Create subscription. On this page you will see the following:

sns_subscriber

 

You need to provide the following details:

  1. Topic ARN – this is the ARN of the SNS topic you created in the previous step.
  2. Protocol – this is to use the email protocol when receiving a message.
  3. Endpoint – this is the email address that should be subscribed to the SNS topic.

Click on Create Subscription and then shortly after this the subscriber will receive a confirmation email.

SNS Topic Access Policy

In order to allow CloudWatch the ability to publish events to your new SNS topic, you must modify the Access Policy to allow the service role to publish to it. Add the following as the SNS access policy:

{
    "Sid": "TrustCWEToPublishEventsToMyTopic",
    "Effect": "Allow",
    "Principal": {
        "Service": "events.amazonaws.com"
    },
    "Action": "sns:Publish",
    "Resource": "arn:aws:sns:us-east-1:${account-id}:CalculationStateMachineFailureTopic"
}

The Access Policy will look similar to the following:

access_policy_sns

SNS Topic Subscriber Confirmation Email

The email subscriber will receive an email similar to the following:

email_subscriber

Click on Confirm subscription and you will be confirmed!

CloudWatch Rules

Now that you have set up your SNS topic and SNS Topic Subscriber, you can set up the CloudWatch Rules to monitor your State Machine executions. Navigate to the AWS Console and search for CloudWatch. Once on the CloudWatch page, click on Rules on the left of the page.

cw_rules

Follow these steps to set up the CloudWatch Rule:

  1. Select Event Pattern for the Event Source.
  2. Service Name = Step Functions
  3. Event Type = Step Functions Execution Status Change
  4. Select Specific status(es) and select FAILED
  5. Add the specific ARN of the State Machine in the next field.
  6. Under Targets select SNS topic and enter in the Topic name.
  7. Click Configure details and on the next page provide a Name for the Rule.
  8. Click on Create rule and you are good to go.

Demo Execution and Notification

You will now navigate back to Step Functions and re-run our failed execution. After the execution fails you will receive a notification email to tell you that the specific execution has failed:

example_execution_sm_1

Infrastructure Cleanup

In order to clean up the produced infrastructure above, please do the following in the AWS Console:

  1. Navigate to Step Functions and delete the State Machine you created above.
  2. Navigate to SNS and delete the SNS topic you created from above.
  3. Navigate to CloudWatch Event Rules and delete the rule you created above.
  4. Navigate to Lambda and delete any corresponding AWS Lambda functions you created for the State Machine.

This will ensure that you no longer get charged after running the experiments from above.

Conclusion

In this blog you learned how to create an AWS Step Function State Machine with and without error handling. You also learned how to set up Amazon CloudWatch Event rules that will send you an email notification from and Amazon SNS Topic if your AWS Step Function State Machine fails.

 

About the authors

noyce_author.jpg

Matt Noyce

Matt Noyce is a Senior Application Architect in Professional Services at Amazon Web Services.
He primarily works with Health Care and Life Sciences customers to architect, design, automate, and build solutions on AWS
for their business needs.

viyoma_author.jpg

Viyoma Sachdeva

Viyoma Sachdeva is a DevOps Consultant in Amazon Web Services supporting Global Customers and their journey to cloud. Outside of work she enjoys watching series and spending time with her family