AWS Cloud Operations Blog

Introducing AWS Fault Injection Service Actions to Inject Chaos in Lambda functions

Usage of serverless technology in regulated industries like financial services is growing. This growth demands robust resilience validation. Chaos engineering for Serverless has become crucial for ensuring reliable and available serverless applications. By purposefully injecting failures and stresses into serverless components, teams can uncover hidden weaknesses and validate the fault tolerance of their systems.

Previously, AWS customers had two options to run chaos experiments for AWS Lambda:

  1. Modify code and use runtime-specific libraries.
  2. Inject faults using self-managed Lambda extensions leveraging the Runtime API proxy pattern.

Both methods required developer involvement and code management to integrate with the AWS Fault Injection Service (AWS FIS).

On the 30th of October 2024, we launched AWS FIS actions targeting Lambda functions. The native integration leverages the proxy pattern but relieves customers of managing the extension themselves. This increases the simplicity and reusability of AWS FIS experiment templates while reducing the management burden for serverless chaos experiments.

This blog post explains how this new approach works, and how it can be used to run chaos experiments targeting Lambda functions at scale.

Overview

AWS FIS now injects faults into AWS Lambda functions using a “chaos” Lambda extension that runs as a separate process within the Lambda execution environment, intercepting invocations before they reach the runtime. As described in a previous blog post, Automating chaos experiments with AWS Fault Injection Service and AWS Lambda, the extension hooks into the function invocation request and response lifecycle. The AWS-managed extension simplifies the process by allowing you to define Lambda actions and targets directly in your experiment template.

This integration makes it easier to run experiments while providing better timing control. It reduces operational overhead by eliminating the need to manage the Lambda extension yourself.

How it works

To explain and demonstrate this new functionality, a sample serverless application will be used throughout this blog post. This application is built using Amazon API Gateway, AWS Lambda, and Amazon DynamoDB. It provides simple Create, Read, Update, and Delete (CRUD) functionality to manage orders. The code for this demo environment is available at the aws-fis-actions-for-lambda GitHub repository.

A diagram showing a workflow in AWS with four icons: Consumer, PUT/GET/UPDATE Payload, API Gateway, Lambda Function, and DynamoDB table. The flow starts from the Consumer, passes through the API Gateway, then to the Lambda Function, and finally interacts with the DynamoDB table, suggesting a serverless architecture for handling API requests and data storage.

To use AWS FIS to inject faults into the AWS Lambda functions, you need two things:

  1. Add the FIS managed extension to your Lambda function and configure it via environment variables
  2. Create an FIS experiment template specifying the actions and targeting the Lambda function

Once you start the experiment, AWS FIS will write the experiment configuration to an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account. The AWS FIS managed extension reads the configuration from Amazon S3 and will inject any faults depending on what is defined in the experiment template. The following sections will further explain this mechanism and how you can configure it correctly.

A diagram illustrating the AWS Fault Injection Service workflow. It shows a consumer triggering API Gateway, which invokes a Lambda function. The Extension in Lambda function checks for an active Experiment using the FIS. If an experiment is active, the injected fault is sent to the target AWS service or DynamoDB table through the AWS Fault Injection Service.

AWS FIS Lambda Actions

With the launch of FIS Lambda actions, the following three actions are available to you in the FIS experiment templates:

See AWS FIS Actions reference for a more detailed description of these actions.

Target selection

In the FIS experiment, you can select target Lambda functions using their Amazon Resource Names (ARNs) or by tags applied to target functions.

The image shows two options for the "Target method" in AWS Fault Injection Simulator: "Resource IDs" and "Resource tags, filters and parameters". The options are presented in separate boxes, with the "Resource IDs" option on the left and "Resource tags, filters and parameters" on the right.

AWS FIS allows precise control over Lambda function fault injection through qualifiers. By specifying a qualifier, you can target specific versions or aliases of Lambda functions. When left empty, FIS targets all invocations regardless of version or alias. Including a version number or alias restricts FIS to affect only invocations made to the AWS Lambda ARNs with that specific version or alias, including the $LATEST alias.

Text field showing the "functionQualifier" parameter in the AWS Fault Injection Service configuration. This parameter sets the qualifier for AWS Lambda functions that will be impaired if the AWS FIS extension is attached to it. The default value is "null" which indicates any version of the function will be impaired.

AWS FIS managed extension

The AWS FIS managed extension is a crucial component for serverless chaos engineering, implemented as an AWS Lambda layer. AWS provides this layer publicly through separate AWS-managed accounts for each supported Region. You can access the ARNs for ARM and x86 versions via documentation and AWS Systems Manager Parameter Store, enabling easy and secure integration with your Lambda functions using IaC, CI/CD pipelines, or manual processes.

The image shows the AWS Lambda function overview page, displaying details of a function named "FisLambdaAPIs-ApiJavaCreateItemFunctionJava2f59691-t28qElSvDCIE". The page provides options to throttle the function, copy its Amazon Resource Name (ARN), view actions, export to the Application Composer, and download. There is a diagram showing the function's architecture, and a table listing the single layer used by the function, which is "aws-fls-extension-x86_64" with version 9 compatible with the x86_64 architecture.

The extension’s effectiveness stems from its integration with the Lambda execution environment lifecycle. It initializes before the function runtime and subsequent invocations, allowing it to check for ongoing FIS experiments and configurations prior to the first function invocation.

To minimize performance impact, the extension uses asynchronous polling from Amazon S3 for configurations, caching them during initialization and parallel to function executions. This asynchronous approach ensures minimal impact on function execution duration. Additionally, the extension employs an adaptive polling strategy to optimize its operation based on experiment status.

Slow-Polling

When no experiment is running, the extension initiates Slow-Polling.

A diagram showing the AWS Lambda function execution lifecycle, including initialization, invocation, and shutdown phases. The timeline indicates a 60-second "Slow-Pull" timeout period between invocations. Related icons depict a bucket representing the file system bucket and a decision diamond for container reuse.

This mode uses a longer interval (default 60 seconds) to minimize operational overhead during normal Lambda execution. Lambda invocations continue unaffected, while the extension remains ready for potential experiments.

Fast-Polling

When the FIS Extension for Lambda detects an active experiment, it initiates the Fast-Polling timer and starts injecting faults defined in the FIS experiment configuration:

The image depicts the AWS Fault Injection Service workflow. It shows the sequence of various stages involved, including Extension Init, Runtime Init, Function Init, Invoke Fault stages, and Runtime Shutdown and Extension Shutdown stages. A Fast-Poll Timer of 20 seconds is also represented. The workflow branches out based on a Yes/No decision point after the Experiment Active stage.

The Fast-Polling timer interval is fixed at 20 seconds.

The shorter interval ensures:

  1. Prompt fault injection based on the experiment configuration
  2. Rapid return to normal operation when the experiment ends

This dual-mode strategy balances minimal impact during regular operations with high responsiveness during active experiments, enhancing the efficiency of serverless chaos engineering using AWS FIS.

Fault Injection Decision

At the beginning of each invocation, the extension makes a probability determination. This probability determination is cached and used to decide for each individual fault action, considering the configured actions and their invocation percentages. The invocation percentage can be understood as the likelihood of an action taking effect on a single invocation. Each fault action is evaluated in order, before or after the function handler is run.

A diagram showing four steps for AWS Lambda function invocation in the context of fault injection. The first step is "Wait before jumping into runtime (latency)," the second is "Don't jump into runtime" with options for an error or HTTP integration response. The third step is the actual "Runtime" action. The fourth step is "Return error" or "Replace function outputs" with options for an error or HTTP integration response.

The evaluation will be done based on the current cached configuration state. Because the extension polls the configuration asynchronously, it may change state during function invocation. That means that if you have an experiment injecting faults at the beginning and the end (e.g. aws:lambda:invocation-add-delay and aws:lambda:invocation-http-integration-response) one is applied while the other is not.

Lambda function configuration

After AWS FIS extension has been added via a layer to the AWS Lambda function the following environment variables need to be set:

AWS_FIS_CONFIGURATION_LOCATION

This is set to the S3 bucket ARN which AWS FIS uses for configuration distribution (including ‘/’ delimiter and optional key name prefix), e.g. arn:aws:s3:::my-config-distribution-bucket/ or arn:aws:s3:::my-config-distribution-bucket/FisConfigs/. This variable value tells the extension where to get the fault configurations from. AWS FIS will query this value from the targeted Lambda functions to determine where to write the fault configuration.

If any qualifiers are specified in the target selection, AWS FIS will inspect the AWS_FIS_CONFIGURATION_LOCATION value for each explicitly specified version or alias and will write to the respective locations. If no qualifier is specified – FIS will only check the configuration of $LATEST to determine the write location. Versions with an AWS_FIS_CONFIGURATION_LOCATION value that was not included in the target list – environment variables are version-specific – will not receive fault configurations.

AWS_LAMBDA_EXEC_WRAPPER

Set the value for the AWS_LAMBDA_EXEC_WRAPPER environmental variable as /opt/aws-fis-bootstrap

Warning
This variable should to be set only after FIS Lambda layer has been added and unset before removing the layer. Setting this variable without the layer installed will result in 500 errors for any function invocation.

AWS_FIS_EXTENSION_METRICS

Set the AWS_FIS_EXTENSION_METRICS environmental variable to all if you’d like to emit Embedded metric format (EMF) logs. By default, the extension does not emit EMF logs, and AWS_FIS_EXTENSION_METRICS defaults to none

Experimenting

The aws-fis-actions-for-lambda GitHub repository with the sample CRUD application, provides you with experiment templates using the AWS FIS Lambda Actions. Please refer to the README in the repository to get started.

Experiment 1

The experiment template Lambda Latency Injection Fault introduces two seconds startup latency for 100% of the invocations for a duration of 10 minutes. This helps to understand how the serverless CRUD API will behave when all invocations have more than normal startup latency. The experiment uses a tag-based target selection where every Lambda function with the tag FISExperimentReady = Yes is selected.

The image shows a configuration section from the AWS Fault Injection Service, specifically the "Actions" tab. It displays an action sequence named "instanceActions / aws:lambda:invocation-add-delay (10 min)" with details like the target being "TargetTaggedLambda", a duration of "PT10M" (10 minutes), an invocation percentage of 100%, and a startup delay of 2000 milliseconds for injecting faults into AWS Lambda functions.

This screenshot displays the "Targets" section of the AWS Fault Injection Service interface. It shows a single target named "TargetTaggedLambda" which is an AWS Lambda function resource. The selection mode is set to "ALL" and the resource has a tag "FISExperimentReady=Yes" applied to it.

To measure the impact of fault injection, we will use load testing solution called – Artillery. It sends requests for creating and fetching an order. The Artillery metrics are published to Amazon CloudWatch. Artillery metrics representing the client-side behavior, API Gateway metrics and Lambda metrics are preconfigured in a single observability dashboard.
If you’ve deployed the sample application, in the AWS Console you can navigate to CloudWatch and then choose Dashboards to see that the function invocations impacted by additional two seconds of latency.

A dashboard displaying various metrics and graphs related to Lambda function performance, API Gateway usage, errors, and latency over a time period. The graphs show data points for steady state and impairment scenarios, including request counts, durations, latency, and error rates across multiple AWS Lambda functions.

Experiment 2

To understand the behavior of this API when there is an impairment impacting success of the execution, we created Modify Function Output Fault experiment template. This template action will inject HTTP status code 500 for 100% of AWS Lambda invocations for 10 minutes.

A screenshot from the AWS Fault Injection Service showing configuration options for an "instanceActions / aws:lambda:invocation-http-integration-response" action that modifies an HTTP integration response. The options include setting a 500 status code, 10-minute duration, 100% invocation percentage, and preventing execution.

Note, that with preventExecution parameter set to true, AWS Lambda will be triggered and return configured statusCode without performing the intended function (update of the Amazon DynamoDB).

This image displays multiple line graphs and metric visualizations from the AWS Lambda Chaos Dashboard, showing various operational metrics over time for an application using API Gateway and AWS Lambda functions. The metrics include request counts, latency, errors, and invocation durations across different components of the system, with data points plotted at two specific time intervals (17:30 and 17:45) indicating steady-state and impairment scenarios.

Conclusion

AWS FIS Lambda actions make it easy to get started and ensure that resilience of the workload is regularly verified by applying chaos engineering in the form of fault injection experiments. It enables you to follow AWS Well-Architected Framework best practices without needing to change code and involve developers.

This blog post provided an overview of this new capability and how it can be used to strengthen the resilience of your serverless applications. If you have deployed sample application to follow along – don’t forget to cleanup using the guidance from the repository README file.

Explore the AWS FIS Workshop. This comprehensive guide walks you through setting up serverless chaos experiments, as well as other fault injection scenarios across various AWS services.

Vladislav Nedosekin

Vladislav Nedosekin is a Principal Solutions Architect with over 20 years of experience designing and implementing mission-critical services and applications based out of London, UK. At Amazon Web Services, he guided leading financial institutions in architecting innovative, cloud-native solutions with a focus on resilience and chaos engineering. Vladislav has extensive expertise helping customers leverage cutting-edge cloud technologies, including serverless and generative AI, to build highly reliable, scalable solutions.

Saurabh Kumar

Saurabh Kumar is a Senior Solutions Architect based out of North Carolina, USA. He is passionate about helping customers solve their business challenges and technical problems from migration to modernization and optimization. Outside of work, he spends time with his family watching TV, gardening and outdoor activities.

André Stoll

André Stoll is a Solutions Architect at Amazon Web Services (AWS) where he helps customers in Switzerland to leverage the full potential of the cloud. He has a background in Software Engineering, DevOps & SRE and currently focus on container, SaaS and serverless technologies.