AWS Cloud Operations Blog
Introducing AWS Fault Injection Service Actions to Inject Chaos in Lambda functions
Usage of serverless technology in regulated industries like financial services is growing. This growth demands robust resilience validation. Chaos engineering for Serverless has become crucial for ensuring reliable and available serverless applications. By purposefully injecting failures and stresses into serverless components, teams can uncover hidden weaknesses and validate the fault tolerance of their systems.
Previously, AWS customers had two options to run chaos experiments for AWS Lambda:
- Modify code and use runtime-specific libraries.
- Inject faults using self-managed Lambda extensions leveraging the Runtime API proxy pattern.
Both methods required developer involvement and code management to integrate with the AWS Fault Injection Service (AWS FIS).
On the 30th of October 2024, we launched AWS FIS actions targeting Lambda functions. The native integration leverages the proxy pattern but relieves customers of managing the extension themselves. This increases the simplicity and reusability of AWS FIS experiment templates while reducing the management burden for serverless chaos experiments.
This blog post explains how this new approach works, and how it can be used to run chaos experiments targeting Lambda functions at scale.
Overview
AWS FIS now injects faults into AWS Lambda functions using a “chaos” Lambda extension that runs as a separate process within the Lambda execution environment, intercepting invocations before they reach the runtime. As described in a previous blog post, Automating chaos experiments with AWS Fault Injection Service and AWS Lambda, the extension hooks into the function invocation request and response lifecycle. The AWS-managed extension simplifies the process by allowing you to define Lambda actions and targets directly in your experiment template.
This integration makes it easier to run experiments while providing better timing control. It reduces operational overhead by eliminating the need to manage the Lambda extension yourself.
How it works
To explain and demonstrate this new functionality, a sample serverless application will be used throughout this blog post. This application is built using Amazon API Gateway, AWS Lambda, and Amazon DynamoDB. It provides simple Create, Read, Update, and Delete (CRUD) functionality to manage orders. The code for this demo environment is available at the aws-fis-actions-for-lambda GitHub repository.
To use AWS FIS to inject faults into the AWS Lambda functions, you need two things:
- Add the FIS managed extension to your Lambda function and configure it via environment variables
- Create an FIS experiment template specifying the actions and targeting the Lambda function
Once you start the experiment, AWS FIS will write the experiment configuration to an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account. The AWS FIS managed extension reads the configuration from Amazon S3 and will inject any faults depending on what is defined in the experiment template. The following sections will further explain this mechanism and how you can configure it correctly.
AWS FIS Lambda Actions
With the launch of FIS Lambda actions, the following three actions are available to you in the FIS experiment templates:
- Add start delay:
aws:lambda:invocation-add-delay
- Modify integration response:
aws:lambda:invocation-http-integration-response
- Enforce invocation errors:
aws:lambda:invocation-error
See AWS FIS Actions reference for a more detailed description of these actions.
Target selection
In the FIS experiment, you can select target Lambda functions using their Amazon Resource Names (ARNs) or by tags applied to target functions.
AWS FIS allows precise control over Lambda function fault injection through qualifiers. By specifying a qualifier, you can target specific versions or aliases of Lambda functions. When left empty, FIS targets all invocations regardless of version or alias. Including a version number or alias restricts FIS to affect only invocations made to the AWS Lambda ARNs with that specific version or alias, including the $LATEST alias.
AWS FIS managed extension
The AWS FIS managed extension is a crucial component for serverless chaos engineering, implemented as an AWS Lambda layer. AWS provides this layer publicly through separate AWS-managed accounts for each supported Region. You can access the ARNs for ARM and x86 versions via documentation and AWS Systems Manager Parameter Store, enabling easy and secure integration with your Lambda functions using IaC, CI/CD pipelines, or manual processes.
The extension’s effectiveness stems from its integration with the Lambda execution environment lifecycle. It initializes before the function runtime and subsequent invocations, allowing it to check for ongoing FIS experiments and configurations prior to the first function invocation.
To minimize performance impact, the extension uses asynchronous polling from Amazon S3 for configurations, caching them during initialization and parallel to function executions. This asynchronous approach ensures minimal impact on function execution duration. Additionally, the extension employs an adaptive polling strategy to optimize its operation based on experiment status.
Slow-Polling
When no experiment is running, the extension initiates Slow-Polling.
This mode uses a longer interval (default 60 seconds) to minimize operational overhead during normal Lambda execution. Lambda invocations continue unaffected, while the extension remains ready for potential experiments.
Fast-Polling
When the FIS Extension for Lambda detects an active experiment, it initiates the Fast-Polling timer and starts injecting faults defined in the FIS experiment configuration:
The Fast-Polling timer interval is fixed at 20 seconds.
The shorter interval ensures:
- Prompt fault injection based on the experiment configuration
- Rapid return to normal operation when the experiment ends
This dual-mode strategy balances minimal impact during regular operations with high responsiveness during active experiments, enhancing the efficiency of serverless chaos engineering using AWS FIS.
Fault Injection Decision
At the beginning of each invocation, the extension makes a probability determination. This probability determination is cached and used to decide for each individual fault action, considering the configured actions and their invocation percentages. The invocation percentage can be understood as the likelihood of an action taking effect on a single invocation. Each fault action is evaluated in order, before or after the function handler is run.
The evaluation will be done based on the current cached configuration state. Because the extension polls the configuration asynchronously, it may change state during function invocation. That means that if you have an experiment injecting faults at the beginning and the end (e.g. aws:lambda:invocation-add-delay
and aws:lambda:invocation-http-integration-response
) one is applied while the other is not.
Lambda function configuration
After AWS FIS extension has been added via a layer to the AWS Lambda function the following environment variables need to be set:
AWS_FIS_CONFIGURATION_LOCATION
This is set to the S3 bucket ARN which AWS FIS uses for configuration distribution (including ‘/’ delimiter and optional key name prefix), e.g. arn:aws:s3:::my-config-distribution-bucket/
or arn:aws:s3:::my-config-distribution-bucket/FisConfigs/
. This variable value tells the extension where to get the fault configurations from. AWS FIS will query this value from the targeted Lambda functions to determine where to write the fault configuration.
If any qualifiers are specified in the target selection, AWS FIS will inspect the AWS_FIS_CONFIGURATION_LOCATION value for each explicitly specified version or alias and will write to the respective locations. If no qualifier is specified – FIS will only check the configuration of $LATEST to determine the write location. Versions with an AWS_FIS_CONFIGURATION_LOCATION value that was not included in the target list – environment variables are version-specific – will not receive fault configurations.
AWS_LAMBDA_EXEC_WRAPPER
Set the value for the AWS_LAMBDA_EXEC_WRAPPER environmental variable as /opt/aws-fis-bootstrap
Warning
This variable should to be set only after FIS Lambda layer has been added and unset before removing the layer. Setting this variable without the layer installed will result in 500 errors for any function invocation.
AWS_FIS_EXTENSION_METRICS
Set the AWS_FIS_EXTENSION_METRICS environmental variable to all if you’d like to emit Embedded metric format (EMF) logs. By default, the extension does not emit EMF logs, and AWS_FIS_EXTENSION_METRICS defaults to none
Experimenting
The aws-fis-actions-for-lambda GitHub repository with the sample CRUD application, provides you with experiment templates using the AWS FIS Lambda Actions. Please refer to the README in the repository to get started.
Experiment 1
The experiment template Lambda Latency Injection Fault introduces two seconds startup latency for 100% of the invocations for a duration of 10 minutes. This helps to understand how the serverless CRUD API will behave when all invocations have more than normal startup latency. The experiment uses a tag-based target selection where every Lambda function with the tag FISExperimentReady = Yes is selected.
To measure the impact of fault injection, we will use load testing solution called – Artillery. It sends requests for creating and fetching an order. The Artillery metrics are published to Amazon CloudWatch. Artillery metrics representing the client-side behavior, API Gateway metrics and Lambda metrics are preconfigured in a single observability dashboard.
If you’ve deployed the sample application, in the AWS Console you can navigate to CloudWatch and then choose Dashboards to see that the function invocations impacted by additional two seconds of latency.
Experiment 2
To understand the behavior of this API when there is an impairment impacting success of the execution, we created Modify Function Output Fault experiment template. This template action will inject HTTP status code 500 for 100% of AWS Lambda invocations for 10 minutes.
Note, that with preventExecution parameter set to true, AWS Lambda will be triggered and return configured statusCode without performing the intended function (update of the Amazon DynamoDB).
Conclusion
AWS FIS Lambda actions make it easy to get started and ensure that resilience of the workload is regularly verified by applying chaos engineering in the form of fault injection experiments. It enables you to follow AWS Well-Architected Framework best practices without needing to change code and involve developers.
This blog post provided an overview of this new capability and how it can be used to strengthen the resilience of your serverless applications. If you have deployed sample application to follow along – don’t forget to cleanup using the guidance from the repository README file.
Explore the AWS FIS Workshop. This comprehensive guide walks you through setting up serverless chaos experiments, as well as other fault injection scenarios across various AWS services.