AWS Fault Injection Service features

Overview

AWS Fault Injection Service (FIS) is a fully managed fault injection service that makes it easier for teams to discover an application’s weaknesses at scale in order to improve performance, observability, and resilience.  You can find a list of supported fault injections here.

Simple setup

AWS Fault Injection Service makes it easy to get started building and running fault injection experiments, without needing to install any agents. Fully managed fault injection actions are used to define actions such as stopping an instance, throttling an API, and failing over a database. Fault Injection Service supports Amazon CloudWatch so that you can use your existing metrics to monitor Fault Injection Service experiments.

Run real-world scenarios

Scenarios define events or conditions that you can apply to test the resilience of your applications, such as an AZ power interruption or cross-region connectivity interruption. Scenarios are created and owned by AWS, and minimize undifferentiated heavy lifting by providing you with pre-defined targets and fault actions (e.g., gradually increase CPU load from 90% to 100% for Amazon EC2instances) for possible application impairments.

Scenarios are provided through the Scenario Library in the FIS console, and are run using an FIS experiment template. In order to run an experiment using a scenario, simply select the scenario from the library, copy it to your experiment template, and specify your application details. Each scenario includes a detailed description and suggested metrics to measure the response of your application during the experiment, helping you improve the resilience posture of your applications over time. You can find a list of supported scenarios here.

Fine grained safety controls

When running experiments in live environments, there’s a risk of unintended impact. To provide guardrails and keep your fault injection experiments under control, AWS Fault Injection Service allows you to target based on environments, application, and other dimensions using tags. For example, you could increase CPU utilization on 10% of your instances with the tag “environment”:“prod”. Fault Injection Service also has the option to set rules based on Amazon CloudWatch Alarms or other tools to stop an experiment. For example, an experiment can be set to stop before completion if a web page response time decreases below an acceptable level.