Scaling AWS Fault Injection Service Across Your Organization Using Account Controls

AWS Fault Injection Service (FIS) empowers you to adopt chaos engineering at scale within your AWS environment. Chaos engineering injects real-world, controlled failures into a system to verify resilience and reliability, ultimately improving the customer experience. This proactive, resilience-focused approach increases your confidence in a system’s ability to respond to adverse conditions in production. You can use AWS FIS experiments to inject controlled failures, such as an Availability Zone (AZ) power interruption or regional connectivity interruption, to learn how your application responds to disruptive events.

When injecting network faults as part of an AZ power interruption or Region isolation experiment, you will need permissions to make temporary changes to the network during the experiment, like adding or removing ACLs (ec2:CreateNetworkACL). This can be challenging, as AWS customers typically follow a centralized networking model, where network accounts and network services are owned and operated by a dedicated networking team, which may prevent you from running network actions in your environments.

In this three-part series, you will learn how to define safety guardrails via Service Control Policies (SCPs) and AWS Identity and Access Management (IAM) permissions that enable your application to run FIS experiments in a controlled way without compromising network integrity–in a single AWS account, multiple AWS accounts, and multi-Region.

Scaling AWS FIS in AWS Organizations

AWS Organizations helps you organize your AWS accounts into a hierarchical structure for centralized management using Organizational Units (OUs). SCPs are organizational policies that manage the maximum available permissions for IAM users and roles within your AWS organization. Within each member account, IAM controls access to AWS Fault Injection Service (FIS) and its specific components.

Diagram A shows a basic example of an AWS Organizations structure that includes a root OU, a Sandbox OU, and a Workloads OU containing AWS accounts with running workloads.

The image shows a basic AWS Organizations structure with hierarchy of Service Control Policies and the owners referenced in the blog.

Preparing for AWS FIS Experiments

To inject controlled fault actions into your environment, you begin by creating an experiment. Experiments are defined using experiment templates that specify the fault actions to be injected and can be reused for continuous experimentation. These experiments should always be run with safety in mind. This is why FIS includes safety mechanisms such as safety levers, stop conditions, and permission controls to manage the experiment’s potential impact. You must assign an IAM role to experiments to grant the permissions required for executing the fault actions defined in your template. To scale AWS Fault Injection Service experiments across the AWS organization, we recommend to standardize the IAM roles used for creating experiment templates and running experiments based on their scope of impact.

Standardizing AWS FIS Access

For the purpose of this blog post, we will use the personas in diagram A that have the following responsibilities:

Account Admin: The Account Admin is responsible for creating, managing and assigning SCPs at the OU level. The SCPs allow-list or deny-list certain permissions to be granted in AWS accounts.
IAM Admin: The IAM Admin is responsible for creating IAM roles and IAM policies for resource access in AWS accounts. The IAM policies must adhere to the SCPs allow-listed permissions.

The Account Admin creates an SCP that allows FIS in specific accounts and another SCP that grants the necessary actions required for specific FIS experiments. They need to ensure only the permissions required to carry out the experiment are granted to adhere to the principle of least privilege. The SCP granting the correct permissions should be applied at the OU level of accounts where FIS is approved. The IAM admin creates a role for creating experiment templates and another role for specific experiments in FIS approved accounts. Both roles can use AWS Managed FIS policies or custom policies can be created. The two roles can be standardized across your AWS Organizations for easier management. For the purpose of the blog series, we will refer to the following roles:

AWS-FIS-Experiment-Orchestrator role: Only allowed to create, delete, or update FIS experiment templates. This role is assumed by end users who are working with the AWS FIS service.
AWS-FIS-Experiment-Executor role: Only allowed to execute certain actions required to inject faults into your workload and is assigned directly to experiments. This role should only be assumed by the AWS FIS service.

The SCPs in place at the OU level act as guardrails to prevent the AWS-FIS-Experiment-Executor role from being misused, and only making network level changes that are needed for experimentation. Since the role will have permissions to change or create resources in AWS accounts, it should only be assumed by AWS FIS. By establishing guardrails via SCPs your organization can securely adopt chaos engineering. Let’s dive deep into the centralized networking model real-world scenario to show how standardization of AWS FIS roles and guardrails lead to a successful FIS experiment.

Real-world Scenario

A centralized networking model helps simplify connection management, provides a single traffic inspection point, and reduces management overhead for easier scalability. With this centralized networking approach, we will manage AWS Fault Injection Service permissions through SCPs. We will use this approach as an example of how you can standardize AWS FIS consumption and scale its implementation across your organization. In the following example, we will demonstrate how to implement the AWS-FIS-Experiment role. The AWS-FIS-Experiment-Orchestrator role will be used in part two for multi-account strategies.

In the centralized networking model, only network admins are allowed to make network-level changes to prevent unauthorized changes outside the networking team. Therefore, if a developer wants to run the aws:network:disrupt-connectivity action against their application to similar AZ disruptions, the call by FIS may fail as the permission to make those temporary calls to the EC2 API.

The Account Admin should create an SCP at the OU level that allows the AWS-FIS-Experiment-Executor role to run the needed FIS actions:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AllowAccessToASpecificRole",
			"Effect": "Deny",
			"Action": [
				"ec2:CreateNetworkAcl",
				"ec2:CreateNetworkAclEntry",
				"ec2:DeleteNetworkAcl",
				"ec2:CreateTags",
				"ec2:DescribeNetworkAcls",
				"ec2:DescribeManagedPrefixLists",
				"ec2:DescribeSubnets",
				"ec2:DescribeVpcs",
				"ec2:ReplaceNetworkAclAssociation",
				"ec2:GetManagedPrefixListEntries"
			],
			"Resource": "*",
			"Condition": {
				"ArnNotEquals": {
					"aws:PrincipalArn": [
						"arn:aws:iam::*:role/AWS-FIS-Experiment-Executor"
					]
				}
			}
		}
	]
}

By implementing standardized AWS Fault Injection Service roles and Service Control Policies, organizations can create a secure and controlled environment for chaos engineering experiments. These SCPs act as guardrails, ensuring that network-level changes can only be made by specific, authorized roles through the FIS service, thereby preventing unauthorized modifications. Visit our Github repo for details.

Now that you have created the AWS-FIS-Experiment-Executor role, account owners can run network disruption experiments using the aws:network:disrupt-connectivity action. This action lets you simulate network connectivity issues by temporarily blocking traffic between subnets or Availability Zones. For detailed steps and best practices on running network disruption experiments, see Network Disruption Actions in the AWS FIS documentation.

Conclusion

In this blog, you learned how to establish SCPs to scale AWS FIS and the adoption of chaos engineering in your AWS organization. Standardizing your AWS FIS IAM roles allows your organization to quickly adopt chaos engineering throughout your organization. As your organization continues to grow, the standardization of roles will be crucial to successfully control experimentation and testing of your workload’s resiliency.

In part two of the series, we will expand upon the topic of role standardization and offer best practices for conducting multi-account AWS FIS experiments.

AWS Cloud Operations Blog

Scaling AWS Fault Injection Service Across Your Organization Using Account Controls

Scaling AWS FIS in AWS Organizations

Preparing for AWS FIS Experiments

Standardizing AWS FIS Access

Real-world Scenario

Conclusion

Resources

Follow

Learn

Resources

Developers

Help