AWS Cloud Operations Blog

Scaling AWS Fault Injection Service Across Your Organization Using Account Controls

AWS Fault Injection Service (FIS) empowers you to adopt chaos engineering at scale within your AWS environment. Chaos engineering injects real-world, controlled failures into a system to verify resilience and reliability, ultimately improving the customer experience. This proactive, resilience-focused approach increases your confidence in a system’s ability to respond to adverse conditions in production. You can use AWS FIS experiments to inject controlled failures, such as an Availability Zone (AZ) power interruption or regional connectivity interruption, to learn how your application responds to disruptive events.

When injecting network faults as part of an AZ power interruption or Region isolation experiment, you will need permissions to make temporary changes to the network during the experiment, like adding or removing ACLs (ec2:CreateNetworkACL). This can be challenging, as AWS customers typically follow a centralized networking model, where network accounts and network services are owned and operated by a dedicated networking team, which may prevent you from running network actions in your environments.

In this three-part series, you will learn how to define safety guardrails via Service Control Policies (SCPs) and AWS Identity and Access Management (IAM) permissions that enable your application to run FIS experiments in a controlled way without compromising network integrity–in a single AWS account, multiple AWS accounts, and multi-Region.

Scaling AWS FIS in AWS Organizations

AWS Organizations helps you organize your AWS accounts into a hierarchical structure for centralized management using Organizational Units (OUs). SCPs are organizational policies that manage the maximum available permissions for IAM users and roles within your AWS organization. Within each member account, IAM controls access to AWS Fault Injection Service (FIS) and its specific components. 

Diagram A shows a basic example of an AWS Organizations structure that includes a root OU, a Sandbox OU, and a Workloads OU containing AWS accounts with running workloads.

The image shows a basic AWS Organizations structure with hierarchy of Service Control Policies and the owners referenced in the blog.

The image shows a basic AWS Organizations structure with hierarchy of Service Control Policies and the owners referenced in the blog.

Preparing for AWS FIS Experiments

To inject controlled fault actions into your environment, you begin by creating an experiment. Experiments are defined using experiment templates that specify the fault actions to be injected and can be reused for continuous experimentation. These experiments should always be run with safety in mind. This is why FIS includes safety mechanisms such as safety levers, stop conditions, and permission controls to manage the experiment’s potential impact. You must assign an IAM role to experiments to grant the permissions required for executing the fault actions defined in your template. To scale AWS Fault Injection Service experiments across the AWS organization, we recommend to standardize the IAM roles used for creating experiment templates and running experiments based on their scope of impact.

Standardizing AWS FIS Access

For the purpose of this blog post, we will use the personas in diagram A that have the following responsibilities:

  • Account Admin: The Account Admin is responsible for creating, managing and assigning SCPs at the OU level. The SCPs allow-list or deny-list certain permissions to be granted in AWS accounts.
  • IAM Admin: The IAM Admin is responsible for creating IAM roles and IAM policies for resource access in AWS accounts. The IAM policies must adhere to the SCPs allow-listed permissions.

The Account Admin creates an SCP that allows FIS in specific accounts and another SCP that grants the necessary actions required for specific FIS experiments. They need to ensure only the permissions required to carry out the experiment are granted to adhere to the principle of least privilege. The SCP granting the correct permissions should be applied at the OU level of accounts where FIS is approved. The IAM admin creates a role for creating experiment templates and another role for specific experiments in FIS approved accounts. Both roles can use AWS Managed FIS policies or custom policies can be created. The two roles can be standardized across your AWS Organizations for easier management. For the purpose of the blog series, we will refer to the following roles:

  • AWS-FIS-Experiment-Orchestrator role: Only allowed to create, delete, or update FIS experiment templates. This role is assumed by end users who are working with the AWS FIS service.
  • AWS-FIS-Experiment-Executor role: Only allowed to execute certain actions required to inject faults into your workload and is assigned directly to experiments. This role should only be assumed by the AWS FIS service.

The SCPs in place at the OU level act as guardrails to prevent the AWS-FIS-Experiment-Executor role from being misused, and only making network level changes that are needed for experimentation. Since the role will have permissions to change or create resources in AWS accounts, it should only be assumed by AWS FIS. By establishing guardrails via SCPs your organization can securely adopt chaos engineering. Let’s dive deep into the centralized networking model real-world scenario to show how standardization of AWS FIS roles and guardrails lead to a successful FIS experiment.

Real-world Scenario

A centralized networking model helps simplify connection management, provides a single traffic inspection point, and reduces management overhead for easier scalability. With this centralized networking approach, we will manage AWS Fault Injection Service permissions through SCPs. We will use this approach as an example of how you can standardize AWS FIS consumption and scale its implementation across your organization. In the following example, we will demonstrate how to implement the AWS-FIS-Experiment role. The AWS-FIS-Experiment-Orchestrator role will be used in part two for multi-account strategies.

In the centralized networking model, only network admins are allowed to make network-level changes to prevent unauthorized changes outside the networking team. Therefore, if a developer wants to run the aws:network:disrupt-connectivity action against their application to similar AZ disruptions, the call by FIS may fail as the permission to make those temporary calls to the EC2 API.

The Account Admin should create an SCP at the OU level that allows the AWS-FIS-Experiment-Executor role to run the needed FIS actions:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "AllowAccessToASpecificRole",
			"Effect": "Deny",
			"Action": [
				"ec2:CreateNetworkAcl",
				"ec2:CreateNetworkAclEntry",
				"ec2:DeleteNetworkAcl",
				"ec2:CreateTags",
				"ec2:DescribeNetworkAcls",
				"ec2:DescribeManagedPrefixLists",
				"ec2:DescribeSubnets",
				"ec2:DescribeVpcs",
				"ec2:ReplaceNetworkAclAssociation",
				"ec2:GetManagedPrefixListEntries"
			],
			"Resource": "*",
			"Condition": {
				"ArnNotEquals": {
					"aws:PrincipalArn": [
						"arn:aws:iam::*:role/AWS-FIS-Experiment-Executor"
					]
				}
			}
		}
	]
}

By implementing standardized AWS Fault Injection Service roles and Service Control Policies, organizations can create a secure and controlled environment for chaos engineering experiments. These SCPs act as guardrails, ensuring that network-level changes can only be made by specific, authorized roles through the FIS service, thereby preventing unauthorized modifications. Visit our Github repo for details.

Now that you have created the AWS-FIS-Experiment-Executor role, account owners can run network disruption experiments using the aws:network:disrupt-connectivity action. This action lets you simulate network connectivity issues by temporarily blocking traffic between subnets or Availability Zones. For detailed steps and best practices on running network disruption experiments, see Network Disruption Actions in the AWS FIS documentation.

Conclusion

In this blog, you learned how to establish SCPs to scale AWS FIS and the adoption of chaos engineering in your AWS organization. Standardizing your AWS FIS IAM roles allows your organization to quickly adopt chaos engineering throughout your organization. As your organization continues to grow, the standardization of roles will be crucial to successfully control experimentation and testing of your workload’s resiliency.

In part two of the series, we will expand upon the topic of role standardization and offer best practices for conducting multi-account AWS FIS experiments.

About the authors

Dylan Reed

Dylan Reed is a Solutions Architect at AWS. He has a passion for helping customers build resilient, secure and innovative solutions on AWS. In his current role, he works across industries to help solve complex business challenges through AWS services. Outside of work he enjoys traveling and playing whatever sports he can.

Isael Pimentel

Isael Pimentel is an Enterprise Support Lead and Chaos Engineering SME at AWS with over 15 years of experience in developing and managing complex infrastructures, IT Transformation, Resilience, and Security. He also holds several certifications including AWS Solution Architect, AWS Network Specialty, AWS Security Specialty, MSCA, and CCNA.

Venkata Moparthi

Venkata Moparthi is a Senior Solutions Architect, specializes in cloud migrations, generative AI, and secure architecture for financial services and other industries. He combines technical expertise with customer-focused strategies to accelerate digital transformation and drive business outcomes through optimized cloud solutions.

Jason Brown

Jason Brown is a Senior Technical Account Manager at AWS, where he serves as a subject matter expert in Resilience, Disaster Recovery, and Chaos Engineering. With over 10 years of diverse technical experience, he has developed a passion for building resilient systems and helping customers define and scale their resilience practices through a comprehensive people, process, and technology approach.

Satish Kumar

Satish is a Sr. Technical Account Manager at AWS and member of Resilience TFC focusing on Chaos Engineering. Over the past 25 years, he worked in different roles from leading teams in software development, consulting, and IT. His experience in industry verticals like Media & Entertainment, High-Tech, Finance, and now Healthcare & Life Sciences provided him with a deep understanding of the various facets of the software industry. Currently in his role he helps Healthcare & Life Sciences customers to design and operate their platform resiliently, cost efficiently, and at scale on AWS.