Containers
AWS Fault Injection Simulator supports chaos engineering experiments on Amazon EKS Pods
Introduction
Chaos engineering is the discipline of verifying the resilience of your application architecture to identify unforeseen risks, address weaknesses, and ultimately improve confidence in the reliability of your application. In this blog, we demonstrate how to automate running chaos engineering experiments using the new features in AWS Fault Injection Simulator (AWS FIS) to target Amazon Elastic Kubernetes Service (Amazon EKS) Pods.
AWS Fault Injection Simulator
AWS FIS is a fully managed service for running chaos engineering experiments to test and verify the resilience of your applications. FIS gives you the ability to inject faults into the underlying compute, network, and storage resources. This includes stopping Amazon Elastic Compute Cloud (Amazon EC2) instances, adding latency to network traffic and pausing I/O operations on Amazon Elastic Block Store (Amazon EBS) volumes.
FIS supports a range of AWS services including Amazon EKS. In addition to individual AWS services, FIS also injects Amazon EC2 control plane level faults (e.g., API errors and API throttling) to test a wide array of failure scenarios to build confidence in the reliability of your application.
New FIS EKS Pod actions
FIS has added seven new fault injection actions that target EKS Pods. The new EKS Pod actions give you specific control to inject faults into EKS Pods without requiring installation of any agents, extra libraries, or orchestration software.
With the new actions, you can evaluate how your application performs under load by applying CPU, memory, or I/O stress to targeted Pods and containers. There are a variety of network failures that can be injected, including adding latency to network traffic and dropping all or a percentage of network packets. Pods can also be terminated to evaluate application resiliency. The complete list of new actions is listed in the following table:
Action Identifier | Description |
aws:eks:pod-cpu-stress |
Adds CPU stress to one or more Pods. Configurable parameters include: the duration of the action, target CPU load percentage, and the number of CPU stressors. |
aws:eks:pod-delete |
Terminates one or more Pods. Configurable parameters include the grace period in seconds for the Pod to shut down. |
aws:eks:pod-io-stress |
Adds I/O stress to one or more Pods. Configurable parameters include: the duration of the action, the percentage of free space on the file system to use during the action, and the number of mixed I/O stressors. |
aws:eks:pod-memory-stress |
Adds memory stress to one or more Pods. Configurable parameters include: the duration of the action, percentage of memory to use during the action, and the number of memory stressors. |
aws:eks:pod-network-blackhole-port |
Drops network traffic from the specified protocol and port. Configurable parameters include: the duration of the action, the protocol (TCP or UDP), the port number, and traffic type (ingress or egress). |
aws:eks:pod-network-latency |
Adds latency to network traffic from a list of sources. Configurable parameters include: the duration of the action, network latency, jitter, network interface, and the list of network sources. |
aws:eks:pod-network-packet-loss |
Drops a percentage of network packets from a list of sources. Configurable parameters include: the duration of the action, percentage of packet loss, network interface, and the list of sources. |
All actions allow you to choose the target EKS cluster and namespace. You target the Pods using either a label selector or by the name of the Deployment or Pod. Optionally, you can configure the action to target a specific Availability Zone.
Architecture overview
The previous architecture diagram describes in detail the steps that take place when you start an experiment using the new EKS Pod actions:
- FIS calls AWS Identity and Access Management (IAM) to assume the IAM role configured in the experiment template.
- FIS calls the EKS Cluster Kubernetes API to create the FIS Experiment Pod in the target namespace.
- The FIS Experiment Pod runs the experiment inside the target EKS cluster and coordinates with the FIS service during the experiment and reports the status and any errors.
- The FIS Experiment Pod is assigned to the FIS Service Account that has Kubernetes Role Based Access Control (RBAC) permissions scoped to the target namespace to execute the fault action. For every action, with the exception of
aws:eks:pod-delete
, the FIS Experiment Pod creates an ephemeral container in the target Pod. - The ephemeral container performs the fault action, like adding CPU stress, to the target Pod.
When the experiment is over, the ephemeral container is stopped and FIS terminates the FIS Experiment Pod. During each step of the fault experiment, the FIS Experiment Pod reports back to the service and can optionally send logs to Amazon Simple Storage Service (Amazon S3) or Amazon CloudWatch Logs.
The aws:eks:pod-delete
action doesn’t require an ephemeral container. When the fault is injected, the FIS Experiment Pod calls the Kubernetes API to terminate the target Pod and Step 5 isn’t required.
Principles of chaos engineering
In the walkthrough, we use the steps outlined in the Principles of Chaos Engineering to perform the experiment:
- Define steady state
- Form a hypothesis
- Run the experiment
- Review findings
Defining steady state for an application is often tied to a business metric. In this post, we’ll measure the client latency of a sample application. Our hypothesis is that our application can handle high CPU load without adversely affecting client latency. We’ll run an experiment using the new FIS EKS Pod action aws:eks:pod-cpu-stress
to inject CPU load into the application. Then, we’ll review our findings to see if our hypothesis was disproved.
Walkthrough
Let’s walk through how to demonstrate how to setup and configure FIS to run a chaos engineering experiment on EKS:
- Install a sample application
- Configure FIS permissions
- a. Create the FIS experiment AWS IAM role
- b. Configure FIS EKS authentication
- c. Configure FIS EKS authorization
- Create a FIS experiment template
- Perform the chaos engineering experiment using FIS
Prerequisites
You need the following to complete the steps in this post:
- An Amazon EKS cluster with Metrics Server installed
- The cluster needs at least three CPU cores of capacity available (i.e., three medium worker nodes)
- If you are familiar with Terraform, you can use Amazon EKS Blueprints for Terraform
- If you are familiar with CDK, you can use Amazon EKS Blueprints for CDK
- eksctl
- AWS Command Line Interface (AWS CLI) version 2
- kubectl
- Apache Bench (ab)
- Note: ab is installed on macOS by default
- To install on Amazon Linux, you can run yum install httpd-tools
- To install on Ubuntu you can run apt-get install apache2-utils
- To install on Windows, you can download the binary here
Before we get started, set a few environment variables that are specific to your environment. Replace <AWS_REGION> and <CLUSTER_NAME> with your own values below.
1. Install sample application
We chose Apache Tomcat as the sample application for this post. The Kubernetes manifest file below configures each Pod with a CPU request for 1 CPU core. It also includes a Kubernetes Service of type LoadBalancer to expose the sample application to the Internet. This allows you to test the application’s performance during the chaos engineering experiment using your local machine.
Run the following command in your terminal to install the sample application in the blog namespace.
You can validate all the Tomcat Pods are running with kubectl -n blog get pod
. If you have any pending Pods, then this means you have exhausted the CPU capacity of your cluster. You can increase the number of worker nodes to add additional CPU capacity to the cluster.
2. Configure FIS permissions
FIS requires permissions to run chaos engineering experiments in your EKS cluster. This includes a combination of IAM permissions and Kubernetes RBAC permissions:
a. Create the FIS experiment IAM role
The FIS experiment IAM role grants FIS permission to query the target EKS cluster and write logs to CloudWatch Logs. Run the following commands to create the role.
b. Configure FIS EKS authentication
The FIS experiment IAM role also serves as the Kubernetes identity of the FIS service. In EKS, this mapping of an IAM role to a Kubernetes identity is configured in the aws-auth ConfigMap. Run the following command to map the FIS experiment IAM role to the Kubernetes user named fis-experiment.
c. Configure FIS EKS authorization
Kubernetes RBAC assigns permissions to the fis-experiment user that enables the FIS Service to create the FIS Experiment Pod when the experiment starts. The permissions are assigned to a Service Account that the FIS Experiment Pod uses to execute commands in the Kubernetes cluster to execute the experiment.
Run the commands below to configure Kubernetes RBAC for the FIS experiment in the blog namespace.
3. Create a FIS experiment template
A FIS experiment template defines actions and targets. The experiment template for this post is configured as follows:
- CPU Stress Action for 10 minutes with a load percentage of 80%
- EKS Pod Target with a label selector of app=tomcat to target our sample application
- Logging configured to send to CloudWatch Logs and the fis-eks-blog log group
- An example stop condition (in production you should set a stop condition based on your business requirements)
Run the following commands to create the FIS experiment template, and CloudWatch alarm and CloudWatch Logs log group.
4. Perform the chaos engineering experiment using FIS
With the sample application installed, permissions configured, and the experiment template created, we follow the four steps from the principles of chaos engineering to perform the experiment.
a. Define steady state
The first step in performing a chaos engineering experiment is to define a measure of the system that defines the steady state — the normal behavior of our application. In our example, we’ll use client latency to access the sample application as our measure of steady state. Run the commands below to measure latency using ab.
An example of the command output is included below. You’ll have different latency numbers depending on a number of factors, including the physical distance between the client and your chosen AWS Region.
In our environment, the p95 latency is 117 ms. This means that 95% of client requests are served within 117 ms. We’ll use that as our steady state in this experiment.
b. Form a hypothesis
For this experiment, we’ll hypothesize that as we perform a CPU stress test of the application, our p95 latency stays below 150 ms.
We choose 150 ms because it is commonly used as a limit for acceptable web application performance. Latency above 150 ms is considered not acceptable for purposes of this post.
c. Run the experiment
When you are ready to start the experiment, run the commands below.
To verify the experiment is running, let’s view the CPU load of the Tomcat Pods.
Some example output is included below.
As you can see, there is a heavy CPU load on the Tomcat Pods, which indicates that the experiment has begun. If you want to see details of the ephemeral container that’s adding CPU stress to the Pod, then you can run kubectl describe pod <pod-name> -o yaml
to see the details of the Pod configuration.
Now let’s measure the latency of the sample application while it’s under load from the experiment.
Example output is included below. Your numbers will be different.
d. Review findings
The p95 latency measurement during the experiment is 191 ms and this disproves our hypothesis in section 4b that the latency wouldn’t exceed our maximum acceptable limit of 150 ms. The measured latency is significantly higher than our hypothesis and indicates a weakness in our architecture.
An important part of the reviewing the findings of a chaos engineering experiment is looking for ways to improve the workload design for resilience. In our example, one possible way to improve the architecture is to introduce dynamic workload scaling using Kubernetes Horizontal Pod Autoscaling (HPA). HPA can scale up the number of Pods in response to CPU load which can result in lower overall latency for clients.
Our next step in this process would be to make improvements, like implementing HPA. We could then run the experiment again until we can no longer disprove our hypothesis. This would increase our confidence in the resilience of our application’s architecture.
Cleaning up
The FIS Experiment Pod and the ephemeral containers created during the experiment were already terminated by FIS when the experiment finished. To clean up the environment, you can remove the sample application, IAM identity, and Kubernetes RBAC permissions we created earlier. Run the following commands to clean up:
Conclusion
In this post, we reviewed the seven new Pod actions and walked you through the process of running a chaos engineering experiment with AWS FIS using the new CPU stress action. With the release of these new Pod actions, customers can now use AWS FIS to run their chaos engineering experiments to target workloads on Amazon EKS.
Running chaos engineering experiments to verify the resiliency of your system regularly is a recommended best practice for operating business critical workloads. It is important to develop a practice of continually experimenting on your system to identify weaknesses in your architecture before they could impact the availability of your business operations. To learn more, please review the Test resiliency using chaos engineering section of the AWS Well Architected Framework’s Reliability Pillar.