AWS Fault Injection Simulator supports chaos engineering experiments on Amazon EKS Pods

Introduction

Chaos engineering is the discipline of verifying the resilience of your application architecture to identify unforeseen risks, address weaknesses, and ultimately improve confidence in the reliability of your application. In this blog, we demonstrate how to automate running chaos engineering experiments using the new features in AWS Fault Injection Simulator (AWS FIS) to target Amazon Elastic Kubernetes Service (Amazon EKS) Pods.

AWS Fault Injection Simulator

AWS FIS is a fully managed service for running chaos engineering experiments to test and verify the resilience of your applications. FIS gives you the ability to inject faults into the underlying compute, network, and storage resources. This includes stopping Amazon Elastic Compute Cloud (Amazon EC2) instances, adding latency to network traffic and pausing I/O operations on Amazon Elastic Block Store (Amazon EBS) volumes.

FIS supports a range of AWS services including Amazon EKS. In addition to individual AWS services, FIS also injects Amazon EC2 control plane level faults (e.g., API errors and API throttling) to test a wide array of failure scenarios to build confidence in the reliability of your application.

New FIS EKS Pod actions

FIS has added seven new fault injection actions that target EKS Pods. The new EKS Pod actions give you specific control to inject faults into EKS Pods without requiring installation of any agents, extra libraries, or orchestration software.

With the new actions, you can evaluate how your application performs under load by applying CPU, memory, or I/O stress to targeted Pods and containers. There are a variety of network failures that can be injected, including adding latency to network traffic and dropping all or a percentage of network packets. Pods can also be terminated to evaluate application resiliency. The complete list of new actions is listed in the following table:

Action Identifier	Description
`aws:eks:pod-cpu-stress`	Adds CPU stress to one or more Pods. Configurable parameters include: the duration of the action, target CPU load percentage, and the number of CPU stressors.
`aws:eks:pod-delete`	Terminates one or more Pods. Configurable parameters include the grace period in seconds for the Pod to shut down.
`aws:eks:pod-io-stress`	Adds I/O stress to one or more Pods. Configurable parameters include: the duration of the action, the percentage of free space on the file system to use during the action, and the number of mixed I/O stressors.
`aws:eks:pod-memory-stress`	Adds memory stress to one or more Pods. Configurable parameters include: the duration of the action, percentage of memory to use during the action, and the number of memory stressors.
`aws:eks:pod-network-blackhole-port`	Drops network traffic from the specified protocol and port. Configurable parameters include: the duration of the action, the protocol (TCP or UDP), the port number, and traffic type (ingress or egress).
`aws:eks:pod-network-latency`	Adds latency to network traffic from a list of sources. Configurable parameters include: the duration of the action, network latency, jitter, network interface, and the list of network sources.
`aws:eks:pod-network-packet-loss`	Drops a percentage of network packets from a list of sources. Configurable parameters include: the duration of the action, percentage of packet loss, network interface, and the list of sources.

All actions allow you to choose the target EKS cluster and namespace. You target the Pods using either a label selector or by the name of the Deployment or Pod. Optionally, you can configure the action to target a specific Availability Zone.

Architecture overview

architecture diagram describes in detail the steps that take place when you start an experiment using the new EKS Pod Actions

The previous architecture diagram describes in detail the steps that take place when you start an experiment using the new EKS Pod actions:

FIS calls AWS Identity and Access Management (IAM) to assume the IAM role configured in the experiment template.
FIS calls the EKS Cluster Kubernetes API to create the FIS Experiment Pod in the target namespace.
The FIS Experiment Pod runs the experiment inside the target EKS cluster and coordinates with the FIS service during the experiment and reports the status and any errors.
The FIS Experiment Pod is assigned to the FIS Service Account that has Kubernetes Role Based Access Control (RBAC) permissions scoped to the target namespace to execute the fault action. For every action, with the exception of aws:eks:pod-delete, the FIS Experiment Pod creates an ephemeral container in the target Pod.
The ephemeral container performs the fault action, like adding CPU stress, to the target Pod.

When the experiment is over, the ephemeral container is stopped and FIS terminates the FIS Experiment Pod. During each step of the fault experiment, the FIS Experiment Pod reports back to the service and can optionally send logs to Amazon Simple Storage Service (Amazon S3) or Amazon CloudWatch Logs.

The aws:eks:pod-delete action doesn’t require an ephemeral container. When the fault is injected, the FIS Experiment Pod calls the Kubernetes API to terminate the target Pod and Step 5 isn’t required.

Principles of chaos engineering

In the walkthrough, we use the steps outlined in the Principles of Chaos Engineering to perform the experiment:

Define steady state
Form a hypothesis
Run the experiment
Review findings

Defining steady state for an application is often tied to a business metric. In this post, we’ll measure the client latency of a sample application. Our hypothesis is that our application can handle high CPU load without adversely affecting client latency. We’ll run an experiment using the new FIS EKS Pod action aws:eks:pod-cpu-stress to inject CPU load into the application. Then, we’ll review our findings to see if our hypothesis was disproved.

Walkthrough

Let’s walk through how to demonstrate how to setup and configure FIS to run a chaos engineering experiment on EKS:

Install a sample application
Configure FIS permissions
- a. Create the FIS experiment AWS IAM role
- b. Configure FIS EKS authentication
- c. Configure FIS EKS authorization
Create a FIS experiment template
Perform the chaos engineering experiment using FIS

Prerequisites

You need the following to complete the steps in this post:

An Amazon EKS cluster with Metrics Server installed
- The cluster needs at least three CPU cores of capacity available (i.e., three medium worker nodes)
- If you are familiar with Terraform, you can use Amazon EKS Blueprints for Terraform
- If you are familiar with CDK, you can use Amazon EKS Blueprints for CDK
eksctl
AWS Command Line Interface (AWS CLI) version 2
kubectl
Apache Bench (ab)
- Note: ab is installed on macOS by default
- To install on Amazon Linux, you can run yum install httpd-tools
- To install on Ubuntu you can run apt-get install apache2-utils
- To install on Windows, you can download the binary here

Before we get started, set a few environment variables that are specific to your environment. Replace <AWS_REGION> and <CLUSTER_NAME> with your own values below.

ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
AWS_REGION=<REGION>
CLUSTER_NAME=<CLUSTER_NAME>

1. Install sample application

We chose Apache Tomcat as the sample application for this post. The Kubernetes manifest file below configures each Pod with a CPU request for 1 CPU core. It also includes a Kubernetes Service of type LoadBalancer to expose the sample application to the Internet. This allows you to test the application’s performance during the chaos engineering experiment using your local machine.

Run the following command in your terminal to install the sample application in the blog namespace.

cat << EOF > sample_app.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: blog
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tomcat
  namespace: blog
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tomcat
  template:
    metadata:
      labels:
        app: tomcat
    spec:
      containers:
      - image: public.ecr.aws/bitnami/tomcat
        name: tomcat
        env:
        - name: ALLOW_EMPTY_PASSWORD
          value: "yes"
        resources:
          requests:
            cpu: "1"
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
  labels:
    app: tomcat
  name: tomcat
  namespace: blog
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: tomcat
  type: LoadBalancer
EOF

kubectl apply -f sample_app.yaml

You can validate all the Tomcat Pods are running with kubectl -n blog get pod. If you have any pending Pods, then this means you have exhausted the CPU capacity of your cluster. You can increase the number of worker nodes to add additional CPU capacity to the cluster.

2. Configure FIS permissions

FIS requires permissions to run chaos engineering experiments in your EKS cluster. This includes a combination of IAM permissions and Kubernetes RBAC permissions:

a. Create the FIS experiment IAM role

The FIS experiment IAM role grants FIS permission to query the target EKS cluster and write logs to CloudWatch Logs. Run the following commands to create the role.

cat << EOF > fis_trust.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "fis.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF

cat << EOF > fis_policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeSubnets",
                "eks:DescribeCluster",
                "logs:CreateLogDelivery",
                "logs:PutResourcePolicy",
                "logs:DescribeResourcePolicies",
                "logs:DescribeLogGroups"
            ],
            "Resource": "*"
        }
    ]
}
EOF

aws iam create-role --role-name fis-eks-experiment \
  --assume-role-policy-document file://fis_trust.json

aws iam put-role-policy --role-name fis-eks-experiment \
  --policy-name fis-eks-policy \
  --policy-document file://fis_policy.json

b. Configure FIS EKS authentication

The FIS experiment IAM role also serves as the Kubernetes identity of the FIS service. In EKS, this mapping of an IAM role to a Kubernetes identity is configured in the aws-auth ConfigMap. Run the following command to map the FIS experiment IAM role to the Kubernetes user named fis-experiment.

eksctl create iamidentitymapping \
  --cluster ${CLUSTER_NAME} \
  --arn arn:aws:iam::${ACCOUNT_ID}:role/fis-eks-experiment \
  --username fis-experiment

c. Configure FIS EKS authorization

Kubernetes RBAC assigns permissions to the fis-experiment user that enables the FIS Service to create the FIS Experiment Pod when the experiment starts. The permissions are assigned to a Service Account that the FIS Experiment Pod uses to execute commands in the Kubernetes cluster to execute the experiment.

Run the commands below to configure Kubernetes RBAC for the FIS experiment in the blog namespace.

cat << EOF > fis_rbac.yaml
kind: ServiceAccount
apiVersion: v1
metadata:
  namespace: blog
  name: fis-experiment
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: blog
  name: fis-experiment
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "create", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "get", "delete", "deletecollection", "list"]
- apiGroups: [""]
  resources: ["pods/ephemeralcontainers"]
  verbs: ["update"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["create"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: fis-experiment
  namespace: blog
subjects:
- kind: ServiceAccount
  name: fis-experiment
  namespace: blog
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: fis-experiment
roleRef:
  kind: Role
  name: fis-experiment
  apiGroup: rbac.authorization.k8s.io
EOF

kubectl apply -f fis_rbac.yaml

3. Create a FIS experiment template

A FIS experiment template defines actions and targets. The experiment template for this post is configured as follows:

CPU Stress Action for 10 minutes with a load percentage of 80%
EKS Pod Target with a label selector of app=tomcat to target our sample application
Logging configured to send to CloudWatch Logs and the fis-eks-blog log group
An example stop condition (in production you should set a stop condition based on your business requirements)

Run the following commands to create the FIS experiment template, and CloudWatch alarm and CloudWatch Logs log group.

cat << EOF > experiment_template.json
{
    "tags": {
        "Name": "Tomcat-CPU-Stress"
    },
    "description": "Tomcat-CPU-Stress",
    "actions": {
        "CPU-Stress": {
            "actionId": "aws:eks:pod-cpu-stress",
            "description": "CPU-stress",
            "parameters": {
                "duration": "PT10M",
                "kubernetesServiceAccount": "fis-experiment",
                "percent": "80"
            },
            "targets": {
                "Pods": "Tomcat-Pods"
            }
        }
    },
    "logConfiguration": {
        "cloudWatchLogsConfiguration": {
            "logGroupArn": "arn:aws:logs:${AWS_REGION}:${ACCOUNT_ID}:log-group:fis-eks-blog:*"
        },
        "logSchemaVersion": 2
    },
    "roleArn": "arn:aws:iam::${ACCOUNT_ID}:role/fis-eks-experiment",
    "stopConditions": [ 
        {
            "source": "aws:cloudwatch:alarm", 
            "value": "arn:aws:cloudwatch:${AWS_REGION}:${ACCOUNT_ID}:alarm:fis-stop-example"
        }
    ],
    "targets": {
        "Tomcat-Pods": {
            "resourceType": "aws:eks:pod",
            "parameters": {
                "clusterIdentifier": "arn:aws:eks:${AWS_REGION}:${ACCOUNT_ID}:cluster/${CLUSTER_NAME}",
                "selectorType": "labelSelector",
                "selectorValue": "app=tomcat",
                "namespace": "blog",
                "targetContainerName": "tomcat"
            },
            "selectionMode": "ALL"
        }
    }
}
EOF

aws cloudwatch put-metric-alarm --alarm-name fis-stop-example --namespace Test \
  --metric-name Test --evaluation-periods 1 --period 60 --threshold 1 --statistic Sum \
  --comparison-operator GreaterThanThreshold --treat-missing-data ignore
  
aws cloudwatch set-alarm-state --alarm-name fis-stop-example \
  --state-value OK --state-reason "OK to run FIS experiment"

aws logs create-log-group \
  --log-group-name fis-eks-blog

aws fis create-experiment-template \
  --cli-input-json file://experiment_template.json

4. Perform the chaos engineering experiment using FIS

With the sample application installed, permissions configured, and the experiment template created, we follow the four steps from the principles of chaos engineering to perform the experiment.

a. Define steady state

The first step in performing a chaos engineering experiment is to define a measure of the system that defines the steady state — the normal behavior of our application. In our example, we’ll use client latency to access the sample application as our measure of steady state. Run the commands below to measure latency using ab.

APP_URL=$(kubectl -n blog get svc tomcat -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

ab -n 100 $APP_URL/

An example of the command output is included below. You’ll have different latency numbers depending on a number of factors, including the physical distance between the client and your chosen AWS Region.

<snip>
Percentage of the requests served within a certain time (ms)
  50%    109
  66%    111
  75%    113
  80%    113
  90%    116
  95%    117
  98%    128
  99%    304
 100%    304 (longest request)

In our environment, the p95 latency is 117 ms. This means that 95% of client requests are served within 117 ms. We’ll use that as our steady state in this experiment.

b. Form a hypothesis

For this experiment, we’ll hypothesize that as we perform a CPU stress test of the application, our p95 latency stays below 150 ms.

We choose 150 ms because it is commonly used as a limit for acceptable web application performance. Latency above 150 ms is considered not acceptable for purposes of this post.

c. Run the experiment

When you are ready to start the experiment, run the commands below.

TEMPLATE_ID=$(aws fis list-experiment-templates \
  --query 'experimentTemplates[?description==`Tomcat-CPU-Stress`].id | [-1]' \
  --output text)

aws fis start-experiment --experiment-template-id $TEMPLATE_ID

To verify the experiment is running, let’s view the CPU load of the Tomcat Pods.

kubectl top pods -n blog

Some example output is included below.

NAME                                          CPU(cores)   MEMORY(bytes)
fispod-12a9be05-3979-3521-b07b-cf778254912d   3m           9Mi
tomcat-746cc4d769-8wf9l                       1576m        128Mi
tomcat-746cc4d769-gqstb                       1573m        125Mi
tomcat-746cc4d769-t9bgk                       1578m        157Mi

As you can see, there is a heavy CPU load on the Tomcat Pods, which indicates that the experiment has begun. If you want to see details of the ephemeral container that’s adding CPU stress to the Pod, then you can run kubectl describe pod <pod-name> -o yaml to see the details of the Pod configuration.

Now let’s measure the latency of the sample application while it’s under load from the experiment.

ab -n 100 $APP_URL/

Example output is included below. Your numbers will be different.

<snip>
Percentage of the requests served within a certain time (ms)
  50%    112
  66%    114
  75%    116
  80%    117
  90%    174
  95%    191
  98%    204
  99%    525
 100%    525 (longest request)

d. Review findings

The p95 latency measurement during the experiment is 191 ms and this disproves our hypothesis in section 4b that the latency wouldn’t exceed our maximum acceptable limit of 150 ms. The measured latency is significantly higher than our hypothesis and indicates a weakness in our architecture.

An important part of the reviewing the findings of a chaos engineering experiment is looking for ways to improve the workload design for resilience. In our example, one possible way to improve the architecture is to introduce dynamic workload scaling using Kubernetes Horizontal Pod Autoscaling (HPA). HPA can scale up the number of Pods in response to CPU load which can result in lower overall latency for clients.

Our next step in this process would be to make improvements, like implementing HPA. We could then run the experiment again until we can no longer disprove our hypothesis. This would increase our confidence in the resilience of our application’s architecture.

Cleaning up

The FIS Experiment Pod and the ephemeral containers created during the experiment were already terminated by FIS when the experiment finished. To clean up the environment, you can remove the sample application, IAM identity, and Kubernetes RBAC permissions we created earlier. Run the following commands to clean up:

kubectl delete namespace blog

eksctl delete iamidentitymapping --cluster ${CLUSTER_NAME} \
  --arn arn:aws:iam::${ACCOUNT_ID}:role/fis-eks-experiment

aws iam delete-role-policy --role-name fis-eks-experiment \
  --policy-name fis-eks-policy  
  
aws iam delete-role --role-name fis-eks-experiment
aws cloudwatch delete-alarms --alarm-names fis-stop-example
aws logs delete-log-group --log-group-name fis-eks-blog

Conclusion

In this post, we reviewed the seven new Pod actions and walked you through the process of running a chaos engineering experiment with AWS FIS using the new CPU stress action. With the release of these new Pod actions, customers can now use AWS FIS to run their chaos engineering experiments to target workloads on Amazon EKS.

Running chaos engineering experiments to verify the resiliency of your system regularly is a recommended best practice for operating business critical workloads. It is important to develop a practice of continually experimenting on your system to identify weaknesses in your architecture before they could impact the availability of your business operations. To learn more, please review the Test resiliency using chaos engineering section of the AWS Well Architected Framework’s Reliability Pillar.

Containers

AWS Fault Injection Simulator supports chaos engineering experiments on Amazon EKS Pods

Introduction