Chaos Engineering with LitmusChaos on Amazon EKS

Introduction

Organizations are embracing microservices-based architectures by refactoring large monolith applications into smaller, independent, and loosely coupled services. These independent services are faster to deploy and scale, enabling organizations to innovate and deliver faster.

However, as the application grows, these microservices present their own challenges. For example, as you deploy tens or hundreds or thousands of microservices, operational tasks such as distributed tracing, debugging, testing, dependency mapping, and so on, become challenging. A failure as a result of network latency, disk failure, server failure, downstream service error, and so on in a microservices-based architecture could render an application unusable.

Despite this, testing for system-level failure scenarios is often unaccounted for and hard for some organizations to make it part of their software development life cycle. To address these challenges, organizations are increasingly practicing Chaos Engineering to test the reliability and performance of distributed systems.

According to Principles of Chaos Engineering, “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Chaos Engineering takes a deliberate approach of injecting failure in a controlled environment using well-planned experiments and helps engineers find weaknesses in systems before they become an outage.

Practicing Chaos Engineering

The idea of Chaos Engineering is not to break things but to identify and understand systemic weaknesses. It can be achieved by doing controlled chaos experiments. According to Principles of Chaos Engineering, a chaos experiment should:

Define a measurable steady state of the system that indicates normal behavior as the baseline.
Develop a hypothesis that this state will continue in both the control group and the experimental group.
Introduce scenarios that reflect real-world events, such as server crashes, network latencies, hardware failures, and so on.
Attempt to invalidate the hypothesis by noting differences in behavior between control and experimental groups after chaos is introduced.

LitmusChaos Architecture

LitmusChaos is a cloud-native Chaos Engineering framework for Kubernetes. It is built using the Kubernetes Operator framework. A Kubernetes Operator is a software extension to Kubernetes that makes use of custom resource definitions (CRDs) to manage applications and their components.

The Litmus Chaos Operator helps reconcile the state of ChaosEngine, a custom resource that holds the chaos intent specified by a developer or DevOps engineer against a particular stateless/stateful Kubernetes deployment. The operator performs specific actions upon creation of the ChaosEngine, its primary resource. The operator also defines a secondary resource (the engine runner pod), which is created and managed by it in order to implement the reconcile functions.

Litmus takes a cloud-native approach to create, manage, and monitor chaos. Chaos is orchestrated using the following Kubernetes CRDs:

ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by the Litmus ChaosOperator, which then invokes ChaosExperiments
ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine.
ChaosResult: A resource to hold the results of a ChaosExperiment.

Getting started

We will create an Amazon EKS cluster with managed nodes. We’ll then install LitmusChaos and a demo application. Then, we will install chaos experiments to be run on the demo application and observe the behavior.

Create EKS cluster

You will need the following to complete the tutorial:

Create a new EKS cluster using eksctl:

export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
export AWS_REGION=us-east-1 #change as per your region of choice

cat <<EOF > cluster.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: eks-litmus-demo
  region: ${AWS_REGION}
  version: "1.21"
managedNodeGroups:
  - instanceType: m5.large
    amiFamily: AmazonLinux2
    name: eks-litmus-demo-ng
    desiredCapacity: 2
    minSize: 2
    maxSize: 4
EOF

eksctl create cluster -f cluster.yaml

Install Helm

curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

Verify Helm installation using the command below and confirm that you are using Helm version v3.X:

helm version --short

Install LitmusChaos

Let’s install LitmusChaos on an Amazon EKS cluster using a Helm chart. The Helm chart will install the needed CRDs, service account configuration, and ChaosCenter.

Add the Litmus Helm repository using the command below:

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/

Confirm that you have the Litmus-related Helm charts:

helm search repo litmuschaos

The output should look like below:

Create a namespace to install LitmusChaos.

kubectl create ns litmus

By default, Litmus Helm chart creates NodePort services. Let’s change the backend service type to ClusterIP and front-end service type to LoadBalancer, so we can access the Litmus ChaosCenter using a load balancer.

cat <<EOF > override-litmus.yaml
portal:
  server:
    service:
      type: ClusterIP
  frontend:
    service:
      type: LoadBalancer
EOF

helm install chaos litmuschaos/litmus --namespace=litmus -f override-litmus.yaml

Verify that LitmusChaos is running:

kubectl get pods -n litmus

You should see a response similar to the one below:

kubectl get svc -n litmus

export LITMUS_FRONTEND_SERVICE=`kubectl get svc chaos-litmus-frontend-service -n litmus --output jsonpath='{.status.loadBalancer.ingress[0].hostname}:{.spec.ports[0].port}'`

echo "Litmus ChaosCenter is available at http://$LITMUS_FRONTEND_SERVICE"

The output should look like below:

➜ echo "Litmus ChaosCenter is available at http://$LITMUS_FRONTEND_SERVICE"
Litmus ChaosCenter is available at http://xxxxxxxxxxxxxx-xxxxxx7948.us-east-1.elb.amazonaws.com:9091

Access Litmus ChaosCenter UI using the URL given above and sign in using the default username “admin” and password “litmus.”

After successful sign-in, you should see the welcome dashboard. Click on the ChaosAgents link from the left-hand navigation.

A ChaosAgent represents the target cluster where Chaos would be injected via Litmus. Confirm that Self-Agent is in Active status. Note: It may take a couple of minutes for the Self-Agent to become active.

Confirm the agent installation by running the command below.

kubectl get pods -n litmus

The output should look like below:

Verify that LitmusChaos CRDs are created:

kubectl get crds | grep chaos

You should see a response similar to the one below showing chaosengines, chaosexperiments, and chaosresults.

Verify that LitmusChaos API resources are created:

kubectl api-resources | grep chaos

You should see a response similar to the one below:

Now that we installed LitmusChaos on the EKS cluster, let’s install a demo application to perform some chaos experiments on.

Install demo application

Let’s deploy nginx on our cluster using the manifest below to run our chaos experiments on it. Save the manifest as nginx.yaml and apply it.

cat <<EOF > nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
EOF
kubectl apply -f nginx.yaml

Verify if the nginx pod is running by executing the command below.

kubectl get pods

Chaos Experiments

Litmus ChaosHub is a public repository where LitmusChaos community members publish their chaos experiments such as pod-delete, node-drain, node-cpu-hog, etc. In this demo walkthrough, we will perform the pod-autoscaler experiment from LitmusChaos hub to test cluster auto scaling on Amazon EKS cluster.

Experiment: Pod Autoscaler

The intent of the pod auto scaler experiment is to check the ability of nodes to accommodate the number of replicas for a deployment. Additionally, the experiment can also be used to check the cluster auto-scaling feature.

Hypothesis: Amazon EKS cluster should auto scale when cluster capacity is insufficient to run the pods.

Chaos experiment can be launched using the Litmus ChaosCenter UI by creating a workflow. Navigate to Litmus Chaos Center and select Litmus Workflows in the left-hand navigation and then select the Schedule a workflow button to create a workflow.

Select the Self-Agent radio button on the Schedule a new Litmus workflow page and select Next.

Choose Create a new workflow using the experiments from ChaosHubs and leave the Litmus ChaosHub selected from the dropdown.

Enter a name for your workflow on the next screen.

Let’s add the experiments in the next step. Select Add a new experiment; then search for autoscaler and select the generic/pod-autoscaler radio button.

Let’s the edit the experiment and change some parameters. Choose the Edit icon:

Accept the default values in the General, Target Application, and Define the steady state for this application sections. In the Tune Experiment section, set the TOTAL_CHAOS_DURATION to 180 and REPLICA_COUNT to 10. TOTAL_CHAOS_DURATION sets the desired chaos duration in seconds and REPLICA_COUNT is the number of replicas to scale during the experiment. Select Finish.

Then, choose Next and accept the defaults for reliability score and schedule the experiment to run now. Finally, select Finish to run the chaos experiment.

The chaos experiment is now scheduled to run and you can look at the status by clicking on the workflow.

From the ChaosResults, you will see that the experiment failed because there was no capacity in the cluster to run 10 replicas.

Install Cluster Autoscaler

Cluster Autoscaler for AWS provides integration with Auto Scaling groups. Cluster Autoscaler will attempt to determine the CPU, memory, and GPU resources provided by an Auto Scaling group based on the instance type specified in its launch configuration or launch template.

Create an IAM OIDC identity provider for your cluster with the following command.

eksctl utils associate-iam-oidc-provider --cluster eks-litmus-demo --approve

Create an IAM policy and role

Create an IAM policy that grants the permissions that the Cluster Autoscaler requires to use an IAM role.

cat <<EOF > cluster-autoscaler-policy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:DescribeLaunchTemplateVersions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}
EOF

aws iam create-policy \
    --policy-name AmazonEKSClusterAutoscalerPolicy \
    --policy-document file://cluster-autoscaler-policy.json

Create an IAM role and attach an IAM policy to it using eksctl.

eksctl create iamserviceaccount \
    --cluster=eks-litmus-demo \
    --namespace=kube-system \
    --name=cluster-autoscaler \
    --attach-policy-arn="arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKSClusterAutoscalerPolicy" \
    --override-existing-serviceaccounts \
    --approve

Make sure your service account with the ARN of the IAM role is annotated.

kubectl describe sa cluster-autoscaler -n kube-system

Deploy the Cluster Autoscaler

Download the Cluster Autoscaler manifest.

curl -o cluster-autoscaler-autodiscover.yaml https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Edit the downloaded file to replace <YOUR CLUSTER NAME> with the cluster name (eks-litmus-demo) and add the following two lines.

- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false

The edited section should look like the following:

command:
  - ./cluster-autoscaler
  - --v=4
  - --stderrthreshold=info
  - --cloud-provider=aws
  - --skip-nodes-with-local-storage=false
  - --expander=least-waste
  - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/eks-litmus-demo
  - --balance-similar-node-groups
  - --skip-nodes-with-system-pods=false

Apply the manifest file to the cluster.

kubectl apply -f cluster-autoscaler-autodiscover.yaml

Patch the deployment to add the cluster-autoscaler.kubernetes.io/safe-to-evict annotation to the Cluster Autoscaler pods with the following command.

kubectl patch deployment cluster-autoscaler \
-n kube-system \
-p '{"spec":{"template":{"metadata":{"annotations":{"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"}}}}}'

Find the latest Cluster Autoscaler version that matches the Kubernetes major and minor versions of your cluster. For example, if the Kubernetes version of your cluster is 1.21, find the latest Cluster Autoscaler release that begins with 1.21. Record the semantic version number (1.21.n) for that release to use in the next step.

export K8S_VERSION=$(kubectl version --short | grep 'Server Version:' | sed 's/[^0-9.]*\([0-9.]*\).*/\1/' | cut -d. -f1,2)
export AUTOSCALER_VERSION=$(curl -s "https://api.github.com/repos/kubernetes/autoscaler/releases" | grep '"tag_name":' | grep -m1 ${K8S_VERSION} | sed 's/[^0-9.]*\([0-9.]*\).*/\1/')

Set the Cluster Autoscaler image tag to the version that was exported in the previous step with the following command.

kubectl set image deployment cluster-autoscaler \
-n kube-system \
cluster-autoscaler=registry.k8s.io/autoscaling/cluster-autoscaler:${AUTOSCALER_VERSION}

After you have deployed the Cluster Autoscaler, you can view the logs and verify that it’s monitoring your cluster load.

View your Cluster Autoscaler logs with the following command.

kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler

Now that we have deployed the Cluster Autoscaler, let’s rerun the same experiment by navigating to Litmus Workflows, then the Schedules tab. Select the three dots menu icon for the workflow and select Rerun Schedule.

This time, the Cluster Autoscaler will add additional nodes to the cluster, and the experiment will pass, which proves our hypothesis.

Experiment Conclusion

Autoscaling the pod triggered the ClusterAautoscaler as a result of insufficient capacity, and a new node was added to the cluster, and the pods were successfully provisioned.

Next steps

From the above walkthrough, we saw how to get started with Chaos Engineering using LitmusChaos on Amazon EKS cluster. There are additional experiments such as pod-delete, node-drain, node-cpu-hog, and so on that you can integrate with a CI/CD pipeline to perform Chaos Engineering. LitmusChaos also supports gitops and advanced chaos workflows using Chaos Workflows.

Containers

Chaos Engineering with LitmusChaos on Amazon EKS

Introduction

Practicing Chaos Engineering

LitmusChaos Architecture

Getting started

Create EKS cluster

Install Helm

Install LitmusChaos

Install demo application

Chaos Experiments

Experiment: Pod Autoscaler

Install Cluster Autoscaler

Create an IAM policy and role

Deploy the Cluster Autoscaler

Experiment Conclusion

Next steps

References

Resources

Follow