Containers
Chaos Engineering with LitmusChaos on Amazon EKS
Introduction
Organizations are embracing microservices-based architectures by refactoring large monolith applications into smaller, independent, and loosely coupled services. These independent services are faster to deploy and scale, enabling organizations to innovate and deliver faster.
However, as the application grows, these microservices present their own challenges. For example, as you deploy tens or hundreds or thousands of microservices, operational tasks such as distributed tracing, debugging, testing, dependency mapping, and so on, become challenging. A failure as a result of network latency, disk failure, server failure, downstream service error, and so on in a microservices-based architecture could render an application unusable.
Despite this, testing for system-level failure scenarios is often unaccounted for and hard for some organizations to make it part of their software development life cycle. To address these challenges, organizations are increasingly practicing Chaos Engineering to test the reliability and performance of distributed systems.
According to Principles of Chaos Engineering, “Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” Chaos Engineering takes a deliberate approach of injecting failure in a controlled environment using well-planned experiments and helps engineers find weaknesses in systems before they become an outage.
Practicing Chaos Engineering
The idea of Chaos Engineering is not to break things but to identify and understand systemic weaknesses. It can be achieved by doing controlled chaos experiments. According to Principles of Chaos Engineering, a chaos experiment should:
- Define a measurable steady state of the system that indicates normal behavior as the baseline.
- Develop a hypothesis that this state will continue in both the control group and the experimental group.
- Introduce scenarios that reflect real-world events, such as server crashes, network latencies, hardware failures, and so on.
- Attempt to invalidate the hypothesis by noting differences in behavior between control and experimental groups after chaos is introduced.
LitmusChaos Architecture
LitmusChaos is a cloud-native Chaos Engineering framework for Kubernetes. It is built using the Kubernetes Operator framework. A Kubernetes Operator is a software extension to Kubernetes that makes use of custom resource definitions (CRDs) to manage applications and their components.
The Litmus Chaos Operator helps reconcile the state of ChaosEngine, a custom resource that holds the chaos intent specified by a developer or DevOps engineer against a particular stateless/stateful Kubernetes deployment. The operator performs specific actions upon creation of the ChaosEngine, its primary resource. The operator also defines a secondary resource (the engine runner pod), which is created and managed by it in order to implement the reconcile functions.
Litmus takes a cloud-native approach to create, manage, and monitor chaos. Chaos is orchestrated using the following Kubernetes CRDs:
- ChaosEngine: A resource to link a Kubernetes application or Kubernetes node to a ChaosExperiment. ChaosEngine is watched by the Litmus ChaosOperator, which then invokes ChaosExperiments
- ChaosExperiment: A resource to group the configuration parameters of a chaos experiment. ChaosExperiment CRs are created by the operator when experiments are invoked by ChaosEngine.
- ChaosResult: A resource to hold the results of a ChaosExperiment.
Getting started
We will create an Amazon EKS cluster with managed nodes. We’ll then install LitmusChaos and a demo application. Then, we will install chaos experiments to be run on the demo application and observe the behavior.
Create EKS cluster
You will need the following to complete the tutorial:
Create a new EKS cluster using eksctl:
export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
export AWS_REGION=us-east-1 #change as per your region of choice
cat <<EOF > cluster.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: eks-litmus-demo
region: ${AWS_REGION}
version: "1.21"
managedNodeGroups:
- instanceType: m5.large
amiFamily: AmazonLinux2
name: eks-litmus-demo-ng
desiredCapacity: 2
minSize: 2
maxSize: 4
EOF
eksctl create cluster -f cluster.yaml
Install Helm
curl -sSL https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
Verify Helm installation using the command below and confirm that you are using Helm version v3.X:
helm version --short
Install LitmusChaos
Let’s install LitmusChaos on an Amazon EKS cluster using a Helm chart. The Helm chart will install the needed CRDs, service account configuration, and ChaosCenter.
Add the Litmus Helm repository using the command below:
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
Confirm that you have the Litmus-related Helm charts:
helm search repo litmuschaos
The output should look like below:
Create a namespace to install LitmusChaos.
kubectl create ns litmus
By default, Litmus Helm chart creates NodePort services. Let’s change the backend service type to ClusterIP and front-end service type to LoadBalancer, so we can access the Litmus ChaosCenter using a load balancer.
cat <<EOF > override-litmus.yaml
portal:
server:
service:
type: ClusterIP
frontend:
service:
type: LoadBalancer
EOF
helm install chaos litmuschaos/litmus --namespace=litmus -f override-litmus.yaml
Verify that LitmusChaos is running:
kubectl get pods -n litmus
You should see a response similar to the one below:
kubectl get svc -n litmus
export LITMUS_FRONTEND_SERVICE=`kubectl get svc chaos-litmus-frontend-service -n litmus --output jsonpath='{.status.loadBalancer.ingress[0].hostname}:{.spec.ports[0].port}'`
echo "Litmus ChaosCenter is available at http://$LITMUS_FRONTEND_SERVICE"
The output should look like below:
➜ echo "Litmus ChaosCenter is available at http://$LITMUS_FRONTEND_SERVICE"
Litmus ChaosCenter is available at http://xxxxxxxxxxxxxx-xxxxxx7948.us-east-1.elb.amazonaws.com:9091
Access Litmus ChaosCenter UI using the URL given above and sign in using the default username “admin” and password “litmus.”
After successful sign-in, you should see the welcome dashboard. Click on the ChaosAgents link from the left-hand navigation.
A ChaosAgent represents the target cluster where Chaos would be injected via Litmus. Confirm that Self-Agent is in Active status. Note: It may take a couple of minutes for the Self-Agent to become active.
Confirm the agent installation by running the command below.
kubectl get pods -n litmus
The output should look like below:
Verify that LitmusChaos CRDs are created:
kubectl get crds | grep chaos
You should see a response similar to the one below showing chaosengines, chaosexperiments, and chaosresults.
Verify that LitmusChaos API resources are created:
kubectl api-resources | grep chaos
You should see a response similar to the one below:
Now that we installed LitmusChaos on the EKS cluster, let’s install a demo application to perform some chaos experiments on.
Install demo application
Let’s deploy nginx on our cluster using the manifest below to run our chaos experiments on it. Save the manifest as nginx.yaml and apply it.
cat <<EOF > nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 500m
memory: 512Mi
EOF
kubectl apply -f nginx.yaml
Verify if the nginx pod is running by executing the command below.
kubectl get pods
Chaos Experiments
Litmus ChaosHub is a public repository where LitmusChaos community members publish their chaos experiments such as pod-delete, node-drain, node-cpu-hog, etc. In this demo walkthrough, we will perform the pod-autoscaler experiment from LitmusChaos hub to test cluster auto scaling on Amazon EKS cluster.
Experiment: Pod Autoscaler
The intent of the pod auto scaler experiment is to check the ability of nodes to accommodate the number of replicas for a deployment. Additionally, the experiment can also be used to check the cluster auto-scaling feature.
Hypothesis: Amazon EKS cluster should auto scale when cluster capacity is insufficient to run the pods.
Chaos experiment can be launched using the Litmus ChaosCenter UI by creating a workflow. Navigate to Litmus Chaos Center and select Litmus Workflows in the left-hand navigation and then select the Schedule a workflow button to create a workflow.
Select the Self-Agent radio button on the Schedule a new Litmus workflow page and select Next.
Choose Create a new workflow using the experiments from ChaosHubs and leave the Litmus ChaosHub selected from the dropdown.
Enter a name for your workflow on the next screen.
Let’s add the experiments in the next step. Select Add a new experiment; then search for autoscaler and select the generic/pod-autoscaler radio button.
Let’s the edit the experiment and change some parameters. Choose the Edit icon:
Accept the default values in the General, Target Application, and Define the steady state for this application sections. In the Tune Experiment section, set the TOTAL_CHAOS_DURATION to 180 and REPLICA_COUNT to 10. TOTAL_CHAOS_DURATION sets the desired chaos duration in seconds and REPLICA_COUNT is the number of replicas to scale during the experiment. Select Finish.
Then, choose Next and accept the defaults for reliability score and schedule the experiment to run now. Finally, select Finish to run the chaos experiment.
The chaos experiment is now scheduled to run and you can look at the status by clicking on the workflow.
From the ChaosResults, you will see that the experiment failed because there was no capacity in the cluster to run 10 replicas.
Install Cluster Autoscaler
Cluster Autoscaler for AWS provides integration with Auto Scaling groups. Cluster Autoscaler will attempt to determine the CPU, memory, and GPU resources provided by an Auto Scaling group based on the instance type specified in its launch configuration or launch template.
Create an IAM OIDC identity provider for your cluster with the following command.
eksctl utils associate-iam-oidc-provider --cluster eks-litmus-demo --approve
Create an IAM policy and role
Create an IAM policy that grants the permissions that the Cluster Autoscaler requires to use an IAM role.
cat <<EOF > cluster-autoscaler-policy.json
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
EOF
aws iam create-policy \
--policy-name AmazonEKSClusterAutoscalerPolicy \
--policy-document file://cluster-autoscaler-policy.json
Create an IAM role and attach an IAM policy to it using eksctl.
eksctl create iamserviceaccount \
--cluster=eks-litmus-demo \
--namespace=kube-system \
--name=cluster-autoscaler \
--attach-policy-arn="arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKSClusterAutoscalerPolicy" \
--override-existing-serviceaccounts \
--approve
Make sure your service account with the ARN of the IAM role is annotated.
kubectl describe sa cluster-autoscaler -n kube-system
Deploy the Cluster Autoscaler
Download the Cluster Autoscaler manifest.
curl -o cluster-autoscaler-autodiscover.yaml https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
Edit the downloaded file to replace <YOUR CLUSTER NAME> with the cluster name (eks-litmus-demo) and add the following two lines.
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
The edited section should look like the following:
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/eks-litmus-demo
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
Apply the manifest file to the cluster.
kubectl apply -f cluster-autoscaler-autodiscover.yaml
Patch the deployment to add the cluster-autoscaler.kubernetes.io/safe-to-evict
annotation to the Cluster Autoscaler pods with the following command.
kubectl patch deployment cluster-autoscaler \
-n kube-system \
-p '{"spec":{"template":{"metadata":{"annotations":{"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"}}}}}'
Find the latest Cluster Autoscaler version that matches the Kubernetes major and minor versions of your cluster. For example, if the Kubernetes version of your cluster is 1.21, find the latest Cluster Autoscaler release that begins with 1.21. Record the semantic version number (1.21.n) for that release to use in the next step.
export K8S_VERSION=$(kubectl version --short | grep 'Server Version:' | sed 's/[^0-9.]*\([0-9.]*\).*/\1/' | cut -d. -f1,2)
export AUTOSCALER_VERSION=$(curl -s "https://api.github.com/repos/kubernetes/autoscaler/releases" | grep '"tag_name":' | grep -m1 ${K8S_VERSION} | sed 's/[^0-9.]*\([0-9.]*\).*/\1/')
Set the Cluster Autoscaler image tag to the version that was exported in the previous step with the following command.
kubectl set image deployment cluster-autoscaler \
-n kube-system \
cluster-autoscaler=registry.k8s.io/autoscaling/cluster-autoscaler:${AUTOSCALER_VERSION}
After you have deployed the Cluster Autoscaler, you can view the logs and verify that it’s monitoring your cluster load.
View your Cluster Autoscaler logs with the following command.
kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler
Now that we have deployed the Cluster Autoscaler, let’s rerun the same experiment by navigating to Litmus Workflows, then the Schedules tab. Select the three dots menu icon for the workflow and select Rerun Schedule.
This time, the Cluster Autoscaler will add additional nodes to the cluster, and the experiment will pass, which proves our hypothesis.
Experiment Conclusion
Autoscaling the pod triggered the ClusterAautoscaler as a result of insufficient capacity, and a new node was added to the cluster, and the pods were successfully provisioned.
Next steps
From the above walkthrough, we saw how to get started with Chaos Engineering using LitmusChaos on Amazon EKS cluster. There are additional experiments such as pod-delete, node-drain, node-cpu-hog, and so on that you can integrate with a CI/CD pipeline to perform Chaos Engineering. LitmusChaos also supports gitops and advanced chaos workflows using Chaos Workflows.