AWS Architecture Blog

Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

In highly distributed systems, it is crucial to ensure that applications function correctly even during infrastructure failures. One common infrastructure failure scenario is when an entire Availability Zone (AZ) becomes unavailable. Applications are often deployed across multiple AZs to ensure high availability and fault tolerance in cloud environments such as Amazon Web Services (AWS).

Kubernetes helps manage and deploy applications across multiple nodes and AZs, though it can be difficult to test how your applications will behave during an AZ failure. This is where fault injection simulators come in. The AWS Fault Injection Simulator (AWS FIS) service can intentionally inject faults or failures into a system to test its resilience. In this blog post, we will explore how to use an AWS FIS to simulate an AZ failure for Kubernetes workloads.

Solution overview

To ensure that Kubernetes cluster workloads are architected to handle failures, you must test their resilience by simulating real-world failure scenarios. Kubernetes allows you to deploy workloads across multiple AZs to handle failures, but it’s still important to test how your system behaves during AZ failures. To do this, we use a microservice for product details with the aim of running this microservice using auto-scaling with Karpenter and test how the system responds to varying traffic levels.

This blog post explores a load test to mimic the behavior of hundreds of users accessing the service concurrently to simulate a realistic failure scenario. This test uses AWS FIS to disrupt network connectivity, and simulate AZ failure in a controlled manner. We will use Karpenter as the autoscaler. Karpenter automatically adjusts the size of a cluster based on the resource requirements of the running workloads.

Figure 1 demonstrates the solution’s architecture.

Architecture flow for Microservices to simulate a realistic failure scenario

Figure 1. Architecture flow for microservices to simulate a realistic failure scenario

Prerequisites

Install the following utilities on a Linux-based host machine, which can be an Amazon Elastic Compute Cloud (Amazon EC2) instance, AWS Cloud9 instance, or a local machine with access to your AWS account:

Setting up a microservice environment

This blog post consists of two major parts: Bootstrap and experiment. The bootstrap section provides step-by-step instructions for:

  • Creating and deploying a sample microservice
  • Creating an AWS IAM role for the FIS service
  • Creating an FIS experiment template

By following these bootstrap instructions, you can set up your own environment to test the different autoscaling tools’ performance in Kubernetes.

In the experiment section, we showcase how the system behaves with Karpenter.

Let’s start by setting a few environment variables using the following code:

export FIS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
export FIS_AWS_REGION=us-west-2
export FIS_CLUSTER_NAME="fis-simulation-cluster"

Next, clone the sample repository which contains the code for our solution:

git clone https://github.com/aws-samples/containers-blog-maelstrom.git
cd ./containers-blog-maelstrom/fis-simulation-blog

Step 1. Bootstrap the environment

This solution uses Amazon EKS for AWS Cloud Development Kit (AWS CDK) Blueprints to provision our Amazon EKS cluster.

The first step to any AWS CDK deployment is bootstrapping the environment. cdk bootstrap is an AWS Command Line Interface (AWS CLI) tool that prepares the environment with resources required by AWS CDK to perform deployments into that environment (for example, a combination of AWS account and AWS Region).

Let’s run the below commands to bootstrap your environment and install all node dependencies required for deploying the solution:

npm install
cdk bootstrap aws://$FIS_ACCOUNT_ID/$FIS_AWS_REGION

We’ll use Amazon EKS Blueprints for CDK to create an Amazon EKS cluster and deploy add-ons. This stack deploys the following add-ons into the cluster:

  • AWS Load Balancer Controller
  • AWS VPC CNI
  • Core DNS
  • Kube-proxy

Step 2. Create an Amazon EKS cluster

Run the below command to deploy the Amazon EKS cluster:

npm install
cdk deploy "*" --require-approval never

Deployment takes approximately 20-30 minutes; then you will have a fully functioning Amazon EKS cluster in your account.

fis-simulation-cluster

Deployment time: 1378.09s

Copy and run the aws eks update-kubeconfig ... command from the output section to gain access to your Amazon EKS cluster using kubectl.

Step 3. Deploy a microservice to Amazon EKS

Use the code from the following Github repository and deploy using Helm.

git clone https://github.com/aws-containers/eks-app-mesh-polyglot-demo.git
helm install workshop eks-app-mesh-polyglot-demo/workshop/helm-chart/

Note: You are not restricted to this one as a mandate. If you have other microservices, use the same. This command deploys the following three microservices:

  1. Frontend-node as the UI to the product catalog application
  2. Catalog detail backend
  3. Product catalog backend

To test the resiliency, let’s take one of the microservice viz productdetail, the backend microservice as an example. By using the Kubernetes topology-spread, you can ask the provisioner to have pods spread across availability zones to limit the blast radius of an outage. Patch the deployment with topologySpreadConstraint as follows.

cat <<EOF > patch-file.yaml
spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "topology.kubernetes.io/zone"
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: proddetail
EOF
kubectl patch deployment proddetail -n workshop --patch-file patch-file.yaml

When checking the status of a service like the following, you will see that proddetail is of type ClusterIP, which is accessible only within the cluster. To access this outside of the cluster, perform the following steps.

kubectl get service proddetail -n workshop
 NAME          TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE                                                                 
5000/TCP       11h
 proddetail    ClusterIP      10.100.168.219   <none>        3000/TCP    11m

Create ingress class

cat <<EOF | kubectl create -f -
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
  name: aws-alb
spec:
  controller: ingress.k8s.aws/alb  
EOF

Create ingress resource

cat <<EOF | kubectl create -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  namespace: workshop
  name: proddtl-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  ingressClassName: aws-alb
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: proddetail
            port:
               number: 3000
EOF          

After this, your web URL is ready:

kubectl get ingress -n workshop
NAME CLASS HOSTS ADDRESS PORTS AGE
proddtl-ingress aws-alb * k8s-workshop-proddtli-166014b35f-354421654.us-west-1.elb.amazonaws.com 80 14s

Test the connectivity from your browser:

Testing the connectivity from your browser

Figure 2. Testing the connectivity from your browser

Step 4. Create an IAM Role for AWS FIS

Before an AWS FIS experiment, create an IAM role. Let’s create a trust policy and attach as shown here:

cat > fis-trust-policy.json << EOF
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                  "fis.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF
aws iam create-role --role-name my-fis-role --assume-role-policy-document file://permissons/fis-trust-policy.json

Create an AWS FIS policy and attach

aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorNetworkAccess
aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorEKSAccess
aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorEC2Access
aws iam attach-role-policy --role-name my-fis-role --policy-arn  arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorSSMAccess

Step 5: Create an AWS FIS experiment

Use AWS FIS to create an experiment to disrupt the network connectivity as below. Use the following experiment template with the IAM role created from the previous step:

Experiment template

Figure 3. Experiment template

Step 6. Failure simulation with AWS FIS on Karpenter

We’ve completed microservice setup, made it internet-accessible, and created an AWS FIS template to simulate failures. Now let’s experiment with how the system behaves with Karpenter as the autoscalers.

With the microservice available within Amazon EKS cluster, we’ll use Locust to simulate user behavior with a total of 100 users trying to access the URLs concurrently.

For the following experiments, Run 1 shows 100 users trying to access the service without any system disruptions. We’ll then move to AWS FIS and disrupt network connectivity in Run 2. Measuring the user impact and comparing it to the results from the first run provides insights on how the system responds to failures and can be improved for greater reliability and performance.

To perform this experiment, select your experiment template and click Start experiment, then enter start in the field.

Currently 12 replicas of the proddetail microservice are running, we will start the experiment using Karpenter. As in the following figures, the cluster uses a combination of t3.small and “C” and “M” instances provided in Karpenter’s provisioner configuration.

Experiment results

Figure 4. Experiment results

Dynamic node usage with cluster

Figure 5. Dynamic node usage with cluster

Cleanup

Use the following commands to clean up your experiment environment.

#delete Ingress resources
kubectl delete ingress proddtl-ingresss -n workshop
kubectl delete ingressclass aws-alb

#delete IAM resources
aws iam delete-role --role-name my-fis-role

#delete FIS resources
fis_template_id=`aws fis list-experiment-templates --region $FIS_AWS_REGION |jq ".experimentTemplates[0].id"`
aws fis delete-experiment-template --id $fis_template_id --region
$FIS_AWS_REGION

#delete application resources and cluster
helm uninstall workshop
cdk destroy

Conclusion

This blog post shows how to simulate AZ failures with AWS FIS on EKS.

While this experiment provides valuable insights into the performance and reliability of Kubernetes workloads in the face of failures, it’s important to acknowledge that this is not a true test of an AZ unavailable situation. In a real-world scenario, an AZ failure can have a cascading effect, potentially impacting other services that workloads depend upon. But simulating an AZ failure in a controlled environment helps you better understand how your Kubernetes cluster and applications will behave in an actual failure scenario. This knowledge can help you identify and address any issues before they occur in production, ensuring that your applications remain highly available and resilient.

In summary, this experiment provides good insights into the performance and resilience of Kubernetes workloads. It is not a perfect representation of a real-world AZ failure, but by leveraging tools such as AWS FIS and carefully configuring autoscaling policies, you can take proactive steps to optimize performance and ensure high availability for critical applications.

Siva Guruvareddiar

Siva Guruvareddiar

Siva Guruvareddiar is a Senior Solutions Architect at AWS where he is passionate about helping customers architect highly available systems. He helps speed cloud-native adoption journeys by modernizing platform infrastructure and internal architecture using microservices, containerization, observability, service mesh areas, and cloud migration. Connect on LinkedIn at: linkedin.com/in/sguruvar

Elamaran Shanmugam

Elamaran Shanmugam

Elamaran (Ela) Shanmugam is a Sr. Container Specialist Solutions Architect with AWS. Ela is a Container, Observability and Multi-Account Architecture SME and helps customers design and build scalable, secure and optimized container workloads on AWS. His passion is building and automating infrastructure to allow customers to focus more on their business. He is based out of Tampa, Florida and you can reach him on twitter @IamElaShan.

Re Alvarez-Parmar

Re Alvarez-Parmar

In his role as Containers Specialist Solutions Architect at Amazon Web Services, Re advises engineering teams with modernizing and building distributed services in the cloud. Prior to joining AWS, he spent more than 15 years as Enterprise and Software Architect. He is based out of Seattle. Connect on LinkedIn at: linkedin.com/in/realvarez/