AWS Cloud Operations Blog

Enhance Amazon EKS Containerized Application Resilience with AWS Resilience Hub

Building and managing resilient, micro-service based Containerized applications in a distributed environment is hard; maintaining and operating them is even harder. Even though containerized applications running on Amazon Elastic Kubernetes Service (Amazon EKS) take advantage of the performance, scale, reliability, and availability of AWS infrastructure which, we need to understand that failures will occur and we should always be prepared.

The AWS Well-Architected Framework defines resilience as having “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components.” It is typically measured by two metrics: Recovery Time Objective (RTO), the time it takes to recover from a failure, and Recovery Point Objective (RPO), the maximum window of time in which data might be lost after an incident. Depending on your business and application, these can be measured in seconds, minutes, hours, or days.

It’s important to ensure your containerized applications have been developed using good resiliency design principles. In November 2021 we launched AWS Resilience Hub, a service that provides a central place to help organizations define, validate, and track the resilience of AWS native applications by analyzing the services that make up an application. We are excited to now announce Amazon EKS as our newest supported service under the AWS Resilience Hub umbrella.

In this post, we will show you how to proactively improve the resilience of modern containerized workload running on Amazon EKS with AWS Resilience Hub. As part of the post, we will deploy an Amazon EKS cluster that will host a sample micro-service based application named sock-shop. After we deploy the sock-shop application, we will discover the resources in the application by adding Amazon EKS Cluster and cluster namespace to AWS Resilience Hub. We will then run the resiliency assessment to indicate whether the Amazon EKS hosted application is resilient. The estimated resiliency will be bench marked against the target RPO and RTO metrics, which will be defined in a resiliency policy. Lastly, we will dive into both the Resiliency assessment report generated by AWS Resilience Hub.

Solution overview

The following diagram depicts the architecture of the solution deployed as part of this blog.

Figure 1: Architecture Diagram

The solution in the blog includes the following services:

  • AWS Resilience Hub
  • Amazon EKS
  • AWS IAM
  • AWS Cloud9 (optional)

Prerequisites

  1. An AWS account with admin privileges: For this blog, we will assume you already have an AWS account with admin privileges.
  2. Command line tools: Users need to install the latest version of AWS CLI, aws-iam-authenticator, kubectl, and eksctl on their IDE workstation. You also have the option to create a Cloud9 environment in AWS and then install these CLIs.

Complete the following prerequisites before deploying the solution. You can either use AWS Cloud9 or IDE of your choice.

Deploy EKS Cluster and Sample Application

Step 1: Create an EKS Cluster

To set up your workspace and get started with this post, open your favorite browser in your Mac/Linux/Windows workstation

  • Follow this tutorial to deploy an EKS cluster to use with this blog .

After you create your Amazon EKS cluster, you must configure your kubeconfig file using the AWS CLI. This configuration allows you to connect to your cluster using the kubectl command line. The following update-kubeconfig command will create a kubeconfig file for your cluster. Test and verify your cluster is up, you can reach/access it by running any kubectl get command.

aws eks update-kubeconfig —region us-east-2 —name eks-resilience-cluster
kubectl get nodes

Step 2: Deploy sample application on Amazon EKS Cluster

The next thing we need to do is deploy our sample application on Amazon EKS Cluster

  •  Clone sock-shop application repository in the working directory of your IDE, then change the directory to application deployment manifest. Open “complete-demo.yaml” in your favorite editor,  change the service type in front-end micro-service from NodePort to LoadBalancer then deploy application by running kubectl apply command.
git clone https://github.com/microservices-demo/microservices-demo.git
cd ./microservices-demo/deploy/kubernetes
kubectl apply -f complete-demo.yaml
  •  Test and verify that  sock-shop application is up and running by running the below command. You should see an similar output to shown below

Figure 2: sock-shop application status

Step 3: Allow AWS Resilience Hub access to the EKS cluster

Amazon EKS cluster access using AWS Identity and Access Management (IAM) entities is enabled by the AWS IAM Authenticator for Kubernetes, which runs on the Amazon EKS control plane. The IAM authenticator gets its configuration information from the aws-auth ConfigMap. For more information see Enabling IAM user and role access to your cluster – Amazon EKS.

AWS Resilience Hub queries resources inside Amazon EKS cluster by assuming an IAM role in your account. This IAM role is mapped to a Kubernetes group and grants the required permission to assess the Amazon EKS cluster.

Figure 3: IAM Process Flow

The following steps grant AWS Resilience Hub with the required permissions to discover resources inside your Amazon EKS cluster.

  • Create an IAM role named AwsResilienceHubAssessmentEKSAccessRole.

This role will be assumed by AWS Resilience Hub when importing and assessing your application. It will be mapped with an Amazon EKS group that enables the AWS Resilience Hub to assess our Amazon EKS cluster.

In AWS we manage access by creating policies and attaching them to IAM identities (users, groups of users, or roles) or AWS resources. To define IAM policy for the role run the below commands

export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
export POLICY=$(echo -n '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'; echo -n "$ACCOUNT_ID"; echo -n ':root"},"Action":"sts:AssumeRole","Condition":{}}]}')
aws iam create-role \
--role-name AwsResilienceHubAssessmentEKSAccessRole \
--description="Amazon Resilience Hub read only role (for AWS IAM Authenticator for Kubernetes)." \
--assume-role-policy-document "$POLICY"
  • Create a Resilience Hub ClusterRole and RoleBinding/ClusterRoleBinding

To grant AWS Resilience Hub read access across all namespaces create the required ClusterRole and ClusterRoleBinding by running below command.

Note: In your Production environment, scope this to particular namespace and follow principle of least privilege by creating Role and RoleBinding.

cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: resilience-hub-eks-access-cluster-role
rules:
  - apiGroups:
      - ""
    resources:
      - pods
      - replicationcontrollers
      - nodes
    verbs:
      - get
      - list
  - apiGroups:
      - apps
    resources:
      - deployments
      - replicasets
    verbs:
      - get
      - list
  - apiGroups:
      - policy
    resources:
      - poddisruptionbudgets
    verbs:
      - get
      - list
  - apiGroups:
      - autoscaling.k8s.io
    resources:
      - verticalpodautoscalers
    verbs:
      - get
      - list
  - apiGroups:
      - autoscaling
    resources:
      - horizontalpodautoscalers
    verbs:
      - get
      - list
  - apiGroups:
      - karpenter.sh
    resources:
      - provisioners
    verbs:
      - get
      - list
  - apiGroups:
      - karpenter.k8s.aws
    resources:
      - awsnodetemplates
    verbs:
      - get
      - list

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: resilience-hub-eks-access-cluster-role-binding
subjects:
  - kind: Group
    name: resilience-hub-eks-access-group
    apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: resilience-hub-eks-access-cluster-role
  apiGroup: rbac.authorization.k8s.io
---  
EOF

Then create a mapping between the IAM role AwsResilienceHubAssessmentEKSAccessRole , with the Kubernetes group resilience-hub-eks-access-group , granting the IAM roles permissions to access resources inside the Amazon EKS cluster.

eksctl create iamidentitymapping \
 --cluster eks-resilience-cluster \
 --region=us-east-2 \
 --arn arn:aws:iam::"$ACCOUNT_ID":role/AwsResilienceHubAssessmentEKSAccessRole \
 --group resilience-hub-eks-access-group \
 --username AwsResilienceHubAssessmentEKSAccessRole
  • Create an IAM role named AwsResilienceHubPeriodicAssessmentRole.

To grant AWS Resilience Hub access to perform scheduled assessments we must enable the required IAM roles and permissions to activate the daily assessment. With scheduled assessments, AWS Resilience Hub assess your application daily. Assessments will use the latest imported application to assess existing resources and their configuration changes.

For more information see AWS Resilience Hub Scheduled Assessment Role.

Running your first Resiliency assessment

The AWS Resilience Hub assessment uses best practices from the AWS Well-Architected Framework to analyze the components of an application and uncover potential resilience weaknesses. These weaknesses may be caused by incomplete infrastructure setup, misconfiguration, or during architecture drift.

Follow the below steps to run Resiliency assessment of sock-shop application running on Amazon EKS as deployed above.

1. Add Amazon EKS Cluster and Sample Application to AWS Resilience Hub

Enter the following (screenshot below)

    • Application Name: sock-shop
    • Description: Sock-Shop Application Hosted on EKS
    • How is this application managed? Select EKS Only
    • Add EKS clusters
      • Select EKS Clusters: select the eks-resilience-cluster
      • Cross account or region: Blank. You can specify the EKS cluster ARN if your EKS cluster is in a different account or region, or both. You can skip this.

Figure 4: Add EKS to AWS Resilience Hub

      • Add namespaces to each EKS cluster: select the eks-resilience-cluster and click Update Namespaces

Figure 5: Update EKS Namespace to AWS Resilience Hub

    • Under Add namespace, enter sock-shop, check the box to use the namespaces and click Save
    • Under Scheduled assessment Check the option that enables required IAM roles and permissions and then Click Next

NOTE: AWS Resilience Hub can run a daily assessment of your application. You can turn off this setting and manually run the assessment on your own schedule. When enabled, the daily assessment schedule begins only after the application is manually assessed successfully for the first time and if the AwsResilienceHubPeriodicAssessmentRole IAM role is created. This is optional. For this blog, required role has been added in the above steps.

  • After a few minutes, the Supported resources from the sock-shop eks cluster will be listed. You can select specific resources type to include or exclude in your assessment. For this blog, we will leave as defaults. Click Next

NOTE: The AWS Resilience hub supports discovering Deployment, Replicaset and Pods resources only at the time of this writing. In the future release, other Kubernetes resources will be supported.

  • Under Select policy , Click Create resiliency policy
  • Under Create resiliency policy, Select the below options/values for the purpose of this blog post
    • Choose a creation method : Select a policy based on a suggested policy
    • Policy name: sock-shop-foundational-core
    • Suggested resiliency policies:  Foundational Core Service
    • Choose Create
    • Select the policy and choose Next
    • Review the configuration on the next page and Choose Publish

Step 2 – Run Resiliency assessment

  • Under Applications on the AWS Resilience Hub, click your application sock-shop
  • You can create and run resiliency assessment in couple of ways. You can either
    •  Click Assessments tab and then Run new resiliency assessment
    • OR Click Assess resiliency

Figure 6: AWS Resilience Hub Workflow

  • Give the name to the report for eg. –   sock-shop-res-assess  and then click Run
  • The Resiliency assessment will list the assessment with status “Pending”. You can refresh the assessment and the status will change to “In Progress”. It will take a few minutes to finish the assessment to “Success”

Reviewing your first Resiliency assessment and recommendations

The Resiliency assessment provides an overview of the assessment report. AWS Resilience Hub lists each disruption type and the associated application component. It also lists your actual RTO and RPO policies and determines whether the application component can achieve the policy goals.

To review your assessment, follow the below steps

  • After the assessment status changes to “Success”. Click on report  sock-shop-res-assess
  • Next to the assessment name, you will see either the “Policy met” or “Policy breached”. If you followed the above blog instructions it will be “Policy breached”. Click on the report sock-shop-res-assess assessment.

Figure 7: AWS Resilience Hub assessment

  • The report is broken primarily in 3 sections/tabs. The Results, Resiliency recommendations and Operational recommendations. The Results tab lists the summary of the RTO and RPO, Estimated against the Targeted. The results also provides detailed descriptions of each disruption type (application, infrastructure, Availability Zone, and Region).

Figure 8: AWS Resilience Hub Recommendations

As you see above, AWS Resilience Hub has identified 14 breaches each across Infrastructure,  Availability Zone and Region.

  • Lets expand into Infrastructure breaches.. Toggle the Infrastructure tab
  • Click on the Estimated RTO for the top AppComponent.  A pop up text explains in detail the reason for the AppComponent breach. Feel free to explore the other AppComponent’s in this list

Figure 9: AWS Resilience Hub AppComponent Recommendations

  • Now that we have looked at the breaches, lets look at the Resiliency recommendations to fix the policy breaches. Resiliency recommendations evaluate application components and recommend optimization changes by RTO and RPO, costs, and minimal changes.
  • Click on Resiliency recommendations tab

Figure 10: AWS Resilience Hub Resiliency Recommendations

  • Under AppComponents, select the top component. You will see the benefits for fixing the AppComponent. For this selection you will see “Optimize for Cost, minimal changes and Best Region RTO/RPO” as the benefits. The Recommendation also suggests Changes to fix the policy compliance. In this example, there are 8 changes recommended to address the application readiness.

Figure 11: AWS Resilience Hub Resiliency Recommendations for EKS resources

Cleanup

When you’re done testing, delete the resources you created so that you’re no longer billed for them. To clean everything, follow these steps:

  • Remove the application from AWS Resilience Hub :
    • Go to AWS Resilience Hub Console → Click Applications → select “sock-shop” → Click “Actions” → Delete

Figure 12: AWS Resilience Hub Resiliency Recommendations for EKS resources

  • Remove Sample Application, AWS Resilience Hub ClusterRole and ClusterRoleBindings from Amazon EKS Cluster by running below commands in your terminal
cd ~/microservices-demo/deploy/kubernetes
kubectl delete -f complete-demo.yaml
kubectl delete clusterrolebinding name resilience-hub-eks-access-cluster-role-binding
kubectl delete clusterrole name resilience-hub-eks-access-cluster-role
  • Delete the Amazon EKS Cluster and AWS IAM role from AWS Management console.

Summary

In this post, we looked at how to enhance Amazon EKS Containerized Application Resilience with AWS Resilience Hub. We deployed a sample application on Amazon EKS and created Resiliency assessment for this application using AWS Resilience Hub. We reviewed the results of the assessment against the target RPO and RTO metrics defined in the resiliency policy. In the next post, we will demonstrate how you can run assessment of other Amazon Kubernetes resources like StatefulSet, DaemonSets, Jobs, Service, Ingress and ClusterAutoscaler  using AWS Resilience Hub to uncover potential resilience weaknesses. Stay tuned!

About the Authors

Imtranur Rahman

Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform.Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.

Anurag Jain

Anurag Jain is a Global Solutions Architect at AWS based out of Palo Alto, CA supporting WW partner. He has 2 decades of wide technology experience in Innovation & Prototyping solutions with core expertise in AWS Cloud platform & Architecting Microservices. He primarily drives Application Modernization journey on AWS Cloud, build Cloud Center of Excellence practice and serve as Advisory Consultant to Office of CTO for World Wide High Tech customers & System Integrators.