Enhancing observability with a managed monitoring solution for Amazon EKS

Introduction

Keeping a watchful eye on your Kubernetes infrastructure is crucial for ensuring optimal performance, identifying bottlenecks, and troubleshooting issues promptly. In the ever-evolving world of cloud-native applications, Amazon Elastic Kubernetes Service (EKS) has emerged as a popular choice for deploying and managing containerized workloads. However, monitoring Kubernetes clusters can be challenging due to their complexity and AWS recently launched Amazon CloudWatch Container Insights to simplify the process. Imagine having a monitoring solution tailored specifically for your EKS clusters using Open Source, delivering real-time insights into the health and performance of your Kubernetes environment. With this, users can monitor a Kubernetes cluster’s real-time state to quickly identify issues or bottlenecks, spotting problems like memory leaks in individual containers through container-level metrics and visually analyzing across different cluster layers. With the combined power of both Amazon Managed Grafana and Amazon Managed Service for Prometheus, you can now deploy an AWS-supported solution for monitoring EKS infrastructure.

With this solution, you can deploy a fully-managed Prometheus backend to collect and store metrics from your EKS cluster, while leveraging the intuitive visualization capabilities of Amazon Managed Grafana. A set of preconfigured dashboards will provide you with a holistic view of the health, performance, and resource utilization of your cluster. Whether you’re managing a small development cluster or a large-scale production environment, these dashboards offer better insights. From assessing the overall cluster health to monitoring the control and data planes, you’ll have a comprehensive understanding of your Kubernetes ecosystem. Additionally, you can dive deeper into workload performance across namespaces, track resource usage (CPU, memory, disk, and network), and identify potential bottlenecks before they escalate. In the following sections, we’ll explore the power of this AWS-managed solution, guiding you through the process of deploying and utilizing the pre-built CloudFormation template. Get ready to unlock a new level of visibility and control over your Amazon EKS infrastructure, empowering you to make informed decisions and optimize your Kubernetes environment for optimal performance.

Prerequisites

You will need the following resources and tools to deploy the solution:

Solution Overview

This AWS-managed solution offers a comprehensive monitoring framework for your Amazon Elastic Kubernetes Service (EKS) clusters. The solution empowers you with anticipatory capabilities, enabling you to drive intelligent scheduling decisions based on historical usage tracking, plan for future resource demands by analyzing current utilization data, and identify potential issues early by monitoring resource consumption trends at the namespace level. On the corrective front, you can quickly troubleshoot and reduce mean time to detection (MTTD) of issues across infrastructure and workloads using the pre-configured troubleshooting dashboard. With this AWS-managed solution tailored for Amazon EKS clusters, you gain monitoring and observability capabilities. Stay ahead of performance bottlenecks, optimize resource utilization, and maintain a healthy and efficient Kubernetes environment through deep insights into your cluster’s health, performance, and resource usage.

To use this solution, we need to have an EKS cluster, Amazon Managed Service for Prometheus workspace and Amazon Managed Grafana workspace. First four steps below covers setting up of these prerequisites. Then we deploy the cloud formation stack to deploy the solution and visualize the results. Finally we see the cost involved and the cleanup section.

Fig 1. Data Flow diagram

Step 1: Setup the environment variables and artifacts

export CLUSTER_NAME=eks-cluster 
export AWS_REGION=<REGION> 
export ACCOUNT_ID=`aws sts get-caller-identity |jq -r ".Account"`

Step 2: Create an Amazon EKS Cluster

An Amazon EKS cluster can be created using the eksctl command line tool which provides a simple way to get started for basic cluster creation with sensible defaults as below.

cat << EOF > eks-cluster-config.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: eks-cluster
  region: us-west-2
  version: "1.29"
iam:
  withOIDC: true  
managedNodeGroups:
  - name: managed-ng
    minSize: 1
    maxSize: 2
accessConfig:
  authenticationMode: API_AND_CONFIG_MAP
iamIdentityMappings:
  - arn: arn:aws:iam::${ACCOUNT_ID}:role/Administrator
    groups:
      - system:masters
    username: admin
    noDuplicateARNs: true # prevents shadowing of ARNs
vpc:
  clusterEndpoints:
    privateAccess: true
    publicAccess: true    
EOF

eksctl create cluster -f eks-cluster-config.yaml

Lets create an IAM role with access to the cluster and store the results with environment variables.

cat << EOF > trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "eks.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Sid": "Statement1",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::${ACCOUNT_ID}:root"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF
aws iam create-role --role-name EKSClusterAdminRole --assume-role-policy-document file://trust-policy.jsonROLE_ARN=$(aws iam get-role --role-name EKSClusterAdminRole --query Role.Arn --output text)
VPC_ID=$(aws eks describe-cluster --name $CLUSTER_NAME --query cluster.resourcesVpcConfig.vpcId --output text --region $AWS_REGION)
SECURITY_GROUP_ID=$(aws eks describe-cluster --name $CLUSTER_NAME --query cluster.resourcesVpcConfig.clusterSecurityGroupId --output text --region $AWS_REGION)
SUBNET_IDS=$(aws eks describe-cluster --name $CLUSTER_NAME --query cluster.resourcesVpcConfig.subnetIds --output text --region $AWS_REGION)
OIDC_URL=$(aws eks describe-cluster --name $CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text --region $AWS_REGION)

Lets create an access entry for the above IAM role, and give the EKSClusterAdmin access

aws eks create-access-entry --cluster-name $CLUSTER_NAME --principal-arn $ROLE_ARN --region $AWS_REGION
aws eks associate-access-policy --cluster-name $CLUSTER_NAME --principal-arn $ROLE_ARN --access-scope type=cluster --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy --region $AWS_REGION

Step 3: Create Amazon Managed Service for Prometheus Workspace

The ‘aws amp create-workspace‘ command creates an Amazon Managed Service for Prometheus workspace with the alias ‘AMP-EKS‘ in the specified AWS region. The workspaces provide isolated environments for storing Prometheus metrics and dashboards. The workspace is created with default settings which can be further customized if needed. The call returns the ID of the newly created workspace. This ID is required for sending metrics data to the workspace from applications as well as for allowing other services to access the data.

aws amp create-workspace --alias AMP-EKS --region $AWS_REGION
AMP_WS_ID=$(aws amp list-workspaces --region $AWS_REGION| jq -r ".workspaces[0].workspaceId")

Step 4: Create Amazon Managed Grafana workspace

Create an Amazon Managed workspace compatible with Grafana version 9 by following the instructions here. Also you can choose to assign users as “admin” to the workspace. Lets get the Grafana workspace ID using the below command

AMG_WS_ID=$(aws grafana list-workspaces --region $AWS_REGION --query "workspaces[0].id" --output text)

Create an API Key with ADMIN access for calling Grafana HTTP APIs using these instructions and store it in AMG_API_KEY variable. Store the parameter in the Systems Manager parameter store as below

aws ssm put-parameter --name "/eks-infra-monitoring-accelerator/grafana-api-key" \
    --type "SecureString" \
    --value $AMG_API_KEY \
    --region $AWS_REGION

Step 5: Deploy the solution using CloudFormation

Create an S3 bucket, get the solution files from the GitHub repo and upload to S3 using the below commands:

 aws s3 mb s3://<s3-bucket>
 git clone --no-checkout https://github.com/aws-observability/observability-best-practices.git
 cd observability-best-practices
 git sparse-checkout init --cone
 git sparse-checkout set solutions/oss/eks-infra/v1.0.0/iac
 aws s3 cp solutions/oss/eks-infra/v1.0.0/iac s3://<s3-bucket> --recursive

Uploaded files from S3 looks like below.Note the URL of eks-monitoring-cfn-template.json as we will need this in the next steps.

Fig 2. S3 bucket showing the Solution files

You can provision the solution using CloudFormation via the CLI like so:

aws cloudformation deploy --stack-name amg-solution \
    --region $AWS_REGION \
    --capabilities CAPABILITY_IAM \ 
    --template-file https://amg-s3-bucket.s3.us-east-2.amazonaws.com/eks-monitoring-cfn-template.json \
    --parameter-overrides \
    AMGWorkspaceEndpoint=$AMG_WS_ID.grafana-workspace.$AWS_REGION.amazonaws.com \
    AMPWorkspaceId=$AMP_WS_ID \
    EKSClusterAdminRoleARN=$ROLE_ARN \
    EKSClusterName=$CLUSTER_NAME \
    EKSClusterOIDCEndpoint=$OIDC_URL \
    EKSClusterSecurityGroupId=$SECURITY_GROUP_ID \
    EKSClusterSubnetIds=$SUBNET_IDS \
    EKSClusterVpcId=$VPC_ID \
    S3BucketName=amg-s3-bucket \
    S3BucketRegion=$AWS_REGION

The other option is to use the AWS Console and go to CloudFormation → Create Stack and enter the values like below, providing the values to create the resources:

Fig 3. CloudFormation screen showing sample values

Creating the stack take around 20 minutes to complete. After the stack creation is complete, you must configure the Amazon EKS cluster to allow access from the newly created scraper. You can get the Scraper ID from your EKS cluster’s Observability tab. Use this ID, and follow these instructions to configure your Amazon EKS cluster for managed scraping.

Step 6: Solution overview

Once the steps the completed, log into your Amazon Managed Grafana workspace and under Dashboards, you should be able to view various dashboards under “EKS Infrastructure Monitoring” as below. This has both Infrastructure as well as workload related dashboards.

Fig 4. Amazon Managed Grafana dashboards

The Cluster dashboard under Computer Resources shows the various metrics related to the cluster as below. As you can see the CPU utilization is low since not much workloads are running

Fig 5. Amazon Managed Grafana Dashboard showing Cluster view

The Namespace(workload) dashboard provides similar information. You can thinks of this as parallel to what you might be viewing from the CloudWatch Container Insight’s Namespace view.

Fig 6. Amazon Managed Grafana Dashboard showing Namespace view

Same is the case with workload view

Fig 7. Amazon Managed Grafana Dashboard showing workspaces view

Fig 7. Amazon Managed Grafana Dashboard showing workloads view

You will also get Control plane views as well like below with the kube-apiserver view, which shows the advanced kubeapi-server metrics

Fig 8. Amazon Managed Grafana Dashboard showing advanced kube-apiserver view

Also you will be getting the Kube-apiserver troubleshooting view as well like below, which will be helpful during the troubleshooting activities for your cluster.

Fig 9. Amazon Managed Grafana Dashboard showing troubleshooting kube-apiserver view

Also the kubelet dashboard view as well

Fig 10. Amazon Managed Grafana Dashboard showing Kubelet view

And last but not least, Node dashboard view looks like below which shows the CPU and load average. Again since not much workloads are running now, the charts does not show lot of variation. These various dashboards tracks a total of 88 metrics and the full list of metrics is documented here.

Fig 11. Amazon Managed Grafana Dashboard showing Nodes view

Using the solution for performance monitoring

Let us deploy some workload and load test to see the anticipatory capabilities. For this, we launch a Java application consisting of a Kubernetes Deployment and Service, using the Amazon Correto JDK:

kubectl create deployment sample-deploy --image=public.ecr.aws/m8u2y2m7/gravitonjava:vcorreto --replicas=3
deployment.apps/sample-deploy created
kubectl expose deployment sample-deploy --name=sample-svc --port=8080 --target-port=8080 --type=ClusterIP
service/sample-svc exposed

Now let us stress-test this deployment using the wrk tool as below. This will spin up 64 threads creating 2,048 connections for a period of 15 minutes, targeting the service we created in the previous step.

cat << EOF > job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: wrk-job
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: wrk
        image: ruslanys/wrk:ubuntu
        command: ["/usr/local/bin/wrk", "-t64", "-c2048", "-d900s", "http://sample-svc.default.svc.cluster.local:8080"]
EOF
kubectl apply -f job.yaml

After this, you should be able to see the CPU and Load average spiking up as below from the Nodes dashboard

Fig 12. Amazon Managed Grafana Dashboard Node view with CPU utilization

Same way, from the Cluster dashboard also we should be able to see the hight CPU utilization as below.

Fig 13. Amazon Managed Grafana Dashboard Cluster view with CPU utilization

Cleanup

Use the following commands to delete resources created during this post:

aws grafana delete-workspace --workspace-id $AMG_WS_ID
aws iam delete-role --role-name $ROLE_ARN
aws amp delete-workspace --workspace-id $AMP_WS_ID
eksctl delete cluster $EKS_CLUSTER

Costs

This solution leverages AWS managed services, including Amazon Managed Grafana and Amazon Managed Service for Prometheus, to provide comprehensive monitoring and observability for your Amazon EKS clusters. While these services offer convenience and ease of use, it’s important to note that you will incur standard usage charges. These charges include costs associated with Amazon Managed Grafana workspace access by users, as well as metric ingestion and storage within Amazon Managed Service for Prometheus. The number of metrics ingested, and consequently the associated costs, will depend on the configuration and usage of your Amazon EKS cluster. You can monitor the ingestion and storage metrics through CloudWatch, as detailed in the Amazon Managed Service for Prometheus User Guide. Additionally, AWS provides a pricing calculator to help estimate the costs based on the number of nodes in your EKS cluster, which directly impacts the metric ingestion volume.

Conclusion

The AWS-managed solution for monitoring Amazon EKS clusters with Amazon Managed Grafana and Amazon Managed Service for Prometheus offers a comprehensive and streamlined approach to gaining deep insights into your Kubernetes infrastructure. By leveraging pre-configured dashboards and automated metric collection, you can effortlessly monitor the health and performance of your control and data planes, workloads, and resource utilization across namespaces. This solution empowers you with both anticipatory and corrective capabilities, enabling you to stay ahead of potential issues, optimize resource allocation, and troubleshoot problems quickly and effectively.

Throughout this walkthrough, you’ve learned how to set up the necessary components, including an EKS cluster, Managed Prometheus workspace, and Managed Grafana workspace. You’ve also deployed the CloudFormation template, which orchestrates the integration of these services, providing you with a unified monitoring solution tailored for your Amazon EKS environment. With the ability to visualize and analyze a wide range of metrics, from cluster-level metrics to workload-specific insights, you can make informed decisions, ensure optimal performance, and maintain a healthy and efficient Kubernetes ecosystem.

We’re looking forward to hear from you about how we can improve this solution. For example, by adding support for logs, alerts, traces, monitoring a fleet of EKS clusters, correlating telemetry, additional ways to provision the solution (for example, Terraform), and really anything else that comes to mind.

To learn more about AWS Observability, see the following references:
• AWS Observability Best Practices Guide
• One Observability Workshop
• Terraform AWS Observability Accelerator
• CDK AWS Observability Accelerator

AWS Cloud Operations Blog