Integration & Automation
Automate monitoring for your Amazon EKS cluster using CloudWatch Container Insights
Are you looking for a monitoring solution for your Amazon Elastic Kubernetes Service (Amazon EKS) cluster that helps you achieve scalability, reduce errors, and save time and manual effort? For example, consider an Amazon EKS cluster environment that’s configured with multiple worker nodes, each with one or more Kubernetes pods. Teams with similar environments often struggle to find the most efficient way to set up monitoring and alerting capabilities for both system-level pods and worker node performance metrics.
In this article, we present an event-driven, automation solution for monitoring your Amazon EKS cluster using Amazon CloudWatch Container Insights metrics, Terraform, AWS CloudFormation, and other AWS services and resources. Our solution addresses the environments where EKS worker nodes are configured to scale up and down according to demand on your workloads. You create and delete CloudWatch alarms dynamically through analysis of those scaling events. Our approach is based on best practices around Amazon EKS observability, an essential component for understanding and monitoring the health of your Amazon EKS environment.
About this blog post | |
Time to read | ~15 min |
Time to complete | ~1.5 hour |
Cost to complete | Using AWS services may incur costs. See AWS service documentation for details. |
Learning level | Intermediate (200) |
AWS services | Amazon CloudWatch Amazon Elastic Kubernetes Service (Amazon EKS) Amazon Simple Notification Service (Amazon SNS) Amazon Simple Storage Service (Amazon S3) AWS CloudFormation AWS Identity and Access Management (IAM) AWS Lambda |
Understanding the dynamic nature of EKS nodes on the Amazon EKS cluster
Unlike components such as system/application pods and DaemonSet that remain static on the cluster until they’re updated, dynamic components change in response to a variety of factors such as workload demand, upgrades, and patching.
For example, EKS nodes are managed by an Amazon EC2 Auto Scaling group on the Amazon EKS cluster and fluctuate in size when workload demands rise and fall. EKS node instances belong to EKS node groups, and each cluster typically contains one or more of these groups. During a scale-out event, a new node joins the cluster; during a scale-in event, redundant nodes are deleted. Other dynamic components and events such as Amazon EKS upgrades and node patching can also change the number of EKS nodes in an Amazon EKS cluster.
Configuring Amazon CloudWatch alarms is crucial for monitoring EKS nodes, especially given their dynamic nature. Our solution directly addresses this challenge by ensuring that CloudWatch alarms are configured automatically, allowing for efficient monitoring of EKS nodes despite their fluctuating numbers.
Architectural Overview
Our solution deploys the following infrastructure (see Figure 1):
- An Amazon EKS cluster and CloudWatch Observability EKS add-on deployed using CloudFormation templates.
- CloudWatch static alarms, configured with Terraform.
- Dynamic alarms configured for Amazon EKS workloads using AWS Lambda, Amazon SNS, Amazon EventBridge, and Amazon S3.
Static alarms flow
In this flow, Terraform creates the alarms that trigger email notifications when a threshold is breached.
- The CloudWatch (Kubernetes) DaemonSet collects the metrics for Amazon EKS clusters and sends the data to CloudWatch Container Insights. The DaemonSet ensures that the CloudWatch agent runs on and collects data from each worker node in the cluster. For a list of extracted metrics, see Amazon EKS and Kubernetes Container Insights.
- CloudWatch Container Insights defines the alarm configurations in a Terraform file named terraform.tfvars, located in the root path of our solution’s GitHub repository. This file is used as a reference for creating, deleting, or updating predefined alarms. You can also add more alarms to this file based on specific use cases. You will clone this repository in the walkthrough section of this article.
- Terraform creates the CloudWatch alarms specified in the terraform.tfvars file and configures Amazon SNS as an endpoint to trigger email notifications.
- When the defined threshold for an alarm is breached, an alert notification event is triggered and sent to the Amazon SNS service.
- The Amazon SNS service sends the alarm notifications via email to the designated subscribers, for example your operations team.
Dynamic alarms flow
In this flow, alarms are generated in response to node-scaling activities that occur within the cluster.
- The node auto scaler continuously evaluates the scaling requirements of EKS worker nodes and submits the events to the Amazon EC2 Auto Scaling group.
- An Amazon EventBridge rule monitors scaling activities and captures EC2 Instance Launch Successful and EC2 Instance Terminate Successful events once they are received by the Amazon EC2 Auto Scaling group.
- When an event is matched according to the EventBridge rule, the event triggers a Lambda function that creates and deletes CloudWatch alarms.
- The Lambda function evaluates whether the event is a scale-out or scale-in event and either creates or terminates the CloudWatch alarms for the associated nodes.
- For scale-out events, the Lambda function creates CloudWatch alarms for the corresponding EKS worker nodes by retrieving the defined alarm attributes from a file named alarm_list_inputs.json, which is stored in an Amazon S3 bucket. For scale-in events, the Lambda function terminates the CloudWatch alarms associated with worker nodes.
- The CloudWatch alarm sends the creation/deletion status to Amazon SNS.
- Amazon SNS sends alarm notifications via email to the designated subscribers, for example the cloud administrative team.
Prerequisites
- An active AWS account.
- An Amazon Linux or Mac OS Server. If Mac OS, preferably Z shell (Zsh).
- An AWS user/role with sufficient permissions to provision resources using Terraform.
- Terraform CLI v1.7.5.
- AWS Command Line Interface (AWS CLI) v2.11.1.
- Kubectl v1.28.8-eks-ae9a62a (must be compatible with Amazon EKS v1.28)
Walkthrough
Step 1: Deploy the infrastructure and set up your environment
Deploy the Amazon EKS infrastructure and CloudWatch alarms using a combination of CloudFormation and Terraform.
- Run the following commands to clone the eks-automated-monitoring GitHub repository:
GitHub repository:
git clone https://github.com/aws-samples/eks-automated-monitoring.git
cd eks-automated-monitoring
- Navigate to the ./script/deploy.sh file to update variables to be passed to the CloudFormation templates when provisioning the infrastructure. The following variables are required:
- SNS_EMAIL: Email address for receiving alarms and notifications.
- TF_ROLE: The Amazon Resource Name (ARN) of the IAM role with permission to launch resources into your AWS account.
- (Optional) Locate the terraform.tfvars file at the root path of the repository, and either modify predefined alarms or add new ones.
- (Optional) Locate the alarm_list_inputs.json file in the files folder at the root path of the repository, and include required alarms for Amazon EKS node-level monitoring. For demonstration purposes, we have included two predefined alarms in this file.
- Run the following command to deploy the Amazon Virtual Private Cloud (Amazon VPC) and the Amazon EKS cluster in the us-east-1 Region. If you need to change the region, update the parameter in the ./script/deploy.sh file.
./scripts/deploy.sh -o apply
Important: Make a note of the name of the Lambda function that’s printed in the output after running this command. You will use this Lambda function in a later step.
Amazon SNS sends a subscription confirmation message to the email address provided in the SNS_EMAIL parameter in the ./script/deploy.sh file. To confirm the subscription, open the email you received from Amazon SNS and choose Confirm subscription. A web page opens and displays a subscription confirmation with your subscription ID.
Now you’re ready to test and verify the CloudWatch alarm configuration.
Step 2 (Conditional): Configure dynamic alarms for existing EKS nodes
If worker nodes already exist in your environment before deploying our solution, run the following command to trigger the Lambda function.
aws lambda invoke --function-name <Lambda function name> --invocation-type RequestResponse output
The Lambda function sets up the CloudWatch alarms that are specified in the list of alarms in the terraform.tfvars file, which was uploaded to an Amazon S3 bucket in .json format for all EC2 instances during deployment.
Step 3: Verify CloudWatch alarms for static components
- Sign in to your AWS account.
- Open the CloudWatch console.
- Using the navigation bar on the left, open the Alarms page.
- Verify the predefined and new alarm configurations from the list of alarms in the terraform.tfvars file in your cloned GitHub repository.
Step 4: Setting up alarms for Auto Scaling events
Perform these steps using a role with adequate permissions.
- Simulate or test alarms by increasing or decreasing (scaling in and out) the value of the NUM_WORKER_NODES variable in the deploy.sh file.
- To apply the updates to the deploy.sh file, run this command:
./scripts/deploy.sh -o apply
- Sign in to your AWS account.
- Open the CloudWatch console.
- Using the navigation bar on the left, open the Alarms page to confirm the updates.
When Auto Scaling groups are launched or terminated, the alarms are created, and email alerts are sent to the subscribed email addresses that are associated with Amazon SNS topics configured for notifications.
Cleaning up resources
Perform these steps to clean up your environment and avoid unexpected costs.
- To delete the provisioned infrastructure, run this command:
./scripts/deploy.sh -o destroy
- Delete the dynamic alarms that you created for the EKS worker nodes. For instructions, see Edit or delete a CloudWatch alarm.
Troubleshooting
If you experience alarm creation or deletion failures, or if you don’t receive notification emails, try the following troubleshooting steps:
- If you experience failures when creating or deleting CloudWatch alarms, open the AWS CloudWatch log group of the Lambda function and check the message details. For information about accessing the logs using the Lambda console, see Viewing queries on the CloudWatch Logs console.
- Use Lambda monitoring functions in the Lambda console to access metrics and graphs such as Error count and success rate (%) and Invocations. For details, see Monitoring functions on the Lambda console.
- Update your Amazon EKS Amazon Machine Image (AMI) to the latest version by updating the image ID in the /scripts/eks-infra.yaml file for the EksAmiIds parameter. For details, see Retrieving Amazon EKS optimized Amazon Linux AMI IDs.Note: Our solution uses amazon-linux-2 as the AMI type.
Conclusion
Congratulations! Now you have a working solution for monitoring your Amazon EKS cluster environment based on automation and based practices for Amazon EKS observability. As a next step, we encourage you to learn more about CloudWatch Container Insights metrics by visiting these resources:
If you have feedback about this blog post, use the Comments section below.