Managing Kubernetes control plane events in Amazon EKS
Amazon Elastic Kubernetes Service (Amazon EKS) helps customers move their container-based workloads to the AWS Cloud. Amazon EKS manages the Kubernetes control plane so customers don’t need to worry about scaling and maintaining Kubernetes components, such as etcd and application programming interface (API) servers. As a declarative and reconciling system, Kubernetes publishes various events to keep users informed of activities in the cluster, such as spinning up and tearing down pods, deployments, namespaces, and more. Amazon EKS keeps the Kubernetes upstream default event time to live (TTL) to 60 minutes, which can’t be changed. This default setting is a good balance of storing enough history for immediate troubleshooting without filling up the etcd database and risk causing API server degraded performance.
Events in any distributed management system bring lot of useful information. In the world of container orchestration and especially with Kubernetes, events are specific objects that show useful information about what is happening inside a pod, namespace, or container. There are certain use cases where it may be useful to have a history of Kubernetes events beyond the default 60-minute window. We also cover how to filter Kubernetes events (such as only pod events, node events, and so on) to be exported to Amazon CloudWatch because not every type of Kubernetes resource may be required for longer duration. Readers can take this example and modify it based on their requirements.
As mentioned, the Amazon EKS event TTL limit is set to the upstream default 60 minutes. Some customers have shown interest in increasing this configuration to keep a longer history for debugging purposes. However, EKS does not allow configuration of this setting, as events filling up the etcd database may cause stability and performance issues. After 60 minutes, etcd purges the events during the last hour. In this post, we provide a solution to capture the events beyond 60 minutes, using Amazon CloudWatch as the example destination. We also cover an example use case for examining events stored in CloudWatch.
You need the following to complete the walkthrough:
- An AWS account
- AWS CLI v2
- Knowledge of Docker
- Basic Kubernetes knowledge (Pods, events, namespaces, and deployments)
Create an Amazon EKS cluster
Let’s start by setting a few environment variables:
Create a cluster using
Creating a cluster can take up to 10 minutes. When the cluster is ready, proceed to the next steps.
Manage control plane events
Once the Amazon EKS cluster is up and running, it’s time to manage the control plane events. As explained in the previous section, this post provides various examples to show how to manage control plane events with Amazon EKS. We achieve this by creating a Kubernetes deployment, which the underlying pod inside that deployment tracks activities from the Amazon EKS control plane and persists the events inside CloudWatch. All the source code is provided for the users to try and modify based on their needs (such as changing the Python-based control plane events application, type of events, and so on). The source code and dependencies are containerized using the Docker file. The resultant container image is pushed to a public repository (Amazon Elastic Container Registry (Amazon ECR) in this case). In this post, the container image from the public Amazon ECR gets deployed to the Kubernetes cluster. But as stated above, this solution can be customized to selected events of interest, if needed.
The architecture of the solution overview is provided in the following diagram:
Based on the previous diagram, the sequence of steps required to achieve the result are the following:
- Get the source code from GitHub.
- Create container insights with fluent bit, so that the control plane events captured from the data plane are automatically pushed to CloudWatch. Fluent bit is the lightweight and scalable log aggregator and processor that helps to push the container logs to CloudWatch. CloudWatch Container Insights allow you to explore, analyze, and visualize your container logs.
- Deploy the control plane events application with the container image from public Amazon ECR repo to the Kubernetes/Amazon EKS cluster, using the provided Helm charts. This deploys a necessary cluster role, with the necessary cluster role binding and deployments.
- Perform operations on the Amazon EKS cluster. The deployed control plane events pod starts to capture the events from the Amazon EKS control plane.
- The deployed control plane events pod pushes the logs to CloudWatch using the container insights.
Let’s work on these step by step.
Step 1: Get the source code from GitHub.
All the source code is available in this GitHub repo.
As explained in the source code, we connect to the Kubernetes API server and watch for events. The events include pod, namespace, node, service, and persistent volume claim events. The Python script prints out the event results and eventually gets pushed to CloudWatch by container insights (which we discuss later). If you wanted to change the Python script to suit your needs, you can containerize it with Docker file, as explained in the Readme file in the GitHub repo. You have the flexibility to customize this solution to selected events of interest.
How does this deployment capture the events under the hood?
As you could see from the
event_watcher Python script, the control plane events application (such as a pod) loads the Kubernetes configuration and then the script checks for the pod events every 60 minutes by default. You can change the Helm chart deployment in the GitHub repo. The script captures the control plane events and pushes to CloudWatch with container insights, thus completing the persistence layer. This event lookup runs every hour. The high-level flow of this approach is given in the following diagram:
Step 2: Create container insights.
CloudWatch Container Insights are used here to collect, aggregate, and summarize metrics and logs from containerized applications. Container insight, which is available for Amazon EKS, collects performance data at every layer of the performance stack. Follow this link to enable container insights with Fluent Bit.
Step 3: Deploy the container image to the Kubernetes cluster and validate the deployment.
Let’s deploy the control plane events application by using the following Helm command. As stated in the previous section, this creates the necessary cluster role, cluster role binding, and deployment to ensure the events are accessible for the default service account. This also takes the container image from the public Amazon ECR and pushes it into the Kubernetes/Amazon EKS cluster:
event_watcher should be in action and start to collect control plane events every hour from now on. Let’s validate the deployment:
The typical output is similar to the following:
Step 4: Perform operations on the cluster.
Let’s perform some operations on the cluster to see if those events persisted (as verified in Step 5 through CloudWatch).
Let’s create some
Let’s expose one of the deployed pods as a service:
Let’s delete one of the pods as well:
Let’s try to deploy a mega-pod that could not be deployed due to resource constraints. Try the below command to see the CPU counts on the cluster nodes:
Here we use three
t3.small instances with two vCPUs each and the output shows, as provided in the following code. Please note that your node name in the output below might be different:
Now try to deploy a resource-heavy pod which requests five vCPUs, as provide in the following code. This is in perpetual pending state due to insufficient resources. We can check why it failed in the next step:
As a final step, let’s try some taints and tolerations. First, let’s taint all the nodes in the following command with
key1=value1. Copy the output of the following command and run it from your terminal.
Let’s deploy an intolerable pod that doesn’t match the above taints. This pod is in a perpetual pending state:
The typical status of all the pods will be like the following:
Step 5: Verify control plane events with container insights through CloudWatch.
Now it’s time to verify the events from the above operations done in Step 4 (and others from the cluster) persisted. Under the hood, as previously discussed, we achieve this using Fluent Bit with container insights to Amazon CloudWatch.
Log into the AWS console and head to Amazon CloudWatch. Select Log Groups. There should be three log groups for this cluster (such as application, data plane, and host under
/aws/containerinsights/<cluster-name>) as shown in the following diagram.
Check the container logs under the application to verify the events’ persistence. Find to the
cpe-xx link to see the control plane events application in action. In the following diagram, we see that the
nginx pod creation and deletion events are captured.
Also, we see some node events as shown in the following diagram. In a typical real-world scenario, this could be a scale-down event where the node is removed, as provided in the example.
As we discussed in the previous step, the
mega-pod could not deploy due to resource constraints and the
intolerable-pod could not deploy due to the taints. You can search for
FailedScheduling in the text box to see these in action as shown in the following diagram.
We can see the service event as well. This is the service that was exposed in Step 4.
Also, some namespace and persistent volume claims events are captured here. If you install
Prometheus, you can see events like this as well, which are provided in the following example.
To test the performance of the control plane events application’s event persistence when hundreds of pods are running, we deploy 100
nginx pods in addition to the existing workloads. As you could see from the following Grafana snapshot, there are more than 160 pods running in this example Amazon EKS cluster.
When checking from the pod’s log of our control plane events application, we see the last
nginx pod event as shown in the following code:
We get the same information from CloudWatch Container Insights as shown in the following diagram.
Since the payload is very small, this event capture mechanism works even with hundreds of pods running in the system.
To avoid incurring future charges, delete the resources:
In this post, we showed you how to capture Amazon EKS control plane events, without being affected by the event TTL limitation of 60 minutes. This was accomplished by the creation of a custom deployment, which queries the API server for events every hour. Consequently, there is minimal load on the API server control plane, and the events were persistent.