Containers
Monitoring Windows pods with Prometheus and Grafana
This post was co-authored by Cezar Guimarães, Sr. Software Engineer, VTEX
Introduction
Customers across the globe are increasingly adopting Amazon Elastic Kubernetes Service (Amazon EKS) to run their Windows workloads. This is a result of customers figuring out that refactoring existing Windows-based applications into an open-source environment, while ideal, is a very complex task. It needs investments that usually don’t immediately translate into cost savings, and investing in this application refactoring isn’t in the best interest for the IT yearly budget. However, re-platforming the existing yet critical Windows-based applications into Windows containers makes sense from a cost-saving and modernization lens.
Tools such as App2Container (A2C) have made application re-platforming easy. However, for successful day two operations, customers should consider certain infra-transformations, such as logging, monitoring, tracing, etc. As part of achieving full Windows containers observability on AWS, in 2022 we published a Containers post on how customers can use an AWS-managed Windows fluent-bit container image to centralize Windows pods log in different destinations.
Prometheus and Grafana are some of the most popular monitoring stacks for Kubernetes-based workloads. Therefore, today we are launching a post focusing on how customers can centralize Windows pod metrics using Amazon Managed Service for Prometheus and Amazon Managed Grafana.
Solution overview
This post walks you through how to set up Windows Exporter (A Prometheus exporter for Windows) as a Kubernetes daemonset and a PromQL (Prometheus Query Language) to enrich windows-exporter container metrics while merging with kube-state-metrics (KSM). This lets you extend existing Linux-based Kubernetes monitoring to support Windows-based workloads.
Image 1. Solution workflow
- Amazon Managed Service for Prometheus scrapes Windows node/container metrics, such as CPU, Memory, Disk, and Network usage from the Windows Exporter HostProcess DaemonSet.
- Amazon Managed Service for Prometheus scrapes KSM to map pod and container names to their container ID.
- Amazon Managed Grafana provides the ability to create monitoring dashboards from the collected metrics using Amazon Managed Service for Prometheus as the data source.
Prerequisites
The following prerequisites are required to continue with this post:
- An Amazon EKS cluster with Windows nodes up and running. See this step-by-step
- Amazon Managed Service for Prometheus with Amazon EKS ingestion properly setup. See this step-by-step
- Amazon Managed Grafana fully integrated with Amazon Managed Service for Prometheus. See this step-by-step
This post’s prerequisites use AWS-managed services such as Amazon Managed Service for Prometheus with managed-collector and Amazon Managed Grafana. However, this post also applies to self-managed Prometheus, Grafana, and ADOT/Prom-server agents.
Walkthrough
The following steps walk you through the steps described previously.
1. Install KSM
We now install KSM, a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. We must collect KSM to map pod and container names to their container ID.
1.1 Enter the following command to install KSM:
2. Create a Windows Exporter daemonset
First, going deep into the daemonset configuration, we are setting up the securityContext to hostProcess:True. This means the container process has access to the host network namespace, storage, and devices, allowing us to fetch metrics for all the containers running at the host by listening to built-in Windows metrics.
The second part is the initContainer, where we set up the host firewall to allow TCP/9182 incoming traffic so that Amazon Managed Service for Prometheus can scrape the host. In the third part, we create a ConfigMap to inject windows-exporter configurations and mount it to the Windows-exporter pod.
2.1 Create a file containing the following code and save it as windows-exporter.yaml :
If you have any taints in the Windows nodes, then make sure you add the tolerations in the Daemonset configuration.
This solution uses a public, open-source Prometheus container image. It is your responsibility to perform security due diligence.
2.2 Create the Kubernetes Namespace, Daemonset and ConfigMap. Enter the following command:
2.3 Check if the Daemonset pods are running. Enter the following command:
2.4 Once the pods are in the running status, you can check if they are accepting connections on port 9182. Enter the following command:
2.5 You should see the windows-exporter pod listening on port 9182, which is the one that is scrapped by Amazon Managed Service for Prometheus.
3. Visualizing Windows pods metrics in Amazon Managed Grafana
Assuming you already have Grafana knowledge, you can create panels that are relevant for your day two operation. In the following, you can find PromQL queries that automatically bring the correct data scrapped by Prometheus, merging Windows container metrics and mapping to its pod. We are setting the query to populate new data every two minutes.
Make sure you are selecting the right data source when creating panels. In this post, we are using Amazon Managed Service for Prometheus as a data source.
Metric | Query | Unit |
CPU | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_cpu_usage_seconds_total{}[2m])) * 1000 | custom: milliCPU |
Memory | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (windows_container_memory_usage_private_working_set_bytes{}) | bytes |
Network (sent) | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_network_transmit_bytes_total{}[2m])) | bytes/sec |
Network (received) | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_network_receive_bytes_total{}[2m])) | bytes/sec |
Disk (written) | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_storage_write_size_bytes_total{}[2m])) | bytes/sec |
Disk (read) | kube_pod_container_info{} * on(container_id) group_left avg by (container_id) (rate(windows_container_storage_read_size_bytes_total{}[2m])) | bytes/sec |
Check the Windows Exporter GitHub repository for a complete list of Windows containers metrics exported.
For example, in the following query, we are filtering total CPU usage percentage per second at the pod level. To do so, you need to create a custom legend with the value pod. Furthermore, it is essential to set the Units in the panel to the ones in the following table.
Image 3. Grafana query panel
The milliCPU query generates the following panel:
Image 4. Windows Pods – milliCPU
The CPU Query measures Kubernetes CPU Unit usage per second multiplied by 1000 to match Kubernetes milliCPUss. This allows you to quickly and easily identify if a pod needs CPU limits/request right-sizing. A CPU second refers to one second on a CPU. This is the amount of time in seconds your CPU spends actively running a process, as opposed to the elapsed time.
4. Visualizing Windows nodes metrics in Amazon Managed Grafana
Nonetheless, visualizing Windows nodes metrics is crucial as Windows pods metrics. In the following table, you can find PromQL queries that automatically bring the correct data scrapped by Prometheus per Windows nodes. We are setting the query to populate new data every two minutes.
Metric | Query | Unit |
CPU | sum by (instance) (rate(windows_cpu_time_total{mode!=”idle”}[2m])) / count by (instance) (rate(windows_cpu_time_total{mode=”idle”}[2m])) | Percent (0.0-1.0) |
Memory | (1 – windows_os_physical_memory_free_bytes{} / windows_cs_physical_memory_bytes{}) | bytes/sec |
Network (sent) | rate(windows_net_bytes_sent_total{}[2m]) | bytes/sec |
Network (received) | rate(windows_net_bytes_received_total{}[2m]) | bytes/sec |
Disk (written) | sum by (instance) (rate(windows_physical_disk_write_bytes_total{}[2m])) | bytes/sec |
Disk (read) | sum by (instance) (rate(windows_physical_disk_read_bytes_total{}[2m])) | bytes/sec |
Check the Windows Exporter GitHub repository for a complete list of Windows nodes metrics exported.
For example, in the following query, we are filtering the total CPU usage percentage per second at the pod level. To do so, you must create a custom legend with the value node. Furthermore, it is essential to set the Units in the panel to the ones in the preceding table.
Image 5. Grafana query panel
The Memory query generates the following panel:
Image 6. Windows nodes memory percent usage panel
Conclusion
This post covered how to successfully deploy Windows Exporter as a daemonset using a hostProcess container mode. Then, we covered which Windows and KSM should be used to have a proper Grafana monitoring dashboard. You can also use these metrics to create additional panels to an existing Grafana dashboard, such as when an Amazon EKS with a mixed data plane is deployed.
In addition, see the best practices for running Windows containers on Amazon EKS in the Amazon EKS Best Practice guide