Containers
Monitoring Amazon EKS on AWS Fargate using Prometheus and Grafana
At AWS, we are continually looking to improve customer experience by reducing complexity. Our customers want to spend more time solving business problems and less time maintaining infrastructure. Two years ago, we launched Amazon EKS to make it easy for you to operate Kubernetes clusters. And last year, at re:Invent 2019, we announced support for EKS on Fargate. This feature allows you to run Kubernetes pods without creating and managing EC2 instances.
Customers often ask, “Can I monitor my pods running on Fargate using Prometheus?”
Yes, you can use Prometheus to monitor pods running on Fargate. We’ll take a more in-depth look at how pods run on Fargate and where the relevant Prometheus metrics originate.
Running Kubernetes pods on Fargate
AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). When you run your Kubernetes workload on Fargate, you don’t need to provision and manage servers. With Fargate, you can get the right amount of compute to run your containers. If you can run your containers on Fargate, you can avoid having to size EC2 instances for your workload. Fargate allows you to specify and only pay for resources your application needs. You need to know how much vCPU and memory your application pod needs, and Fargate will run it. And if you right-size your pods, then you can use tools like right-size-guide and Goldilocks vertical-pod-autoscaler. And scaling pods horizontally is also easier with Fargate; as the horizontal pod autoscaler creates new replicas, Fargate will create nodes for the new pods.
EKS allows you to choose where you obtain compute capacity from (EC2 or Fargate) on a per-pod basis. You can either run all the pods in a namespace on Fargate or specify a label for pods you want to run on Fargate. You can have a cluster where some pods run on EC2 while others run on Fargate.
You can declare which pods run on Fargate using the Fargate profile. Along with specifying which Kubernetes namespaces and labels should require Fargate capacity, you can also define from which subnets the pods will get their IP address.
As an additional benefit, pods on Fargate get their own VM-isolated environment, which means your pods do not share kernel, CPU, memory, and network interface with any other pods. Fargate runs Kubernetes processes like kubelet
, kubeproxy
, and containerd
along with the pod. If you have five pods running on Fargate, and you run kubectl get nodes
, you will see five worker nodes, one for each pod. If you run an EKS cluster that also runs kube-system
pods also on Fargate, then you will also see nodes for CoreDNS as well.
When pods are scheduled on Fargate, the vCPU and memory reservations within the pod specification determine how much vCPU and memory to provision for the pod.
- The maximum
requests
out of any Init containers is used to determine the Init request vCPU and memory requirements. requests
for all long-running containers are added up to determine the long-running request vCPU and memory requirements.- The larger of the above two values is chosen for the vCPU and memory
requests
to use for your pod. - Fargate adds 256 MB to each pod’s memory reservation for the required Kubernetes components (
kubelet
,kube-proxy
, andcontainerd
). - Fargate rounds up to the compute configuration that most closely matches the sum of vCPU and memory
requests
in order to ensure pods always have the resources that they need to run. - If you do not specify a vCPU and memory combination, then the smallest available combination is used (.25 vCPU and 0.5 GB memory).
Consider declaring vCPU and memory requests irrespective of whether you use Fargate or EC2. It will enable Kubernetes to ensure that at least the requested resources for each pod are available on the compute resource.
Monitoring CPU and memory usage percentage
Defining vCPU and memory requests
for pods running on Fargate will also help you correctly monitor the CPU and memory usage percentage in Fargate. The formula used for the calculation of CPU and memory used percent varies by Grafana dashboard. For example, some Grafana dashboards calculate a pod’s memory used percent like this:
This formula will yield incorrect memory usage in Fargate since, as explained above, a pod’s resource usage is limited to the sum of vCPU and memory requests
declared in its containers. In Fargate, a pod’s resource usage should not be calculated against Fargate node’s CPU and memory but against container’s defined requests
like this:
This formula will help you monitor pod’s compute resources and help you identify when your container’s resource requests
should be adjusted. The values you declare in the resource requests
will tell Fargate how much CPU and memory should be allocated to the pod. If you notice your pod’s memory and CPU usage is constantly nearing the values you’ve declared in the resource requests
, then it may be time to review the requested resources.
Given the way Fargate allocates resources, a pod will get total memory requested in containers (individual Init containers or the sum of all the long-running containers, whichever is more) + 256 MB rounded to the next Fargate configuration (see the table below). For example, if you request 3.5 GB of memory, Fargate will allocate 4 GB memory: 3.5 GB + 250 MB rounded up. If you don’t declare any values for memory, then Fargate will allocate 0.5GB. The containers in your pod will be able to use all the available memory unless you specify a memory limit
in your containers.
You can determine the size of the Fargate node that runs a pod by checking the pod’s annotations:
Having understood what to measure, let’s now explore how it can be measured.
Prometheus architecture
Prometheus is a time-series based, open source systems monitoring tool originally built at SoundCloud. Prometheus joined Cloud Native Computing Foundation in 2016 as the second hosted project, after Kubernetes.
Prometheus collects metrics via a pull model over HTTP. In Kubernetes, Prometheus can automatically discover targets using Kubernetes API, targets can be pods, daemon sets, nodes, etc. A typical Prometheus installation in Kubernetes includes these components:
- Prometheus server
- Node exporter
- Push gateway
- Alert manager
- kube-state-metrics
In Kubernetes, the Prometheus server runs as a pod that is responsible for scraping metrics from metrics endpoints.
Node exporter runs as a daemon set and is responsible for collecting metrics of the host it runs on. Most of these metrics are low-level operating system metrics like vCPU, memory, network, disk (of the host machine, not containers), and hardware statistics, etc. These metrics are inaccessible to Fargate customers since AWS is responsible for the health of the host machine.
To measure a pod’s performance on Fargate, we need metrics like vCPU, memory usage, and network transfers. Prometheus collects these metrics from two sources: cAdvisor and kube-state-metrics.
cAdvisor
cAdvisor (short for container advisor) analyzes and exposes resource usage and performance data from running containers on a node. In Kubernetes, cAdvisor runs as part of the Kubelet binary. You can use kubectl
to view the metrics generated by cAdvisor:
cAdvisor provides node and pod usage statistics that are useful in understanding how a pod is using its resources. Here are the metrics it exposes:
cAdvisor also exposes the total CPU and memory of the node. For example, I scheduled a pod on Fargate and requested 200m vCPU.
As reflected in the metrics, the Fargate node that runs my pod has 2 vCPUs and 4GiB RAM. This can be a bit confusing.
Even though the node has 2vCPUs and 4GiB RAM, my pod is limited to 200m (or 0.25 vCPU, 0.5GiB RAM if no requests are configured). I am being billed for the resources that the pod uses and not for the rest of the unused capacity on the Fargate node.
Once Fargate uses Firecracker microVMs, the compute resources of the microVM will closely match the requirements of the pod running on Fargate. Until then, you should expect to see unused capacity on your Fargate nodes even though you are not responsible for its cost.
Most Grafana dashboards intended for pod monitoring use the following metrics generated by cAdvisor:
- container_cpu_usage_seconds_total
- container_memory_usage_bytes
- container_network_*_bytes_total
While some Grafana dashboards for monitoring pod usage are based on cAdvisor metrics only, others combine metrics from other sources like kube-state-metrics.
Not all cAdvisor metrics are populated in Fargate-backed nodes. For example, the container_cpu_load_average_10s
metric is not populated in Fargate-backed nodes because it requires the --enable_load_reader
flag, which is not enabled on Fargate. Here are the metrics that are unavailable in Fargate:
- container_cpu_load_average_10s
- container_fs_io_current
- container_fs_io_time_seconds_total
- container_fs_io_time_weighted_seconds_total
- container_memory_failcnt
- container_memory_swap (because there’s no swap in Fargate)
- container_spec_memory_reservation_limit_bytes
- container_spec_memory_swap_limit_bytes
- container_start_time_seconds
kube-state-metrics
kube-state-metrics is an open source project that is responsible for listening to the Kubernetes API server and generating metrics. It creates a Kubernetes service and exposes metrics in Prometheus text format. It collects metrics for the following resources:
- nodes
- pods
- replicationcontrollers
- services
- endpoints
- namespaces
- limitranges
- resourcequotas
- persistentvolumes
- persistentvolumeclaims
- configmaps
- secrets
It creates a service that listens on port 8080, and you can use kubectl to see all the metrics it exposes.
Pod monitoring Grafana dashboards generally use kube-state-metrics to determine pod’s compute resource requests and limits. Here are some relevant metrics:
- kube_pod_container_resource_requests — The number of requested request resource by a container.
- kube_pod_container_resource_requests_cpu_cores — The number of requested cpu cores by a container.
- kube_pod_container_resource_limits_cpu_cores — The limit on cpu cores to be used by a container. CPU limits are ignored in Fargate.
- kube_pod_container_resource_requests_memory_bytes — The number of requested memory bytes by a container.
- kube_pod_container_resource_limits_memory_bytes — The limit on memory to be used by a container in bytes. Memory limits are ignored in Fargate.
Prometheus gives us the complete picture by combining data collected from cAdvisor and kube-state-metrics. Let’s review some helpful Grafana dashboards for monitoring pods running on Fargate.
Tutorial
The tutorial walks you through setting up Prometheus and Grafana. If you already use Prometheus and Grafana you can skip the tutorial.
We will create an EKS cluster and install Prometheus and Grafana. Prometheus includes a local time series database to store metrics chronologically. It’s a best practice to use persistent volumes with Prometheus so its data is not lost during container restarts. In EKS, you can create persistent volumes that are backed by Amazon EBS, Amazon EFS, or Amazon FSx for Lustre. Prometheus currently doesn’t support NFS file systems, hence we cannot use EFS-backed persistent volumes and must to rely on EBS volumes to provide persistence for Prometheus.
Fargate supports stateful workloads through its integration with EFS. Since Prometheus requires EBS volumes, our Kubernetes cluster will need an EC2 node group to run Prometheus with an EBS persistent volume. In the cluster we create during this tutorial, all pods in the prometheus
and kube-system
namespace will run on EC2. Pods in the default
namespace will run on Fargate.
Before we can schedule pods on Fargate, we must define a Fargate profile which specifies what pods should use Fargate as they are launched.
The Fargate profile allows an administrator to declare which pods run on Fargate. This declaration is done through the profile’s selectors. Each profile can have up to five selectors that contain a namespace and optional labels. You must define a namespace for every selector. The label field consists of multiple optional key-value pairs. Pods that match a selector (by matching a namespace for the selector and all of the labels specified in the selector) are scheduled on Fargate. If a namespace selector is defined without any labels, Amazon EKS will attempt to schedule all pods that run in that namespace onto Fargate using the profile. If a to-be-scheduled pod matches any of the selectors in the Fargate profile, then that pod is scheduled on Fargate.
You can create a Fargate profile using eksctl
for your existing EKS cluster. In this tutorial, we will use eksctl
to create a new EKS cluster with a Fargate profile. All pods in the default
namespace will run on Fargate.
Architecture
Build the cluster:
If you don’t have an EKS cluster, you can use eksctl
to create one. You can use this template to create a cluster that schedules all pods in the default
namespace on Fargate:
Add Helm repositories
We will use Helm to install the components required to run Prometheus. If you don’t have Helm installed, please see Using Helm with Amazon EKS. Add these repositories:
Install Prometheus
Before we can install Prometheus, we need to install the EBS CSI driver, which allows Kubernetes applications to use EBS volumes for stateful workloads. Install the EBS CSI driver Helm chart:
Create a namespace for Prometheus:
Prometheus needs two EBS-backed persistent volumes: one for prometheus-server and another for prometheus-alertmanager. Because EBS volumes are accessible within a single Availability Zone (AZ), we must determine the AZs in which our cluster’s worker nodes exist. We will configure the storage class to create EBS volumes in the same AZ as the node.
Get the Availability Zone of one of the worker nodes:
Create a storage class:
Download the Helm values for Grafana file:
Install Prometheus:
Verify that Prometheus pods are running:
The output should look like this:
The chart creates two persistent volume claims: an 8Gi volume for prometheus-server
pod and a 2Gi volume for prometheus-alertmanager
.
The Helm chart also creates these Kubernetes services:
The prometheus-server service provides Prometheus’ API and user interface. This is the service that Grafana will use to fetch Prometheus metrics.
Install Grafana
Download the values file for the Grafana Helm chart:
Install Grafana:
The chart creates a 5Gi persistent volume for Grafana and a LoadBalancer
service for the UI. Get the load balancer’s DNS name:
Open Grafana UI in your browser:
Login to using username admin. You can use kubectl to retrieve the password stored as a Kubernetes secret:
Install sample dashboard 7249
.
This dashboard gives a cluster level overview of the workloads deployed based on Prometheus metrics.
If you already have pods running on Fargate, you will see them in the dashboard. If you don’t have any existing workloads, you can deploy a sample application:
The result should be this:
With this setup, we can monitor pod memory usage as shown.
Similarly, we can calculate the CPU usage as shown below.
We have created Grafana Dashboard 12421
to track CPU and memory usage against requests
.
The formula it uses for calculating CPU usage is:
The formula for calculating memory usage is:
The syntax has been modified for legibility.
The current version of the dashboard doesn’t consider initContainers’ requests. This is because kube-state-metrics doesn’t expose resources requested by initContainers.
The requests
metric in the graph will be absent if none of the long-running containers request any resources. The request
metric should not be confused with the total CPU and memory the pod has at its disposal. The pod’s CPU and memory is determined by the calculated Fargate configuration of the pod, as explained above.
Here are some common metrics used in pod monitoring dashboard and the source of the metric:
- kube_pod_info [kube-state-metrics]
- kube_pod_status_phase [kube-state-metrics]
- kube_pod_container_status_restarts_total [kube-state
- metrics]
- CPU
- container_cpu_usage_seconds_total [cAdvisor]
- kube_pod_container_resource_requests_cpu_cores [kube-state-metrics]
- Memory
- container_memory_working_set_bytes [cAdvisor]
- kube_pod_container_resource_requests_memory_bytes [kube-state-metrics]
- kube_pod_container_resource_limits_memory_bytes [kube-state-metrics]
- Network
- container_network_transmit_bytes_total [cAdvisor]
- container_network_receive_bytes_total [cAdvisor]
Conclusion
As demonstrated, the inability to run node-exporter as a DaemonSet in Fargate doesn’t impede the ability to monitor Kubernetes workloads running on Fargate. Metrics provided by cAdvisor and kube-state-metrics are sufficient for monitoring pods on Fargate.
With Fargate, it’s important to implement requests
in your containers. Otherwise, you will get the Fargate default configuration profile, and you wouldn’t be able to measure the performance of your applications correctly.
You may also like Michael Fischer’s Grafana dashboard to monitor EKS control plane performance.
Further reading
Using Prometheus Metrics in Amazon CloudWatch
EKS Workshop — Deploy Prometheus and Grafana tutorial