Containers

Monitoring Amazon EKS on AWS Fargate using Prometheus and Grafana

At AWS, we are constantly looking to improve customer experience by reducing complexity. Our customers want to spend more time solving business problems and less time maintaining infrastructure. Two years ago, we launched Amazon EKS to make it easy for you to operate Kubernetes clusters. And last year, at re:Invent 2019, we announced support for EKS on Fargate. This feature allows you to run Kubernetes pods without creating and managing EC2 instances.

Customers often ask, “Can I monitor my pods running on Fargate using Prometheus?”

Yes, you can use Prometheus to monitor pods running on Fargate. I am going to take a deeper look at how pods run on Fargate and where do the relevant Prometheus metrics originate.

Running Kubernetes pods on Fargate

AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). When you run your Kubernetes workload on Fargate, you don’t need to provision and manage servers. With Fargate, you can get the right amount of compute to run your containers. If you can run your containers on Fargate, then you can avoid having to size EC2 instances for your workload. Fargate allows you to specify and only pay for resources your application needs. You just need to know how much vCPU and memory your application pod needs and Fargate will run it. And if you right-size your pods, then you can use tools like right-size-guide and Goldilocks vertical-pod-autoscaler. And scaling pods horizontally is also easier with Fargate; as the horizontal pod autoscaler creates new replicas, Fargate will create nodes for the new pods.

EKS allows you to choose where you obtain compute capacity from (EC2 or Fargate) on a per-pod basis. You can either tell Kubernetes to run all the pods in a namespace on Fargate or you can specify label pods that you want to run on Fargate. You can have a cluster where some pods run on EC2 while others run on Fargate.

You can declare which pods run on Fargate using the Fargate profile. Along with specifying which Kubernetes namespaces and labels should require Fargate capacity, you can also define from which subnets the pods will get their IP address.

As an additional benefit, pods on Fargate get their own VM-isolated environment, which means your pods do not share kernel, CPU, memory, and network interface with any other pods. Fargate runs Kubernetes processes like kubelet, kubeproxy, and containerd along with the pod. If you have five pods running on Fargate, and you run kubectl get nodes, you will see five worker nodes, one for each pod. If you run an EKS cluster that also runs kube-system pods also on Fargate, then you will also see nodes for CoreDNS as well.

When pods are scheduled on Fargate, the vCPU and memory reservations within the pod specification determine how much vCPU and memory to provision for the pod.

  • The maximum requests out of any Init containers is used to determine the Init request vCPU and memory requirements.
  • requests for all long-running containers are added up to determine the long-running request vCPU and memory requirements.
  • The larger of the above two values is chosen for the vCPU and memory requests to use for your pod.
  • Fargate adds 256 MB to each pod’s memory reservation for the required Kubernetes components (kubelet, kube-proxy, and containerd).
  • Fargate rounds up to the compute configuration that most closely matches the sum of vCPU and memory requests in order to ensure pods always have the resources that they need to run.
  • If you do not specify a vCPU and memory combination, then the smallest available combination is used (.25 vCPU and 0.5 GB memory).

Consider declaring vCPU and memory requests irrespective of whether you use Fargate or EC2. It will enable Kubernetes to ensure that at least the requested resources for each pod are available on the compute resource.

Monitoring CPU and memory usage percentage

Defining vCPU and memory requests for pods running on Fargate will also help you correctly monitor the CPU and memory usage percentage in Fargate. The formula used for the calculation of CPU and memory used percent varies by Grafana dashboard. For example, some Grafana dashboards calculate a pod’s memory used percent like this:

Pod's memory used percentage = 
(memory used by all the containers in the pod/
 Total memory of the worker node) * 100

This formula will yield incorrect memory usage in Fargate since, as explained above, a pod’s resource usage is limited to the sum of vCPU and memory requests declared in its containers. In Fargate, a pod’s resource usage should not be calculated against Fargate node’s CPU and memory but against container’s defined requests like this:

Pod's memory used percentage = 
(memory used by all the containers in the pod/
 Max(sum of memory requested by the containers in the pod,
    The highest value of requested memory by all init containers in the Pod )
    * 100

This formula will help you monitor pod’s compute resources and help you identify when your container’s resource requests should be adjusted. The values you declare in the resource requests will tell Fargate how much CPU and memory should be allocated to the pod. If you notice your pod’s memory and CPU usage is constantly nearing the values you’ve declared in the resource requests, then it may be time to review the requested resources.

Given the way Fargate allocates resources, a pod will get total memory requested in containers (individual Init containers or the sum of all the long-running containers, whichever is more) + 256 MB rounded to the next Fargate configuration (see the table below). For example, if you request 3.5 GB of memory, Fargate will allocate 4 GB memory: 3.5 GB + 250 MB rounded up. If you don’t declare any values for memory, then Fargate will allocate 0.5GB. The containers in your pod will be able to use all the available memory unless you specify a memory limit in your containers.

Having understood what to measure, let’s now explore how it can be measured.

Prometheus architecture

Prometheus is a time-series based, open source systems monitoring tool originally built at SoundCloud. Prometheus joined Cloud Native Computing Foundation in 2016 as the second hosted project, after Kubernetes. So it doesn’t come as a surprise that Prometheus works seamlessly with Kubernetes.

Prometheus collects metrics via a pull model over HTTP. In Kubernetes, Prometheus can automatically discover targets using Kubernetes API, targets can be pods, DaemonSets, Nodes, etc. A typical Prometheus installation in Kubernetes includes these components:

  • Prometheus server
  • Node exporter
  • Push gateway
  • Alert manager
  • kube-state-metrics (installed by default if you use stable/prometheus helm chart)

In Kubernetes, the Prometheus server runs as a pod that is responsible for scraping metrics from metrics endpoints.

Node exporter runs as a DaemonSet and is responsible for collecting metrics of the host it runs on. Most of these metrics are low-level operating system metrics like vCPU, memory, network, disk (of the host machine, not containers), and hardware statistics, etc. These metrics are inaccessible to Fargate customers since AWS is responsible for the health of the host machine.

To measure the performance of a pod running on Fargate, we need metrics like vCPU, memory usage, and network transfers. Prometheus collects these metrics from two sources: cAdvisor and kube-state-metrics.

cAdvisor

cAdvisor (short for container advisor) analyzes and exposes resource usage and performance data from running containers on a node. In Kubernetes, cAdvisor runs as part of the Kubelet binary. You can use kubectl to view the metrics generated by cAdvisor:

kubectl get —raw /api/v1/nodes/[NAME-OF-A-NODE]/proxy/metrics/cadvisor

cAdvisor provides Node and pod usage statistics that are useful in understanding how a pod is using its resources. Here are the metrics it exposes:

cadvisor_version_info  -- A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
container_cpu_load_average_10s  -- Value of container cpu load average over the last 10 seconds.
container_cpu_system_seconds_total  -- Cumulative system cpu time consumed in seconds.
*container_cpu_usage_seconds_total  -- Cumulative cpu time consumed in seconds.*
container_cpu_user_seconds_total  -- Cumulative user cpu time consumed in seconds.
container_fs_inodes_free  -- Number of available Inodes
container_fs_inodes_total  -- Number of Inodes
container_fs_io_current  -- Number of I/Os currently in progress
container_fs_io_time_seconds_total  -- Cumulative count of seconds spent doing I/Os
container_fs_io_time_weighted_seconds_total  -- Cumulative weighted I/O time in seconds
container_fs_limit_bytes  -- Number of bytes that can be consumed by the container on this filesystem.
container_fs_read_seconds_total  -- Cumulative count of seconds spent reading
container_fs_reads_bytes_total  -- Cumulative count of bytes read
container_fs_reads_merged_total  -- Cumulative count of reads merged
container_fs_reads_total  -- Cumulative count of reads completed
container_fs_sector_reads_total  -- Cumulative count of sector reads completed
container_fs_sector_writes_total  -- Cumulative count of sector writes completed
container_fs_usage_bytes  -- Number of bytes that are consumed by the container on this filesystem.
container_fs_write_seconds_total  -- Cumulative count of seconds spent writing
container_fs_writes_bytes_total  -- Cumulative count of bytes written
container_fs_writes_merged_total  -- Cumulative count of writes merged
container_fs_writes_total  -- Cumulative count of writes completed
container_last_seen  -- Last time a container was seen by the exporter
container_memory_cache  -- Number of bytes of page cache memory.
container_memory_failcnt  -- Number of memory usage hits limits
container_memory_failures_total  -- Cumulative count of memory allocation failures.
container_memory_mapped_file  -- Size of memory mapped files in bytes.
*container_memory_max_usage_bytes  -- Maximum memory usage recorded in bytes*
container_memory_rss  -- Size of RSS in bytes.
container_memory_swap  -- Container swap usage in bytes.
container_memory_usage_bytes  -- Current memory usage in bytes, including all memory regardless of when it was accessed
container_memory_working_set_bytes  -- Current working set in bytes.
*container_network_receive_bytes_total  -- Cumulative count of bytes received*
container_network_receive_errors_total  -- Cumulative count of errors encountered while receiving
container_network_receive_packets_dropped_total  -- Cumulative count of packets dropped while receiving
container_network_receive_packets_total  -- Cumulative count of packets received
*container_network_transmit_bytes_total  -- Cumulative count of bytes transmitted*
container_network_transmit_errors_total  -- Cumulative count of errors encountered while transmitting
container_network_transmit_packets_dropped_total  -- Cumulative count of packets dropped while transmitting
container_network_transmit_packets_total  -- Cumulative count of packets transmitted
container_scrape_error  -- 1 if there was an error while getting container metrics, 0 otherwise
container_spec_cpu_period  -- CPU period of the container.
container_spec_cpu_shares  -- CPU share of the container.
container_spec_memory_limit_bytes  -- Memory limit for the container.
container_spec_memory_reservation_limit_bytes  -- Memory reservation limit for the container.
container_spec_memory_swap_limit_bytes  -- Memory swap limit for the container.
container_start_time_seconds  -- Start time of the container since unix epoch in seconds.
container_tasks_state  -- Number of tasks in given state
machine_cpu_cores  -- Number of CPU cores on the machine.
machine_memory_bytes  -- Amount of memory installed on the machine.

cAdvisor also exposes the total CPU and memory of the node. For example, I scheduled a pod on Fargate and requested 200m vCPU.

kubectl get --raw /api/v1/nodes/fargate-ip-192-168-102-240.us-east-2.compute.internal/proxy/metrics/cadvisor
...
machine_cpu_cores 2
machine_memory_bytes 4.134506496e+09

As reflected in the metrics, the Fargate node that runs my pod has 2 vCPUs and 4GiB RAM. This can be a bit confusing.

Even though the node has 2vCPUs and 4GiB RAM, my pod is limited to 200m (or 0.25 vCPU, 0.5GiB RAM in case no requests are configured). I am being billed for the resources that the pod uses and not for the rest of the unused capacity on the Fargate node.

Once Fargate uses Firecracker microVMs, the compute resources of the microVM will closely match the requirements of the pod running on Fargate. Until then, you should expect to see unused capacity on your Fargate nodes even though you are not responsible for its cost.

Most Grafana dashboards intended for pod monitoring use the following metrics generated by cAdvisor:

  • container_cpu_usage_seconds_total
  • container_memory_usage_bytes
  • container_network_*_bytes_total

While some Grafana dashboards for monitoring pod usage are based on cAdvisor metrics only, others combine metrics from other sources like kube-state-metrics.

kube-state-metrics

kube-state-metrics is an open source project that is responsible for listening to the Kubernetes API server and generating metrics. It creates a Kubernetes service and exposes metrics in Prometheus text format. It collects metrics for the following resources:

  • nodes
  • pods
  • replicationcontrollers
  • services
  • endpoints
  • namespaces
  • limitranges
  • resourcequotas
  • persistentvolumes
  • persistentvolumeclaims
  • configmaps
  • secrets

It creates a service that listens on port 8080, and you can kubectl to see all the metrics it exposes.

kubectl get --raw /api/v1/namespaces/prometheus/services/prometheus-kube-state-metrics:8080/proxy/metrics

Pod monitoring Grafana dashboards generally use kube-state-metrics to determine pod’s compute resource requests and limits. Here are some relevant metrics:

  • kube_pod_container_resource_requests — The number of requested request resource by a container.
  • kube_pod_container_resource_requests_cpu_cores — The number of requested cpu cores by a container.
  • kube_pod_container_resource_limits_cpu_cores — The limit on cpu cores to be used by a container. CPU limits are ignored in Fargate.
  • kube_pod_container_resource_requests_memory_bytes — The number of requested memory bytes by a container.
  • kube_pod_container_resource_limits_memory_bytes — The limit on memory to be used by a container in bytes. Memory limits are ignored in Fargate.

Prometheus gives us the complete picture by combining data collected from cAdvisor and kube-state-metrics. Let’s review some helpful Grafana dashboards for monitoring pods running on Fargate.

Tutorial

I am going to walk you through setting up Prometheus and Grafana. If you already use Prometheus and Grafana you can skip the tutorial.

I will create an EKS cluster and install Prometheus and Grafana. The cluster will need a worker node backed by EC2 since Prometheus requires a persistent volume to store data and EKS on Fargate currently doesn’t support persistent storage. All the pods in the prometheus namespace will run on EC2.

Before I can schedule pods on Fargate, I have to define a Fargate profile which specifies which pods should use Fargate when they are launched.

The Fargate profile allows an administrator to declare which pods run on Fargate. This declaration is done through the profile’s selectors. Each profile can have up to five selectors that contain a namespace and optional labels. You must define a namespace for every selector. The label field consists of multiple optional key-value pairs. Pods that match a selector (by matching a namespace for the selector and all of the labels specified in the selector) are scheduled on Fargate. If a namespace selector is defined without any labels, Amazon EKS will attempt to schedule all pods that run in that namespace onto Fargate using the profile. If a to-be-scheduled pod matches any of the selectors in the Fargate profile, then that pod is scheduled on Fargate.

You can create a Fargate profile using eksctl for your existing EKS cluster. In this tutorial, I will use eksctl to create a new EKS cluster with a Fargate profile. All pods defined in the default and kube-system namespaces will run on Fargate.

Architecture

Building the cluster:

If you don’t have an EKS cluster, you can use eksctl to create one. You can use eksctl to create a cluster that runs all pods in the default and kube-system namespaces on Fargate:

eksctl create cluster --fargate

You can also use eksctl to create a node group that will be needed to run Prometheus.

eksctl create nodegroup --cluster=<clusterName> 

Installing Prometheus

First I will create a namespace for Prometheus:

kubectl create namespace prometheus

I use Helm to install Prometheus using stable/prometheus chart. If you don’t have Helm installed, please see Using Helm with Amazon EKS.

Create a file prometheus-storageclass.yaml and with the following content:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: prometheus
  namespace: prometheus
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
reclaimPolicy: Retain
mountOptions:
  debug

Create a storage class.

kubectl apply -f prometheus-storageclass.yaml

Next I will create two persistent volume claims:

  1. Prometheus-Server – 16.0 GiB
  2. Prometheus-alertmanager – 4.0 GiB

Download prometheus_values.yml file.

wget https://raw.githubusercontent.com/jonnalagadda35153/EKS-Fargate/master/EKS_Fargate_Monitoring/Monitoring/prometheus_values.yml

Install Prometheus:

helm install prometheus -f prometheus_values.yml \
stable/prometheus --namespace prometheus

The output should look like this:

Prometheus will run as a ClusterIP type service.

Installing Grafana

I will use Helm to install Grafana. Download grafana-values.yml :

wget https://raw.githubusercontent.com/jonnalagadda35153/EKS-Fargate/master/EKS_Fargate_Monitoring/Monitoring/grafana-values.yaml

Install Grafana.

helm install grafana -f grafana-values.yaml \
stable/grafana ——namespace prometheus

It will install Grafana and create a LoadBalancer service.

You can access Grafana with the DNS name of the Load Balancer.

Username: admin
Password can be retrieved with the command:

kubectl get secret --namespace prometheus grafana \
 -o jsonpath="{.data.admin-password}" | \
 base64 --decode ; echo

Install sample dashboard 7249.

This dashboard gives a cluster level overview of the workloads deployed based on Prometheus metrics.

If you already have pods running on Fargate, you will see them in the dashboard. If you don’t have any pods, you can create some like this:

Deploy a sample application with the command:

kubectl apply -f https://github.com/jonnalagadda35153/EKS-Fargate/raw/master/EKS_Fargate_Monitoring/Monitoring/sampleapp.yaml

The result should be this:

ng/sampleapp.yaml
service/appf created
deployment.apps/appf created
ingress.extensions/appf created
horizontalpodautoscaler.autoscaling/appf created

$ kubectl get pods 
NAME READY STATUS RESTARTS AGE
appf-5cc9c4655-gfm8r 0/1 Pending 0 7s
appf-5cc9c4655-nk97x 0/1 Pending 0 7s
appf-5cc9c4655-vtwpn 0/1 Pending 0 7s

With this setup, we can monitor pod memory usage as shown.

Similarly, we can calculate the CPU usage as shown below.

We have created Grafana Dashboard 12421 to track CPU and memory usage against requests.

The formula it uses for calculating CPU usage is:

sum(rate(container_cpu_usage_seconds_total[5m])) 
/ sum(kube_pod_container_resource_requests{resouce="cpu"}) * 100

The formula for calculating memory usage is:

sum(container_memory_working_set_bytes) 
/ sum(kube_pod_container_resource_requests{resource="memory"}+262144000) * 100

The syntax has been modified for legibility.

The current version of the dashboard doesn’t consider initContainers’ requests. This is because kube-state-metrics doesn’t expose resources requested by initContainers.

The requests metric in the graph will be absent if none of the long-running containers request any resources. The request metric should not be confused with the total CPU and memory the pod has at its disposal. The pod’s CPU and memory is determined by the calculated Fargate configuration of the pod, as explained above.

Here are some common metrics used in pod monitoring dashboard and the source of the metric:

  • kube_pod_info [kube-state-metrics]
  • kube_pod_status_phase [kube-state-metrics]
  • kube_pod_container_status_restarts_total [kube-state-metrics]
  • CPU
    • container_cpu_usage_seconds_total [cAdvisor]
    • kube_pod_container_resource_requests_cpu_cores [kube-state-metrics]
  • Memory
    • container_memory_working_set_bytes [cAdvisor]
    • kube_pod_container_resource_requests_memory_bytes [kube-state-metrics]
    • kube_pod_container_resource_limits_memory_bytes [kube-state-metrics]
  • Network
    • container_network_transmit_bytes_total [cAdvisor]
    • container_network_receive_bytes_total [cAdvisor]

Conclusion

As demonstrated, the inability to run node-exporter as a DaemonSet in Fargate doesn’t impede the ability to monitor Kubernetes workloads running on Fargate. Metrics provided by cAdvisor and kube-state-metrics are sufficient for monitoring pods on Fargate.

With Fargate, it’s important to implement requests in your containers, otherwise you will get the Fargate default configuration profile and you wouldn’t be able to measure the performance of your applications correctly.

You may also like Michael Fischer‘s Grafana dashboard to monitor EKS control plane performance.

Further reading

Using Prometheus Metrics in Amazon CloudWatch
EKS Workshop — Deploy Prometheus and Grafana tutorial

Re Alvarez-Parmar

Re Alvarez-Parmar

Re Alvarez-Parmar is a Container Specialist Solutions Architect at Amazon Web Services. He helps customers use AWS container services to design scalable and secure applications. He is based out of New York and uses Twitter, sparingly, @realz

Jaswanth Kumar Jonnalagadda

Jaswanth Kumar Jonnalagadda

Jaswanth Kumar is an Application Architect at Amazon Web Services. He helps AWS customers use AWS container services to design scalable and secure applications. He is based out of New York.