Containers
Introducing Amazon CloudWatch Container Insights for Amazon EKS Fargate using AWS Distro for OpenTelemetry
Introduction
Amazon CloudWatch Container Insights helps customers collect, aggregate, and summarize metrics and logs from containerized applications and microservices. Metrics data is collected as performance log events using the embedded metric format. These performance log events use a structured JSON schema that enables high-cardinality data to be ingested and stored at scale. From this data, CloudWatch creates aggregated metrics at the cluster, node, pod, task, and service level as CloudWatch metrics. The metrics that Container Insights collects are available in CloudWatch automatic dashboards.
AWS Distro for OpenTelemetry (ADOT) is a secure, AWS-supported distribution of the OpenTelemetry project. With ADOT, users can instrument their applications just once to send correlated metrics and traces to multiple monitoring solutions. With the recent preview launch of ADOT support for CloudWatch Container Insights, customers can collect system metrics such as CPU, memory, disk, and network usage from Amazon EKS and Kubernetes clusters running on Amazon Elastic Cloud Compute (Amazon EC2), providing the same experience as Amazon CloudWatch agent. ADOT Collector is now available in preview with support for CloudWatch Container Insights for Amazon EKS Fargate. Customers can now collect container and pod metrics such as CPU and memory utilization for their pods that are deployed to an Amazon EKS cluster running on AWS Fargate and view them in CloudWatch dashboards without any changes to their existing CloudWatch Container Insights experience. This will enable customers to also determine whether to scale up or down to respond to traffic and save costs.
In this blog post, we will discuss the design of components in an ADOT Collector pipeline that enables the collection of Container Insights metrics from EKS Fargate workloads. Then, we will demonstrate how to configure and deploy an ADOT Collector to collect system metrics from workloads deployed to an EKS Fargate cluster and send them to CloudWatch.
Design of Container Insights support in ADOT Collector for EKS Fargate
The ADOT Collector has the concept of a pipeline which comprises three key types of components, namely, receiver, processor, and exporter. A receiver is how data gets into the collector. It accepts data in a specified format, translates it into the internal format and passes it to processors and exporters defined in the pipeline. It can be pull or push based. A processor is an optional component that is used to perform tasks such as batching, filtering, and transformations on data between being received and being exported. An exporter is used to determine which destination to send the metrics, logs or traces. The collector architecture allows multiple instances of such pipelines to be defined via YAML configuration. The following diagram illustrates the pipeline components in an ADOT Collector instance deployed to EKS Fargate.
The kubelet on a worker node in a Kubernetes cluster exposes resource metrics such as CPU, memory, disk, and network usage at the /metrics/cadvisor endpoint. However, in EKS Fargate networking architecture, a pod is not allowed to directly reach the kubelet on that worker node. Hence, the ADOT Collector calls the Kubernetes API Server to proxy the connection to the kubelet on a worker node, and collect kubelet’s cAdvisor metrics for workloads on that node. These metrics are made available in Prometheus format. Therefore, the collector uses an instance of Prometheus Receiver as a drop-in replacement for a Prometheus server and scrapes these metrics from the Kubernetes API server endpoint. Using Kubernetes service discovery, the receiver can discover all the worker nodes in an EKS cluster. Hence, a single instance of ADOT Collector will suffice to collect resource metrics from all the nodes in a cluster.
The metrics then go through a series of processors that perform filtering, renaming, data aggregation and conversion, and so on. The following is the list of processors used in the pipeline of an ADOT Collector instance for EKS Fargate illustrated above.
- Filter Processor to include or exclude metrics based on their name.
- Metrics Transform Processor to rename metrics as well as perform aggregations on metrics across labels.
- Cumulative to Delta Processor to convert cumulative sum metrics to delta sums.
- Delta to Rate Processor to convert delta sum metrics to rate metrics. This rate is a gauge.
- Metrics Generation Processor to create new metrics using existing metrics.
The final component in the pipeline is AWS CloudWatch EMF Exporter, which converts the metrics to embedded metric format (EMF) and then sends them directly to CloudWatch Logs using the PutLogEvents API. The following list of metrics is sent to CloudWatch by the ADOT Collector for each of the workloads running on EKS Fargate.
- pod_cpu_utilization_over_pod_limit
- pod_cpu_usage_total
- pod_cpu_limit
- pod_memory_utilization_over_pod_limit
- pod_memory_working_set
- pod_memory_limit
- pod_network_rx_bytes
- pod_network_tx_bytes
Each metric will be associated with the following dimension sets and collected under the CloudWatch namespace named ContainerInsights.
- ClusterName, LaunchType
- ClusterName, Namespace, LaunchType
- ClusterName, Namespace, PodName, LaunchType
Deploying ADOT Collector to EKS Fargate
Let’s get into the details of installing the ADOT Collector in an EKS Fargate cluster and then collecting metrics data from workloads. The following is a list of prerequisites for installing the ADOT Collector.
- An EKS cluster that supports Kubernetes version 1.18 or higher. You may create the EKS cluster using one of the approaches outlined here.
- When your cluster creates pods on AWS Fargate, the components that run on the Fargate infrastructure must make calls to AWS APIs on your behalf. This is so that they can execute actions such as pulling container images from Amazon ECR. The EKS pod execution role provides the IAM permissions to do this. Create a Fargate pod execution role as outlined here.
- Before you can schedule pods running on Fargate, you must define a Fargate profile that specifies which pods should use Fargate when they are launched. For the implementation in this blog, we create two Fargate profiles as outlined here. The first Fargate profile is named fargate-container-insights, specifying the namespace fargate-container-insights. The second one is named applications, specifying the namespace golang.
- The ADOT Collector requires IAM permissions to send performance log events to CloudWatch. This is done by associating a Kubernetes service account with an IAM role using the IAM Roles for Service Accounts (IRSA) feature supported by EKS. The IAM role should be associated with the AWS-managed policy CloudWatchAgentServerPolicy. The helper script shown below may be used, after substituting the CLUSTER_NAME and REGION variables, to create an IAM role named EKS-ADOT-ServiceAccount-Role that is granted these permissions and is associated with the adot-collector Kubernetes service account.
##!/bin/bash
CLUSTER_NAME=YOUR-EKS-CLUSTER-NAME
REGION=YOUR-EKS-CLUSTER-REGION
SERVICE_ACCOUNT_NAMESPACE=fargate-container-insights
SERVICE_ACCOUNT_NAME=adot-collector
SERVICE_ACCOUNT_IAM_ROLE=EKS-Fargate-ADOT-ServiceAccount-Role
SERVICE_ACCOUNT_IAM_POLICY=arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
eksctl utils associate-iam-oidc-provider \
--cluster=$CLUSTER_NAME \
--approve
eksctl create iamserviceaccount \
--cluster=$CLUSTER_NAME \
--region=$REGION \
--name=$SERVICE_ACCOUNT_NAME \
--namespace=$SERVICE_ACCOUNT_NAMESPACE \
--role-name=$SERVICE_ACCOUNT_IAM_ROLE \
--attach-policy-arn=$SERVICE_ACCOUNT_IAM_POLICY \
--approve
Next, deploy the ADOT Collector as a Kubernetes StatefulSet using the following deployment manifest after replacing the placeholder variables YOUR-EKS-CLUSTER-NAME and YOUR-AWS-REGION in the manifest with the names of your EKS cluster and AWS Region respectively.
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: adotcol-admin-role
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- nodes/metrics
- services
- endpoints
- pods
- pods/proxy
verbs: ["get", "list", "watch"]
- nonResourceURLs: [ "/metrics/cadvisor"]
verbs: ["get", "list", "watch"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: adotcol-admin-role-binding
subjects:
- kind: ServiceAccount
name: adot-collector
namespace: fargate-container-insights
roleRef:
kind: ClusterRole
name: adotcol-admin-role
apiGroup: rbac.authorization.k8s.io
# collector configuration section
# update `ClusterName=YOUR-EKS-CLUSTER-NAME` in the env variable OTEL_RESOURCE_ATTRIBUTES
# update `region=YOUR-AWS-REGION` in the emfexporter with the name of the AWS Region where you want to collect Container Insights metrics.
---
apiVersion: v1
kind: ConfigMap
metadata:
name: adot-collector-config
namespace: fargate-container-insights
labels:
app: aws-adot
component: adot-collector-config
data:
adot-collector-config: |
receivers:
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 40s
scrape_configs:
- job_name: 'kubelets-cadvisor-metrics'
sample_limit: 10000
scheme: https
kubernetes_sd_configs:
- role: node
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Only for Kubernetes ^1.7.3.
# See: https://github.com/prometheus/prometheus/issues/2916
- target_label: __address__
# Changes the address to Kube API server's default address and port
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
# Changes the default metrics path to kubelet's proxy cadvdisor metrics endpoint
replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
metric_relabel_configs:
# extract readable container/pod name from id field
- action: replace
source_labels: [id]
regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
target_label: rkt_container_name
replacement: '$${2}-$${1}'
- action: replace
source_labels: [id]
regex: '^/system\.slice/(.+)\.service$'
target_label: systemd_service_name
replacement: '$${1}'
processors:
# rename labels which apply to all metrics and are used in metricstransform/rename processor
metricstransform/label_1:
transforms:
- include: .*
match_type: regexp
action: update
operations:
- action: update_label
label: name
new_label: container_id
- action: update_label
label: kubernetes_io_hostname
new_label: NodeName
- action: update_label
label: eks_amazonaws_com_compute_type
new_label: LaunchType
# rename container and pod metrics which we care about.
# container metrics are renamed to `new_container_*` to differentiate them with unused container metrics
metricstransform/rename:
transforms:
- include: container_spec_cpu_quota
new_name: new_container_cpu_limit_raw
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_spec_cpu_shares
new_name: new_container_cpu_request
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_cpu_usage_seconds_total
new_name: new_container_cpu_usage_seconds_total
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_spec_memory_limit_bytes
new_name: new_container_memory_limit
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_cache
new_name: new_container_memory_cache
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_max_usage_bytes
new_name: new_container_memory_max_usage
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_usage_bytes
new_name: new_container_memory_usage
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_working_set_bytes
new_name: new_container_memory_working_set
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_rss
new_name: new_container_memory_rss
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_swap
new_name: new_container_memory_swap
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_failcnt
new_name: new_container_memory_failcnt
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_failures_total
new_name: new_container_memory_hierarchical_pgfault
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"}
- include: container_memory_failures_total
new_name: new_container_memory_hierarchical_pgmajfault
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"}
- include: container_memory_failures_total
new_name: new_container_memory_pgfault
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"}
- include: container_memory_failures_total
new_name: new_container_memory_pgmajfault
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"}
- include: container_fs_limit_bytes
new_name: new_container_filesystem_capacity
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_fs_usage_bytes
new_name: new_container_filesystem_usage
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
# POD LEVEL METRICS
- include: container_spec_cpu_quota
new_name: pod_cpu_limit_raw
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_spec_cpu_shares
new_name: pod_cpu_request
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_cpu_usage_seconds_total
new_name: pod_cpu_usage_seconds_total
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_spec_memory_limit_bytes
new_name: pod_memory_limit
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_cache
new_name: pod_memory_cache
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_max_usage_bytes
new_name: pod_memory_max_usage
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_usage_bytes
new_name: pod_memory_usage
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_working_set_bytes
new_name: pod_memory_working_set
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_rss
new_name: pod_memory_rss
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_swap
new_name: pod_memory_swap
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_failcnt
new_name: pod_memory_failcnt
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_failures_total
new_name: pod_memory_hierarchical_pgfault
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"}
- include: container_memory_failures_total
new_name: pod_memory_hierarchical_pgmajfault
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"}
- include: container_memory_failures_total
new_name: pod_memory_pgfault
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"}
- include: container_memory_failures_total
new_name: pod_memory_pgmajfault
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"}
- include: container_network_receive_bytes_total
new_name: pod_network_rx_bytes
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_receive_packets_dropped_total
new_name: pod_network_rx_dropped
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_receive_errors_total
new_name: pod_network_rx_errors
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_receive_packets_total
new_name: pod_network_rx_packets
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_transmit_bytes_total
new_name: pod_network_tx_bytes
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_transmit_packets_dropped_total
new_name: pod_network_tx_dropped
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_transmit_errors_total
new_name: pod_network_tx_errors
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_transmit_packets_total
new_name: pod_network_tx_packets
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
# filter out only renamed metrics which we care about
filter:
metrics:
include:
match_type: regexp
metric_names:
- new_container_.*
- pod_.*
# convert cumulative sum datapoints to delta
cumulativetodelta:
metrics:
- new_container_cpu_usage_seconds_total
- pod_cpu_usage_seconds_total
- pod_memory_pgfault
- pod_memory_pgmajfault
- pod_memory_hierarchical_pgfault
- pod_memory_hierarchical_pgmajfault
- pod_network_rx_bytes
- pod_network_rx_dropped
- pod_network_rx_errors
- pod_network_rx_packets
- pod_network_tx_bytes
- pod_network_tx_dropped
- pod_network_tx_errors
- pod_network_tx_packets
- new_container_memory_pgfault
- new_container_memory_pgmajfault
- new_container_memory_hierarchical_pgfault
- new_container_memory_hierarchical_pgmajfault
# convert delta to rate
deltatorate:
metrics:
- new_container_cpu_usage_seconds_total
- pod_cpu_usage_seconds_total
- pod_memory_pgfault
- pod_memory_pgmajfault
- pod_memory_hierarchical_pgfault
- pod_memory_hierarchical_pgmajfault
- pod_network_rx_bytes
- pod_network_rx_dropped
- pod_network_rx_errors
- pod_network_rx_packets
- pod_network_tx_bytes
- pod_network_tx_dropped
- pod_network_tx_errors
- pod_network_tx_packets
- new_container_memory_pgfault
- new_container_memory_pgmajfault
- new_container_memory_hierarchical_pgfault
- new_container_memory_hierarchical_pgmajfault
experimental_metricsgeneration/1:
rules:
- name: pod_network_total_bytes
unit: Bytes/Second
type: calculate
metric1: pod_network_rx_bytes
metric2: pod_network_tx_bytes
operation: add
- name: pod_memory_utilization_over_pod_limit
unit: Percent
type: calculate
metric1: pod_memory_working_set
metric2: pod_memory_limit
operation: percent
- name: pod_cpu_usage_total
unit: Millicore
type: scale
metric1: pod_cpu_usage_seconds_total
operation: multiply
# core to millicore: multiply by 1000
# millicore seconds to millicore nanoseconds: multiply by 10^9
scale_by: 1000
- name: pod_cpu_limit
unit: Millicore
type: scale
metric1: pod_cpu_limit_raw
operation: divide
scale_by: 100
experimental_metricsgeneration/2:
rules:
- name: pod_cpu_utilization_over_pod_limit
type: calculate
unit: Percent
metric1: pod_cpu_usage_total
metric2: pod_cpu_limit
operation: percent
# add `Type` and rename metrics and labels
metricstransform/label_2:
transforms:
- include: pod_.*
match_type: regexp
action: update
operations:
- action: add_label
new_label: Type
new_value: "Pod"
- include: new_container_.*
match_type: regexp
action: update
operations:
- action: add_label
new_label: Type
new_value: Container
- include: .*
match_type: regexp
action: update
operations:
- action: update_label
label: namespace
new_label: Namespace
- action: update_label
label: pod
new_label: PodName
- include: ^new_container_(.*)$$
match_type: regexp
action: update
new_name: container_$$1
# add cluster name from env variable and EKS metadata
resourcedetection:
detectors: [env, eks]
batch:
timeout: 60s
# only pod level metrics in metrics format, details in https://aws-otel.github.io/docs/getting-started/container-insights/eks-fargate
exporters:
awsemf:
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{PodName}'
namespace: 'ContainerInsights'
region: YOUR-AWS-REGION
resource_to_telemetry_conversion:
enabled: true
eks_fargate_container_insights_enabled: true
parse_json_encoded_attr_values: ["kubernetes"]
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ClusterName, LaunchType], [ClusterName, Namespace, LaunchType], [ClusterName, Namespace, PodName, LaunchType]]
metric_name_selectors:
- pod_cpu_utilization_over_pod_limit
- pod_cpu_usage_total
- pod_cpu_limit
- pod_memory_utilization_over_pod_limit
- pod_memory_working_set
- pod_memory_limit
- pod_network_rx_bytes
- pod_network_tx_bytes
extensions:
health_check:
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [metricstransform/label_1, resourcedetection, metricstransform/rename, filter, cumulativetodelta, deltatorate, experimental_metricsgeneration/1, experimental_metricsgeneration/2, metricstransform/label_2, batch]
exporters: [awsemf]
extensions: [health_check]
# configure the service and the collector as a StatefulSet
---
apiVersion: v1
kind: Service
metadata:
name: adot-collector-service
namespace: fargate-container-insights
labels:
app: aws-adot
component: adot-collector
spec:
ports:
- name: metrics # default endpoint for querying metrics.
port: 8888
selector:
component: adot-collector
type: ClusterIP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: adot-collector
namespace: fargate-container-insights
labels:
app: aws-adot
component: adot-collector
spec:
selector:
matchLabels:
app: aws-adot
component: adot-collector
serviceName: adot-collector-service
template:
metadata:
labels:
app: aws-adot
component: adot-collector
spec:
serviceAccountName: adot-collector
securityContext:
fsGroup: 65534
containers:
- image: amazon/aws-otel-collector:v0.15.1
name: adot-collector
imagePullPolicy: Always
command:
- "/awscollector"
- "--config=/conf/adot-collector-config.yaml"
env:
- name: OTEL_RESOURCE_ATTRIBUTES
value: "ClusterName=YOUR-EKS-CLUSTER-NAME"
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 1
memory: 2Gi
volumeMounts:
- name: adot-collector-config-volume
mountPath: /conf
volumes:
- configMap:
name: adot-collector-config
items:
- key: adot-collector-config
path: adot-collector-config.yaml
name: adot-collector-config-volume
---
Deploy a sample stateless workload to the cluster with the following deployment manifest.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
namespace: golang
spec:
replicas: 2
selector:
matchLabels:
app: webapp
role: webapp-service
template:
metadata:
labels:
app: webapp
role: webapp-service
spec:
containers:
- name: go
image: public.ecr.aws/awsvijisarathy/prometheus-webapp:latest
imagePullPolicy: Always
resources:
requests:
cpu: "256m"
memory: "512Mi"
limits:
cpu: "256m"
memory: "512Mi"
The above deployments are both targeting namespaces associated with a Fargate profile, and hence the workloads will be scheduled to run on Fargate. It may take a couple of minutes for a Fargate worker node to be provisioned for each of these workloads and for the pods to reach a Ready status. Executing the command
kubectl get nodes -l eks.amazonaws.com/compute-type=fargate should now list the Fargate worker nodes named with the prefix fargate. Verify that the ADOT Collector and the workload pods are all running using the following commands:
- kubectl get pods -n fargate-container-insights
- kubectl get pods -n golang
Visualizing EKS Fargate resource metrics using CloudWatch Container Insights
The performance log events for the workloads will be found under the log group named
/aws/containerinsights/CLUSTER_NAME/performance as shown below. A separate log stream is created for each pod running on Fargate.
Shown below is a representative example of the JSON data with embedded metric format contained in one of the log events that identifies the data as pertaining to the metrics named pod_cpu_usage_total and pod_cpu_utilization_over_pod_limit.
Shown below is a graph of the same metric pod_cpu_utilization_over_pod_limit seen in the CloudWatch metrics dashboard.
The metrics may also be visualized using the prebuilt Container Insights dashboards that display data at the cluster, node, namespace, service, and pod level. Shown below is a view of the dashboard displaying EKS Fargate metrics at the cluster level.
Concluding remarks
This blog presented an overview of the design of the ADOT Collector for EKS Fargate with support for CloudWatch Container Insights and demonstrated its deployment and metrics collection from workloads on an EKS Fargate cluster. A single collector instance was able to discover all the worker nodes in an EKS cluster through the use of Kubernetes service discovery and collect metrics from them by using the Kubernetes API server as a proxy for the kubelet on worker nodes. EKS customers will now be able to collect system metrics such as CPU, memory, disk, and network usage from workloads that are deployed to an EKS Fargate cluster and visualize them in CloudWatch dashboards, providing the same experience as CloudWatch agent.