Containers

Container Insights for Amazon EKS Support AWS Distro for OpenTelemetry Collector

CloudWatch Container Insights collects, aggregates, and summarizes metrics from your containerized applications and microservices. Metrics are collected as log events using embedded metric format, which enables high-cardinality data to be ingested and stored in designated CW log groups at scale. Amazon CloudWatch then uses those embedded metrics to create the aggregated CloudWatch metrics from the received EMF data, and presents them in CloudWatch automatic dashboards.

AWS Distro for OpenTelemetry (ADOT) Collector v0.11.0 is now available with Container Insights for Amazon EKS. With this support, we introduced a receiver called AWS Container Insights Receiver that collects infrastructure metrics for many resources, including CPU, memory, disk, and network for different components of an EKS cluster. Together with the CloudWatch EMF Exporter, this receiver can support the similar CloudWatch Container Insights experience for an EKS cluster. In other words, you can take full advantage of AWS Distro for OpenTelemetry (ADOT) Collector without any change in your existing Container Insights experience.

In this blog post, we’ll discuss the design of the AWS Container Insights Receiver and demonstrate how to use ADOT Collector to collect infrastructure metrics for Container Insights use cases.

Design of Container Insights support in ADOT Collector

The following diagram shows the architecture of the AWS Container Insights Receiver and how it works with other components to support Container Insights in Amazon EKS.

AWS Container Insights Receiver architecture diagram

Metric processing pipeline and its components

To collect infrastructure metrics for Amazon EKS Container Insights, the ADOT collector is deployed as a DaemonSet, which means there is exactly one collector pod per worker node and as shown in the preceding diagram. This DaemonSet approach is only supported by Amazon EKS on EC2 worker nodes. The metric processing pipeline of ADOT Collector is composed of three main parts in order:

  • AWS Container Insights Receiver for collecting data from various sources
  • OpenTelemetry Processors for processing and manipulating existing metrics
  • CloudWatch EMF Exporter for pushing the metrics to a specific backend

The AWS Container Insights Receiver, which is the first component of the pipeline, is responsible for all kinds of activities and collects container-related data from two main sources:

  • The cadvisor component embeds a customized cAdvisor (Container Advisor) lib that pulls information from container stats API and provides the resource usage and performance characteristics of running containers. We use a customized cAdvisor to collect certain metrics and those metrics are categorized as different infrastructure layers, like node, node filesystem, node disk io, node network, pod, pod network, container, and container filesystem.
  • The k8sapiserver component is responsible for collecting cluster-level metrics from Kubernetes API server. To ensure that only one ADOT collector from the DaemonSet is collecting cluster-level metrics, k8sapiserver integrates with the Kubernetes client, which supports leader election API. It leverages a Kubernetes ConfigMap resource as a type of LOCK primitive and ensure that only one collector holds the lock to be the leader.

In addition to the two main data sources, the Receiver also gets additional metadata information through the host and store components that are also embedded inside. The host component is used to gather information about EKS worker nodes. Those are the CPU/mem capacity of the worker using proc file system, the instance id/type using EC2 metadata endpoint, the relevant Auto Scaling group information through EC2 APIs, and etc. The stores component acts as an information store where all collected metadata information is stored.

The information gathered through the host and store is used by both cadvisor and k8sapiserver to enrich collected metrics for Amazon EKS Container Insights. Moreover, cadvisor requires additional metadata information while enriching pod and container metrics for Container Insights. For this reason, Container Insights Receiver also must interact with Kubelet in order to list and cache all pod objects running on the same node. It also needs to query Kubernetes API server to get the relevant service name for pods. Only cadvisor requires this information.

The second component of the pipeline are the OpenTelemetry Processors. They are optional but important, as they are responsible for pre-processing data before it is exported. The Processors modify attributes and ensure that data makes it through a pipeline successfully by retrying in case of a failure. There are many OpenTelemetry Processors available such as the:

However, in our default configuration, we enable the Batch processor to improve CloudWatch EMF logs requests throughput. With this Batch processor setting, the collected OpenTelemetry metrics data batches in memory until it either reaches timeout threshold (30s by default), or batch size (default 8192 bytes). It may be necessary to add additional processors that are supported by ADOT collector to satisfy your business and application requirements.

The third and last component of the pipeline is CloudWatch EMF Exporter (awsemf), which is used to send the metrics to the CloudWatch backend as EMF logs. In the configuration for awsemf exporter, there are main two placeholders {ClusterName} and {NodeName} in the log group and log stream names. Those are replaced dynamically with the names of your cluster and the node on which the ADOT Collector is running.

Getting started

To use AWS OTel Collector to collect infrastructure metrics for a service cluster, you must make sure all the prerequisites are satisfied.

Then you can deploy AWS OTel Collector as a daemon set to the cluster by entering the following command:

curl https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml |
kubectl apply -f –

After running the command to deploy AWS OTel Collector, there will be a new namespace created called “aws-otel-eks,” and all related Kubernetes objects will be created within this namespace rather than default namespace.

Verifying the installation:

You can run the following command to confirm if AWS OTel Collector is running successfully:

#kubectl get pods -l name=aws-otel-eks-ci -n aws-otel-eks

Afterwards, you can also review pod logs and compare them with the following expected output:

#kubectl logs aws-otel-eks-ci-8djzp -n aws-otel-eks

Expected Output At Pod Level:

2021/08/08 00:18:19 AWS OTel Collector version: v0.11.0
2021/08/08 00:18:19 find no extra config, skip it, err: open /opt/aws/aws-otel-collector/etc/extracfg.txt: no such file or directory
2021-08-08T00:18:19.975Z info service/collector.go:262 Starting aws-otel-collector... {"Version": "v0.11.0", "NumCPU": 2}
2021-08-08T00:18:19.975Z info service/collector.go:170 Setting up own telemetry...
2021-08-08T00:18:19.976Z info service/telemetry.go:99 Serving Prometheus metrics {"address": ":8888", "level": 0, "service.instance.id": "ee7ecd58-340c-4166-b693-9fd688c27d60"}
2021-08-08T00:18:19.976Z info service/collector.go:205 Loading configuration...
2021-08-08T00:18:19.979Z info service/collector.go:221 Applying configuration...
...
2021-08-08T00:18:19.981Z info service/service.go:137 Starting extensions...
2021-08-08T00:18:19.981Z info builder/extensions_builder.go:53 Extension is starting... {"kind": "extension", "name": "health_check"}
2021-08-08T00:18:19.981Z info healthcheckextension/healthcheckextension.go:41 Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Port":0,"TCPAddr":{"Endpoint":"0.0.0.0:13133"}}}
2021-08-08T00:18:19.981Z info builder/extensions_builder.go:59 Extension started. {"kind": "extension", "name": "health_check"}
2021-08-08T00:18:19.981Z info service/service.go:182 Starting exporters...
...
2021-08-08T00:18:19.981Z info service/service.go:187 Starting processors...
...
2021-08-08T00:18:19.981Z info service/service.go:192 Starting receivers...
2021-08-08T00:18:19.983Z info host/ec2metadata.go:72 Fetch instance id and type from ec2 metadata {"kind": "receiver", "name": "awscontainerinsightreceiver"}
W0808 00:18:20.332742 1 manager.go:288] Could not configure a source for OOM detection, disabling OOM events: open /dev/kmsg: no such file or directory
2021-08-08T00:18:20.692Z info builder/receivers_builder.go:75 Receiver started. {"kind": "receiver", "name": "awscontainerinsightreceiver"}
2021-08-08T00:18:20.692Z info healthcheck/handler.go:129 Health Check state change {"kind": "extension", "name": "health_check", "status": "ready"}
2021-08-08T00:18:20.692Z info service/collector.go:182 Everything is ready. Begin running and processing data.

Important note: In case you encounter a “not authorized” error, you need to ensure that your EKS workers have the required “CloudWatchAgentServerPolicy” IAM policy attached.

2021-08-08T18:58:59.667Z error awsemfexporter@v0.29.1-0.20210630203112-81d57601b1bc/cwlog_client.go:117 cwlog_client: Error occurs in PutLogEvents {"kind": "exporter", "name": "awsemf", "error": "AccessDeniedException: User: arn:aws:sts:: is not authorized to perform: logs:PutLogEvents on resource: arn:aws:logs:us-east-2:log-group:}

After the installation:

After you’ve verified your setup and deployed successfully, the AWS OTel Collector creates a log group named /aws/containerinsights/{your-cluster}/performance. It begins sending the performance log events to this log group. Each collector pod on a cluster node will publish logs to a log stream with the name of the worker node. In the following screenshot, three log streams are present under the log group /aws/containerinsights/eks-otel-v1/performance and each corresponds to one worker node:

Log streams present under log group

The following is an example of performance log events captured within those log streams:

{
"AutoScalingGroupName": "eks-e2bd9188-e94a-3339-af3b-bf09b661ba5f",
"ClusterName": "eks-otel-v1",
"InstanceId": "i-01234abcdef",
"InstanceType": "m5.large",
"NodeName": "ip-192-168-17-16.us-east-2.compute.internal",
"Sources": [
"cadvisor",
"calculated"
],
"Timestamp": "1628455529505",
"Type": "NodeNet",
"Version": "0",
"interface": "eth0",
"kubernetes": {
"host": "ip-192-168-17-16.us-east-2.compute.internal"
},
"node_interface_network_rx_bytes": 5392.51738408303,
"node_interface_network_rx_dropped": 0,
"node_interface_network_rx_errors": 0,
"node_interface_network_rx_packets": 16.09165306162301,
"node_interface_network_total_bytes": 8418.50657774278,
"node_interface_network_tx_bytes": 3025.9891936597514,
"node_interface_network_tx_dropped": 0,
"node_interface_network_tx_errors": 0,
"node_interface_network_tx_packets": 16.668710292316455
}

Amazon EKS Container Insights monitoring dashboards and metrics

With CloudWatch Container Insights using AWS Distro for OpenTelemetry (ADOT) Collector, you are able to capture detailed metrics about your containerized workload running on EKS clusters without additional configuration. The following are some of the collect metrics by default:

  • At worker node level:
    • node_cpu_utilization
    • node_memory_utilization
    • node_network_total_bytes
    • node_cpu_reserved_capacity
    • node_memory_reserved_capacity
    • node_number_of_running_pods
    • node_number_of_running_containers
  • At pod level:
    • pod_cpu_utilization
    • pod_memory_utilization
    • pod_network_rx_bytes
    • pod_network_tx_bytes
    • pod_cpu_utilization_over_pod_limit
    • pod_memory_utilization_over_pod_limit
    • pod_cpu_reserved_capacity
    • pod_memory_reserved_capacity
    • pod_number_of_container_restarts
  • Other relevant metrics:
    • cluster_node_count
    • cluster_failed_node_count
    • service_number_of_running_pods
    • node_filesystem_utilization

The metrics that Container Insights collects are also available in CloudWatch automatic dashboards as different resource types, such as EKS Clusters, EKS Namespaces, EKS Nodes, EKS Services, and EKSPods. The following is a screenshot for the pod-level metrics for a cluster named eks-otel-v1:

Pod-level metrics for cluster

During initial setup, it is important to note that it may take a few minutes to process the collected metrics and visualize them through the dashboard within Container Insights dashboard.

Summary

In this blog post, we went through AWS Distro for OpenTelemetry (ADOT) Collector integration with Container Insights for Amazon EKS. We covered architecture details, important components, and installation and setup verification details. To learn more about AWS observability functionalities on Amazon CloudWatch and AWS X-Ray, watch our One Observability Demo workshop.

Ugur Kira

Ugur Kira

Ugur Kira is a Principal Specialist Technical Account Manager (STAM) - Containers based out of Dublin, Ireland. He joined AWS 10 years back, a containers enthusiast over 6 years and passionate about helping AWS users to design modern container-based applications on AWS services. Ugur is actively working with Amazon EKS, Amazon ECS, AppMesh services and conducts proactive operational reviews around those services. He also has special interest in improving observability capabilities in containers-based applications.

Ping Xiang

Ping Xiang

Ping Xiang is a Software Engineer at AWS CloudWatch. He is the one of the major contributors to the Observability and Monitoring projects. Ping has been working in software industry for more than 5 years with various enterprise analytics and monitoring solutions.

Min Xia

Min Xia

Min Xia is a Sr. Software Engineer at AWS CloudWatch. Currently, he is the lead engineer in the team who is contributing to Observability and Monitoring projects in AWS and Open Source communities. Min has more than a decade experience as an accomplished engineer who has delivered many successful products in Monitoring, Telecom and eCommerce industries. He is also interested in AI and machine learning technologies that make “everything can be autonomous”.