Containers

Cost savings by customizing metrics sent by Container Insights in Amazon EKS

AWS Distro for OpenTelemetry (ADOT) is an AWS-provided distribution of the OpenTelemetry project. The ADOT Collector receives and exports data from multiple sources and destinations. Amazon CloudWatch Container Insights now supports ADOT for Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Elastic Container Service (Amazon ECS). This will enable customers to perform advanced configurations, such as customizing metrics that are sent to CloudWatch. The follow is a diagram of how ADOT Collector architecture looks for Amazon EKS:

ADOT Collector Pipeline

ADOT Collector Pipeline

As you can see in the previous graph, the ADOT Collector pipeline starts by using a receiver to collect metrics – in this case it is the Container Insights Receiver. It then uses Processors to transform or filter collected metrics. Finally, the ADOT Collector uses exporters to send metrics to different destinations. In this example, we are using the AWS EMF Exporter, which will convert processed metrics to CloudWatch embedded metric format logs. In this blog post, we will show you how to reduce CloudWatch Insight-associated costs by customizing metrics collected by the Container Insights receiver in the ADOT Collector for Amazon EKS clusters.

With the default configuration, the Container Insights receiver collects the complete set of metrics as defined by the receiver documentation. The number of metrics and dimensions collected is high, and for large clusters this will significantly increase the costs for metric ingestion and storage. We are going to demonstrate two different approaches that you can use to configure the ADOT Collector to send only metrics that bring value to you.

In this blog, we will explain how to configure the ADOT Collector for an Amazon EKS cluster, but note that metric customization using the approaches shown in this blog post are also applicable when using the ADOT Collector in Amazon ECS. Be aware that metric names for Amazon ECS are different, as shown in this documentation.

Installing ADOT Collector in EKS

To get Container Insights receiver collecting infrastructure data, you need to install the ADOT Collector as a daemonset within your Amazon EKS cluster. In this section, we describe the steps required to configure ADOT Collector in Amazon EKS.

Set up IAM role for service account

To improve security for the ADOT Collector agent, you need to enable IAM roles for your service account. This is so you can assign IAM permissions to the ADOT Collector pod.

export CLUSTER_NAME=<eks-cluster-name>
export AWS_REGION=<e.g. us-east-1>
export AWS_ACCOUNT_ID=<AWS account ID>
  • Enable IAM OIDC provider
eksctl utils associate-iam-oidc-provider --region=$AWS_REGION \
    --cluster=$CLUSTER_NAME \
    --approve
  • Create IAM policy for the Collector
 cat << EOF > AWSDistroOpenTelemetryPolicy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Effect": "Allow",
        "Action": [
            "logs:PutLogEvents",
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:DescribeLogStreams",
            "logs:DescribeLogGroups",
            "xray:PutTraceSegments",
            "xray:PutTelemetryRecords",
            "xray:GetSamplingRules",
            "xray:GetSamplingTargets",
            "xray:GetSamplingStatisticSummaries",
            "cloudwatch:PutMetricData",
            "ec2:DescribeVolumes",
            "ec2:DescribeTags",
            "ssm:GetParameters"
        ],
        "Resource": "*"
        }
    ]
}
EOF

aws iam create-policy \
    --policy-name AWSDistroOpenTelemetryPolicy \
    --policy-document file://AWSDistroOpenTelemetryPolicy.json 
  • Download ADOT Collector Kubernetes manifest
curl -s -O https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml
  • Install Kubernetes manifest
kubectl apply -f otel-container-insights-infra.yaml
  • Create IAM role and configure IRSA for the Collector
eksctl create iamserviceaccount \
    --name aws-otel-sa \
    --namespace aws-otel-eks \
    --cluster ${CLUSTER_NAME} \
    --attach-policy-arn arn:aws:iam::${AWS_ACCOUNT_ID}:policy/AWSDistroOpenTelemetryPolicy \
    --approve \
    --override-existing-serviceaccounts
  • Restart ADOT Collector pods to start using IRSA
kubectl delete pods -n aws-otel-eks -l name=aws-otel-eks-ci

Now you have the ADOT Collector installed in your Amazon EKS cluster.Let’s review the two available approaches that you have to customize metrics sent by the Collector.

Option 1: Filter metrics using processors

This approach involves the introduction of OpenTelemetry processors to filter out metrics or attributes to reduce the size of EMF logs. In this section, we will demonstrate the basic usage of two processors. For more detailed information about these processors, you can refer to this documentation.

Filter processor

Filter processor is part of the AWS OpenTelemetry distribution. It can be used as part of the metrics collection pipeline to filter out unwanted metrics.  For example, suppose that you want Container Insights to only collect pod-level metrics (with name prefix pod_) excluding those for networking, with name prefix pod_network. You can add the filter processor into the pipeline by editing the Kubernetes manifest otel-container-insights-infra.yaml downloaded in the previous Installing ADOT Collector in EKS section. Then modify ConfigMap named otel-agent-conf to include filter processors as follows:

extensions:
  health_check:
  
receivers:
  awscontainerinsightreceiver: 

processors:
  # filter processors example
  filter/include:
    # any names NOT matching filters are excluded from remainder of pipeline
    metrics:
      include:
        match_type: regexp
        metric_names:
          # re2 regexp patterns
          - ^pod_.*
  filter/exclude:
    # any names matching filters are excluded from remainder of pipeline
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ^pod_network.*
  
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    metric_declarations:
      # node metrics
      - dimensions: [[NodeName, InstanceId, ClusterName]]
        metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
      - dimensions: [[ClusterName]]
        metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
          - node_cpu_usage_total
          - node_cpu_limit
          - node_memory_working_set
          - node_memory_limit
   
      # pod metrics
      - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit
      - dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_reserved_capacity
          - pod_memory_reserved_capacity
      - dimensions: [[PodName, Namespace, ClusterName]]
        metric_name_selectors:
          - pod_number_of_container_restarts
          - container_cpu_limit
          - container_cpu_request
          - container_cpu_utilization
          - container_memory_limit
          - container_memory_request
          - container_memory_utilization
          - container_memory_working_set
    
      # cluster metrics
      - dimensions: [[ClusterName]]
        metric_name_selectors:
          - cluster_node_count
          - cluster_failed_node_count
    
      # service metrics
      - dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - service_number_of_running_pods
    
      # node fs metrics
      - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
        metric_name_selectors:
          - node_filesystem_utilization
    
      # namespace metrics
      - dimensions: [[Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - namespace_number_of_running_pods

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      # Add filter processors to the pipeline
      processors: [filter/include, filter/exclude, batch/metrics]
      exporters: [awsemf]
      
extensions: [health_check]

Resource processor

Resource processor is also built into the AWS OpenTelemetry Distro and can be used to remove unwanted metric attributes. For example, if you want to remove the Kubernetes and Sources fields from the EMF logs, you can add the resource processor to the pipeline:

extensions:
  health_check:
  
receivers:
  awscontainerinsightreceiver: 

processors:
  filter/include:
    # any names NOT matching filters are excluded from remainder of pipeline
    metrics:
      include:
        match_type: regexp
        metric_names:
          # re2 regexp patterns
          - ^pod_.*
  filter/exclude:
    # any names matching filters are excluded from remainder of pipeline
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ^pod_network.*
  # resource processors example
  resource:
    attributes:
    - key: Sources
      action: delete
    - key: kubernetes
      action: delete
      
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    metric_declarations:
      # node metrics
      - dimensions: [[NodeName, InstanceId, ClusterName]]
        metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
      - dimensions: [[ClusterName]]
        metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
          - node_cpu_usage_total
          - node_cpu_limit
          - node_memory_working_set
          - node_memory_limit
   
      # pod metrics
      - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit
      - dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_reserved_capacity
          - pod_memory_reserved_capacity
      - dimensions: [[PodName, Namespace, ClusterName]]
        metric_name_selectors:
          - pod_number_of_container_restarts
          - container_cpu_limit
          - container_cpu_request
          - container_cpu_utilization
          - container_memory_limit
          - container_memory_request
          - container_memory_utilization
          - container_memory_working_set
    
      # cluster metrics
      - dimensions: [[ClusterName]]
        metric_name_selectors:
          - cluster_node_count
          - cluster_failed_node_count
    
      # service metrics
      - dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - service_number_of_running_pods
    
      # node fs metrics
      - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
        metric_name_selectors:
          - node_filesystem_utilization
    
      # namespace metrics
      - dimensions: [[Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - namespace_number_of_running_pods

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      # Add resource processor to the pipeline
      processors: [filter/include, filter/exclude, resource, batch/metrics]
      exporters: [awsemf]
      
extensions: [health_check]

The processor approach is more generic for the ADOT Collector and it is the only way to customize metrics that can be sent to different destinations. Also, as customization and filtering happens in an early stage of the pipeline, it is efficient and can manage high volume of metrics with minimal impact on performance.

Option 2: Customize metrics and dimensions

In this approach, instead of using OpenTelemetry processors, you will configure the CloudWatch EMF exporter to generate only the set of metrics that you want to send to CloudWatch Logs. The metric_declaration section of CloudWatch EMF exporter configuration can be used to define the set of metrics and dimensions that you want to export. For example, you can keep only pod metrics from the default configuration.  This metric_declaration section will look like the following:

extensions:
  health_check:

receivers:
  awscontainerinsightreceiver:

processors:
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    # Customized metric declaration section
    metric_declarations:
      # pod metrics
      - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [batch/metrics]
      exporters: [awsemf]

extensions: [health_check]

To reduce the number of metrics, you can keep the dimension set only [PodName, Namespace, ClusterName] if you do not care about others:

extensions:
  health_check:

receivers:
  awscontainerinsightreceiver:

processors:
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    metric_declarations:
      # pod metrics
      - dimensions: [[PodName, Namespace, ClusterName]] # Reduce exported dimensions
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [batch/metrics]
      exporters: [awsemf]

extensions: [health_check]

Additionally, if you want to ignore the pod network metrics, you can delete the metrics pod_network_rx_bytes and pod_network_tx_bytes. Suppose that you are interested in the dimension PodName. You can add it to the dimension set [PodName, Namespace, ClusterName].  With the previous customization, the final metric_declarations will look like the following:

extensions:
  health_check:

receivers:
  awscontainerinsightreceiver:

processors:
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    metric_declarations:
      # reduce pod metrics by removing network metrics
      - dimensions: [[PodName, Namespace, ClusterName]]
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [batch/metrics]
      exporters: [awsemf]

extensions: [health_check]

This configuration will produce and stream the following four metrics within single dimension [PodName, Namespace, ClusterName] rather than 55 different metrics for multiple dimensions in the default configuration:

  • pod_cpu_utilization
  • pod_memory_utilization
  • pod_cpu_utilization_over_pod_limit
  • pod_memory_utilization_over_pod_limit

With this configuration, you will only send the metrics that you are interested in rather than all the metrics configured by default. As a result, you will be able to decrease metric ingestion cost for Container Insights considerably. Having this flexibility will provide Container Insights costumers with high level of control over metrics being exported.

Customizing metrics by modifying the awsemf exporter configuration is also highly flexible, and you can customize both the metrics that you want to send and their dimensions. Note that this is only applicable to logs that are sent to CloudWatch.

Conclusions

The two approaches demonstrated in this blog post are not mutually exclusive with each other. In fact, they both can be combined for a high degree of flexibility in customizing metrics that we want ingested into our monitoring system. We use this approach to decrease costs associated with metric storage and processing, as show in following graph.

Daily cost for CloudWatch service using ADOT Collector with different configurations.

Cost Explorer

In the preceding Cost Explorer graph, we can see daily cost associated with CloudWatch using different configurations on the ADOT Collector in a small EKS cluster (20 Worker nodes, 220 pods). Aug 15th shows CloudWatch bill using ADOT Collector with the default configuration. On Aug 16th, we have used the Customize EMF exporter approach and can see about 30% cost savings. On Aug 17th, we used the Processors approach, which achieves about 45% costs saving.

You must consider the trade-offs of customizing metrics sent by Container Insights as you will be able to decrease monitoring costs by sacrificing visibility of the monitored cluster. But also, the built-in dashboard provided by Container Insights within the AWS Console can be impacted by customized metrics as you can select not sending metrics and dimensions used by the dashboard.

AWS Distro for OpenTelemetry support for Container Insights metrics on Amazon EKS and Amazon ECS is available now and you can start using it today. To learn more about ADOT, read the official documentation and check out the Container Insights for EKS Support AWS Distro for OpenTelemetry Collector blog post.

This is an open source project and welcomes your pull requests! We will be tracking the upstream repository and plan to release a fresh version of the toolkit monthly. If you need feedback/review for AWS related components, feel free to tag us on GitHub PR/Issues. You can also open issues in ADOT repo directly if you have questions.

Juan Mesa

Juan Mesa

Juan Mesa is a Specialist Technical Account Manager for Container Technologies at Amazon Web Services (AWS). He work with AWS enterprise customers helping them to optimize containerized workloads in intensive production environment. He loves automation and bring application from Dev to Ops in a secure and reliable environment.

Ping Xiang

Ping Xiang

Ping Xiang is a Software Engineer at AWS CloudWatch. He is the one of the major contributors to the Observability and Monitoring projects. Ping has been working in software industry for more than 5 years with various enterprise analytics and monitoring solutions.

Ugur KIRA

Ugur KIRA

Ugur KIRA is a Sr. Specialist Technical Account Manager (STAM) - Containers based out of Dublin, Ireland. He joined AWS 7 years back, a containers enthusiast over 3 years and passionate about helping AWS customers to design modern container based applications on AWS services. Ugur is actively working with EKS, ECS, AppMesh services and conducts proactive operational reviews around those services. He also has special interest in improving observability capabilities in containers based applications .

Min Xia

Min Xia

Min Xia is a Sr. Software Engineer at AWS CloudWatch. Currently, he is the lead engineer in the team who is contributing to Observability and Monitoring projects in AWS and Open Source communities. Min has more than a decade experience as an accomplished engineer who has delivered many successful products in Monitoring, Telecom and eCommerce industries. He is also interested in AI and machine learning technologies that make “everything can be autonomous”.