Amazon EKS で Container Insights が送信するメトリクスをカスタマイズし、コストを削減する

この記事は Cost savings by customizing metrics sent by Container Insights in Amazon EKS (記事公開日 : 2021 年 12 月 17 日) の翻訳です。

AWS Distro for OpenTelemetry (ADOT) は、OpenTelemetry プロジェクトの AWS 提供のディストリビューションです。ADOT Collector は、複数のソースからデータを受信し、エクスポートします。また、Amazon CloudWatch Container Insights は、Amazon Elastic Kubernetes Service (Amazon EKS) と Amazon Elastic Container Service (Amazon ECS) における ADOT をサポートするようになりました。これにより、お客様は CloudWatch に送信するメトリクスをカスタマイズするなど、高度な設定を実現できます。次の図は、Amazon EKS における ADOT Collector のアーキテクチャを示します。

ADOT Collector パイプライン

上図にあるように、ADOT Collector パイプラインは、メトリクス収集のための receiver (この場合は Container Insights receiver) から始まります。次に processor を使用して、収集したメトリクスを変換またはフィルタリングします。最後に、ADOT Collector は exporter を使用して、様々な宛先にメトリクスを送信します。この例では、処理されたメトリクスを CloudWatch 埋め込みメトリクスフォーマット (EMF) のログに変換する AWS EMF Exporter を使用しています。このブログ記事では、Amazon EKS クラスターにおいて、ADOT Collector の Container Insights receiver で収集したメトリクスをカスタマイズすることで、CloudWatch Insights に関連するコストを削減する方法を紹介します。

デフォルトの設定では、Container Insights receiver は、receiver のドキュメントで定義されているメトリクスをすべて収集します。収集するメトリクスとディメンションの数は多く、大規模なクラスターでは、メトリクスの取り込みと保存のコストが増加してしまいます。そこで、価値のあるメトリクスのみを送信するように ADOT Collector を設定する、2 つの異なるアプローチを紹介します。

このブログ記事では、Amazon EKS クラスターに ADOT Collector を設定する方法を紹介しますが、紹介するアプローチによるメトリクスのカスタマイズは Amazon ECS の場合にも適用可能です。ただし、ドキュメントにあるように、Amazon ECS の場合にはメトリクス名が異なるので注意してください。

EKS に ADOT Collector をインストールする

Container Insights receiver を使用してインフラストラクチャのデータを収集するには、ADOT Collector を DaemonSet として Amazon EKS クラスターにインストールする必要があります。このセクションでは、Amazon EKS で ADOT Collector を設定する手順について説明します。

サービスアカウントの IAM ロールを設定する

ADOT Collector のセキュリティを高めるために、サービスアカウントの IAM ロール (IRSA) を有効化する必要があります。これにより、ADOT Collector の Pod に対して IAM の権限を割り当てることが可能です。

export CLUSTER_NAME=<eks-cluster-name>
export AWS_REGION=<e.g. us-east-1>
export AWS_ACCOUNT_ID=<AWS account ID>

IAM OIDC プロバイダーを有効化する

eksctl utils associate-iam-oidc-provider --region=$AWS_REGION \
    --cluster=$CLUSTER_NAME \
    --approve

Collector の IAM ポリシーを作成する

cat << EOF > AWSDistroOpenTelemetryPolicy.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Effect": "Allow",
        "Action": [
            "logs:PutLogEvents",
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:DescribeLogStreams",
            "logs:DescribeLogGroups",
            "xray:PutTraceSegments",
            "xray:PutTelemetryRecords",
            "xray:GetSamplingRules",
            "xray:GetSamplingTargets",
            "xray:GetSamplingStatisticSummaries",
            "cloudwatch:PutMetricData",
            "ec2:DescribeVolumes",
            "ec2:DescribeTags",
            "ssm:GetParameters"
        ],
        "Resource": "*"
        }
    ]
}
EOF

aws iam create-policy \
    --policy-name AWSDistroOpenTelemetryPolicy \
    --policy-document file://AWSDistroOpenTelemetryPolicy.json

ADOT Collector の Kubernetes マニフェストをダウンロードする

curl -s -O https://raw.githubusercontent.com/aws-observability/aws-otel-collector/main/deployment-template/eks/otel-container-insights-infra.yaml

Kubernetes マニフェストをインストールする

kubectl apply -f otel-container-insights-infra.yaml

IAM ロールを作成し、Collector の IRSA を設定する

eksctl create iamserviceaccount \
    --name aws-otel-sa \
    --namespace aws-otel-eks \
    --cluster ${CLUSTER_NAME} \
    --attach-policy-arn arn:aws:iam::${AWS_ACCOUNT_ID}:policy/AWSDistroOpenTelemetryPolicy \
    --approve \
    --override-existing-serviceaccounts

IRSA を使用するように、ADOT Collector の Pod を再起動する

kubectl delete pods -n aws-otel-eks -l name=aws-otel-eks-ci

これで、ADOT Collector を Amazon EKS クラスターにインストールできました。それでは、Collector が送信するメトリクスをカスタマイズする 2 つのアプローチを見ていきましょう。

Option 1 : processor を使用してメトリクスをフィルタリングする

このアプローチでは、OpenTelemetry processor を導入してメトリクスや属性をフィルタリングし、EMF ログのサイズを縮小します。このセクションでは、2 つの processor の基本的な使い方を紹介します。これらの processor の詳細な情報は、こちらのドキュメントを参照してください。

Filter processor

filter processor は、AWS OpenTelemetry ディストリビューションに含まれており、メトリクス収集パイプラインの一部として、不要なメトリクスをフィルタリングできます。例えば、Container Insights を使用して、(pod_ という prefix を持つ) Pod レベルのメトリクスのみを収集し、pod_network という prefix を持つ、ネットワークに関する Pod のメトリクスは除外したいとします。先程の「EKS に ADOT Collector をインストールする」セクションでダウンロードした otel-container-insights-infra.yaml という名前の Kubernetes マニフェストを編集することで、パイプラインに filter processor を追加できます。その後、以下のように otel-agent-conf という名前の ConfigMap を編集し、filter processor を含めます。

extensions:
  health_check:

receivers:
  awscontainerinsightreceiver: 

processors:
  # filter processorの例
  filter/include:
    # フィルターにマッチしない名前は、この先のパイプラインから除外される
    metrics:
      include:
        match_type: regexp
        metric_names:
          # re2 regexp patterns
          - ^pod_.*
  filter/exclude:
    # フィルターにマッチしない名前は、この先のパイプラインから除外される
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ^pod_network.*

  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    metric_declarations:
      # node metrics
      - dimensions: [[NodeName, InstanceId, ClusterName]]
        metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
      - dimensions: [[ClusterName]]
        metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
          - node_cpu_usage_total
          - node_cpu_limit
          - node_memory_working_set
          - node_memory_limit
 
      # pod metrics
      - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit
      - dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_reserved_capacity
          - pod_memory_reserved_capacity
      - dimensions: [[PodName, Namespace, ClusterName]]
        metric_name_selectors:
          - pod_number_of_container_restarts
          - container_cpu_limit
          - container_cpu_request
          - container_cpu_utilization
          - container_memory_limit
          - container_memory_request
          - container_memory_utilization
          - container_memory_working_set

      # cluster metrics
      - dimensions: [[ClusterName]]
        metric_name_selectors:
          - cluster_node_count
          - cluster_failed_node_count

      # service metrics
      - dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - service_number_of_running_pods

      # node fs metrics
      - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
        metric_name_selectors:
          - node_filesystem_utilization

      # namespace metrics
      - dimensions: [[Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - namespace_number_of_running_pods

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      # filter processorをパイプラインに追加する
      processors: [filter/include, filter/exclude, batch/metrics]
      exporters: [awsemf]

extensions: [health_check]

Resource processor

resource processor も AWS OpenTelemetry Distro に組み込まれており、不要なメトリクスの属性を削除するのに使用できます。例えば、EMF ログから Kubernetes と Source フィールドを削除したい場合、resource processor をパイプラインに追加できます。

extensions:
  health_check:

receivers:
  awscontainerinsightreceiver: 

processors:
  filter/include:
    # フィルターにマッチしない名前は、この先のパイプラインから除外される
    metrics:
      include:
        match_type: regexp
        metric_names:
          # re2 regexp patterns
          - ^pod_.*
  filter/exclude:
    # フィルターにマッチしない名前は、この先のパイプラインから除外される
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ^pod_network.*
  # resource processorの例
  resource:
    attributes:
    - key: Sources
      action: delete
    - key: kubernetes
      action: delete

  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    metric_declarations:
      # node metrics
      - dimensions: [[NodeName, InstanceId, ClusterName]]
        metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
      - dimensions: [[ClusterName]]
        metric_name_selectors:
          - node_cpu_utilization
          - node_memory_utilization
          - node_network_total_bytes
          - node_cpu_reserved_capacity
          - node_memory_reserved_capacity
          - node_number_of_running_pods
          - node_number_of_running_containers
          - node_cpu_usage_total
          - node_cpu_limit
          - node_memory_working_set
          - node_memory_limit
 
      # pod metrics
      - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit
      - dimensions: [[PodName, Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_reserved_capacity
          - pod_memory_reserved_capacity
      - dimensions: [[PodName, Namespace, ClusterName]]
        metric_name_selectors:
          - pod_number_of_container_restarts
          - container_cpu_limit
          - container_cpu_request
          - container_cpu_utilization
          - container_memory_limit
          - container_memory_request
          - container_memory_utilization
          - container_memory_working_set

      # cluster metrics
      - dimensions: [[ClusterName]]
        metric_name_selectors:
          - cluster_node_count
          - cluster_failed_node_count

      # service metrics
      - dimensions: [[Service, Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - service_number_of_running_pods

      # node fs metrics
      - dimensions: [[NodeName, InstanceId, ClusterName], [ClusterName]]
        metric_name_selectors:
          - node_filesystem_utilization

      # namespace metrics
      - dimensions: [[Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - namespace_number_of_running_pods

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      # resource processorをパイプラインに追加する
      processors: [filter/include, filter/exclude, resource, batch/metrics]
      exporters: [awsemf]

extensions: [health_check]

processor を使用したアプローチは、ADOT Collector にとってより汎用的であり、異なる宛先に送信できるメトリクスをカスタマイズする唯一の方法です。また、カスタマイズとフィルタリングはパイプラインの初期段階で行われるため、効率的で、パフォーマンスへの影響を最小限に抑えながら大量のメトリクスを処理できます。

Option 2 : メトリクスとディメンションをカスタマイズする

このアプローチでは、OpenTelemetry processor を使用する代わりに、CloudWatch Logs に送信したいメトリクスのみを生成するように CloudWatch EMF exporter を設定します。CloudWatch EMF exporter の設定の metric_declaration セクションは、エクスポートしたいメトリクスとディメンションを定義するのに使用できます。例えば、デフォルトの設定から、Pod のメトリクスのみを残すことができます。この metric_declaration セクションは、以下のように設定できます。

extensions:
  health_check:

receivers:
  awscontainerinsightreceiver:

processors:
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    # Customized metric declaration section
    metric_declarations:
      # pod metrics
      - dimensions: [[PodName, Namespace, ClusterName], [Service, Namespace, ClusterName], [Namespace, ClusterName], [ClusterName]]
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [batch/metrics]
      exporters: [awsemf]

extensions: [health_check]

もし他を気にしないのであれば、メトリクス数を減らすために、[PodName, Namespace, ClusterName] ディメンションのみを残すことができます。

extensions:
  health_check:

receivers:
  awscontainerinsightreceiver:

processors:
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    metric_declarations:
      # pod metrics
      - dimensions: [[PodName, Namespace, ClusterName]] # Reduce exported dimensions
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_network_rx_bytes
          - pod_network_tx_bytes
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [batch/metrics]
      exporters: [awsemf]

extensions: [health_check]

さらに、Pod のネットワークメトリクスを無視したい場合は、pod_network_rx_bytes、 pod_network_tx_bytes メトリクスを削除できます。もし PodName ディメンションに興味がある場合には、ディメンションの集合 [PodName, Namespace, ClusterName] を設定している部分に追加します。上のカスタマイズの例と併せると、最終的な metric_declarations は以下のようになります。

extensions:
  health_check:

receivers:
  awscontainerinsightreceiver:

processors:
  batch/metrics:
    timeout: 60s

exporters:
  awsemf:
    namespace: ContainerInsights
    log_group_name: '/aws/containerinsights/{ClusterName}/performance'
    log_stream_name: '{NodeName}'
    resource_to_telemetry_conversion:
      enabled: true
    dimension_rollup_option: NoDimensionRollup
    parse_json_encoded_attr_values: [Sources, kubernetes]
    metric_declarations:
      # reduce pod metrics by removing network metrics
      - dimensions: [[PodName, Namespace, ClusterName]]
        metric_name_selectors:
          - pod_cpu_utilization
          - pod_memory_utilization
          - pod_cpu_utilization_over_pod_limit
          - pod_memory_utilization_over_pod_limit

service:
  pipelines:
    metrics:
      receivers: [awscontainerinsightreceiver]
      processors: [batch/metrics]
      exporters: [awsemf]

extensions: [health_check]

デフォルトの設定では、複数のディメンションに対して 55 種類のメトリクスを生成するのに対し、この設定では、単一のディメンション [PodName, Namespace, ClusterName] において、以下の 4 つのメトリクスのみを生成してストリーミングします。

pod_cpu_utilization
pod_memory_utilization
pod_cpu_utilization_over_pod_limit
pod_memory_utilization_over_pod_limit

この設定では、デフォルトで設定される全てのメトリクスではなく、関心のあるメトリクスのみを送信します。その結果、Container Insights へのメトリクスの取り込みコストを大幅に削減できます。このような柔軟性を持つことで、Container Insights をご利用のお客様は、エクスポートするメトリクスに対して高いレベルで制御できます。

awsemf exporter の設定変更によるメトリクスのカスタマイズも柔軟性が高く、送信したいメトリクスとそのディメンションの両方をカスタマイズできます。ただし、CloudWatch に送信されるログにのみ適用可能であることに注意してください。

まとめ

このブログ記事で紹介した 2 つのアプローチは、互いに排他的なものではありません。実際には、この 2 つを組み合わせることで、モニタリングシステムに取り込みたいメトリクスを非常に柔軟にカスタマイズできます。以下のグラフに示すように、私たちはこのアプローチを使用して、メトリクスの保存と処理に関連するコストを削減しています。

コストエクスプローラー

上のコストエクスプローラーのグラフでは、小規模な EKS クラスター (20 ワーカーノード、220 Pods) において、異なる ADOT Collector の設定を使用した場合の CloudWatch に関連するコストを示しています。8 月 15 日には、デフォルトの設定で ADOT Collector を使用した場合の CloudWatch の請求を示しています。8 月 16 日には、EMF exporter をカスタマイズするアプローチを使用し、約 30 % コストを削減したことを確認できます。8 月 17 日には、processor を使用したアプローチにより、約 45 % のコスト削減を実現しました。

監視対象のクラスターの可視性を犠牲にすることで監視コストを削減できるという、Container Insights により送信されるメトリクスをカスタマイズすることのトレードオフを考慮することは必須です。また、AWS コンソール内の Container Insights の組み込みダッシュボードは、ダッシュボードで使用するメトリクスをカスタマイズすることによって影響を受ける可能性があります。

Amazon EKS と Amazon ECS の Container Insights メトリクスに対する AWS Distro for OpenTelemetry のサポートはすでに利用可能で、本日より利用開始できます。ADOT についてもっと知りたい場合は、公式ドキュメントを読み、「Amazon EKS の Container Insights は AWS Distro for OpenTelemetry コレクターをサポートします」のブログ記事を確認して下さい。

このプロジェクトはオープンソースであり、プルリクエストを歓迎しています！私たちはアップストリームのリポジトリを追跡しており、毎月新しいバージョンのツールキットをリリースしていく予定です。AWS に関連するコンポーネントのフィードバックやレビューが必要な場合には、GitHub の PR / Issue で私たちをタグ付けしていただければと思います。また、何か質問がある場合には ADOT リポジトリで Issue をオープンして下さい。

Amazon Web Services ブログ