Metrics and traces collection from Amazon ECS using AWS Distro for OpenTelemetry with dynamic service discovery

An earlier blog published last year (Part 1 in the series), Metrics collection from Amazon ECS using Amazon Managed Service for Prometheus, demonstrated how to deploy Prometheus server on an Amazon ECS cluster, dynamically discover the services to collect metrics from, and send metrics to Amazon Managed Service for Prometheus for subsequent query and visualization. It employed a custom service discovery mechanism that leveraged the integration between Amazon ECS and AWS Cloud Map to enable Prometheus server to discover its scraping targets.

AWS Distro for OpenTelemetry (ADOT) is a secure, AWS-supported distribution of the OpenTelemetry project. ADOT Collector is an AWS-supported version of the upstream OpenTelemetry Collector that is fully compatible with AWS computing platforms such as Amazon EKS and Amazon ECS. With ADOT, users can collect telemetry data such as metrics, traces, and logs from their applications and send them to AWS managed services such as Amazon CloudWatch, Amazon Managed Service for Prometheus, and AWS X-Ray, as well as other supported third-party monitoring backends.

ADOT can be enabled for Prometheus metrics collection from workloads deployed to an Amazon ECS cluster as documented in Basic ECS Configuration for AMP. This deployment model uses the sidecar approach, where an ADOT Collector is deployed alongside the application container within each ECS task in a cluster. To collect application metrics, the collector uses the canonical way to specify static targets in a Prometheus scrape configuration. The sidecar approach is employed to collect application traces as well.

This blog (Part 2 in the series) builds upon the solution that was presented in Part 1 and demonstrates how to employ a single instance of an ADOT Collector to collect X-Ray traces and Prometheus metrics from Amazon ECS services that were dynamically discovered using AWS Cloud Map. This approach reduces the deployment footprint in target environments. This is similar to the deployment model most commonly used on platforms such as Kubernetes, which have built-in service discovery capabilities.

Source code

The source code for the solution outlined in this blog, as well as artifacts needed for deploying the resources to an Amazon ECS cluster, can be downloaded from the GitHub repository.

Solution architecture overview

The figure below shows the proposed solution architecture. The steps to implement it are outlined as follows:

Set up an Amazon ECS cluster on Amazon EC2 or AWS Fargate and enable service discovery by creating service registries in AWS Cloud Map.
Deploy an instance of ADOT Collector to the cluster. The collector has a metrics pipeline that comprises a Prometheus Receiver and an AWS Prometheus Remote Write Exporter as shown in the figure. This enables it to collect Prometheus metrics from workloads and send them to a workspace in Amazon Managed Service for Prometheus.
Deploy a sidecar application alongside the ADOT Collector to help discover the services registered in AWS Cloud Map and dynamically update the scrape configurations used by the Prometheus Receiver.
Deploy application services to the cluster and register them with a service registry in AWS Cloud Map. The current implementation uses a stateless web application that is instrumented with Prometheus Go client library as a representative workload. This application exposes a Counter named http_requests_total and a Histogram named request_duration_milliseconds.
Optionally, deploy an instance of ECS Exporter alongside the application container in order to expose task-level system metrics in addition to custom application metrics.
Visualize metrics data using Amazon Managed Grafana.
Deploy application services instrumented with X-Ray SDK and send trace data to the ADOT Collector instance. The collector has a traces pipeline as shown in the figure which comprises an instance of AWS X-Ray Receiver and AWS X-Ray Exporter which enables it to collect the trace segments and send them to AWS X-Ray.

Diagram-1

Enabling Amazon ECS service discovery with AWS Cloud Map

AWS Cloud Map is a fully managed service that you can use to register services deployed to an Amazon ECS cluster. Subsequently, client applications internal or external to the cluster may discover them using DNS queries or API calls, referencing the services by their logical names. When tasks are launched using an Amazon ECS service, they can be automatically registered in a service registry within a namespace in AWS Cloud Map. Please refer to the documentation for more details on Service Discovery concepts and components.

The implementation in this blog makes use of a private namespace based on DNS that is visible only inside the Amazon VPC where the cluster resides. Under the hood, AWS Cloud Map uses Route 53 and creates DNS records in a private hosted zone for each registered task. The AWS CLI commands used for setting up the relevant AWS Cloud Map resources can be found in the script cloudmap.sh.

Deploying ADOT Collector to Amazon ECS

The ECS task definition used for deploying the ADOT Collector along with the service discovery sidecar can be downloaded from GitHub. Here are the salient aspects of deploying the ADOT Collector to an Amazon ECS cluster.

The ADOT Collector pipeline configurations can be provided through an environment variable named AOT_CONFIG_CONTENT. The current implementation sets the value of this variable using a parameter named otel-collector-config stored in AWS Systems Manager Parameter Store, as seen in the following JSON snippet of the task definition. The complete collector configuration read from the Parameter Store is on GitHub.
```
{
   "secrets":[
      {
         "name":"AOT_CONFIG_CONTENT",
         "valueFrom":"arn:aws:ssm:us-east-1:123456789000:parameter/otel-collector-config"
      }
   ]
}
```
The Prometheus Receiver used in the ADOT Collector pipeline is meant to minimally be a drop-in replacement for a Prometheus server and supports scrape configuration documented here. The scraping configuration used by the Prometheus Receiver is seen below in the YAML snippet of the collector’s pipeline configuration. The current implementation leverages Prometheus’s HTTP-based service discovery, which provides a generic way to configure static targets and serves as an interface to plug in custom service discovery mechanisms. The discovery source, in this case, is a sidecar that exposes an HTTP endpoint that is periodically invoked by the Prometheus Receiver to get a list of scrape targets. This is similar to Prometheus’s file-based discovery but removes some of its limitations, such as the ADOT Collector container having to share a file system with the sidecar. It also allows discovery sources to be hosted outside the cluster as long as they are reachable by the ADOT Collector over HTTP.
```
global:
  scrape_interval: 15s
  scrape_timeout: 10s
scrape_configs:
  - job_name: ecs_services
    http_sd_configs:
      - url: http://localhost:9001/prometheus-targets  
        refresh_interval: 60   
```
To get visibility into task-level system metrics such as CPU, memory, and network usage, an instance of ECS Exporter may be deployed as a sidecar alongside each application container. The ECS container agent injects an environment variable named ECS_CONTAINER_METADATA_URI_V4 into each container, referred to as the task metadata endpoint, which provides various task metadata and Docker stats to the container. The ECS Exporter sidecar reads this data and exports them as Prometheus metrics on port 9779. The following snippet is the JSON definition of this container:
```
{
   "name":"ecs-exporter",
   "image":"public.ecr.aws/awsvijisarathy/ecs-exporter:1.2",
   "portMappings":[
      {
         "containerPort":9779,
         "protocol":"tcp"
      }
   ]
}
```

The following snippet is the JSON definition of the service discovery sidecar. The current implementation is based on the same Go application that was employed in Part 1, with relevant changes to make it support the discovery of targets over the HTTP protocol.

{
   "name":"config-reloader",
   "image":"public.ecr.aws/awsvijisarathy/prometheus-sdconfig-reloader:4.0",
   "cpu":128,
   "memory":128,
   "environment":[
      {
         "name":"SERVICE_DISCOVERY_MODE",
         "value":"HTTP_BASED"
      },
      {
          "name":"DISCOVERY_NAMESPACES_PARAMETER_NAME",
          "value":"ECS-Namespaces"
      }         
   ]
}

The sidecar is configured to read a list of AWS Cloud Map namespaces as comma-separated values from the parameter ECS-Namespaces in Systems Manager Parameter Store. It exposes an HTTP endpoint, :9001/prometheus-targets, that periodically receives a GET request from the Prometheus Receiver. The sidecar responds with the HTTP Header Content-Type: application/json and a JSON body that contains a list of scrape targets in the same format used for providing a list of static targets. A representative example of this HTTP response is shown below. Note that the targets in this example include both the application (port 3000) and the ECS Exporter (port 9779) containers.

[
   {
      "targets":[
         "10.10.101.12:3000"
      ],
      "labels":{
         "__metrics_path__":"/metrics",
         "cluster":"ecs-adot-cluster",
         "instance":"10.10.101.12",
         "service":"WebAppService",
         "namespace": "ecs-services",         
         "taskdefinition":"WebAppTask",
         "taskid":"0dc5c05b56df432ab6ff2f52f4019000"
      }
   },
   {
      "targets":[
         "10.10.101.12:9779"
      ],
      "labels":{
         "__metrics_path__":"/metrics",
         "cluster":"ecs-adot-cluster",
         "instance":"10.10.101.12",
         "service":"WebAppService",
         "namespace": "ecs-services",                  
         "taskdefinition":"WebAppTask",
         "taskid":"0dc5c05b56df432ab6ff2f52f4019000"
      }
   }
]

For secure ingestion of metrics into Amazon Managed Service for Prometheus, the HTTP requests must be signed using AWS Signature Version 4 signing process. Since version 2.26.0, Prometheus natively supports authentication based on AWS Signature Version 4. This support is also built into the AWS Prometheus Remote Write Exporter used in the ADOT Collector pipeline. Therefore, there is no need to rely on a proxy sidecar container, such as the AWS SigV4 container. However, the collector must still be able to present credentials associated with an IAM role that has permissions to perform read/write operations against a workspace in Amazon Managed Service for Prometheus. It also needs permissions to access parameters in Systems Manager Parameter Store and service registries in AWS Cloud Map. The current implementation uses an ECS Task Role to grant these permissions to the task that encapsulates the ADOT Collector container. Refer to the script iam.sh for details on IAM role and policies.

Metrics collection with Amazon Managed Service for Prometheus

The set of services deployed to the cluster is shown below in a view of the ECS console. ADOTService service manages a single instance of the task that comprises the ADOT Collector and service discovery containers. WebAppService service manages two tasks, each comprising the sample web application and ECS Exporter containers.

UI screenshot of Services tab

A view of the AWS Cloud Map Console shown below displays the two tasks of the WebAppService service registered in the AWS Cloud Map service registry identified by the private DNS name webapp-svc.ecs-service. Note that the ports and request paths exposed by the application and the sidecar containers for scraping custom application metrics and ECS task-level metrics, respectively, are made known via the tags associated with the service registry.

Ui screenshot of Service information tab

The metrics ingested into the Amazon Managed Service for Prometheus workspace are visualized using Amazon Managed Grafana, which is a fully managed service that enables you to query, correlate, and visualize operational metrics, logs, and traces from multiple sources. Refer to the documentation on how to add an Amazon Managed Service for Prometheus workspace as a data source to Amazon Managed Grafana. The figure below shows a visualization of the metrics collected from the web application as well as a task-level system metric, namely CPU usage, collected from the ECS Exporter.

HTTP Request Rate chart shows the rate of requests processed by the service, computed as: sum(rate(http_requests_total[5m]))
Average Response Latency chart shows average request processing latency, computed as: sum(rate(request_duration_milliseconds_sum[5m])) / sum(rate(request_duration_milliseconds_count[5m]))
Response Latency Histogram chart shows the percentage of requests served within a specified threshold, computed as: sum(rate(request_duration_milliseconds_bucket{le=”BUCKET_VALUE”}[5m])) / sum(rate(request_duration_milliseconds_count[5m])). The histogram is configured to count observations falling into particular buckets of values, namely, 500, 1,000, 2,500, and 5,000 milliseconds.
CPU Utilization chart shows the percentage CPU usage by the web application container in each task, computed as: ecs_cpu_percent{container=”webapp”}

Graph showing HTTP Request Rate per second

Graph showing Avg Response Latency in seconds

Graph showing Response latency distribution

graph showing CPU utilization percentages

Traces collection with AWS X-Ray

The AWS X-Ray Receiver in the traces pipeline accepts segments in X-Ray Segment format which enables it to process segments sent by microservices instrumented with X-Ray SDK. By default, this receiver listens for traffic on UDP port 2000 which is exposed by the task definition used for deploying the ADOT Collector. The ADOTService service is registered in the Cloud Map service registry identified by the private DNS name adot-collector-svc.ecs-service. With this setup, application services in the cluster that are instrumented with X-Ray SDK can now be configured with the environment variable AWS_XRAY_DAEMON_ADDRESS set to adot-collector-svc.ecs-service:2000 and send traces data to AWS X-Ray Receiver in the ADOT Collector pipeline which are then sent to AWS X-Ray by the exporter in the pipeline.

Sending metrics and traces to other monitoring services

The ADOT Collector deployment approach outlined above is agnostic about the monitoring backend to which the metrics and traces are sent. Customers will have to use the appropriate OpenTelemetry supported exporter in the ADOT Collector configuration that is read from Systems Manager Parameter Store. For example, the complete collector configuration that uses the AWS CloudWatch EMF Exporter in the metrics pipeline for sending the metrics to Amazon CloudWatch is available on GitHub.

After storing this configuration using the same parameter named otel-collector-config in Systems Manager Parameter Store and restarting the ADOTService service, the collector will convert the metrics data into performance log events with embedded metric format (EMF) and then send it directly to a CloudWatch log group. From this data, CloudWatch will create an aggregated custom metric http_requests_total that is made available under the CloudWatch metrics namespace ECS/ContainerInsights with the dimensions ClusterName, SdNamespaceName, SdServiceName, and SdTaskID per the exporter configuration settings.

Conclusion

This blog post presented a solution for deploying a single instance of an ADOT Collector to an Amazon ECS cluster and collecting Prometheus metrics from workloads that were dynamically discovered by taking advantage of the integration between Amazon ECS and AWS Cloud Map. This solution will also scale well for handling metrics collection from large-scale ECS deployments that may comprise hundreds of workloads. This scenario can be handled by launching multiple instances of ADOT Collector, if necessary, configuring each one with a mutually exclusive set of namespaces under AWS Cloud Map. Each instance would then collect metrics from the scraping targets identified under the respective namespaces by the service discovery mechanism. Though this strategy may not result in a perfectly balanced distribution of scraping targets across the collector instances, it may be adopted as an efficient sharding strategy for large-scale deployments. The deployment model allows AWS X-Ray traces to be collected from application services using a single instance of ADOT Collector instead of deploying the X-Ray daemon as a sidecar alongside every application container.

The solution presented in this blog also used the ECS Exporter, which was deployed as a sidecar with the application container in order to collect ECS task-level system metrics. Customers can optionally deploy Prometheus Node Exporter to collect system metrics from every EC2 container instance in the cluster. Steps for deploying the Node Exporter are outlined in Part 1 of this series. This approach can be used in conjunction with other monitoring backends that are supported as an exporter in an OpenTelemetry Collector pipeline.

Containers