Adding metrics and traces to your application on Amazon EKS with AWS Distro for OpenTelemetry, AWS X-Ray and Amazon CloudWatch

In order to make a system observable, it must be instrumented. This means that code to emit traces, metrics and logs must be added to the application either manually, with libraries, or with automatic instrumentation agents. Once deployed, the instrumented data from the application will be sent to the respective backend. There are a number of observability backends available and the way, which code is instrumented, varies from solution to solution.

In the past, this meant that there was no standardized data format available for sending data to an observability backend. Additionally, if you chose to switch observability backends, you had to re-instrument your code and configure new agents to be able to emit telemetry data to the new destination of your choice.

The OpenTelemetry (OTEL) project’s goal is to provide a set of standardized SDKs, APIs, and tools for ingesting, transforming, and sending data to an observability backend. AWS Distro for OpenTelemetry (ADOT) is a secure, production-ready, AWS-supported distribution of the OpenTelemetry project. With AWS Distro for OpenTelemetry, you can instrument your applications just once to send correlated metrics and traces to multiple monitoring solutions. AWS Distro for OpenTelemetry consists of SDKs, auto-instrumentation agents, collectors and exporters to send data to backend services.

In this blog post, we will introduce a sample application written in Python, the PetAdoptionsHistory microservice, to demonstrate how to add distributed tracing and metrics to your applications using OpenTelemetry Python client SDKs. We will explain on how you can use AWS Distro for OpenTelemetry (ADOT) to send the traces to AWS X-Ray, and metrics to Amazon CloudWatch. Amazon CloudWatch is a monitoring and observability service. Amazon CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization. Amazon CloudWatch collects monitoring and operational data in the form of logs, metrics, and events.

In this blog, we will use Amazon CloudWatch Service Lens, one of the Amazon CloudWatch’s capabilities, to provide us with a visual representation of the components that make up our application and how they are connected. Additionally, we can quickly drill down from the map nodes into the related metrics, logs and traces. In particular, we will leverage CloudWatch Service Lens to view the application architecture before and after adding instrumentation code to the PetAdoptionsHistory microservice and to review relevant metrics regarding on the success- and error-rates of our service.

We will in particular highlight how to leverage CloudWatch Logs Insights to interactively search and analyze log data generated by our application and we will create an Amazon CloudWatch dashboard group relevant application metrics into one single visual dashboard.

The full code and step-by-step instructions are available directly in the One Observability Workshop. This blog post will highlight the most important parts and explain the relevant concepts on how to instrument the PetAdoptionsHistory microservice using OpenTelemetry Python client SDKs.

Architecture overview

The PetAdoptionsHistory microservice is written in Python and runs in a container on Amazon Elastic Kubernetes Service (EKS). The AWS Distro for OpenTelemetry (ADOT) collector is deployed in the same cluster and receives traces from the application. The collector is also configured to periodically scrape metrics from the application’s /metrics endpoint using HTTP.

The AWS Distro for OpenTelemetry (ADOT) collector is configured to publish traces to AWS X-Ray and sends metrics to Amazon CloudWatch.

Later in this blog, we will elaborate on the OpenTelemetry collector configuration to explain how the collector obtains its metrics and traces from the application and which services it publishes the collected data to.

Figure 1. AWS Distro for OpenTelemetry

Solution Walkthrough

Application overview

We will focus on a microservice called PetAdoptionsHistory. This microservice is part of a larger application, the PetAdoptions application, a web application that can be used to adopt pets. Each time a pet is adopted, a transaction is recorded in an Amazon Aurora PostgreSQL database.

The PetAdoptionsHistory microservice exposes APIs to query the transaction details and also clean up the historic data. Calls to this new service are made from the PetSite front-end service. Another service, the traffic generator, simulates human interactions with the PetSite website by periodically making calls to the front-end. The calls to the front-end in turn result in calls to the PetAdoptionsHistory service, causing the service to be called to either return the recorded list of adoptions, or to clear the list of transactions from the database.

The PetAdoptionsHistory application uses:

Flask to handle incoming requests from an AWS Application Load Balancer to the application
pyscopg2 to handle connectivity to the associated Amazon Aurora PostgreSQL database

The PetAdoptionsHistory application is deployed on an Amazon Elastic Kubernetes Service (EKS) cluster. Here’s an overview of the Amazon CloudWatch service map before adding instrumentation to the service. The PetAdoptionsHistory does not yet appear on this diagram.

Service Map showing end-to-end view of the PetAdoptions microservices architecture, before adding instrumentation to the service.

Figure 2: CloudWatch Service Map

Adding distributed tracing to the PetAdoptionsHistory microservice

Distributed tracing allows you to have deep visibility on the requests across those services and their backends (databases, external services) using correlation IDs. To get started with OpenTelemetry SDK on Python, we have to import the OpenTelemetry tracing libraries and follow initialization steps. Let’s break down the next steps.

The OpenTelemetry SDK needs a pipeline to define how traces and metrics flow through the application. Currently, AWS X-Ray requires a specific format for traces. For the Python SDK, this is done with the AWSXRayIdGenerator. The tracing pipeline looks like the following (import statements omitted for brevity).

# Setup AWS X-Ray propagator
set_global_textmap(AwsXRayPropagator())

# Setup AWS EKS resource detector
resource = get_aggregated_resources(
    [
        AwsEksResourceDetector(),
    ]
)

# Setup tracer provider with the X-Ray ID generator
tracer_provider = TracerProvider(resource=resource, id_generator=AwsXRayIdGenerator())
processor = BatchSpanProcessor(OTLPSpanExporter())
provider.add_span_processor(processor)

# Sets the global default tracer provider
trace.set_tracer_provider(tracer_provider)

# Creates a tracer from the global tracer provider
tracer = trace.get_tracer(__name__)

To actually generate traces data, we need to instrument portions of the application. The following snippet allows to capture incoming HTTP requests from the Application Load Balancer.

from opentelemetry.instrumentation.flask import FlaskInstrumentor

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

To capture databases transaction, you can instrument the psycopg2 library.

from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor

Psycopg2Instrumentor().instrument()

To give a service name with your captured traces, you can use the resources and service name attributes. On X-Ray you will see your service appear with the name below.

resource = Resource(attributes={
    SERVICE_NAME: "PetAdoptionsHistory"
})

With the instrumentation above, you have visibility on the databases and HTTP transactions automatically. Additionally, with custom spans, you can instrument any portion of your application. For example, the following snippet creates a span called transactions_delete for the DELETE HTTP calls that clean up the adoptions history database table.

@app.route('/petadoptionshistory/api/home/transactions', methods=['DELETE'])
def transactions_delete():
    with tracer.start_as_current_span("transactions_delete") as transactions_span:
        repository.delete_transaction_history(db)
        return jsonify(success=True)

You can find the full code snippet in the GitHub repository of the Observability Workshop.

Adding custom metrics to the PetAdoptionsHistory microservice

Instrumentation makes troubleshooting easier and metrics give a reliable way to see how your service is operating. With metrics, you can create alarms and get notified on anomalies that can occur based on predefined thresholds. More than 100 AWS services publish metrics to Amazon CloudWatch automatically, at no additional cost. Services will publish metrics to Amazon CloudWatch to give you insights about their usage. For example, when using AWS Application Load Balancer, you get Amazon CloudWatch metrics like HTTPCode_Target_2XX_Count, which will give you the number of HTTP response codes generated by the ALB targets.

To better understand your application, you can emit custom metrics that are based on the application’s business logic and create alerts based on relevant business criteria. One popular and effortless way to achieve that is through Prometheus. Prometheus is an open-source, metrics-based monitoring system. It has a simple yet powerful data model and a query language that lets you analyze how your application and infrastructure is performing.

OpenTelemetry and Prometheus provide libraries to generate custom metrics with minimal efforts such as the number of HTTP responses codes, broken down by endpoints, dynamically, or allow adding those business metrics. The OpenTelemetry SDK for metrics setup looks like the following (import statements omitted for brevity).

# Setup metrics
reader = PrometheusMetricReader()
meter_provider = MeterProvider(resource=resource, metric_readers=[reader])

# Sets the global default meter provider
metrics.set_meter_provider(meter_provider)

# Creates a meter from the global meter provider
meter = metrics.get_meter(__name__)

# Flask exporter
from prometheus_flask_exporter import PrometheusMetrics

# This exposes the /metrics HTTP endpoint
metrics = PrometheusMetrics(app, group_by='endpoint')

# Create a business metric with a counter for the number of GET calls
transactions_get_counter = meter.create_counter(
    "transactions_get.count",
    description="The number of times the transactions_get endpoint has been called",
)

In addition to the default Prometheus metrics, in our application, we chose to track the total number of business transactions using the transactions_get_counter variable.

@app.route('/petadoptionshistory/api/home/transactions', methods=['GET'])
def transactions_get():
    transactions_get_counter.add(1) # increment count
    transactions = repository.list_transaction_history(db)
    return jsonify(transactions)

Metrics can be of different types in Prometheus. Counter metrics are used for measurements that only increase, which means their value can only go up. The only exception is when a counter is restarted, then it is reset to zero. Gauges on the other hand are used for measurements that can arbitrarily increase or decrease. Examples for gauges are temperature, CPU utilization, memory usage, the size of a queue and so on. In Python, gauges can be defined with a callback function, which returns the value of the gauge at the time it is invoked.

def transactions_history_callback(result):
    count = repository.count_transaction_history(db)
    yield Observation(count)

meter.create_observable_gauge(
    name="transactions_history.count",
    description="The number of items in the transactions history",
    callbacks=[transactions_history_callback])

transactions_get_counter = meter.create_counter(
    "transactions_get.count",
    description="The number of times the transactions_get endpoint has been called",
)

AWS Distro for OpenTelemetry

In this example, we used the AWS Distro for OpenTelemetry Collector to collect traces and metrics and send them to AWS X-Ray, and Amazon CloudWatch. This is achieved using the OpenTelemetry configuration components. These components once configured must be enabled via pipelines which defines the data flow within the OpenTelemetry Collector. In the sections below, we will explain the pipeline for our application which uses three components:

Receivers: a receiver, which can be push or pull based, is how data gets into the Collector.
Processors: processors are run on data between being received and being exported.
Exporters: an exporter, which can be push or pull based, is how you send data to one or more backends/destinations.

Configuration for AWS X-Ray

AWS X-Ray provides a complete view of requests as they travel through your application and visualizes data across payloads, functions, traces, services, APIs. With AWS X-Ray you can analyze your distributed traces and understand your overall system. Learn more about AWS X-Ray in the CloudWatch ServiceLens Map section of the workshop.

This receiver configuration will expect the application to send traces data to one of those endpoints below using gRPC or HTTP.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

Sending traces to AWS X-Ray will be configured with the awsxray exporter defined below. Check out more advanced configurations options such as AWS Region or proxy in the Getting Started with X-Ray Receiver in AWS OpenTelemetry Collector section of the AWS Distro for OpenTelemetry (ADOT) website.

exporters:
  awsxray:

Configuration for Amazon CloudWatch metrics

In our example, we use Prometheus metrics from the application collected by the AWS Distro for OpenTelemetry Collector that we send to Amazon CloudWatch metrics.

With the receiver configuration below, the AWS Distro for OpenTelemetry collector scrapes via an HTTP call every 20 seconds on the dedicated path for the Prometheus metrics (see the instrumentation section above). As AWS Distro for OpenTelemetry supports Prometheus configurations, we use the service discovery mechanisms to collect environment information such as the Kubernetes container and pod name.

receiver:
  prometheus:
    config:
      global:
        scrape_interval: 20s
        scrape_timeout: 10s
      scrape_configs:
        - job_name: "otel-collector"
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_container_port_number]
              action: keep
              target_label: '^8080$'
            - source_labels: [ __meta_kubernetes_pod_container_name ]
              action: keep
              target_label: '^pethistory$'
            - source_labels: [ __meta_kubernetes_pod_name ]
              action: replace
              target_label: pod_name
            - source_labels: [ __meta_kubernetes_pod_container_name ]
              action: replace
              target_label: container_name

To send these metrics to Amazon CloudWatch, we have configured the awsemf exporter which uses CloudWatch embedded metric format (EMF). EMF is a JSON specification used to instruct Amazon CloudWatch Logs to automatically extract metric values embedded in structured log events. This allows Prometheus-format metrics to be transformed into Amazon CloudWatch metrics.

In this snippet, we show how these metrics will be created under the PetAdoptionsHistory namespace (container for metrics) on Amazon CloudWatch metrics. The transactions_get_count_total metric will be associated two dimensions which are the pod_name and container_name.

exporters:
  namespace: "PetAdoptionsHistory"
  metric_declarations:
    - dimensions: [ [ pod_name, container_name ] ]
      metric_name_selectors:
        - "^transactions_get_count_total$"
        - "^transactions_history_count$"
        - "^process_.*"
      label_matchers:
        - label_names:
          - container_name
          regex: ^pethistory$

Pipeline definition

To tie everything together, the OpenTelemetry configuration needs a pipeline under the service definition. For our example it looks as follows.

service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [awsxray]
        metrics:
          receivers: [prometheus]
          exporters: [awsemf]

The entire OpenTelemetry Collector configuration can be found in the link provided.

Results

With instrumentation enabled within the application and deployed alongside the AWS Distro for OpenTelemetry collector, tracing data will flow to AWS X-Ray and metrics to Amazon CloudWatch.

CloudWatch Service Map

The CloudWatch Service Map displays your service endpoints and resources as “nodes” and highlights the traffic, latency, and errors for each node and its connections. You can choose a node to see detailed insights about the correlated metrics, logs, and traces associated with that part of the service. The end-to-end view of your application helps you to pinpoint performance bottlenecks and identify impacted users more efficiently. This enables you to investigate problems and their effect on the application.

Here is the updated service map with the PetAdoptionsHistory node and its connections.

Amazon CloudWatch Service Map after adding instrumentation to the service, showing the PetAdoptionsHistory node and its connections.

Figure 3: Updated Amazon CloudWatch Service Map

Highlighting the PetAdoptionsHistory service in the Service Map will reveal its connections to other entities.

Highlighting the PetAdoptionsHistory service in the Service Map showing connections to other entities

Figure 4: Highlighting PetAdoptionsHistory service in the Service Map

Selecting the PetAdoptionsHistory node on the map allows us to view relevant metrics such as latency, requests and faults associated to this service, alongside a node map with the service and its connections.

Selecting the PetAdoptionsHistory service in the Service Map showing metrics on Latency, Requests and Fault (5xx) on the service

Figure 5: Selecting PetAdoptionsHistory service in the Service Map

Selecting a trace, we view not only the transactions_delete setup above, but also the origin of the transaction all the way from the front-end website (PetSite).

Trace Map for transactions_delete showing the origin of the transaction all the way from the front-end website (PetSite).

Figure 6: Trace Map for transactions_delete

Amazon CloudWatch Logs Insights

Amazon CloudWatch Logs Insights enables you to interactively search and analyze log data in Amazon CloudWatch Logs. This can be useful for troubleshooting purposes, for example node level metric spikes could provide further insights into task level errors. With Amazon CloudWatch embedded metric format (EMF) being written to Amazon CloudWatch Logs, we can leverage Amazon CloudWatch Logs Insights to investigate when and how often a metric has been larger than a given threshold. In our example, our query filters for transactions_history_count > 90.

CloudWatch Logs Insights query showing a histogram and a list of transactions matching the query filter transactions_history_count > 90.

Figure 7: Amazon CloudWatch Logs Insights

Amazon CloudWatch Metrics

The Amazon CloudWatch metrics explorer organizes all metrics collected from the application inside the PetAdoptionsHistory namespace.

Figure 8: Amazon CloudWatch metrics explorer

Now that we have both traces and metrics data, we can now create Amazon CloudWatch dashboards to have centralized visibility on how the application is performing.

Amazon CloudWatch dashboards showing custom application metrics.

Figure 9: Amazon CloudWatch dashboards

Conclusion

AWS Distro for OpenTelemetry offers multiple possibilities to manage your observability data. In this post, we have shown you how to use OpenTelemetry client SDKs to instrument your applications. We have configured AWS Distro for OpenTelemetry collector on Amazon EKS to send the application traces to AWS X-Ray and the application metrics to Amazon CloudWatch. With this setup, you can correlate the metrics, logs and traces for your application using Amazon CloudWatch ServiceLens, an interactive map visualization service.

The One Observability Workshop lets you experiment with many AWS observability services. You can use the workshop in an AWS-led event with an account provisioned for you. You can also run it your account at your own pace. To run the PetAdoptionsHistory microservice yourself or explore other Amazon CloudWatch features such as Contributor Insights, Logs Insights and more, review the respective sections of the One Observability Workshop.

About the authors:

AWS Cloud Operations & Migrations Blog