AWS Lambda metrics support for Amazon Managed Service for Prometheus now available in AWS Distro for OpenTelemetry
In this blog post, intern engineers Karen Xu and Kelvin Lo describe how they added metric support to the OpenTelemetry and AWS Distro for OpenTelemetry Lambda layers, and built and tested the metric pipeline to generate, collect, and export application metrics from AWS Lambda to Amazon Managed Service for Prometheus (AMP).
The demand for observability and monitoring of user applications is growing exponentially due to the rising complexity of software systems. Observability provides developers with detailed insights to analyze and understand the performance of their applications. Application metrics in particular are a key factor in observability and can be used to measure an application’s performance, productivity, and workloads.
This growing demand for observability has led to the development of related technologies, such as OpenTelemetry (OTEL), a popular open source community project that provides a set of components, including APIs and SDKs, for robust and portable telemetry of cloud-native software. OpenTelemetry can be used to instrument, generate, collect, and export telemetry data as metrics, logs, or traces. This telemetry data can then be exported to various backends for analysis.
Additionally, the open source Prometheus monitoring framework can be used to scrape, query, and manage metric data. The Prometheus monitoring and alerting toolkit can also easily be integrated with Grafana to visualize metrics.
As OpenTelemetry is more widely adopted and becomes the open standard for telemetry, providing compatibility between OpenTelemetry and Prometheus is crucial. OTEL currently supports the full metric pipeline, from instrumenting an application, generating metrics, and exporting them to Prometheus for analysis, and there is ongoing work on OpenTelemetry to ensure full compatibility with Prometheus. You can join the OpenTelemetry Prometheus workgroup to participate in these discussions.
AWS Lambda is a serverless compute service, which runs code in response to events and automatically manages the computing resources required by that code. Because compute servers cannot natively run an instance of the OpenTelemetry Collector, the pipeline to collect and export application metrics from AWS Lambda does not inherently work. Luckily, AWS Lambda allows the creation of Lambda layers that can be deployed and used to augment Lambda functions. The developers of OpenTelemetry leveraged this feature to develop OpenTelemetry Lambda layers to enable the tracing pipeline with OpenTelemetry and AWS X-Ray, a backend solution for monitoring traces.
To support generating, collecting, and exporting application metrics from AWS Lambda to Prometheus, we extended the OpenTelemetry Lambda layer to ensure end-to-end support for the metric pipeline. In addition to supporting Prometheus, a separate layer is also able to support exporting metrics to Amazon Managed Service for Prometheus (AMP).
Distributions of OpenTelemetry Lambda layers
Currently, two packages of the OpenTelemetry Collector in AWS Lambda include the vendor-agonistic OTEL Collector Lambda layer and the managed AWS Distro for OpenTelemetry (ADOT) Collector Lambda layer, as shown in Figure 1. The ADOT Lambda layer serves as a downstream distribution of the OTEL Lambda layer and is packaged with prebuilt configurations to work out of the box for AWS services and platforms.
The Lambda layer components in each of the distribution of the Lambda layers are separately maintained. The OpenTelemetry Collector is used for the upstream OpenTelemetry Lambda layer, whereas the AWS Distro for OpenTelemetry Collector \ (ADOT Collector) is used in the AWS Distro for OpenTelemetry (ADOT) Lambda layer. The ADOT Collector Layer takes the implementation of the upstream version and patches it with AWS service-supported Collector components.
What currently exists in both layers is a pipeline that is able to support a traces pipeline. To extend the wider usage of these layers, we embarked on a task to create a pipeline for users to support a metrics pipeline to Prometheus as well.
Initial evaluation and design considerations
After various design discussions with other engineers, we narrowed down our pipeline to support a popular customer use case in which customers are instrumenting their Java applications using the OpenTelemetry Java API and the Java agent to send metrics to the Prometheus-based monitoring service of their choice.
The pipeline to support this starts with the AWS Lambda function instrumented with the OpenTelemetry Java Metrics API, generates and sends metrics to the OpenTelemetry Collector Lambda layer, and exports to a Prometheus remote write service endpoint (Figure 2). In this scenario, the generated metrics that are sent to the Collector Lambda layer will be OTLP formatted. This option allows for more flexibility to the customer to configure their backends and compare which monitoring service to use.
One consideration in our approach was keeping the layers lightweight because the increase in size causes an increase in cold start time. Additionally, a user wanting to gain observability of their applications may also want to generate traces in conjunction with metrics. Having a separate layer while trying to accomplish both these tasks would greatly increase the initialization time and cause fragmentation in the user experience. With these considerations in mind, extending the existing layers rather than creating a new one specifically for metrics was an obvious choice.
The AWS Lambda layers that currently exist are a standalone Collector layer, along with individual Lambda layers, one for each of the supported languages of AWS Lambda, which have the Collector and OpenTelemetry Language SDK built into it. We will be using the combined layer with both the OpenTelemetry Metrics SDK and OpenTelemetry Collector since the metric pipeline uses the OpenTelemetry Metrics API to export metrics to the Collector layer.
Challenges of Lambda
A challenge of exporting metrics from Lambda is to work around a “Lambda freeze,” which is when Lambda freezes the execution environment to conserve compute resources and reduce billing costs for the user. Until the environment is unfrozen, the metrics may be stuck in the Lambda environment, and users may not see them until their function is invoked again. This timely delivery of metrics is a huge factor in observability because real-time data is an important indicator for observability and users can set alarms based on the metrics received. As such, any metrics that are delayed or become lost due to Lambda freeze are useless and provide unreliable information.
To prevent metrics from being delayed or lost, we chose to support only the use case in which an OpenTelemetry Metrics library was used. The Metrics SDK is able to detect the state of the Lambda function and then calls
ForceFlush() to push all the remaining metrics from the buffer out to the backend before Lambda freezes its execution environment. In this initial implementation of the metrics pipeline, the
ForceFlush() method was not yet implemented for many language Metrics library, so this initial iteration only supports Lambda functions written in Java, coupled with the usage of the OpenTelemetry Java Agent Lambda layer.
Another challenge that we needed to address what the Lambda layer footprint size. Lambda layers have size limitations that we needed to keep in mind. This led us to include the Prometheus remote write (PRW) exporter but exclude the Prometheus receiver, which enables us to ingest OTLP metrics exportable in Prometheus format via the PRW exporter in the Collector layer.
To enable the metrics pipeline, we needed to add components that are able to export metrics, connected through the metric pipeline shown in Figure 3:
The upstream and downstream Lambda layers currently include a selection of receivers and exporters, which are used to enable the pipeline for receiving and exporting traces.
To enable the metrics pipeline in the Lambda layers, the three main components that we can leverage from the existing OpenTelemetry Collector components are as follows:
- OTLP receiver: The OpenTelemetry Protocol (OTLP) receiver is one of the receivers in the OpenTelemetry Collector. The OTLP receiver receives telemetry data via gRPC using OTLP, which is used to for encoding, transporting, and delivering mechanism of telemetry data. This component is already being used in both layers for the trace pipeline.
- Prometheus Remote Write Exporter: The Prometheus Remote Write (PRW) Exporter is a component in the OpenTelemetry Collector that converts OTLP-formatted metrics into a Prometheus-compatible time series format. The PRW exporter then pushes these metrics via an HTTP POST request from the OpenTelemetry Collector to a Prometheus push endpoint.
- AWS Prometheus Remote Write Exporter: This component is identical to the regular Prometheus Remote Write Exporter, but adds AWS Signature Version 4 (SigV4), which adds authentication to AWS API requests sent over HTTP.
Lambda layer components
After evaluating the components that we needed for our pipeline, we examined existing components to determine what needed to be added to each layer to enable the metric pipeline to be sent to a Prometheus backend.
To verify the functionality and performance of the AWS Lambda Layer extensions, we tested the Lambda Layers using end-to-end and integration tests and added these tests to the existing CI/CD workflows of the repository.
Verify with end-to-end testing
The first goal of our tests was to check that the metric was being sent through the full metric pipeline. We tested the functionality of the pipeline in each deployed AWS Lambda Region. Once the metric reached the backend (AMP), we verified that the received metric matched what was sent by the metric load generator, using a custom-written AMP metric validator and a Prometheus metric template.
The final step was to run soak tests to check that the performance of AWS Lambda still fell within acceptable limits, by measuring infrastructure metrics emitted, such as CPU usage, memory, and initialization time (Figure 5).
In addition to the integration tests that we submitted to the aws-otel-Lambda repository, we also performed load testing on the backend to ensure that AMP received every metric it was sent, within its limits (outlined in the AWS documentation).
We wanted to test the ranges of concurrency that would be supported by this pipeline. To do so, we created a variety of workspaces in AMP and used Hey, a tool for HTTP load generation, to send a varying number of metrics through our pipeline to AMP. We tested using the maximum Lambda concurrency limit of 1,000 Lambda functions and generated up to 70,000 metric samples per second to AMP.
After testing was complete, we were able to verify that the pipeline would not result in dropped data for customers sending high volumes of metrics to AMP through these Lambda layers as long as it stayed within the posted limits of AMP without requesting for a quota increase.
Usage and demo
Now that the layers have their respective components required to enable the metric pipeline from AWS Lambda to Prometheus, we can deploy one of these layers along with a sample Lambda function instrumented with the OpenTelemetry Java Metrics API.
When our Lambda function is invoked with the deployed Lambda layer, we can visualize any metrics sent to AWS Managed Service for Prometheus (AMP) using Grafana, as shown in Figure 6. By seeing metrics in Grafana, we confirmed that our Lambda layers are functioning as intended. Users can find more information on how to visualize these metrics using Amazon Managed Grafana from the “Amazon Managed Grafana – Getting Started” blog post.
Check out our pull requests within the following projects to learn more about our process:
- Add AWS PRW to Lambda Components (aws-otel-lambda PR #584)
- Switch around endpoints used for trace/metric validation (aws-otel-lambda#129)
- Only provision AMP resources in supported regions for java-awssdk-agent sample app (aws-otel-lambda PR #130)
- Fix Canary Workflow (aws-otel-lambda PR #134)
- Add Prometheus Remote Write Exporter to Collector Lambda Layer Components (opentelemetry-lambda PR #119)
- Add upstream sample application (opentelemetry-lambda PR#127)
- Updates to Java AWS SDK sample app Terraform Configuration to Support a Custom Collector Configuration File (opentelemetry-lambda #132)
- Update CI workflows and Terraform configurations for Java AWS SDK Sample App (opentelemetry-lambda PR #128)
With the continued development of the OpenTelemetry project and metrics pipeline, more customers will be able to use this Lambda pipeline and the power of Amazon Managed Service for Prometheus to monitor and observe their AWS Lambda application metrics.