Building a Prometheus Remote Write Exporter for the OpenTelemetry Python SDK
In this post, AWS intern engineers Azfaar Qureshi and Shovnik Bhattacharya talk about their experience building the OpenTelemetry Prometheus Remote Write Exporter for Python. They share their experiences in tackling challenges they faced while building this tool, which is used for sending metrics to Prometheus protocol-based service endpoints.
As software deployments become increasingly more complex, the ability to better understand our applications and infrastructure also becomes vitally important. This ability can be achieved through observability, which deals with understanding the internal state of a system based on its external outputs. This process goes beyond just simple monitoring. Whereas monitoring allows us to view the overall health of our systems, observability provides us the granular high-fidelity data required to gain an in-depth understanding of our systems.
Prometheus is one such open source observability tool focused on the collection of metrics data. Since its launch, Prometheus has become extremely popular and has seen widespread adoption in the industry. However, because Prometheus has its own data exposition format, it often locks users into its ecosystem, preventing them from working with other observability services that are available. This need for flexibility and cross-compatibility within the observability space is what sparked the creation of the open source OpenTelemetry project.
OpenTelemetry aims to create a new vendor-agnostic, open standard for telemetry data, eliminating vendor lock-in and providing developers with complete freedom to set up their observability pipelines however they wish. The OpenTelemetry project is part of the Cloud Native Computing Foundation (CNCF) and is the second most popular project after Kubernetes. Many of the major players in the observability space (including Amazon Web Services [AWS], Google, Microsoft, Splunk, DataDog, New Relic, and Dynatrace) collaborate on OpenTelemetry.
To facilitate widespread adoption of OpenTelemetry, the project recognizes the need for full compatibility with Prometheus so that its users can have a seamless transition into instrumenting their apps with OpenTelemetry. This is where our project, and the subject of this blog post, fits in. We created a Prometheus Remote Write (RW) Exporter in the OpenTelemery Python SDK to allow users to export metrics to their existing RW integrated back ends, such as Cortex, Thanos, and InfluxDB. With our RW exporter, users can use the Python SDK to push metrics straight to their back end without needing to run a middle-tier service.
Before our project, two distinct pipelines existed for exporting OpenTelemetry Protocol (OTLP) metrics to a RW-integrated back end.
The first pipeline takes advantage of the OpenTelemetry Collector Prometheus RW Exporter. OTLP metrics are first pushed from the Python SDK to the OpenTelemetry Collector. The Collector’s RW exporter then converts the OTLP metrics into
TimeSeries and pushes them to the back-end service.
The second pipeline involves exporting data using the OpenTelemetry Python Prometheus exporter. This is considered a “pull” exporter because it converts OTLP metrics into the Prometheus data exposition format and exposes them at the
/metrics endpoint. This endpoint is then periodically scraped by a Prometheus server, which converts the metrics into the
TimeSeries data format and finally pushes them to the back-end service.
The existing Python pull exporter was created to facilitate exporting data to Prometheus-integrated back-end services by avoiding having to go through the OpenTelemetry Collector. It is, however, not a complete solution. Many customers value the advantages offered by push-based exporters. For example, using the RW Exporter removes the need to run additional services just to scrape metrics, potentially reducing infrastructure costs of your metrics pipeline.
Large-scale customers who care about high availability may prefer the RW Exporter because its easier to replicate traffic across different data ingestion endpoints with a push-based exporter. Similarly, customers with strict network security requirements often restrict the amount of incoming traffic to their services and prefer outgoing traffic instead. This approach rules out having another service scrape your services for metrics and requires the customers to use a RW Exporter.
In short, to ensure that OpenTelemetry can support all possible customer use cases and to provide full feature parity with Prometheus, we created a Prometheus Remote Write Exporter in the Python SDK. This introduces a new pipeline for exporting metrics which looks like the following:
Before building the Prometheus RW Exporter, we first evaluated whether the Python SDK was compliant with the OpenTelemetry Metrics Specification. The OpenTelemetry Specification contains requirements and guidelines to which all the OpenTelemetry SDKs must adhere. It can be thought of as the blueprint for each SDK. However, the specification is currently under development because the OpenTelemetry project has not yet reached General Availability (GA). Consequently, the specification quickly evolves based on feedback from users, open source contributors, and vendors. Because the Python Metrics SDK was written during an early iteration of the metrics specification, we noticed a few compliance gaps in the SDK implementation. We took this opportunity to contribute back to the upstream community by making the Python Metrics SDK compliant with the specification.
We opened an issue (python:#1307) upstream detailing all our findings with suggested changes and their rationale. As discussion around issues in OpenTelemetry happens in Gitter channels and weekly Special Interest Group (SIG) meetings, we started conversations there to get buy-in from the community around our proposed changes. Once we achieved consensus around our changes, we opened Pull Requests with the fixes (python:#1372, python:#1373, python:#1367,python-contrib:#192).
Once the metrics SDK was compliant, we jumped into implementing our exporter. Before writing any code, we needed to write thorough and detailed documents outlining what we proposed to build. This is standard for most projects at AWS, and there are four main documents that need to be written: Requirements, Design, Implementation, and Testing.
A requirements document contains all the analyses of the features one needs to support, along with assumptions, non-goals, and data requirements. This work is then followed up with a design document that delves deeper into the structure of your software (in this case, the exporter) and details of each component. Finally, implementation and testing documents outline the specific details in turning your design into code. We reviewed each document thoroughly with senior engineers internally before sharing them for review with upstream maintainers. After receiving approvals both internally and externally, we were ready to start coding our exporter.
Metrics export pipeline
Before diving into the role of our exporter, let’s first walk through the steps in the overarching metrics export pipeline:
- The user instantiates the pipeline with a user-defined collection interval.
- The controller is responsible for orchestrating the entire pipeline.
- In every cycle, the controller collects raw metrics from the accumulator.
- These raw metrics are then processed and converted into
ExportRecordsby applying the relevant aggregators to convert the raw data into a usable form.
- Finally, the controller calls the exporter to send the data to its final location.
Our work with the Remote Write Exporter lies in this final stage, shown in green in the following diagram.
Prometheus Remote Write Exporter functionality
Now let’s look at how our exporter behaves. During every collection cycle, the controller will call the exporter once with a new set of records to be exported. The Prometheus remote write exporter iterates through the records and converts them into time series format based on the records internal OTLP aggregation type. For example, a histogram aggregation is converted into multiple time series with one time series for each bucket. These time series representing the same aggregation are linked together based on their common timestamp and labels.
Once the data has been converted, it is compressed using snappy into a Protocol Buffer message and sent to the back-end service of choice via HTTP request. Finally, based on the response from the server, the exporter returns whether the metric data was exported successfully.
An important aspect for any component that sends requests over a network is authentication. We made sure to include all common forms of authentication to allow users to keep their metric data secure. The Prometheus RW requests may be configured using a series of optional parameters in the exporter’s constructor.
Basic authentication and TLS certification file paths can be added directly while bearer tokens or any other form of authentication method may be added using headers. In addition, users may also set a timeout of requests to ensure requests do not block for a long period when using shorter collection cycles. Below is an example of the exporter being initialized with various configuration options.
Error handling was an interesting area for this project. For parameter validation, throwing exceptions is appropriate to signal incorrect usage of the constructor. However, during export cycles, if an error occurs, throwing an exception would terminate flow for the entire pipeline. In many cases, especially when sending requests over a network, temporary errors may occur. To work around this, we decided to log error messages when export fails. If data systematically fails to be exported, the user can check the logs to debug the issue; however, if a single data point is missing, export will continue seamlessly.
To ensure our exporter would be reliable and easy to maintain, we followed test-driven development and added end-to-end integration tests. To write our unit tests, we used Python’s unittest module, which made setting up the tests, running them, and mocking external dependencies straightforward. For our integration tests, we ran an instance of Cortex, which is a common Prometheus integrated back end that stores time series data. We first generated metric data, then pushed it to Cortex using our exporter, and finally queried the database to ensure the data we sent was received successfully and was of a structure we expected.
We included a sample app upstream that used every instrument with every aggregator so we could provide a detailed example to end users for using our exporter. The README provides details of how to set up our sample app to generate metrics, set up Cortex as the Remote Write endpoint, and use Grafana to visualize the data all in a dockerized environment. However, the bare minimum required to set up the Python SDK with our exporter is as follows:
Now we have set up the metrics collection pipeline with our Remote Write Exporter and set the export interval to 1 second. This means that every second the OpenTelemetry Python SDK will collect metrics and export them to the designated back end. The screenshot of Grafana from our sample app below shows a successful metrics export.
Designing and implementing the Prometheus Remote Write Exporter for the OpenTelemetry Python SDK was a great learning opportunity. We learned how to work with the open source community to develop high-quality, well-documented, and reliable code. We also now understand the importance of a rigorous development workflow to organize ideas, make well-founded decisions, and submit proposals upstream.
We thank our mentors and maintainers of the OpenTelemetry Python library, Leighton Chen and Alex Boten, as well as our manager, Alolita Sharma, who were great to work with and who provided us with guidance and detailed feedback during multiple cycles of reviews. Overall, working with the OpenTelemetry community has been amazing, and we encourage others interested in observability to get involved in the project and contribute.
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.