AWS adds observability metrics to the OpenTelemetry C++ library
In this post, three AWS interns—Brandon Kimberly, Ankit Bhargava, and Hudson Humphries—describe their first engineering contributions to the popular open source observability project OpenTelemetry.
Recently we made contributions to OpenTelemetry that included the metrics collection and processing functionality for the C++ library. These metrics are collected from instrumented applications and infrastructure. They allow users to monitor the health of their services, improve performance, and detect anomalies. In this post we explain how the new metrics pipeline works, how it might be used in code, and the lessons we learned along the way.
We found that the C++ API and SDK was a possible area of contribution, in particular, implementing the Metrics API and SDK functionality as it was at the time nonexistent. We saw this as a substantial contribution opportunity that aligned with our team’s development skill set and interests.
Given that we were implementing this portion of the library from scratch, we reviewed the universal OpenTelemetry specification to understand the metrics pipeline requirements. This added another layer of complexity to our task because the specification itself remains under development. This meant we were constantly aiming at a moving target. We translated these requirements into a set of design documents that we iteratively reviewed with the OpenTelemetry community.
Open source projects are unique in that, rather than a single central figure making decisions, there is generally a community process for decision-making. For example, all of our reviews, from initial project designs to the final code, received critiques from OpenTelemetry contributors across many organizations. Though design documents are perpetually evolving, we began implementing once we felt our proposal was sufficiently detailed and comprehensive.
The OpenTelemetry specification demands an API and a SDK for its metrics architecture. The API defines how to capture metric data, while the SDK processes, queries, and exports it. A user can inject our API elements into their application with no compilation issues; however, the API on its own will not be able to generate any useful metric data. After a user installs the SDK, the library aggregates, filters, and distributes API-captured data to any number of visualization backend services. We like to think of the API as feeding data into the SDK pipeline.
The API consists of three major components: metric instruments, meter, and meter provider classes. Metric instruments are what a user injects into their code at strategic locations to capture data of interest. The meter is responsible for creating these instruments and managing them in an internal registry. The meter also provides a single endpoint for collecting data from all operational metric instruments. Finally, the meter provider creates a global meter instance and allows users to specify certain aspects of the pipeline.
The Metrics SDK has five components that turn captured raw data into insightful metrics:
- Controller: The Controller oversees and manages all SDK elements.
- Accumulator: Users start the pipeline from the Controller, which then queries the Accumulator for data at a user-specified interval.
- Aggregators: When queried, the Accumulator (the SDK version of the meter) will loop over all the aggregators in its registry and collect their current state. Note that each metric instrument has its own aggregator to combine captured data.
- Processor: The accumulator then batches these states and sends data to the Processor for filtering. We implemented a default processor that acts as a pass through; users can easily define their own and plug them into the pipeline.
- Exporter: The Processor will send data back to the Controller, which hands it off to the Exporter at the conclusion of the pipeline. We can see how the metric instruments capture data as it weaves its way through numerous components before finally reaching the user.
We also implemented an OStreamExporter for the metrics and tracing pipelines. These Exporters are comparable to
stdout in other OpenTelemetry projects. Exporters are initialized with an
ostream to which the user wants to send all metric or tracing data. We originally only planned to implement
stdout functionality, but we decided to handle all streams with
stdout being the default. This was a simple change for a lot of additional functionality.
The OStreamMetricsExporter implements only one function,
Export(), which takes in a vector of records that contain the name, description, labels, and aggregator from an instrument. The function then sends the name, description, and labels to
ostream. Aggregators are templatized and held within a
variant, making the process of sending them to the
ostream more complicated. First, we must find out what type the aggregator is holding, which we can find by using
holds_alternative. We then send this to a template function to unpack the aggregator from the
variant. Once we have the aggregator, we can check what type of aggregator is, then send the relevant information to the
ostream based on its kind.
The OStreamSpanExporter is the equivalent of the OStreamMetricsExporter for the tracing pipeline. Implemented in the C++ project, the tracing pipeline functions like the metrics pipeline.
The SpanProcessor automatically send spans to the Exporter when their memory is deleted. The SpanExporter receives a span of spans and sends their basic information to the initial
ostream. This includes the name, trace id, span id, parent span id, start time, duration, description, and status. The span also has attributes, which are held in a
map of string to variant and, because it holds a
variant, we must unpack the value. After we unpack the
variant, we can also send the data it holds to the
ostream and conclude exporting.
With this information, users can monitor the health of their system, detect anomalies, and improve performance. These metrics are an improvement over logs because they are intelligently aggregated and automatically processed. Users can spend more time analyzing the insights instead of tediously extracting them from thousands of lines of logs.
We can examine how a user might actually instrument their code and capture labeled data in the following diagram:
Throughout this project, we learned a great deal about developing scalable code and working with a large community of open source contributors. As interns, we are used to writing code the minute we see a specification, but this approach fails miserably as the project grows in complexity. We found that it is crucial to plan nearly every aspect of a library’s architecture before writing a single line of code.
Additionally, we embraced the tenets of open source design, which center on transparency and debate: Document and discuss everything, no exceptions. Overall, working on OpenTelemetry was a terrific experience, and we are interested in continuing to work in the open source community.
- opentelemetry-cpp: Learn more about OpenTelemetry observability with metrics functionality.
- open-o11y/docs repository: Our Metrics SDK, API, and Exporter design documents.
- OpenTelemetry project
About the authors
The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.