AWS Open Source Blog

AWS adds observability metrics to the OpenTelemetry C++ library

In this post, three AWS interns—Brandon Kimberly, Ankit Bhargava, and Hudson Humphries—describe their first engineering contributions to the popular open source observability project OpenTelemetry.

Recently we made contributions to OpenTelemetry that included the metrics collection and processing functionality for the C++ library. These metrics are collected from instrumented applications and infrastructure. They allow users to monitor the health of their services, improve performance, and detect anomalies. In this post we explain how the new metrics pipeline works, how it might be used in code, and the lessons we learned along the way.

OpenTelemetry

OpenTelemetry is a complete solution that solves the problem of collecting telemetry metrics. Its mission is to develop an open, industry-wide standard for telemetry data, and to provide reference implementations with universal tools that support metrics, tracing, and logs. OpenTelemetry currently supports Python, Golang, JavaScript, Erlang, Java, .NET, PHP, Rust, C++, Ruby, and Swift.

We found that the C++ API and SDK was a possible area of contribution, in particular, implementing the Metrics API and SDK functionality as it was at the time nonexistent. We saw this as a substantial contribution opportunity that aligned with our team’s development skill set and interests.

Given that we were implementing this portion of the library from scratch, we reviewed the universal OpenTelemetry specification to understand the metrics pipeline requirements. This added another layer of complexity to our task because the specification itself remains under development. This meant we were constantly aiming at a moving target. We translated these requirements into a set of design documents that we iteratively reviewed with the OpenTelemetry community.

Open source projects are unique in that, rather than a single central figure making decisions, there is generally a community process for decision-making. For example, all of our reviews, from initial project designs to the final code, received critiques from OpenTelemetry contributors across many organizations. Though design documents are perpetually evolving, we began implementing once we felt our proposal was sufficiently detailed and comprehensive.

OpenTelemetry API & SDK Structure

The OpenTelemetry specification demands an API and a SDK for its metrics architecture. The API defines how to capture metric data, while the SDK processes, queries, and exports it. A user can inject our API elements into their application with no compilation issues; however, the API on its own will not be able to generate any useful metric data. After a user installs the SDK, the library aggregates, filters, and distributes API-captured data to any number of visualization backend services. We like to think of the API as feeding data into the SDK pipeline.

The API consists of three major components: metric instruments, meter, and meter provider classes. Metric instruments are what a user injects into their code at strategic locations to capture data of interest. The meter is responsible for creating these instruments and managing them in an internal registry. The meter also provides a single endpoint for collecting data from all operational metric instruments. Finally, the meter provider creates a global meter instance and allows users to specify certain aspects of the pipeline.

Metrics SDK Data Path

Metrics SDK

The Metrics SDK has five components that turn captured raw data into insightful metrics:

  • Controller: The Controller oversees and manages all SDK elements.
  • Accumulator: Users start the pipeline from the Controller, which then queries the Accumulator for data at a user-specified interval.
  • Aggregators: When queried, the Accumulator (the SDK version of the meter) will loop over all the aggregators in its registry and collect their current state. Note that each metric instrument has its own aggregator to combine captured data.
  • Processor: The accumulator then batches these states and sends data to the Processor for filtering. We implemented a default processor that acts as a pass through; users can easily define their own and plug them into the pipeline.
  • Exporter: The Processor will send data back to the Controller, which hands it off to the Exporter at the conclusion of the pipeline. We can see how the metric instruments capture data as it weaves its way through numerous components before finally reaching the user.

OStreamExporter

We also implemented an OStreamExporter for the metrics and tracing pipelines. These Exporters are comparable to stdout in other OpenTelemetry projects. Exporters are initialized with an ostream to which the user wants to send all metric or tracing data. We originally only planned to implement stdout functionality, but we decided to handle all streams with stdout being the default. This was a simple change for a lot of additional functionality.

The OStreamMetricsExporter implements only one function, Export(), which takes in a vector of records that contain the name, description, labels, and aggregator from an instrument. The function then sends the name, description, and labels to ostream. Aggregators are templatized and held within a variant, making the process of sending them to the ostream more complicated. First, we must find out what type the aggregator is holding, which we can find by using holds_alternative. We then send this to a template function to unpack the aggregator from the variant. Once we have the aggregator, we can check what type of aggregator is, then send the relevant information to the ostream based on its kind.

OStreamSpanExporter

The OStreamSpanExporter is the equivalent of the OStreamMetricsExporter for the tracing pipeline. Implemented in the C++ project, the tracing pipeline functions like the metrics pipeline.

The SpanProcessor automatically send spans to the Exporter when their memory is deleted. The SpanExporter receives a span of spans and sends their basic information to the initial ostream. This includes the name, trace id, span id, parent span id, start time, duration, description, and status. The span also has attributes, which are held in a map of string to variant and, because it holds a variant, we must unpack the value. After we unpack the variant, we can also send the data it holds to the ostream and conclude exporting.

With this information, users can monitor the health of their system, detect anomalies, and improve performance. These metrics are an improvement over logs because they are intelligently aggregated and automatically processed. Users can spend more time analyzing the insights instead of tediously extracting them from thousands of lines of logs.

We can examine how a user might actually instrument their code and capture labeled data in the following diagram:

Instrument Use Code Snippet

Throughout this project, we learned a great deal about developing scalable code and working with a large community of open source contributors. As interns, we are used to writing code the minute we see a specification, but this approach fails miserably as the project grows in complexity. We found that it is crucial to plan nearly every aspect of a library’s architecture before writing a single line of code.

Additionally, we embraced the tenets of open source design, which center on transparency and debate: Document and discuss everything, no exceptions. Overall, working on OpenTelemetry was a terrific experience, and we are interested in continuing to work in the open source community.

References

About the authors

randon Kimberly

Brandon Kimberly

Brandon Kimberly is a senior at Ohio University, currently interning as a software developer at AWS. He is interested in machine learning, observability, and all things Rust.

Ankit Bhargava

Ankit Bhargava

Ankit Bhargava is a senior at the University of Michigan, majoring in computer science and business, and interning as a software engineer at AWS. When he’s not working on OpenTelemetry, Ankit likes to read about cybersecurity and machine learning.

Hudson Humphries

Hudson Humphries

Hudson Humphries is a senior at Texas A&M University, majoring in computer science with a minor in Statistics, and interning as a software engineer at AWS. He’s interested in Data Analytics.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

Alolita Sharma

Alolita Sharma

Alolita is a senior manager at AWS where she leads open source observability engineering and collaboration for OpenTelemetry, Prometheus, Cortex, Grafana. Alolita is co-chair of the CNCF Technical Advisory Group for Observability, member of the OpenTelemetry Governance Committee and a board director of the Unicode Consortium. She contributes to open standards at OpenTelemetry, Unicode and W3C. She has served on the boards of the OSI and SFLC.in. Alolita has led engineering teams at Wikipedia, Twitter, PayPal and IBM. Two decades of doing open source continue to inspire her. You can find her on Twitter @alolita.