What is distributed tracing?
Distributed tracing is observing data requests as they flow through a distributed system. Modern microservices architecture often has multiple small independent components—these components constantly communicate and exchange data using APIs to do complex work. With distributed tracing, developers can trace—or visually follow—a request path across different microservices. This visibility helps troubleshoot errors or fix bugs and performance issues.
What are the benefits of distributed tracing?
Software developers can implement distributed tracing systems in almost any cloud-native environment, as well as record distributed traces that the cloud applications generate. Moreover, tracing tools support numerous programming languages and software stacks, allowing software teams to monitor and collect performance data for different applications on the same platform.
Development teams use distributed tracing to improve observability, as well as solve performance issues that conventional software debugging and monitoring tools can’t help with.
The following are more benefits of distributed tracing.
Accelerate software troubleshooting
Modern applications rely on numerous microservices to exchange data and fulfill service requests across distributed systems. Troubleshooting performance issues in a microservice-based architecture is significantly more challenging than in a monolithic software application. Unlike a monolithic application, the root cause of a specific software problem might not be apparent—the overlapping and complex interactions between multiple software modules can make it difficult to diagnose issues.
With distributed tracing, software teams can monitor data that passes through complex paths connecting various microservices and data storage. Using distributed tracing tools, software teams track requests and visualize data propagation paths with precision. Software teams can solve performance issues promptly and minimize service disruptions.
Improve developer collaboration
Several developers are often involved in building a cloud application, with each responsible for one or several microservices. The software development process slows down if developers cannot trace data exchanged by microservices. With distributed tracing systems developers can collaborate by providing telemetry data, such as logs and traces, for every service request the microservice makes. Developers can accurately respond to bugs and other software issues discovered during testing and production.
Reduce time to market
Organizations deploying distributed tracing platforms can streamline and accelerate efforts to release software applications for end users. Software teams review distributed traces to gain insight that speeds up software development, minimizes development costs, understands user behaviors, and improves market readiness.
What are the different types of distributed tracing?
Software teams use distributed tracing tools to monitor, analyze, and optimize applications.
Code tracing
Code tracing is a software process that inspects the flow of source codes in an application when performing a specific function. It helps developers understand the logical flow of the code and identify unknown issues. For example, developers use code tracing to validate that the service request has invoked steps to query a database. If some software functions fail to reply, the tracing system will collect the appropriate error status and draw attention to the response time.
Program tracing
Program tracing is a method wherein developers can examine the addresses of instructions and variables called by an active application. When a software application runs, it processes each line of code that resides in a specific allocated memory space. The application also processes variables stored in the machine memory. Inspecting changes in program and data memories is challenging without an automated tool. With program tracing, software teams can diagnose deep-rooted performance issues like memory overflow, excessive resource consumption, and blocking logic operations.
End-to-end tracing
With end-to-end tracing development teams can track data transformation along the service request path. When an application initiates a request, it sends data to other software components for further processing. Developers use tracing tools to track and compile changes that critical data undergo from end to end. It gives an application-centric view of requests flowing through the application.
How does end-to-end distributed tracing work in microservices architecture?
When using applications, users initiate service requests and different application components process the request.
Consider a user making a ticket booking in an online movie booking application. The user enters their contact details, movie details, and payment information and chooses Book Now. A request is created that goes to:
- Microservice A that validates the user-entered data.
- Microservice B that takes the data from A and creates a record in the customer database.
- Microservice C that takes the data from B and validates payment.
- Microservice D that takes the data from C, allocates a seat, and generates movie ticket data.
- Microservice E that takes the data from D and creates a formatted ticket PDF file.
A response containing the ticket PDF is then returned back up the chain of microservices from E to D to C to B to A, until it eventually reaches the user. The above example is simple—a request often passes through several dozen microservices and even chains of third-party software components outside the application. This makes the process increasingly complex.
Distributed tracing systems track these interactions of service requests with other microservices and software components in the distributed computing environment. A distributed trace represents the timeline and all the actions that occur between request generation and response receipt. Software teams use the trace to follow data movement through multiple microservices that the initial request interacts with.
Span
When processing a service request, an application might take several actions. These actions are represented as spans in distributed tracing. For example, a span might be an API call, user authentication, or enabling storage access. If a single request results in several actions, the initial (or parent) span may branch into several child spans. These nested layers of parent and child spans form a continuous logical representation of steps taken to accomplish the service request.
Trace ID
The distributed tracing system assigns a unique ID to every request in order to track it. Each span inherits the same trace ID from the original request it belongs to. Spans are also tagged with a unique span ID that helps the tracing system consolidate the metadata, logs, and metrics it collects.
Metric collection
As each span passes through different microservices, it appends metrics that provide developers with deep and precise insights into the software behavior. You can collect error rate, timestamp, response time, and other metadata with the spans. After the trace completes an entire cycle, the distributed tracing tool consolidates all data collected.
For example, an API call is evaluated with response time, error status, and breakdown of secondary functions fulfilled by multiple third-party services. The tracing tool turns the data into visual forms, highlighting key indicators and performance summaries. This way, site reliability engineers can rapidly identify errors, inspect critical data elements, and collaborate with development teams to remediate performance issues and ensure compliance with Service Level Agreements (SLAs).
What are distributed tracing standards?
Distributed tracing standards provide a common framework and software tools for developers. These standards monitor, visualize, and analyze service requests in modern application environments. By standardizing distributed tracing workflow, software teams can instrument request-tracing without being subjected to vendor lock-in.
The following sections describe standards introduced to enable interoperability when performing distributed tracing.
OpenTracing
OpenTracing is an open source distributed tracing standard developed by the Cloud Native Computing Foundation (CNCF). OpenTracing focuses on enabling developers to generate traces with an instrumentation API. This allows developers to generate distributed traces from different parts of the code base, library, or other dependencies.
OpenCensus
OpenCensus consists of multi-language libraries capable of extracting software metrics and sending them to backend systems for analysis. Developers can use the provided API to manage how traces are generated and collected. Unlike OpenTracing, developers work with OpenCensus from a single project repository instead of individual code bases and libraries.
OpenTelemetry
OpenTelemetry unifies OpenTracing and OpenCensus. It combines the best features of both standards to provide a comprehensive distributed tracing framework. OpenTelemetry provides extensive software development kits, APIs, libraries, and other instrumentation tools for implementing distributed tracing more effortlessly.
What’s the difference between distributed tracing and logging?
Logging is a practice of recording specific events that occur when an application runs. Logging tools collect timestamped events—such as system errors, user interactions, communication statuses, and other metrics—to help development teams detect system anomalies. Generally, there are two types of logging:
- Centralized logging collects all recorded activities and stores them in a single location.
- Distributed logging stores log files in separate locations on the cloud.
Both logging methods provide a static overview of incidents that show developers what happened in the application. In contrast, distributed tracing provides an audit trail that clarifies why an incident occurred by correlating various telemetry data collected throughout a service request's period. Distributed tracing may use logging and other data collection methods for tracing a specific service request.
What are the challenges of distributed tracing?
Distributed tracing has simplified the efforts of developers in diagnosing, debugging, and fixing software issues. Despite that, the following challenges remain that software teams must be mindful of when choosing tracing tools.
Manual instrumentation
Some tracing tools require software teams to manually instrument their codes to generate the necessary traces. When developers modify codes to trace requests, there are risks of coding errors that affect production releases. Moreover, the lack of automation complicates tracing, resulting in delays and possibly inaccurate data collection.
Limited frontend coverage
Developers may not be able to gain complete oversight of performance issues if their tracing tools are restricted to backend analysis. In some cases, the distributed tracing system only starts collecting data when the first backend service receives the request. This means developers cannot detect and inspect issues arising from frontend services during the corresponding user session.
Random sampling
Some tools don't allow software teams to prioritize tracing, limiting observability to randomly-sampled traces. With limited sample data, organizations need additional software troubleshooting approaches to capture major issues that escape the tracing tool.
How can AWS help with your distributed tracing requirements?
AWS X-Ray is a distributed tracing platform that helps software developers trace user requests and identify bottlenecks in their cloud applications. Organizations use X-Ray to visualize application metrics and improve workload availability. With AWS X-Ray you can:
- Integrate with all applications running on Amazon Elastic Compute Cloud (EC2), Amazon EC2 Container Service (Amazon ECS), AWS Lambda, and AWS Elastic Beanstalk.
- Set the appropriate sampling rate to provide end-to-end visibility for the traces.
- Visualize aggregated data with a service map, displaying key metrics such as latency and failure rates.
Get started with distributed tracing on AWS by creating an account today.