Ship and visualize your Istio virtual service traces with AWS X-Ray

Distributed tracing is a mechanism to derive runtime performance insights of a distributed system by tracking how requests flow through the system components and capturing key performance indicators at call sites. It surfaces dynamic service dependency information along with traffic health information and helps service owners and cluster operators quickly visualize the application landscape, identify performance bottlenecks, and proactively address service issues.

However, in spite of the benefits that a tracing platform brings to the table, standing up a self-managed production ready stack for distributed tracing involves a good amount of engineering effort. At a minimum, a distributed tracing setup commonly involves the following steps:

Instrument application code, either manually or with the help of an instrumentation library, to propagate trace related headers across inter service calls.
Package and configure a tracing library to export traces from the application to a collector endpoint.
Deploy a collector that can receive the traces and forward to a persistence layer.
Deploy a data store for trace persistence.
Deploy API and UI visualization components to allow operators to query the persisted traces and access visualizations to extract actionable insights.

While an application developer must always instrument the code to propagate the tracing context through headers, a service mesh like Istio makes life easier for the application developer by embedding the exporters for popular tracing backends in the service proxies, which reside alongside the application process intercepting all network communications. The proxies can be configured to automatically publish spans for all incoming and outgoing requests. The benefits of this approach are:

the application is now responsible for only the propagation of the right context, and the service mesh proxies take care of publishing the traces to the configured collector endpoint
the service mesh administrator can apply the trace agent related configurations in a single place as part of the service mesh control plane setup.

Even then, setting up persistence and UI visualization layer for a popular open source tracing backend like Open Zipkin requires setting up and fine-tuning a data store and spark dependencies, which consume a lot of DevOps bandwidth.

Customers migrating from self-managed deployments to AWS often look for ways to offload a lot of the undifferentiated engineering heavy lifting by moving to a fully managed solution. For distributed tracing, AWS X-Ray provides customers a fully managed experience freeing up DevOps teams to focus more on innovation rather than busy work.

In addition to distributed tracing, any non-trivial production system also has to deal with other telemetry concerns like metrics and logs collection. Often these solutions will have their own agent/collector implementations leading to a proliferation of agents and collectors being deployed in the operating environment each with its own specifications and protocol requirements that may not be easily ported to support other backends. Think of a production environment where metrics are ingested by Prometheus, logs are shipped to Elasticsearch, and traces are sent to Open Zipkin. Now to introduce CloudWatch Logs based EMF custom metrics ingestion, another agent needs to be added in to the already complex mix. This is the core problem that the OpenTelemetry project, an open source initiative under the Cloud Native Computing Foundation (CNCF), attempts to solve by providing a standardized specification and reference implementations for telemetry data generation, processing, and ingestion. The OpenTelemetry Collector subproject provides a single unified, vendor-agnostic collector architecture that can receive standardized metrics, logs, and traces from multiple sources, convert/enrich the received data, and export to multiple backends simplifying deployment complexity.

This post shows how customers can migrate their existing Open Zipkin instrumented applications running on Istio to AWS and enjoy the managed experience of AWS X-Ray. The solution leverages the flexibility offered by the OpenTelemetry Collector project to convert Open Zipkin formatted spans to AWS X-Ray segments. High level architecture

Time to read	12 minutes
Time to complete	45 minutes
Cost to complete (estimated)	$ 0.686
Learning level	Expert (400)
Services used	Amazon Elastic Kubernetes Service (EKS) Amazon Elastic Container Registry (ECR) AWS X-Ray Amazon EC2 AWS CloudFormation AWS IAM

Overview of solution

The collector architecture comprises pluggable receivers, processors, and exporters that can be combined to form pipelines. A metric/trace event enters the pipeline through a receiver, goes through zero or more processors, and finally exits the pipeline through one or more exporters. The receivers implement the wire protocol of the source data format. For example, zipkin receiver in the opentelemetry-collector project exposes an endpoint to decode JSON formatted Zipkin v1 or v2 spans into internal OpenTelemetry representation. The exporters, on the other hand, implement the wire transport protocol for the target backend. The collector repository hosts exporters for common open source backends like Zipkin, Jaeger, Prometheus, Kafka and others. It also provides exporters for open formats like OpenCensus and OTLP. The opentelemetry-collector-contrib repository is a companion repository for commercial vendor implementations. For example, the awsxrayexporter module in the Contrib repo translates the internal trace spans from the pipeline to AWS X-Ray segment documents and posts the documents to a configured AWS X-Ray endpoint.

Collector pipeline architecture

The architecture allows new components to be linked in to the collector binary by referencing vendor implementations from the Contrib project. A collector image can be deployed as a standalone Kubernetes deployment with one or more pipeline configurations connecting any number of receivers with any number of backends.

The solution involves deploying a custom collector binary in an EKS cluster where Istio will be set up to publish Zipkin spans to the collector endpoint. The collector will be configured with a single pipeline definition to ingest the spans and export to AWS X-Ray. A logging exporter will also be configured to easily view the internal representation of the spans in the collector log.

Custom Collector Pipeline Configuration

The Zipkin tracer built into Istio proxy as of this writing (Istio version 1.7.4) does not conform to the OpenTelemetry semantic conventions and thus requires translation of some of the span attributes to OpenTelemetry resource attributes so that AWS X-Ray can correctly identify the attributes. The pipeline used in the solution sets up the zipkin receiver module from the opentelemetry-collector repository and the awsxrayexporter module from the opentelemetry-collector-contrib repository. The opentelemetry-collector-contrib repository was forked and a new istio-build branch was created in the forked repository to add and test the translations. The collector binary build project maintained in the otelcol-custom-istio-awsxray repository pulls in these custom translations from the forked repository.

Under the hood

Collector Contrib customizations

The trace id mapping logic in the awsxrayexporter module has been updated by introducing an in-memory cache for the Trace ID field. AWS X-Ray requires trace_id in segment documents to embed the epoch time in seconds when the trace had started. Zipkin tracer in Istio, on the other hand, expects the traceId field to be a randomly generated, 128-bit, unique identifier. The default handling currently checks if the high order 32 bits are within the allowed epoch window and maps to the Trace ID epoch section. If the high order 32 bits are outside the allowed window then the exporter drops the span. The enhancement will take the trace id with epoch outside the valid window and update it with epoch calculated from the start time of the segment. This epoch is also cached for later segments in the same trace. If the segment start time is missing then the current epoch is taken and cached for later lookup. The TTL of the cache is defaulted to 60 seconds but can be configured through the exporter configuration as done in the demo deployment.

Trace ID Epoch Validation Flow - Before

Trace ID Epoch Validation - After

The configuration of the exporter has been updated to allow setting a cache provider. The current implementation uses a simple TTL based in-process thread safe cache, which mandates a single instance of the collector to avoid cache sync issues. To allow multiple collector instances to load balance telemetry ingestion in a large cluster, the implementation can be extended to introduce a clustered cache either in-process or external like Amazon ElastiCache for Redis.

Exporter Configuration

The next set of changes refers to the upstream_cluster attribute of a span. The Istio proxy generated spans always have the span kind set to CLIENT irrespective of the direction of traffic. This confuses the exporter and X-Ray service since they expect spans for incoming traffic to have span kind set to SERVER. (This span kind behavior can potentially confuse other backends also and the custom collector allows working around such bugs until they can be released upstream.) In our case this is resolved by introspecting the upstream_cluster attribute. This attribute captures whether the span is for incoming or outgoing traffic and other details like service port and remote or local service FQDN. For example, the upstream_cluster attribute value for egress traffic from a client to an upstream service will look like outbound|9080||productpage.default.svc.cluster.local. On the server side, the corresponding upstream_cluster attribute value for the ingress traffic will look like inbound|9080|http|productpage.default.svc.cluster.local.

Span kind translation

For more details about the changes please refer to the istio-build branch of the forked repository.

Collector builder

The custom collector binary is built through the otelcol-custom-istio-awsxray project. This project is modeled based on the instructions available here. The project pulls in the awsxrayexporter module from the collector contrib repo but instead of directly referencing the upstream version of the module, it replaces the module link with a locally checked out version from istio-build branch.

Local replace in go.mod

Add exporter in components.go

Once the image is built, tagged, and pushed to a remote registry of choice, deploy the collector, install Istio, and deploy a sample application to see the traces flowing in to AWS X-Ray.

Tutorial

Prerequisites

For this tutorial, you should have the following prerequisites:

An AWS account
An IAM user
A workstation with
- AWS CLI v2 to interact with AWS services like ECR and EKS
- git to clone the solution repository
- Docker engine to build the collector image
- kubectl compatible with the server version to interact with the EKS control plane and deploy the collector and istio service mesh
The tutorial assumes basic familiarity with Istio service mesh, distributed tracing concepts, and AWS X-Ray

Create an EKS cluster

In this section, we are going to setup IAM and create an EKS cluster.

Log in to IAM console.
Create a role that will be used to launch the CloudFormation template. The role will be granted admin access to the EKS cluster that will be created by the CloudFormation template. For this demo, a role named admin with AdministratorAccess permissions is created. Note that Require MFA option is enabled for the new role.
Once the role is created update your CLI profile (~/.aws/credentials) to refer to the new role. If MFA is required also add the MFA device serial of the IAM user that will be used to assume the role.
Assume the new role from the console.

Clone the otelcol-custom-istio-awsxray repository in a local workspace.

cd ~
mkdir -p github.com/iamsouravin
cd github.com/iamsouravin
git clone https://github.com/iamsouravin/otelcol-custom-istio-awsxray.git

Navigate to the CloudFormation console.
Choose Create Stack.
On the Create stack page choose Upload a template file in the Specify template section and click on Choose file button.
Select docs/examples/base-cfn.yaml template file from the cloned workspace.
Enter a name for the stack. For this demo I have used otelcol-custom-istio-awsxray-demo. Review and/or update the parameters.
On the Configure stack options page leave the default values.
On the Review page review the selections and acknowledge that the stack may create IAM resources.
Click Create stack button to kick off the stack creation.
This will be a good time to take a break and grab a warm cup of your favorite beverage.
Once the stack is created, switch to the terminal and update your local kubeconfig file to connect to the cluster.
```
aws eks update-kubeconfig --name tracing-cluster --region us-east-1 --profile admin-role
```
Verify that you are able to connect to the cluster and the nodes are ready.
```
kubectl cluster-info
kubectl get nodes
```

Build the custom collector binary

In this step, we are going to build the custom collector binary.

The build script (bin/build.sh) clones the forked collector contrib repo locally and copies the awsxrayexporter module in the project directory before kicking off the docker build. Launch the build with the tag (-t) argument. The time taken for the build depends on multiple factors like internet bandwidth, build machine cores, volume type (SSD or HDD), etc.

Install and test

In this step, we are going to install the custom collector binary, Istio, and a sample app to test the end-to-end integration.

Open the generated manifest file (docs/examples/otelcol-custom-istio-awsxray-manifest.yaml) to verify that the image tag refers to the newly built image.
Login to the remote Docker registry and push the image.
The sampling rate for traces generated by the proxies is defaulted to 100 for demo purposes. The configuration can be found in docs/examples/tracing.yaml. This is not meant for production deployments. Typically, operators will set the sampling rate to a low value like 1% in production deployments.
Run the installer script (bin/install-components.sh) to
- install the collector in the cluster
- download and install 1.7.4 release version of Istio
- label the default namespace to enable automatic proxy injection
- install and expose the book info app from the Istio samples directory.
Get the gateway URL of /productpage from the script output.
Open the product page URL in a browser and refresh a number of times. You should notice that the review section switches across three different versions. Also, try logging in and out with dummy user credentials and repeat for a while.
If the page does not load in the browser, then the ELB for the ingress gateway may still be registering the target EC2 instances. Either wait for a few minutes or track the registration progress in the Load Balancers view in the EC2 console.
Navigate to the AWS X-Ray console to review the service map visualization.
Open the traces pane to drill down into the individual segments.
You can also exercise the API endpoints exposed by the Bookinfo app using a client like Postman or plain curl and see the trace links update in the service map.
1. http://{{ISTIO_INGRESS_HOST}}/api/v1/products
2. http://{{ISTIO_INGRESS_HOST}}/api/v1/products/0
3. http://{{ISTIO_INGRESS_HOST}}/api/v1/products/0/reviews
4. http://{{ISTIO_INGRESS_HOST}}/api/v1/products/0/ratings

Updated service map visualization

Cleaning up

To avoid incurring future charges, delete all the resources.

Run the uninstallation script (bin/uninstall-components.sh) to uninstall the components installed by the installer script.
ECR will not allow deletion of a non-empty repository. Delete the image that was pushed to the otelcol-custom-istio-awsxray repository.
```
aws ecr batch-delete-image --repository-name otelcol-custom-istio-awsxray --image-ids imageTag=0.1.0 --region us-east-1
```
Finally, clean up the remaining resources by triggering Delete stack on the CloudFormation console.

Conclusion

This post gives customers an additional option to leverage AWS X-Ray as a managed backend for traces generated from Istio service mesh on AWS. The solution demonstrates how the OpenTelemetry collector project opens up new integration possibilities and gives DevOps teams full flexibility to choose and customize the parts based on their specific org’s requirements. As the community moves towards a standards-based approach to collecting telemetry data, we hope to see better alignment among different products to support more seamless integration obviating the need for much of the customizations demonstrated in this blog. We are excited to showcase this integration and can’t wait to see how customers will leverage and extend this solution.

Containers