AWS Cloud Operations & Migrations Blog

Actionable Insights based on anomaly detection in AWS X-Ray

Today, we launched in public preview X-Ray Insights, a new feature of AWS X-Ray, which uses anomaly detection to create actionable insights about any anomalies in your application. AWS X-Ray helps developers analyze and debug distributed applications. With this launch, you will be able to proactively identify issues in your applications caused by increases in the fault rate, determine the root cause of an issue, and understand its impact on your end users. You can also view the incident timeline to understand when the issue started and how it progressed.

Developers spend a considerable amount of time searching through application logs, service logs, metrics, and traces to understand performance bottlenecks and pinpoint their root causes. A common follow-up task is to correlate data collected across metrics, logs and traces to identify which end users are impacted. This comes with its own set of challenges in mining the data and performing analysis after the event has occurred. X-Ray Insights automatically analyzes attributes such as HTTP Status code, URL, and other resources and service parameters across sets of traces to automatically provide actionable insights that help developers identify the root cause and address the issue – in minutes. Developers can use these insights to answer questions like “What is the underlying issue?”, “Which service is the cause and how does it affect other services?”, and “Which customers are impacted and by how much?” without having to mine and manually analyze large data sets.

X-Ray Insights Feature Set and Use Cases

Let’s walk through some of the key features in AWS X-Ray Insights.

List of Insights 

This section provides a list of all active and closed Insights, enabling you to quickly identify which Insights needs attention, what is the current impact of the issue, and how long the issue has been going on. This section also provides customers with tools to filter and isolate issues that happened in the past for triage analysis and to create cause-of-error reports. Insights are generated per X-Ray Group, which allows customers to easily identify issues originating from a subsection of an application.

XRay Insights main screen showing findings from traces

Insight Overview

Insight overview screenshot

Summary

When you select an insight to triage, you see the Insight overview tab which provides a summary of the insight. In this section you can quickly see the root cause of the issue, the services that are impacted or exhibiting anomalous behavior, and the percentage of requests made to the root cause service, as well as the entire X-Ray Group that is affected. Using this information, you can decide whether you need to dive deeper into this insight, or that it’s not a cause for concern. You can also analyze the Insight further using X-Ray Analytics by clicking on “Analyze Insight” button. There you can view traces related to the incident, determine business metrics associated with this issue and other parameters such as user-agent, client-IP, HTTP method and URL etc.

Anomalous Services

In this section, you can view the top anomalous services and determine why anomalies are triggered. Using these graphs, you can view the incident window (total time from when we detect an issue until the issue is closed), and determine when the actual fault rate breaches the prediction bands.

Anomolous services graph showing service behavior over a period of time

Root Cause

The root cause section shows the incident map with all the services impacted by this anomaly. You can see the service that was identified as the probable root cause of the issue and any anomalous services tagged with “Anomaly” on the map. In the Incident map, you can perform service impact analysis and visualize how the services are impacted during an incident. You can also analyze the traces to dive deep into the problem using X-Ray Analytics by clicking on View root cause details link in the Root cause section.

Service graph showing potential root cause identified by Insights

Client Impact

The Client Impact graph helps you understand the end-user experience for the total duration of the Insight. This graph provides information about the percentage of the requests for the X-Ray Group that resulted in error, fault, throttle, or success during the incident.

Client Impact graph showing the user experience during the insight period

Inspect

The Inspect tab can be used to understand the progression of the incident from the point it started to the time it was closed. In X-Ray Insights, we continuously monitor the ongoing incident and periodically create events indicating the current state of the Insight.

Insights Timeline: The Insights timeline provides a set of events starting from the time when the issue started, tracking how it regressed or improved, to the time when it was closed. This enables you to see the occurrence of events at various time intervals, and the impact of requests to the root cause service and across the entire X-Ray Group. You can select individual events in the timeline to see the incident map get refreshed to understand the services involved in the issue and the severity of impact on each one of them.

Insights timeline showing sequence of insights

Getting started with X-Ray Insights

To get started using AWS X-Ray Insights, visit the AWS Management Console for X-Ray. There is no additional instrumentation needed to use X-Ray Insights. Once your application is instrumented with the X-Ray SDK, you can start using X-Ray Insights by enabling it in the X-Ray Groups including the Default group. AWS X-Ray will run the anomaly detection algorithm on incoming traces to generate insights.

Enabling X-Ray Insights when creating a group

The X-Ray Insights functionality is available globally in all commercial regions. Visit our pricing page to learn about the cost of using X-Ray Insights.

AWS X-Ray Insights APIs for these new features are in preview and subject to change before general availability.

 

About the Authors

Author - Nizar TyrewallaNizar Tyrewalla is a Sr. Product Manager in AWS focused on monitoring distributed applications built using microservices architecture. Currently he is leading the distributed tracing service with AWS X-Ray and ingestion of Observability data using open source tools and frameworks like OpenTelemetry.

 

 

Author - Imaya Kumar JagannathanImaya Kumar Jagannathan is a Senior Solution Architect focused on Amazon CloudWatch and AWS X-Ray. He is passionate about Monitoring and Observability and has a strong application development and architecture background. He likes working on distributed systems and is excited to talk about microservice architecture design. He loves programming on C#, working with Containers and Serverless technologies.