## AWS Big Data Blog

# Using Random Cut Forests for real-time anomaly detection in Amazon OpenSearch Service

**September 8, 2021**: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

Anomaly detection is a rich field of machine learning. Many mathematical and statistical techniques have been used to discover outliers in data, and as a result, many algorithms have been developed for performing anomaly detection in a computational setting. In this post, we take a close look at the output and accuracy of the anomaly detection feature available in Amazon OpenSearch Service and Open Distro for Elasticsearch, and provide insight as to why we chose Random Cut Forests (RCF) as the core anomaly detection algorithm. In particular, we:

- Discuss the goals of anomaly detection
- Share how to use the RCF algorithm to detect anomalies and why we chose RCF for this tool
- Interpret the output of anomaly detection for Elasticsearch
- Compare the results of the anomaly detector to commonly used methods

## What is anomaly detection?

Human beings have excellent intuition and can detect when something is out of order. Often, an anomaly or outlier can appear so obvious you just “know it when you see it.” However, you can’t base computational approaches to anomaly detection on such intuition; you must found them on mathematical definitions of anomalies.

The mathematical definition of an anomaly is varied and typically addresses the notion of separation from normal observation. This separation can manifest in several ways via multiple definitions. One common definition is “a data point lying in a low-density region.” As you track a data source, such as total bytes transferred from a particular IP address, number of logins on a given website, or number of sales per minute of a particular product, the raw values describe some probability or density distribution. A high-density region in this value distribution is an area of the domain where a data point is highly likely to exist. A low-density region is where data tends not to appear. For more information, see Anomaly detection: A survey.

For example, the following image shows two-dimensional data with a contour map indicating the density of the data in that region.

The data point in the bottom-right corner of the image occurs in a low-density region and, therefore, is considered anomalous. This doesn’t necessarily mean that an anomaly is something bad. Rather, under this definition, you can describe an anomaly as behavior that rarely occurs or is outside the normal scope of behavior.

## Random Cut Forests and anomaly thresholding

The algorithmic core of the anomaly detection feature consists of two main components:

- A RCF model for estimating the density of an input data stream
- A thresholding model for determining if a point should be labeled as anomalous

You can use the RCF algorithm to summarize a data stream, including efficiently estimating its data density, and convert the data into *anomaly scores.* Anomaly scores are positive real numbers such that the larger the number, the more anomalous the data point. For more information, see Real Time Anomaly Detection in Open Distro for Elasticsearch.

We chose RCF for this plugin for several reasons:

**Streaming context**– Elasticsearch feature queries are streaming, in that the anomaly detector only receives each new feature aggregate one at a time.**Expensive queries**– Especially on a large cluster, each feature query may be costly in CPU and memory resources. This limits the amount of historical data we can obtain for model training and initialization**.****Customer hardware**– Our anomaly detection plugin runs on the same hardware as our customers’ Elasticsearch cluster. Therefore, we must be mindful of our plugin’s CPU and memory impact.**Scalable**– It is preferred if you can distribute the work required to determine anomalous data across the nodes in the cluster.**Unlabeled data**– Even for training purposes, we don’t have access to labeled data. Therefore, the algorithm must be unsupervised.

Based on these constraints and performance results from internal and publicly available benchmarks across many data domains, we chose the RCF algorithm for computing anomaly scores in data streams.

But this begs the question: How large of an anomaly score is large enough to declare the corresponding data point as an anomaly? The anomaly detector uses a thresholding model to answer this question. This thresholding model combines information from the anomaly scores observed thus far and certain mathematical properties of RCFs. This hybrid information approach allows the model to make anomaly predictions with a low false positive rate when relatively little data has been observed, and effectively adapts to the data in the long run. The model constructs an efficient sketch of the anomaly score distribution using the KLL Quantile Sketch algorithm. For more information, see Optimal Quantile Approximation in Streams.

## Understanding the output

The anomaly detector outputs two values: an *anomaly grade* and a *confidence score.* The anomaly grade is a measurement of the severity of an anomaly on a scale from zero to one. A zero anomaly grade indicates that the corresponding data point is normal. Any non-zero grade means that the anomaly score output by RCF exceeds the calculated score threshold, and therefore indicates the presence of an anomaly. Using the mathematical definition introduced at the beginning of this post, the grade is inversely related to the anomaly’s density; that is, the rarer the event, the higher the corresponding anomaly grade.

The confidence score is a measurement of the probability that the anomaly detection model correctly reports an anomaly within the algorithm’s inherent error bounds. We derive the model confidence from three sources:

- A statistical measurement of whether the RCF model has observed enough data. As the RCF model observes more data, this source of confidence approaches 100%.
- A confidence upper bound comes from the approximation made by the distribution sketch in the thresholding model. The KLL algorithm can only predict the score threshold within particular error bounds with a certain probability.
- A confidence measurement attributed to each node of the Elasticsearch cluster. If a node is lost, the corresponding model data is also lost, which leads to a temporary confidence drop.

## The NYC Taxi dataset

We demonstrate the effectiveness of the new anomaly detector feature on the New York City taxi passenger dataset. This data contains 6 months of taxi ridership volume from New York City aggregated into 30-minute windows. Thankfully, the dataset comes with labels in the form of *anomaly windows*, which indicate a period of time when an anomalous event is known to occur. Example known events in this dataset are the New York City marathon, where taxi ridership uncharacteristically spiked shortly after the event ended, and the January 2015 North American blizzard, when at one point the city ordered all non-essential vehicles off the streets, which resulted in a significant drop in taxi ridership.

We compare the results of our anomaly detector to two common approaches for detecting anomalies: a rules-based approach and the Gaussian distribution method.

### Rules-based approach

In a rules-based approach, you mark a data point as anomalous if it exceeds a preset, human-specified boundary. This approach requires significant domain knowledge of the incoming data and can break down if the data has any upward or downward trends.

The following graph is a plot of the NYC taxi dataset with known anomalous event periods indicated by shaded regions. The model’s anomaly detection output is shown in red below the taxi ridership values. A set of human labelers received the first month of data (1,500 data points) to define anomaly detection rules. Half of the participants responded by stating they didn’t have sufficient information to confidently define such rules. From the remaining responses, the consolidated rules for anomalies are either that the value is equal to or greater than 30,000, or the value is below 20,000 for 150 points (about three days).

In this use case, the human annotators do a good enough job in this particular range of data. However, this approach to anomaly detection doesn’t scale well and may require a large amount of training data before a human can set reasonable thresholds that don’t suffer from a high false positive or high false negative rate. Additionally, as mentioned earlier, if this data develops an upward or downward trend, our team of annotators needs to revisit these constant-value thresholds.

### Gaussian distribution method

A second common approach is to fit a Gaussian distribution to the data and define an anomaly as any value that is three standard deviations away from the mean. To improve the model’s ability to adapt to new information, the distribution is typically fit on a sliding window of the observations. Here, we determine the mean and standard deviation from the 1,500 most recent data points and use these to make predictions on the current value. See the following graph.

The Gaussian model detects the clear ridership spike at the marathon but isn’t robust enough to capture the other anomalies. In general, such a model can’t capture certain kinds of temporal anomalies where, for example, there’s a sudden spike in the data or other change in behavior that is still within the normal range of values.

### Anomaly detection tool

Finally, we look at the anomaly detection results from the anomaly detection tool. The taxi data is streamed into an RCF model that estimates the density of the data in real time. The RCF sends these anomaly scores to the thresholding model, which decides whether the corresponding data point is anomalous. If so, the model reports the severity of the anomaly in the anomaly grade. See the following graph.

Five out of seven of the known anomalous events are successfully detected with zero false positives. Furthermore, with our definition of anomaly grade, we can indicate which anomalies are more severe than others. For example, the NYC Marathon spike is much more severe than those of Labor Day and New Year’s Eve. Based on the definition of an anomaly in terms of data density, the behavior observed at the NYC Marathon lives in a very low-density region, whereas, by the time we see the New Year’s Eve spike, this kind of behavior is still rare but not as rare anymore.

## Summary

In this post, you learned about the goals of anomaly detection and explored the details of the model and output of the anomaly detection feature, now available in Amazon OpenSearch Service and Open Distro for Elasticsearch. We also compared the results of the anomaly detection tool to two common models and observed considerable performance improvement.

### About the Authors

** Chris Swierczewski is an applied scientist in Amazon AI. **He enjoys hiking and backpacking with his family.

** Lai Jiang is a software engineer working on machine learning and Elasticsearch at Amazon Web Services.** His primary interests are algorithms and math. He is an active contributor to Open Distro for Elasticsearch.