AWS Open Source Blog
Introducing real-time anomaly detection in Open Distro for Elasticsearch
There is an enormous increase in real-time streaming applications across a wide range of industries such as finance, health, information technology, retail, and the Internet of Things (IoT). Organizations depend on log analytics solutions to detect aberrations in the data and identify critical situations. Examples include finding fraudulent behavior in financial transactions, discovering suspicious IP addresses accessing privileged resources, or identifying systems causing delays during transaction processing. Most traditional analytical tools rely on pre-configured statistical thresholds to identify anomalies. However, these tools are not well suited for streaming applications, which typically exhibit dynamic patterns requiring the anomaly detector to continuously process the changing data and output a decision in real time.
Open sourcing anomaly detection and Random Cut Forest
Today, we are excited to announce the preview release of our machine learning-based anomaly detection plugins for Open Distro for Elasticsearch. The development of the anomaly detection feature has been a joint collaboration between the Amazon Elasticsearch Service and AWS Machine Learning teams. In addition to open sourcing anomaly detection as part of Open Distro for Elasticsearch, we’re also open sourcing the underlying Random Cut Forest (RCF) libraries for the benefit of the greater data science community. RCF is focused on streaming use cases and has been proven in production use. Access to these libraries provides visibility into how anomaly decisions are made, and enables our users and the data science community to leverage, collaborate, and contribute to RCF.
Open Distro for Elasticsearch anomaly detection has been designed to provide value to all developers and operators, regardless of their machine learning expertise. In Kibana, visualizations provide context on which data points contributed to an anomaly and why the event is an anomaly, and allows users to dive deep into the specific log data behind it. The plugin is also integrated with Open Distro for Elasticsearch Alerting to trigger notifications as the detector identifies anomalies.
As Elasticsearch is used to index high volumes of data in a distributed fashion, we felt it essential to design the anomaly detection feature to be lightweight and highly reactive to changes in cluster resources while minimizing impact on application workloads. We have achieved this by distributing the computation of anomaly models across Elasticsearch nodes, allowing the system’s performance to scale with the cluster, all while not requiring dedicated machine learning nodes.
Open Distro for Elasticsearch anomaly detection leverages Random Cut Forest (RCF), a proven algorithm built on years of academic research, used by AWS in multiple service offerings. RCF is an unsupervised algorithm for detecting anomalous data points within a data set. While many algorithms support batch-based techniques that periodically analyze data in time-based windows, RCF detects anomalies on live data and helps to identify issues as they evolve in real time. RCF works by constructing multiple decision trees on recency-based samples of data. It can incrementally update the samples and the trees every time a new input is added, without having to reconstruct the trees from scratch. This makes the algorithm adapt to evolving distributions. RCF stores segments of behavior, also known as shingles, and checks for departures from previous patterns to flag them as anomalous, without making prior assumptions about the application. The ability to detect anomalies in real-time streaming data, and being domain-agnostic, makes RCF a great algorithm for a wide range of log analytics applications.
For a deep dive into the Open Distro for Elasticsearch anomaly detection system design and RCF library, please read Real-time Anomaly Detection in Open Distro for Elasticsearch and Random Cut Forests.
Join the community and contribute to the project
Open Distro for Elasticsearch remains focused on driving innovation with value-added features to ensure that our community has an option that is fully open source. As our machine learning and engineering teams continue to explore and grow this area, we invite you to engage with us, share your use cases, and collaborate with us in this innovation.