Preprocess logs for anomaly detection in Amazon OpenSearch

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

Amazon OpenSearch Service supports real-time anomaly detection, which uses machine learning (ML) to proactively detect anomalies in real-time streaming data. When used to analyze application logs, it can detect anomalies such as unusually high error rates or sudden changes in the number of requests. For example, a sudden increase in the number of food delivery orders from a particular area could be due to weather changes or due to a technical glitch experienced by users from that area. The detection of such an anomaly can facilitate quick investigation and remediation of the situation.

The anomaly detection feature of Amazon OpenSearch Service uses the Random Cut Forest algorithm. This is an unsupervised algorithm that constructs decision trees from numeric input data points in order to detect outliers in the data. These outliers are regarded as anomalies. To detect anomalies in logs, we have to convert the text-based log files into numeric values so that they can be interpreted by this algorithm. In ML terminology, such conversion is commonly referred to as data preprocessing. There are several methods of data preprocessing. In this post, I explain some of these methods that are appropriate for logs.

To implement the methods described in this post, you need a log aggregation pipeline that ingests log files into an Amazon OpenSearch Service domain. For information about ingesting Apache web logs, see Send Apache Web Logs to Amazon OpenSearch Service with Kinesis Firehose. For a similar method for ingesting and analyzing Amazon Simple Storage Service (Amazon S3) server access logs, see Analyzing Amazon S3 server access logs using Amazon OpenSearch Service.

Now, let’s discuss some data preprocessing methods that we can use when dealing with complex structures within log files.

Log lines to JSON documents

Although they’re text files, usually log files have some structure to the log messages, with one log entry per line. As shown in the following image, a single line in a log file can be parsed and stored in an Amazon OpenSearch Service index as a document with multiple fields. This image is an example of how an entry in an Amazon S3 access log can be converted into a JSON document.

Although you can ingest JSON documents such as the preceding image as is into Amazon OpenSearch Service, some of the text fields require further preprocessing before you can use them for anomaly detection.

Text fields with nominal values

Let’s assume your application receives mostly GET requests and a much smaller number of POST requests. According to an OWASP security recommendation, it’s also advisable to disable TRACE and TRACK request methods because these can be misused for cross-site tracing. If you want to detect when unusual HTTP requests appear in your server logs, or when there is a sudden spike in the number of HTTP requests with methods that are normally a minority, you could do so by using the request_uri or operation fields in the preceding JSON document. These fields contain the HTTP request method, but you have to extract that and convert that into a numeric format that can be used for anomaly detection.

These are fields that have only a handful of different values, and those values don’t have any particular sequential order. If we simply convert HTTP methods to an ordered list of numbers, like GET = 1, POST = 2, and so on, we might confuse the anomaly detection algorithm into thinking that POST is somehow greater than GET, or that GET + GET equals POST. A better way to preprocess such fields is one-hot encoding. The idea is to convert the single text field into multiple binary fields, one for every possible value of the original text field. In our example, the result of this one-hot encoding is a set of nine binary fields. If the value of the field in the original log is HEAD, only the HEAD field in the preprocessed data has value 1, and all other fields are zero. The following table shows some examples.

Original Log Message	Preprocessed into multiple one-hot encoded fields
HTTP Request Method	GET	HEAD	POST	PUT	DELETE	CONNECT	OPTIONS	TRACE	PATCH
GET	1	0	0	0	0	0	0	0	0
POST	0	0	1	0	0	0	0	0	0
OPTIONS	0	0	0	0	0	0	1	0	0

These generated fields data can be then processed by the Amazon OpenSearch Service anomaly detection feature to detect anomalies when there is a change in the pattern of HTTP requests received by your application, for example an unusually high number of DELETE requests.

Text fields with a large number of nominal values

Many log files contain HTTP response codes, error codes, or some other type of numeric codes. These codes don’t have any particular order, but the number of possible values is quite large. In such cases, one-hot encoding alone isn’t suitable because it can cause an explosion in the number of fields in the preprocessed data.

Take for example the HTTP response codes. The values are unordered, meaning that there is no particular reason for 200 being OK and 400 being Bad Request. 200 + 200 != 400 as far as HTTP response codes go. However, the number of possible values is quite large—more than 60. If we use the one-hot encoding technique, we end up creating more than 60 fields out of this 1 field, and it quickly becomes unmanageable.

However, based on our knowledge of HTTP status codes, we know that these codes are by definition binned into five ranges. Codes in the range 100–199 are informational responses, codes 200–299 indicate successful completion of the request, 300–399 are redirections, 400–499 are client errors, and 500–599 are server errors. We can take advantage of this knowledge and reduce the original values to five values, one for each range (1xx, 2xx, 3xx, 4xx and 5xx). Now this set of five possible values is easier to deal with. The values are purely nominal. Therefore, we can additionally one-hot encode these values as described in the previous section. The result after this binning and one-hot encoding process is something like the following table.

Original Log Message	Preprocessed into multiple fields after binning and one-hot encoding
HTTP Response Status Code	1xx	2xx	3xx	4xx	5xx
100 (Continue)	1	0	0	0	0
101 (Switching Protocols)	1	0	0	0	0
200 (OK)	0	1	0	0	0
202 (Accepted)	0	1	0	0	0
301 (Moved Permanently)	0	0	1	0	0
304 (Not Modified)	0	0	1	0	0
400 (Bad Request)	0	0	0	1	0
401 (Unauthorized)	0	0	0	1	0
404 (Not Found)	0	0	0	1	0
500 (Internal Server Error)	0	0	0	0	1
502 (Bad Gateway)	0	0	0	0	1
503 (Service Unavailable)	0	0	0	0	1

This preprocessed data is now suitable for use in anomaly detection. Spikes in 4xx errors or drops in 2xx responses might be especially important to detect.

The following Python code snippet shows how you can bin and one-hot encode HTTP response status codes:

def http_status_bin_one_hot_encoding(http_status):
    # returns one hot encoding based on http response status bin
    # bins are: 1xx, 2xx, 3xx, 4xx, 5xx
    if 100 <= http_status <= 199: # informational responses
        return (1, 0, 0, 0, 0)
    elif 200 <= http_status < 299: # successful responses
        return (0, 1, 0, 0, 0)
    elif 300 <= http_status < 399: # redirects
        return (0, 0, 1, 0, 0)
    elif 400 <= http_status < 499: # client errors
        return (0, 0, 0, 1, 0)
    elif 500 <= http_status < 599: # server errors
        return (0, 0, 0, 0, 1)

http_1xx, http_2xx, http_3xx, http_4xx, http_5xx = http_status_bin_one_hot_encoding(status)

log_entry = {
    'timestamp': timestamp,
    'bucket': "somebucket",
    'key': "somekey",
    'operation': "REST.GET.VERSIONING",
    'request_uri': "GET /awsexamplebucket1?versioning HTTP/1.1",
    'status_code': status,
    'http_1xx': http_1xx,
    'http_2xx': http_2xx,
    'http_3xx': http_3xx,
    'http_4xx': http_4xx,
    'http_5xx': http_5xx,
    'error_code': "-",
    'bytes_sent': 113,
    'object_size': 0
}

Text fields with ordinal values

Some text fields in log files contain values that have a relative sequence. For example, a log level field might contain values like TRACE, DEBUG, INFO, WARN, ERROR, and FATAL. This is a sequence of increasing severity of the log message. As shown in the following table, these string values can be converted to numeric values in a way that retains this relative sequence.

Log Level (Original Log Message)	Preprocessed Log Level
TRACE	1
DEBUG	2
INFO	3
WARN	4
ERROR	5
FATAL	6

IP addresses

Log files often have IP addresses that can contain a large number of values, and it doesn’t make sense to bin these values together using the method described in the previous section. However, these IP addresses might be of interest from a geolocation perspective. It might be important to detect an anomaly if an application starts getting accessed from an unusual geographic location. If geographic information like country or city code isn’t directly available in the logs, you can get this information by geolocating the IP addresses using third-party services. Effectively, this is a process of binning the large number of IP addresses into a considerably smaller number of country or city codes. Although these country and city codes are still nominal values, they can be used with the cardinality aggregation of Amazon OpenSearch Service.

After we apply these preprocessing techniques to our example Amazon S3 server access logs, we get the resulting JSON log data:

{
    "bucket_owner": "", //string
    "bucket": "awsexamplebucket1", //string
    "timestamp": "06/Feb/2019:00:00:38 +0000",
    "remote_ip": "192.0.2.3", //string
    "country_code": 100, //numeric field generated during pre-processing
    "requester": "", //string
    "request_id": "3E57427F3EXAMPLE",
    "operation": "REST.GET.VERSIONING",
    "key": "-",
    "request_uri": "GET /awsexamplebucket1?versioning HTTP/1.1",
    "http_method_get": 1, //nine one-hot encoded fields generated during pre-processing
    "http_method_post": 0,
    "http_method_put": 0,
    "http_method_delete": 0,
    "http_method_head": 0,
    "http_method_connect": 0,
    "http_method_options": 0,
    "http_method_trace": 0,
    "http_method_patch": 0,
    "http_status": 200,
    "http_1xx": 0, //five one-hot encoded fields generated during pre-processing
    "http_2xx": 1,
    "http_3xx": 0,
    "http_4xx": 0,
    "http_5xx": 0,
    "error_code": "-",
    "bytes_sent": 113,
    "object_size": "-",
    "total_time": 7,
    "turn_around_time": "-",
    "referer": "-",
    "user_agent": "S3Console/0.4",
    "version_id": "-",
    "host_id": "", //string
    "signature_version": "SigV2",
    "cipher_suite": "ECDHE-RSA-AES128-GCM-SHA256",
    "authentication_type": "AuthHeader",
    "host_header": "awsexamplebucket1.s3.us-west-1.amazonaws.com",
    "tls_version": "TLSV1.1"
}

This data can now be ingested and indexed into an Amazon OpenSearch Service domain. After you set up the log preprocessing pipeline, the next thing to configure is an anomaly detector. Amazon OpenSearch Service anomaly detection allows you to specify up to five features (fields in your data) in a single anomaly detector. This means the anomaly detector can learn patterns in data based on the values of up to five fields.

Aggregations

You must specify an appropriate aggregation function for each feature. This is because the anomaly detector aggregates the values of all documents ingested in each detector interval to produce a single aggregate value, and then that value is used as the input to the algorithm that automatically learns the patterns in data. The following diagram depicts this process.

After you configure the right features and corresponding aggregation functions, the anomaly detector starts to initialize. After processing a sufficient amount of data, the detector enters the running state.

To help you get started with anomaly detection on your own logs, the following table shows the preprocessing techniques and aggregation functions that might make sense for some common log fields.

Log Field Name	Preprocessing	Aggregation
HTTP response status code	One-hot encoding	sum
Client IP address	IP geolocation to a country or city code	cardinality
Log Message Level (INFO, WARN, ERR, FATAL etc.)	One-hot encoding	sum
Error or Exception names	Map to numeric codes, additional binning and one-hot encoding if there are large number of possible values	cardinality if using single numeric code field; sum if using one-hot encodings
Object Size / Bytes Sent / Content-Length	None, use numeric value itself	min, max, average
To monitor general traffic levels, you can use any numeric field like response code or bytes sent to count the number of log entries per detector interval	None, use numeric value itself	count (value_count) – simply counts the number of documents that have a value for this field

Conclusion

IT teams can use the anomaly detection feature of Amazon OpenSearch Service to implement proactive monitoring and alerting for applications and infrastructure logs. Anyone with basic scripting or programming skills should be able to implement the log preprocessing techniques discussed in this post—you don’t need to have in-depth knowledge of ML or data science. The anomaly detection feature is available in Amazon OpenSearch Service domains running Elasticsearch version 7.4 or later. To get started, see Anomaly detection in Amazon OpenSearch Service.

About the Author

Kapil Pendse is a Senior Solutions Architect with Amazon Web Services (Singapore) and has over 15 years of experience building technology solutions across multiple domains such as cloud computing, embedded systems, and machine learning. In his free time, Kapil likes to bike along Singapore’s coastal parks and enjoys the occasional company of otters.