AWS Big Data Blog

Using pipes to explore, discover and find data in Amazon OpenSearch Service with Piped Processing Language

System developers, DevOps engineers, support engineers, site reliability engineers (SREs), and IT managers make sure that the underlying infrastructure powering the applications and systems within an organization is available, reliable, secure, and scalable. To achieve these goals, you need to perform a fast and deep analysis on the underlying logs, monitoring, and observability data. Amazon OpenSearch Service is a popular choice to store and analyze such data. However, extracting insights from OpenSearch isn’t easy. Although Query DSL (the language used to query data stored in OpenSearch) is powerful, it has a steep learning curve, and wasn’t designed as a human interface to easily create one-time queries and explore user data.

In this post, we discuss the newly supported Piped Processing Language (PPL) feature, powered by Open Distro for Elasticsearch, which enables you to form complex queries and quickly explore and discover data with the help of pipes.

What is Piped Processing Language?

Piped Processing Language is powered by Open Distro for Elasticsearch, an Apache 2.0-licensed distribution of Elasticsearch. PPL enables you to explore, discover, and find data stored in Elasticsearch, using a set of commands delimited by pipes ( | ).

Pipes allow you to combine two or more commands as a chain, such that the output of one command acts as an input for the next command, very similar to Unix pipes. With PPL, you can now search for keywords and feed the results from the command on the left of the pipe to the command on the right of the pipe, effectively creating a command pipeline.

Use case

As an illustration, consider a use case where you want to find out the number of hosts that are responding with HTTP 404 (Page not found) and HTTP 503 (Server Unavailability) errors, aggregate the error responses per host, and sort in the order of impact.

Using Query DSL

When you use Query DSL, the query looks similar to the following code:

GET kibana_sample_data_logs/_search
{"from":0,"size":0,"timeout":"1m","query":{"bool":{"should":[{"term":{"response.keyword":{"value":"404","boost":1}}},{"term":{"response.keyword":{"value":"503","boost":1}}}],"adjust_pure_negative":true,"boost":1}},"sort":[{"_doc":{"order":"asc"}}],"aggregations":{"composite_buckets":{"composite":{"size":1000,"sources":[{"host":{"terms":{"field":"host.keyword","missing_bucket":true,"order":"asc"}}},{"response":{"terms":{"field":"response.keyword","missing_bucket":true,"order":"asc"}}}]},"aggregations":{"request_count":{"value_count":{"field":"request.keyword"}},"sales_bucket_sort":{"bucket_sort":{"sort":[{"request_count":{"order":"desc"}}],"size":10}}}}}}

The following screenshot shows the query results.

 

Using PPL

You can replace the entire DSL query with a single PPL command:

source = kibana_sample_data_logs | where response='404' or response='503' | stats count(request) as request_count by host, response | sort -request_count

The following screenshot shows the query results.

Commands and functions supported by PPL

PPL supports a comprehensive set of commands, including search, where, field, rename, dedup, sort, stats, eval, head, top, and rare. These commands are read-only requests to process data and return results. The following table summarizes the purpose of each command.

Command What does it do? Example Result
search source Retrieves documents from the index. The keyword search can be ignored. source=accounts; Retrieves all documents from the accounts index.
field Keeps or removes fields from the search result. source=accounts | fields account_number, firstname, lastname; Gets account_number, firstname, and lastname fields from the search result.
dedup Removes duplicate documents defined by a field from the search result. source=accounts | dedup gender | fields account_number, gender; Removes duplicate documents with the same gender.
stats Aggregates the search results using sum, count, min, max, and avg. source=accounts | stats avg(age); Calculates the average age of all accounts.
eval Evaluates an expression and appends its result to the search result. search source=accounts | eval doubleAge = age * 2 | fields age, doubleAge; Creates a new doubleAge field for each document that is age * 2.
head Returns the first N number of results in a specified search order. search source=accounts | fields firstname, age | head; Fetches the first 10 results.
top Finds the most common values of all fields in the field list. search source=accounts | top gender; Finds the most common value of gender.
rare Finds the least common values of all fields in a field list. search source=accounts | rare gender; Finds the least common value of gender.
where Filters the search result. search source=accounts | where account_number=1 or gender="F" | fields account_number, gender; Gets all the documents from the account index.
rename Renames one or more fields in a search result. search source=accounts | rename account_number as an | fields acc; Renames the account field as acc.
sort Sorts results in a specified field. search source=accounts | sort age | fields account_number, age; Sorts all documents by age field in ascending order.

PPL also supports functions including date-time, mathematical, string, aggregate, and trigonometric, and operators and expressions.

Summary

Piped Processing Language, powered by Open Distro for Elasticsearch, has a comprehensive set of commands and functions that enable you to quickly begin extracting insights from your data in Elasticsearch. It’s supported on all Amazon OpenSearch Service domains running Elasticsearch 7.9 or greater. PPL also expands the capabilities of the Query Workbench in Kibana in addition to SQL. For more information, see Piped Processing Language.


About the Author

Viraj Phanse is a product management leader at Amazon Web Services for Search Services/Analytics. An avid foodie, he loves trying cuisines from around the globe. In his free time, he loves to play his keyboard and travel.