AWS Big Data Blog

Matching your Ingestion Strategy with your OpenSearch Query Patterns

Choosing the right indexing strategy for your Amazon OpenSearch Service clusters helps deliver low-latency, accurate results while maintaining efficiency. If your access patterns require complex queries, it’s best to re-evaluate your indexing strategy.

In this post, we demonstrate how you can create a custom index analyzer in OpenSearch to implement autocomplete functionality efficiently by using the Edge n-gram tokenizer to match prefix queries without using wildcards.

What is an index analyzer?

Index analyzers are used to analyze text fields during ingestion of a document. The analyzer outputs the terms you can use to match queries

By default, OpenSearch indexes your data using the standard index analyzer. The standard index analyzer splits tokens on spaces, converts tokens to lowercase, and removes most punctuation. For some use cases (like log analytics), the standard index analyzer might be all you need.

Standard Index Analyzer

Let’s look at what the standard index analyzer does. We’ll use the _analyze API to test how the standard index analyzer tokenizes the sentence “Standard Index Analyzer.”

Note: You can run all the commands in this post using OpenSearch DevTools in the OpenSearch Dashboard.

GET /_analyze
{
  "analyzer": "standard",
  "text": "Standard Index Analyzer."
}
#========
#Results
#========
{
  "tokens": [
    {
      "token": "standard",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "index",
      "start_offset": 9,
      "end_offset": 14,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "analyzer",
      "start_offset": 15,
      "end_offset": 23,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Notice how each word was lowercased and the period (punctuation) was removed.

Creating your own index analyzer

OpenSearch offers a large number of built in analyzers that you can use for different access patterns. It also lets you build your own custom analyzer, configured for your specific search needs. In the following example, we are going to configure a custom analyzer that returns partial word matches for a list of addresses. The analyzer is specifically designed for autocomplete functionality, enabling end users to quickly find addresses without having to type out (or remember) an entire address. Autocomplete allows OpenSearch to effectively complete the search term based off matched prefixes.

First, create an index called standard_index_test:

PUT standard_index_test
{
  "mappings": {
    "properties": {
      "text_entry": {
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}

Specifying the analyzer as standard is not required because the standard analyzer is the default analyzer.

To test, bulk add some data to our standard_index_test that we created.

POST _bulk
{"index":{"_index":"standard_index_test"}} 
{"text_entry": "123 Amazon Street Seattle, Wa 12345 "} 
{"index":{"_index":"standard_index_test"}}
{"text_entry": "456 OpenSearch Drive Anytown, Ny 78910"}
{"index":{"_index":"standard_index_test"}}
{"text_entry": "789 Palm way Ocean Ave, Ca 33345"}
{"index":{"_index":"standard_index_test"}}
{"text_entry": "987 Openworld Street, Tx 48981"}

Query this data using the text “ope”.

GET standard_index_test/_search
{
  "query": {
    "match": {
      "text_entry": {
        "query": "ope"
      }
    }
  }
}
#========
#Results
#========
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": n`ull,
    "hits": [] # No matches 
  }
}

When searching for the term “ope”, we don’t get any matches. To see why, we can dive a little deeper into the standard index analyzer and see how our text is being tokenized. Test the standard index analyzer with the address “456 OpenSearch Drive Anytown, Ny 78910”.

POST standard_index_test/_analyze
{
  "analyzer": "standard",
  "text": "456 OpenSearch Drive Anytown, Ny 78910"
}
#========
#Results
#========
  "tokens":
      "456" 
      "opensearch" 
      "drive" 
      "anytown"
      "ny" 
      "78910"

The standard index analyzer has tokenized the address into individual terms: 456, opensearch, drive and so on. That means, unless you search for an individual token (like 456 or opensearch) o, op, ope , and even open won’t yield any results. One option is to use wildcards while still using the standard index analyzer for indexing:

GET standard_index_test/_search
{
  "query": {
    "wildcard": {
      "text_entry": "ope*"
    }
  }
}

The wildcard query would match “456 OpenSearch Drive Anytown, Ny 78910” but wildcard queries can be resource intensive and slow. Querying for ope* in OpenSearch results in iterating over each term in the index, bypassing optimizations of inverted index lookups. This results in higher memory usage and slower performance. To improve the performance of our query execution and search experience, we can use an index analyzer that better suits our access patterns.

Edge n-gram

The Edge n-gram tokenizer helps you find partial matches and avoids the use of wildcards by tokenizing prefixes of a single word. For example, the input word coffee is expanded into all its prefixes, c, co , cof, and so on. It can limit the prefixes to those between a minimum (min_gram) and maximum (max_gram) length. So with min_gram=3 and max_gram=5, it will expand “coffee” to cof, coff, and coffe.

Create a new index called custom_index with our own custom index analyzer that uses Edge n-grams. Set the minimum token length (min_gram) to 3 characters, and the maximum token length (max_gram) to 20 characters. The min_gram and max_gram sets the minimum and maximum returned token length respectively. You should select the min_gram and max_gram based off your access patterns. In this example, we’re searching for the term “ope” so we don’t need to set the minimum length to anything less than 3 since we’re not searching for terms like o or op. Setting the min_gram too low can lead to high latency. Likewise, we don’t need to set the maximum length to anything greater than 20 as no individual token will exceed the length of 20. Setting the maximum length to 20 gives us room to spare in case we do eventually ingest an address with a longer token length. Note, the index we are creating here is specifically for autocomplete functionality and is likely unnecessary for a general search index.

PUT custom_index
{
  "mappings": {
    "properties": {
      "text_entry": {
        "type": "text",
        "analyzer": "autocomplete",         
        "search_analyzer": "standard"       
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "edge_ngram_filter"
          ]
        }
      }
    }
  }
}

In the above code, we created an index called custom_index with a custom analyzer named autocomplete. The analyzer performs the following:

  • It uses the standard tokenizer to split text into tokens
  • A lowercase filter is applied to lowercase all the tokens
  • The tokens are then further broken into smaller chunks based off the minimum and maximum values of the edge_ngram

The search analyzer is configured to use the standard analyzer to reduce query processing required at search time. We have already applied our custom analyzer to split the text for us upon ingestion, and we do not need to repeat this process when searching. Test how the custom analyzer analyzes the text Lexington Avenue:

GET custom_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "Lexington Avenue"
}
#========
#Results
#========
# Minimum token length is 3 so we won't see l, or le
    "tokens": 
        "lex"  
        "lexi"  
        "lexin"  
        "lexing" 
        "lexingt" 
        "lexingto"    
        "lexington" 
        "ave"        
        "aven" 
        "avenu" 
        "avenue"

Notice how the tokens are lowercase and now support partial matches. Now that we’ve seen how our analyzer tokenizes our text, bulk add some data:

POST _bulk
{"index":{"_index":"custom_index"}} 
{"text_entry": "123 Amazon Street Seattle, Wa 12345 "} 
{"index":{"_index":"custom_index"}}
{"text_entry": "456 OpenSearch Drive Anytown, Ny 78910"}
{"index":{"_index":"custom_index"}}
{"text_entry": "789 Palm way Ocean Ave, Ca 33345"}
{"index":{"_index":"custom_index"}}
{"text_entry": "987 Openworld Street, Tx 48981"}

And test!

GET custom_index/_search
{
  "query": {
    "match": {
      "text_entry": {
        "query": "ope" 
      }
    }
  }
}
#========
#Results
#========
 "hits": [
      {
        "_index": "custom_index",
        "_id": "aYCEIJgB4vgFQw3LmByc",
        "_score": 0.9733556,
        "_source": {
          "text_entry": "456 OpenSearch Drive Anytown, Ny 78910"
        }
      },
      {
        "_index": "custom_index",
        "_id": "a4CEIJgB4vgFQw3LmByc",
        "_score": 0.4095239,
        "_source": {
          "text_entry": "987 Openworld Street, Tx 48981"
        }
      }
    ]

You have configured a custom n-gram analyzer to find partial words matches within our list of addresses.

Note, there is a tradeoff between using non-standard index analyzers and writing compute intensive queries. Analyzers can affect indexing throughput and increase the overall index size, especially if used inefficiently. For example, when creating the custom_index, the search analyzer was set to use the standard analyzer. Using n_grams for analysis upon ingestion and search would have impacted cluster performance unnecessarily. Additionally, we set the min_gram and max_gram to values that matched our access patterns, ensuring we didn’t create more n_grams than we needed to for our search use case. This allowed us to gain the benefits of optimizing search without impacting our ingestion throughput.

Conclusion

In this post, we changed how OpenSearch indexed our data to simplify and speed up autocomplete queries. In our case, using the Edge n-grams allowed OpenSearch to match parts of an address and yield precise results without compromising cluster performance with a wildcard query.

It’s always important to test your cluster before deploying in a production environment. Understanding your access patterns is essential to optimizing your cluster from both an indexing and searching perspective. Use the guidelines in this post as a starting point. Confirm your access patterns before creating an index, then begin experimenting with different index analyzers in a test environment to see how they can simplify your queries and improve overall cluster performance. For more reading on general OpenSearch cluster optimization techniques, refer to the Get started with Amazon OpenSearch Service: T-shirt-size your domain post.


About the authors

Rakan Kandah

Rakan Kandah

Rakan is a Solutions Architect at AWS. In his free time, Rakan enjoys playing guitar and reading.