How do I troubleshoot search latency spikes in my Amazon OpenSearch Service cluster?

7 minute read
0

I have search latency spikes in my Amazon OpenSearch Service cluster.

Short description

For search requests, OpenSearch Service calculates the round trip time as follows:

Round trip = Time the query spends in the query phase + Time in the fetch phase + Time spent in the queue + Network latency

The SearchLatency metric on Amazon CloudWatch gives you the time that the query spent in the query phase.

To troubleshoot search latency spikes in your OpenSearch Service cluster, there are multiple steps that you can take:

  • Check for insufficient resources provisioned on the cluster
  • Check for search rejections using the ThreadpoolSearchRejected metric in CloudWatch
  • Use the search slow logs API and the profile API
  • Resolve any 504 gateway timeout errors

Resolution

Check for insufficient resources provisioned on the cluster

If you have insufficient resources provisioned on your cluster, then you might experience search latency spikes. Use the following best practices to make sure that you have sufficient resources provisioned.

1.    Review the CPUUtilization metric and the JVMMemory pressure of the cluster using CloudWatch. Confirm that they're within the recommended thresholds. For more information, see Recommended CloudWatch alarms for Amazon OpenSearch Service.

2.    Use the node stats API to get node level statistics on your cluster:

GET /_nodes/stats

In the output, check the following sections: caches, fielddata, and jvm. To compare the outputs, run this API multiple times with some delay between each output.

3.    OpenSearch Service uses multiple caches to improve its performance and the response time of requests:

  • The file-system cache, or page cache, that exists on the operating system level
  • The shard level request cache and query cache that both exist on the OpenSearch Service level

Review the node stats API output for cache evictions. A high number of cache evictions in the output means that the cache size is inadequate to serve the request. To reduce your evictions, use bigger nodes with more memory.

To view specific cache information with the node stats API, use the following requests. The second request is for a shard-level request cache:

GET /_nodes/stats/indices/request_cache?human

GET /_nodes/stats/indices/query_cache?human

For more information on OpenSearch caches, see Elasticsearch caching deep dive: Boosting query speed one cache at a time on the Elastic website.

For steps to clear the various caches, see Clear index or data stream cache in the OpenSearch website.

4.    Performing aggregations on fields that contain highly unique values might cause an increase in the heap usage. If aggregation queries are already in use, then search operations use fielddata. Fielddata also sorts and accesses the field values in the script. Fielddata evictions depend on the size of the indices.fielddata.cache.size file, and this accounts for 20% of the JVM heap space. When the cache is exceeded, eviction start.

To view the fielddata cache, use this request:

GET /_nodes/stats/indices/fielddata?human

For more information on troubleshooting high JVM memory, see How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?
To troubleshoot high CPU utilization, see How do I troubleshoot high CPU utilization on my Amazon OpenSearch Service cluster?

Check for search rejections using the ThreadpoolSearchRejected metric in CloudWatch

To check for search rejections using CloudWatch, follow the steps in How do I resolve search or write rejections in Amazon OpenSearch Service?

Use search slow logs to identify long running queries

To identify both long running queries and the time that a query spent on a particular shard, use slow logs. You can set thresholds for the query phase and then fetch the phase for each index. For more information on setting up slow logs, see Viewing Amazon Elasticsearch Service slow logs. For a detailed breakdown of the time that's spent by your query in the query phase, set "profile":true for your search query .

Note: If you set the threshold for logging to a very low value, your JVM memory pressure might increase. This might lead to more frequent garbage collection that then increases CPU utilization and adds to cluster latency. Logging more queries might also increase your costs. A long output of the profile API also adds significant overhead to any search queries.

Resolve any 504 gateway timeout errors

From the application logs of your OpenSearch Service cluster, you can see specific HTTP error codes for individual requests. For more information on resolving HTTP 504 gateway timeout errors, see How can I prevent HTTP 504 gateway timeout errors in Amazon OpenSearch Service?

Note: You must activate error logs to identify specific HTTP error codes. For more information about HTTP error codes, see Viewing Amazon OpenSearch Service error logs.

Other factors that can cause high search latency

There are a number of other factors that can cause high search latency. Use the following tips to further troubleshoot high search latency:

  • Frequent or long running garbage collection activity might cause search performance issues. Garbage collection activity might pause threads and increase search latency. For more information, see A heap of trouble: Managing Amazon OpenSearch Service's managed heap on the Elastic website.
  • Provisioned IOPS (or i3 instances) might help you remove any Amazon Elastic Block Store (Amazon EBS) bottleneck. In most cases, you don't need them. Before you move an instance to i3, it's a best practice to test the performance between i3 nodes and r5 nodes.
  • A cluster with too many shards might increase resource utilization, even when the cluster is inactive. Too many shards slow down query performance. Although increasing the replica shard count can help you achieve faster searches, don't go beyond 1000 shards on a given node. Also, make sure that the shard sizes are between 10 GiB and 50 GiB. Ideally, the maximum number of shards on a node is heap * 20.
  • Too many segments or too many deleted documents might affect search performance. To improve perform, use force merge on read-only indices. Also, increase the refresh internal on the active indices, if possible. For more information, see Lucene's handling of deleted documents on the Elastic website.
  • If your cluster is in a Virtual Private Cloud (VPC), then it's a best practice to run your applications within the same VPC.
  • Use UltraWarm nodes or hot data nodes for read-only data. Hot storage provides the fastest possible performance for indexing and searching new data. However, UltraWarm nodes are a cost-effective way to store large amounts of read-only data on your cluster. For indices that you don't need to write to and don't require high performance, UltraWarm offers significantly lower costs per GiB of data.
  • Determine if your workload benefits from having the data that you're searching for available on all nodes. Some applications benefit from this approach, especially if there are few indices on your cluster. To do this, increase the number of replica shards.
    Note: This might add to indexing latency. Also, you might need additional Amazon EBS storage to accommodate the replicas that you add. This increases your EBS volume costs.
  • Search as few fields as possible, and avoid scripts and wildcard queries. For more information, see Tune for search speed on the Elastic website.
  • For indices with many shards, use custom routing to help speed up searches. Custom routing makes sure that you query only the shards that hold your data, rather than broadcast the request to all shards. For more information, see Customizing your document routing on the Elastic website.

Related information

Recommended CloudWatch alarms for Amazon OpenSearch Service

AWS OFFICIAL
AWS OFFICIALUpdated a year ago