How do I troubleshoot search latency spikes in my Amazon Elasticsearch Service cluster?

Last updated: 2021-04-05

I am experiencing search latency spikes in my Amazon Elasticsearch Service (Amazon ES) cluster. How can I troubleshoot and resolve search latency spikes?

Short description

For search requests, the round trip time is calculated as follows:

Round trip = Time the query spends in the query phase + time in the fetch phase + time spent in the queue + network latency

The SearchLatency metric on Amazon CloudWatch gives you the time the query has spent in the query phase.

There are a number of troubleshooting steps you can take to troubleshoot search for latency spikes in an Amazon ES cluster, including:

  • Check for insufficient resources provisioned on the cluster
  • Check for search rejections using the ThreadpoolSearchRejected metric in CloudWatch
  • Use Search Slow Logs and Profile API
  • Resolve any 504 GatewayTimeout errors

Resolution

Check for insufficient resources provisioned on the cluster

You can experience search latency spikes if you have insufficient resources provisioned on the Amazon ES cluster. Use the following best practices to ensure that you have sufficient resources provisioned.

1.    Review the CPUUtilization metric and the JVMMemory pressure of the cluster using Amazon CloudWatch to confirm that they are within the recommended thresholds. For more information, see Recommended CloudWatch alarms.

2.    Use the Node Stats API to get node level statistics on your cluster:

GET /_nodes/stats

In the output, check the following sections: caches, fielddata, and jvm. Run this API multiple times with some delay to compare the outputs.

3.    Amazon ES uses the file system cache to make faster search requests. Review the NodeStats API output for cache evictions. A high number of cache evictions in the output means that the cache size is inadequate to serve the request. In this case, consider using bigger nodes with more memory.

4.    Performing aggregations on fields that contain highly unique values can cause an increase in the heap usage. If aggregation queries are already in use, then search operations use FieldData. FieldData is also used to sort and access the field values in the script. FieldData evictions depend on the size of the indices.fielddata.cache.size file, and this accounts for 20% of the JVM heap space. Evictions start when the cache is exceeded.

    For more information on troubleshooting high JVM memory, see How do I troubleshoot high JVM memory pressure on my Amazon ES cluster?

    To troubleshoot high CPU utilization, see How do I troubleshoot high CPU utilization on my Amazon ES cluster?

    Check for search rejections using the ThreadpoolSearchRejected metric in CloudWatch

    To check for search rejections using CloudWatch, follow the steps in How do I resolve search or write rejections in Amazon ES?

    Use Search Slow Logs to identify long running queries

    Use slow logs to identify both long running queries and the time that a query spent on a particular shard. You can set thresholds for the query phase, and then fetch the phase for each index. For more information on setting up slow logs, see Viewing Amazon ES slow logs. Be sure to set "profile":true for your search query to get a detailed breakdown of the time spent by your query in the query phase.

    Note: If you set the threshold for logging to a very low value, then your JVM memory pressure can increase. This can lead to more frequent garbage collection that then increases the CPU utilization and adds to latency on the cluster. Logging more queries can also increase your costs. The output of the profile API can be long, adding significant overhead to any search queries.

    Resolve any 504 Gateway Timeout errors

    From the Application Logs of your Amazon ES cluster, you can see specific HTTP error codes for individual requests. For more information on resolving HTTP 504 Gateway Timeout errors, see How can I prevent HTTP 504 gateway timeout errors in Amazon ES?

    Note: You must enable error logs to identify specific HTTP error codes. For more information about HTTP error codes, see Viewing Amazon ES error logs.

    Other factors that can cause high search latency

    There are a number of other factors that can cause high search latency. Use the following tips to further troubleshoot high search latency:

    • Frequent or long running garbage collection activity can cause search performance issues. Garbage collection activity can pause threads, and increase search latency. For additional information, see Managing Elasticsearch's managed heap on the Elasticsearch website.
    • Provisioned IOPS (or i3 instances) might help you remove any Amazon Elastic Block Store (Amazon EBS) bottleneck. In most cases, you will not need them. It's a best practice that you test the performance between i3 nodes and r5 nodes before directly moving to i3.
    • A cluster with too many shards can cause an increase in resource utilization, even when the cluster is inactive. Too many shards slow down query performance. Although increasing the replica shard count can help you achieve faster searches, make sure that you are not going beyond 1000 shards on a given node. Also, make sure that the shard sizes are between 10 GiB and 50 GiB. Ideally, the maximum number of shards on a node should be heap * 20.
    • Too many segments or too many deleted documents can affect search performance. Using force merge on read-only indices can help in this case. If your use-case allows it, increase the refresh internal on the active indices. For more information, see Lucene's handling of delete documents on the Elasticsearch website.
    • If your cluster is in a VPC, consider running your applications within the same VPC.
    • Consider using UltraWarm nodes or hot data nodes for read-only data. Hot storage provides the fastest possible performance for indexing and searching new data. However, UltraWarm nodes are a cost-effective way to store large amounts of read-only data on Amazon ES. For indices that you are not actively writing to and don't need the same performance from, UltraWarm offers significantly lower costs per GiB of data.
    • Test your workload to see if it benefits from having the data that you are searching for available on all nodes. Some applications benefit from this approach, especially if there are few indices on your cluster. To do this, increase the number of replica shards. Keep in mind that this can add to indexing latency. Also, you might need additional Amazon EBS storage to accommodate the replicas that you are adding. This will increase your EBS volume costs.
    • Search as few fields as possible, avoids scripts, and wildcard queries. For more information, see Tune for search speed on the Elasticsearch website.
    • For indices with many shards, you can use custom routing to help speed up searches. Custom routing ensures that only the shards holding your data are queried, instead of broadcasting the request to all the shards of the index. For more information, see Customizing your document routing on the Elasticsearch website.