How do I troubleshoot high CPU utilization on my Amazon OpenSearch Service cluster?

Last updated: 2021-08-05

My data nodes are showing high CPU usage on my Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) cluster. How do I troubleshoot this?

Short description

It's a best practice to maintain your CPU utilization to make sure that OpenSearch Service has enough resources to perform its tasks. A cluster that consistently performs at high CPU utilization can degrade cluster performance. When your cluster is overloaded, OpenSearch Service will stop responding, resulting in a timeout request.

To troubleshoot high CPU utilization on your cluster, consider the following approaches:

  • Use the nodes hot threads API. (For more information, see Nodes hot threads API on the Elasticsearch website.)
  • Check the write operation or bulk API thread pool. (For more information, see Bulk API on the Elasticsearch website.)
  • Check the search thread pool. (For more information, see Thread pools on the Elasticsearch website.)
  • Check the Apache Lucene merge thread pool. (For more information, see Merge on the Elasticsearch website.)

Resolution

Use the nodes hot threads API

If there are constant CPU spikes in your OpenSearch Service cluster, then use the nodes hot threads API. The nodes hot threads API acts as a task manager, showing you the breakdown of all resource-intensive threads that are running on your cluster.

Here's an example output of the nodes hot threads API:

GET _nodes/hot_threads

100.0% (131ms out of 500ms) cpu usage by thread 
'opensearch[xxx][search][T#62]' 10/10 snapshots sharing following 10 
elements sun.misc.Unsafe.park(Native Method) 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
 
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
 
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
 
org.opensearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Note: The nodes hot threads output lists information for each node. The length of your output depends on how many nodes are running in your OpenSearch Service cluster.

Additionally, use the cat nodes API to view the current breakdown of resource utilization. You can narrow down the subset of nodes with the highest CPU utilization with the following command:

GET _cat/nodes?v&s=cpu:desc

The last column in your output displays your node name. For more information, see cat nodes API on the Elasticsearch website.

Then, pass on the relevant node name to your hot threads API:

GET _nodes/<node-name>/hot_threads

For more information, see hot threads API on the Elasticsearch website.

The nodes hot threads output looks like the following:

<percentage> of cpu usage by thread 'opensearch[<nodeName>][<thread-name>]

The thread name indicates which OpenSearch Service processes are consuming high CPU.

Check the write operation or bulk API thread pool

A 429 error in OpenSearch Service can indicate that your cluster is handling too many bulk indexing requests. When there are constant CPU spikes in your cluster, OpenSearch Service rejects the bulk indexing requests.

The write thread-pool handles indexing requests, which include Bulk API operations. To confirm whether your cluster is handling too many bulk indexing requests, check the IndexingRate metric in Amazon CloudWatch.

If your cluster is handling too many bulk indexing requests, then consider the following approaches:

  • Reduce the number of bulk requests on your cluster.
  • Reduce the size of each bulk request, so that your nodes can process them more efficiently.
  • If Logstash is being used to push data into your OpenSearch Service cluster, then reduce the batch size or the number of workers.
  • If your cluster's ingestion rate slows down, then scale your cluster (either horizontally or vertically). To scale up your cluster, increase the number of nodes and instance type so that OpenSearch Service can properly process the incoming requests.

Check the search thread pool

A search thread pool that consumes high CPU indicates that search queries are overwhelming your OpenSearch Service cluster. Your cluster can be overwhelmed by a single long-running query. An increase in queries being performed by your cluster can also affect your search thread pool.

To check whether a single query is increasing your CPU usage, use the task management API. For example:

GET _tasks?actions=*search&detailed

The task management API fetches all active search queries that are running on your cluster. For more information, see Task management API on the Elasticsearch website.

Here's an example output:

{
  "nodes": {
    "U4M_p_x2Rg6YqLujeInPOw": {
      "name": "U4M_p_x",
      "roles": [
        "data",
        "ingest"
      ],
      "tasks": {
        "U4M_p_x2Rg6YqLujeInPOw:53506997": {
          "node": "U4M_p_x2Rg6YqLujeInPOw",
          "id": 53506997,
          "type": "transport",
          "action": "indices:data/read/search",
          "description": """indices[*], types[], search_type[QUERY_THEN_FETCH], source[{"size":10000,"query":{"match_all":{"boost":1.0}}}]""",
          "start_time_in_millis": 1541423217801,
          "running_time_in_nanos": 1549433628,
          "cancellable": true,
          "headers": {}
        }
      }
    }
  }
}

Check the description field to identify which particular query is being run. The running_time_in_nanos field indicates the amount of time a query has been running. To decrease your CPU usage, cancel the search query that is consuming high CPU. The task management API also supports a _cancel call.

Note: Make sure to record the task ID from your output to cancel a particular task. In this example, the task ID is "U4M_p_x2Rg6YqLujeInPOw:53506997".

Here's an example of a task management POST call:

POST _tasks/U4M_p_x2Rg6YqLujeInPOw:53506997/_cancel

The Task Management POST call marks the task as "cancelled", releasing any dependent AWS resources. If you have multiple queries running on your cluster, then use the POST call to cancel queries one at a time. Cancel each query until your cluster returns to a normal state. It's also a best practice to set a proper timeout value in the query body, to prevent high CPU spikes. (For more information, see Request body search parameters on the Elasticsearch website.) To verify whether the number of active queries has decreased, check the SearchRate metric in Amazon CloudWatch.

Note: Canceling all active search queries at the same time in your OpenSearch Service cluster can cause errors on the client application side.

Check the Apache Lucene merge thread pool

OpenSearch Service uses Apache Lucene for indexing and searching documents on your cluster. Apache Lucene runs merge operations to reduce the effective number of segments needed for each shard and to remove any deleted documents. This process is run whenever new segments are created in a shard.

If you observe an Apache Lucene merge thread operation impacting the CPU usage, then increase the refresh_interval setting of your OpenSearch Service cluster indices. The increase in the refresh_interval setting slows down segment creation of your cluster.

Note: A cluster that's migrating indices to UltraWarm storage can increase your CPU utilization. An UltraWarm migration usually involves a force merge API operation, which can be CPU-intensive.

To check for any UltraWarm migrations, use the following command:

GET _ultrawarm/migration/_status?v