How do I troubleshoot a circuit breaker exception in Amazon OpenSearch Service?

Last updated: 2021-09-01

I'm trying to send data to my Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) cluster. However, I receive a circuit breaking exception error that states that my data is too large. Why is this happening and how do I resolve this?

Short description

When a request reaches OpenSearch Service nodes, circuit breakers estimate the amount of memory needed to load the required data. OpenSearch Service then compares the estimated size with the configured heap size limit. If the estimated size of your data is greater than the available heap size, the query is terminated. As a result, a CircuitBreakerException  is thrown to prevent overloading the node.

OpenSearch Service uses the following circuit breakers to prevent JVM OutofMemoryError exceptions:

  • Request
  • Fielddata
  • In flight requests
  • Accounting
  • Parent

Note: It's important to know which of these five circuit breakers raised the exception because each circuit breaker has its own tuning needs. For more information about circuit breaker types, see Circuit breaker settings on the Elasticsearch website.

To obtain the current memory usage per node and per breaker, use the following command:

GET _nodes/stats/breaker

Also, note that circuit breakers are only a best-effort mechanism. While circuit breakers provide some resiliency against overloading a node, you might still end up receiving an OutOfMemoryError. Circuit breakers can track memory only if it is explicitly reserved, so estimating the exact memory usage upfront isn't always possible. For example, if you have a small amount of memory heap, the relative overhead of untracked memory is larger. For more information about circuit breakers and node resiliency, see Improving node resiliency with the real memory circuit breaker on the Elasticsearch website.

To avoid overloading your data nodes, follow the tips provided in the Troubleshooting high JVM memory pressure section.

Resolution

Circuit breaker exception

If you're using Elasticsearch version 7.x and higher with 16 GB of heap, you receive the following error when the circuit breaker limit is reached:

"error": {
        "root_cause": [
            {
                "type": "circuit_breaking_exception",
                "reason": "[parent] Data too large, data for [<http_request>] would be [16355096754/15.2gb], which is larger than the limit of [16213167308/15gb], real usage: [15283269136/14.2gb], new bytes reserved: [1071827618/1022.1mb]",
               }
      ]
}

This example output indicates that the data to be processed is too large for the parent circuit breaker to handle. The parent circuit breaker (a circuit breaker type) is responsible for the overall memory usage of your cluster. When a parent circuit breaker exception occurs, the total memory used across all circuit breakers has exceeded the set limit. A parent breaker throws an exception when the cluster exceeds 95% of 16 GB, which is 15.2 GB of heap.

You can verify this logic by calculating the difference between memory usage and set circuit breaker limit. Use the values from our example output, and subtract "real usage: [15283269136/14.2gb]" from "limit of [16213167308/15gb]". This calculation shows that this request needs around 1.02 GB of new bytes reserved memory to successfully process the request. However, in this example, the cluster had less than 0.8 GB of available free memory heap when the data request came in. As a result, the circuit breaker trips.

The circuit breaker exception message can be interpreted like this:

  • data for [ ]:  Client sends HTTP requests to a node in your cluster. OpenSearch Service either processes the request locally or passes it onto another node for additional processing.
  • would be [#]: How the heap size looks when the request is processed.
  • limit of [#]: Current circuit breaker limit.
  • real usage: Actual usage of the JVM heap.
  • new bytes reserved: Actual memory needed to process the request.

JVM memory pressure

A circuit breaking exception is often caused by high JVM memory pressure. JVM memory pressure refers to the percentage of Java heap that is used for all data nodes in your cluster. Check the JVMMemoryPressure metric in Amazon CloudWatch to determine current usage.

Note: JVM heap size of a data node is set to half the size of physical memory (RAM), up to 32 GB. For example, if the physical memory (RAM) is 128 GB per node, the heap size will still be 32 GB (the maximum heap size). Otherwise, heap size is calculated as half the size of physical memory.

High JVM memory pressure can be caused by following:

  • Increase in the number of requests to the cluster. Check the IndexRate and SearchRate metrics in Amazon CloudWatch to determine your current load.
  • Aggregation, wildcards, and using wide time ranges in your queries.
  • Unbalanced shard allocation across nodes or too many shards in a cluster.
  • Index mapping explosions.
  • Using the fielddata data structure to query data. Fielddata can consume a large amount of heap space, and remains in the heap for the lifetime of a segment. As a result, JVM memory pressure remains high on the cluster when fielddata is used. For more information, see Fielddata on the Elasticsearch website.

Troubleshooting high JVM memory pressure

To resolve high JVM memory pressure, try the following tips:

  • Reduce incoming traffic to your cluster, especially if you have a heavy workload.
  • Consider scaling the cluster to obtain more JVM memory to support your workload.
  • If cluster scaling isn't possible, try reducing the number of shards by deleting old or unused indices. Because shard metadata is stored in memory, reducing the number of shards can reduce overall memory usage.
  • Enable slow logs to identify faulty requests. 
    Note: Before enabling  configuration changes, verify that JVM memory pressure is below 85%. This way, you can avoid additional overhead to existing resources.
  • Optimize search and indexing requests, and choose the correct number of shards. For more information about indexing and shard count, see Get started with Amazon OpenSearch Service: How many shards do I need?
  • Disable and avoid using fielddata. By default, fielddata is set to "false" on a text field unless it's explicitly defined as otherwise in index mappings.
  • Change your index mapping type to a keyword, using reindex API or create or update index template API. You can use the keyword type as an alternative for performing aggregations and sorting on text fields.
  • Avoid aggregating on text fields to prevent increases in field data. When you use more field data, more heap space is consumed. Use the cluster stats API operation to check your field data.
  • Clear the fielddata cache with the following API call:
POST /index_name/_cache/clear?fielddata=true (index-level cache)
POST */_cache/clear?fielddata=true (cluster-level cache)

Warning: If you clear the fielddata cache, any in-progress queries might be disrupted.

For more information, see How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?