Why is my Amazon OpenSearch Service cluster in a red or yellow status?

8 minute read
1

My Amazon OpenSearch Service cluster is in a red or yellow cluster status.

Short description

The Monitoring tab in your OpenSearch Service console indicates the status of the least healthy index in your cluster. A cluster status that shows a red status doesn't mean that your cluster is down. This status indicates that at least one primary shard and its replicas aren't allocated to a node. If your cluster status shows a yellow status, then the primary shards for all indices are allocated to nodes in your cluster. However, one or more replica shards aren't allocated to any of the nodes.

Note: Don't reconfigure your domain until you first resolve the red cluster status. If you try to reconfigure your domain when it's in a red cluster status, then it could get stuck in a "Processing" state. For more information about clusters stuck in a "Processing" state, see Why is my OpenSearch Service domain stuck in the "Processing" state?

For reasons on why your cluster can enter a red status, see Red cluster status.

Note: In some cases, you might be able to delete and then restore the index from an automated snapshot to resolve your red cluster status.

For reasons on why your cluster can enter a yellow status, see Yellow cluster status.

Note: If your yellow cluster status doesn't resolve itself, you can update the index settings or manually reroute the unassigned shards to resolve the status. If your yellow cluster status doesn't self-resolve, then identify and troubleshoot the root cause. To prevent yellow cluster status, apply the Cluster health best practices.

Resolution

Identify the cause for your unassigned shards

You can use either an AWS Systems Manager Automation runbook or manual curl commands to identify the cause for your unassigned shards.

Use the automation runbook

Navigate to the AWSSupport-TroubleshootOpenSearchRedYellowCluster in the AWS Systems Manager console. Then, follow these steps to configure the automation.

Use curl commands

To identify the unassigned shards, perform the following steps:

  1. List the unassigned shard:

    $ curl -XGET 'domain-endpoint/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED
  2. Retrieve the details for why the shard is unassigned:

    $ curl -XGET 'domain-endpoint/_cluster/allocation/explain?pretty' -H 'Content-Type:application/json' -d'{
         "index": "<index name>",
         "shard": <shardId>,
         "primary": <true or false>
    }
  3. (Optional) For a red cluster status, delete the indices of concern and identify and address the root cause:

    curl -XDELETE 'domain-endpoint/<index names>'
  4. Then, identify the available snapshots and restore your indices from a snapshot:

    curl -XGET 'domain-endpoint/_snapshot?pretty'

For yellow cluster status, address the root cause so that your shards are assigned.

Troubleshoot your red or yellow cluster status

Not enough nodes to allocate to the shards

A replica shard won't be assigned to the same node as its primary shard. A single node cluster with replica shards always initializes with yellow cluster status. Single node clusters are initialized this way because there are no other available nodes that OpenSearch Service can assign a replica.

There is also a default limit of "1,000" for the cluster.max_shards_per_node setting for OpenSearch Service versions 7.x and later. It's a best practice to keep the cluster.max_shards_per_node setting to the default value of "1000". If you set shard allocation filters to control how OpenSearch Service allocates shards, the shard can become unassigned because it doesn't have enough filtered nodes. To prevent this node shortage, increase your node count. Make sure the number of replicas for every primary shard is less than the number of data nodes. You can also reduce the number of replica shards. For more information, see Sizing OpenSearch Service domains and Demystifying OpenSearch Service shard allocation.

Low disk space or disk skew

If there isn't enough disk space, then your cluster can enter a red or yellow health status. There must be enough disk space to accommodate shards before OpenSearch Service distributes the shards.

To check how much storage space is available for each node in your cluster, use the following syntax:

$ curl domain-endpoint/_cat/allocation?v

For more information about storage space issues, see How do I troubleshoot low storage space in my OpenSearch Service domain?

Heavy disk skew can also lead to low storage space issues for some data nodes. If you decide to re-allocate any shards, then the shards can become unassigned when the shards distribute. To resolve this issue, see How do I rebalance the uneven shard distribution in my OpenSearch Service cluster?

The disk-based shard allocation settings can also lead to unassigned shards. For example, if the cluster.routing.allocation.disk.watermark.low metric is set to 50 GB, then the specified amount of disk space must be available for shard allocation.

To check the current disk-based shard allocation settings, use the following syntax:

$ curl -XGET domain-endpoint/_cluster/settings?include_defaults=true&flat_settings=true

To resolve your disk space issues, consider the following approaches:

  • Delete any unwanted indices for yellow and red clusters.
  • Delete red indices for red clusters
  • Scale up the EBS volume.
  • Add more data nodes.

Note: Don't make any configuration changes to your cluster when it's in a red health status. If you try to reconfigure your domain when it's in a red cluster status, then it could get stuck in a "Processing" state.

High JVM memory pressure

Every shard allocation uses CPU, heap space, and disk and network resources. Consistently high levels of JVM memory pressure might lead to a failed shard allocation. For example, if JVM memory pressure exceeds 95%, a memory parent circuit breaker is initiated. The allocation thread then gets cancelled, which leaves shards unassigned.

To resolve this issue, reduce the JVM memory pressure level first. After your JVM memory pressure has been reduced, consider these additional tips to bring your cluster back to a green health status:

  • Increase the default shard retry value from "5" or higher.
  • Deactivate and activate the replica shard.
  • Manually retry the unassigned shards.

Example API to increase the retry value:

PUT <index-name>/_settings{ 
 "index.allocation.max_retries" : <value>
}

For more information to reduce your JVM memory pressure, see How do I troubleshoot high JVM memory pressure on my OpenSearch Service cluster?

Node failure

When your cluster experiences a node failure, shards that are allocated to a node become unassigned. When there are no replica shards available for a given index, even a single node failure can cause red health status. Use two replica shards and a Multi-AZ deployment to protect your cluster against data loss for hardware failures.

If all your indices have a replica shard, then a single node failure can cause your cluster to temporarily enter a yellow health status. If your cluster temporarily enters a yellow health status, then OpenSearch Service will recover automatically as soon as the node is healthy again. Or, OpenSearch Service will recover when shards are assigned to a new node.

Check your Amazon CloudWatch metrics to confirm node failures. For more information to identify a node failure, see Failed cluster nodes.

Note: It's also a best practice to assign one replica shard for each index or to use dedicated primary nodes and activate zone awareness. For more information, see Coping with failure on the Elasticsearch website.

Exceeded the maximum number of retries

In OpenSearch Service, your cluster must not exceed the maximum time limit (5,000 ms) or the number of retries (5) for shard allocation. If your cluster has reached the maximum thresholds, then you must manually launch a shard allocation. To manually launch a shard allocation, deactivate and re-activate the replica shards for your indices.

A configuration change on your cluster can also launch shard allocation. For more information about shard allocation, see Every shard deserves a home on the Elasticsearch website.

Note: It's not a best practice to manually launch shard allocation if your cluster has a heavy workload. If you remove all your replicas from an index, then the index must rely only on primary shards. When a node fails, your cluster then enters a red health status because the primary shards are left unassigned.

To deactivate a replica shard, update the number_of_replicas value to "0":

$ curl -XPUT 'domain-endpoint/<indexname>/_settings' -H 'Content-Type: application/json' -d'{
     "index" : {
          "number_of_replicas" : 0
     }
}

Also, check to make sure the index.auto_expand_replicas setting is set to "false". When your cluster returns to a green status, you can set the index.number_of_replicas value back to the desired value to launch allocation for replica shards. If the shard allocation is successful, then your cluster will enter a green health status.

Cluster health best practices

To resolve your yellow or red cluster status, consider the following best practices:

  • Set a recommended Amazon CloudWatch alarm for AutomatedSnapshotFailure. With the alarm, you can make sure you have a snapshot available to restore your indices from when your cluster enters a red status.
  • If your cluster is under a sustained heavy workload, then scale your cluster. For more information to scale your cluster, see How can I scale up an OpenSearch Service domain?
  • Monitor your disk usage, JVM memory pressure, and CPU usage and make sure they don't exceed set thresholds. For more information, see Recommended CloudWatch alarms and Cluster metrics.
  • Make sure all primary shards have replica shards to protect against node failures.

For more information, see Operational best practices for Amazon OpenSearch Service.

2 Comments

Exceeded the maximum number of retries

When we reach this state and if we decrease the replica set to ZERO, my running system will be impacted if the data size is large.

So instead of this we should increase the max retries count after solving the error due to which the shard was not getting allocated.

PUT /<yellow-index-name>/_settings{
     "index.allocation.max_retries": 10
}
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago