Why is my Amazon OpenSearch Service cluster in red or yellow status?

Last updated: 2021-07-30

My Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) cluster is in red or yellow cluster status. Why is this happening?

Short description

The Monitoring tab in your OpenSearch Service console indicates the status of the least healthy index in your cluster. A cluster status that shows red status doesn't mean that your cluster is down. Rather, this status indicates that at least one primary shard and its replicas aren't allocated to a node. If your cluster status shows yellow status, then the primary shards for all indices are allocated to nodes in your cluster. However, the replica shards for at least one index aren't allocated to any of the nodes.

Note: Don't reconfigure your domain until you first resolve the red cluster status. If you try to reconfigure your domain when it is in red cluster status, it could get stuck in a "Processing" state. For more information about clusters stuck in a "Processing" state, see Why is my Amazon OpenSearch Service domain stuck in the "Processing" state?

Your cluster can enter red status for the following reasons:

  • Multiple data node failures
  • Using a corrupt or red shard for an index
  • High JVM memory pressure or CPU utilization
  • Low disk space or disk skew

Note: In some cases, you might be able to resolve your red cluster status by deleting and then restoring the index from an automated snapshot.

Your cluster can enter yellow health status for the following reasons:

  • Creation of a new index
  • Not enough nodes to allocate to the shards or disk skew
  • High JVM memory pressure
  • Single node failure
  • Exceeded the maximum number of shard allocation retries

Note: If your yellow cluster status doesn't resolve itself, you can resolve the status by updating the index settings or by manually rerouting the unassigned shards. If your yellow cluster status doesn't self-resolve, then identify and troubleshoot the root cause. To prevent yellow cluster status, apply the Cluster health best practices.

Resolution

Identifying the reason for your unassigned shards

To identify the unassigned shards, perform the following steps:

1.    List the unassigned shard:

$ curl -XGET 'domain-endpoint/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED

2.    Retrieve the details for why the shard is unassigned:

$ curl -XGET 'domain-endpoint/_cluster/allocation/explain?pretty' -H 'Content-Type:application/json' -d'{"index": "<index name>", "shard": <shardId>, "primary":<true or false>}'

3.    (Optional) For red cluster status, delete the indices of concern and identify and address the root cause:

curl -XDELETE 'domain-endpoint/<index names>'

Then, identify the available snapshots and restore your indices from a snapshot:

curl -XGET 'domain-endpoint/_snapshot?pretty'

For yellow cluster status, address the root cause so that your shards are assigned.

Troubleshooting your red or yellow cluster status

Not enough nodes to allocate to the shards

A replica shard will not be assigned to the same node as its primary shard. A single node cluster with replica shards always initializes with yellow cluster status. Single node clusters are initialized this way because there are no other available nodes to which OpenSearch Service can assign a replica.

There is also a default limit of "1,000" for the cluster.max_shards_per_node setting for OpenSearch Service versions 7.x and later. It's a best practice to keep the cluster.max_shards_per_node setting to the default value of "1000". If you set shard allocation filters to control how OpenSearch Service allocates shards, the shard can become unassigned from not having enough filtered nodes. To prevent this node shortage, increase your node count. Make sure the number of replicas for every primary shard is less than the number of data nodes. You can also reduce the number of replica shards. For more information, see Sizing OpenSearch Service domains and Demystifying OpenSearch Service shard allocation.

Low disk space or disk skew

If there isn't enough disk space, your cluster can enter red or yellow health status. There must be enough disk space to accommodate shards before OpenSearch Service distributes the shards.

To check how much storage space is available for each node in your cluster, use the following syntax:

$ curl domain-endpoint/_cat/allocation?v

For more information about storage space issues, see How do I troubleshoot low storage space in my Amazon OpenSearch Service domain?

Heavy disk skew can also lead to low storage space issue for some data nodes. If you decide to re-allocate any shards, the shards can become unassigned during the shard distribution. To resolve this issue, see How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster?

The disk-based shard allocation settings can also lead to unassigned shards. For example, if the cluster.routing.allocation.disk.watermark.low metric is set to 50 GB, then the specified amount of disk space must be available for shard allocation.

To check the current disk-based shard allocation settings, use the following syntax:

$ curl -XGET domain-endpoint/_cluster/settings?include_defaults=true&flat_settings=true

To resolve your disk space issues, consider the following approaches:

  • Delete any unwanted indices.
  • Scale up the EBS volume.
  • Add more data nodes.

High JVM memory pressure

Every shard allocation uses CPU, heap space, and disk and network resources. Consistently high levels of JVM memory pressure could lead to a failed shard allocation. For example, if JVM memory pressure exceeds 95%, a memory parent circuit breaker will be triggered. The allocation thread then gets cancelled, leaving shards unassigned.

To resolve this issue, reduce the JVM memory pressure level first. After your JVM memory pressure has been reduced, consider these additional tips to bring your cluster back to green health status:

  • Increase the default shard retry value from "5" or higher.
  • Disable and enable the replica shard.
  • Manually retry the unassigned shards.

For more information about reducing your JVM memory pressure, see How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?

Node failure

When your cluster experiences a node failure, shards that are allocated to a node to become unassigned. When there are no replica shards available for a given index, even a single node failure can cause red health status. Having two replica shards and a Multi-AZ deployment protects your cluster against data loss during a hardware failure.

If all your indices have a replica shard, a single node failure can cause your cluster to temporarily enter yellow health status. If your cluster temporarily enters yellow health status, then OpenSearch Service will recover automatically as soon as the node is healthy again. Or, OpenSearch Service will recover when shards are assigned to a new node.

You can confirm any node failures by checking your Amazon CloudWatch metrics. For more information about identifying a node failure, see Failed cluster nodes.

Note: It's also a best practice to assign one replica shard for each index or to use dedicated master nodes and enable zone awareness. For more information, see Coping with failure on the Elasticsearch website.

Exceeded the maximum number of retries

In OpenSearch Service, your cluster must not exceed the maximum time limit (5,000 ms) or the number of retries (5) for shard allocation. If your cluster has reached the maximum thresholds, you must manually trigger a shard allocation. To manually trigger a shard allocation, disable and re-enable the replica shards for your indices.

A configuration change on your cluster can also trigger shard allocation. However, avoid making any configuration changes to your cluster when it is in red health status. For more information about shard allocation, see Every shard deserves a home on the Elasticsearch website.

Note: It's not a best practice to manually trigger shard allocation if your cluster has a heavy workload. If you remove all your replicas from an index, the index must rely only on primary shards. When a node fails, your cluster then enters red health status because primary shards are left unassigned.

To disable a replica shard, update the number_of_replicas value to "0":

$ curl -XPUT 'domain-endpoint/<indexname>/_settings' -H 'Content-Type: application/json' -d'
{
  "index" : {
    "number_of_replicas" : 0
  }
}'

Also, check to make sure the index.auto_expand_replicas setting is set to "false". When your cluster returns to green health status, you can set the index.number_of_replicas value back to the desired value to trigger allocation for replica shards. If the shard allocation is successful, your cluster will enter green health status.

Cluster health best practices

To resolve your yellow or red cluster status, consider the following best practices:

For more information about OpenSearch Service best practices, see Amazon OpenSearch Service best practices.


Did this article help?


Do you need billing or technical support?