Why is my Amazon OpenSearch Service domain stuck in the "Processing" state?

Last updated: 2021-08-05

My Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) cluster is stuck in the "Processing" state. Why is this happening and how can I prevent this?

Short description

Your OpenSearch Service cluster enters the "Processing" state when it's in the middle of a configuration change. The cluster can get stuck in the "Processing" state if either situation occurs:

  • A new set of data nodes fails to launch.
  • Shard migration to the new set of data nodes is unsuccessful.

If you initiate a configuration change, the domain state changes to "Processing" while OpenSearch Service creates a new environment. In the new environment, OpenSearch Service launches a new set of applicable nodes (such as data, master, or UltraWarm). After the migration completes, the older nodes are terminated.

Resolution

A new set of data nodes failed to launch

When you make simultaneous configuration changes to your cluster before the first change completes, then your cluster can get stuck. Make sure to check for any ongoing blue/green deployments in your cluster. To verify whether there are any ongoing blue/green deployments, check the total number of nodes in Amazon CloudWatch. If you observe a higher node count than expected, then a blue/green deployment is likely in progress.

Use the following API call to retrieve more information about additional nodes and the shard migration process:

GET /_cluster/health?pretty and GET /_cat/recovery?pretty

If you're using an Amazon Virtual Private Cloud (VPC) domain, check to make sure that you have enough free IP addresses in your subnet. If there aren't enough IP addresses specified in your subnet, then the launch of new nodes fails. As a result, your cluster gets stuck in the "Processing" state. For more information, see Reserving IP addresses in a VPC subnet.

If you've encrypted an OpenSearch Service domain, make sure that your AWS KMS key exists in your AWS account before making a configuration change. If you accidentally deleted the AWS KMS key, then the cluster can get stuck in the "Processing" state.

Your cluster can also get stuck for the following reasons:

  • An overloaded master node with too many pending tasks or high CPU and JVM memory pressure levels. Use the cat pending tasks API to check for any pending tasks. You can also check the MasterCPUUtilization and MasterJVMMemoryPressure metrics in Amazon CloudWatch.
  • The authentication prerequisites Amazon Cognito authentication for OpenSearch Dashboards weren't met. If you configured Amazon Cognito for OpenSearch Dashboards authentication, then make sure that you met the authentication prerequisites. For example, OpenSearch Service must have the user pool, Amazon Cognito identity pool, and AWS Identity Access Management (IAM) role set with correct permissions. The default name for this role is CognitoAccessForAmazonOpenSearch (with the AmazonESCognitoAccess policy attached).
    Note: If you created a custom IAM role, make sure that your role has the same permissions as CognitoAccessForAmazonOpenSearch.

Shard migration to the new set of data nodes is unsuccessful

A shard migration (from the old set to the new set of data nodes) might be unsuccessful for the following reasons:

  • Your OpenSearch Service cluster is currently in red health status. If your cluster is in red health status, then troubleshoot your red cluster status so that your cluster is in a healthy state.
    Note: It's a best practice to configure your cluster when it's in a healthy state.
  • Nodes are out of service because of a heavy processing load caused by high JVM memory pressure and CPU usage. To resolve this issue, reduce your network traffic to the cluster or stop the network traffic entirely, to return the cluster to a healthy state. Otherwise, your blue/green deployment process might time out, requiring manual intervention.
  • Because of internal hardware failures, the shards on old data nodes can get stuck during a migration. (Note: Depending on your hardware issue, your cluster also might not recover automatically.) If your cluster doesn't recover automatically, then OpenSearch Service will run self-healing scripts to return the nodes to a healthy state. The loss of a node's root volume can prevent OpenSearch Service from responding, and an Auto Scaling group then automatically replaces the faulty nodes. If the attached EBS volume for a node goes down, then manual intervention is required to replace the EBS volume. To help identify which shards are still operating from an older set of nodes, use the following API commands: cat allocation API, cat nodes API, or cat shards API.
  • A stuck shard relocation caused by insufficient free storage in the new set of nodes. This issue occurs when there is new data coming into the cluster during a blue/green deployment process.
    Note: A blue/green deployment isn't triggered if OpenSearch Service detects less space than is required for a successful data migration.
  • A stuck shard relocation caused by shards that are pinned to an older set of nodes. To make sure that shards aren't pinned to any nodes before a configuration change is made, check the index setting. Or, check to see if your cluster has a write block that is caused by high JVM memory pressure or low disk space.

To identify which index shards are stuck and the corresponding index settings, use the following commands:

curl -X GET "ENDPOINT/_cluster/allocation/explain?pretty"
curl -X GET "ENDPOINT/INDEX_NAME/_settings?pretty"

In your index settings, check to see if either of these settings appear:

{
    "index.routing.allocation.require._name": "NODE_NAME" (OR)
    "index.blocks.write": true
    }

If you observe "index.routing.allocation.require._name": "NODE_NAME" in your index settings, then remove the setting like this:

curl -X PUT "ENDPOINT/INDEX_NAME/_settings?pretty" H 'Content-Type: application/json' -d '
{
"index.routing.allocation.require._name": null
}'

For more information, see Index-level shard allocation filtering on the Elasticsearch website.

If you observe "index.blocks.write": true in your index settings, then your cluster has a write block. The write block is likely caused by high JVM memory pressure or low disk space. Make sure to address these issues before implementing any other troubleshooting tips. For more information about troubleshooting this exception, see ClusterBlockException.

Note: If your cluster is stuck in the "Processing" state for more than 24 hours, then your cluster needs manual intervention. Also, if you haven't made any configuration changes but the node count is higher than expected, then a software patch might be in progress.


Did this article help?


Do you need billing or technical support?