Why is my Amazon Elasticsearch Service domain stuck in the "Processing" state?

Last updated: 2021-03-08

My Amazon Elasticsearch Service (Amazon ES) cluster is stuck in the "Processing" state. Why is this happening and how can I prevent this?

Short description

Your Amazon ES cluster enters the "Processing" state when it is in the middle of a configuration change. It can get stuck in the "Processing" state if either situation occurs:

  • A new set of data nodes fails to launch.
  • Shard migration to the new set of data nodes is unsuccessful.

If you initiate a configuration change, the domain state changes to "Processing" while Amazon ES creates a new environment. In the new environment, Amazon ES launches a new set of applicable nodes (data, master, or UltraWarm). After the migration completes, the older nodes are terminated.

Resolution

A new set of data nodes failed to launch

When you make simultaneous configuration changes to your cluster before the first change completes, it can cause your cluster to get stuck. Make sure to check for any ongoing blue/green deployments in your Elasticsearch cluster. To verify whether there are any ongoing blue/green deployments, check the total number of nodes in Amazon CloudWatch. If you observe a higher node count than expected, then a blue/green deployment is likely in progress.

Use the following API call to retrieve more information about additional nodes and the shard migration process:

GET /_cluster/health?pretty and GET /_cat/recovery?pretty

If you are using an Amazon Virtual Private Cloud (VPC) domain, check to make sure you have enough free IP addresses in your subnet. If there aren't enough IP addresses specified in your subnet, the launch of new nodes fails. As a result, your cluster gets stuck in the "Processing" state. For more information, see Reserving IP addresses in a VPC subnet.

If you have an encrypted Amazon ES domain, make sure your KMS key exists in your AWS account before making a configuration change. If you accidentally deleted the KMS key, then the cluster can get stuck in the "Processing" state.

Your Elasticsearch cluster can also get stuck for the following reasons:

  • An overloaded master node with too many pending tasks or high CPU and JVM memory pressure levels. Use the cat pending tasks API to check for any pending tasks. You can also check the MasterCPUUtilization and MasterJVMMemoryPressure metrics in Amazon CloudWatch.
  • The authentication prerequisites for Amazon Cognito for Kibana have not been met. If you've configured Amazon Cognito for Kibana authentication, make sure you've met the authentication prerequisites. For example, Amazon ES must have the user pool, Amazon Cognito identity pool, and AWS Identity Access Management (IAM) role set with correct permissions. The default name for this role is CognitoAccessforAmazonES (with the AmazonESCognitoAccess policy attached).
    Note: If you created a custom IAM role, make sure your role has the same permissions as CognitoAccessforAmazonES.

Shard migration to the new set of data nodes is unsuccessful

A shard migration (from the old set to the new set of data nodes) might be unsuccessful for the following reasons:

  • Your Elasticsearch cluster is currently in red health status. If your cluster is in red health status, troubleshoot your red cluster status and bring it back to a healthy state before making a configuration change.
  • Nodes are out of service because of a heavy processing load caused by high JVM memory pressure and CPU usage. To resolve this issue, reduce your network traffic to the cluster or stop the network traffic entirely, bringing the cluster back to a healthy state. Otherwise, your blue/green deployment process might time out, requiring manual intervention.
  • Because of internal hardware failures, the shards on old data nodes can get stuck during a migration. If you're experiencing an internal hardware failure, replace the nodes from the faulty hardware. The loss of root volume on your data nodes can prevent Amazon ES from responding, and an Auto Scaling group then automatically replaces the nodes. If the attached EBS volume for a node goes down, manual intervention is required to replace the EBS volume. To replace the EBS volume, replace the entire node itself. To help identify which shards are still operating from an older set of nodes, use the following API commands: cat allocation API, cat nodes API, or cat shards API.
  • A stuck shard relocation caused by insufficient free storage in the new set of nodes. This issue occurs when there is new data coming into the cluster during a blue/green deployment process.
    Note: A blue/green deployment isn't triggered if Amazon ES detects less space than is required for a successful data migration.
  • A stuck shard relocation caused by shards that are pinned to an older set of nodes. To make sure shards aren't pinned to any nodes before a configuration change is made, check the index setting. Or, check to see if your cluster has a write block that is caused by high JVM memory pressure or low disk space.

To identify which index shards are stuck and the corresponding index settings, use the following commands:

curl -X GET "ENDPOINT/_cluster/allocation/explain?pretty"
curl -X GET "ENDPOINT/INDEX_NAME/_settings?pretty"

In your index settings, check to see if either of these settings appear:

{
    "index.routing.allocation.require._name": "NODE_NAME" (OR)
    "index.blocks.write": true
    }

If you observe "index.routing.allocation.require._name": "NODE_NAME" in your index settings, remove the setting like this:

curl -X PUT "ENDPOINT/INDEX_NAME/_settings?pretty" H 'Content-Type: application/json' -d '
{
"index.routing.allocation.require._name": null
}'

For more information, see Index-level shard allocation filtering on the Elasticsearch website.

If you observe "index.blocks.write": true in your index settings, then your cluster has a write block. The write block is likely caused by high JVM memory pressure or low disk space. Make sure to address these issues before implementing any other troubleshooting tips. For more information about troubleshooting this exception, see ClusterBlockException.

Note: If your cluster is stuck in the "Processing" state for more than 24 hours, your cluster needs manual intervention. Also, if you haven't made any configuration changes but the node count is higher than expected, a software patch might be in progress.


Did this article help?


Do you need billing or technical support?