Why is my Amazon OpenSearch Service domain stuck in the "Processing" state?

6 minute read
1

My Amazon OpenSearch Service cluster is stuck in the "Processing" state.

Short description

Your OpenSearch Service cluster enters the "Processing" state when it's in the middle of a configuration change. The cluster can get stuck in the "Processing" state if any of the following occurs:

  • A validation check failed with errors.
  • A new set of resources failed to launch.
  • Shard migration to the new set of data nodes isn't complete.
  • The old set of resources failed to terminate.

When you initiate a configuration change, the domain state changes to "Processing" while OpenSearch Service creates a new environment. In the new environment, OpenSearch Service launches a new set of applicable nodes, such as data nodes, dedicated primary nodes, or UltraWarm nodes. After the migration completes, the older nodes are terminated.

You can view the progress of the configuration change in the console under Domain status. You can also monitor the progress of a configuration change using the DescribeDomainChangeProgress API. For more information, see Stages of a configuration change.

Resolution

A validation check has failed with errors

When you initiate a configuration change or perform a OpenSearch Service domain version upgrade, OpenSearch Service first performs a series of validation checks. The validation checks make sure that your domain is eligible for an upgrade. A domain can get stuck in the "Processing" state when validation checks fail with errors. There are several reasons why a validation check can fail. To resolve this issue, see Troubleshooting validation errors. Follow the troubleshooting steps associated with the validation errors and retry your configuration change.

A new set of resources has failed to launch

When you submit simultaneous configuration changes to your cluster before the first configuration change completes, your cluster can get stuck. When you submit a new configuration change, wait until it completes before submitting another configuration change.

The validation checks for your domain in the Validation stage remain valid throughout the duration of the configuration change. If your configuration passes the Validation stage, avoid modifying resources that your domain depends on. For example, don't deactivate the AWS Key Management Service (AWS KMS) key used for encryption.

Your domain can also get stuck if it encounters a ClusterBlockException error. This can happen because of a lack of available storage space or high JVM memory pressure. For more information and troubleshooting, see ClusterBlockException.
Note: You can check the FreeStorageSpace, MasterCPUUtilization and MasterJVMMemoryPressure metrics in Amazon CloudWatch.

Shard migration to the new set of data nodes hasn't completed

After the new resources are created by OpenSearch Service, the shard migration from the old set of data nodes to the new set begins. This stage can take several minutes to several hours depending on the cluster load and size.

To monitor the current migration of shards between the old nodes and the new nodes, run the following API call:

GET /<DOMAIN_ENDPOINT>/_cat/recovery?active_only=true

The shard migration might be unsuccessful for the following reasons:

  • Your OpenSearch Service cluster is currently in red health status. If your cluster is in red health status, then troubleshoot your red cluster status until your cluster is in a healthy state. For more information, see Why is my Amazon OpenSearch Service cluster in a red or yellow status?
  • Your cluster is overloaded and can't allocate resources to handle the shard migration. A cluster with high CPU and JVM pressure might get overloaded. Monitor the CloudWatch JVMMemoryPressure and CPUUtilization metrics. For more information, see Viewing metrics in CloudWatch.
  • There is a lack of free storage space in the new set of nodes. This issue occurs when there is new data coming into the cluster during the blue/green deployment process. This issue can also occur when old nodes have large shards that can't be allocated to the new nodes.

To see the size of your shards, use the cat shards API on the Elasticsearch website.
To see the number of shards assigned to each node, use the cat allocation API on the Elasticsearch website.
To find the reason why some shards can't be assigned to the new nodes, use the cluster allocation explain API on the Elasticsearch website.
If you have old indices that you no longer need, you can use the delete index API on the Elasticsearch website to free up storage.

If your shard can't be assigned to a node because it exceeded the maximum number of retries, then you can retry the allocation. Increase the "index.allocation.max_retries"index setting associated to that shard using the following API call:

PUT <INDEX_NAME>/_settings
{
    "index.allocation.max_retries" : 10
}

Note: By default, the cluster attempts to allocate a shard a maximum of 5 times in a row.

  • Because of internal hardware failures, the shards on old data nodes can get stuck during a migration.
    Note: Depending on your hardware issue, OpenSearch Service runs self-healing scripts to return the nodes to a healthy state.
  • A stuck shard relocation caused by shards that are pinned to an older set of nodes. To make sure that shards aren't pinned to any nodes, check the index settings. Or, check to see if your cluster has a ClusterBlockException error.

To identify the shards that can't be allocated to the new nodes and the corresponding index settings, use the following commands:

GET /<DOMAIN_ENDPOINT>/_cluster/allocation/explain?pretty
GET /<DOMAIN_ENDPOINT>/<INDEX_NAME>/_settings?pretty

Using the get index settings API on the Elasticsearch website, check to see if either of these settings appear:

{
    "index.routing.allocation.require._name": "NODE_NAME" (OR)
    "index.blocks.write": true
}

If you find "index.routing.allocation.require._name": "<NODE_NAME>" in your index settings, then reset that setting using the following API call:

PUT /<DOMAIN_ENDPOINT>/<INDEX_NAME>/_settings
{
    "index.routing.allocation.require._name": null
}

For more information, see Index-level shard allocation filtering on the Elasticsearch website.

If you observe "index.blocks.write": true in your index settings, then your index has a write block. This write block issue might be caused by a ClusterBlockException error. For more information, see How do I resolve the 403 "index_create_block_exception" or "cluster_block_exception" error in OpenSearch Service?

Best practices

To avoid getting your OpenSearch Service cluster stuck in the "Processing" state, follow these best practices:

  • Make sure that that your cluster can support the blue/green deployment process before submitting a configuration change.
  • Submit a dry-run of your changes before submitting configuration changes.
  • Make sure that your cluster isn't overloaded.
  • Avoid submitting several configuration changes simultaneously.
  • Consider submitting a configuration change during low traffic hours.
  • Monitor the progress of your configuration change.

Note: Contact AWS Support if any of the following occurs:

  • Your cluster gets stuck in the "Processing" state for more than 24 hours.
  • Your domain is stuck in the "Deleting older resources" stage.
AWS OFFICIAL
AWS OFFICIALUpdated 8 months ago