Why is my Amazon OpenSearch Service domain upgrade taking so long?
Last updated: 2021-07-30
I'm trying to upgrade my Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) cluster, but the upgrade is taking a long time. Why is this happening and how can I better monitor my cluster upgrade status in OpenSearch Service?
When you make a configuration change in OpenSearch Service, a blue/green deployment process is used. In a blue/green deployment, two production environments are run (one is live, while the other is idle). The two production environments are then switched according to software updates. For OpenSearch Service, a new environment is created during domain updates, and users are routed to the new production environment after the updates are complete. This behavior minimizes the downtime and maintains the original environment in case a deployment is unsuccessful.
To better monitor your cluster upgrade status in OpenSearch Service, monitor your blue/green deployment process at each stage:
- Creation of new nodes
- Data migration
- Removal of old nodes
Retrieving all cluster snapshots and node IDs
Before a migration, OpenSearch Service takes an automated snapshot of your cluster when it passes the eligibility test. During a snapshot, the progress status might show "null" or 0%. After the snapshot is taken, the percentage value is updated. The time it takes to complete a snapshot can vary depending on storage space. Because snapshots are taken incrementally, your snapshot can take longer to complete if there are significant changes in your data from the previous automated snapshot.
The following _snapshot request retrieves all currently running snapshots with detailed status information:
For more information about the snapshot APIs, see Monitor snapshot and restore progress on the Elasticsearch website.
To retrieve all currently running snapshots in your cluster, use the current parameter:
To obtain the IDs of all data nodes, use the cat nodes API:
You can use the node IDs to identify what nodes are old, and what nodes are new. An increasing number of shards on the new nodes indicates a smooth migration. Eventually, all the shards will move to the new nodes and the old nodes will be empty.
Monitoring the blue/green deployment process
When your cluster enters the blue/green deployment process, the new nodes (in the green environment) appear. The shards are then migrated from the old nodes (in the blue environment). After the data migration or shard reallocation is complete, your old cluster is torn down.
You can monitor the blue/green deployment process in its three stages: new nodes, data migration, and removal of old nodes.
Stage 1: Creation of new nodes
You can monitor the Nodes cluster metric in Amazon CloudWatch to obtain the node count. Or, you can use the cat nodes API to list all the nodes in your cluster:
Because you're only updating your cluster version, as soon as the node counts (old nodes and new nodes) increase, the process is complete. Afterwards, you might see your OpenSearch Service domain returning to an "Active" state (after being in the “Processing” state). For clusters with dedicated nodes, you can see that the node count increases to the sum of old and new nodes. The leader nodes of the older configuration will shut down and the node count will decrease by the number of leader nodes. For example, an OpenSearch Service cluster with three dedicated leader nodes will decrease by three nodes.
Stage 2: Data migration
As soon as the first stage is complete, the shard migration begins. During the data migration, the shard count for older nodes decreases, and the shard count for newer nodes increases. You can use the cat allocation API to see how many shards are allocated to each node:
For more information, see cat allocation on the Elasticsearch website.
Stage 3: Removal of old nodes
After all the shards are migrated to the new nodes, older nodes are removed from your cluster. The node count then returns to the original node count that you configured. At this stage, the blue/green deployment and update process are complete.