How do I resolve a "Prior snapshot operation has not yet completed" error while upgrading my Amazon Elasticsearch Service cluster?

Last updated: 2020-09-10

I want to upgrade my Amazon Elasticsearch Service (Amazon ES) cluster, but it fails with a "Prior snapshot operation has not yet completed" error message. How do I fix this?

Short description

When you upgrade an Amazon ES cluster, the cluster can fail for the following reasons:

  • Snapshot is already in progress.
  • Snapshot in progress is stuck.
  • Snapshot in progress has a cluster in red status.
  • Snapshot timeout or failure.

For more information about Amazon ES upgrade failures, see Troubleshooting an upgrade.

Resolution

Snapshot is already in progress

Note: If you receive a "Prior snapshot operation has not yet completed" error message, the error indicates that a snapshot is already in progress.

For encrypted domains, use the following syntax to check whether an automated snapshot is in progress:

curl -X GET  'https://es_endpoint/_snapshot/cs-automated-enc/_status'

For unencrypted domains, use the following syntax to check whether an automated snapshot is in progress:

curl -X GET 'https://es_endpoint/_snapshot/cs-automated/_status'

If there are no running snapshots, the following output appears:

{  "snapshots" : [ ]}

The empty brackets indicate that you can safely perform an upgrade. If Amazon ES is unable to check whether a snapshot is in progress, then upgrades can fail.

Snapshot in progress is stuck

Use the following command syntax to check the start and end times of your hourly snapshots:

curl -X GET  'https://es_endpoint/_cat/snapshots/cs-automated?v&s=id'

Then, print the start times by using a cURL output piped to the awk command:

curl -X GET  'https://es_endpoint/_cat/snapshots/cs-automated?v&s=id' |awk -F" " ' { print $4  } '

The output indicates the time that the hourly snapshots occurred. For example, this output indicates that the output runs around the 52nd minute of each hour:

22:51:11
23:51:18
00:51:19
01:51:14
02:51:16
03:51:18
04:51:16
05:51:11

Then, check Amazon ES upgrade eligibility. Use the snapshot status API to check whether the snapshot completed. The API returns an empty set when the snapshot is done. This data can change after configuration changes are made, so the snapshot cannot be used to plan scheduled jobs.

If the current status is in progress and it doesn't change for a while, the snapshot is likely stuck. The same applies to aborted snapshots, which can prevent other snapshots from being taken. If the cluster is in red status or there is a write block, clear the status or block to resolve the failure.

Important: Don't run the upgrade eligibility check until the snapshot has completed.

Snapshot in progress has a cluster in red status

To list only the repository names registered to your domain, use the following syntax:

curl -XGET "http://es_endpoint/_cat/repositories?v&h=id"

To list the repository names, types, and other settings registered to your domain, use the following syntax:

curl -XGET "http://es_endpoint/_snapshot?pretty"
curl -XGET "https://es_endpoint/_cluster/state/metadata"

Check if you can list snapshots in each of the repositories, excluding the cs-automated or cs-automated-enc repositories. If you have several repositories, use a bash script like this:

#!/bin/bash
repos=$(curl -s https://es_endpoint/_cat/repositories 2>&1 |grep  -v "cs-automated" | awk '{print $1}')

for i in $repos; do
echo "Snapshots in ... :" $i >>/tmp/snapshot
`curl -s -XGET https://es_endpoint/_cat/snapshots/$i >> /tmp/snapshot`
\echo "done..."
done

Important: Stuck snapshots cannot be manually deleted in the cs-automated or cs-automated-enc repository.

To view the output in the /tmp/snapshot folder, use the following syntax:

cat /tmp/snapshot

The command returns a response like this:

Snapshots in ... : snapshot-manual-repo
axa_snapshot-1557497454881 SUCCESS 1557639400 05:36:40 1557639405 05:36:45  4.6s  7 31 0 31
2019-05-15                 SUCCESS 1560503610 09:13:30 1560503622 09:13:42 11.8s  4 16 0 16
epoch_test                 SUCCESS 1569151317 11:21:57 1569151335 11:22:15 18.1s 15 56 0 56

The returned error message indicates that the Amazon Simple Storage (Amazon S3) bucket was already deleted and registered as a snapshot repository:

Snapshots in ... : snapshot-manual-repo
{"error":{"root_cause":[{"type":n","reason":"[snapshot-manual-repo] could not read repository data from index blob"}],"type":"repository_exception","reason":"[snapshot-manual-repo] could not read repository data from index blob","caused_by":{"type":"i_o_exception","reason":"Exception when listing blobs by prefix [index-]","caused_by":{"type":"a_w_s_security_token_service_exception","reason":"a_w_s_security_token_service_exception: User: arn:aws:sts::999999999999:assumed-role/cp-sts-grant-role/swift-us-east-1-prod-666666666666 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::666666666666:policy/my-manual-es-snapshot-creator-policy (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: 6b9374fx-11xy-11yz-ff66-918z9bb08193)"}}},"status":500}

Verify that the manual snapshot repository was deleted from the Amazon S3 bucket:

aws s3 ls | grep -i "snapshot-manual-repo"

Note: Replace snapshot-manual-repo with your bucket name.

Then, delete the repository from Amazon ES:

curl -XDELETE "https://es_endpoint/_snapshot/snapshot-example-manual-repo"

Snapshot timeout or failure

Check whether you can take a manual snapshot. If you get a Can't take manual snapshot error, call the _cat/snapshots API:

curl -XGET/_cat/snapshots/s3_repository

Note: Replace s3_repository with the name of your S3 bucket.

The syntax above checks how long the current snapshot has been running. If the duration seems reasonable, wait for it to complete, and then try again.

Then, check the health status of your cluster:

curl -XGET "https://es_endpoint/_cluster/health?pretty"

If the cluster is red, clear the red cluster first. If it is relocating or initializing shards, wait for the process to complete before configuring any access policies. Note that shard reallocation can significantly strain the computing resources of your cluster. For more information about troubleshooting a red cluster, see Red cluster status.