Why is my Amazon Aurora DB cluster clone, snapshot restore, or point in time restore taking so long?

Last updated: 2020-10-21

I am performing a cluster clone, snapshot restore, or a point in time operation on my Amazon Aurora cluster. Why is this restore taking so long and how can I resolve this issue?

Short description

Amazon Aurora’s continuous backup and restore techniques are optimized to avoid variation in restore times. They also help the cluster’s storage volume to reach full performance as soon as the cluster becomes available. Usually, long restore times are caused by long-running transactions in the source database at the time the backup was taken.

Resolution

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent version of the AWS CLI.

Amazon Aurora backs-up your cluster volume’s changes automatically and continuously. The back-ups are retained for the length of your backup retention period. This continuous backup also means that you are able to restore your data to a new cluster, to any point in time within the retention period specified. This avoids the need for a lengthy binlog roll-forward process. Because you create a new cluster, there is no impact to performance or interruption to your original database.

When you initiate a clone, snapshot, or point in time restore, Amazon RDS calls the following APIs on your behalf:

Once this step completes, the cluster changes into the Available state. You can check your cluster state by refreshing the console or checking with the AWS CLI.

The instance creation process only starts when the cluster is Available. This happens in two stages: setting up the instance configuration, and database crash recovery.

You can check if the API has finished setting up the instance by looking for the MySQL error log file. You can do this even if the instance is in the Creating status. If the error log file is available to download, this means that the instance is set up and the engine is now performing crash recovery. The error log file is also the best resource to check on the progress of your database crash recovery, along with Amazon CloudWatch metrics.

Note: If you are using the AWS CLI or API to perform a restore operation, make sure to invoke the CreateDBInstance call because it is not automatic.

Check for long-running write operations on the source database

The best way to prevent a long crash recovery is to make sure that no long-running write operations are running on the source database at the time of the snapshot, point-in-time, or clone. Any long-running DCL, DDL or DML (open write transactions) can lengthen the time it takes for the restored database to become available.

For example, if you enable the binary log for an Aurora cluster, this increases the time it takes to perform a recovery. This is because InnoDB automatically checks the logs and performs a roll-forward of the database to the present. It then rolls back any uncommitted transactions that are present at the time of the recovery. For more information on InnoDB crash recovery, see Innodb recovery.

When the instance finishes the creation and recovery processes, the cluster and the instance are then ready to accept incoming connections.

Note: Aurora doesn't require the binary log. It's a best practice to you disable it unless it is required. For cross-region replication, you can evaluate the Aurora global databases instead, which also does not require binary logs.