Why is my RDS for SQL Server instance stuck in the rebooting state?

Last updated: 2022-09-27

Why is it taking a long time to reboot my Amazon Relational Database Service (Amazon RDS) for Microsoft SQL Server instance?

Short description

Rebooting your DB instance might take more time than expected if there are a large number of ongoing transactions on the instance. Or, rebooting might appear to be stuck if there are a high number of virtual log files (VLFs) on the instance that are slowing down the reboot.

To resolve these issues, do one or both of the following:

  • Check for ongoing queries on the DB instance.
  • Reduce the number of VLFs on the DB instance.

Resolution

Check for ongoing queries on the DB instance

Rebooting stops all ongoing transactions and SQL Server runs recovery while starting the instance. SQL Server performs roll forward and rollback of the transactions to take the database to a consistent state.

You can see the time taken by this recovery process in the SQL Server error log. The log entry includes the time consumed in each recovery phase. In the following log entry example, X is the amount of time taken by SQL Server in each phase and in complete recovery. If you have large ongoing transactions, the reboot might take a long time.

Recovery completed for database <<DB_NAME>> (database ID <<id of database>>) in X second(s) (analysis X ms, redo X ms, undo X ms [system undo X ms, regular undo X ms].) This is an informational message only. No user action is required.

To resolve this issue, reduce the number of ongoing queries. Use the following command to check if there are any active data modification transactions on the database.

SELECT r.session_id,
       r.start_time,
       r.status,
       r.cpu_time,
       r.total_elapsed_time,
       st.TEXT AS batch_text
FROM sys.dm_exec_requests AS r
     CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) AS st
     CROSS APPLY sys.dm_exec_query_plan(r.plan_handle) AS qp
     where session_id <> @@SPID
ORDER BY cpu_time DESC;

The command output provides information about the sessions running on your database along with start time and SQL text. If there are queries that are still running, then allow the queries to complete before performing a reboot.

Reduce the number of VLFs on the DB instance

Rebooting might take a long time due to a high number of VLFs on the instance. An excessive number of VLFs might accumulate from an initially small transaction log that was grown (manually or automatically) in very small increments.

Before starting the recovery phases during reboot, SQL Server scans all of the VLFs serially. This phase is called the discovery phase. If a large number of VLFs are found, the following notification appears in the error log:

Database %ls has more than %d virtual log files which is excessive. Too many virtual log files can cause long startup and backup times. Consider shrinking the log and using a different growth increment to reduce the number of virtual log files.

To reduce the number of VLFs, do the following:

Note: It's a best practice to perform these steps in slow or non-business hours.

  1. Log in to RDS for SQL Server instance from Microsoft SQL Server Management Studio as the master user.
  2. Run DBCC SQLPERF(LOGSPACE) query to check log file utilization.
  3. Shrink the log file of the database that received the high VLFs notification in the error log.
  4. Expand the transaction log file to an appropriate usage size all at once. This avoids automatic growth that causes a high number of VLFs.

Note: Shrinking the log file reduces the number of VLFs. Performing a one-time expansion creates a limited amount of VLFs required per the VLF creation criteria. For more information, see Virtual Log Files (VLFs) in the Microsoft SQL docs.