Why did my Amazon RDS DB instance restart, recover, or failover?

Last updated: 2022-07-19

I want to know the root cause for the restart, recover, or failover of my Amazon Relational Database Service (Amazon RDS) DB instance.

Short description

The Amazon RDS database instance automatically performs a restart under the following conditions:

  • There is loss of availability in primary Availability Zone or excessive workload due to performance bottleneck and resource contention.
  • There is an underlying infrastructure issue with the primary instance, such as loss of network connectivity to the primary instance, compute unit issue on primary, or storage issue on primary.
  • The DB instance class type is changed as part of the DB instance vertical scaling activity.
  • The underlying host of the RDS DB instance is undergoing software patching during a specific maintenance window. For more information, see Maintaining a DB instance and Upgrading a DB instance engine version.
  • You initiated a manual reboot of the DB instance using the options Reboot or Reboot with failover.

When the DB instance shows potential issues and fails to respond to RDS health checks, RDS automatically initiates a Single-AZ recovery for the Single-AZ deployment and a Multi-AZ failover for the Multi-AZ deployment. Then, the DB instance is restarted so that you can resume database operations as quickly as possible without administrative intervention.

Resolution

To identify the cause of the outage, check the following logs and metrics for your RDS DB instance.

Amazon RDS events

To identify the root cause of an unplanned outage in your instance, view all the Amazon RDS events for the last 24 hours. All the events are registered in the UTC/GMT time by default. To store events a longer time, send the Amazon RDS events to Amazon CloudWatch Events. For more information, see Creating a rule that triggers on an Amazon RDS event. When your instance restarts, you see one of the following messages in RDS event notifications:

  • The RDS instance was modified by customer: This RDS event message indicates that the failover was initiated by an RDS instance modification.
  • Applying modification to database instance class: This RDS event message indicates that the DB instance class type is changed.
    • Single-AZ deployments become unavailable for a few minutes during this scaling operation.
    • Multi-AZ deployments are unavailable during the time that it takes for the instance to failover. This duration is usually about 60 seconds. This is because the standby database is upgraded before the newly sized database experiences a failover. Then, your database is restarted, and the engine performs recovery to make sure that your database remains in a consistent state.
  • The user requested a failover of the DB instance: This message indicates that you initiated a manual reboot of the DB instance using the option Reboot or Reboot with failover.
  • The primary host of the RDS Multi-AZ instance is unhealthy: This reason indicates a transient underlying hardware issue that led to the loss of communication to the primary instance. This issue might have rendered the instance unhealthy because the RDS monitoring system couldn't communicate with the RDS instance for performing the health checks.
  • The primary host of the RDS Multi-AZ instance is unreachable due to loss of network connectivity: This reason indicates that the Multi-AZ failover and database instance restart were caused by a transient network issue that affected the primary host of your Multi-AZ deployment. The internal monitoring system detected this issue and initiated a failover.
  • The RDS Multi-AZ primary instance is busy and unresponsive, the Multi-AZ instance activation started, or the Multi-AZ instance activation completed: The event log shows these messages under the following situations:
    • The primary DB instance is unresponsive.
    • A memory crunch after an excessive memory consumption in the database prevented the RDS monitoring system from contacting the underlying host. Hence the database restarts by our monitoring system as a proactive measure.
    • The DB instance experienced intermittent network issues with the underlying host.
    • The instance experienced a database load. In this case, you might notice spikes in CloudWatch metrics CPUUtilization, DatabaseConnections, IOPS metrics, and Throughput details. You might also notice depletion of Freeablememory.
  • Database instance patched: This message indicates that the DB instance underwent a minor version upgrade during a maintenance window because the setting Auto minor version upgrade is enabled on the instance.

CloudWatch metrics

View the CloudWatch metrics for your Amazon RDS instance to check if the database load issue caused the outage. For more information, see Monitoring Amazon RDS metrics with Amazon CloudWatch. Check for spikes in the following key metrics that indicate the availability and health status of your RDS instance:

  • DatabaseConnections
  • CPUUtilization
  • FreeableMemory
  • WriteIOPS
  • ReadIOPS
  • ReadThroughput
  • WriteThroughput
  • DiskQueueDepth

Enhanced Monitoring

Amazon RDS delivers metrics from Enhanced Monitoring into your Amazon CloudWatch Logs account. This provides metrics in real time for the operating system that your DB instance runs on. You can view all the system metrics and process information for your DB instances on the console.

You can set the granularity for the Enhanced Monitoring feature to 1, 5, 10, 15, 30, or 60.

To turn on Enhanced Monitoring for your Amazon RDS instance, see Setting up and enabling Enhanced Monitoring.

Performance Insights

The Performance Insights dashboard contains information related to database performance that can help you analyze and troubleshoot performance issues. You can also identify queries and wait events that consume excessive resources on your DB instance. Performance Insights collects data at the database level and displays the data on the Performance Insights dashboard. For more information, see Monitoring DB load with Performance Insights on Amazon RDS. When an increase in resource consumption is generated from the application side, use the Support SQL ID from your Performance Insights dashboard and match it to the corresponding query. It's a best practice to use this information to tune the performance of the query and optimize your workload using the guidance of your DBA:

  1. Open the Amazon RDS console.
  2. In the navigation pane, choose Performance insights.
  3. On the Performance Insights page, select your DB instance. You can view the Performance Insights dashboard for this DB instance.
  4. Select the time frame when the issue occurred.
  5. Choose the Top SQL tab.
  6. Choose the settings icon, and then turn on Support ID.
  7. Choose Save.

RDS database logs

To troubleshoot the cause of the outage for your Amazon RDS DB instance, you can view, download, or watch database log files using the Amazon RDS console or Amazon RDS API operations. You can also query the database log files that are loaded into database tables. For more information, see Monitoring Amazon RDS log files.

Keep the following best practices in mind when dealing with RDS instance outages:

  • Enable Multi-AZ deployment on your instance to reduce downtime during an outage. With a Multi-AZ deployment, RDS automatically provisions and maintains a synchronous standby replica in a different Availability Zone or two readable standbys. For more information, see Amazon RDS Multi-AZ.
  • Adjust the DB instance maintenance window according to your preference. The DB instance is unavailable during this time only if the system changes, such as a change in DB instance class, are being applied and require an outage, and for only the minimum amount of time required to make the necessary changes. For more information, see Maintaining a DB instance. If you don’t want your instances to go through automatic minor version upgrades, you can turn off this option. For more information, see Automatically upgrading the minor engine version.
  • Be sure that you have enough resources allocated to your database to run queries. With Amazon RDS, the amount of resources allocated depends on the instance type. Also, certain queries, such as stored procedures, might take an unlimited amount of memory. Therefore, if the instance restarts frequently due to lack of resources, consider scaling up your database instance class to keep up with the increasing demands of your applications.
  • To avoid instance throttling, configure Amazon CloudWatch alarms on RDS key metrics that indicate the availability and health status of your RDS instances. For example, you can set a CloudWatch alarm on the FreeableMemory metric so that you receive a notification when available memory reaches 95%. It's a best practice to keep at least 5% of the instance memory free. For more information, see How can I filter Enhanced Monitoring CloudWatch logs to generate automated custom metrics for Amazon RDS?
  • To be notified whenever there is a failover on your RDS instance, subscribe to Amazon RDS event notifications. For more information, see How do I create an Amazon RDS event subscription?
  • To optimize database performance, make sure that your queries are properly tuned. Otherwise, you might experience performance issues and extended wait times.
  • To troubleshoot any kind of load in terms of CPU, memory or any other resource crunch. see How can I troubleshoot high CPU utilization for Amazon RDS or Amazon Aurora PostgreSQL?