How do I perform the root cause analysis for a Multi-AZ failover and restart of my Amazon RDS instance?
Last updated: 2021-09-29
I want to know the root cause for the Multi-AZ failover and restart of my Amazon Relational Database Service (Amazon RDS) instance.
When you use Multi-AZ deployment for your database instance, Amazon RDS creates a primary DB instance in one Availability Zone that's associated with a subnet. Then, RDS creates a standby DB instance in a different Availability Zone that's associated with a different subnet. For more information, see High availability (Multi-AZ) for Amazon RDS.
Amazon RDS detects and automatically recovers from the most common failure scenarios for Multi-AZ deployments so that you can resume database operations as quickly as possible without administrative intervention. If you enabled the Multi-AZ configuration for your database instance, then Amazon RDS automatically switches to a standby replica in another Availability Zone in the event of a planned or unplanned outage of your DB instance. Amazon RDS automatically performs a failover in the event of any of the following:
- Loss of availability in primary Availability Zone
- Loss of network connectivity to primary
- Compute unit failure on primary
- Storage failure on primary
Check logs and metrics
Check the following to identify the root cause of the outage:
Events: To identify the root cause of an unplanned outage in your instance, view all the Amazon RDS events in the last 24 hours. All the events are registered in the UTC/GMT time by default. To store events a longer time, send the Amazon RDS events to Amazon CloudWatch Events. For more information, see Creating a rule that triggers on an Amazon RDS event.
CloudWatch metrics: View the CloudWatch metrics for your Amazon RDS instance to check if the database load issue caused the outage. For more information, see Viewing Amazon RDS metrics and dimensions.
View the following metrics and check for throttling:
- Write Latency
Enhanced Monitoring: Amazon RDS delivers metrics from Enhanced Monitoring into your Amazon CloudWatch Logs account. This provides metrics in real time for the operating system (OS) that your DB instance runs on. You can view all the system metrics and process information for your DB instances on the console.
You can set the granularity for the Enhanced Monitoring feature to 1, 5, 10, 15, 30, or 60.
To turn on Enhanced Monitoring for your Amazon RDS instance, see Setting up and enabling Enhanced Monitoring.
Performance Insights: With the Performance Insights dashboard, you can visualize the database load and filter the load by waits, SQL statements, hosts, or users. The dashboard contains information related to database performance that can help you to analyze and troubleshoot performance issues. After turning on the Performance Insights feature for your DB instance, you can view information about the database load on the main dashboard page.
To view the Performance Insights dashboard for your instance, do the following:
- Open the Amazon RDS console.
- In the navigation pane, choose Performance Insights.
- On the Performance Insights page, select your DB instance.
You can view the Performance Insights dashboard for this DB instance.
If you turned on Performance Insights for your instance, then you can also view the dashboard by choosing the Sessions item in the list of DB instances.
For more information, see Opening the Performance Insights dashboard.
Logs & Events: To troubleshoot the cause of the outage for your Amazon RDS for Oracle DB instance, view the alert logs located in the Logs & Events tab of your instance.
Identify the causes for the outage
The most common failover reasons in the event log in a Multi-AZ environment are the following:
- The primary host of the RDS Multi-AZ instance is unhealthy: This reason indicates a transient underlying hardware issue that led to the loss of communication to the primary instance. This issue might have rendered the instance unhealthy, because the RDS monitoring system couldn't communicate with the RDS instance for performing the health checks.
- The primary host of the RDS Multi-AZ instance is unreachable due to loss of network connectivity: This reason indicates that the Multi-AZ failover was caused by a transient network issue that affected the primary host of your Multi-AZ deployment. The internal monitoring system detected this issue and proactively initiated a failover.
- The RDS Multi-AZ primary instance is busy and unresponsive, The Multi-AZ instance activation started, or The Multi-AZ instance activation completed: The event log shows these messages under the following situations:
- The primary DB instance is unresponsive.
- A memory crunch in the database prevented the RDS monitoring system from contacting the underlying host.
- The DB instance experienced intermittent network issues with the underlying host.
- The instance experienced a database load. In this case, you might notice spikes in CPUUtilization and DatabaseConnections and depletion of Freeablememory.
Note: To avoid failover and restart of your RDS instances due to database overload, configure the memory parameters on the database instance appropriately.
- The storage volume underlying the primary host of the RDS Multi-AZ instance experienced a failure: This message indicates that the underlying storage hardware experienced an issue that led to an elevated latency of the Amazon Elastic Block Store (Amazon EBS) volume. The primary host detected a performance degradation and entered into a failed state. As a proactive measure, the monitoring system initiated a failover to secondary.
- The RDS instance was modified by customer: This message indicates that the failover was initiated by an RDS instance modification.
- The user requested a failover of the DB instance: This message indicates that you rebooted the instance and chose Reboot with failover.
For more information, see Failover process for Amazon RDS.
Note: To be notified whenever there is a failover on your RDS instance, subscribe to Amazon RDS event notifications. For more information, see How do I create an Amazon RDS event subscription?