Why did my Amazon Redshift cluster reboot outside of the maintenance window?

Last updated: 2020-08-19

My Amazon Redshift cluster restarted outside the maintenance window. Why did my cluster reboot?

Short description

An Amazon Redshift cluster is restarted outside of the maintenance window for the following reasons:

  • An issue with your Amazon Redshift cluster was detected.
  • A faulty node in the cluster was replaced.

To be notified about any cluster reboots outside of your maintenance window, create an event notification for your Amazon Redshift cluster.

Resolution

An issue with your Amazon Redshift cluster was detected

Here are some common issues that can trigger a cluster reboot:

  • An out-of-memory (OOM) error on the leader node: When a query is run on a cluster that gets upgraded to a newer version, that can cause an OOM exception, triggering a cluster reboot. To resolve this, consider rolling back your patch or failed patch.
  • An OOM error resulting from an older driver version: If you are working on an older driver version and your cluster is experiencing frequent reboots, download the latest JDBC driver version. Note that you should test the driver version in your development environment before you use it in production.

A faulty node in the Amazon Redshift cluster was replaced

Each Amazon Redshift node runs on a separate Amazon Elastic Compute Cloud (Amazon EC2) instance. A failed node is an instance that fails to respond to any heartbeat signals sent during the monitoring process. Heartbeat signals periodically monitor the availability of compute nodes in your Amazon Redshift cluster.

These automated health checks try to recover the Amazon Redshift cluster when an issue is detected. When Amazon Redshift detects any hardware issues or failures, nodes are automatically replaced in the following maintenance window. Note that in some cases, faulty nodes must be replaced immediately to ensure the proper functioning of your cluster.

Here are some of the common causes of failed cluster nodes:

  • EC2 instance failure: When the underlying hardware of an EC2 instance is found to be faulty, the faulty node is then replaced to restore cluster performance. EC2 tags the underlying hardware as faulty if there is a lack of response or failure to pass any automated health checks.
  • Node replacement due to a faulty disk drive of a node: When an issue is detected with the disk on a node, Amazon Redshift either replaces the disk or restarts the node. If the Amazon Redshift cluster fails to recover, the node is replaced or scheduled to be replaced.
  • Internode communication failure: If there is a communication failure between the nodes, the control messages aren't received by a particular node at the specified time. Internode communication failures are caused by an intermittent network connection issue or an issue with the underlying host.
  • Discovery Timeout: An automatic node replacement is triggered if a node or cluster cannot be reached within the specified time.
  • Out-of-memory (OOM) exception: Heavy load on a particulate node can cause OOM issues, triggering a node replacement.

Creating Amazon Redshift event notifications

To identify the cause of your cluster reboot, create an Amazon Redshift event notification, subscribing to any cluster reboots. The event notifications also notifies you if the source was configured.