Why did my Amazon Redshift cluster reboot outside of the maintenance window?

Last updated: 2022-10-27

My Amazon Redshift cluster restarted outside the maintenance window. Why did my cluster reboot?

Short description

An Amazon Redshift cluster is restarted outside of the maintenance window for the following reasons:

  • An issue with your Amazon Redshift cluster was detected.
  • A faulty node in the cluster was replaced.

To be notified about any cluster reboots outside of your maintenance window, create an event notification for your Amazon Redshift cluster.

Resolution

An issue with your Amazon Redshift cluster was detected

Here are some common issues that can trigger a cluster reboot:

  • An out-of-memory (OOM) error on the leader node: A query that runs on a cluster that's upgraded to a newer version can cause an OOM exception, initiating a cluster reboot. To resolve this, consider rolling back your patch or failed patch.
  • An OOM error resulting from an older driver version: If you're working on an older driver version and your cluster is experiencing frequent reboots, download the latest JDBC driver version. It's a best practice to test the driver version in your development environment before you use it in production.
  • Health check queries failure: Amazon Redshift constantly monitors the availability of its components. When a health check fails, Amazon Redshift initiates a restart to bring the cluster to a healthy state as soon as possible. Doing so reduces the amount of downtime.

Prevent health check query failures

The most common health check failures happen when the cluster has long-running open transactions. When Amazon Redshift cleans up memory associated with long running transactions, that process can cause the cluster to lock up. To prevent these situations, it's a best practice to monitor unclosed transactions using the following queries.

For long open connections, run the following example query:

select s.process as process_id,
       c.remotehost || ':' || c.remoteport as remote_address,
       s.user_name as username,
       s.db_name,
       s.starttime as session_start_time,
       i.starttime as start_query_time,
       datediff(s,i.starttime,getdate())%86400/3600||' hrs '|| 
datediff(s,i.starttime,getdate())%3600/60||' mins ' || 
datediff(s,i.starttime,getdate())%60||' secs 'as running_query_time,
       i.text as query
from stv_sessions s
left join pg_user u on u.usename = s.user_name
left join stl_connection_log c
          on c.pid = s.process
          and c.event = 'authenticated'
left join stv_inflight i
          on u.usesysid = i.userid
          and s.process = i.pid
where username <> 'rdsdb'
order by session_start_time desc;

For long-open transactions, run the following example query:

select *,datediff(s,txn_start,getdate())/86400||' days '||datediff(s,txn_start,getdate())%86400/3600||' hrs '||datediff(s,txn_start,getdate())%3600/60||' mins '||datediff(s,txn_start,getdate())%60||' secs' from svv_transactions
where lockable_object_type='transactionid' and pid<>pg_backend_pid() order by 3;

After you have this information, you can review the transactions that are still opened by running the following query:

select * from svl_statementtext where xid = <xid> order by starttime, sequence)

To terminate idle sessions and free up the connections, use the PG_TERMINATE_BACKEND command.

A faulty node in the Amazon Redshift cluster was replaced

Each Amazon Redshift node runs on a separate Amazon Elastic Compute Cloud (Amazon EC2) instance. A failed node is an instance that fails to respond to any heartbeat signals sent during the monitoring process. Heartbeat signals periodically monitor the availability of compute nodes in your Amazon Redshift cluster.

These automated health checks try to recover the Amazon Redshift cluster when an issue is detected. When Amazon Redshift detects any hardware issues or failures, nodes are automatically replaced in the following maintenance window. Note that in some cases, faulty nodes must be replaced immediately to make sure that your cluster is performing properly.

Here are some of the common causes of failed cluster nodes:

  • EC2 instance failure: When the underlying hardware of an EC2 instance is found to be faulty, the faulty node is then replaced to restore cluster performance. EC2 tags the underlying hardware as faulty if there is a lack of response or failure to pass any automated health checks.
  • Node replacement due to a faulty disk drive of a node: When an issue is detected with the disk on a node, Amazon Redshift either replaces the disk or restarts the node. If the Amazon Redshift cluster fails to recover, the node is replaced or scheduled to be replaced.
  • Internode communication failure: If there is a communication failure between the nodes, then the control messages aren't received by a particular node at the specified time. Internode communication failures are caused by an intermittent network connection issue or an issue with the underlying host.
  • Discovery Timeout: An automatic node replacement is triggered if a node or cluster cannot be reached within the specified time.
  • Out-of-memory (OOM) exception: Heavy load on a particulate node can cause OOM issues, triggering a node replacement.

Creating Amazon Redshift event notifications

To identify the cause of your cluster reboot, create an Amazon Redshift event notification, subscribing to any cluster reboots. The event notifications also notifies you if the source was configured.