Why am I experiencing intermittent connectivity issues with my Amazon Redshift cluster?
Last updated: 2020-09-21
I'm experiencing intermittent connectivity issues when I try to connect to my Amazon Redshift cluster. Why is this happening and how do I troubleshoot this?
Intermittent connectivity issues in your Amazon Redshift cluster are caused by the following:
- Restricted access for a particular IP address or CIDR block
- Maintenance window updates
- Node failures or scheduled administration tasks
- Encryption key rotations
- Too many active network connections
- High CPU utilization of the leader node
- Client-side connection issues
Restricted access for a particular IP address or CIDR block
Check to see if there is restricted access for a particular IP address or CIDR block in your security group. Because of DHCP configuration, your client IP address can change, which can cause connectivity issues. Additionally, if you aren't using elastic IP addresses for your Amazon Redshift cluster, the AWS managed IP of your cluster nodes might change. For example, your IP address can change when you delete your cluster and then recreate it from a snapshot, or when you resume a paused cluster.
Note: Public IP addresses are rotated when the Amazon Redshift cluster is deleted and recreated. Private IP addresses change whenever nodes are replaced.
To resolve any network restrictions, consider the following approaches:
- If your application is caching the public IP address behind a cluster endpoint, be sure to use this endpoint for your Amazon Redshift connection. To be sure of stability and security in your network connection, avoid using a DNS cache for your connection.
- It's a best practice to use an elastic IP address for your Amazon Redshift cluster. An elastic IP address allows you to change your underlying configuration without affecting the IP address that clients use to connect to your cluster. This approach can be helpful for situations such as recovery after a failure. This approach is helpful if you are recovering a cluster after a failure. For more information, see Managing clusters in a VPC.
- If you're using a private IP address to connect to a leader node or compute node, be sure to use the new IP address. For example, if you performed SSH ingestion or have an EMR configuration that uses the compute node, update your settings with the new IP address. A new private IP address is granted to new nodes after a node replacement.
Maintenance window updates
Check the maintenance window for your Amazon Redshift cluster. During a maintenance window, your Amazon Redshift cluster is unable to process read or write operations. If a maintenance event is scheduled for a given week, it starts during the assigned 30-minute maintenance window. While Amazon Redshift is performing maintenance, any queries or other operations that are in progress are shut down. You can change the scheduled maintenance window from the Amazon Redshift console.
Node failures or scheduled administration tasks
From the Amazon Redshift console, check the Events tab for any node failures or scheduled administration tasks (such as a cluster resize or reboot).
If there is a hardware failure, Amazon Redshift might be unavailable for a short period, which can result in failed queries. When a query fails, you see an Events description such as the following:
"A hardware issue was detected on Amazon Redshift cluster [cluster name]. A replacement request was initiated at [time]."
Or, if an account administrator scheduled a restart or resize operation on your Amazon Redshift cluster, intermittent connectivity issues can occur. Your Events description then indicates the following:
"Cluster [cluster name] began restart at [time]." "Cluster [cluster name] completed restart at [time]."
Encryption key rotations
If your encryption key is enabled and the encryption key is being rotated, then your Amazon Redshift cluster is unavailable during this time. As a result, you receive the following error message:
"pg_query(): Query failed: SSL SYSCALL error: EOF detected"
The frequency of your key rotation depends on your environment's policies for data security and standards. Rotate the keys as often as needed or whenever the encrypted key might be compromised. Also, be sure to have a key management plan that supports both your security and cluster availability needs.
Too many active connections
In Amazon Redshift, all connections to your cluster are sent to the leader node, and there is a maximum limit for active connections. The maximum limit that your Amazon Redshift cluster can support is determined by node type (instead of node count).
When there are too many active connections in your Amazon Redshift cluster, you receive the following error:
"[Amazon](500310) Invalid operation: connection limit "500" exceeded for non-bootstrap users"
If you notice a spike in your database connections, there might be a number of idle connections in your Amazon Redshift cluster. To check the number of idle connections, run the following SQL query:
trim(a.user_name) as user_name, a.usesysid, a.starttime, datediff(s,a.starttime,sysdate) as session_dur, b.last_end, datediff(s,case when b.last_end is not null then b.last_end else a.starttime end,sysdate) idle_dur FROM (select starttime,process,u.usesysid,user_name from stv_sessions s, pg_user u where s.user_name = u.usename and u.usesysid>1 and process NOT IN (select pid from stv_inflight where userid>1 union select pid from stv_recents where status != 'Done' and userid>1) ) a LEFT OUTER JOIN (select userid,pid,max(endtime) as last_end from svl_statementtext where userid>1 and sequence=0 group by 1,2) b ON a.usesysid = b.userid AND a.process = b.pid WHERE (b.last_end > a.starttime OR b.last_end is null) ORDER BY idle_dur;
The output looks like this example:
process | user_name | usesysid | starttime | session_dur | last_end | idle_dur ---------+------------+----------+---------------------+-------------+----------+---------- 14684 | myuser | 100 | 2020-06-04 07:02:36 | 6 | | 6 (1 row)
When the idle connections are identified, the connection can be shut down using the following command syntax:
The output looks like this example:
pg_terminate_backend ---------------------- 1 (1 row)
High CPU utilization of the leader node
All clients connect to an Amazon Redshift cluster using a leader node. High CPU utilization of the leader node can result in intermittent connection issues.
If you try to connect to your Amazon Redshift cluster and the leader node is consuming high CPU, you receive the following error message:
"Error setting/closing connection"
To confirm whether your leader node has reached high CPU utilization, check the CPUUtilization metric in Amazon CloudWatch. For more information, see Amazon Redshift metrics.
Client-side connection issues
Check for a connection issue between the client (such as Workbench/J or PostgreSQL) and server (your Amazon Redshift cluster). A client-side connection reset might occur, if your client is trying to send a request from a port that has been released. As a result, the connection reset can cause intermittent connection issues.
To prevent these client-side connection issues, consider the following approaches:
- Use the keepalive feature in Amazon Redshift to check that the connection between the client and server are operating correctly. The keepalive feature also helps to prevent any connection links from being broken. To check or configure the values for keepalive, see Change TCP/IP timeout settings and Change DSN timeout settings.
- Check the maximum transition unit (MTU) if your queries appear to be running but hang in the SQL client tool. Sometimes, the queries fail to appear in Amazon Redshift because of a packet drop. A packet drop occurs when there are different MTU sizes in the network paths between two IP hosts. For more information about how to manage packet drop issues, see Queries appear to hang and sometimes fail to reach the cluster.