How can I troubleshoot connectivity issues when using NAT gateway on my private VPC?

Last updated: 2022-04-12

I'm using a NAT gateway to connect instances in a private Virtual Private Cloud (Amazon VPC) subnet to the internet. The instances have intermittent connection issues. How can I troubleshoot this?

Short description

Private subnet resources might experience intermittent connectivity time out issues for the following reasons:

  • Network access control list (ACL) rules.
  • ErrorPortAllocation error on the NAT gateway.
  • Client instance port exhaustion.

Private subnet resources might experience a sudden connection drop for the following reason:

  • IdleTimeoutCount error to release capacity.

Private subnet resources might experience slowness for the following reason:

  • Bandwidth limitation per NAT gateway.

Resolution

Private subnet resources are experiencing intermittent connectivity time out issues

Network ACL rules

Confirm that the network ACL associated to the public subnet where the NAT gateway is present allows traffic from the ephemeral port range (1024-65535). If the network ACL allows only a subset of the ephemeral port range and the instances in the private subnet use a source port outside of that range, then traffic is dropped. For more information on how to configure network ACLs, see Recommended network ACL rules for a VPC with public and private subnets (NAT).

ErrorPortAllocation error on the NAT gateway

For more information on this error, see How do I resolve the ErrorPortAllocation error on my NAT gateway?

Client instance port exhaustion

Check if the client instances in the private subnet have reached their operating system-level connection limits. To see the number of active connections, run the netstat command:

Linux:

netstat -ano | grep ESTABLISHED | wc --l
netstat -ano | grep TIME_WAIT | wc --l

Windows:

netstat -ano | find /i "estab" /c
netstat -ano | find /i "TIME_WAIT" /c

If the preceding command returns a value near the allowed local port range (the source port for client connections), then you might have port exhaustion.

To reduce port exhaustion, do the following:

  • Resolve any application-level issues that drain the available connections.
  • Increase the operating system local (ephemeral) port range by running the following command:
net.ipv4.ip_local_port_range = 1025 61000

Private subnet resources are experiencing sudden connection drops

IdleTimeoutCount error to release capacity

If a connection that's using a NAT gateway is idle for 350 seconds or more, then the connection times out and you see a spike on the IdleTimeoutCount metric. When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempts to continue the connection. The NAT gateway doesn't send a FIN packet.

Workaround for the IdleTimeoutCount error:

  • Use the IdleTimeoutCount metric in Amazon CloudWatch to monitor for increases in idle connections. Configure CloudWatch Contributor Insights to get visibility on the top contributors of clients with processes in the Idle state.
  • Close idle connections from clients to release capacity.
  • Initiate more traffic over the connection.
  • Turn on TCP keepalive on the instance with a value less than 350 seconds.

Private subnet resources are experiencing slowness

Bandwidth limitations on the NAT gateway

  • A NAT gateway supports 5 Gbps of bandwidth and automatically scales up to 45 Gbps. If the combined sum of networking throughput metrics across all the instances behind the NAT gateway is equal to or more than 45 Gbps bursts, the traffic slows.
  • Using CloudWatch metrics, bandwidth is calculated as: (BytesOutToDestination + BytesOutToSource + BytesInFromDestination + BytesInFromSource) * 8 / time period in seconds.

Workaround for bandwidth limitations per NAT gateway:

If your bandwidth on the NAT gateway is greater than 45 Gbps, you can split the resources between multiple subnets and create multiple NAT gateways.