How do I resolve HTTP 504 errors in Amazon EKS?

Last updated: 2020-02-17

I get HTTP 504 errors when I connect to my Kubernetes service through a Classic Load Balancer or Application Load Balancer in Amazon Elastic Kubernetes Service (Amazon EKS).

Short Description

Your HTTP 504 errors could be caused by the following:

  • The load balancer established a connection to the target, but the target didn't respond before the idle timeout period elapsed. By default, the idle timeout for the Classic Load Balancer and Application Load Balancer is 60 seconds.
  • The load balancer failed to establish a connection to the backend target before the connection timeout expired (10 seconds).
  • The network access control list (ACL) for the subnet doesn't allow traffic from the targets to the load balancer nodes on the ephemeral ports (1024-65535).

Resolution

Verify that your load balancer’s idle timeout is set correctly

1.    Review the Amazon CloudWatch metrics for your Classic Load Balancer or Application Load Balancer.

If the latency data points are equal to your currently configured load balancer timeout value and there are data points in the HTTPCode_ELB_5XX metric, then at least one request has timed out.

2.    Modify the idle timeout for your load balancer so that the HTTP request can be completed within the idle timeout period, or configure your application to respond quicker.

To modify the idle timeout for your Classic Load Balancer, update the service definition to include the service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout annotation. For an example, see Other ELB annotations.

To modify the idle timeout for your Application Load Balancer, update the Ingress definition to include the alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds annotation. For an example, see Ingress annotations.

Verify that your backend instances have no backend connection errors

If a backend instance closes a TCP connection to the load balancer before the load balancer has reached its idle timeout value, then the load balancer could fail to fulfill the request.

1.    Review the CloudWatch BackendConnectionErrors metrics for your Classic Load Balancer and the target group's TargetConnectionErrorCount for your Application Load Balancer.

2.    Enable keep-alive settings on your backend worker node or pods, and set the keep-alive timeout to a value greater than the load balancer’s idle timeout.

To see if the keep-alive timeout is less than the idle timeout, verify the keep-alive value in your pods or worker node. See the following example for pods and nodes:

For pods:

$ kubectl exec your-pod-name -- sysctl \
    net.ipv4.tcp_keepalive_time \
    net.ipv4.tcp_keepalive_intvl \
    net.ipv4.tcp_keepalive_probes

For nodes:

$ sysctl \
    net.ipv4.tcp_keepalive_time \
    net.ipv4.tcp_keepalive_intvl \
    net.ipv4.tcp_keepalive_probes

Verify that your backend targets can receive traffic from the load balancer over the ephemeral port range

You must configure security groups and network ACLs to allow data to move between the load balancer and the backend targets. For example, these targets could be IP addresses or instances depending on the load balancer type.

To configure the security groups for ephemeral port access, you must connect the security group egress rule of your nodes and pods to the security group of your load balancer. For more information, see Working with Security Groups and Adding and Deleting Rules.