How do I resolve a failed health check for a load balancer in Amazon EKS?

Last updated: 2021-12-13

My load balancer keeps failing the health check in my Amazon Elastic Kubernetes Service (Amazon EKS).

Short description

To troubleshoot health check issues with the load balancer in your Amazon EKS, complete the steps in the following sections:

  • Check the status of the pod
  • Check the pod and service label selectors
  • Check for missing endpoints
  • Check the service traffic policy and cluster security groups for Application Load Balancers
  • Verify that your EKS is configured for targetPort
  • Verify that your AWS Load Balancer Controller has the correct permissions
  • Check the ingress annotations for issues with Application Load Balancers
  • Check the Kubernetes Service annotations for issues with Network Load Balancers
  • Manually test a health check
  • Check the networking
  • Restart the kube-proxy

Resolution

Check the status of the pod

Check if the pod is in Running status and the containers in the pods are ready:

$ kubectl get pod -n YOUR_NAMESPACE

Note: Replace YOUR_NAMESPACE with your Kubernetes namespace.

Example output:

NAME                           READY   STATUS    RESTARTS   AGE
podname                        1/1     Running   0          16s

Note: If the application container in the pod isn't running, then the load balancer health check isn't answered and fails.

Check the pod and service label selectors

For pod labels, run the following command:

$ kubectl get pod -n YOUR_NAMESPACE --show-labels

Example output:

NAME                           READY   STATUS    RESTARTS   AGE     LABELS
alb-instance-6cc5cd9b9-prnxw   1/1     Running   0          2d19h   app=alb-instance,pod-template-hash=6cc5cd9b9

To verify that your Kubernetes Service is using the pod labels, run the following command to check that its output matches the pod labels:

$ kubectl get svc SERVICE_NAME -n YOUR_NAMESPACE -o=jsonpath='{.spec.selector}{"\n"}'

Note: Replace SERVICE_NAME with your Kubernetes Service and YOUR_NAMESPACE with your Kubernetes namespace.

Example output:

{"app":"alb-instance"}

Check for missing endpoints

The Kubernetes controller for the service selector continuously scans for pods that match its selector, and then posts updates to an endpoint object. If you selected an incorrect label, then no endpoint appears.

Run the following command:

$ kubectl describe svc SERVICE_NAME -n YOUR_NAMESPACE

Example output:

Name:                     alb-instance
Namespace:                default
Labels:                   <none>
Annotations:              <none>
Selector:                 app=alb-instance-1      
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.100.44.151
IPs:                      10.100.44.151
Port:                     http  80/TCP
TargetPort:               80/TCP
NodePort:                 http  32663/TCP
Endpoints:                <none>                 
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

Check if the endpoint is missing:

$ kubectl get endpoints SERVICE_NAME -n YOUR_NAMESPACE

Example output:

NAME           ENDPOINTS                                AGE
alb-instance   <none>                                   2d20h

Check the service traffic policy and cluster security groups for issues with Application Load Balancers

Unhealthy targets in the Application Load Balancer target groups happen for two reasons. Either the service traffic policy, spec.externalTrafficPolicy, is set to Local instead of Cluster. Or, the node groups in a cluster have different cluster security groups associated with them, and traffic cannot flow freely between the node groups.

Verify that the traffic policy is correctly configured:

$ kubectl get svc SERVICE_NAME -n YOUR_NAMESPACE -o=jsonpath='{.spec.externalTrafficPolicy}{"\n"}'

Example output:

Local

Change the setting to Cluster:

$ kubectl edit svc SERVICE_NAME -n YOUR_NAMESPACE

Check the cluster security groups

1.    Open the Amazon EC2 console.

2.    Select the healthy instance.

3.    Choose the Security tab and check the security group ingress rules.

4.    Select the unhealthy instance.

5.    Choose the Security tab and check the security group ingress rules.

If the security group for each instance is different, then you must modify the security ingress rule in the security group console:

1.    From the Security tab, select the security group ID.

2.    Choose the Edit inbound rules button to modify ingress rules.

3.    Add inbound rules to allow traffic from the other node groups in the cluster.

Verify that your service is configured for targetPort

Your targetPort must match the containerPort in the pod that the service is sending traffic to.

To verify what your targetPort is configured to, run the following command:

$ kubectl get svc  SERVICE_NAME -n YOUR_NAMESPACE -o=jsonpath="{.items[*]}{.metadata.name}{'\t'}{.spec.ports[].targetPort}{'\t'}{.spec.ports[].protocol}{'\n'}"

Example output:

alb-instance	8080	TCP

In the preceding example output, the targetPort is configured to 8080. However, because the containerPort is set to 80 you must configure the targetPort to 80.

Verify that your AWS Load Balancer Controller has the correct permissions

The AWS Load Balancer Controller must have the correct permissions to update security groups to allow traffic from the load balancer to instances or pods. If the controller doesn't have the correct permissions, then you receive errors.

Check for errors in the AWS Load Balancer Controller deployment logs:

$ kubectl logs deploy/aws-load-balancer-controller -n kube-system

Check for errors in the individual controller pod logs:

$ kubectl logs CONTROLLER_POD_NAME -n YOUR_NAMESPACE

Note: Replace CONTROLLER_POD_NAME with your controller pod name and YOUR_NAMESPACE with your Kubernetes namespace.

Check the ingress annotations for issues with Application Load Balancers

For issues with the Application Load Balancer, check the Kubernetes ingress annotations:

$ kubectl describe ing INGRESS_NAME -n YOUR_NAMESPACE

Note: Replace INGRESS_NAME with the name of your Kubernetes Ingress and YOUR_NAMESPACE with your Kubernetes namespace.

Example output:

Name:             alb-instance-ingress
Namespace:        default
Address:          k8s-default-albinsta-fcb010af73-2014729787.ap-southeast-2.elb.amazonaws.com
Default backend:  alb-instance:80 (192.168.81.137:8080)
Rules:
  Host          Path  Backends
  ----          ----  --------
  awssite.cyou
                /   alb-instance:80 (192.168.81.137:8080)
Annotations:    alb.ingress.kubernetes.io/scheme: internet-facing        
                kubernetes.io/ingress.class: alb                         
Events:
  Type    Reason                  Age                  From     Message
  ----    ------                  ----                 ----     -------
  Normal  SuccessfullyReconciled  25m (x7 over 2d21h)  ingress  Successfully reconciled

To find ingress annotations that are specific to your use case, see Ingress annotations (from the Kubernetes website).

Check the Kubernetes Service annotations for issues with Network Load Balancers

For issues with the Network Load Balancer, check the Kubernetes Service annotations:

$ kubectl describe svc SERVICE_NAME -n YOUR_NAMESPACE

Example output:

Name:                     nlb-ip
Namespace:                default
Labels:                   <none>
Annotations:              service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip              
                          service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing          
                          service.beta.kubernetes.io/aws-load-balancer-type: external                   
Selector:                 app=nlb-ip
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.100.161.91
IPs:                      10.100.161.91
LoadBalancer Ingress:     k8s-default-nlbip-fff2442e46-ae4f8cf4a182dc4d.elb.ap-southeast-2.amazonaws.com
Port:                     http  80/TCP
TargetPort:               80/TCP
NodePort:                 http  31806/TCP
Endpoints:                192.168.93.144:80
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

Note: Take note of the APPLICATION_POD_IP. You'll need it to run a health check command.

To find Kubernetes Service annotations that are specific to your use case, see Service annotations (from the Kubernetes website).

Manually test a health check

Check your application pod IP address:

$ kubectl get pod -n YOUR_NAMESPACE -o wide

Run a test pod to manually test a health check within the cluster for HTTP health checks:

$ kubectl run -n YOUR_NAMESPACE troubleshoot -it --rm --image=amazonlinux -- /bin/bash

For HTTP health checks:

# curl -Iv APPLICATION_POD_IP/HEALTH_CHECK_PATH

Note: Replace APPLICATION_POD_IP with your application pod IP and HEALTH_CHECK_PATH with the ALB Target group health check path.

Example command:

# curl -Iv 192.168.81.137

Example output:

* Trying 192.168.81.137:80...
* Connected to 192.168.81.137 (192.168.81.137) port 80 (#0)
> HEAD / HTTP/1.1
> Host: 192.168.81.137
> User-Agent: curl/7.78.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Server: nginx/1.21.3
Server: nginx/1.21.3
< Date: Tue, 26 Oct 2021 05:10:17 GMT
Date: Tue, 26 Oct 2021 05:10:17 GMT
< Content-Type: text/html
Content-Type: text/html
< Content-Length: 615
Content-Length: 615
< Last-Modified: Tue, 07 Sep 2021 15:21:03 GMT
Last-Modified: Tue, 07 Sep 2021 15:21:03 GMT
< Connection: keep-alive
Connection: keep-alive
< ETag: "6137835f-267"
ETag: "6137835f-267"
< Accept-Ranges: bytes
Accept-Ranges: bytes

< 
* Connection #0 to host 192.168.81.137 left intact

Check the HTTP response status code. If the response status code is 200 OK, then it means that your application is responding correctly on the health check path.

If the HTTP response status code is 3xx or 4xx, then you can change your health check path. The following annotation can respond with 200 OK:

alb.ingress.kubernetes.io/healthcheck-path: /ping

-or-

You can use the following annotation on the ingress resource to add a successful health check response status code range:

alb.ingress.kubernetes.io/success-codes: 200-399

For TCP health checks, use the following command to install the netcat command:

# yum update -y && yum install -y nc

Test the TCP health checks:

# nc -z -v APPLICATION_POD_IP CONTAINER_PORT_NUMBER

Note: Replace APPLICATION_POD_IP with your application pod IP and CONTAINER_PORT_NUMBER with your container port.

Example command:

# nc -z -v 192.168.81.137 80

Example output:

Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 192.168.81.137:80.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

Check the networking

For networking issues, verify the following:

  • The multiple node groups in the EKS cluster can freely communicate with each other
  • The network access control list (network ACL) that's associated with the subnet where your pods are running allows traffic from the load balancer subnet CIDR range
  • The network ACL that's associated with your load balancer subnet should allow return traffic on the ephemeral port range from the subnet where the pods are running
  • The route table allows local traffic from within the VPC CIDR range

Restart the kube-proxy

If the kube-proxy that runs on each node isn't behaving correctly, then it could fail to update the iptables rules for the service and endpoints. Restart the kube-proxy to force it to recheck and update iptables rules:

kubectl rollout restart daemonset.apps/kube-proxy -n kube-system

Example output:

daemonset.apps/kube-proxy restarted

Did this article help?


Do you need billing or technical support?