How can I troubleshoot pod status in Amazon EKS?

Last updated: 2020-02-13

My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on Amazon Elastic Compute Cloud (Amazon EC2) instances or on a managed node group are stuck. How can I get my pods in the Running state?

Resolution

Important: The following steps apply only to pods launched on Amazon EC2 instances or on a managed node group. The steps don't apply to pods launched on AWS Fargate.

Find out the status of your pod

1.    To get information from the Events history of your pod, run the following command:

$ kubectl describe pod YOUR_POD_NAME

Note: The example commands covered in the following steps are in the default namespace. For other namespaces, append the command with -n YOURNAMESPACE.

2.    Based on the status of your pod, complete the steps in one of the following sections: Your pod is in the Pending state, Your pod is in the Waiting state, or Your pod is in the CrashLoopBackOff state.

Your pod is in the Pending state

Pods in the Pending state can't be scheduled onto a node.

Your pod could be in the Pending state because you have insufficient resources on the available worker nodes, or you've defined an occupied hostPort on the pod.

If you have insufficient resources on the available worker nodes, then delete unnecessary pods, or add more worker nodes. For example, your worker nodes can run out of CPU and memory. If this is a recurring issue, you can use the Kubernetes Cluster Autoscaler to automatically scale your worker node group when resources in your cluster are scarce.

If you're defining a hostPort for your pod, then consider the following:

(A) There are a limited number of places that a pod can be scheduled when you bind a pod to a hostPort.
(B) Don't specify a hostPort unless it's necessary, because the hostIP, hostPort, and protocol combination must be unique.
(C) If you must specify hostPort, then schedule the same number of pods as there are worker nodes.

The following example shows the output of the describe command for my-nginx-12345abc6d-7e8fg, which is in the Pending state. The pod is unscheduled because of a resource constraint.

$ kubectl describe pod my-nginx-86459cfc9f-2j5bq

Name:               my-nginx-12345abc6d-7e8fg
Namespace:          default
Priority:           0
PriorityClassName:  
Node:               
Labels:             pod-template-hash=86459cfc9f
                    run=my-nginx
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Pending
...
Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  7s (x6 over 5m58s)  default-scheduler  0/2 nodes are available: 1 Insufficient pods, 1 node(s) had taints that the pod didn't tolerate.

If your pods are still in the Pending state after trying the preceding steps, complete the steps in the Additional troubleshooting section.

Your pod is in the Waiting state

A pod in the Waiting state is scheduled on a worker node (for example, an Amazon EC2 instance), but can't run on that node.

Your pod can be in the Waiting state because of an incorrect Docker image or repository name, a lack of permissions, or because the image doesn't exist.

If you have the incorrect Docker image or repository name, then complete the following:

1.    Confirm that the image and repository name is correct by logging into Docker Hub, Amazon Elastic Container Registry (Amazon ECR), or another container image repository.

2.    Compare the repository or image from the repository with the repository or image name specified in the pod specification.

If the image doesn't exist or you lack permissions, then complete the following:

1.    Verify that the image specified is available in the repository and that the correct permissions are configured to allow the image to be pulled.

2.    To confirm that image pull is possible and to rule out general networking and repository permission issues, manually pull the image from the Amazon EKS worker nodes with Docker:

$ docker pull yourImageURI:yourImageTag

3.    To verify that the image exists, check that both the image and tag are present in either Docker Hub or Amazon ECR.

Note: If you're using Amazon ECR, verify that the repository policy allows image pull for the NodeInstanceRole. Or, verify that the AmazonEC2ContainerRegistryReadOnly role is attached to the policy.

The following example shows a pod in a Pending state with the container in Waiting state because of an image pull error:

$ kubectl describe po web-test

Name:               web-test
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51
Start Time:         Wed, 22 Jan 2020 08:18:16 +0200
Labels:             app=web-test
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"web-test"},"name":"web-test","namespace":"default"},"spec":{...
                    kubernetes.io/psp: eks.privileged
Status:             Pending
IP:                 192.168.1.143
Containers:
  web-test:
    Container ID:   
    Image:          somerandomnonexistentimage
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ErrImagePull
...
Events:
  Type     Reason            Age                 From                                                 Message
  ----     ------            ----                ----                                                 -------
  Normal   Scheduled         66s                 default-scheduler                                    Successfully assigned default/web-test to ip-192-168-6-51.us-east-2.compute.internal
  Normal   Pulling           14s (x3 over 65s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Pulling image "somerandomnonexistentimage"
  Warning  Failed            14s (x3 over 55s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Failed to pull image "somerandomnonexistentimage": rpc error: code = Unknown desc = Error response from daemon: pull access denied for somerandomnonexistentimage, repository does not exist or may require 'docker login'
  Warning  Failed            14s (x3 over 55s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Error: ErrImagePull

If your pods are still in the Waiting state after trying the preceding steps, complete the steps in the Additional troubleshooting section.

Your pod is in the CrashLoopBackOff state

Pods stuck in CrashLoopBackOff are starting, crashing, starting again, and then crashing again repeatedly.

If you receive the "Back-Off restarting failed container" output message, then your container probably exited soon after Kubernetes started the container.

To look for errors in the logs of the current pod, run the following command:

$ kubectl logs YOUR_POD_NAME

To look for errors in the logs of the previous pod that crashed, run the following command:

$ kubectl logs --previous YOUR-POD_NAME

Note: For a multi-container pod, you can append the container name at the end. For example:

$ kubectl logs POD_NAME CONTAINER_NAME

If the Liveness probe isn't returning a successful status, verify that the Liveness probe is configured correctly for the application. For more information, see Configure Probes.

The following example shows a pod in a CrashLoopBackOff state because the application exits after starting:

$ kubectl describe po crash-app-6847947bf8-28rq6

Name:               crash-app-6847947bf8-28rq6
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51
Start Time:         Wed, 22 Jan 2020 08:42:20 +0200
Labels:             pod-template-hash=6847947bf8
                    run=crash-app
Annotations:        kubernetes.io/psp: eks.privileged
Status:             Running
IP:                 192.168.29.73
Controlled By:      ReplicaSet/crash-app-6847947bf8
Containers:
  main:
    Container ID:  docker://6aecdce22adf08de2dbcd48f5d3d8d4f00f8e86bddca03384e482e71b3c20442
    Image:         alpine
    Image ID:      docker-pullable://alpine@sha256:ab00606a42621fb68f2ed6ad3c88be54397f981a7b70a79db3d1172b11c4367d
    Port:          80/TCP
    Host Port:     0/TCP
    Command:
      /bin/sleep
      1
    State:          Waiting
      Reason:       CrashLoopBackOff
...
Events:
  Type     Reason     Age                From                                                 Message
  ----     ------     ----               ----                                                 -------
  Normal   Scheduled  47s                default-scheduler                                    Successfully assigned default/crash-app-6847947bf8-28rq6 to ip-192-168-6-51.us-east-2.compute.internal
  Normal   Pulling    28s (x3 over 46s)  kubelet, ip-192-168-6-51.us-east-2.compute.internal  Pulling image "alpine"
  Normal   Pulled     28s (x3 over 46s)  kubelet, ip-192-168-6-51.us-east-2.compute.internal  Successfully pulled image "alpine"
  Normal   Created    28s (x3 over 45s)  kubelet, ip-192-168-6-51.us-east-2.compute.internal  Created container main
  Normal   Started    28s (x3 over 45s)  kubelet, ip-192-168-6-51.us-east-2.compute.internal  Started container main
  Warning  BackOff    12s (x4 over 42s)  kubelet, ip-192-168-6-51.us-east-2.compute.internal  Back-off restarting failed container

If your pods are still in the CrashLoopBackOff state after trying the preceding steps, complete the steps in the Additional troubleshooting section.

Additional troubleshooting

If your pod is still stuck after completing steps in the previous sections, try the following steps:

1.    To confirm that worker nodes exist in the cluster and are in Ready status (which allow pods to be scheduled on it), run the following command:

$ kubectl get nodes

The output should look similar to the following:

NAME                                          STATUS   ROLES    AGE   VERSION
ip-192-168-6-51.us-east-2.compute.internal    Ready    <none>   25d   v1.14.6-eks-5047ed
ip-192-168-86-33.us-east-2.compute.internal   Ready    <none>   25d   v1.14.6-eks-5047ed

If the nodes are not in the cluster, add worker nodes.

If the nodes are NotReady or can't join the cluster, see How can I change the status of my nodes from NotReady or Unknown status to Ready status?

2.    To check the version of the Kubernetes cluster, run the following command:

$ kubectl version --short

The output should look similar to the following:

Client Version: v1.14.6-eks-5047ed
Server Version: v1.14.9-eks-c0eccc

3.    To check the version of the Kubernetes worker node, run the following command:

$ kubectl get node -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion

The output should look similar to the following:

NAME                                          VERSION
ip-192-168-6-51.us-east-2.compute.internal    v1.14.6-eks-5047ed
ip-192-168-86-33.us-east-2.compute.internal   v1.14.6-eks-5047ed

4.    Based on the output from steps 2 and 3, confirm that the Kubernetes server version for the cluster matches the version of the worker nodes within an acceptable version skew.

Important: The patch versions can be different (for example, v1.14.x for the cluster vs. v1.14.y for the worker node).

If the cluster and worker node versions are incompatible, create a new node group with eksctl (see the eksctl tab) or AWS CloudFormation (see the Self-managed nodes tab).

--or--

Create a new managed node group (Kubernetes: v1.14, platform: eks.3 and above) using a compatible Kubernetes version. Then, delete the node group with the incompatible Kubernetes version.

5.    To confirm that the Kubernetes control plane can communicate with the worker nodes, verify firewall rules against the recommended rules in Amazon EKS Security Group Considerations, and then verify that the nodes are in Ready status.


Did this article help you?

Anything we could improve?


Need more help?