How can I troubleshoot pod status in Amazon EKS?

Last updated: 2021-11-09

My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on Amazon Elastic Compute Cloud (Amazon EC2) instances or on a managed node group are stuck. I want to get my pods in the Running state.

Resolution

Important: The following steps apply only to pods launched on Amazon EC2 instances or a managed node group. The steps don't apply to pods launched on AWS Fargate.

Find out the status of your pod

1.    Check your pod status

$ kubectl get pod<br>

2. To get information from the Events history of your pod, run the following command:

$ kubectl describe pod YOUR_POD_NAME

Note: The example commands covered in the following steps are in the default namespace. For other namespaces, append the command with -n YOURNAMESPACE.

3.    Based on the status of your pod, complete the steps in one of the following sections: Your pod is in the Pending state, Your pod is in the Waiting state, or Your pod is in the CrashLoopBackOff state.

Your pod is in the Pending state

Pods in the Pending state can't be scheduled onto a node.

Your pod could be in the Pending state because you may have one of the following situation :

1. Insufficient CPU/Memory resources on the available worker nodes,

2. or you've defined an occupied hostPort on the pod.

3. Nodes have taints which pod did not tolerate

4. Not enough worker nodes in the cluster in ready status

If you have insufficient CPU/Memory resources on the available worker nodes, then delete unnecessary pods or add more worker nodes. For example, your worker nodes can run out of CPU and memory. If this is a recurring issue, use the Kubernetes Cluster Autoscaler to automatically scale your worker node group when resources in your cluster are scarce.

Insufficient CPU :

$ kubectl describe pod frontend-cpu                               <br>Name:         frontend-cpu<br>Namespace:    default<br>Priority:     0<br>Node:         <none><br>Labels:       <none><br>Annotations:  kubernetes.io/psp: eks.privileged<br>Status:       Pending<br>...<br>Events:<br>  Type     Reason            Age                 From               Message<br>  ----     ------            ----                ----               -------<br>  Warning  FailedScheduling  22s (x14 over 13m)  default-scheduler  0/3 nodes are available: 3 Insufficient cpu.

Insufficient Memory :

$ kubectl describe pod frontend-memory<br>Name:         frontend-memory<br>Namespace:    default<br>Priority:     0<br>Node:         <none><br>Labels:       <none><br>Annotations:  kubernetes.io/psp: eks.privileged<br>Status:       Pending<br>...<br>Events:<br>  Type     Reason            Age                 From               Message<br>  ----     ------            ----                ----               -------<br>  Warning  FailedScheduling  80s (x14 over 15m)  default-scheduler  0/3 nodes are available: 3 Insufficient memory.

If you're defining a hostPort for your pod, then consider the following:

(A) There are a limited number of places that a pod can be scheduled when you bind a pod to a hostPort.
(B) Don't specify a hostPort unless it's necessary, because the hostIP, hostPort, and protocol combination must be unique.
(C) If you must specify hostPort, then schedule the same number of pods as there are worker nodes.

The following example shows the output of the describe command for frontend-port-77f67cff67-2bv7w, which is in the Pending state. The pod is unscheduled because of requested host port is not available in any of the worker node in the cluster.

Port unavailable :

$ kubectl describe pod frontend-port-77f67cff67-2bv7w                                            <br>Name:           frontend-port-77f67cff67-2bv7w<br>Namespace:      default<br>Priority:       0<br>Node:           <none><br>Labels:         app=frontend-port<br>                pod-template-hash=77f67cff67<br>Annotations:    kubernetes.io/psp: eks.privileged<br>Status:         Pending<br>IP:             <br>IPs:            <none><br>Controlled By:  ReplicaSet/frontend-port-77f67cff67<br>Containers:<br>  app:<br>    Image:      nginx<br>    Port:       80/TCP<br>    Host Port:  80/TCP<br>...<br>Events:<br>  Type     Reason            Age                  From               Message<br>  ----     ------            ----                 ----               -------<br>  Warning  FailedScheduling  11s (x7 over 6m22s)  default-scheduler  0/3 nodes are available: 3 node(s) didn't have free ports for the requested pod ports.


If pods are unable to schedule on the nodes because nodes are having taints which pod could not tolerate then you should see output like below :

$ kubectl describe pod nginx                                                  <br>Name:         nginx<br>Namespace:    default<br>Priority:     0<br>Node:         <none><br>Labels:       run=nginx<br>Annotations:  kubernetes.io/psp: eks.privileged<br>Status:       Pending<br>...<br>Events:<br>  Type     Reason            Age                  From               Message<br>  ----     ------            ----                 ----               -------<br>  Warning  FailedScheduling  8s (x10 over 9m22s)  default-scheduler  0/3 nodes are available: 3 node(s) had taint {key1: value1}, that the pod didn't tolerate.<br>

You can check for your nodes taints with following command :

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints             <br>NAME                                                TAINTS<br>ip-192-168-4-78.ap-southeast-2.compute.internal     [map[effect:NoSchedule key:key1 value:value1]]<br>ip-192-168-56-162.ap-southeast-2.compute.internal   [map[effect:NoSchedule key:key1 value:value1]]<br>ip-192-168-91-249.ap-southeast-2.compute.internal   [map[effect:NoSchedule key:key1 value:value1]]<br>

In this situation, if you want to retain your node taints then you specify a toleration for a pod in the PodSpec as suggested here. Alternatively, you can remove taint from the node by appending "-" sign at the end of taint value as below :

$ kubectl taint nodes NODE_Name key1=value1:NoSchedule-<br>

If your pods are still in the Pending state after trying the preceding steps, then complete the steps in the Additional troubleshooting section.

Your pod is in the Waiting state

A pod in the Waiting state is scheduled on a worker node (for example, an EC2 instance), but can't run on that node.

Your pod can be in the Waiting state because of an incorrect Docker image or incorrect repository name. Or, your pod could be in the Waiting state because the image doesn't exist or you lack permissions.

If you have the incorrect Docker image or repository name, then complete the following:

1.    Confirm that the image and repository name is correct by logging into Docker Hub, Amazon Elastic Container Registry (Amazon ECR), or another container image repository.

2.    Compare the repository or image from the repository with the repository or image name specified in the pod specification.

If the image doesn't exist or you lack permissions, then complete the following:

1.    Verify that the image specified is available in the repository and that the correct permissions are configured to allow the image to be pulled.

2.    To confirm that image pull is possible and to rule out general networking and repository permission issues, manually pull the image. You must pull the image from the Amazon EKS worker nodes with Docker. For example:

$ docker pull yourImageURI:yourImageTag

3.    To verify that the image exists, check that both the image and tag are present in either Docker Hub or Amazon ECR.

Note: If you're using Amazon ECR, verify that the repository policy allows image pull for the NodeInstanceRole. Or, verify that the AmazonEC2ContainerRegistryReadOnly role is attached to the policy.

The following example shows a pod in the Pending state with the container in the Waiting state because of an image pull error:

$ kubectl describe po web-test

Name:               web-test
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51
Start Time:         Wed, 22 Jul 2021 08:18:16 +0200
Labels:             app=web-test
Annotations:        kubectl.kubernetes.io/last-applied-configuration:
                      {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"web-test"},"name":"web-test","namespace":"default"},"spec":{...
                    kubernetes.io/psp: eks.privileged
Status:             Pending
IP:                 192.168.1.143
Containers:
  web-test:
    Container ID:   
    Image:          somerandomnonexistentimage
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ErrImagePull
...
Events:
  Type     Reason            Age                 From                                                 Message
  ----     ------            ----                ----                                                 -------
  Normal   Scheduled         66s                 default-scheduler                                    Successfully assigned default/web-test to ip-192-168-6-51.us-east-2.compute.internal
  Normal   Pulling           14s (x3 over 65s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Pulling image "somerandomnonexistentimage"
  Warning  Failed            14s (x3 over 55s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Failed to pull image "somerandomnonexistentimage": rpc error: code = Unknown desc = Error response from daemon: pull access denied for somerandomnonexistentimage, repository does not exist or may require 'docker login'
  Warning  Failed            14s (x3 over 55s)   kubelet, ip-192-168-6-51.us-east-2.compute.internal  Error: ErrImagePull

If your pods are still in the Waiting state after trying the preceding steps, then complete the steps in the Additional troubleshooting section.

Your pod is in the CrashLoopBackOff state

Pods stuck in CrashLoopBackOff are starting, crashing, starting again, and then crashing again repeatedly.

If you receive the "Back-Off restarting failed container" output message, then your container probably exited soon after Kubernetes started the container.

To look for errors in the logs of the current pod, run the following command:

$ kubectl logs YOUR_POD_NAME

To look for errors in the logs of the previous pod that crashed, run the following command:

$ kubectl logs --previous YOUR-POD_NAME

Note: For a multi-container pod, you can append the container name at the end. For example:

$ kubectl logs POD_NAME CONTAINER_NAME

If the Liveness probe isn't returning a successful status, then verify that the Liveness probe is configured correctly for the application. For more information, see Configure Probes on the Kubernetes website.

The following example shows a pod in a CrashLoopBackOff state because the application exits after starting, notice State, Last State, Reason, Exit Code and Restart Count along with Events.

$ kubectl describe pod crash-app-b9cf4587-66ftw <br>Name:         crash-app-b9cf4587-66ftw<br>Namespace:    default<br>Priority:     0<br>Node:         ip-192-168-91-249.ap-southeast-2.compute.internal/192.168.91.249<br>Start Time:   Tue, 12 Oct 2021 12:24:44 +1100<br>Labels:       app=crash-app<br>              pod-template-hash=b9cf4587<br>Annotations:  kubernetes.io/psp: eks.privileged<br>Status:       Running<br>IP:           192.168.82.93<br>IPs:<br>  IP:           192.168.82.93<br>Controlled By:  ReplicaSet/crash-app-b9cf4587<br>Containers:<br>  alpine:<br>    Container ID:   containerd://a36709d9520db92d7f6d9ee02ab80125a384fee178f003ee0b0fcfec303c2e58<br>    Image:          alpine<br>    Image ID:       docker.io/library/alpine@sha256:e1c082e3d3c45cccac829840a25941e679c25d438cc8412c2fa221cf1a824e6a<br>    Port:           <none><br>    Host Port:      <none><br>    State:          Waiting<br>      Reason:       CrashLoopBackOff<br>    Last State:     Terminated<br>      Reason:       Completed<br>      Exit Code:    0<br>      Started:      Tue, 12 Oct 2021 12:26:21 +1100<br>      Finished:     Tue, 12 Oct 2021 12:26:21 +1100<br>    Ready:          False<br>    Restart Count:  4<br>    ...<br>Events:<br>  Type     Reason     Age                  From               Message<br>  ----     ------     ----                 ----               -------<br>  Normal   Scheduled  2m30s                default-scheduler  Successfully assigned default/crash-app-b9cf4587-66ftw to ip-192-168-91-249.ap-southeast-2.compute.internal<br>  Normal   Pulled     2m25s                kubelet            Successfully pulled image "alpine" in 5.121853269s<br>  Normal   Pulled     2m22s                kubelet            Successfully pulled image "alpine" in 1.894443044s<br>  Normal   Pulled     2m3s                 kubelet            Successfully pulled image "alpine" in 1.878057673s<br>  Normal   Created    97s (x4 over 2m25s)  kubelet            Created container alpine<br>  Normal   Started    97s (x4 over 2m25s)  kubelet            Started container alpine<br>  Normal   Pulled     97s                  kubelet            Successfully pulled image "alpine" in 1.872870869s<br>  Warning  BackOff    69s (x7 over 2m21s)  kubelet            Back-off restarting failed container<br>  Normal   Pulling    55s (x5 over 2m30s)  kubelet            Pulling image "alpine"<br>  Normal   Pulled     53s                  kubelet            Successfully pulled image "alpine" in 1.858871422s


Example of liveness probe failing for the pod :

$ kubectl describe pod nginx<br>Name:         nginx<br>Namespace:    default<br>Priority:     0<br>Node:         ip-192-168-91-249.ap-southeast-2.compute.internal/192.168.91.249<br>Start Time:   Tue, 12 Oct 2021 13:07:55 +1100<br>Labels:       app=nginx<br>Annotations:  kubernetes.io/psp: eks.privileged<br>Status:       Running<br>IP:           192.168.79.220<br>IPs:<br>  IP:  192.168.79.220<br>Containers:<br>  nginx:<br>    Container ID:   containerd://950740197c425fa281c205a527a11867301b8ec7a0f2a12f5f49d8687a0ee911<br>    Image:          nginx<br>    Image ID:       docker.io/library/nginx@sha256:06e4235e95299b1d6d595c5ef4c41a9b12641f6683136c18394b858967cd1506<br>    Port:           80/TCP<br>    Host Port:      0/TCP<br>    State:          Waiting<br>      Reason:       CrashLoopBackOff<br>    Last State:     Terminated<br>      Reason:       Completed<br>      Exit Code:    0<br>      Started:      Tue, 12 Oct 2021 13:10:06 +1100<br>      Finished:     Tue, 12 Oct 2021 13:10:13 +1100<br>    Ready:          False<br>    Restart Count:  5<br>    Liveness:       http-get http://:8080/ delay=3s timeout=1s period=2s #success=1 #failure=3<br>    ...<br>Events:<br>  Type     Reason     Age                    From               Message<br>  ----     ------     ----                   ----               -------<br>  Normal   Scheduled  2m47s                  default-scheduler  Successfully assigned default/nginx to ip-192-168-91-249.ap-southeast-2.compute.internal<br>  Normal   Pulled     2m44s                  kubelet            Successfully pulled image "nginx" in 1.891238002s<br>  Normal   Pulled     2m35s                  kubelet            Successfully pulled image "nginx" in 1.878230117s<br>  Normal   Created    2m25s (x3 over 2m44s)  kubelet            Created container nginx<br>  Normal   Started    2m25s (x3 over 2m44s)  kubelet            Started container nginx<br>  Normal   Pulled     2m25s                  kubelet            Successfully pulled image "nginx" in 1.876232575s<br>  Warning  Unhealthy  2m17s (x9 over 2m41s)  kubelet            Liveness probe failed: Get "http://192.168.79.220:8080/": dial tcp 192.168.79.220:8080: connect: connection refused<br>  Normal   Killing    2m17s (x3 over 2m37s)  kubelet            Container nginx failed liveness probe, will be restarted<br>  Normal   Pulling    2m17s (x4 over 2m46s)  kubelet            Pulling image "nginx"

If your pods are still in the CrashLoopBackOff state after trying the preceding steps, complete the steps in the Additional troubleshooting section.

Additional troubleshooting

If your pod is still stuck after completing steps in the previous sections, then try the following steps:

1.    To confirm that worker nodes exist in the cluster and are in Ready status, run the following command:

$ kubectl get nodes

Example output:

NAME                                          STATUS   ROLES    AGE   VERSION
ip-192-168-6-51.us-east-2.compute.internal    Ready    <none>   25d   v1.21.2-eks-5047ed
ip-192-168-86-33.us-east-2.compute.internal   Ready    <none>   25d   v1.21.2-eks-5047ed

If the nodes are not in the cluster, add worker nodes.

If the nodes are NotReady or can't join the cluster, see How can I change the status of my nodes from NotReady or Unknown status to Ready status?

2.    To check the version of the Kubernetes cluster, run the following command:

$ kubectl version --short

Example output:

Client Version: v1.21.2-eks-5047ed
Server Version: v1.21.2-eks-c0eccc

3.    To check the version of the Kubernetes worker node, run the following command:

$ kubectl get node -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion

Example output:

NAME                                          VERSION
ip-192-168-6-51.us-east-2.compute.internal    v1.21.2-eks-5047ed
ip-192-168-86-33.us-east-2.compute.internal   v1.21.2-eks-5047ed

4.    Confirm that the Kubernetes server version for the cluster matches the version of the worker nodes within an acceptable version skew (from the Kubernetes website). Use the output from the preceding steps 2 and 3 as the basis for this comparison.

Important: The patch versions can be different (for example, v1.21.x for the cluster vs. v1.21.y for the worker node).

If the cluster and worker node versions are incompatible, create a new node group with eksctl (see the eksctl tab) or AWS CloudFormation (see the Self-managed nodes tab).

-or-

Create a new managed node group (Kubernetes: v1.21, platform: eks.1 and above) using a compatible Kubernetes version. Then, delete the node group with the incompatible Kubernetes version.

5.    Confirm that the Kubernetes control plane can communicate with the worker nodes by verifying firewall rules against recommended rules in Amazon EKS security group considerations. Then, verify that the nodes are in Ready status.


Did this article help?


Do you need billing or technical support?