How can I troubleshoot pod status in Amazon EKS?
Last updated: 2020-02-13
My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on Amazon Elastic Compute Cloud (Amazon EC2) instances or on a managed node group are stuck. How can I get my pods in the Running state?
Resolution
Important: The following steps apply only to pods launched on Amazon EC2 instances or on a managed node group. The steps don't apply to pods launched on AWS Fargate.
Find out the status of your pod
1. To get information from the Events history of your pod, run the following command:
$ kubectl describe pod YOUR_POD_NAME
Note: The example commands covered in the following steps are in the default namespace. For other namespaces, append the command with -n YOURNAMESPACE.
2. Based on the status of your pod, complete the steps in one of the following sections: Your pod is in the Pending state, Your pod is in the Waiting state, or Your pod is in the CrashLoopBackOff state.
Your pod is in the Pending state
Pods in the Pending state can't be scheduled onto a node.
Your pod could be in the Pending state because you have insufficient resources on the available worker nodes, or you've defined an occupied hostPort on the pod.
If you have insufficient resources on the available worker nodes, then delete unnecessary pods, or add more worker nodes. For example, your worker nodes can run out of CPU and memory. If this is a recurring issue, you can use the Kubernetes Cluster Autoscaler to automatically scale your worker node group when resources in your cluster are scarce.
If you're defining a hostPort for your pod, then consider the following:
(A) There are a limited number of places that a pod can be scheduled when you bind a pod to a hostPort.
(B) Don't specify a hostPort unless it's necessary, because the hostIP, hostPort, and protocol combination must be unique.
(C) If you must specify hostPort, then schedule the same number of pods as there are worker nodes.
The following example shows the output of the describe command for my-nginx-12345abc6d-7e8fg, which is in the Pending state. The pod is unscheduled because of a resource constraint.
$ kubectl describe pod my-nginx-86459cfc9f-2j5bq
Name: my-nginx-12345abc6d-7e8fg
Namespace: default
Priority: 0
PriorityClassName:
Node:
Labels: pod-template-hash=86459cfc9f
run=my-nginx
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7s (x6 over 5m58s) default-scheduler 0/2 nodes are available: 1 Insufficient pods, 1 node(s) had taints that the pod didn't tolerate.
If your pods are still in the Pending state after trying the preceding steps, complete the steps in the Additional troubleshooting section.
Your pod is in the Waiting state
A pod in the Waiting state is scheduled on a worker node (for example, an Amazon EC2 instance), but can't run on that node.
Your pod can be in the Waiting state because of an incorrect Docker image or repository name, a lack of permissions, or because the image doesn't exist.
If you have the incorrect Docker image or repository name, then complete the following:
1. Confirm that the image and repository name is correct by logging into Docker Hub, Amazon Elastic Container Registry (Amazon ECR), or another container image repository.
2. Compare the repository or image from the repository with the repository or image name specified in the pod specification.
If the image doesn't exist or you lack permissions, then complete the following:
1. Verify that the image specified is available in the repository and that the correct permissions are configured to allow the image to be pulled.
2. To confirm that image pull is possible and to rule out general networking and repository permission issues, manually pull the image from the Amazon EKS worker nodes with Docker:
$ docker pull yourImageURI:yourImageTag
3. To verify that the image exists, check that both the image and tag are present in either Docker Hub or Amazon ECR.
Note: If you're using Amazon ECR, verify that the repository policy allows image pull for the NodeInstanceRole. Or, verify that the AmazonEC2ContainerRegistryReadOnly role is attached to the policy.
The following example shows a pod in a Pending state with the container in Waiting state because of an image pull error:
$ kubectl describe po web-test
Name: web-test
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51
Start Time: Wed, 22 Jan 2020 08:18:16 +0200
Labels: app=web-test
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"web-test"},"name":"web-test","namespace":"default"},"spec":{...
kubernetes.io/psp: eks.privileged
Status: Pending
IP: 192.168.1.143
Containers:
web-test:
Container ID:
Image: somerandomnonexistentimage
Image ID:
Port: 80/TCP
Host Port: 0/TCP
State: Waiting
Reason: ErrImagePull
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 66s default-scheduler Successfully assigned default/web-test to ip-192-168-6-51.us-east-2.compute.internal
Normal Pulling 14s (x3 over 65s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Pulling image "somerandomnonexistentimage"
Warning Failed 14s (x3 over 55s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Failed to pull image "somerandomnonexistentimage": rpc error: code = Unknown desc = Error response from daemon: pull access denied for somerandomnonexistentimage, repository does not exist or may require 'docker login'
Warning Failed 14s (x3 over 55s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Error: ErrImagePull
If your pods are still in the Waiting state after trying the preceding steps, complete the steps in the Additional troubleshooting section.
Your pod is in the CrashLoopBackOff state
Pods stuck in CrashLoopBackOff are starting, crashing, starting again, and then crashing again repeatedly.
If you receive the "Back-Off restarting failed container" output message, then your container probably exited soon after Kubernetes started the container.
To look for errors in the logs of the current pod, run the following command:
$ kubectl logs YOUR_POD_NAME
To look for errors in the logs of the previous pod that crashed, run the following command:
$ kubectl logs --previous YOUR-POD_NAME
Note: For a multi-container pod, you can append the container name at the end. For example:
$ kubectl logs POD_NAME CONTAINER_NAME
If the Liveness probe isn't returning a successful status, verify that the Liveness probe is configured correctly for the application. For more information, see Configure Probes.
The following example shows a pod in a CrashLoopBackOff state because the application exits after starting:
$ kubectl describe po crash-app-6847947bf8-28rq6
Name: crash-app-6847947bf8-28rq6
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: ip-192-168-6-51.us-east-2.compute.internal/192.168.6.51
Start Time: Wed, 22 Jan 2020 08:42:20 +0200
Labels: pod-template-hash=6847947bf8
run=crash-app
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.29.73
Controlled By: ReplicaSet/crash-app-6847947bf8
Containers:
main:
Container ID: docker://6aecdce22adf08de2dbcd48f5d3d8d4f00f8e86bddca03384e482e71b3c20442
Image: alpine
Image ID: docker-pullable://alpine@sha256:ab00606a42621fb68f2ed6ad3c88be54397f981a7b70a79db3d1172b11c4367d
Port: 80/TCP
Host Port: 0/TCP
Command:
/bin/sleep
1
State: Waiting
Reason: CrashLoopBackOff
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 47s default-scheduler Successfully assigned default/crash-app-6847947bf8-28rq6 to ip-192-168-6-51.us-east-2.compute.internal
Normal Pulling 28s (x3 over 46s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Pulling image "alpine"
Normal Pulled 28s (x3 over 46s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Successfully pulled image "alpine"
Normal Created 28s (x3 over 45s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Created container main
Normal Started 28s (x3 over 45s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Started container main
Warning BackOff 12s (x4 over 42s) kubelet, ip-192-168-6-51.us-east-2.compute.internal Back-off restarting failed container
If your pods are still in the CrashLoopBackOff state after trying the preceding steps, complete the steps in the Additional troubleshooting section.
Additional troubleshooting
If your pod is still stuck after completing steps in the previous sections, try the following steps:
1. To confirm that worker nodes exist in the cluster and are in Ready status (which allow pods to be scheduled on it), run the following command:
$ kubectl get nodes
The output should look similar to the following:
NAME STATUS ROLES AGE VERSION
ip-192-168-6-51.us-east-2.compute.internal Ready <none> 25d v1.14.6-eks-5047ed
ip-192-168-86-33.us-east-2.compute.internal Ready <none> 25d v1.14.6-eks-5047ed
If the nodes are not in the cluster, add worker nodes.
If the nodes are NotReady or can't join the cluster, see How can I change the status of my nodes from NotReady or Unknown status to Ready status?
2. To check the version of the Kubernetes cluster, run the following command:
$ kubectl version --short
The output should look similar to the following:
Client Version: v1.14.6-eks-5047ed
Server Version: v1.14.9-eks-c0eccc
3. To check the version of the Kubernetes worker node, run the following command:
$ kubectl get node -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion
The output should look similar to the following:
NAME VERSION
ip-192-168-6-51.us-east-2.compute.internal v1.14.6-eks-5047ed
ip-192-168-86-33.us-east-2.compute.internal v1.14.6-eks-5047ed
4. Based on the output from steps 2 and 3, confirm that the Kubernetes server version for the cluster matches the version of the worker nodes within an acceptable version skew.
Important: The patch versions can be different (for example, v1.14.x for the cluster vs. v1.14.y for the worker node).
If the cluster and worker node versions are incompatible, create a new node group with eksctl (see the eksctl tab) or AWS CloudFormation (see the Self-managed nodes tab).
--or--
Create a new managed node group (Kubernetes: v1.14, platform: eks.3 and above) using a compatible Kubernetes version. Then, delete the node group with the incompatible Kubernetes version.
5. To confirm that the Kubernetes control plane can communicate with the worker nodes, verify firewall rules against the recommended rules in Amazon EKS Security Group Considerations, and then verify that the nodes are in Ready status.
Related Information
Did this article help you?
Anything we could improve?
Need more help?