Why is my Amazon EKS pod stuck in the ContainerCreating state with the error "failed to create pod sandbox"?

Last updated: 2021-11-30

My Amazon Elastic Kubernetes Service (Amazon EKS) pod is stuck in the ContainerCreating state with the error "failed to create pod sandbox".

Resolution

Your Amazon EKS pods might be stuck in the ContainerCreating state with network connectivity error for several reasons. Use the following troubleshooting steps based on the error message that you get.

Error response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown

This error occurs because of an operating system limitation caused by the defined kernel settings for maximum PID or maximum number of files.

Retrieve information about your pod by running the following command:

$ kubectl describe pod example_pod

The output looks similar to the following:

kubelet, ip-xx-xx-xx-xx.xx-xxxxx-x.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "example_pod": Error response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown

To temporarily resolve the issue, restart the node.

To troubleshoot this issue, do the following:

  • Gather the node logs.
  • Review the Docker logs for the error "dockerd[4597]: runtime/cgo: pthread_create failed: Resource temporarily unavailable".
  • Review the Kubelet log for the errors "kubelet[5267]: runtime: failed to create new OS thread (have 2 already; errno=11)" and "kubelet[5267]: runtime: may need to increase max user processes (ulimit -u)".
  • Identify the zombie processes by running the ps command. All the processes listed with the state Z in the output are the zombie processes.

Network plugin cni failed to set up pod network: add cmd: failed to assign an IP address to container

This error indicates that the Container Network Interface (CNI) can't assign an IP address for the newly provisioned pod.

Retrieve information about your pod by running the following command:

$ kubectl describe pod example_pod

The output looks similar to the following:

Warning FailedCreatePodSandBox 23m (x2203 over 113m) kubelet, ip-xx-xx-xx-xx.xx-xxxxx-x.compute.internal (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "provisioning-XXXXXXXXXXXXXXX": networkPlugin cni failed to set up pod "provisioning-XXXXXXXXXXXXXXX" network: add cmd: failed to assign an IP address to container

Review the subnet to identify if the subnet has run out of free IP addresses. You can view available IP addresses for each subnet in the Amazon VPC console under the Subnets section.

To resolve this issue, scale down some of the workload to free up available IP addresses. You can choose to scale the node if additional subnet capacity is available. You can also create an additional subnet. For more information, see Create subnets with a new CIDR range in How do I use multiple CIDR ranges with Amazon EKS?

Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused

This error indicates that the aws-node pod failed to communicate with IPAM.

Retrieve information about your pod by running the following commands:

$ kubectl describe pod example_pod
$ kubectl describe pod/aws-node-XXXXX -n kube-system

The output looks similar to the following:

Warning  FailedCreatePodSandBox  51s  kubelet, ip-xx-xx-xx-xx.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "example_pod": NetworkPlugin cni failed to set up pod "example_pod" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "example_pod": NetworkPlugin cni failed to teardown pod "example_pod" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

To troubleshoot this issue, run the following command to view the last log message:

kubectl -n kube-system exec -it aws-node-XXX-- tail -f /host/var/log/aws-routed-eni/ipamd.log | tee ipamd.log

The last log message looks similar to the following:

Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\

This message indicates that the pod was unable to mount var/run/dockershim.sock.

To resolve this issue, try the following:

  • Restart the aws-node pod. Restarting might help the pod to remap the mount point.
  • If the issue is still not resolved, then cordon the node and scale the nodes in the node group.
  • Try upgrading the virtual private cloud (VPC) CNI to the latest supported version of the cluster.

If the CNI was added as a managed add-on in the AWS Management Console, then the aws-node fails the probes. Switching to managed add-ons overwrites the service account. However, the service account is not configured with the selected role. To resolve this issue, turn off the add-on from the console and create the service account using a manifest file. Or, edit the current aws-node service account to add the role that's used on the managed add-on.

Network plugin cni failed to set up pod "example_pod" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address

You get this error either because the pod isn't running properly or the certificate that the pod is using isn't created successfully. This error relates to the VPC admission controller webhook that's required on Amazon EKS clusters to run Windows workloads. This component is a plugin that runs a pod in the kube-system namespace. This component runs on Linux nodes and enables networking for incoming pods on Windows nodes.

Retrieve information about your pod by running the following command:

$ kubectl describe pod example_pod

The output looks similar to the following:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "<POD_ANME>": networkPlugin cni failed to set up pod "example_pod" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address

To troubleshoot this issue, run the following command to confirm that the VPC admission controller pod is created:

$ kubectl get pods -n kube-system

If the admission controller pod isn't created, then enable Windows support for your cluster.

Important: Amazon EKS currently supports Windows node groups without requiring the VPC controller to be enabled. If you have the VPC controller enabled, then remove legacy Windows support from your data plane.

Run the following command to check if there are any errors written into the logs:

$ kubectl logs your-admission-webhook-name -n kube-system

You can continue further troubleshooting based on the errors identified from the logs.


Did this article help?


Do you need billing or technical support?