Why is my Amazon EKS pod stuck in the ContainerCreating state with the error "failed to create pod sandbox"?

Last updated: 2022-11-10

My Amazon Elastic Kubernetes Service (Amazon EKS) pod is stuck in the ContainerCreating state with the error "failed to create pod sandbox".

Resolution

Your Amazon EKS pods might be stuck in the ContainerCreating state with a network connectivity error for several reasons. Use the following troubleshooting steps based on the error message that you get.

Error response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown

This error occurs because of an operating system limitation that's caused by the defined kernel settings for maximum PID or maximum number of files.

Run the following command to get information about your pod:

$ kubectl describe pod example_pod

Example output:

kubelet, ip-xx-xx-xx-xx.xx-xxxxx-x.compute.internal  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "example_pod": Error response from daemon: failed to start shim: fork/exec /usr/bin/containerd-shim: resource temporarily unavailable: unknown

To temporarily resolve the issue, restart the node.

To troubleshoot the issue, do the following:

  • Gather the node logs.
  • Review the Docker logs for the error "dockerd[4597]: runtime/cgo: pthread_create failed: Resource temporarily unavailable".
  • Review the Kubelet log for the errors "kubelet[5267]: runtime: failed to create new OS thread (have 2 already; errno=11)" and "kubelet[5267]: runtime: may need to increase max user processes (ulimit -u)".
  • Identify the zombie processes by running the ps command. All the processes listed with the Z state in the output are the zombie processes.

Network plugin cni failed to set up pod network: add cmd: failed to assign an IP address to container

This error indicates that the Container Network Interface (CNI) can't assign an IP address for the newly provisioned pod.

The following are reasons why the CNI fails to provide an IP address to the newly created pod:

  • The instance used the maximum allowed elastic network interfaces and IP addresses.
  • The Amazon Virtual Private Cloud (Amazon VPC) subnets have an IP address count of zero.

The following is an example of network interface IP address exhaustion:

Instance type    Maximum network interfaces    Private IPv4 addresses per interface    IPv6 addresses per interface
t3.medium        3                                  6                    6

In the preceding example, the instance t3.medium has a maximum of 3 network interfaces, and each network interface has a maximum of 6 IP addresses. The first IP address is used for the node and is not assignable. This leaves 17 IP addresses that the network interface can allocate.

The Local IP Address Management daemon (ipamD) logs show the following message when the network interface runs out of IP addresses:

"ipamd/ipamd.go:1285","msg":"Total number of interfaces found: 3 "
"AssignIPv4Address: IP address pool stats: total: 17, assigned 17"
"AssignPodIPv4Address: ENI eni-abc123 does not have available addresses"

Run the following command to get information about your pod:

$ kubectl describe pod example_pod

Example output:

Warning FailedCreatePodSandBox 23m (x2203 over 113m) kubelet, ip-xx-xx-xx-xx.xx-xxxxx-x.compute.internal (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "provisioning-XXXXXXXXXXXXXXX": networkPlugin cni failed to set up pod "provisioning-XXXXXXXXXXXXXXX" network: add cmd: failed to assign an IP address to container

Review the subnet to identify if the subnet ran out of free IP addresses. You can view available IP addresses for each subnet in the Amazon VPC console under the Subnets section.

Subnet: XXXXXXXXXX
IPv4 CIDR Block 10.2.1.0/24   Number of allocated ips 254   Free address count 0

To resolve this issue, scale down some of the workload to free up available IP addresses. If additional subnet capacity is available, then you can scale the node. You can also create an additional subnet. For more information, see How do I use multiple CIDR ranges with Amazon EKS? Follow the instructions in Create subnets with a new CIDR range section.

Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused

This error indicates that the aws-node pod failed to communicate with IPAM because the aws-node pod failed to run on the node.

Run the following commands to get information about the pod:

$ kubectl describe pod example_pod
$ kubectl describe pod/aws-node-XXXXX -n kube-system

Example outputs:

Warning  FailedCreatePodSandBox  51s  kubelet, ip-xx-xx-xx-xx.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "example_pod": NetworkPlugin cni failed to set up pod "example_pod" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container
"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "example_pod": NetworkPlugin cni failed to teardown pod "example_pod" network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

To troubleshoot this issue, verify that the aws-node pod is deployed and is in the Running state:

kubectl get pods --selector=k8s-app=aws-node -n kube-system

Note: Make sure that you're running the correct version of the VPC CNI plugin for the cluster version.

The pods might be in Pending state due to Liveness and Readiness probe errors. Liveness and Readiness probe errors mean that the application that's running in the container isn't ready within the time that you run the check.

Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=60s timeout=1s period=10s #success=1 #failure=3
Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=1s timeout=1s period=10s #success=1 #failure=3
kubelet: "Probe failed" probeType="Readiness" pod="kube-system/aws-node-xxx" podUID=xxx-xxx-xxx containerName="aws-node" probeResult=failure output="{\"level\":\"info\",\"ts\":\"2022-10-13T11:25:56.708Z\",\"caller\":\"/usr/local/go/src/runtime/proc.go:225\",\"msg\":\"timeout: failed to connect service \\\":50051\\\" within 5s\"}\n"

pod="kube-system/aws-node-xxxx" containerMessage="Container aws-node failed liveness probe, will be restarted"

To allow the application enough time to start and be ready to receive requests, you can do the following:

  • Update the timeout from 1 second to 5 seconds.
  • Increase the failure count from 3 to 5.

Run the following command to view the last log message from the aws-node pod:

kubectl -n kube-system exec -it aws-node-XXX-- tail -f /host/var/log/aws-routed-eni/ipamd.log | tee ipamd.log

The issue might also occur because the Dockershim mount point fails to mount. The following is an example message that you can receive when this issue occurs:

Getting running pod sandboxes from \"unix:///var/run/dockershim.sock\
Not able to get local pod sandboxes yet (attempt 1/5): rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or director

The preceding message indicates that the pod didn't mount var/run/dockershim.sock.

To resolve this issue, try the following:

  • Restart the aws-node pod to remap the mount point.
  • Cordon the node, and scale the nodes in the node group.
  • Upgrade the Amazon VPC network interface to the latest cluster version that's supported.

If you added the CNI as a managed plugin in the AWS Management Console, then the aws-node fails the probes. Managed plugins overwrite the service account. However, the service account isn't configured with the selected role. To resolve this issue, turn off the plugin from the AWS Management Console, and create the service account using a manifest file. Or, edit the current aws-node service account to add the role that's used on the managed plugin.

Network plugin cni failed to set up pod "my-app-xxbz-zz" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address

You get this error because either the pod isn't running properly, or the certificate that the pod is using isn't created successfully. This error relates to the Amazon VPC admission controller webhook that's required on Amazon EKS clusters to run Windows workloads. The webhook is a plugin that runs a pod in the kube-system namespace. The component runs on Linux nodes and allows networking for incoming pods on Windows nodes.

Run the following command to get the list of pods that are affected:

kubectl get pods

Example output:

my-app-xxx-zz        0/1     ContainerCreating   0          58m   <none>            ip-XXXXXXX.compute.internal   <none>
my-app-xxbz-zz       0/1     ContainerCreating   0          58m   <none>

Run the following command to get information about the pod:

$ kubectl describe pod my-app-xxbz-zz

Example output:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" network for pod "<POD_ANME>": networkPlugin cni failed to set up pod "example_pod" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address
Reconciler worker 1 starting processing node ip-XXXXXXX.compute.internal.
Reconciler checking resource vpc.amazonaws.com/PrivateIPv4Address warmpool size 1 desired 3 on node ip-XXXXXXX.compute.internal.
Reconciler creating resource vpc.amazonaws.com/PrivateIPv4Address on node ip-XXXXXXX.compute.internal.
Reconciler failed to create resource vpc.amazonaws.com/PrivateIPv4Address on node ip-XXXXXXX.compute.internal: node has no open IP address slots.

Windows nodes support one network interface per node. The number of pods that you can run per Windows node is equal to the number of IP addresses available per network interface for the node's instance type, minus one. To resolve this issue, scale up the number of Windows nodes.

If the IP addresses aren't the issue, then review the Amazon VPC admission controller pod event and logs.

Run the following command to confirm that the Amazon VPC admission controller pod is created:

$ kubectl get pods -n kube-system  OR kubectl get pods -n kube-system | grep "vpc-admission"

Example output:

vpc-admission-webhook-5bfd555984-fkj8z     1/1     Running   0          25m

Run the following command to get information about the pod:

$ kubectl describe pod vpc-admission-webhook-5bfd555984-fkj8z -n kube-system

Example output:

  Normal  Scheduled  27m   default-scheduler  Successfully assigned kube-system/vpc-admission-webhook-5bfd555984-fkj8z to ip-xx-xx-xx-xx.ec2.internal
  Normal  Pulling    27m   kubelet            Pulling image "xxxxxxx.dkr.ecr.xxxx.amazonaws.com/eks/vpc-admission-webhook:v0.2.7"
  Normal  Pulled     27m   kubelet            Successfully pulled image "xxxxxxx.dkr.ecr.xxxx.amazonaws.com/eks/vpc-admission-webhook:v0.2.7" in 1.299938222s
  Normal  Created    27m   kubelet            Created container vpc-admission-webhook
  Normal  Started    27m   kubelet            Started container vpc-admission-webhook

Run the following command to check the pod logs for any configuration issues:

$ kubectl logs vpc-admission-webhook-5bfd555984-fkj8z -n kube-system

Example output:

I1109 07:32:59.352298       1 main.go:72] Initializing vpc-admission-webhook version v0.2.7.
I1109 07:32:59.352866       1 webhook.go:145] Setting up webhook with OSLabelSelectorOverride: windows.
I1109 07:32:59.352908       1 main.go:105] Webhook Server started.
I1109 07:32:59.352933       1 main.go:96] Listening on :61800 for metrics and healthz
I1109 07:39:25.778144       1 webhook.go:289] Skip mutation for  as the target platform is .

The preceding output shows that the container started successfully. The pod then adds the vpc.amazonaws.com/PrivateIPv4Address label to the application pod. However, the manifest for the application pod must contain a node selector or affinity so that the pod is scheduled on the Windows nodes.

Other options to troubleshoot the issue include verifying the following:

  • You deployed the Amazon VPC admission controller pod in the kube-system namespace.
  • Logs or events aren't pointing to an expired certificate. If the certificate is expired and Windows pods are stuck in the Container creating state, then you must delete and redeploy the pods.
  • There aren't any timeouts or DNS-related issues.

If you don't create the Amazon VPC admission controller, then turn on Windows support for your cluster.

Important: Amazon EKS doesn't require you to turn on the Amazon VPC admission controller to support Windows node groups. If you turned on the Amazon VPC admission controller, then remove legacy Windows support from your data plane.


Did this article help?


Do you need billing or technical support?