How can I troubleshoot Amazon EKS pods on AWS Fargate that are stuck in a Pending state?
Last updated: 2021-12-20
My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on AWS Fargate instances are stuck in a Pending state. How can I get these pods to run?
Here are some common scenarios that prevent pods from running on Amazon Elastic Kubernetes Service (Amazon EKS) using AWS Fargate.
- There is a capacity error because a particular vCPU/memory combination is unavailable.
- The CoreDNS pods were created with a default annotation that must be removed to schedule them on a Fargate node.
- The pod didn't match any Fargate Profiles when it was created and is not assigned to the fargate-scheduler. If a pod isn't matched on creation, it isn't automatically rescheduled to Fargate nodes, even if a matching profile is created later. In this case, the pod is assigned to the default-scheduler.
- If the pod is assigned to the fargate-scheduler but remains in a Pending state, then additional troubleshooting might be required.
Before troubleshooting, note the Fargate following pod rules:
- You must configure namespace and match labels for your pod selectors. Fargate workflow matches pods to a Fargate profile only if both conditions match the pod specification.
- If you specify multiple pod selectors within a single Fargate profile, then the pod is scheduled by fargate-schedule if it matches any of the selectors.
- If a pod specification matches with multiple Fargate profiles, the pod is scheduled according to a random Fargate profile. To avoid this, you can use the annotation eks.amazonaws.com/fargate-profile:<fp_name> within the pod specification.
Important: The following steps apply only to pods launched with AWS Fargate. For information on pods launched on Amazon EC2 instances, see How can I troubleshoot pod status in Amazon EKS?
Find out the status of your pod
1. Run the following command to check your pod state
kubectl get pods -n <namespace>
2. To get more error information about your pod, run the following describe command:
kubectl describe pod YOUR_POD_NAME -n <namespace>
Based on the output of the describe command, see the following resolutions.
Resolving capacity error
If your pods have a capacity issue, then the describe output is similar to the following:
Fargate capacity is unavailable at this time. Please try again later or in a different availability zone
To resolve the error:
- Retry to the pod after 15-20 minutes. Because the error is capacity-based, the exact amount of time can vary.
- Change the request (CPU/memory) within your pod specification. A new combination of vCPU/memory is then provisioned by the Fargate workflow.
Note: You are billed based on one of your combinations. See Pod CPU and memory for more information around how the combination is finalized based on your pod specification. Performing a "kubectl describe node" command from your terminal/IDE can give you a much higher vCPU/memory combination value. Fargate doesn't always have capacity available based on your requests and provisions resources from a capacity pool on a best effort basis. However, you are billed only for pod usage and equivalent vCPU/memory combination.
Resolving CoreDNS pods in pending state
If your pods are CoreDNS pods, then the name of the pod in the describe output is similar to the following:
NAME READY STATUS RESTARTS AGE coredns-6548845887-qk9vf 0/1 Pending 0 157m
To resolve this and re-assign the pods to the Fargate scheduler: patch the CoreDNS deployment to remove the following default annotation: eks.amazonaws.com/compute-type : ec2.
Resolving pods assigned to default-scheduler
To determine that scheduler that your pods are assigned to, run the following command:
kubectl get pods -o yaml -n <namespace> <pod-name> | grep schedulerName.
In the output, confirm the schedulerName is default-scheduler, update the pod specification, and then recreate the pods.
If the schedulerName is fargate-scheduler and you still get errors, then confirm that your pod follows all rules and Fargate considerations. See the following section for more troubleshooting steps.
Troubleshooting pods assigned to fargate-scheduler
If your pods are assigned to fargate-scheduler but remain in a Pending state, then the describe output is similar to the following:
Events: Type Reason Age From ---- ------ ---- ---- Warning FailedScheduling 2m25s (x301 over 5h3m) fargate-scheduler
To troubleshoot this error:
- Delete and recreate the pods.
- Confirm the following are not set in the pod specification YAML:
These specifications cause the fargate-scheduler to skip the pod.
- Confirm that the subnets selected in your Fargate profile have enough free IP addresses to create new pods. Each Fargate node consumes one IP address from the subnet.
- Confirm that the NAT Gateway is set to a public subnet, and has an Elastic IP attached to it.
- Confirm that the DHCP option sets associated with your VPC have an AmazonProvidedDNS or a valid DNS server hostname for domain-name-servers.
- Confirm that DNS hostnames and DNS resolution is turned on for your VPC.
- If you are using private subnets for your Fargate pods with only VPC endpoints configured for service communication, then confirm that you have the following endpoints with DNS names allowed:
ECR - API
ECR - DKR
S3 Gateway endpoint
- Confirm the security group attached to the VPC endpoint allows communication from Fargate to and from the API server. The VPC endpoint security group must allow port 443 ingress from the cluster VPC CIDR. Private endpoint access must also be turned on for your cluster.