How can I troubleshoot Amazon EKS pods on AWS Fargate that are stuck in a Pending state?
Last updated: 2022-11-21
My Amazon Elastic Kubernetes Service (Amazon EKS) pods that are running on AWS Fargate instances are stuck in a Pending state. How can I get these pods to run?
Here are some common scenarios that cause pods to remain stuck in the Pending state on Amazon Elastic Kubernetes Service (Amazon EKS) using AWS Fargate:
- There is a capacity error because a particular vCPU/memory combination is unavailable.
- The CoreDNS pods were created with a default annotation that maps them to the Amazon Elastic Compute Cloud (Amazon EC2) compute type. The EC2 must be removed to schedule them on a Fargate node.
- The pod didn't match any Fargate Profiles when it was created and isn't assigned to the fargate-scheduler. If a pod isn't matched on creation, then it isn't automatically rescheduled to Fargate nodes, even if a matching profile is created later. In this case, the pod is assigned to the default-scheduler.
- If the pod is assigned to the fargate-scheduler but remains in a Pending state, then additional troubleshooting might be required.
Before troubleshooting, note the Fargate following pod rules:
- You must configure namespace and match labels for your pod selectors. Fargate workflow matches pods to a Fargate profile only if both conditions match the pod specification.
- If you specify multiple pod selectors within a single Fargate profile, then the pod is scheduled by fargate-schedule if it matches any of the selectors.
- If a pod specification matches with multiple Fargate profiles, then the pod is scheduled according to a random Fargate profile. To avoid this, you can use the annotation eks.amazonaws.com/fargate-profile:
within the pod specification.
Find out the status of your pod
1. Run the following command to check your pod state:
kubectl get pods -n <namespace>
2. To get more error information about your pod, run the following describe command:
kubectl describe pod YOUR_POD_NAME -n <namespace>
Refer to the output of the describe command to evaluate which of the following resolutions will help troubleshoot your issue.
Resolving capacity error
If your pods have a capacity issue, then the describe output is similar to the following:
Fargate capacity is unavailable at this time. Please try again later or in a different availability zone
This means that Fargate can't provision compute capacity, based on the vCPU/memory combination that you selected.
To resolve the error:
- Retry creating the pod after 15-20 minutes. Because the error is capacity-based, the exact amount of time can vary.
- Change the request (CPU/memory) within your pod specification. A new combination of vCPU/memory is then provisioned by the Fargate workflow.
Note: You're billed based on one of your combinations. See Pod CPU and memory for more information around how the combination is finalized based on your pod specification. Performing a kubectl describe node command from your terminal/IDE can give you a much higher vCPU/memory combination value. Fargate doesn't always have capacity available based on your requests and provisions resources from a capacity pool on a best effort basis. However, you're billed only for pod usage and equivalent vCPU/memory combination.
Resolving CoreDNS pods in a Pending state
If CoreDNS pods are in a Pending state:
kubectl get pods -n kube-system<br>NAME READY STATUS RESTARTS AGE coredns-6548845887-qk9vf 0/1 Pending 0 157m
This might be because CoreDNS deployment has the following default annotation: eks.amazonaws.com/compute-type : ec2.
To resolve this and re-assign the pods to the Fargate scheduler, see Update CoreDNS.
Troubleshooting pods assigned to fargate-scheduler
There are multiple reasons why pods assigned to fargate-scheduler might be stuck in Pending, ranging from misconfiguration of pod annotation to networking issues. If your pods remain in a Pending state, then the describe output is similar to the following:
Events: Type Reason Age From ---- ------ ---- ---- Warning FailedScheduling 2m25s (x301 over 5h3m) fargate-scheduler
To troubleshoot this error:
- Delete and recreate the pods.
- Confirm the following are not set in the pod specification YAML:
These specifications cause the fargate-scheduler to skip the pod.
- Confirm that the subnets selected in your Fargate profile have enough free IP addresses to create new pods. Each Fargate node consumes one IP address from the subnet.
- Confirm that the NAT Gateway is set to a public subnet, and has an Elastic IP attached to it.
- Confirm that the DHCP option sets associated with your VPC have an AmazonProvidedDNS or a valid DNS server hostname for domain-name-servers.
- Confirm that DNS hostnames and DNS resolution is turned on for your VPC.
- If your Fargate pods use private subnets with only VPC endpoints configured for service communication, then you must allow these endpoints with DNS names:
ECR - API
ECR - DKR
S3 Gateway endpoint
- Confirm the security group attached to the VPC endpoint allows communication from Fargate to and from the API server. The VPC endpoint security group must allow port 443 ingress from the cluster VPC CIDR. Private endpoint access must also be turned on for your cluster.
Resolving pods assigned to default-scheduler
To determine the scheduler that your pods are assigned to, run the following command:
kubectl get pods -o yaml -n <namespace> <pod-name> | grep schedulerName.
In the output, confirm that the schedulerName is default-scheduler. If it's listed as default-scheduler, then the fargate-scheduler skipped this pod. To troubleshoot this, check your pod configuration for compute-type annotations and refer to AWS Fargate considerations.