How do I troubleshoot an upgrade fail with my Amazon EKS cluster?

Last updated: 2022-09-23

My Amazon Elastic Kubernetes Service (Amazon EKS) cluster fails to update. How do I resolve this?

Short description

To resolve a failed Amazon EKS cluster update, try the following:

  • For an IpNotAvailable error, verify that the subnet that's associated with your cluster has enough available IP addresses.
  • For a SubnetNotFound error, verify that the subnets exist and are correctly tagged.
  • For a SecurityGroupNotFound error, verify that the security groups that are associated with the cluster exist.
  • For an EniLimitReached error, increase the elastic network interface quota for the AWS account.
  • For an AccessDenied error, verify that you have the correct permissions.
  • For an OperationNotPermitted error, verify that the Amazon EKS service role has the correct permissions.
  • For a VpcIdNotFound error, verify that the VPC that's associated with the cluster exists.
  • Verify that the resources that you used to create the cluster were deleted.
  • For clusters created with eksctl, verify that the AWS CloudFormation stack failed to roll back.
  • For transient backend workflow issues, update the cluster again.

Note: The AWS console links in the Resolution section direct you to the us-east-1 AWS Region. If your resources are located in another AWS Region, then make sure to change the Region to the one where your resources reside.

Resolution

Verify that the subnets have available IP addresses (IpNotAvailable)

To update an Amazon EKS cluster, you must have five available IP addresses from each of the subnets. If you don't have enough available IP addresses, then you can delete unused network interfaces within the cluster subnets. Deleting a network interface releases the IP address. For more information, see Delete a network interface.

To check for available IP addresses in the Amazon EKS cluster subnets:

1.    Open the Amazon EKS console.

2.    Select the Amazon EKS cluster.

3.    Choose the Configuration tab.

4.    Choose the Networking tab.

5.    Under Subnets, select a subnet to open the Subnets page.

6.    Select a subnet and choose the Details tab.

7.    Locate the Available IPv4 addresses to see how many available IP addresses the subnet has.

From the AWS Command Line Interface, run the following commands:

1.    Get the subnets that are associated with the cluster:

$ aws eks describe-cluster --name cluster-name --region your-region

Note: Replace cluster-name with your cluster's name and your-region with your AWS Region.

Output:

...
   "subnetIds": [
                "subnet-6782e71e",
                "subnet-e7e761ac"
            ],
   ...

2.    Describe the subnets from the preceding output:

aws ec2 describe-subnets --subnet-ids subet-id --region your-region

Note: Replace subnet-id with your subnet's ID and your-region with your Region.

Output:

...
"AvailableIpAddressCount": 4089,
...

If you don't have enough available IP addresses, then you can set the environment variable in the aws-node daemonset to WARM_IP_TARGET:

$ kubectl set env ds aws-node -n kube-system WARM_IP_TARGET=number

Note: Replace number with the number of IP addresses that you want to reserve from the subnets.

The WARM_IP_TARGET defines how many secondary IP addresses that the Container Network Interface (CNI) must reserve for pods. For more information on WARM_IP_TARGET and other configuration variables, see What are the best practices to configure the Amazon VPC CNI plugin to use an IP address in VPC subnets with Amazon EKS?

Verify that the subnets exist and are correctly tagged (SubnetNotFound)

To verify that your subnets exist, run the following command:

aws ec2 describe-subnets --subnet-ids subet-id --region region

Note: Replace subnet-id with your subnet's ID and region with the Region where the subnets are located.

If the subnets don't exist, you receive the following error:

An error occurred (InvalidSubnetID.NotFound) when calling the DescribeSubnets operation: The subnet ID 'subnet-id' does not exist

To verify that the subnets are correctly tagged:

1.    Identify the subnets that are associated with the cluster using the steps in the Check if you have enough available IP addresses (IpNotAvailable) section.

2.    Open the VPC console.

3.    Navigate to the Subnet page.

4.    Select the subnets that should be associated with the cluster and choose the Tags tab in the Details pane.

5.    Verify that each subnet has the correct tags:

Key - kubernetes.io/cluster/cluster-name

Note: The preceding tag is added to only Amazon EKS cluster versions 1.18 or earlier. For clusters created with Kubernetes version 1.19 and later, the tag is not mandatory. Replace cluster-name with your cluster's name.

The value of the tag can be either shared or owned.

Verify that the security groups that are associated with the cluster exist (SecurityGroupNotFound)

To identify the security groups that are associated with the cluster:

1.    Open the Amazon EKS console.

2.    Select the cluster.

3.    Choose the Configuration tab.

4.    Choose the Networking tab.

5.    Select the security groups that are listed under Cluster security group and Additional security groups.

If the security group exists, then the console opens and displays the security group details.

From the AWS CLI:

1.    Get the security groups associated with the cluster:

$ aws eks describe-cluster --name cluster-name --region your-region

Note: Replace cluster-name with your cluster's name and your-region with your Region.

Output:

...
"securityGroupIds": [       
	"sg-xxxxxxxx"
]
...

2.    Describe the security group from the preceding output:

$ aws ec2 describe-security-groups --group-ids security-group-id --region your-region

Note: Replace security-group-id with your security group's ID and your-region with your Region.

Increase the elastic network interface quota for the AWS account (EniLimitReached)

If you reached your network interface quota, then you can remove unused network interfaces or request a limit increase .

If your network interfaces are attached to a cluster, then delete the cluster to remove the network interface. If your network interfaces are attached to unused worker nodes, then delete the Auto Scaling group for self-managed node groups. For managed node groups, delete the node group from the Amazon EKS console. To move workloads from one node group to another node group, see Migrating to a new node group.

Verify that you have the correct permissions (AccessDenied)

1.    Open the IAM console.

2.    On the navigation pane, choose Roles or Users.

3.    Select the role or user.

4.    Verify that the IAM role or user has the correct permissions.

Verify that the service role has the correct permissions (OperationNotPermitted)

1.    Open the IAM console.

2.    On the navigation pane, choose Roles.

3.    Filter for AWSServiceRoleForAmazonEKS and select the role.

4.    Verify that the role has the AmazonEKSServiceRolePolicy policy attached.

If the policy isn't attached, see Adding IAM identity permissions.

Verify that the VPC associated with the cluster exists (VpcNotFound)

1.    Open the Amazon EKS console.

2.    Select the cluster.

3.    Choose the Configuration tab.

4.    Choose the Networking tab.

5.    Select the VPC ID link to see if the VPC exists.

If the VPC doesn't exist, you must create a new cluster.

Verify that resources associated with the cluster were deleted

If you created the cluster on the Amazon EKS console and the subnets that were used to create the cluster were deleted, then the cluster can't update. You must recreate the cluster and move the workloads from the old cluster to the new one.

Verify that the AWS CloudFormation stack failed to roll back (eksctl)

Update the cluster again

Transient issues can cause the backend workflows to be unstable. If the preceding troubleshooting steps don't relate to your issue, then try to update the cluster again.


Did this article help?


Do you need billing or technical support?