How do I troubleshoot Amazon EKS managed node group creation failures?

Last updated: 2022-05-11

My Amazon Elastic Kubernetes Service (Amazon EKS) managed node group failed to create. Nodes can't join the cluster and I received an error similar to the following:

"Instances failed to join the kubernetes cluster".


Follow these troubleshooting instructions to resolve the Amazon EKS managed node group creation failure.

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent AWS CLI version.

Confirm that your Amazon EKS worker nodes can reach the API server endpoint for you cluster

You can launch Amazon EKS worker nodes in a subnet associated with a route table to the API endpoint through a NAT gateway or internet gateway.

If your worker nodes are launched in a restricted private network, then confirm that your worker nodes can reach the Amazon EKS API server endpoint. For more information, see the requirements to run Amazon EKS in a private cluster without outbound internet access.

Note: If your nodes are in a private subnet backed by a NAT gateway, it's a best practice to create the NAT gateway in a public subnet.

If you're not using AWS PrivateLink endpoints, then verify access to API endpoints through a proxy server for the following AWS services:

  • Amazon Elastic Compute Cloud (Amazon EC2)
  • Amazon Elastic Container Registry (Amazon ECR)
  • Amazon Simple Storage Service (Amazon S3)

To verify that the worker node has access to the API server, run the following netcat command from the worker node:

nc -vz 443

Note: Replace with your API server endpoint.

Connect to your Amazon EKS worker node Amazon EC2 instance using SSH. Then, run the following command to check kubelet logs:

journalctl -f -u kubelet

If the kubelet logs don't provide information on the source of the issue, then run the following command to check the status of the kubelet on the worker node:

sudo systemctl status kubelet

Collect the Amazon EKS logs and the operating system logs for further troubleshooting.

Verify that the Amazon EC2, Amazon ECR, and Amazon S3 API endpoints are reachable

Use SSH to connect to one of the worker nodes.

To verify if the Amazon EC2, Amazon ECR, and Amazon S3 API endpoints for your AWS Region are reachable, run the following command:

$ nc -vz ec2.<region> 443
$ nc -vz ecr.<region> 443
$ nc -vz s3.<region> 443

Note: Replace <region> with the AWS Region for your worker node.

Configure the user data for your worker node

Custom AMI on the launch templates must use the Amazon EKS bootstrap invocation with the launch template user data. Amazon EKS doesn't merge the default bootstrap information into user data. For more information, see Introducing launch template and custom AMI support in Amazon EKS Managed Node Groups.

To configure user data for your worker node, you can specify the user data when launching your Amazon EC2 instances.

Update the user data field to the worker nodes similar to the following:

set -o xtrace
/etc/eks/ ${ClusterName} ${BootstrapArguments}

Note: Replace ${ClusterName} with the name of your Amazon EKS cluster. Replace ${BootstrapArguments} with additional bootstrap values, or leave the value blank.

Confirm that the Amazon VPC for your Amazon EKS cluster has support for a DNS hostname and DNS resolution

You must enable DNS hostnames and DNS resolution with worker nodes after changing the cluster endpoint access from public to private. When you enable endpoint private access for your cluster, Amazon EKS creates a Route 53 private hosted zone on your behalf. Then, Amazon EKS associates it with your cluster's Amazon Virtual Private Cloud (Amazon VPC).

For more information, see Amazon EKS cluster endpoint access control.

Verify worker node permissions

Make sure that the IAM instance role associated with the worker node has the AmazonEKSWorkerNodePolicy and AmazonEC2ContainerRegistryReadOnly policies attached.

Note: The Amazon managed policy AmazonEKS_CNI_Policy must be attached to either the node instance role or a different role that's mapped to the AWS node Kubernetes service account. It's a best practice to assign the policy to the role associated to the Kubernetes service account instead of assigning it to the role.

Confirm worker node security group traffic requirements

Confirm that your control plane's security group and worker node security group are configured with the recommended settings for inbound and outbound traffic.

Confirm that the Amazon VPC subnets for the worker node have available IP addresses

If the Amazon VPC is running out of IP addresses, you can associate a secondary CIDR to your existing Amazon VPC. For more information, see Increase available IP addresses for your Amazon VPC.