How can I change the status of my nodes from NotReady or Unknown status to Ready status?

Last updated: 2020-01-28

My Amazon Elastic Kubernetes Service (Amazon EKS) worker nodes are in NotReady or Unknown status. I want to get my worker nodes back in Ready status again.

Short Description

You can't schedule pods on a node that's in NotReady or Unknown status. You can schedule pods only on a node that's in Ready status.

The following resolution addresses nodes in NotReady or Unknown status.

If your node is in the MemoryPressure, DiskPressure, or PIDPressure status, then you must manage your resources to allow additional pods to be scheduled on the node. If your node is in NetworkUnavailable status, then you must properly configure the network on the node. For more information, see Node Status.

Note: For information on managing pod evictions and resource limits, see Configure Out of Resource Handling and Managing Compute Resources for Containers.

Resolution

Check the aws-node and kube-proxy pods to see why the nodes are in NotReady status

A node in NotReady status isn't available for pods to be scheduled on.

1.    To check the status of your aws-node and kube-proxy pods, run the following command:

$ kubectl get pods -n kube-system -o wide

2.    Check the status of the aws-node and kube-proxy pods by reviewing the output from step 1.

Note: The aws-node and kube-proxy pods are managed by a DaemonSet, so each node in the cluster must have one aws-node and kube-proxy pod running on it. If no aws-node or kube-proxy pods are listed, skip to step 4.

If your node status is normal, then your aws-node and kube-proxy pods should be in Running status. For example:

$ kubectl get pods -n kube-system -o wide
NAME                             READY   STATUS    RESTARTS   AGE        IP              NODE
aws-node-qvqr2                   1/1     Running   0          4h31m      192.168.54.115  ip-192-168-54-115.ec2.internal
kube-proxy-292b4                 1/1     Running   0          4h31m      192.168.54.115  ip-192-168-54-115.ec2.internal

If either pod is in a status other than Running, run the following command:

$ kubectl describe pod yourPodName -n kube-system

3.    To get additional information from the aws-node and kube-proxy pod logs, run the following command:

$ kubectl logs yourPodName -n kube-system

The logs and the events from the describe output can show why the pods aren't in Running status. For a node to change to Ready status, both the aws-node and kube-proxy pods must be Running on that node.

Note: The name of the pods can differ from aws-node-qvqr2 and kube-proxy-292b4, as shown in the preceding examples.

4.    If the aws-node and kube-proxy pods aren't listed after running the command from step 1, then run the following commands:

$ kubectl describe daemonset aws-node -n kube-system
$ kubectl describe daemonset kube-proxy -n kube-system

Search the output of the preceding commands for a reason why the pods can't be started.

Tip: You can search the Amazon EKS control plane logs for information on why the pods can't be scheduled.

In some scenarios, the node can be in Unknown status. This means that the kubelet on the node is unable to communicate with the control plane with the correct status of the node.

To troubleshoot nodes in Unknown status, complete the steps in the following sections:

  • Check the network configuration between nodes and the control plane
  • Check the status of the kubelet
  • Check that the Amazon EC2 API endpoint is reachable

Check the network configuration between nodes and the control plane

1.    Confirm that there are no network ACL rules on your subnets blocking traffic between the Amazon EKS control plane and your worker nodes.

2.    Confirm that the security groups for your control plane and nodes comply with minimum inbound and outbound requirements.

3.    (Optional) If your nodes are configured to use a proxy, confirm that the proxy is allowing traffic to the API server endpoints.

4.    To verify that the node has access to the API server, run the following netcat command:

$ nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443
Connection to 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443 port [tcp/https] succeeded!

Check the status of the kubelet

1.    Use SSH to connect to the affected worker node.

2.    To check the kubelet logs, run the following command:

$ journalctl -u kubelet > kubelet.log

Note: The kubelet.log file contains information on kubelet operations that can help you find the root cause of the node status issue.

If the logs don't provide information on the source of the issue, then run the following command to check the status of the kubelet on the worker node:

$ sudo systemctl status kubelet
  kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-eksclt.al2.conf
   Active: inactive (dead) since Wed 2019-12-04 08:57:33 UTC; 40s ago

If the kubelet isn't in the Running status, then run the following command to restart the kubelet:

$ sudo systemctl restart kubelet

Check that the Amazon EC2 API endpoint is reachable

1.    Use SSH to connect to one of the worker nodes.

2.    To check if the Amazon Elastic Compute Cloud (Amazon EC2) API endpoint for your AWS Region is reachable, run the following command:

$ nc -vz ec2.<region>.amazonaws.com 443
Connection to ec2.us-east-1.amazonaws.com 443 port [tcp/https] succeeded!

Note: You must replace us-east-1 with the AWS Region where your worker node is located. 


Did this article help you?

Anything we could improve?


Need more help?