How do I troubleshoot DNS failures with Amazon EKS?

Last updated: 2020-02-21

The applications or pods that are using CoreDNS in my Amazon Elastic Kubernetes Service (Amazon EKS) cluster are failing internal or external DNS name resolutions.

Short description

Pods running inside the Amazon EKS cluster use the CoreDNS service's cluster IP as the default name server for querying internal and external DNS records. Applications can fail DNS resolutions if there are any issues with the CoreDNS pods, the service configuration, or connectivity.

The CoreDNS pods are abstracted by a service object called kube-dns. To troubleshoot issues with your CoreDNS pods, you must verify that all the components of the kube-dns service are working. These components include, but are not limited to, service endpoint options and iptables rules.

Resolution

The following resolution applies to the CoreDNS ClusterIP 10.100.0.10.

1.    To get the ClusterIP of your CoreDNS service, run the following command:

kubectl get service kube-dns -n kube-system

2.    To verify that DNS endpoints are exposed and pointing to CoreDNS pods, run the following command:

kubectl -n kube-system get endpoints kube-dns

You should see output similar to the following:

NAME       ENDPOINTS                                                        AGE
kube-dns   192.168.2.218:53,192.168.3.117:53,192.168.2.218:53 + 1 more...   90d

Note: If the endpoint list is empty, check the pod status of the CoreDNS pods.

3.    Verify that the pods aren't blocked by a security group or network access control list (network ACL) when communicating with CoreDNS.

For more information, see Why won't my pods connect to other pods in Amazon EKS?

Verify that the kube-proxy pod is working

To verify that the kube-proxy pod has access to API servers for your EKS cluster, you can check your logs for timeout errors to the control plane or 403 unauthorized errors.

To get the kube-proxy logs, run the following command:

kubectl logs -n kube-system --selector 'k8s-app=kube-proxy'

Note: The kube-proxy gets the endpoints from the control plane and creates the iptables rules on every worker node.

Connect to the application pod to troubleshoot the DNS issue

1.    To run commands inside your application pods, run the following command to access a shell inside the running pod:

$ kubectl exec -it your-pod-name -- sh

Your application pod might not have a shell binary available, if you receive an error similar to the following:

OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown
command terminated with exit code 126

To debug, you can update the image used in your manifest file to use another image (such as the busybox image).

2.    To verify that the cluster IP of the kube-dns service is in your pod's /etc/resolv.conf, run the following command in the shell inside of the pod:

cat /etc/resolv.conf

The following example resolv.conf shows a pod that's configured to point at 10.100.0.10 for DNS requests. The IP should match the ClusterIP of your kube-dns service.

nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

Note: You can manage your pod's DNS configuration with the dnsPolicy field in the pod specification. If this field isn't populated, then the ClusterFirst DNS policy is used by default.

3.    To verify that your pod can resolve an internal domain using the default clusterIP, run the following command in the shell inside the pod:

nslookup kubernetes 10.100.0.10

You should see output similar to the following:

Server:     10.100.0.10
Address:    10.100.0.10#53
Name:       kubernetes.default.svc.cluster.local
Address:    10.100.0.1

4.    To verify that your pod can resolve an external domain using the default clusterIP, run the following command in the shell inside the pod:

nslookup amazon.com 10.100.0.10

You should see output similar to the following:

Server:     10.100.0.10
Address:    10.100.0.10#53
Non-authoritative answer:
Name:   amazon.com
Address: 176.32.98.166
Name:    amazon.com
Address: 205.251.242.103
Name:    amazon.com
Address: 176.32.103.205

5.    To verify that your pod can resolve using the IP address of the CoreDNS pod directly, run the following commands in the shell inside the pod.

nslookup kubernetes COREDNS_POD_IP
nslookup amazon.com COREDNS_POD_IP

Note: Replace the COREDNS_POD_IP with one of the endpoint IPs from the kubectl get endpoints that you used earlier.

Get more detailed logs from coreDNS pods for debugging

1.    To enable the debug log of CoreDNS pods and add the log plugin to the CoreDNS ConfigMap, run the following command:

kubectl -n kube-system edit configmap coredns

2.    In the editor screen that appears in the output, add the log string. See the following example:

kind: ConfigMap
apiVersion: v1
data:
  Corefile: |
    .:53 {
        log    # Enabling CoreDNS Logging
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          upstream
          fallthrough in-addr.arpa ip6.arpa
        }
        ...
...

Note: It can take several minutes for CoreDNS to reload the configuration. You can restart the pods one by one to apply the changes immediately.

3.    To check if the CoreDNS logs are failing or getting any hits from the application pod, run the following command:

kubectl logs --follow -n kube-system --selector 'k8s-app=kube-dns'

Search and ndots combination

DNS uses nameserver for name resolutions (usually the cluster IP of a kube-dns service). DNS uses search for completing a query name to a fully qualified domain name. Search domains for workerNode are included. The ndots value is the number of dots that must appear in a name to resolve a query before an initial absolute query is made.

For example, you can set the ndots option to the default value 5 in a domain name that's not fully qualified. Then, all external domains that don't fall under the internal domain cluster.local are appended to the search domains before querying

See the following example with the /etc/resolv.conf setting of the application pod:

nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

CoreDNS looks for five dots in the domain being queried. If the pod makes a DNS resolution call for amazon.com, your logs look similar to the following:

[INFO] 192.168.3.71:33238 - 36534 "A IN amazon.com.default.svc.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000473434s
[INFO] 192.168.3.71:57098 - 43241 "A IN amazon.com.svc.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000066171s
[INFO] 192.168.3.71:51937 - 15588 "A IN amazon.com.cluster.local. udp 42 false 512" NXDOMAIN qr,aa,rd 135 0.000137489s
[INFO] 192.168.3.71:52618 - 14916 "A IN amazon.com.ec2.internal. udp 41 false 512" NXDOMAIN qr,rd,ra 41 0.001248388s
[INFO] 192.168.3.71:51298 - 65181 "A IN amazon.com. udp 28 false 512" NOERROR qr,rd,ra 106 0.001711104s

Note: NXDOMAIN means that the domain record wasn't found, and NOERROR means that the domain record was found.

Every search domain is prepended with amazon.com before making the final call on the absolute domain at the end. The final domain name is appended with a dot ( . ) at the end, which makes it a fully qualified domain name. This means that for every external domain name query there could be four to five additional calls, which can overwhelm the CoreDNS pod.

To resolve this issue, either change ndots to 1 (that looks only for single dot) or append a dot ( . ) at the end of the domain that's queried or used.

nslookup example.com.

Be aware of VPC resolver (AmazonProvidedDNS) limits

The VPC resolver can accept only a maximum hard limit of 1024 packets per second per network interface. If more than one CoreDNS pod is on the same worker node, then the chances of hitting this limit are higher for external domains queries.

To use PodAntiAffinity rules to schedule CoreDNS pods on separate instances, add the following options to the CoreDNS deployment:

spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: k8s-app
            operator: In
            values:
            - kube-dns
        topologyKey: kubernetes.io/hostname

Did this article help you?

Anything we could improve?


Need more help?