How can I troubleshoot unhealthy Route 53 health checks?

Last updated: 2021-05-20

The Amazon Route 53 health checks I created are reporting as unhealthy. How can I troubleshoot and fix the issues?

Resolution

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you’re using the most recent AWS CLI version.

First, you must determine the reason for the last health check failure using the AWS Management Console. Or, you can use the get-health-check-last-failure-reason command in the AWS CLI. After you identify the health check type, complete the corresponding troubleshooting steps to identify and fix the issue.

Note: Regardless of health check type, be sure to check the status of the "Invert health check status" option. If this option is set to "true", then Route 53 considers the health check unhealthy when the health checkers mark the health check as healthy, and vice versa.

Troubleshoot a health check that monitors an endpoint

Cause: This issue is indicated by the "The health checker could not establish a connection within the timeout limit." error message. This error is caused by a timeout that happens when health checkers attempt to establish a connection with the configured endpoint. The minimum time to establish a connection differs based on the health check protocol (TCP, HTTP, or HTTPS):

  • For TCP health checks, the TCP connection between the health checkers and the endpoint must happen within ten seconds.
  • For HTTP and HTTPS health checks, the TCP connection between the health checkers and the endpoint must happen within four seconds. The endpoint must respond with a 2xx or 3xx HTTP status code within two seconds after establishing a connection. For more information, see How Amazon Route 53 determines whether a health check is healthy.

Solution:

1.    In the heath check configuration, note the "Domain name" or "IP address" of the endpoint.

2.    Access the endpoint. Confirm that the firewall or server allows connections from the Route 53 public IP addresses for the Regions enabled in the health check configuration. See IP ranges and search for "service": "ROUTE53_HEALTHCHECKS". If the endpoint resources are hosted on AWS, configure security groups and network access control lists (network ACLs) to allow the IP addresses of the Route 53 health checkers.

3.    Use the following tools to test connectivity with the configured endpoint over the internet. Be sure to replace the placeholders in the commands with your respective values.

TCP test:

$ telnet <domain name / IP address> <port>

HTTP/HTTPS test:

$ 

curl -Ik -w "HTTPCode=%{http_code} TotalTime=%{time_total}\n" <http/https>://<

domain-name/ip address>:<port>/<path>

 -so /dev/null

Compare the output of the previous tests with the timeout values for the health checks. Then, confirm that your application is responding within the respective timelines.

For example, if you run the following test:

curl -Ik -w "HTTPCode=%{http_code} TotalTime=%{time_total}\n" https://example.com -so /dev/null

Then the output is:

HTTPCode=200 TotalTime=0.001963

In this example, the Total Time to get responses with the HTTP status code 200 is 0.001963 seconds.

For HTTP connections, the connection time must be within four seconds. The endpoint must respond with the HTTP status code within two seconds after connecting. The Total Time is six seconds. A Total Time value higher than six seconds indicates that the endpoint is slow to respond and the health check is failing. In these cases, check your endpoint to make sure that it responds within the timeout period.

If your output from the test commands shows an HTTPCode other than 200, then check the following configurations:

  • Firewall rules
  • Security groups
  • Network ACLs

When checking these configurations, confirm that your endpoint allows connections from Route 53 public IP addresses.

4.    If enabled, use the Latency graph option in the health check configuration to check the metrics graph for:

  • TCP Connection Time
  • Time to first byte
  • Time to complete SSL handshake

For more information, see Monitoring the latency between health checkers and your endpoint.

Note:

  • If the Latency graph isn't enabled, you can't edit existing health checks. Instead, you must create a new health check.
  • If the Elastic IP address of the endpoint that you're monitoring is released or updated, the health check might fail.

Cause: This issue is indicated by the "SSL alert handshake_failure" error message.

Solution:

This error indicates that SSL or TLS negotiation with the endpoint failed. When you enable SNI (HTTPS Only), Route 53 sends the hostname in the "client_hello" message to the endpoint during TLS negotiation. This action allows the endpoint to respond to the HTTPS request with the applicable SSL or TLS certificate.

If the hostname that you're monitoring isn't part of the common name in the endpoint's SSL or TLS certificate, then you receive an "SSL alert handshake_failure " error message.

Note: To enable SNI, the monitored endpoint must support SNI.

Troubleshoot health checks with the string match condition

Cause: This issue is indicated when the endpoint server returns "200 OK", but Route 53 marks the health check as unhealthy. Health checkers must establish a TCP connection with the endpoint within four seconds. Health checkers must then receive an HTTP status code of 2xx or 3xx in the next two seconds. Then, the configured string must appear in the first 5,120 bytes of the response body within the next two seconds. If the string isn't present in the first 5,120 bytes, then Route 53 marks the health check as unhealthy.

Solution:

To verify whether the string appears entirely in the first 5,120 bytes of the response body, use the following command. Be sure to replace “$search-string” with your string value.

$ curl -sL <http/https>://<domain-name>:<port> | head -c 5120 | grep $search-string

Troubleshoot a health check that monitors a CloudWatch alarm

Cause: Route 53 doesn't wait for the Amazon CloudWatch alarm to go into the ALARM state. Route 53 monitors the metric data stream instead of the state of the CloudWatch alarm.

Solution:

1.    Verify the configuration of the health check that's in the "INSUFFICIENT DATA " state. If the metric data stream provides insufficient information to determine the state of the alarm, then the health check status depends on the "InsufficientDataHealthStatus" configuration. The status options for the "InsufficientDataHealthStatus" setting are "healthy", "unhealthy", or "last known status".

2.    When you update the configuration of a CloudWatch alarm, the new settings don't automatically appear in the associated health check. To synchronize the health check configuration with the updated CloudWatch alarm's configuration:

  • In the Route 53 console, choose Health Checks.
  • Select the health check, and then choose Synchronize configuration.

Did this article help?


Do you need billing or technical support?