How to achieve DNS high availability with Route 53 Resolver endpoints

This post assumes a certain level of technical knowledge, including familiarity with DNS terminology, Wireshark, and Amazon Route 53 Resolver endpoints.

Introduction

The Domain Name System (DNS) is a critical service underpinning nearly the entire internet. As nearly every application begins with DNS resolution, a highly available and performant DNS architecture is crucial for application availability. To ensure high availability, DNS services must be resilient against component failures and network connectivity issues.

Amazon Web Services (AWS) customers operate infrastructures with varying levels of complexity and scale, including hybrid environments that encompass multi-cloud and integration with traditional data centers. Amazon’s Chief Technology Officer, Werner Vogels, famously said, “Everything fails, all the time.” This underscores the importance of anticipating potential failures and ensuring that applications can resolve DNS names even during Availability Zone failures. This post focuses on achieving DNS resilience using Route 53 Resolver endpoints, providing examples and considerations for their use in hybrid environments to enhance DNS reliability for business-critical applications.

Route 53 Resolver endpoints overview

Route 53 Resolver is a service of Route 53 that provides DNS resolution for resources running within a Amazon Virtual Private Cloud (VPC) to reach public resources out on the internet or any resources within the AWS network. If you have a custom DNS service running outside of the AWS Cloud, Amazon Route 53 offers Route 53 Resolver endpoints to establish hybrid DNS resolution. Outbound endpoints enable AWS resources to send DNS queries from the VPC to external networks or another VPC. In contrast, inbound endpoints enable DNS queries to be received from external networks or other VPCs into the resource in your VPC. By using Resolver rules, you can forward DNS requests to the appropriate target for resolution, ensuring seamless DNS resolution across different platforms and environments.

Outbound endpoint resiliency

Before delving into the resiliency of outbound endpoints, let’s first discuss the overall DNS query evaluation flow when a DNS query is sent from VPC resources through these endpoints. As illustrated in Figure 1, when the Route 53 Resolver receives a DNS query, it first evaluates the presence of any Resolver rules attached to the VPC. If the DNS query matches an existing rule to forward to the outbound endpoint, the request is forwarded through the outbound endpoint to the external DNS service.

Figure 1: Outbound endpoint DNS query resolution flow

Now, let’s dive deep into the outbound endpoint resiliency aspect. I have a domain, example-on-prem.com, hosted on two of my on-premises DNS servers that are authoritative for the domain. Critical applications within the VPC need to connect to these on-premises applications using the name server.example-onprem.com.

Figure 2: Outbound endpoint architecture with three Availability Zones

As illustrated in Figure 2, for maximum resiliency, I have created outbound endpoints in three different Availability Zones and associated the Resolver rules as mentioned in Figures 3 and 4. Next, I ensured that the network layer was established, and the application could resolve the DNS name server.example-onprem.com. Upon receiving the DNS requests from the application in the workload subnet for server.example-onprem.com [1], the Route 53 Resolver first evaluates the associated rules [2], then duplicates the DNS queries for redundancy and forwards them to any two outbound endpoint interfaces (192.168.71.24, 192.168.72.120, and 192.168.73.90) [3]. Subsequently, the outbound endpoints forward these queries to the on-premises DNS servers 10.4.1.227 and 10.4.1.118 [4] according to rule definition.

Figure 3: Outbound endpoints with three elastic network interfaces across three Availability Zones for resiliency

Figure 4: Resolver rule with two target DNS resolver IP addresses

The next step is to run an experiment to simulate an Availability Zone failure using the AWS Fault Injection Service (FIS) simulator. While running this experiment, I used the Wireshark tool to capture the DNS query packets. The application server initiated multiple DNS queries for subdomain server.example-on-prem.com with AmazonProvidedDNS, which is Route 53 Resolver, as shown in Figure 5.

Figure 5: Wireshark capture of application server sending multiple DNS queries to Route 53 Resolvers (VPC DNS)

These DNS queries are sent to the on-premises DNS servers through outbound endpoints, as depicted in Figures 6 and 7. It’s important to note that I used a continuous DNS query script to generate traffic for this demonstration. Additionally, the DNS record I used had a time to live (TTL) value set to ZERO. If you have DNS caching enabled, you may not observe a significant volume of packets reaching your on-premises DNS servers.

Figure 6: Wireshark capture of DNS Server 1 hosting example-onprem.com and receiving DNS requests from all outbound Resolver endpoints

Figure 7: Wireshark capture of DNS Server 2 hosting example-onprem.com and receiving DNS requests from all outbound Resolver endpoints

In the next step, I used the FIS Availability Zone level power failure scenario in one of the Availability Zones. During my experiment, FIS simulated a network interruption at the Resolver Availability Zone with the IP address 192.168.71.24. Despite the interruption, the application server did not experience any DNS timeouts. This is because the redundancy is built into the Resolver, and the queries were still forwarded to the on-premises DNS servers from the endpoint IP addresses in the active Availability Zones. By examining the destination server’s Wireshark capture in Figure 8, it was evident that the DNS server received requests from the active outbound endpoint IP addresses 192.168.72.120 and 192.168.73.90. Once the FIS simulation concluded, DNS queries resumed from all three endpoints.

Figure 8: Wireshark capture of on-premises DNS servers with the highlighted packet when the DNS queries resumed from the endpoint in the failed Availability Zone simulation using FIS

This experiment demonstrates that configuring outbound Resolver endpoints across multiple Availability Zones ensures DNS resiliency for applications hosted on AWS.

Inbound endpoint resiliency

Similarly, for applications running on on-premises or third-party cloud infrastructure, DNS resiliency is just as crucial when your DNS domains are hosted on the AWS Cloud. As illustrated in Figure 9, when a DNS query is received at inbound Resolver endpoints, it forwards the query to Route 53 Resolver for further evaluation. You then need to configure the on-premises DNS resolver to forward the queries to the inbound Resolver endpoints’ IP addresses. Now let’s dive deep into the resiliency aspect.

Figure 9: Inbound endpoint DNS query resolution flow

As mentioned in Figure 10, the domain example-private-dns.com is a private hosted zone attached to the VPC in which the inbound endpoints are created. It is essential that the private hosted zone is associated with the VPC where the inbound endpoints are located. Otherwise, name resolution will fail.

Figure 10: VPC private hosted zone associated with the VPC where inbound endpoints are configured

Next, an inbound endpoint is created in three Availability Zones for maximum resiliency, as depicted in Figure 11.

Figure 11: Inbound endpoint configuration with three IP addresses across multiple Availability Zones for high availability

Next, the on-premises DNS resolver is configured to forward the DNS queries to the three inbound Resolver endpoint IP addresses for the domain example-private-dns.com, as shown in Figure 12. Similarly, you need to configure for other domains hosted on AWS environments, or you can configure general DNS forwards to forward all queries to AWS inbound endpoint DNS service.

Figure 12: On-premises DNS resolver conditional forwarder to forward DNS queries to inbound Resolver endpoint IP addresses

The on-premises application servers send DNS queries for aws-app.example-private-dns.com to their local DNS resolver [1]. As illustrated in Figure 13 and based on the configured DNS forwarders, the query is then sent to the first inbound endpoint IP address 192.168.71.252 in the list [2]. The behavior of third-party DNS resolvers may vary, and you need to refer to the vendor documentation. Then, the DNS queries are forwarded to VPC DNS [3], where it sends to the private hosted zone for final name resolution [4].

Figure 13: Inbound endpoints designed with the redundancy of three Availability Zones

Next, I used AWS FIS to create an AZ-level power failure scenario for 2 minutes in the Availability Zone that the on-premises DNS server was sending requests to. Each DNS service provider may have its unique resolution behavior, typically involving timeout settings defined in their configuration (5 seconds in this test). They are designed to recover quickly and connect to the next available DNS forwarder in the list. Once the affected Availability Zone becomes active again, you will observe that DNS queries resume, including those directed to the previously failed inbound endpoint IP address, as depicted in Figures 14 and 15.

Figure 14: Wireshark packets with DNS queries failing over to redundant IP addresses during the Availability Zone simulation failure

Figure 15: Wireshark packets with DNS queries to the original IP address resumed after the Availability Zone is active

This experiment demonstrates that configuring inbound Resolver endpoints across multiple Availability Zones ensures DNS resiliency for applications running on AWS. You need to ensure the on-premises DNS resolver is configured to forward the queries to all inbound Resolver endpoint IP addresses.

Resolver endpoint performance and adding additional ENIs

Each Resolver endpoint IP address can process up to 10,000 queries per second (QPS) over UDP, but the QPS per network interface is significantly lower for DNS over HTTPS (DoH). Factors such as query type, response size, target name server health, query response times, and round-trip latency influence the actual QPS rate. To ensure high availability, Route 53 Resolver generates multiple redundant outbound queries. Any slow-responding target name servers can also reduce the capacity of the Resolver endpoints. You should monitor the CloudWatch metrics InboundQueryVolume, OutboundQueryVolume, and OutBoundQueryAggregateVolume for the Resolver endpoints and per IP addresses. If the maximum QPS exceeds 50 percent of the capacity for any Resolver endpoint IP, the recommendation is to add an additional endpoint network interface for better performance.

Connection tracking is a networking concept where a networking device, like a firewall, router, or NAT device, needs to track and maintain information about the state of IP traffic going through it. When creating Resolver endpoints, you can add security groups to restrict DNS traffic to be allowed from a specific source or protocol. When restrictive security groups are added to endpoint network interfaces, the connections are tracked, and it reduces the rate to as low as 1,500 QPS. In such scenarios, for better performance and high availability, you need to add additional network interfaces per endpoint in each Availability Zone or configure security groups to avoid connection tracking as defined in the untracked connections. For additional information, read Resolver endpoint scaling and the Networking & Content Delivery Blog post Using connection tracking improvements to increase network performance.

Adding additional Resolver endpoints increases reliability and performance. This linearly increases the QPS rate. However, you must plan for times of reduced capacity, that is, planned maintenance or unplanned incidents. Use the formula (n−1) * 10,000 to calculate the best-case scenario of QPS rate. For example, six Resolver endpoints can support up to (6−1) * 10,000 = 50,000 QPS.

Considerations to design for high availability

This section provides different criteria to consider when designing highly resilient DNS architectures with Route 53 Resolver endpoints.

DNS caching
Resolver endpoints cache the DNS responses per configured record TTL. You must add the necessary and maximum TTL values required for your applications for better query performance.

Network connectivity
The Route 53 Resolver endpoints are placed in subnets and rely on subnet routing to connect to DNS networks. The recommendation is to use separate /28 or /27 subnets for Resolver endpoints with their own route table. You must also ensure resilient IPv4 and IPv6 network layer connectivity using AWS Site-to-Site VPN, AWS Direct Connect, or third-party virtual private network (VPN) solutions. For DNS routing between multiple VPCs through Resolver endpoints, use AWS Transit Gateway or Amazon VPC peering. This applies to single or multi-Region architectures across multiple AWS accounts.

Third-party DNS resolvers
You must make sure to configure forwarder rules on the third-party DNS resolvers to forward DNS queries to multiple inbound Resolver endpoints.

Route 53 Resolver (VPC DNS)
It is highly available, so you don’t need to plan for resiliency. Ensure your VPC Dynamic Host Configuration Protocol (DHCP) option set uses AmazonProvidedDNS as the preferred DNS server.

Resolver endpoint quotas
By default, you can add up to six IP addresses per endpoint, and you can request a higher limit using Service Quotas.

Resolver rules
For redundancy, you must provide more than one target DNS server IP address inside each outbound Resolver endpoint rule. You can add up to six target IP addresses.

Sharing of Resolver rules
If you have a centralized networking architecture with Resolver endpoints created in one account and resources in a shared services account, then you can share the Resolver rules across your organization accounts using AWS Resource Access Manager (AWS RAM) to use the already created highly available endpoints.

Configuring Resolver endpoints
To ensure redundancy, Resolver endpoints require a minimum of two endpoints with elastic network interfaces in selected subnets. In Regions with three or more Availability Zones, at least three Resolver endpoints across Availability Zones are recommended for mission-critical applications. Resolver endpoints are also Regional, so for multi-Region applications, create separate Resolver endpoints and Resolver rules and ensure network connectivity in each Region.

Service level agreement (SLA)
Multi-AZ Resolver endpoint configuration offers higher SLA than single-AZ deployments. For additional information, refer to Amazon Route 53 Resolver Endpoints Service Level Agreement.

Finding nonresilient Resolver endpoints
You can use the Amazon Route 53 Resolver Endpoint Availability Zone Redundancy Trusted Advisor check to identify nonresilient Resolver endpoints across your AWS Organizations accounts

Conclusion

AWS emphasizes the importance of planning for high availability, especially for business-critical workloads. When designing a DNS infrastructure on AWS, it’s important to note that the number of Resolver endpoints should not be solely determined by utilization metrics. Instead, the focus should be on creating a highly available DNS environment that can withstand failures and ensure uninterrupted service.

This post offered a technical overview of outbound and inbound endpoint resiliency, and you have learned practical methods to test and achieve DNS resiliency using Resolver endpoints. Additionally, you have gained insights into the behavior and response of outbound and inbound endpoints during Availability Zone–specific outages. This post also highlighted various considerations for designing a highly available DNS environment using Route 53 Resolver endpoints.

You can safely configure Resolver endpoints across multiple Availability Zones without incurring additional costs for cross-AZ DNS queries facilitated through VPC endpoints. However, you will be charged for the number of Resolver endpoints you create. For information about Route 53 Resolver endpoint pricing, refer to the Amazon Route 53 pricing page and use the AWS Pricing Calculator to create a customized cost estimate. Additionally, you can explore cost optimization strategies by referring to the guide on optimizing costs for Route 53 Resolver endpoints.

A correction was made on August 6, 2024: An earlier version of this post included diagrams with missing IP addresses. These diagrams have been updated.

About the Authors

Kartik Bheemisetty

Kartik Bheemisetty is a Sr Technical Account Manager under US-ISV segment, where he helps customer achieve their business goals with AWS cloud services. He hold’s subject matter expertise in AWS Network and Content Delivery services. He offers expert guidance on best practices, facilitates access to subject matter experts, and delivers actionable insights on optimizing AWS spend, workloads, and events. You can connect with him on LinkedIn

Randy Weinstein

Randy Weinstein is a Sr. Solutions Architect at Amazon Web Services. With broad experience across multiple areas of technology, Randy enjoys designing and building software defined infrastructures that underpin complex business systems.

Select your cookie preferences

Networking & Content Delivery