Using load balancer target group health thresholds to improve availability
AWS recently added features to our Elastic Load Balancers (ELB) that give you control over when they take measures to shift traffic between targets. In this blog, we will explore these new capabilities and review how you can use them to improve the availability and resiliency of your applications.
Two types of Elastic Load Balancer, Network Load Balancer (NLB) and Application Load Balancer (ALB), provide you with an easy way to improve the availability and scalability of your workloads by automatically distributing incoming traffic across multiple targets, such as EC2 instances, containers, and IP addresses. Both the Network and the Application load balancing operation occur at two levels, Domain Name System (DNS) and Request level routing. DNS routing ensures traffic is evenly distributed to each configured Availability Zone (AZ) of your load balancer. Request routing happens when a client sends traffic to the IP address of a load balancer, and the load balancer distributes this traffic to a healthy target. To failover traffic across multiple AZs, your load balancer configuration requires two or more active AZs (each AZ is a fully isolated partition of AWS infrastructure made up of multiple data centers). We recommend that you deploy your workloads across two or more AZs, according to our reliability best practices in the AWS Well-Architected Framework.
Overview of actions when targets fail health checks
Before we dive into the configuration and use cases for these new capabilities, let’s review how Network Load Balancer and Application Load Balancer manage DNS and request-level routing by default.
At DNS level, Amazon Route 53 performs health checks for each load balancer’s AZ. When Route 53 detects the load balancer IP address or addresses in an AZ are unhealthy, because of timeouts or the AZ reports there are not enough healthy targets, it triggers the DNS failover action. This determines the removal of the load balancer IP in the affected AZ from its FQDN record.
By default, both Application and Network Load Balancers monitor the health of targets, therefore each load balancer routes requests only to healthy targets. However, if the load balancer does not have enough healthy targets, it automatically triggers the Routing (Fail Open) action that sends traffic to any registered target as if they are all healthy.
Now let’s explore how to use the new available options to configure target group settings for Routing (Fail Open) and DNS failover when a percentage or a fixed number of targets becomes unhealthy.
Target group health threshold
The default threshold value for both Routing (Fail Open) and DNS failover set to one minimum healthy target. This configuration triggers the Routing (Fail Open) and the DNS failover actions when your target group does not have any healthy targets (i.e. at least one healthy target condition is not met). The default setting is used to mitigate impact when there is a hard failure and all targets fail health checks.
By configuring your target group health thresholds, you can enforce a timely response to health check failures. This helps when the health checks for your targets temporarily fail, for example, because of a transient connectivity issue with dependencies like databases. This allows you to handle zonal failures before all targets fail, and increases the likelihood that a client connection succeeds, improving your availability and resiliency.
You can configure the health threshold settings for each target group, so let’s review these settings next.
Routing (Fail Open) threshold configuration
The Routing (Fail Open) threshold configuration allows you to define the minimum number of healthy targets to trigger the Routing (Fail Open) action. When you cross the threshold, the load balancer routes new client flows or requests to any target part of the target group, regardless of their health check status. With this setting, your load balancer can prevent outages during temporary failures, as it can send traffic to unhealthy targets until targets can recover, reducing the risk of overloading the remaining healthy targets.
Let’s consider a load balancer that identifies a target as unhealthy and stops sending client traffic to it. If a transient issue event occurs, for example, a larger percentage of targets in the target group temporarily fail health checks, the Routing (Fail Open) settings allow the load balancer to continue to route client traffic to any target, and the client traffic succeeds or fails. The Load Balancer makes this decision based on the actual state of the backend application targets, and not on the health check state.
Routing (Fail Open) threshold considerations:
- Routing (Fail Open) threshold must be less than or equal to the DNS failover threshold.
- Each target group triggers its Routing (Fail Open) separately.
- When triggered, the load balancer routes traffic to any targets in the target group. If these targets cannot serve traffic, the connections, or requests will fail.
- Each Load Balancer Elastic Network Interface (ENI) makes independent routing decisions and health checks towards registered targets; therefore, each load balancer IP address has its own health check state for each registered target.
- The Cross-Zone Load Balancing setting is honored, regardless of when Routing (Fail Open) is triggered. This means that even if other AZs have healthy targets, the AZ where the threshold was crossed will fail open traffic only to the targets within the same AZ.
- If the unhealthy targets refuse TCP connections:
- Network Load Balancer propagates these errors to clients, so for troubleshooting, check the TCP_Target_Reset_Count per AZ metric.
- Application Load Balancers generate 502 errors, so for troubleshooting, check HTTPCode_ELB_502_Count metrics.
DNS failover threshold configuration
By configuring your DNS failover threshold, you steer client traffic only to the AZs that are healthy and serve the needs of your workloads. When the threshold is crossed, the DNS failover occurs, and Route 53 removes the IP address of the affected AZ from the ELB FQDN record.
For example, if a load balancer IP marks all targets as unhealthy, this determines Amazon Route 53 Health Checks to mark the IP address of the affected AZ as unhealthy, and remove it from the FQDN record. Therefore, clients performing a new DNS resolution for the load balancer record will receive only healthy load balancer IP addresses.
Availability Zone requirements for DNS Fail over:
For DNS failover to occur, you must configure your load balancer across at least two AZs. Application Load Balancers are multi-AZ by default, therefore at creation time, they require you to specify at least two subnets (one in each AZ). Network Load Balancers don’t need to span multiple AZs, but zonal DNS failover cannot be performed for NLBs configured with single AZ.
If Application Load Balancers are deployed across two AZs with at least one registered target, ALB will have a minimum one ENI in each configured AZ. If you select three or more AZs, the ALB creates one ENI in each AZ that has at least one registered target. Application Load Balancer IP addresses can change throughout the life of your load balancer, for example, when the load balancer scales up or down based on your traffic profile. An Application Load Balancer can have up to a 100 active IP addresses/ENIs, distributed across all enabled AZs.
Network Load Balancers have one ENI with one static IP address in each selected AZ, regardless of the number of targets. Although the network load balancer IP addresses do not change, we recommend clients resolve the load balancer DNS name to ensure they connect to healthy IPs (ones that can process traffic).
DNS failover considerations:
- The DNS failover threshold must be greater than or equal to the threshold for Routing (Fail Open), so that DNS failover occurs either together with, or before Routing (Fail Open).
- When using multiple target groups, if a single Target Group is not meeting the configured threshold, DNS failover begins.
- When the DNS failover occurs, the IP of the load balancer ENI is removed from the ELB DNS hostname, which removes the capacity of an entire AZ. Make sure that you have capacity to tolerate this failover in the remaining AZs, especially for load balancers with Cross-Zone Load Balancing turned off.
- As best practice, the client must honor the DNS record time-to-live (TTL), which is set to 60 seconds. Therefore, the local client DNS cache can still contain the IP addresses Route 53 removed from the Load Balancer DNS record, until the TTL expires for the cached entry.
- For load balancers with Cross-Zone Load Balancing turned off, if the AZ does not have any registered targets, the IP of this AZ is removed from the FQDN record.
- With DNS failover, if all load balancer AZs are considered unhealthy, the load balancer DNS record will contain the IPs of all AZs, to protect your workloads against configuration errors. Here, the Route53 Evaluate Target Health will fail. If your DNS record in Route53 is configured with a failover to another resource, then failover will occur. The same applies for Target Groups with no targets.
Considerations when using Cross-Zone Load Balancing
Cross-Zone Load Balancing can be configured either at the load balancer level or at the target group level. When you turn off Cross-Zone Load Balancing, each load balancer ENI/IP address sends traffic only to targets in the same AZ. The number of targets visible by each AZ is equal to the number of registered targets in the AZ, as highlighted in the following diagram (figure 1):
Best Practices for using Load Balancers with Cross-Zone Load Balancing off:
- Consider the total number of targets that you have in the Target Group per AZ to calculate the desired threshold, as both DNS failover and Routing (Fail Open) thresholds you set are based on the total number of targets in each AZ.
- Note that using multiple target groups, a single target group is enough to initiate failover, removing the IP of the impacted AZ, when Cross-Zone load balancing is off. This causes new flows to be routed to the other remaining AZs. You can turn off the DNS failover for target groups that should not trigger this action. Also, ensure that targets in remaining AZs have enough capacity to handle the traffic from the affected AZ.
When Cross-Zone Load Balancing is on, each Application or Network Load Balancer ENI/IP can send traffic to any registered targets in the target group. The number of targets visible to each load balancer ENI/IP address may be different than what is visible when cross-zone load balancing is turned off. This is shown in the following diagram (figure 2):
Best Practices for using Load Balancers with Cross-Zone Load Balancing on:
- Consider the total number of targets in each Target Group to calculate the desired threshold, as both DNS failover and Routing (Fail Open) thresholds are based on the total number of targets across all your ELB AZs.
- If traffic does not flow across AZs, for example, during a network failure event with one Availability Zone, the number of targets seen by each AZ will be lower than the total number of registered targets. Here, the affected AZ may not be able to communicate with targets in other AZ, resulting in differences on health check status for the affected AZs and timeout when attempting to send traffic to this AZ. Ensure your failover threshold settings account for the number of targets in each AZ.
Differences between Network and Application Load Balancer DNS failover behavior
Network Load Balancers trigger DNS failover when you have empty target groups, therefore ensure you don’t have an AZ with no registered targets. Let’s consider a scenario with two target groups, and cross-zone load balancing turned off, as highlighted in the following diagram (figure 3).
Cross-zone load balancing: Off
Target Group 1 and Target Group 2:
- Target Group 1: 4 Targets
- 2 in AZ1 – healthy
- 2 in AZ2 – healthy
- Target Group 2: 2 Targets
- 2 in AZ1 – healthy
- 0 in AZ2 – no registered targets (empty)
In this case, when a Network Load Balancer fails the DNS health check of AZ2, and DNS failover is triggered, Route 53 removes the NLB IP address 10.100.2.30 (AZ2) from the NLB FQDN. This stops traffic flowing to AZ2, even though the traffic to the listener associated with target group 1 could still be still served. Client traffic sent to the NLB IP 10.100.2.30 (AZ2) to the listener associated with the Target Group 2 will timeout.
In the same scenario, because the Application Load Balancer has at least one target group that is not empty, and the other target groups are healthy, DNS health checks will not fail. The traffic received in AZ2 that is routed to Target Group 2 results in HTTP 503 errors, as there are no healthy targets to receive the traffic.
The DNS failover default setting is configured in the target groups attribute configuration. When you have multiple target groups associated with your load balancer, if one target group breaches the DNS failover threshold, the DNS Failover action occurs. This happens even if other target groups are not be breaching their own thresholds. You can reconfigure health state requirements by turning off the DNS failover action for a target group. You can see this in the following screenshot (figure 4):
Turning this off excludes the target group from DNS failover actions. This can be used if you have a target group that does not serve the majority of traffic.
Configuring target group health thresholds settings
You can configure your Target Group Zonal health threshold using either the Unified Configuration or Detailed Configuration approach.
Unified Configuration – using the same threshold across DNS failover and Routing (Fail Open):
Unified Configuration offers you a single setting for both types of thresholds, which cover the majority of use cases. Configuring a single setting allows you to apply a single zonal health threshold value for both DNS failover and Routing (Fail Open). If the minimum number of healthy targets falls below the threshold, both DNS and Routing (Fail Open) actions are initiated at the same time.
Detailed Configuration – using custom threshold values action type:
By setting the Detailed Configuration parameter to different threshold values, DNS failover takes precedence over Routing (Fail Open). If, after the DNS failover is performed, the unhealthy IP of the ALB/NLB IP keeps receiving traffic, when you cross the Routing (Fail Open) threshold, the load balancer sends traffic to any registered target, as if they are all healthy. This is helpful when your targets are unhealthy because of a short-term or transient issues, ensuring healthy targets are not overwhelmed, and allowing unhealthy targets to recover.
Choosing between percentage or fixed thresholds:
When configuring thresholds, you can set both a static value for the minimum acceptable number of healthy targets, or a percentage of the total number of registered targets. Regardless of how many targets are in the target group, this value is fixed when using a static value configuration. This is useful when you have benchmarked your application requirements and know the minimum number of targets your workload requires. When configuring a percentage of healthy targets as a threshold, the load balancer calculates the number based on the total number of targets registered with the target group. This is useful when your workload has a variable number of targets. You can configure both static and percentage values as failover thresholds for a target group. When either of the values (percentage or static) crosses the failover threshold, the configured action is executed.
Let’s consider the following example scenarios:
- 10 targets
- Routing (Fail Open) and DNS failover threshold static values: 5
- Routing (Fail Open) and DNS failover threshold percentage values: 30% → 3 (Dynamically calculated)
When 6 of the 10 targets fail health checks the Routing (Fail Open) action and DNS Failover will trigger, because the target group had less than 5 five healthy instances. This happens even though the target group percentage threshold of 3 healthy targets was not crossed. Here the static value was crossed first
- 30 targets
- Routing (Fail Open) and DNS failover threshold static values: 5
- Routing (Fail Open) and DNS failover threshold percentage values: 30% → 9 (Dynamically calculated)
Because the target group has less than 30% of targets considered healthy, the DNS failover action triggers when 22 of 30 targets fail health checks (8 targets are healthy). This happens even though the target group static threshold was not crossed. In this case the percentage value was crossed first.
Custom versus unified threshold values
To show the difference between the two options, let’s look at a third scenario:
- 3 Availability Zones
- Cross-Zone load balancing turned off
- 10 Targets in each AZ / 30 Targets in the Target Group
- DNS failover threshold percentage value: 50%
- Routing (Fail Open) threshold percentage value: 30%
When the number of healthy targets is less than 50% (5 out of 10) in one AZ, DNS failover occurs and the client traffic is redirected to healthy AZs. If client traffic persists after the DNS Failover and the number of healthy targets drops below 30% (3 out of 10), the Routing (Fail Open) action occurs.
Let’s walk through how this might happen:
|Time||Number of Healthy Targets in AZ – 1||Number of Unhealthy Targets in AZ – 1||DNS Failover Threshold||DNS Failover state||Routing (Fail Open) Threshold||Routing (Fail Open) State|
- At T0, everything is healthy.
- At T1, 3 targets are unhealthy.
- At T2, 5 targets are unhealthy.
- At T3, the DNS Failover breaches, because only 4 targets (less than 5) are healthy.
- At T4, one additional target failed, but did not trigger any additional action.
- At T5, the Routing (Fail Open) breaches, because only 2 targets (less than 3) are healthy.
- At T6, one additional target failed, but all actions already occurred.
- At T7 targets starts to recover, the Routing (Fail Open) recovers.
- At T8 the DNS Failover also recovers.
- At T9 everything is healthy.
In this scenario, we configured the thresholds with different values than the default configuration. Note that with default threshold for DNS failover and Routing (Fail Open) is configured with value set to 1, therefore with the default configuration, no actions would be taken, as with the default configuration all targets in the target group must fail health checks for actions to be triggered, which did not occur in this scenario.
For Network Load Balancers, use the CloudWatch metric UnhealthyRoutingFlowCount for the number of flows (or connections) that are routed using the Routing (Fail Open) action (fail open). For Application Load Balancers, use the CloudWatch metric UnhealthyRoutingRequestCount to see the number of requests that are routed using the Routing (Fail Open) action (fail open).
With configurable target group health thresholds, you can better control your application availability and failover actions in case of availability events. This improves your resiliency and gives you more control when managing traffic around unhealthy targets in an AZ. In this blog, we reviewed three scenarios on how to configure these new capabilities, together with considerations and best practices for integration with Cross Zone Load Balancing. We recommend you evaluate your use case and check all considerations when implementing this with your target group configuration.