Improving availability with Application Load Balancer automatic target weights

In this blog, we explore Automatic Target Weights (ATW), which can reduce the number of errors web application users experience. ATW provides the ability to detect and mitigate gray failures for targets behind Application Load Balancers (ALB). A gray failure occurs when an ALB target passes active load balancer health checks, making it look healthy, but still returns errors. This scenario could be caused by many things, including application bugs, a dependency failure, intermittent network packet loss, a cold cache on a newly launched target, CPU overload, and more.

To improve application availability and account for these gray failures, ALB released the ATW feature, which includes anomaly detection and a load balancing algorithm called Weighted Random. ATW’s anomaly detection analyzes the HTTP return status codes and TCP/TLS errors to identify targets with a disproportionate ratio of errors compared to other targets in the same target group. When ATW identifies anomalous targets, it reduces traffic to the under-performing targets and gives a larger portion of the traffic to targets that are not exhibiting these errors. When the gray failures decrease or stop, ALB will slowly increase traffic back onto these targets.

Anomaly detection

There are two aspects to the ATW feature: anomaly detection and anomaly mitigation. Anomaly detection is automatically enabled on HTTP/HTTPS target groups with at least three healthy targets. It looks for targets exhibiting signs of a gray failure based on the HTTP/TCP/TLS responses.

If you look at the registered targets console (Figure 1), you will notice a new column titled “Anomaly detection result.” This column shows “Normal” for targets that are operating as they should be. However, if a target is experiencing a gray failure, it will show as “Anomalous.” This could result from a bug in your application, an overloaded server, or a misconfiguration.

Screenshot of the new target group console showing 4 of the 12 targets are anomalous.

Figure 1—Target group with anomalous targets

ATW calculates anomaly detection results every five seconds by looking at the responses from each of your targets. Targets experiencing a significantly higher ratio (number of errors divided by total number of requests) of HTTP 5xx, TCP or TLS errors than other targets are considered anomalous.

Anomaly mitigation

In order for users of your web application to experience fewer errors, you need to enable anomaly mitigation, which reduces the number of requests sent to targets with gray failures. You can enable ATW mitigation mode using the AWS Console, AWS CloudFormation, our API, or the CLI. To enable it using the console, navigate to the target group attribute section, select the load balancing algorithm “Weighted random” and make sure you have selected “Turn on anomaly mitigation” (Figure 2).

Figure 2—Load balancing algorithm

Alternatively, you can enable anomaly mitigation by using the CLI:

aws elbv2 modify-target-group-attributes --target-group-arn <Target Group ARN> \
--attributes Key=load_balancing.algorithm.type,Value=weighted_random \
key=load_balancing.algorithm.anomaly_mitigation,Value=on

Now when you look at the target group’s registered targets (Figure 3), you will see new column, this one titled “Mitigation in effect.”

Screenshot showing the ALB target group registered targets screen with 12 total targets. In the anomaly detection column, 8 show as normal and 4 show anomalous and have mitigation in effect.

Figure 3—Mitigation in effect

When you enable anomaly mitigation and ATW detects anomalous targets, the ALB decreases the amount of traffic sent to those targets. If your anomalous targets continue to experience 5xx/TCP/TLS errors at a significantly higher ratio than other targets, then the weight continues to be adjusted down to further reduce the traffic. If the targets show signs of recovery by returning a lower ratio of 5xx/TCP/TLS errors, then ATW slowly increases traffic to the anomalous targets by increasing the targets’ dynamic weight.

You can retrieve similar information using the DescribeTargetHealth API, or the CLI:

aws elbv2 describe-target-health \
--targets Id=<Target ID>,Port=<Target Port> \
--target-group-arn <Target Group ARN> \
--include AnomalyDetection

For each target, the response includes a section “AnomalyDetection” describing the “Result” as “anomalous” or “normal”. It also includes the “MitigationInEffect” state as yes or no. The following snippet is an example of DescribeTargetHealth for an anomalous target:

{
	“Target”: {
		“Id”: “i-00000000000000000”,
		“Port”: 80
	},
	“HealthCheckPort”: “80”,
	“TargetHealth": {
		“State”: “healthy”
	},
	“AnomalyDetection": {
		“Result”: “anomalous”,
		“MitigationInEffect”: “yes”
	}
}

Viewing effectiveness of ATW anomaly mitigation

You might find it helpful to create a CloudWatch dashboard to monitor the status of your web application and to see how anomaly detection and mitigation are working. The screenshot in Figure 4 shows traffic to an example web application before and after enabling anomaly mitigation.

Screenshot of a CloudWatch dashboard showing 4 graphs. The graphs show a high number of HTTP 5xx errors before anomaly mitigation is enabled and a small number of errors after it's enabled.

Figure 4 – CloudWatch dashboard

On the left side of the graphs, you can see that the application experienced a high number of HTTP 5xx errors (about 25% of the 800,000 requests) before we enabled anomaly mitigation. The “Anomalous Host Count” graph shows that four of the twelve targets experienced gray failures.

Halfway through the time period, you can see the impact of enabling anomaly mitigation. You notice that the mitigated host count jumps up to four. Simultaneously, the number of HTTP 5xx errors drops to less than 10,000 and the number of HTTP 2xx responses jumps by 200,000.

In this example, ATW anomaly detection and mitigation automatically eliminated 96% of the errors that users would have otherwise experienced.

ATW anomaly mitigation and auto scaling groups

You can use auto scaling group policies together with the random-weighted load balancing algorithm. However, if you have a dynamic scaling policy based on average CPU utilization, the metric (an all-up average) will not change when individual targets are under ATW mitigation, even though CPU utilization of individual instances will change.

You can create a custom CloudWatch metric to calculate an average CPU utilization which accounts for the capacity lost due to mitigated targets. This metric uses the auto scaling group average CPU utilization and adjusts the metric value based on the number of non-mitigated targets.

The custom metric looks like this:

(EC2_AutoScalingGroup_CpuUtilization * AutoScaling_GroupInServiceInstances) / (AutoScaling_GroupInServiceInstances - MitigatedHostCount)

Considerations

Here are some things to know about how ATW works:

ATW can only be used in target groups with three or more targets.
When cross-zone is enabled, ATW detects and mitigates failures on up to 50% of all targets in a target group.
When cross-zone is disabled, ATW detects and mitigates failures on up to 50% of targets per Availability Zone (AZ).
ATW may not be able to detect anomalous targets when the requests rate is lower than two requests per second per target.
When most of the targets are experiencing gray failures, ATW is unable to detect outliers or perform mitigation.
There is no additional charge when using ATW detection and mitigation features.

Conclusion

ATW is now available in all commercial AWS Regions on existing and newly created ALBs. We encourage you to start experimenting with it right away. To learn more about this feature, refer to the documentation.

About the authors

Scott Hewitt

Scott Hewitt is a Senior Solutions Architect at AWS in Chicago. He helps customers effectively architect their applications to run on AWS. Scott’s passion for networking started over 20 years ago with datacenter networking.

Jorge Prado Headshot

Jorge Prado

Jorge is a Senior Technical Account Manager at AWS in North Carolina. He is passionate about helping Enterprise Support customers find the right solutions and achieve operational excellence. His focus is on networking technologies. In his free time he enjoys reading, watching movies and playing video games with his kids.

Pushkar Patil Headshot.jpg

Pushkar Patil

Pushkar Patil is a Product Owner in the AWS networking team based out of California. He has over a decade of experience driving product innovation and strategic planning in cloud computing and infrastructure. Pushkar has successfully launched many new products by understanding customers’ needs and delivering innovative solutions. When not working, you can find this cricket enthusiast traveling with his family.

Networking & Content Delivery