Monitoring load balancers using Amazon CloudWatch anomaly detection alarms

Load balancers are a critical component in the architecture of distributed software services. AWS Elastic Load Balancing (ELB) provides highly performant automatic distribution for any scale of incoming traffic across many compute targets (Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), AWS Lambda, etc.), while enabling developers to adopt security best practices at the network boundary (among many other features).

As a result of being high-up in the service stack, the metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance. Monitoring of these metrics provides visibility into many kinds of incidents across the service stack and the network. This visibility can result in quick detection and mitigation of an incident rather than a prolonged outage.

This post begins with a brief overview of AWS Network Load Balancer (NLB) monitoring. This is followed by a look at the NLB metric TCP_Target_Reset_Count and why conventional Amazon CloudWatch alarms using static thresholds can’t be used for monitoring this class of metrics. Then a brief look at CloudWatch anomaly detection alarms is presented, followed by a deep-dive into how this can be used for monitoring TCP_Target_Reset_Count. In conclusion, we highlight some of the situations where this monitoring can be useful.

NLB TCP reset count metrics

TCP Reset Count by Target or Clients are a set of metrics emitted by NLBs. The TCP reset flag is summarized next, followed by the definition of these metrics and what they might indicate.

TCP reset flag

Every packet in a TCP connection contains a TCP header. Each header contains a bit knows as the “reset“ (RST) flag. Setting this bit to 0 has no effect, however setting it to 1 indicates to the receiver that the given TCP connection shouldn’t be used anymore. A reset closes the TCP connection instantly.

TCP_Target_Reset_Count, TCP_Client_Reset_Count and TCP_ELB_Reset_Count

TCP_Target_Reset_Count is an ELB metric published in CloudWatch. This monitors the total number of reset (RST) packets sent from a target (Amazon EC2 host) to a client. A reset packet is one with no payload and with the RST bit set in the TCP header flags. These resets are generated by the target and forwarded by the load balancer. Sum is the most useful statistic for this metric. Similarly, the NLB also emits metrics corresponding to resets generated by the load balancer itself (TCP_ELB_Reset_Count) and resets generated by the client (TCP_Client_Reset_Count).

For a generic system comprising of an NLB and underlying compute (such as Amazon EC2 hosts), TCP connections are short lived (represented by the time-to-live (TTL) configurations). Therefore, these reset metrics are expected to have a baseline value which is greater than 1 (in a given time-period) as TCP connections are opened and closed continuously.

Spikes in these reset metrics can occur when the target, client or load balancer is closing more connections than usual. Some situations when this can occur:

Breakdown or delay in the ‘client → NLB → target’ communication
For example, a networking issue which prevents the targets, NLB, or the clients from successfully communicating with one another. This would lead to a dip in reset metrics, lasting for the duration of the underlying issue. This scenario indicates either a full or partial outage of the service (such as a spike in 4xx errors on the client side), which should be alarmed on and mitigated appropriately.
A code-deployment on the underlying targets
This leads to a spike in reset metrics as the targets will send a reset packet before shutting down the application and starting the deployment. This is expected behavior and not an issue.

Monitoring TCP reset count metrics

As explained previously, NLB reset count metrics can highlight critical issues in the client → NLB → target communication. This can lead to increased errors and a detrimental customer experience. Accurate alarming of these NLB reset count metrics can notify the service owner and enable them to activate mitigation strategies.

Static threshold alarming

Conventional CloudWatch alarms monitor a metric with a static threshold. For example, the alarm is triggered when a metric has a value greater than a threshold X, for Y data points in a given time duration. The threshold X is configured in advance and is a constant.

Figure 1. Static threshold alarm for request duration

Figure 1. Static threshold alarm for request duration

This kind of alarming strategy will fail for those metrics where a static threshold can’t represent normal operating conditions of the system. This is the case if the safe-values of the metric (indicating normal operating conditions) change frequently. For example, the threshold is dependent on the daily traffic pattern or the size of the auto-scaled service fleet.

The reset count metrics (TCP_Target_Reset_Count) fall under this category. By definition the threshold of the metric is dependent on the number of hosts in the underlying fleet (among other factors). Figure 2. Static threshold alarm and TCP_Target_Reset_Count

Figure 2. Static threshold alarm and TCP_Target_Reset_Count

For example, in the previous figure, a CloudWatch snapshot of TCP_Target_Reset_Count over six consecutive days is shown. Region A and Region B indicate anomalies in the system (irregular spikes or dips in RST count) while Region C is healthy.

An alarm threshold value of 1365 is sufficient to detect the spike in Region A, but this value fails to capture the dip shown in Region B. One possible solution could be to create another separate alarm which triggers if the threshold falls below a new lower threshold value of 1200. However, both of these alarms will be static and will fail to adapt to changes in the contributing factors (for example, the host count).

The previous example is a small snapshot (six days), and over a longer period (months) this metric can have even more variation. TCP reset metrics thus can’t be monitored by a static threshold.

CloudWatch anomaly detection alarms

CloudWatch anomaly detection alarms solve the above problem by building a statistical model of the underlying metric. This enables the creation of dynamic alarm thresholds with both an upper and a lower limit. These statistical models are continuously re-trained, which account for changing trends in the metrics (the different regions in the previous figure).

Figure 3. Anomaly detection alarm and TCP_Target_Reset_Count

Figure 3. Anomaly detection alarm and TCP_Target_Reset_Count

In the previous figure, a CloudWatch anomaly detection alarm is used for monitoring TCP_Target_Reset_Count. The anomaly detection dynamic threshold is denoted by the grey band which is continuously adjusting to changes in the metric trends. Some interesting things denoted in the figure alarm gets triggered (in red) in a more interesting set of situations:

The alarm is triggered (in red) for the extreme peaks and valleys of the metric, indicating either an increased rate of failure (on the target) or some kind of a networking issue respectively.
The alarm is already triggered while the metric has started a rapid descent or ascent. This leads to an earlier detection of the event, allowing service owners to trigger mitigations earlier and shorten the time-to-mitigation.
The width of the threshold band can be controlled by a single parameter – large values lead to a thicker band while small values lead to a thinner band. A larger threshold band is less sensitive compared to a smaller band.

Creating an anomaly detection alarm for TCP_Target_Reset_Count using AWS Cloud Development Kit

AWS Cloud Development Kit (AWS CDK) is an open source software development framework to define your cloud application resources using familiar programming languages.

Here are some things to note in the following implementation:

This assumes the NLB ARN is being exported from the corresponding stack, which isn’t necessary if you’re creating the NLB in the same AWS CDK package being used to create the alarm.
The standard deviation for the anomaly detection model is set to 8. This should be tuned depending on the desired sensitivity of the alarm. Increasing it makes the anomaly detection band larger, and thus the alarm becomes less sensitive to small changes in the TCP_Target_Reset_Count metric.

Anomaly detection alarm class

import {CfnAlarm, CfnAnomalyDetector, Metric, TreatMissingData} from "@aws-cdk/aws-cloudwatch";
import {Construct, Duration} from "@aws-cdk/core";

export interface AnomalyDetectionAlarmProps {
    readonly alarmName: string;
    readonly alarmDescription: string;
    readonly metric: Metric;
    readonly comparisonOperator: string;
    readonly evaluationPeriods: number;
    readonly period: Duration;
    readonly standardDeviation: number;
    readonly alarmActions?: string[];
    readonly modelConfiguration?: CfnAnomalyDetector.ConfigurationProperty;
}

export class AnomalyDetectionAlarm extends Construct {
    constructor(scope: Construct, id: string, props: AnomalyDetectionAlarmProps) {
        super(scope, id);

        const metricName = props.metric.metricName || "";
        const anomalyDetectorMetricId = `anomalyDetectorMetricId`;
        const anomalyDetectorId = `anomalyDetectorId`;
        const metricStats = props.metric.toMetricConfig().metricStat;
        const namespace = metricStats?.namespace || "";
        const stats = metricStats?.statistic || "";
        const dimensions = metricStats?.dimensions || undefined;
        const alarmActions = props?.alarmActions || [];

        new CfnAnomalyDetector(this, anomalyDetectorId, {
            configuration: props.modelConfiguration,
            namespace,
            metricName,
            stat: stats,
            dimensions,
        });

        return new CfnAlarm(this, props.alarmName, {
            alarmName: props.alarmName,
            alarmDescription: props.alarmDescription,
            comparisonOperator: props.comparisonOperator,
            evaluationPeriods: props.evaluationPeriods,
            thresholdMetricId: anomalyDetectorMetricId,
            treatMissingData: TreatMissingData.MISSING,
            metrics: [
                {
                    expression: `ANOMALY_DETECTION_BAND(m1, ${props.standardDeviation})`,
                    id: anomalyDetectorMetricId,
                },
                {
                    id: "m1",
                    metricStat: {
                        metric: {
                            namespace,
                            metricName,
                            dimensions,
                        },
                        period: props.period.toSeconds(),
                        stat: stats,
                    },
                },
            ],
            alarmActions,
        });
    }
}

Instantiate the alarm

private createNLBAnomalyDetectionAlarm(alarmName: string) {
    const nlbName = loadBalancerNameFromListenerArn(Fn.importValue("ServiceLoadBalancer"));
    const metricName = "TCP_Target_Reset_Count";
    const metric = new Metric({
        statistic: "Sum",
        label: nlbName,
        metricName,
        namespace: "AWS/NetworkELB",
        period: Duration.minutes(5),
        dimensions: {
            LoadBalancer: nlbName,
        },
    });
    new AnomalyDetectionAlarm(this, `${metricName}_Alarm`, {
        alarmName,
        alarmDescription: "TCP_Target_Reset_Count below the anomaly detector threshold",
        metric,
        comparisonOperator: "LessThanLowerThreshold",
        evaluationPeriods: 3,
        period: Duration.minutes(5),
        standardDeviation: 8,
    });
}

Conclusion

We presented an overview of NLB reset count metrics and their utility. This was followed by describing why conventional CloudWatch alarms can’t be used for monitoring these metrics. Finally, we conducted a deep-dive for using CloudWatch anomaly detection alarms and AWS CDK to monitor these metrics.

These alarms can be used in conjunction with conventional NLB alarms, such as unhealthy host count. This setup is being used by a software development team in Prime Video to improve time-to-detection (and time-to-mitigation) for certain incidents (mentioned above) by more than one hour.

References

Varun Jewalikar

Varun Jewalikar is a Software Engineer at Prime Video. He is passionate about large scale distributed systems, chaos engineering and open source. You can connect with him at https://www.linkedin.com/in/vjewalikar/

Networking & Content Delivery