Using Amazon CloudWatch metrics math to monitor and scale resources

Many applications require monitoring, scaling, and alerting across multiple dimensions. This requirement adds operational complexity for Developer Operations (DevOps) teams, as they must track numerous discrete data points. Instead, you can use Amazon CloudWatch metric math to create composite metrics quickly and easily. In this post, you’ll learn to apply these concepts to monitoring dashboards, operational alerts, and resource scaling policies.

CloudWatch is a monitoring and observability service that provides data and actionable insights to monitor your applications, respond to system-wide performance changes, and optimize resource utilization. You can deploy the CloudWatch agent across your compute resources like Amazon Elastic Compute Cloud (Amazon EC2), container services like Amazon Elastic Container Service (Amazon ECS), and even on-premises servers. Furthermore, many AWS services can also centralize their telemetry into your AWS account’s CloudWatch resources. Operating CloudWatch across the organization’s portfolio enables a consistent control plane and experience.

This abundance of data can help drive automation that detects and remediates resource performance. Many customers begin with simple metrics, such as CPU utilization and memory-pressure thresholds. However, as their applications modernize, it needs more sophisticated detection logic that contains multiple metrics, conditional logic, and Boolean operators. You might consider building custom behaviors within an AWS Lambda function, but that’s undifferentiated heavy lifting and detracts from delivering value to your customers.

There are several additional benefits to using composite metrics within the CloudWatch feature set. For example, it reduces the total alarms necessary to monitor the environment, and this situation removes duplicate messaging and redundant alarm costs. Operational teams can also increase consistency and efficiency through holistic signals versus low-level data points.

This post demonstrates how to implement sophisticated composite metrics using discrete performance counters. Then, you’ll learn how to integrate CloudWatch metric math into your CloudWatch dashboards and CloudWatch alerts.

Solution overview

Metric math enables you to query multiple CloudWatch metrics and use math expressions to create new time series based on these metrics. You can visualize the resulting time series on the CloudWatch console and add them to dashboards. Using Lambda metrics as an example, you could divide the Errors metric by the Invocations metric to get an error rate. Then, add the resulting time series to a graph on your CloudWatch dashboard.

The following diagram illustrates how to combine different metrics to create a composite metric:

Fig 1 : Solution architecture combines several metrics to create a holistic utilization metric and aggregate alarm.

The typical steps in this process are:

Choose two or more metrics, such as CPU, memory, and network utilization.
Use metric math expressions and functions to combine the time series.
Include the composite metric in CloudWatch dashboards, alerts, and autoscale policies.

Approach

Suppose that you have an AWS Auto Scaling Group (ASG) containing Amazon EC2 instances. These resources post a custom metric called MyPendingTransactions, representing the inflight work. You must scale this ASG in response to high CPU utilization, or when the pending transactions exceed a threshold. In that case, you can use metric math’s IF-function and OR-operator to create a binary signal that scaling is necessary.
Here is an example ScaleUpRequested metric expression that implements this pattern:

IF(CPU > 70, 1, 0) OR IF(MyPendingTransactions > 75, 1, 0)

If the CPU utilization is above 70%, then the metric math expression will return 1.
If the MyPendingTransactions is above 75, then the metric math expression will return 1.
Otherwise, the metric math expression will return 0.

You can visualize the expression using CloudWatch metrics, as illustrated in the following diagram:

Fig 2 : CloudWatch Metric Math Expression to create a combined metric that combines various metrics.

Next, define a CloudWatch alarm based on the Combined metric. Use 1 for the threshold. The alarm will transition into the ALARM state when the CPU utilization is above 70%, the memory utilization is above 75%, or the network utilization is above 80%. Configure the CloudWatch alarm to send a notification or increase the desired ASG capacity.

Fig 3 : CloudWatch metric math alarm specifications.

Note that AWS doesn’t provide a memory utilization metric by default. Therefore, use the CloudWatch agent to collect the memory metrics.

Additional use cases

You can use this technique in several related scenarios. Let’s briefly examine other use cases that you’ll encounter.

Simplifying autoscaling

Auto Scaling makes sure that Amazon EC2 instances are enough to run your application. We can create an auto-scaling group that contains a collection of EC2 instances. Besides instance utilization metrics, you can perform autoscale actions using metrics from other AWS services, such as total requests to an Elastic Load Balancer (ELB) and pending message counters such as the Amazon Simple Queuing Service (Amazon SQS) ApproximateNumberOfMessages metric.

Simplifying observability

You can build your powerful CloudWatch dashboards to monitor and visualize your resources like EC2 instances on AWS. Observability involves looking at the system’s outputs to include not only the performance of the system, but also stability. CloudWatch dashboard can help you visualize system performance and interpret metrics for your AWS services and workloads. The dashboard can provide a single view of your resources and aggregate information across your deployment.

Simplifying monitoring alerts

Instead of creating separate alarms for each metric, you can consolidate alerts using CloudWatch metric math functions. This approach will simplify the process rather than make individual alarms and reduce complexity. There’s also a potential cost-saving, as standard alarms are $0.10 per month versus composite alarms which are $0.50 per month.

Deploying the solution

You can deploy an example implementation of this pattern into your account. Choose the appropriate AWS CloudFormation stack to provide the solution to your AWS account in your preferred Region:

N.Virginia (us-east-1)

Oregon (us-west-2)

After selecting the appropriate Launch stack button, the CloudFormation console will prompt you through the automated deployment.

Cleanup

The example template will cost under $1 per month, and your account might be eligible for the free-tier pricing structure. To prevent accruing additional charges in your AWS account, delete the CloudFormation stack that you provisioned by navigating to the CloudFormation console. If you created other test resources, then remove those as well.

Conclusion

In this post, you learned how to combine CloudWatch metrics for monitoring, scaling, and observing resources using metric math expressions. You can apply this pattern to several use cases to simplify automation and lower operational complexity. Furthermore, metric math advanced functions for scenarios like running summations, filling missing data points, and variability over time. Lastly, you can use CloudWatch anomaly detection with your composite metric. This approach lets you continuously analyze systems and applications metrics, determine normal baselines, and surface anomalies with minimal user intervention.

Authors:

AWS Cloud Operations & Migrations Blog