How to set up Amazon CloudWatch alarms to monitor IO metrics of AWS EBS volumes performance using metric math

To prevent application or database performance hiccups from latency of a disk, it is very critical to monitor disk I/O and usage for performance issues. Disk I/O is the amount of read or write or input/output operations that occur during a period, in other words it measures the data transfer speed between a disk and memory (RAM) on the server. Disk I/O Wait is the percentage of time your processors are waiting on the disk which generally indicates an I/O bottleneck. This can be caused due to disks taking more time to respond to the I/O requests, leading to higher latencies. Many enterprises experience a negative impact on the performance of their applications due to latencies on their storage component when running their workloads. Hence it is very important to monitor disk I/O performance and take proactive actions to prevent a bigger impact in performance.

In this blog post, we will show you how to use Amazon CloudWatch alarms to monitor your disk I/O latency and IOPS for performance.

Customers generally turn to the Amazon EBS to create volumes on which they store the persistent data for their databases and applications. The performance of those volumes is critical to the applications or database using them. Amazon EBS uses solid state drives (SSD) which are optimal for transactional workloads involving frequent read/write operations with small I/O size, where the dominant performance attribute is IOPS. SSD-backed volume types include:

Some of the key metrics that are necessary to be monitored on Amazon EBS volumes are:

Total IOPS (Input/Output Operations Per Second): The total number of read + write operations in a specified period.
Total Latency: It is the amount of time taken to write and read a block.
Queue Length: The number of read and write operation requests waiting to be completed in a specified period. This can be set in case of Provisioned IOPS SSD volumes.

Amazon CloudWatch monitors your AWS resources and the applications you run on AWS in real time. To measure the performance of your resources and applications, you can use CloudWatch to collect and track metrics, which are time series that represent the value of a performance indicator measured over time. You can create alarms that watch metrics and send notifications to engage an operator or trigger automated remediation in case of any performance issue based on thresholds you define. Amazon CloudWatch provides inbuilt monitoring capability for all EBS volumes on metrics listed in on this EBS documentation page. CloudWatch also provides the metric math feature which enables you to query multiple CloudWatch metrics and use math expressions to create new time series based on these metrics.

In this post, we provide steps to set up CloudWatch alarms on the EBS metrics listed above to automatically notify you in case of any performance issues on the EBS volumes to prevent a bigger impact in performance. For example, an application engineer introduced an ETL job that is performing a high number of I/O operations daily during the business hours. This is causing latency to increase on the disks, which impacts the regular database operations and degrades the performance of the application impacting the customer. Having an alarm in place for monitoring the I/O latency helps in identifying the issue which can be mitigated by reassessing the scheduling of the job during off-business hours or configuring the appropriate volume type with appropriate IOPS and throughput to support the customer use case.

Formulae and Thresholds for EBS volume IO metrics

When defining alarms for the EBS volumes, the I/O characteristics thresholds and formulae to consider are:

For Total IOPS: Maximum IOPS on an EBS volume needs to be considered when defining the threshold. To define the maximum IOPS on an EBS volume, you need to consider the required IOPS (in case of volume types of Provisioned IOPS SSD volumes or General Purpose SSD volumes of type gp3) or derived IOPS (in case of volume types of Provisioned IOPS SSD volumes or General Purpose SSD volumes of type gp2.

Formula: Total IOPS/period = Sum(VolumeReadOps + VolueReadOps)/ Period

For Total Latency: To define the total latency on EBS volume, you need to consider the maximum latency tolerable by your application without impacting performance.

Formula: Total Latency in ms = (sum (VolumeTotalReadTime + VolumeTotalWriteTime) × 1000) / (sum(VolumeReadOps + VolueReadOps))

For Queue Length: Consider the average queue depth(rounded to the nearest whole number) of one for every 1000 provisioned IOPS in a minute for Provisioned IOPS SSD volumes. More details on the calculation can be found in this link.

Formula: Optimal Queue Length = Total IOPS/1000

Metric Math in CloudWatch

To create the alarms in CloudWatch for the AWS EBS statistics as defined above using AWS Command Line Interface (AWS CLI), we are going to use the AWS CLI all of put-metric-alarm.

In this blog, we are creating alarms for EBS IOPS, latency, and queue length for which we are using the metrics option in the put-metric-alarm command. Metrics is an array of MetricDataQuery structures that enable you to create an alarm based on the result of a metric math expression. Each item in the Metrics array either retrieves a metric or performs a math expression. For more details on the parameters of the metrics structure, refer to the CloudWatch API documentation.

In this blog, we will be focusing on the following parameters when defining alarms for EBS volumes.

Expression: a metric math expression to be performed on the returned data.
MetricName: The name of the metric. For example, to calculate latency one metric needed is VolumeTotalReadTime.
Dimensions: A dimension is a name/value pair that is part of the identity of a metric. For example, a dimension name of VolumeId will consists of the value for that volume id that we need to create the alarm on.
Period: The granularity, in seconds, of the returned data points i.e., the length in seconds, used each time the metric specified is evaluated.
Stat: The statistic to return, for example, Average or Minimum or Maximum.

Prerequisites

For this walkthrough, the following prerequisites are necessary:

An AWS account
Amazon Simple Notification Service (Amazon SNS) should be created and subscribed for the CloudWatch alarm to use to send a notification. Please follow this link for the setup.
Amazon EC2 instance exists with Amazon EBS volumes attached to it.
AWS CLI is installed on the instance where the creation of alarm command is going to be executed.

CloudWatch Alarm setup Walkthrough

Step 1: Define thresholds for alerting

Based on the performance that is expected of the application or database, you need to define the thresholds for alerts.

Example for IOPS threshold:

For example, based on historical trends, the usage of application or database IOPS is observed to be between 6,500 IOPS and 7,500 IOPS. If an EBS provisioned IOPS SSD volume is created with a maximum of 8000 IOPS, you can define the warning threshold around 90% of that value, that is 7200 IOPS, and you can define critical threshold as 95% of that value, that is 7600 IOPS. If it is a mission-critical application where you expect a maximum of 6000 IOPS and an increase in IOPS would lead to other concurrency waits on the database, then you can set lower thresholds with warning threshold defined at around 70% ~ 5600 IOPS and critical threshold defined at 75% ~ 6000 IOPS.

In our post, we assume the threshold of 95% for IOPS alarms.

Example for latency threshold:

In a use case, based on historical trends, you observed an average volume latency of about 2 ms and a maximum latency of 6 ms for your application. If the latency goes above 4 ms then the application starts seeing some slowness with acceptable customer impact and if it crosses 6 ms, there is a huge impact in performance with many webpages starting to time out. For this use case, we can choose a threshold of warning at 3 ms and critical at 4 ms. This is an example and the threshold limits are set based on each application and other business needs.

Example for queue length threshold:

The application uses Provisioned IOPS SSD EBS volumes that are created with 8000 IOPS. Based on AWS documentation for Provisioned IOPS SSD EBS volumes, you should consider an average queue length of one for every 1000 provisioned IOPS in a minute.

In this example, we can set the alarm for queue length at 8 (which is derived from 8000/1000).

Step 2: Create the EBS latency alarm using CLI

EBS latency is calculated using the formula below:

Total Latency in milliseconds = (sum (VolumeTotalReadTime + VolumeTotalWriteTime) × 1000) / (sum(VolumeReadOps + VolueReadOps)).

The metrics parameter allows us to implement the latency formula and is therefore key to creating the latency alarm. Let us dissect the JSON document for metrics parameter below to understand how to structure it.

"Metrics": [
{
"Id": "e1", # ID for expression which is used to calculate latency
"Label": "Total Latency", # Label for the metric
"ReturnData": true,
"Expression": "(m3+m4)/(m1+m2)*1000" # Formula to calculate latency using id for each Metric below
},
{
"Id": "m3", # ID for calculating one metric in the formula, in this example VolumeTotalReadTime
"ReturnData": false,
"MetricStat": {
"Metric": {
"Namespace": "AWS/EBS", # The namespace for the metric associated with EBS
"MetricName": "VolumeTotalReadTime", # Metric name of VolumeTotalReadTime
"Dimensions": [
{
"Name": "VolumeId", # It is describing to look for the VolumeId
"Value": "volxxxxxxxx" # It is describing to look for the VolumeId with value volxxxxxxx
}
]
},
"Period": 60, # 1-minute datapoint, if you need this to be 5 minutes, it changes to 300
"Stat": "Average" # Statistic of Average is used, it can be Maximum or Minimum based on your use-case.
}
},
{
"Id": "m4", # ID for calculating one metric in the formula, in this example VolumeTotalWriteTime
"ReturnData": false,
"MetricStat": {
"Metric": {
"Namespace": "AWS/EBS",
"MetricName": "VolumeTotalWriteTime",
"Dimensions": [
{
"Name": "VolumeId",
"Value": " volxxxxxxxx "
}
]
},
"Period": 60,
"Stat": "Average"
}
},
{
"Id": "m1", # ID for calculating one metric in the formula, in this example VolumeTotalReadOps
"ReturnData": false,
"MetricStat": {
"Metric": {
"Namespace": "AWS/EBS",
"MetricName": "VolumeReadOps",
"Dimensions": [
{
"Name": "VolumeId",
"Value": " volxxxxxxxx "
}
]
},
"Period": 60,
"Stat": "Average"
}
},
{
"Id": "m2", # ID for calculating one metric in the formula, in this example VolumeTotalWriteOps
"ReturnData": false,
"MetricStat": {
"Metric": {
"Namespace": "AWS/EBS",
"MetricName": "VolumeWriteOps",
"Dimensions": [
{
"Name": "VolumeId",
"Value": "volxxxxxxxx"
}
]
},
"Period": 60,
"Stat": "Average"
}
}
]

Now that we understand how the metric math is defined for EBS latency, let us look at an example provided below on how to create a CloudWatch alarm for EBS latency using the AWS CLI with a period of 60 seconds. Make sure to edit it based on your EBS volume details and requirements.

/usr/local/bin/aws cloudwatch put-metric-alarm \
--alarm-name 'Critical Latency Alarm for volume id volxxxxxx' \ # Edit name of alarm
--alarm-description 'This is a critical alert for latency of EBS volume with volume id volxxxxxx exceeding the threshold of 3ms' \ # Edit description of alarm
--actions-enabled \
--alarm-actions 'arn:aws:sns:us-east-1:xxxxxx:Critical_Alerts_NOC_Team' \ # Edit ARN of SNS Topic
--evaluation-periods '5' \ # Edit evaluation periods
--datapoints-to-alarm '5' \ # Edit datapoints to alarm
--threshold '3' \ # Edit custom threshold for alarm
--comparison-operator 'GreaterThanThreshold' \ # Edit comparison condition for alarm
--treat-missing-data 'missing' \
--region 'us-east-1' \ # AWS region
--metrics '[{"Id":"e1","Label":"Total Latency","ReturnData":true,"Expression":"(m3+m4)/(m1+m2)*1000"},{"Id":"m3","ReturnData":false,"MetricStat":{"Metric":{"Namespace":"AWS/EBS","MetricName":"VolumeTotalReadTime","Dimensions":[{"Name":"VolumeId","Value":"volxxxxxx"}]},"Period":60,"Stat":"Average"}},{"Id":"m4","ReturnData":false,"MetricStat":{"Metric":{"Namespace":"AWS/EBS","MetricName":"VolumeTotalWriteTime","Dimensions":[{"Name":"VolumeId","Value":"volxxxxxx"}]},"Period":60,"Stat":"Average"}},{"Id":"m1","ReturnData":false,"MetricStat":{"Metric":{"Namespace":"AWS/EBS","MetricName":"VolumeReadOps","Dimensions":[{"Name":"VolumeId","Value":"volxxxxxx"}]},"Period":60,"Stat":"Average"}},{"Id":"m2","ReturnData":false,"MetricStat":{"Metric":{"Namespace":"AWS/EBS","MetricName":"VolumeWriteOps","Dimensions":[{"Name":"VolumeId","Value":"volxxxxxx"}]},"Period":60,"Stat":"Average"}}]' # Edit Dimensions Values, Stat and Period values

Step 3: Create the EBS IOPS alarm using CLI

EBS IOPS is calculated using the formula below:

Total IOPS/period = Sum(VolumeReadOps + VolueReadOps)/ Period

Key component to create the alarm for IOPS is the metrics parameter which takes into consideration the formula for IOPS. Let us dissect the JSON document for metrics parameter below to understand how to structure it.

"Metrics": [
{
"Id": "e1", #ID for expression which is used to calculate latency
"Label": "Total IOPS ( ReadOps + WriteOps )", # Label for metric
"ReturnData": true,
"Expression": "(m1+m2)/60" # Formula to calculate IOPS using id for each Metric below
},
{
"Id": "m1", #ID for calculating one metric in the formula, in this example VolumeReadOps
"ReturnData": false,
"MetricStat": {
"Metric": {
"Namespace": "AWS/EBS",
"MetricName": "VolumeReadOps", #metric name of VolumeReadOps
"Dimensions": [
{
"Name": "VolumeId", #It is describing to look for the VolumeId
"Value": "volxxxxxxxx" #It is describing to look for the VolumeId with value volxxxxxxx
}
]
},
"Period": 60, #1-minute datapoint, if you need this to be 5 minutes, it changes to 300
"Stat": "Average" #Statistic of Average is used, it can be Maximum or Minimum based on your use-case.
}
},
{
"Id": "m2", # ID for calculating one metric in the formula, in this example VolumeWriteOps
"ReturnData": false,
"MetricStat": {
"Metric": {
"Namespace": "AWS/EBS",
"MetricName": "VolumeWriteOps", # metric name of VolumeWriteOps
"Dimensions": [
{
"Name": "VolumeId",
"Value": "volxxxxxxxx"
}
]
},
"Period": 60,
"Stat": "Average"
}
}
]

Now that we understand how the metric math is defined for EBS IOPS, let us look at an example provided below on how to create a CloudWatch alarm for EBS IOPS using the AWS CLI with a period of 60 seconds. Make sure to edit it based on your EBS volume details and requirements.

/usr/local/bin/aws cloudwatch put-metric-alarm \
--alarm-name 'Critical IOPS Alarm for volume id volxxxxxx' \ # Edit name of alarm
--alarm-description 'This is a critical alert for IOPS of EBS volume with volume id volxxxxxx exceeding the threshold of 95%' \ # Edit description of alarm
--actions-enabled \
--alarm-actions 'arn:aws:sns:us-east-1:xxxxx:Critical_Alerts_NOC_Team' \# Edit ARN of SNS
--evaluation-periods '5' \ # Edit evaluation periods
--datapoints-to-alarm '5' \ # Edit datapoints to alarm
--threshold '7600' \ # Edit threshold
--comparison-operator 'GreaterThanThreshold' \ # Edit comparison operator
--treat-missing-data 'missing' \
--region 'us-east-1' \ # AWS region
--metrics '[{"Id":"e1","Label":"Total IOPS ( ReadOps + WriteOps )","ReturnData":true,"Expression":"(m1+m2)/60"},{"Id":"m1","ReturnData":false,"MetricStat":{"Metric":{"Namespace":"AWS/EBS","MetricName":"VolumeReadOps","Dimensions":[{"Name":"VolumeId","Value":"volxxxxxx"}]},"Period":60,"Stat":"Average"}},{"Id":"m2","ReturnData":false,"MetricStat":{"Metric":{"Namespace":"AWS/EBS","MetricName":"VolumeWriteOps","Dimensions":[{"Name":"VolumeId","Value":"volxxxxxx"}]},"Period":60,"Stat":"Average"}}]' # Edit Dimensions Values, Stat and Period values

Step 4: Create the EBS queue length alarm using AWS CLI

To create a CloudWatch alarm for queue length, you can use a predefined metric in CloudWatch. An example to create this alarm is shown below.

/usr/local/bin/aws cloudwatch put-metric-alarm \
--alarm-name 'Critical Queue Length Alarm for volume id volxxxxxx' \ # Edit Alarm Name
--alarm-description 'This is a critical alert for queue length of EBS volume with volume id volxxxxxx exceeding the threshold of 8' \ # Edit description of alarm
--actions-enabled \
--alarm-actions 'arn:aws:sns:us-east-1:xxxxxx:Critical_Alerts_NOC_Team' \ # Edit ARN for SNS
--evaluation-periods '5' \ # Edit evaluation period
--period '300' \ # Edit Period, in this example it is 5 minutes
--threshold '8' \ # Edit threshold
--comparison-operator 'GreaterThanThreshold' \ # Edit comparison operator for threshold
--treat-missing-data 'missing' \
--region 'us-east-1' \ # Edit AWS Region
--metric 'VolumeQueueLength' \
--namespace 'AWS/EBS' \
--statistic 'Average' \
--dimensions '{ 'VolumeId' : 'volxxxxxx ' }' # Edit volume ID key value

Clean up

So as not to incur additional costs, make sure to clean up the CloudWatch alarms created using the steps in this link.

Summary

A proactive monitoring and alerting mechanism is essential to prevent situations where performance bottlenecks negatively impact your application. EBS volume latency, IOPS, and queue length are key metrics to monitor to troubleshoot potential issues without delay. Using metric math helps you monitor efficiently those metrics with CloudWatch alarms.

AWS Cloud Operations & Migrations Blog