Monitoring and understanding Amazon EBS performance using Amazon CloudWatch
Storage and compute are the main pillars of many different types of applications, making them important to monitor and understand when optimizing or developing an application for peak performance. Amazon EBS is an easy-to-use, scalable, high-performance block-storage service designed for Amazon EC2. EBS is the main type of storage used by applications for high performance transaction-based use cases. Because applications depend on storage performance and stability, it is important to monitor EBS volumes. Common factors impacting Amazon EBS performance include a lack of free storage space, volume-type performance limits, and latency.
There are many popular applications used to collect and analyze these in detail. In this blog, we focus on simple AWS features that can be used to quickly implement effective monitoring, along with alerts and automated remediation to achieve peak reliability.
Monitoring Amazon EC2 and Amazon EBS
Amazon EC2 allows different types of metrics and logs to be collected, viewed, and analyzed. The metrics deal with the EC2 instance, storage, network and even at the application level. The metrics are easily gathered and viewed with Amazon CloudWatch.
Let’s summarize the different types of information available in CloudWatch. There are three complementary sets of data that provide visibility on storage performance available for Amazon EC2:
- The first set of data is part of the instance metrics generated by EC2. By default, Amazon EC2 provides instance metrics every 5 minutes. These metrics are available through Amazon CloudWatch. This data includes storage performance, and there is no charge for it. Detailed monitoring of the instance can be enabled to collect metrics more frequently at up to 1 minute intervals. Enabling detailed monitoring has no cost, but use of the metrics with Amazon CloudWatch is charged as per Amazon CloudWatch pricing. All of these metrics come from the EC2 data collected at the hypervisor level and are published under the AWS/EC2 namespace. The full list of available metrics is in the EC2 documentation. The storage metrics are for the instance store. In addition, metrics for EC2 Nitro-based instances, additional EBS storage metrics are also available.
- The second set of data you can get is custom metrics from the OS level that can be collected at a more granular level, up to 1 second intervals. This is done by enabling the CloudWatch agent in the EC2 instance. The CloudWatch agent is versatile and it can collect system level metrics as well as logs from EC2 instances and from on-prem systems. The CloudWatch Agent can run on Windows, Linux, or Mac, and with x86-64 or ARM64 architecture. It can also collect metrics and logs from custom applications or services using the StatsD and collected protocols. This data can be seen in Amazon CloudWatch under custom namespace.
- Apart from the above two, the third set of data is the EBS CloudWatch metrics. This is generated from the EBS storage and is published under the AWS/EBS namespace. This data is from the EBS subsystem of EC2 unlike the previous data sets that are seen through the lens of the hypervisor or OS. This data is on individual volumes while the previous data is aggregated at the instance level for instance and EBS storage. Consequently, this data set provides deeper insight into usage and issues including latency at the volume level. The granularity of this data is 1 minute and there is no charge.
In summary, the EC2 metrics and CloudWatch agent metrics provide information on health and performance at the application and systems level. The EBS CloudWatch metrics provides performance details specific to EBS volumes. Since EBS volumes are the mainstay for high-performance applications, a combined viewing these data sets can help you correlate application and storage performance. The data is especially helpful to identify IOPS bottlenecks and latencies. Amazon CloudWatch can also be used to create thresholds to identify any issues with the monitored metrics. This function can be automated to raise alerts and remediations.
Detailed EBS monitoring
Let’s drill down further into the EBS metrics and collect specific data to shed light on key performance metrics.
Measure storage latency
We can measure Amazon EBS storage latency using the metrics VolumeTotalReadTime and VolumeTotalWriteTime. We use a formula to plot the total IO time spent to see changes, especially peaks to isolate the cause of latencies.
The following steps show how to measure storage latency:
- In Amazon EC2 console click on “Volumes” at the left navigator pane to see a list of the EBS volumes and the instances they are attached to. Note the “Volume ID” of the volume that you wish to monitor. For this example, select one volume.
- In Amazon CloudWatch console click on “All Metrics” at the left navigator pane to see various metrics/namespaces available for monitoring. Click on the box named “EBS” under the “AWS Namespace”. Next, click “Per-Volume Metrics” to see a table of all your EBS volumes and the metrics available for each volume. Enter the Volume ID from step 1 in the search bar at the top of the table to filter view to show eight metrics (rows) for that single volume.
- Click on the checkbox to the left of “Volume ID” to select all eight metrics. This action will select the metrics to display in the CloudWatch graph.
- Click the “Graphed Metrics” tab above the table to view a graph of the selected metrics.
- In the lower pane you of this screen you can modify the period to 1 minute, and the statistic to values based on the EBS metrics documentation.
- Then create a new metric based on math expression using the formula (VolumeTotalReadTime + VolumeTotalWriteTime) / (VolumeReadOps + VolumeWriteOps)
- Name the new metric (for example Vol1-IOTime).
- The graph shows the average time spent per IO. If you identify significant changes or high peaks in this graph, they could be the cause of bottlenecks in your application performance. Correlate the timestamps of these peaks with the other metrics you collect on your application performance, to find when and how these latencies impact your application.
Note that VolumeTotalTime is not supported with Multi-Attach enabled volumes. VolumeTotalReadTime and VolumeTotalWriteTime can be used to measure disk latency for your volume. Cloudwatch does not support anomaly detection, so you need to use high and low averages. Use 5 to 10% above the average high and average low to assure that you have a good threshold for the application.
Click on the action icon for e1 to create a threshold and action based on the threshold.
Measure the number of disk operations queued
The metric that we use to measure the number of disk operations queued is VolumeQueueLength. This metric is a direct measure of the amount of the traffic coming into storage, and time the application waits to complete storage IO operations. A spike in this can negatively impact the performance of your storage operations and your application can suffer.
To view this metric, choose the best volume queue length based on your workload and volume type. You can create a CloudWatch alarm on VolumeQueueLength with the average aggregation.
Measure IOPS and Setup Alerts when IOPS crosses threshold limit
Amazon EBS provides a choice of different volume types – gp2, gp3, io1, io2 and the newly released io2 Block Express, for use depending on the performance needs of applications. EBS storage pricing is based on the storage type, location (AWS Region), space used as well as the maximum performance levels (IOPS) expected. Customers select the IOPS levels based on their needs and can change the EBS volume type as their application needs change, thus optimizing the storage costs.
You may find that as their application traffic and IO operations increase, it could result in performance decrease. If the IO operations exceed the IOPS limit that the you have set for a volume, the application will wait longer for the IO operations to complete. The Latency graph shown in section (A) will indicate spikes, and the application may suffer from performance degradations. We recommend that you monitor the IOPS levels to ensure they remain below the configured limits. If there is a risk that the limits will be exceeded, then alerts should be sent to raise the limits to what the application requires.
Two EBS metrics – VolumeReadOps and VolumeWriteOps – are used to measure the number of read and write operations on a volume. You can create a CloudWatch alarm to send alerts if these metrics are approaching the IOPS limits for your volume.
Change the above SUM to (m1+m2)/60 to get the IOPS. EBS populate the metrics for 1 minute. Choose the period as 1 minute for granular data. Next, click on the action of e1 which will take you to the next page to setup the metric and condition as shown below.
Use the “Define threshold value” depending on your current IOPS type and its usage. For example, if the current configured value is 8000 and you need to get an alert at 90%, then set the threshold value as 7200.
Based on the above alert, you can set a notification to an existing Amazon SNS topic. Additionally, you can create a new topic, enable an auto-scaling action, systems manager action, or even create a support ticket to AWS support team based on the alarm state as shown below.
Amazon CloudWatch is a monitoring tool which provides metrics and visibility around Amazon EC2 instances, Amazon EBS storage, network, and even the application. Oftentimes, common factors impacting performance are overlooked because they are not being monitored for conditions that cause latency. In this blog, we have shown you how to use CloudWatch to measure storage latency, the number of disk operations queued, and IOPS. We have also demonstrated how you can setup alerts when IOPS cross a user defined threshold limit. Monitoring EBS storage with Amazon CloudWatch is a useful mechanism to ensure your application’s performance.