Automating Amazon CloudWatch dashboard creation for Amazon EBS volume KPIs

Enterprises can benefit significantly from optimizing block storage performance in the cloud. One of the primary benefits is having faster and more reliable data access to support critical business operations and real-time applications by achieving reduced latency and higher throughput. Another benefit is realizing cost savings by increasing storage operational efficiency to reduce the need for additional, higher-performance storage capacity. As the amount of data generated and stored in the cloud continues to grow, you must monitor your block storage performance by looking for the key performance indicators (KPIs). You will be able to identify performance and cost optimization opportunities, and make sure that your block storage remains performant and efficient.

Amazon Elastic Block Store (EBS) is an easy-to-use, scalable, high-performance block-level storage service designed for use with Amazon Elastic Compute Cloud (EC2) instances. Choosing the appropriate EBS volume type, using EBS-optimized instances, and regularly monitoring the performance can help prevent the need for frequent volume scaling. Amazon CloudWatch provides near real-time monitoring and visualization of various metrics in the AWS/EBS namespace. By regularly monitoring these metrics, you can detect performance issues and take appropriate actions to resolve them. Although the default metrics, such as VolumeReadOps/VolumeWriteOps and VolumeReadBytes/VolumeWriteBytes, are good for basic understanding of the volume performance, they can be used to calculate additional advanced metrics. These include latency, total input/output operations per second (IOPS), and total throughput that are considered KPIs for monitoring storage, among others. Monitoring these advanced metrics can help visualize the performance of an EBS volume, thereby enabling the customers to make appropriate changes to the volume’s performance configuration in the AWS Management Console.

In this post, we provide a solution to automate the calculation of these advanced metrics and generate a CloudWatch dashboard with all metrics under a single view. The dashboard provides a unified view of all performance KPIs that can be used to monitor the historic performance of EBS volumes, and any early warning signs of unforeseen issues. It helps with understanding trends in the workload, and taking appropriate and timely actions to resolve performance bottlenecks. This can help save on time and costs in the long run. Furthermore, the dashboard is easy to share for faster communication during operational events.

Solution overview

The solution presented in this post is a quick way to create a custom CloudWatch dashboard. Automating the creation of this dashboard takes away the manual and cumbersome task of adding expressions to calculate the KPIs. The solution creates a CloudWatch dashboard using AWS CloudFormation, which is a service to provision resources by treating infrastructure as a code.

The dashboard contains twelve widgets that graph the following performance metrics:

Latency is the average time for an I/O operation to complete, measured in milliseconds. There are widgets for Read and Write Latency.
Throughput is the average amount of data transferred from an EBS volume in a certain amount of time, measured in MiB/s. There are widgets for Read, Write, and Total Throughput.
IOPS is the average number of read or write operations an EBS volume can handle in one second. There are widgets for Read, Write, and Total IOPS.
IO Size is the average size of an I/O operation, measured in KiB. There are widgets for Read and Write IO Size.
Volume Idle Time is the total number of seconds in a specified period of time when no read or write operations were submitted to the EBS volume.
Queue Length is the number of read and write operation requests waiting to be completed in a specified period of time.

Some of these metrics are available in the AWS/EBS namespace by default, while the others are calculated as shown in the documentation.

Screenshot of the full CloudWatch dashboard generated by the CloudFormation Stack.

Exhibit – CloudWatch dashboard for EBS volumes

These metrics are important to monitor because they report on the performance, efficiency, and responsiveness of the applications running on an Amazon EC2 instance, using Amazon EBS for storage. For example, high IOPS and low latency can help make sure that database operations are fast and responsive, while high throughput can help make sure that large data transfers are completed quickly. On the other hand, performance bottlenecks and higher latencies could result in application slowness or timeouts. Therefore, monitoring these metrics regularly can help identify performance issues.

To gain a high-level understanding of your EBS volume performance, review the Total IOPS and Total Throughput widgets. Additionally, it may be helpful to review read and write metrics separately to understand whether the I/O workload is read or write intensive.

Walkthrough

To deploy the solution, you must complete the following steps to create a CloudFormation stack and generate a custom CloudWatch dashboard:

Create a CloudFormation stack.
Specify stack details.
Configure stack options.
Review and launch stack.
Generate the CloudWatch dashboard.
(Optional) Set up CloudWatch alarms.

Step 1: Create a CloudFormation stack

Select this template to load the stack directly from Amazon Simple Storage Service (Amazon S3).
On the Create Stack page in the CloudFormation console, choose Next to proceed.

Screenshot of CloudFormation Console for creating a CloudFormation stack.

Step 2: Specify stack details

On the Specify stack details page, type a stack name in the Stack name
In the Parameters section, there are two parameters to specify. For VolumeID1, select the volume ID of the primary EBS Volume that you want to analyze. VolumeID2 is an optional parameter where you can specify the volume ID of another EBS volume in the same region. This could be helpful if you want to analyze two different volumes, or just compare the performance.
Choose Next to proceed.

Console screenshot for specifying CloudFormation stack details.

Step 3: Configure stack options

In this step, you can specify additional stack options such as Tags, Permissions, etc. These are not required for this solution. However, you may refer to your organization’s policies, and choose appropriate options.
Choose Next to proceed.

Step 4: Review and launch

On this page, review the details of your stack. If you must change any of the values before launching the stack, choose Edit on the appropriate section to go back to the relevant page.
After you review the stack creation settings, choose Create stack to launch your stack.
While your stack is being created, it is listed on the Stacks page with a status of CREATE_IN_PROGRESS.
After your stack has been successfully created, its status changes to CREATE_COMPLETE

Step 5: Generate the CloudWatch dashboard

On the Stacks page, select the stack name that you just created, and choose the Resources tab to view the CloudWatch dashboard.
Select the link under the Physical ID column, which contains the Stack name followed by “_Dashboard”.
On the Dashboards page, select the appropriate time frame to analyze the metrics.

Screenshot of the Dashboards page where you can configure time frame.

(Optional) Step 6: Set up CloudWatch alarms

Hover over the metric for which you wish to set the CloudWatch alarm, and expand the widget by clicking on the expand icon at the upper right corner.
In the expanded widget, select View in metrics.
In the CloudWatch metrics console, select the bell icons for the graphed metrics which you wish to set alarms for.

Screenshot that shows the CloudWatch metrics console

Adopting the solution for micro-bursting volumes

If your application is experiencing micro-bursting related issues, then you can use this version of the CloudFormation template. You can create the stack using the same steps listed in the Walkthrough section above.

What is micro-bursting?

Micro-bursting is the phenomenon of sending a high number of I/O requests from an application within a few seconds, and not lasting for more than a minute. This burst is obscured in CloudWatch graphs because the smallest interval (period) is 1 minute.

For example, consider an application sending 12,000 I/O requests to an io2 volume within a 10 second period, and no I/O requests for the remaining 50 seconds within a 1-minute interval. On CloudWatch, the average IOPS value is calculated as:

12,000 IOPS / 60 seconds = 200 IOPS

In reality, it should have been calculated as:

12000 IOPS / 10 seconds = 1200 IOPS

For an io2 volume with a provisioned performance of 1000 IOPS, this is dismissed as a non-issue because the Average Period IOPS from CloudWatch metrics will display 200 IOPS for the 1-minute interval. However, the actual IOPS pushed by the application was 1200, which is high. This can result in increased queue length and higher latencies for the application, and the volume performance could be throttled at its provisioned IOPS value for the duration of high I/O.

In this scenario, to get a better understanding of your application, you can use the same template to calculate the micro-burst metrics for IOPS and throughput.

CloudWatch dashboard generated by the CloudFormation Stack for micro bursting use case.

Exhibit: Amazon CloudWatch dashboard for micro-bursting EBS volumes

Create a custom CloudWatch dashboard to visualize Amazon EBS performance metrics for EC2 instances

You can adopt the solution for the aggregate Amazon EBS metrics of all volumes attached to an EC2 instance. All EC2 instances have a limited bandwidth for I/O traffic between EC2 instances and EBS volumes. For EBS-optimized instances, this bandwidth is dedicated for Amazon EBS I/O traffic, while for older generation instance types this bandwidth is shared with other Amazon EC2 traffic.

There are Amazon EBS metrics in the AWS/EC2 namespace for Nitro instances that can be used to calculate Amazon EBS KPIs, such as IOPS and throughput at the instance level and resolve these bottlenecks. Amazon EBS metrics in the AWS/EC2 namespace are available in 5-minute intervals by default. However, you can enable detailed monitoring to get metrics in 1-minute granularity.

If you want to analyze these metrics in a custom CloudWatch dashboard, then you can use this version of the CloudFormation template and follow the same steps listed in the Walkthrough section above. When deploying the stack, use the resource IDs for the EC2 instances in the parameters section.

Here is a list of metrics available in the custom Amazon EC2 dashboard:

Throughput is the average amount of data transferred from all EBS volumes attached to the EC2 instance, measured in MiB/s. There are widgets for Read, Write, and Total Throughput.
IOPS is the average number of read or write operations from all EBS volumes attached to the instance. There are widgets for Read, Write, and Total IOPS.
EBS IO Balance % is the percentage of I/O credits remaining in the burst bucket.
EBS Byte Balance % is the percentage of throughput credits remaining in the burst bucket.

Note that these metrics are only available for instances built on the Nitro System. Additionally, EBS IO Balance % and EBS Byte Balance % metrics are only available for basic monitoring for some 4xlarge instance sizes or smaller. These instances burst to their maximum performance for only 30 minutes at least once every 24 hours.

Screenshot of the full CloudWatch dashboard generated by the CloudFormation Stack for EC2 Instance use case.

Exhibit: Amazon CloudWatch dashboard for aggregate EBS metrics at EC2 instance level

Cleaning up

To delete the dashboard, simply delete the CloudFormation stack by following instructions in the documentation.

Conclusion

In this post, we presented a solution that automates the creation process of a comprehensive CloudWatch dashboard which contains all the key performance metrics for Amazon EBS volumes. This solution can save you time and effort, and at the same time help analyze the KPIs such as latency, IOPS, and throughput for EBS volumes in a single view. You can monitor these KPIs to identify areas for improvement, enable informed decision-making to increase storage efficiency, and facilitate communication and collaboration by sharing KPI data with stakeholders during unexpected operational events.

Feel free to leave any questions or feedback in the comments section. Thank you for reading, and happy cloud computing!