Uncover new performance insights using Amazon EBS detailed performance statistics

As businesses increasingly rely on latency-sensitive applications for mission-critical workloads, the need to understand performance across the entire technology stack is essential to swiftly resolve performance bottlenecks that could affect application efficiency. Given that storage performance and stability directly impact application efficiency, reliability, scalability, and user experience, it is paramount for organizations to have the capability to collect and analyze detailed telemetry at the storage layer. Observability at the storage layer complements traditional application and OS monitoring, which helps businesses ensure robust performance, stability, and fast issue resolution across their infrastructure.

Users use Amazon Elastic Block Store (Amazon EBS), an easy-to-use block storage service, as the main storage type for high-performance, transactional workloads, such as databases, enterprise applications, and latency-sensitive operations. These workloads demand consistent and reliable storage performance, thus observability for EBS volumes is crucial. Real-time monitoring enables you to gain visibility into key performance metrics, such as latency, throughput, and IOPS, allowing you to detect and address potential bottlenecks or issues proactively.

In this post, we discuss how to use Amazon EBS detailed performance statistics, a set of new EBS volume metrics that provide sub-minute granularity, to help you gain real-time visibility into your EBS volume performance. You can access these statistics directly from your Amazon EBS NVMe device attached to the Amazon Elastic Compute Cloud (Amazon EC2) instance and use them to monitor I/O performance at the storage level. We also provide examples of how to use these statistics to quickly assess EBS volume health and identify performance bottlenecks, which improve both the reliability and performance of your applications.

Solution overview

Using the new Amazon EBS detailed performance statistics at the instance-level, we propose a solution that enhances observability and troubleshooting capabilities for latency-sensitive applications running on EC2 Nitro instances. We use the new ebsnvme script to collect high-frequency statistics on I/O operations, latency, and queue length, enabling proactive troubleshooting.

As examples of how to use these granular metrics, this solution demonstrates how to validate the responsiveness of EBS volumes, so that you can quickly notice any I/O interruptions. Furthermore, this solution helps you identify storage performance bottlenecks, which can be used to optimize the EBS volume and EC2 instance configurations for your workloads.

Prerequisites

This solution involves setting up an EC2 Nitro instance and an attached EBS volume to access detailed performance statistics for the EBS volume. This is a setup you likely already have if using Amazon EC2. To deploy the required components, you must complete the following steps:

1. Launch an EC2 Nitro instance (or use an existing Nitro instance), and connect to it. You could use SSH to connect to it.

2. Identify the NVMe device associated with the EBS volume for which you wish to query the Amazon EBS stats. For example, you can run the nvme-cli command in the CLI to output all NVMe devices on the instance.

$ sudo nvme list

The following is an example output of the list command that lists the NVMe devices on the instance and their volume IDs.

example output of the list command that lists the NVMe devices on the instance and their volume IDs

In this demonstration, consider that the EBS volume used by your application has the ID vol-02b51b6b2cb16aab1, thus the NVMe device associated with it would be /dev/nvme1n1.

3. Copy the ebsnvme script onto the EC2 instance (download it from this GitHub link if necessary).

4. Run the ebsnvme script, with the correct permissions, and pass the device as a parameter. The returned output looks like the following figure (JSON output can be retrieved by providing the -j or --json parameter to the script):

$ sudo ./ebsnvme stats /dev/nvme1n1 –json

Alternatively, the –interval or -i parameter can be used with the ebsnvme script to poll the stats based on the provided interval.

$ sudo ./ebsnvme stats /dev/nvme1n1 --interval 15

The following is a section of example NVMe log output that shows a set of statistics that indicate cumulative read/write operations, bytes, and time spent processing the read/write operations (in microseconds), and the number of microseconds in which the application performance attempted to exceed the Amazon EBS/Amazon EC2 provisioned IOPS/throughput limits.

example NVMe log output

Also included in the following figures are read and write I/O latency histograms, with each row representing the total number of I/O operations completed so far within a specific bin of time (in microseconds).

read and write I/O latency histograms

These statistics are presented as cumulative counters up to the time at which the command is executed. The command can be run at the desired interval, for example, every 15 seconds, with each subsequent output reflecting the updated cumulative totals for the metrics. Calculating the difference in the statistics across the last two outputs allows you to derive insight into the volume performance profile over the given 15 second period.

Deriving insights from the Amazon EBS detailed performance statistics

You have set up monitoring using these detailed performance statistics, thus we can demonstrate the different ways you can use these statistics.

As mentioned in the preceding section, you can use the detailed statistics to view I/O latency histograms to observe the spread of I/O latency within the period. Furthermore, you can use the read/write operations and time spent statistics to calculate the average latency. Using the detailed statistics allows you to view the average latency at a sub-minute granularity.

Here are two other methods for you to use the detailed statistics to shed light on key performance metrics. In Scenario 1, you monitor Amazon EBS I/O performance to determine if an EBS volume is not responding to I/O operations. In Scenario 2, you track the performance statistics to detect when workloads exceed the provisioned performance limits of your EC2 instances or EBS volumes, which may result in elevated latency.

Scenario 1: Identifying unresponsive state of an EBS volume

In this scenario, we discuss how to use Amazon EBS detailed performance statistics to observe when an EBS volume isn’t responding to I/O operations, allowing you to take timely actions as needed. If you observe multiple intervals where your volume is unresponsive, then you can either wait for AWS to resolve the issue, or you can take actions, such as replacing the affected volume or stopping and restarting the instance to which the volume is attached. In most cases, when your volume becomes unresponsive, Amazon EBS automatically diagnoses and recovers your volume within a few minutes.

To identify if your volume is unresponsive, you can use the following steps to determine whether I/O disrupted on your volume:

1. Choose the EBS volume’s NVMe device to troubleshoot
2. Collect stats for the device at the desired intervals
3. Compare the stats to check if the EBS volume is unresponsive

Step 1: Choose the EBS volume’s NVMe device to troubleshoot

1. Use the NVMe CLI command, and identify the NVMe device associated with the EBS volume on the instance.

Step 2: Collect stats for the device at the desired intervals

1. Collect the Amazon EBS detailed performance statistics directly from the device by using the ebsnvme command:

$ sudo ebsnvme stats /dev/nvme1n1

Step 3: Compare the stats to check if the EBS volume is unresponsive

1. From the output, consider the following three fields for this scenario: Total Read Ops, Total Write Ops, and Queue Length.

Compare the stats to check if the EBS volume is unresponsive

2. Issue the same ebsnvme command after a desired interval (for example: after 15 seconds), so that you can compare how Total Read/Write I/Os have progressed at the Amazon EBS level.

2. Issue the same ebsnvme command after a desired interval

3. From the detailed performance statistics collected approximately 15 seconds apart, we make the following key observations

- Total Read Ops increased from 340086743 to 340174856, indicating 88113 Read operations completed in the 15 second span.
- Total Write Ops increased from 340051578 to 340139785, indicating 88207 Read operations completed in the 15 second span.
- Queue Length stayed non-zero between 9 and 11, indicating that the application was issuing I/Os to the EBS volume. If you see a gradual increase in the Queue Length, then it would reflect a buildup in queued I/Os.

This shows that the EBS volume is still driving I/Os that it is receiving, which rules out the EBS volume as the source of observed degradation in application performance. If we had seen an increase in the Queue Length along with 0 Read/Write Ops processed during the period, then it would reflect an unresponsive EBS volume.

If you would like to validate your mechanisms of identifying unresponsive EBS volumes, please refer to the Conducting chaos engineering experiments on Amazon EBS using AWS Fault Injection Service blog post, which walks through how to set up an AWS Fault Injection Service Pause I/O experiment.

Scenario 2: Identifying bottlenecks in storage performance

Amazon EBS detailed performance statistics can also be used to configure the appropriate performance characteristics for your EBS volume and EC2 instance based on the performance needs of your application. The EBS Volume Performance Exceeded and EC2 Instance EBS Performance Exceeded statistics indicate the duration for which your workload consistently attempted to drive IOPS or throughput that is greater than your volume or your instance’s provisioned performance in a given period. Exceeding either the volume’s or instance’s provisioned performance can result in elevated latency on your workload. For this scenario, consider the same application as the one used in Scenario 1.

Complete the following steps to check if EBS volume performance is correctly provisioned:

1. Select the EBS volume’s NVMe device to check
2. Collect stats for the device at the desired intervals
3. Compare the stats to check if the EBS volume is exceeding provisioned performance

Step 1. Select the EBS volume’s NVMe device to check

1. This step is the same as Step 1 discussed previously in Scenario 1.

Step 2. Collect stats for the device at the desired intervals

1. Similar to Step 2 discussed in Scenario 1, access the detailed performance statistics across two points in time.

2. Consider the EBS Volume Performance Exceeded and EC2 Instance EBS Performance Exceeded statistics from the EBS NVMe device. Use the ‘–interval’ option, which can be used to parse the output every <interval_seconds> and output the difference in statistics across that interval:

$ ebsnvme stats /dev/nvme1n1 --interval 15

Step 3: Compare the stats to check if the EBS volume is exceeding provisioned performance

1. In the screenshot of the following example output, you can see the EBS Volume Performance Exceeded statistic increasing by 5349286 microseconds. This shows the workload running on EBS volume vol-02b51b6b2cb16aab1 has attempted to drive more IOPS than provisioned on the underlying EBS volume, which can impact the volume’s I/O latency. We recommend that you increase the performance of your volume to make sure that you have sufficient provisioned performance for your application’s needs.

Compare the stats to check if the EBS volume is exceeding provisioned performance

2. In the following example output, driving a different workload on the instance allows us to see that the volume has exceeded the provisioned IOPS performance at the attached EC2 instance level. In this case, up-sizing to a larger instance size can improve the performance of your application.

driving a different workload on the instance allows us to see that the volume has exceeded the provisioned IOPS performance at the attached EC2 instance level

3. A synthetic load generator for Oracle called Silly Little Oracle Benchmark (SLOB) could also be used to simulate workloads on Oracle databases, while monitoring the Amazon EBS statistics to see which volume or instance is becoming the bottleneck.

It’s important to have the right instance and volume configurations to avoid performance bottlenecks to your application. Refer to the EBS volume types documentation for more information on the different EBS volume types, and the Amazon EBS-optimized documentation to understand how to select the optimal combination of EC2 instance and EBS volume suited for your application. These statistics are available at up to a one-second granularity, which allows you to effectively perform these checks in real-time and initiate volume modifications to optimize volume characteristics as needed.

Cleaning up

If you created an EC2 instance and EBS volume for this exercise, then terminate and delete the appropriate instance and volumes to avoid future costs.

Conclusion

In this post, we presented a solution for accessing high-resolution statistics on Amazon EBS volume performance at the instance level. Amazon EBS detailed performance statistics allow you to obtain a real-time view into your underlying EBS volume performance at a sub-minute granularity to quickly root cause disruptions to your applications running on an EBS volume. This also allows you to identify application performance bottlenecks due to driving higher performance than Amazon EC2 or EBS volume provisioned IOPS or throughput limits. Along with Amazon CloudWatch metrics, which give volume level insights with one-minute granularity, these are additional tools that you can use to identify and resolve issues with your Amazon EBS storage with confidence.

Thank you for reading this post. If you have any comments or questions, please leave them in the comment section.

Select your cookie preferences

AWS Storage Blog