How can I identify if my EBS volume is micro-bursting and how can I prevent this from happening?

Last updated: 2020-10-29

My Amazon Elastic Block Store (Amazon EBS) volume isn't breaching its throughput (bytes/s) or IOPS (ops/s) limit in CloudWatch, but it appears throttled and is experiencing high latency/queue length. How can I determine if this is happening because of micro-bursting and how can I prevent this?

Short description

Amazon CloudWatch monitors EBS volumes' IOPS (op/s) and throughput (byte/s) by collecting samples every 5 minutes, for most volumes. IO1 and IO2 volumes support detailed monitoring that collects samples every minute.

Micro-bursting occurs when an EBS volume "bursts" high IOPS or throughput for significantly shorter periods than the collection period. Because the volume bursts high IOPS or throughput for a shorter time than the collection period, CloudWatch doesn't reflect the bursting.

Example: An IO1 volume (1 minute collection period) with 950 IOPS provisioned has an application that pushes 1000 IOPS for 5 seconds. Amazon EBS throttles the application when it hits the volume's IOPS limit. At this point, the volume can't handle the workload, causing increased queue length and higher latency.

CloudWatch doesn't show that the volume breached the IOPS limit because the collection period is 60 seconds. IOPS of 1000 occurred for only 5 seconds. For the remaining 55 seconds of the 1 minute collection period, the volume remained idle. So, the number of VolumeReadOps+VolumeWriteOps over the whole minute was 5000 operations (1000*5 seconds). This equates to an average of 83.33 IOPS over that minute (5000/60 seconds), which usually isn't a concern.

In this case, the VolumeIdleTime at the same sample time is 55 seconds, as the volume was idle for the remainder of the collection period. This means that the 5000 operations (VolumeReadOps+VolumeWriteOps) at that sample time occurred over only 5 seconds. Calculate the average IOPS by dividing 5000 by 5. This equates to 1000 IOPS, the limit for the volume.

To determine if micro-bursting is occurring on your volume, do the following:

  1. Use CloudWatch metrics to identify possible micro-bursting.
  2. Confirm micro-bursting using an OS-level tool, such as iostat.
  3. Prevent micro-bursting by changing your volume size or type to accommodate your applications.

Resolution

Use CloudWatch to identify possible micro-bursting

To identify micro-bursting using IOPS (op/s) In CloudWatch, do the following:

1.    Check the VolumeIdleTime metric.

If the VolumeIdleTime is high, the volume remained idle for most of the collection period. Sufficiently high IOPS at the same sample time indicates that micro-bursting might have occurred.

2.    Calculate the average IOPS.

VolumeReadOps and VolumeWriteOps show only the number of I/O operations performed within the collection period. To calculate the average IOPS reached while the volume was active, divide Sum(VolumeReadOps)+Sum(VolumeWriteOps) by the volume's active time, as shown in the following formula:

Actual average IOPS in Ops/s = (Sum(VolumeReadOps) + Sum(VolumeWriteOps) ) / ( Period - Sum(VolumeIdleTime) )

Note: The Period used in the preceding formula uses a sample at a given time in CloudWatch. The specified Period of the CloudWatch graph equals the volume's collection period.

If the formula gives a value greater than the maximum IOPS supported by the volume, then micro-bursting occurred.

To identify micro-bursting using throughput (bytes/s) In CloudWatch, do the following:

1.    Check the VolumeIdleTime metric.

2.    Use the following formula to calculate the average throughput:

Actual Average Throughput in Bytes/s = (Sum(VolumeReadBytes) + Sum(VolumeWriteBytes) ) / ( Period - Sum(VolumeIdleTime) )

Note: The Period used in the preceding formula uses a sample at a given time in CloudWatch. The specified Period of the CloudWatch graph equals the volume's collection period.

If the formula gives a value greater than the maximum IOPS supported by the volume, then micro-bursting occurred.

Confirm micro-bursting using an OS-level tool, such as iostat

The preceding formulas don't always identify micro-bursting in real time. This is because the volume might be micro-bursting even if the VolumeIdleTime is low.

Example: Your volume spikes to a level that breaches the volume's limits. The volume then reduces to a very low level of activity without being completely idle for the remainder of the collection period. The VolumeIdleTime metric doesn't reflect the low activity, even though micro-bursting occurred.

To confirm micro-bursting, use an OS level tool that has a finer granularity than CloudWatch, such as iostat. For more information on iostat, see iostat(1) on the Linux man page.

1.    Run the following command using iostat to report I/O stats for all of your mounted volumes with 1 second granularity:

iostat -xdmzt 1

Note: The iostat tool is part of the sysstat package. If the iostat command isn't found, run the following command to install sysstat on Amazon Linux AMIs.

$ sudo yum install sysstat -y

2.    To determine if you're hitting the throughput limit, review the rMB/s and wMB/s in the output. If rMB/s + wMB/s is greater than the volume's maximum throughput, micro-bursting is occurring.

To determine if you're hitting the IOPS limit, review the r/s and w/s in the output. If r/s + w/s is greater than the volume's maximum IOPS, micro-bursting is occurring.

Prevent micro-bursting by changing your volume size or type to accommodate your applications

Change the volume to a type/size that accommodates your required IOPS and throughput. For more information on volume types and their respective IOPS/throughput limits, see Amazon EBS volume types. Keep in mind that there are limits on the IOPS/throughput the instance can push to all attached EBS volumes.

It's a best practice to benchmark your volumes against your workload to verify which volume types safely accommodate your workload in a test environment. For more information, see Benchmark EBS volumes.


Did this article help?


Do you need billing or technical support?