Why am I experiencing a high I/O wait, increased queue length, and spike in latency with my Amazon EBS volume?

Last updated: 2022-12-02

I'm experiencing a high I/O wait, increased queue length, and spike in latency with my Amazon Elastic Block Store (Amazon EBS) volume. Why is this occurring?

Short description

For Amazon EBS volumes, an increased queue length and high IO wait indicate that there's a latency in I/O operation completion.

The following are the most common reasons for increased latency:

  • The EBS volume is reaching its throughput or IOPS limit.
  • The Amazon Elastic Compute Cloud (Amazon EC2) instance's throughput or IOPS limit is reached.
  • Micro-bursting is occurring.
  • The volume is restored from a snapshot and is initializing.
  • There's an issue with the underlying storage subsystems of the volume.

Resolution

The volume is reaching its throughput or IOPS limit

EBS volumes have throughput and IOPS limits based on their type and size. You can also provision these limits for gp3, io1, and io2 volume types. If you're reaching your limits, then you can experience latency. To determine your throughput and IOPS limits, see How can I calculate the maximum IOPS and throughput for an Amazon EBS volume? Then, you can use CloudWatch metrics to check if the EBS volumes of your EC2 instance are reaching throughput or IOPS limits.

If you're frequently reaching your throughput or IOPs limit, then consider changing the volume type or size to one that meets your application needs. It's a best practice to benchmark your EBS volumes against your workload in a test environment to determine which volume types work best for you.

The instance's throughput or IOPS limit is reached

EBS-optimized instances have a maximum aggregated throughput and IOPS that can be achieved across all EBS volumes that are attached to the instance. You might see a high I/O wait and increase latency, but your volume isn’t reaching its throughput or IOPS limits. If this is happening, then check whether the volume's throughput or IOPS is reaching the instance's throughput or IOPS limit.

For example, you have a gp3 volume of 1 TiB with 16,000 provisioned IOPS and 700 MiB/s throughput that's attached to a t3.medium instance. A t3.medium instance can achieve a maximum performance of 260.57 MiB/s throughput and 11,800 IOPS aggregated across all volumes that are attached to it. The instance achieves this for only 30 minutes in a 24 hour period. Then, performance is throttled to a baseline of 43.43 MiB/s throughput and 2,000 IOPS aggregated across all the attached volumes. Although your single volume can sustain up to 700 MiB/s and 16,000 IOPS, the instance can't achieve this performance.

If your application performance needs exceed the capabilities of your instance, then consider changing the instance type to one that can sustain your workload needs. For a list of available instance types with their respective Amazon EBS throughput and IOPS limits, see EBS-optimized instances specifications.

Micro-bursting is occurring

Micro-bursting happens when a volume is bursting IOPS or throughput for a significantly shorter period than the collection period. Micro-bursting doesn't reflect on Amazon CloudWatch metrics, and you might miss micro-bursting if you're not checking for it. To determine if micro-bursting is the issue, see How can I identify if my EBS volume is micro-bursting and prevent this from happening?

The volume is restored from a snapshot and is initializing

When a volume is restored from a snapshot, its data is pulled from Amazon Simple Storage Service (Amazon S3) and written to the volume. This process is called initialization. Initializing can cause increased latency in I/O operations the first time each block of data is accessed.

To reduce the impact of initialization on volume performance, you can force the initialization of the volume by reading from the blocks on the volume. You can also turn on Amazon EBS fast snapshot restore so that the volume is fully initialized at creation.

There's an issue with the underlying storage subsystems of the volume

If you tried all the preceding troubleshooting steps and are still experiencing high latency, then contact AWS Support.