How do I troubleshoot the latency of Amazon EBS volumes caused by an IOPS bottleneck in my Amazon RDS instance?

Last updated: 2021-10-11

I have an Amazon Relational Database Service (Amazon RDS) DB instance. I want to troubleshoot the latency of the Amazon Elastic Block Store (Amazon EBS) volumes in my Amazon RDS instance.

Resolution

The most common reasons for latency in an Amazon RDS instance that's caused by an IOPS or throughput bottleneck include the following:

  • An IOPS bottleneck at the instance level
  • An IOPS bottleneck at the volume level
  • A throughput bottleneck at the instance level
  • A throughput bottleneck at the volume level
  • Micro-bursting

Use the following troubleshooting steps based on your use case.

RDS instance with General Purpose SSD (gp2)

Perform the following checks:

  1. Check the configuration information of the Amazon RDS instance, such as the DB instance class and storage size. This information can help you track the IOPS and throughput limits. You must know these values when you troubleshoot issues causing an IOPS or throughput bottleneck.
  2. Use the Amazon CloudWatch graphs to check for any spikes in the values of DiskQueueDepth, ReadLatency, and WriteLatency. Under normal circumstances, it's a best practice to use a DiskQueueDepth of one per minute for every 1000 IOPS. ReadLatency and WriteLatency are expected to be less than 10 milliseconds. If you notice spikes, then identify the time of the spike.
  3. Use the CloudWatch graphs to view the ReadIOPS and WriteIOPS metrics. Check if the IOPS limit was breached at the volume level during the timeframe of the spikes in the values of DiskQueueDepth, ReadLatency, and WriteLatency.
  4. Use the CloudWatch graph to check if there is a drop in the value of BurstBalance. This check is applicable only for volumes with a size of less than 1 TB. A drop in the value of BurstBalance confirms the occurrence of an IOPS bottleneck during the timeframe of the spike.
  5. Use the CloudWatch graphs to view the ReadThroughput and WriteThroughput metrics. Check if the throughput limit was breached at the volume level during the timeframe of the spikes in the values of ReadThroughput and WriteThroughput.
  6. If you're using an EBS-optimized RDS instance class, then use the CloudWatch graphs to check for throttling of IOPS or throughput. For instance classes with burst capacity, view the EBSIOBalance% and EBSByteBalance% metrics in the CloudWatch graphs. Consistently low values of EBSIOBalance% or EBSByteBalance% indicate an IOPS or throughput bottleneck at the instance level.

Throttling of IOPS, throughput, or both indicates that the IOPS or throughput is inadequate for your workload at the storage level. To fix this issue, do the following:

  • Locate the SQL queries that create more load on the database, and then optimize these queries. If the workload is as expected, or there is no scope for tuning the SQL queries, then you might need to increase the storage size to get a higher IOPS capacity.
    Note: After you increase the storage size of an RDS instance, you can't reduce the size to the previous value.
  • Consider switching the volume from General Purpose (gp2) to Provisioned IOPS (io1). If the DB instance is Single-AZ and you're using a custom parameter group, then switching between gp2 and io1 might cause a brief downtime. If your instance is Multi-AZ, then you don't experience any downtime.
  • If you notice throttling of IOPS or throughput at the instance level, then you must scale up the instance class to get a higher IOPS or throughput capacity.

RDS instance with Provisioned IOPS (io1)

  1. Check the configuration information of the Amazon RDS instance, such as the DB instance class and defined Provisioned IOPS, to determine the IOPS limit or throughput limit for the DB instance class.
  2. Use the CloudWatch graphs to check for any spikes in the values of DiskQueueDepth, ReadLatency, and WriteLatency. Under normal circumstances, it's a best practice to use a DiskQueueDepth of one per minute for every 1000 IOPS. ReadLatency or WriteLatency are expected to be within 10 milliseconds. If you notice spikes, then identify the time of the spike.
  3. Use the CloudWatch graphs to view the ReadIOPS and WriteIOPS metrics. Check if the IOPS limit was breached during the timeframe of spikes in the values of DiskQueueDepth, ReadLatency, and WriteLatency.
  4. Use the CloudWatch graphs to view the ReadThroughput and WriteThroughput metrics. Check if the throughput limit was breached during the timeframe of spikes in the values of ReadThroughput and WriteThroughput.
  5. If you're using an EBS-optimized RDS instance class, then use the CloudWatch graphs to check for throttling of IOPS or throughput. For instance classes with burst capacity, view the EBSIOBalance% and EBSByteBalance% metrics in the CloudWatch graphs. Consistently low percentage values of EBSIOBalance% or EBSByteBalance% respectively indicate an IOPS or throughput bottleneck at the instance level.

Throttling of IOPS or throughput indicates that the IOPS or throughput is inadequate for the workload at the storage level. To fix this issue, do the following:

  • Locate the SQL queries that create more load on the database, and then optimize these queries. If the workload is as expected or there is no scope for tuning the SQL queries, then you might need to increase the IOPS provisioned.
  • If you notice throttling of IOPS or throughput at the instance level, then you need to scale up the instance class to get a higher throughput or IOPS capacity.

Micro-bursting

Micro-bursting occurs when an EBS volume "bursts" high IOPS or throughput for significantly shorter periods than the collection period. CloudWatch metrics are collected at an interval of 60 seconds. Because the volume bursts high IOPS or throughput for a shorter time than the collection period, CloudWatch doesn't reflect the bursting. You can use Enhanced Monitoring to identify if microbursting causes the latency. Turn on Enhanced Monitoring with a granularity of 1 second. You can use the Read IO/s and Write IO/s metrics to determine the actual IOPS utilization. You can use the Read Kb/s and Write Kb/s to determine the actual throughput utilization per second. For more information, see Enhanced Monitoring metric descriptions.