How do I troubleshoot write latency spikes in my Amazon RDS DB instance?

Last updated: 2022-07-14

I want to troubleshoot write latency spikes in my Amazon Relational Database Service (Amazon RDS) DB instance.

Short description

The Amazon CloudWatch metric WriteLatency defines the average amount of time taken per disk I/O operation. Ideally, write latency must not be more than a single digit millisecond.

The spike in write latency for your DB instance might be caused when you do the following:

The spike might also result from IOPS or throughput bottleneck caused due to heavy workload on the database.

Resolution

Troubleshoot latency spike

1.    To troubleshoot the causes for a high read or write latency on your DB instance, check the following CloudWatch metrics:

  • ReadLatency and WriteLatency
  • ReadIOPS and WriteIOPS
  • ReadThroughput and WriteThroughput
  • DiskQueueDepth
  • BurstBalance (for gp2 storage)

Suppose that you notice one or more of the following:

  • The latency values are high.
  • The throughput and IOPS values have reached their maximum limits.
  • The value of DiskQueueDepth is high.
  • The value of BurstBalance is low (for gp2).

This means that your RDS instance is under heavy workload and needs more resources. For more information, see How do I troubleshoot the latency of Amazon EBS volumes caused by an IOPS bottleneck in my Amazon RDS instance?

To troubleshoot issues that cause an IOPS or throughput bottleneck, do the following:

For an RDS instance with General Purpose SSD (gp2), check the DB instance class and storage size.

For an RDS instance with Provisioned IOPS (io1), check the DB instance class and defined Provisioned IOPS.

For more information, see DB instance classes and Amazon EBS–optimized instances.

2.    If CloudWatch metrics don't indicate any resource throttling, then check the Read IO/s and Write IO/s using Enhanced Monitoring.

CloudWatch metrics are recorded at an interval of 60 seconds. Therefore, every spike or drop might not be recorded. However, you can set up the granularity of Enhanced Monitoring for up to one second to capture data. Any spike in resource utilization within a 60-second interval can be captured by Enhanced Monitoring. For more information, see How can I identify if my EBS volume is micro-bursting and how can I prevent this from happening?

3.    If all the preceding checks don't indicate the cause for the issue, then check the CloudWatch metrics NetworkReceiveThroughput and NetworkTransmitThroughput to make sure that there are no issues with the network.

Mitigate the impact of lazy loading

When you restore an RDS database instance from a snapshot, the DB instance continues to load data in the background. This process is known as lazy loading.

Lazy Loading might happen in all scenarios that require restoring from a snapshot, such as point-in-time restore, conversion of Single-AZ instance to Multi-AZ instance, and creating a new read replica. If you try to access data that isn't loaded yet, the DB instance immediately downloads the requested data from Amazon Simple Storage Service (Amazon S3). Then, the instance continues to load the rest of the data in the background. For more information, see Amazon EBS snapshots. To help mitigate the effects of lazy loading on tables to which you require quick access, you can perform operations that involve full-table scans, such as SELECT *. This allows RDS to download all of the backed-up table data from Amazon S3.

Follow best practices

Keep the following best practices and workarounds in mind when dealing with high write latency in your DB instance:

  • Be sure that you have enough resources allocated to your database to run queries. With RDS, the amount of resources allocated depends on the instance type.
  • If neither CloudWatch metrics nor Enhanced Monitoring metrics indicate resource throttling, then use Performance Insights to monitor the database workload. Using Performance Insights, you can identify the underlying SQL queries running in your database when you are experiencing latency with your application. You can use this information to assess the load on your database and determine further actions. For more information, see Monitoring DB load with Performance Insights on Amazon RDS.
  • Prevent micro-bursting by changing the size or type of the Amazon EBS volume according to your use case.
  • To optimize database performance, make sure that your queries are properly tuned.
  • If you're converting your Single-AZ database instance to a Multi-AZ instance, consider doing so during off-business hours.
  • To reduce the impact of lazy loading after a Multi-AZ conversion, consider doing one of the following:
    • Perform a manual failover soon after converting to Multi-AZ instance.
    • Run a full dump or just the required queries to load all the data from the tables. This process can help with loading the data and forcing all blocks to be pushed from S3 into the new host. For Amazon RDS for PostgreSQL instances, you can run the pg_prewarm command.
  • You can configure Amazon CloudWatch alarms on RDS key metrics that are useful for determining the reason for Write Latency spikes in your RDS instances. Examples of these metrics include ReadIOPS, WriteIOPS, ReadThroughput, WriteThroughput, DiskQueueDepth, ReadLatency, and WriteLatency. You can use these alarms to make sure that the instance doesn't throttle.