Why is my SELECT query running slowly on my Amazon Aurora MySQL DB Cluster?

Last updated: 2022-09-08

I have an Amazon Aurora MySQL-Compatible Edition DB cluster, and I want to use the SELECT query to select data from my database. When I run the SELECT query on my DB cluster, the query is running slowly. How can I identify the cause of slow SELECT queries, and fix them?

Short description

There are several reasons that SELECT queries might run slowly on your Aurora MySQL-Compatible DB Cluster.

  • Your Amazon Relational Database Service (Amazon RDS) system resources are over utilized. This can happen because of high CPU, low memory, or a workload that exceeds what your DB instance type can handle.
  • The database is locking and the resulting wait events are causing SELECT queries to perform poorly.
  • The SELECT query is doing full-table scans on large tables, or the query lacks the necessary indexes.
  • Your InnoDB history list length (HLL) has grown considerably due to long-running transactions.

Resolution

Monitor your Amazon RDS system resources using metrics

It's a best practice to always monitor your CPU utilization and free memory on your Amazon Aurora cluster. While occasional CPU spikes are normal, consistently high CPU for long periods of time can make your SELECT queries run slowly. Use the following tools to determine how and where you are using your CPU:

  1. Amazon CloudWatch metrics are the easiest way to monitor your CPU utilization. For more information on what metrics are available for Amazon Aurora, see Amazon CloudWatch metrics for Amazon Aurora.
  2. Enhanced monitoring gives a detailed look at OS level metrics at lower granularity. The detailed breakdown shows how the processes are using your CPU.
  3. Performance Insights accurately determines your DB load. Enable Performance Insights for your DB instance and then check if your load is exceeding the Max vCPU. You can also monitor your load bearing queries and SQLs by waits, and identify the users that cause the maximum waits.

SELECT queries can also run slowly due to disk seeks. To minimize the disk I/O, the database engine tries to cache the block read from the disk. This means that next time the database needs the same block of data, it's fetched from the memory rather than from the disk.

Use these metrics to check if you are serving a particular query from disk or from memory:

  • VolumeReadsIOPS: This metric is the number of billed volume level [disk] read operations. It's a best practice to make sure that this value is as low as possible.
  • BufferCacheHitRatio: This metric is the percentage of requests that the buffer cache serves. It's a best practice to make sure that this value is as high as possible. If BufferCacheHitRatio drops and your SELECT statement is slow, then you're processing the query from underlying volumes.

Another important resource to identify slow SELECT statements is the slow query log. Enable slow query logging for your DB cluster to record these queries and take action later. For the MySQL 5.6 compatible version, use the My SQL Performance Schema to monitor query performance on an ongoing basis.

Identify deadlocks and wait events

Amazon RDS locks the data in your database so that only one user session can write or update a row at any given time. Any other transaction that requires this row is kept on hold. In a shared lock, the write/update transactions are kept on hold while read transactions read the data. If a query is waiting to access a row that is locked by another query, this can cause a deadlock.

To identify deadlocks on your database, enable the innodb_print_all_deadlocks parameter in your parameter groups. Then monitor the mysql-error.log from RDS console/CLI/API.

Or, log in to MySQL with an admin account, and run this command to identify deadlocks from the command output in the Latest Detected Deadlock section:

mysql> SHOW ENGINE INNODB STATUS\G;

Check if your query is using an index

If a query doesn't have an index or does full table scans, then the query runs more slowly. Indexes help to speed up SELECT queries.

To check if your query is using an index, use the EXPLAIN query. This is a helpful tool for troubleshooting slow queries. In the EXPLAIN output, check the table names, key used, and number of rows scanned during the query. If the output doesn't show any keys in use, then create an index on the columns used in the WHERE clause.

If the table has the indexing required, check if the table statistics are up to date. Making sure that statistics are accurate means that the query optimizer uses the most selective indexes with the correct cardinality. This improves query performance.

Check the history list length (HLL)

InnoDB uses a concept called Multi-version-concurrency control (MVCC). MVCC maintains multiple copies of the same record to preserve read-consistency. This means that when you commit a transaction, then InnoDB purges the older copies. But when a transaction isn't committed for a long time due to growth of the undo segments, then history list length (HLL) increases. The InnoDB history list length represents the number of unflushed changes.

If your workload demands multiple open or long-running transactions, then you can expect to see a high HLL on the database.

Note: Long-running transactions aren't the only cause of HLL spikes. Even if the purge threads are unable to keep up with changes on the DB, HLL can remain high.

If you don't monitor the size of the HLL, then performance regresses over time. An increasing size of the HLL can also cause higher resource consumption, slower and inconsistent SELECT statement performance, and an increase in storage. In extreme cases, this can lead to database outage.

To check the history list length, run the following command:

SHOW ENGINE INNODB STATUS;

Output:

------------ TRANSACTIONS ------------
Trx id counter 26368570695
Purge done for trx's n:o < 26168770192 undo n:o < 0 state: running but idle History list length 1839

For Aurora MySQL, because of the nature of the volumes of the shared storage, the history list length is at Cluster level and not at the individual instance level. Connect to your writer and run the following query:

SELECT server_id, IF(session_id = 'master_session_id', 'writer', 'reader') AS ROLE, replica_lag_in_msec,
       oldest_read_view_trx_id , oldest_read_view_lsn
       from mysql.ro_replica_status;

This query helps you understand the replica lag between the reader nodes and the writer node. It also details the oldest LSN used by the DB instance to read from storage, and the oldest read view TRX ID of the DB instance. Use this information to check if one of the readers is holding an old read view (as compared to the engine InnoDB status on the writer).

Note: Starting from Aurora MySQL 1.19 and 2.06, you can monitor HLL using the RollbackSegmentHistoryListLength metric in CloudWatch. Or, in older versions, use trx_rseg_history_len to check HLL using the following command:

select NAME AS RollbackSegmentHistoryListLength, 
COUNT from INFORMATION_SCHEMA.INNODB_METRICS where NAME = 'trx_rseg_history_len';

Use the following methods to resolve issues with HLL growth:

  • If DML (writes) cause the HLL growth: Canceling or terminating this statement involves a rollback of the interrupted transaction. This takes a significant amount of time because all of the updates made until this point of time are being rolled back.
  • If a READ causes the HLL growth: Terminate the query using mysql.rds_kill_query.
  • Depending on how long the query is running, work with your DBA to check whether you can terminate the query using the stored procedure.

It's a best practice to avoid growth by monitoring the HLL using these methods and to avoid open or long-running transactions on the database. Additionally, it's a best practice to commit the data in smaller batches.

Important: Don’t reboot the DB cluster or instance. It is more efficient to purge the HLL when it can access the data in memory in the buffer pool. If you reboot the database, the survivable page cache might be lost. When that happens, data pages from the cluster volume must be read in order to purge the HLL. This is slower than doing it in memory and causes additional I/O billing costs.


Did this article help?


Do you need billing or technical support?