Why is my SELECT query running slowly on my Amazon Aurora MySQL DB Cluster?

Last updated: 2020-12-10 

I have an Amazon Aurora for MySQL DB cluster, and I would like to use the SELECT query to select data from my database. When I run the SELECT query on my DB cluster, the query is running slowly. How can I identify the cause of slow SELECT queries, and fix them?

Short description

There are a number of reasons that SELECT queries can run slowly on your Amazon Aurora for MySQL DB Cluster.

  • Your Amazon Relational Database Service (Amazon RDS) system resources are over utilized. This can happen because of high CPU, low memory, or a workload that exceeds what your DB instance type can handle.
  • The database is locking and the resulting wait events are causing SELECT queries to perform poorly.
  • The SELECT query is doing full-table scans on large tables, or the query lacks the necessary indexes.
  • Your InnoDB history list length (HLL) has grown considerably due to long running transactions.

Resolution

Monitor your Amazon RDS system resources using metrics

You should always monitor your CPU utilization and free memory on your Amazon Aurora cluster. While occasional CPU spikes are normal, consistently high CPU for long periods of time can make your SELECT queries run slowly. Use the following tools to determine how and where you are using your CPU:

  1. Amazon CloudWatch metrics are the easiest way to monitor your CPU utilization. For more information on what metrics are available for Amazon Aurora, see  Monitoring Amazon Aurora DB cluster metrics.
  2. Enhanced monitoring gives a detailed look at OS level metrics at lower granularity. It gives you a detailed breakdown of how the processes are using your CPU.
  3. Performance insights accurately determines your DB load. Enable Performance insights for your DB instance and then check if your load is exceeding the Max vCPU. You can also monitor your load bearing queries and SQLs by waits, and identify the users that cause the maximum waits.

SELECT queries can also run slowly due to disk seeks. To minimize the disk I/O, the database engine tries to cache the block read from the disk. This means that next time the database needs the same block of data, it can fetch from the memory rather than going to the disk.

Use the metrics below to check if you are serving a particular query from disk or from memory:

  • VolumeReadsIOPS: This metric is the number of billed volume level [disk] read operations and should be as low as possible.

  • BufferCacheHitRatio: This metric is the percentage of requests that the buffer cache serves, and should be as high as possible. If BufferCacheHitRatio drops and your SELECT statement is slow, then you're processing the query from underlying volumes.

Another important resource to identify slow SELECT statements is the  slow query log. Enable slow query logging for your DB cluster to record these queries and take action later. If you are using the MySQL 5.6 compatible version, use the MySQL Performance Schema to monitor performance of queries on an ongoing basis.

Identify deadlocks and wait events

Amazon RDS locks the data in your database so that only one user session can write or update a row at any given time. Any other transaction that requires this row is kept on hold. In a shared lock, the write/update transactions are kept on hold while read transactions read the data. If a query is waiting to access a row that is locked by another query, this can cause a deadlock.

To identify deadlocks on your database, enable the innodb_print_all_deadlocks parameter in your parameter groups.

Then, run this command in the Latest deadlock section to identify deadlocks:

SHOW ENGINE INNODB STATUS\G;

Note: You can't run this command on Aurora Readers because the readers are configured with the MySQL innodb_read_only parameter set to ON, by default. This parameter is non-modifiable for a DB instance as Aurora MySQL manages the read-only and read/write state of DB instances based on the type of cluster.

You can also query the INFORMATION_SCHEMA INNODB_TRX table. This gives you precise details about every running transaction in INNODB, including waits.

Check if your query is using an index

If a query does not have an index or does full table scans, it will be slower in general. Indexes help to speed up SELECT queries.

To check if your query is using index, use the EXPLAIN query. This is a helpful tool for troubleshooting slow queries. In the EXPLAIN output, check the table names, key used, and number of rows scanned during the query. If the output does not show any keys in use, then create an index on the columns used in the WHERE clause.

If the table has the indexing required, check if the table statistics are up to date. Making sure that statistics are accurate means that the query optimizer uses the most selective indexes with the correct cardinality. This improves query performance.

Check the history list length (HLL)

InnoDB uses a concept called Multi-version-concurrency control (MVCC). MVCC maintains multiple copies of the same record to preserve read-consistency. This means that once you have committed a transaction, then InnoDB purges the older copies. But when a transaction is not committed for a long time, due to growth of the undo segments, then history list length (HLL) increases. The InnoDB history list length represents the number of unflushed changes.

If your workload demands a lot of long running or open transactions, then you can expect to see a high HLL on the database.

Note: Long running transactions are not the only cause of HLL spikes. Even if the purge threads are unable to keep up with changes on the DB, HLL can remain high.

If you do not monitor the size of the HLL, this will cause performance regression over time, as well as higher resource consumption, slower and inconsistent SELECT statement performance, and an increase in storage. In extreme cases, this can lead to database outage.

To check the history list length, run the following command:

SHOW ENGINE INNODB STATUS;

Output:

------------ TRANSACTIONS ------------
Trx id counter 26368570695
Purge done for trx's n:o < 26168770192 undo n:o < 0 state: running but idle History list length 1839

For Aurora MySQL, because of the nature of the volumes of the shared storage, the history list length is at Cluster level and not at the individual instance level. Connect to your writer and run the following query:

SELECT server_id, IF(session_id = 'master_session_id', 'writer', 'reader') AS ROLE, replica_lag_in_msec,
       oldest_read_view_trx_id , oldest_read_view_lsn
       from mysql.ro_replica_status;

This query helps you understand the replica lag between the reader nodes and the writer node. It also details the oldest LSN used by the DB instance to read from storage, and the oldest read view TRX ID of the DB instance. Use this information to check if one of the readers is holding an old read view (as compared to the engine InnoDB status on the writer).

Note: Starting from Aurora MySQL 1.19 and 2.06, you can monitor HLL using the RollbackSegmentHistoryListLength metric in CloudWatch. Or, in older versions, use trx_rseg_history_len to check HLL using the following command:

select NAME AS RollbackSegmentHistoryListLength, 
COUNT from INFORMATION_SCHEMA.INNODB_METRICS where NAME = 'trx_rseg_history_len';

Use the following methods to resolve issues with HLL growth:

  • If DML (writes) cause the HLL growth: Canceling or terminating this statement involves a rollback of the interrupted transaction. This takes a significant amount of time because all of the updates made until this point of time are being rolled back.

  • If a READ causes the HLL growth: Terminate the query using mysql.rds_kill_query.

  • Depending on how long the query is running, work with your DBA to see if you can terminate the query using the stored procedure.

The best way to avoid a growth is to monitor the HLL using the methods above and to avoid long running or open transactions on the database. Additionally, it is a best practice to commit the data in smaller batches.

Important: Don't reboot the DB cluster or instance. Purging the HLL is more efficient if it can access the data in the buffer pool. If you reboot the database, then all data will be read from the disk in order to clean up.