Amazon Aurora MySQL version 2 (with MySQL 5.7 compatibility) to version 3 (with MySQL 8.0 compatibility) upgrade checklist, Part 2

In the first part, we discussed the most common issues that will cause upgrade prechecks from Amazon Aurora MySQL-Compatible Edition v2 to v3 to fail. In this post, we discuss the most common causes of prolonged and unsuccessful upgrade.

Cluster has XA transactions in the prepared state

Amazon Aurora MySQL cancels the upgrade if it detects XA transactions in the prepared state in the database when it is being upgraded. Failing over or rebooting database will not get rid of prepared XA transactions. You must discover and then commit or roll back prepared XA transactions before the upgrade. The following codes is a quick example to show how.

Open a session and start a XA transaction then close the session.

mysql> USE sakila;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed

mysql> XA START 'gtridtest','bqualtest',123456;
Query OK, 0 rows affected (0.00 sec)

mysql> UPDATE actor SET first_name='testname' WHERE actor_id=3;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> XA END 'gtridtest','bqualtest',123456;
Query OK, 0 rows affected (0.00 sec)

mysql> XA PREPARE 'gtridtest','bqualtest',123456;
Query OK, 0 rows affected (0.00 sec)

mysql> exit;

An xid is an XA transaction identifier. It indicates which transaction the statement applies to. xid values are supplied by the client or generated by the MySQL server. An xid value has from one to three parts: gtrid [, bqual [, formatID]]. gtrid is a global transaction identifier, bqual is a branch qualifier, and formatID is a number that identifies the format used by the gtrid and bqual values. bqual and formatID are optional. The default bqual value is ‘ ‘ if not given. The default formatID value is 1 if not given.

At this point, we have a prepared XA transaction in the database. The following command discovers prepared XA transactions.

mysql> XA RECOVER CONVERT xid;
+----------+--------------+--------------+----------------------------------------+
| formatID | gtrid_length | bqual_length | data |
+----------+--------------+--------------+----------------------------------------+
| 123456 | 9 | 9 | 0x677472696474657374627175616C74657374 |
+----------+--------------+--------------+----------------------------------------+
1 row in set (0.00 sec)

The output columns have the following meanings:

formatID is the formatID part of the transaction xid
gtrid_length is the length in bytes of the gtrid part of the xid
bqual_length is the length in bytes of the bqual part of the xid
data is the concatenation of the gtrid and bqual parts of the xid

With these values, we can extract the following parts of the data field.

gtrid = 0x677472696474657374
bqual = 0x627175616C74657374

To rollback, you use the XA ROLLBACK command

mysql> XA ROLLBACK 0x677472696474657374,0x627175616C74657374,123456;
Query OK, 0 rows affected (0.01 sec)

Cluster has a substantial number of tables

Aurora MySQL performs a major version upgrade as a multistage process as outlined in the public documentation. Having a large number of tables in a cluster can extend the duration of the precheck and engine version upgrade step. The engine version upgrade step occurs in 2 steps: data dictionary upgrade and server upgrade.

There is a big change in the way the MySQL data dictionary is stored between MySQL 8.0 and MySQL 5.7. In MySQL 5.7 and lower, the MySQL dictionary data was stored in datafiles (.frm files, .par files, .trn files). For example, if innodb_file_per_table =1, which is the default value, each InnoDB table will have its own .frm file. In MySQL 8.0, however, the MySQL dictionary data is stored centrally in tables in the mysql schema. This brings benefits that you can learn about in the MySQL documentation. In the data dictionary upgrade step above, the server creates data dictionary tables with updated definitions, copies persisted metadata to the new tables, atomically replaces the old tables with the new ones, and reinitializes the data dictionary. Hence, when the cluster contains a large number of tables, the removal and migration of numerous metadata files to tables may extend the overall upgrade duration. The Amazon Aurora MySQL version upgrade is quite similar, and therefore, the impact of having numerous tables applies as well.

In the server upgrade step, the server processes the tables in the user schemas as necessary. It will check and try to repair tables if problems are found, which can be time-consuming if there are lots of tables in a cluster, especially for large tables.

To mitigate the impact of a high table count during the v2 to v3 upgrade, we suggest reviewing and eliminating unused tables, such as backup tables and old table partitions. Upgrading the cluster’s major version with a substantial number of tables can be resource-intensive, potentially exceeding your daily workload. If resource contention arises during a test upgrade, consider temporarily scaling up your production instance class and scaling back down once the upgrade is complete. You should monitor key Amazon CloudWatch metrics, including CPUUtilization for CPU contention, FreeableMemory and SwapUsage for memory pressure, and NetworkThroughput and StorageNetworkThroughput for instance throughput. You can run the following query to get the count of tables in your cluster.

SELECT count(*) AS TOTAL_NUMBER_OF_TABLES FROM INFORMATION_SCHEMA.TABLES;

We recommend to run it in a cloned cluster to minimize impact to production as running this query in a production instance with a large number of tables, with live production load, can take a long time and degrade the database’s performance.

Cluster has a high number of undo records

The in-place upgrade mechanism involves shutting down your DB cluster while the operation takes place. Amazon Aurora MySQL performs a clean shutdown and completes outstanding operations such as undo purge. An upgrade, even with RDS Blue/Green deployment and snapshot restore methods, might take a long time if there are a high number of undo records to purge.

The value of the Amazon Aurora MySQL CloudWatch metric RollbackSegmentHistoryListLength (HLL) on the writer instance of the cluster indicates the number of undo records stored by the database to implement multi-version concurrency control (MVCC). You can also run SHOW ENIGNE INNODB STATUS and look for the value of History list length in the TRANSACTIONS section. Consider performing the upgrade only after the history list length is low. A generally acceptable value for the history list length is 100,000 and below. However, the speed of HLL to come down depends on many factors, such as application workload, schema properties, and instance cluster configuration. If you need to address a high HLL issue, refer to Why is my SELECT query running slowly on my Amazon Aurora MySQL DB Cluster?, The InnoDB history list length increased significantly, and Purge Configuration. You can also use Amazon RDS Blue/Green Deployments to workaround the high HLL issue. While it won’t decrease the upgrade time on the green cluster, it will help to minimize the application downtime. However, note that the green cluster might take longer to catch up due to the prolonged upgrade.

Cluster has large ongoing write transactions (uncommitted changes for many rows)

Similar to undo purge, transaction rollback happens during the clean shutdown, which may cause a prolonged upgrade if there are lots of rows to rollback. Note that the time it takes to roll back an incomplete transaction can be three or four times the amount of time a transaction has been active before it is interrupted, depending on the server load. There is no easy way to estimate the rollback time. You can’t cancel transactions that are being rolled back. Simply restarting the database or failing over cluster will not help speed up rolling back transactions and can make it slower in some cases as some data is reloaded from disk to memory.

Therefore, you should check the total number of uncommitted rows by running the following query before proceeding with the upgrade:

SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX ORDER BY TRX_ROWS_MODIFIED DESC;

The table contains one row for each transaction. The TRX_ROWS_MODIFIED column contains the number of rows modified or inserted by the transaction. Consider performing the upgrade only after all large transactions are committed or rolled back.

Cluster has long-running or uncommitted idle transactions

Read-only or read-write long-running transactions or uncommitted idle transactions could hold table locks, thereby blocking the upgrade prechecks from completing. As a consequence, the upgrade might take very long or even appear to be stuck. When this happens, if you run select * from information_schema.processlist where USER='rdsadmin', you might see the FLUSH LOCAL TABLE command is waiting in the state Waiting for table flush, or the CHECK TABLE command is waiting in the state Waiting for table metadata lock. In this case, you can use the following queries to identify the blocking transactions and commit or roll back or kill them accordingly if possible.

SELECT * FROM information_schema.innodb_trx\G
SELECT * FROM information_schema.processlist where id=<# trx_mysql_thread_id from innodb_trx table above>;

CALL mysql.rds_kill(#id from processlist table above);

To overcome this issue, identify long-running or uncommitted idle transactions before initiating the upgrade process. You can do so by running the following queries

SELECT a.trx_id,a.trx_state,a.trx_started,
TIMESTAMPDIFF(SECOND, a.trx_started, now()) as "Seconds Transaction Has Been Open",
a.trx_rows_modified,b.id,b.USER,b.host,b.db,b.command,b.time,b.state

from
information_schema.innodb_trx a,information_schema.processlist b

where
a.trx_mysql_thread_id = b.id
order bytrx_started;

Then wait for it to complete or kill it if possible. Note that if you kill a large write transaction, it may take long time to rollback as mentioned in the above section.

CALL mysql.rds_kill(#id from processlist table above);

You can use the same queries to identify if a cluster is processing any data definition language (DDL) statements. Make sure you perform the upgrade after all DDL statements (CREATE, DROP, ALTER, RENAME, and TRUNCATE) are finished, otherwise Amazon Aurora MySQL will cancel the upgrade. It is not advisable to interrupt a DDL while it is running as it might cause data dictionary inconsistency issue for the involved table.

Conclusion

In this post, we discussed the most common causes of prolonged and unsuccessful upgrade. With Amazon Aurora MySQL version 2 reaching its end of life on October 31, 2024, we encourage you to upgrade your Aurora MySQL version 2 clusters to the default minor version of Aurora MySQL v3 or higher at your earliest convenience to take advantage of newer features and optimizations available in Aurora MySQL v3.

About the Authors

Huy Nguyen is a Senior Engineer in AWS Support. He specializes in Amazon RDS, Amazon Aurora. He provides guidance and technical assistance to customers, enabling them to build scalable, highly available, and secure solutions in the AWS Cloud.

Leevon Abuan is a Database Engineer in AWS Support. He focuses on Amazon RDS and Amazon Aurora, and has been helping customers on resolving their complex technical issues when running Databases in the Cloud.

AWS Database Blog