AWS Database Blog
Amazon Aurora MySQL version 2 (with MySQL 5.7 compatibility) to version 3 (with MySQL 8.0 compatibility) upgrade checklist, Part 2
In the first part, we discussed the most common issues that will cause upgrade prechecks from Amazon Aurora MySQL-Compatible Edition v2 to v3 to fail. In this post, we discuss the most common causes of prolonged and unsuccessful upgrade.
Cluster has XA transactions in the prepared state
Amazon Aurora MySQL cancels the upgrade if it detects XA transactions in the prepared state in the database when it is being upgraded. Failing over or rebooting database will not get rid of prepared XA transactions. You must discover and then commit or roll back prepared XA transactions before the upgrade. The following codes is a quick example to show how.
Open a session and start a XA transaction then close the session.
An xid
is an XA transaction identifier. It indicates which transaction the statement applies to. xid
values are supplied by the client or generated by the MySQL server. An xid
value has from one to three parts: gtrid [, bqual [, formatID]]
. gtrid
is a global transaction identifier, bqual
is a branch qualifier, and formatID
is a number that identifies the format used by the gtrid
and bqual
values. bqual
and formatID
are optional. The default bqual
value is ‘ ‘ if not given. The default formatID
value is 1 if not given.
At this point, we have a prepared XA transaction in the database. The following command discovers prepared XA transactions.
The output columns have the following meanings:
formatID
is theformatID
part of the transactionxid
gtrid_length
is the length in bytes of thegtrid
part of thexid
bqual_length
is the length in bytes of thebqual
part of thexid
data
is the concatenation of thegtrid
andbqual
parts of thexid
With these values, we can extract the following parts of the data field.
gtrid = 0x677472696474657374
bqual = 0x627175616C74657374
To rollback, you use the XA ROLLBACK
command
Cluster has a substantial number of tables
Aurora MySQL performs a major version upgrade as a multistage process as outlined in the public documentation. Having a large number of tables in a cluster can extend the duration of the precheck and engine version upgrade step. The engine version upgrade step occurs in 2 steps: data dictionary upgrade and server upgrade.
There is a big change in the way the MySQL data dictionary is stored between MySQL 8.0 and MySQL 5.7. In MySQL 5.7 and lower, the MySQL dictionary data was stored in datafiles (.frm files, .par files, .trn files). For example, if innodb_file_per_table =1, which is the default value, each InnoDB table will have its own .frm file. In MySQL 8.0, however, the MySQL dictionary data is stored centrally in tables in the mysql schema. This brings benefits that you can learn about in the MySQL documentation. In the data dictionary upgrade step above, the server creates data dictionary tables with updated definitions, copies persisted metadata to the new tables, atomically replaces the old tables with the new ones, and reinitializes the data dictionary. Hence, when the cluster contains a large number of tables, the removal and migration of numerous metadata files to tables may extend the overall upgrade duration. The Amazon Aurora MySQL version upgrade is quite similar, and therefore, the impact of having numerous tables applies as well.
In the server upgrade step, the server processes the tables in the user schemas as necessary. It will check and try to repair tables if problems are found, which can be time-consuming if there are lots of tables in a cluster, especially for large tables.
To mitigate the impact of a high table count during the v2 to v3 upgrade, we suggest reviewing and eliminating unused tables, such as backup tables and old table partitions. Upgrading the cluster’s major version with a substantial number of tables can be resource-intensive, potentially exceeding your daily workload. If resource contention arises during a test upgrade, consider temporarily scaling up your production instance class and scaling back down once the upgrade is complete. You should monitor key Amazon CloudWatch metrics, including CPUUtilization
for CPU contention, FreeableMemory
and SwapUsage
for memory pressure, and NetworkThroughput
and StorageNetworkThroughput
for instance throughput. You can run the following query to get the count of tables in your cluster.
We recommend to run it in a cloned cluster to minimize impact to production as running this query in a production instance with a large number of tables, with live production load, can take a long time and degrade the database’s performance.
Cluster has a high number of undo records
The in-place upgrade mechanism involves shutting down your DB cluster while the operation takes place. Amazon Aurora MySQL performs a clean shutdown and completes outstanding operations such as undo purge. An upgrade, even with RDS Blue/Green deployment and snapshot restore methods, might take a long time if there are a high number of undo records to purge.
The value of the Amazon Aurora MySQL CloudWatch metric RollbackSegmentHistoryListLength (HLL) on the writer instance of the cluster indicates the number of undo records stored by the database to implement multi-version concurrency control (MVCC). You can also run SHOW ENIGNE INNODB STATUS
and look for the value of History list length
in the TRANSACTIONS
section. Consider performing the upgrade only after the history list length is low. A generally acceptable value for the history list length is 100,000 and below. However, the speed of HLL to come down depends on many factors, such as application workload, schema properties, and instance cluster configuration. If you need to address a high HLL issue, refer to Why is my SELECT query running slowly on my Amazon Aurora MySQL DB Cluster?, The InnoDB history list length increased significantly, and Purge Configuration. You can also use Amazon RDS Blue/Green Deployments to workaround the high HLL issue. While it won’t decrease the upgrade time on the green cluster, it will help to minimize the application downtime. However, note that the green cluster might take longer to catch up due to the prolonged upgrade.
Cluster has large ongoing write transactions (uncommitted changes for many rows)
Similar to undo purge, transaction rollback happens during the clean shutdown, which may cause a prolonged upgrade if there are lots of rows to rollback. Note that the time it takes to roll back an incomplete transaction can be three or four times the amount of time a transaction has been active before it is interrupted, depending on the server load. There is no easy way to estimate the rollback time. You can’t cancel transactions that are being rolled back. Simply restarting the database or failing over cluster will not help speed up rolling back transactions and can make it slower in some cases as some data is reloaded from disk to memory.
Therefore, you should check the total number of uncommitted rows by running the following query before proceeding with the upgrade:
The table contains one row for each transaction. The TRX_ROWS_MODIFIED
column contains the number of rows modified or inserted by the transaction. Consider performing the upgrade only after all large transactions are committed or rolled back.
Cluster has long-running or uncommitted idle transactions
Read-only or read-write long-running transactions or uncommitted idle transactions could hold table locks, thereby blocking the upgrade prechecks from completing. As a consequence, the upgrade might take very long or even appear to be stuck. When this happens, if you run select * from information_schema.processlist where USER='rdsadmin'
, you might see the FLUSH LOCAL TABLE command is waiting in the state Waiting for table flush
, or the CHECK TABLE command is waiting in the state Waiting for table metadata lock
. In this case, you can use the following queries to identify the blocking transactions and commit or roll back or kill them accordingly if possible.
To overcome this issue, identify long-running or uncommitted idle transactions before initiating the upgrade process. You can do so by running the following queries
Then wait for it to complete or kill it if possible. Note that if you kill a large write transaction, it may take long time to rollback as mentioned in the above section.
You can use the same queries to identify if a cluster is processing any data definition language (DDL) statements. Make sure you perform the upgrade after all DDL statements (CREATE, DROP, ALTER, RENAME, and TRUNCATE
) are finished, otherwise Amazon Aurora MySQL will cancel the upgrade. It is not advisable to interrupt a DDL while it is running as it might cause data dictionary inconsistency issue for the involved table.
Conclusion
In this post, we discussed the most common causes of prolonged and unsuccessful upgrade. With Amazon Aurora MySQL version 2 reaching its end of life on October 31, 2024, we encourage you to upgrade your Aurora MySQL version 2 clusters to the default minor version of Aurora MySQL v3 or higher at your earliest convenience to take advantage of newer features and optimizations available in Aurora MySQL v3.
About the Authors
Huy Nguyen is a Senior Engineer in AWS Support. He specializes in Amazon RDS, Amazon Aurora. He provides guidance and technical assistance to customers, enabling them to build scalable, highly available, and secure solutions in the AWS Cloud.
Leevon Abuan is a Database Engineer in AWS Support. He focuses on Amazon RDS and Amazon Aurora, and has been helping customers on resolving their complex technical issues when running Databases in the Cloud.