Best Practices for Upgrading Amazon RDS for MySQL and Amazon RDS for MariaDB
One key feature of Amazon RDS is that you can easily upgrade a database instance to a new minor or major version of your database engine. You can perform upgrades on demand by using the AWS Management Console or AWS CLI. You can decide to automatically upgrade to a new minor engine version when it is marked as the preferred version by the Amazon RDS service team. You also need to upgrade your database engine version when an existing version is deprecated. Such deprecation happens when a version is no longer supported or has functional issues that are resolved in a future version.
Although Amazon RDS makes upgrades simple, you still need to be aware of a number of caveats to get the best experience. In this post, I discuss some of the caveats about upgrading versions of the MySQL and MariaDB database engines for Amazon RDS in the following topics:
- What happens behind the scenes during an engine upgrade?
- Problems that can occur during an upgrade
- Ways to detect problems before scheduling an upgrade
- How to minimize downtime using read replicas
What happens behind the scenes during an engine upgrade?
Amazon RDS supports both minor and major engine upgrades. A minor engine upgrade keeps the major version the same while advancing the minor version, for example going from 5.6.27 to 5.6.37. In contrast, a major version upgrade advances both minor and major version numbers, for example going from 5.6.27 to 5.7.19.
Minor version upgrades include only database changes that are backward-compatible with previous minor versions (of the same major version) of the database engine. The distinction between minor and major version upgrades is important. A major version upgrade can take longer than a minor version upgrade. You can also run into compatibility issues during or after a major version upgrade, which are very unlikely after a minor version upgrade.
Also, you can advance the major version only one step at a time. For example, an instance running 5.5.46 that needs to go to 5.7.19 first upgrades to a 5.6 version like 5.6.37 and then to 5.7.19. Additionally, usually the target version is restricted to the most recent version. For example, you might be able to upgrade an instance from 5.6.37 to 5.7.19, but not from 5.6.37 to 5.7.16.
For both minor and major version upgrades, the system takes the following steps during the upgrade process for MySQL and MariaDB for Amazon RDS:
- A snapshot is taken (if backups are enabled) while the instance is still running the previous version.
- The instance shuts down in a special mode called a slow shutdown to ensure data consistency.
- The instance is restarted with mysqld running the new engine version with networking disabled to prevent remote connections.
- The mysql_upgrade program runs on the database to convert system and user tables to the new version.
- An RDS-specific upgrade script runs to upgrade RDS-managed tables and stored procedures.
- The instance is restarted with networking enabled to allow for remote connections.
The main difference between the process for a minor version upgrade and a major version upgrade is the use of the mysql_upgrade program. This program checks all tables (both system and user) on the database. Depending on the state of a table, the program might need to physically recreate the table to conform to the new version. In most cases, no physical changes are needed and mysql_upgrade merely updates metadata in the table to indicate that the table has been upgraded.
In the case of a minor version upgrade, these table checks are usually trivial. The entire process usually takes only a few seconds to a few minutes depending on the number of tables. However, a major version upgrade can require more extensive changes to tables, as discussed later.
Problems that can occur during an upgrade
The main problems you might run into during or after an upgrade fall into these categories:
- Shutdown takes an inordinate amount of time.
- Latent corruption is triggered during the mysql_upgrade run.
- Physical rebuild of certain tables during a major version upgrade takes hours to days.
- Issues or regressions occur due to new version-specific bugs.
Let’s discuss these issues in more detail.
Shutdown takes an inordinate amount of time
Normally when mysqld shuts down, it performs a fast shutdown. In this mode, no new connections or transactions are allowed to start. The database runs a checkpoint to flush all dirty pages in the buffer cache to disk, while skipping other types of flush operations. However, when performing a minor or major version upgrade, Amazon RDS employs a slow shutdown by setting the innodb_fast_shutdown parameter to 0. This mode performs additional flushing, such as merging the change buffer and performing a full purge of deleted rows. This type of shutdown can take much longer than a normal shutdown. However, Amazon RDS uses it to prevent certain kinds of bugs and crashes from occurring after the upgrade. Slow shutdowns of a database with a very large change buffer can take up to several hours.
Latent corruption is triggered during the mysql_upgrade run
The Amazon RDS infrastructure is designed to prevent physical corruption to database tables. Unfortunately, physical corruption sometimes happens (along with logical corruption) from bugs in the MySQL engine. In most cases, corruption is detected immediately during normal workload processing. However, sometimes corruption can exist in an infrequently accessed table, such as one used for archival data. In this case, a corrupted table can remain under the radar for days, months, or years until it’s brought back to life during the mysql_upgrade sweep. Depending on the type of corruption, it can cause the database to enter a crash-loop during the upgrade and never complete the upgrade.
Physical rebuild of certain tables during a major version upgrade takes hours to days
In most cases, the upgrade process does not need to physically rebuild tables. The main exception is during a major version upgrade to 5.7 if the database once ran on an engine version before 5.6.4. Starting in 5.6.4, there was a format change in how temporal data was physically stored to support a fractional part. Temporal data types include the TIME, DATETIME, and TIMESTAMP types. Most databases include tables with temporal data, given how fundamental storing time and date values within a database is.
Therefore, if an RDS instance was originally a 5.5 instance that was later upgraded to 5.6 and now to 5.7, it requires physical table rebuilds on tables with temporal data. For small to midsized tables on general purpose (SSD) or provisioned IOPS storage, this table rebuild can add a few minutes to the upgrade. However, on magnetic storage, multi-gigabyte tables can lead to upgrades that take hours to days, even. You can find more details on how to detect and handle this issue in Upgrading the MySQL DB Engine in the Amazon RDS documentation.
Issues or regressions occur due to new version-specific bugs
Most customers report no problems after upgrading to a new version. In some cases, a bug might be present in the new version that causes a crash, performance regression, or incorrect query results. For Amazon RDS, we certify engine compatibility as much as possible with an extensive set of tests. In the end, though, we are presenting the MySQL database software as is from Oracle or MariaDB with only minimal changes to the native source code. Major version upgrades tend to be riskier in terms of the possibility of bugs, due to the wider changes in the native source code compared to minor version upgrades.
Ways to detect problems before scheduling an upgrade
The best way to detect problems before scheduling an upgrade is to perform a dry run. Thanks to features like snapshot restore and read replicas, you can easily spin up a copy of an existing production instance and attempt an upgrade. Doing this gives you a sense of how long the upgrade takes and whether there is any latent corruption. Checking for post-upgrade bugs or regressions requires additional testing with queries like the ones that you run in production.
A newly restored snapshot or replica can show high I/O latency on reads and writes. These high latencies can make for a longer than normal upgrade. For this reason, we recommend that you prewarm any restored instance before performing any upgrade testing. One way to prewarm a restored instance is to run mysqldump (with output directed to /dev/null) on frequently accessed tables.
One test that you might find somewhat difficult to emulate with a restored instance is how long a slow shutdown takes. Although activities like the time it takes to do a change buffer merge should match what happens in production, some shutdown variables require realistic loads. For example, the time it takes to write dirty pages to disk depends on the current workload. For this reason, try to run a synthetic workload against the restored instance during the upgrade or replay a captured workload from the general log.
Regardless, don’t perform an upgrade in production without having first done a dry run. Although Amazon RDS automates many database operations, some activities should be performed by someone with a good understanding of your database. Doing an upgrade dry run is one of these activities.
How to minimize downtime using read replicas
Typically in a self-managed or on-premises environment, a DBA minimizes downtime on an upgrade by using a rolling upgrade using read replicas. Amazon RDS doesn’t fully automate one-click rolling upgrades. However, you can still perform a rolling upgrade by creating a read replica, upgrading the replica, promoting the replica, and then routing traffic to the promoted replica. The following diagram shows this process, and you can find full details in Upgrading the MySQL DB Engine in the Amazon RDS documentation.
One other caveat about upgrade downtime is how Multi-AZ fits into the picture. One common fallacy is that Multi-AZ configurations prevents downtime during an upgrade. We do recommend that you use Multi-AZ for high availability, because it can prevent extended downtime due to hardware failure or a network outage. However, in the case of a MySQL or MariaDB engine upgrade, Multi-AZ doesn’t eliminate downtime. The slow shutdown and the physical changes made on the active server by the mysql_upgrade program require this downtime.
Major and minor version database engine upgrades are an important feature of Amazon RDS. Version upgrades enable you to exercise new engine features and reduce the risk of encountering functional or security issues. Amazon RDS makes it easy to upgrade an instance to a new engine version either automatically or with a few clicks in the AWS Management Console. Still, we do recommend that you plan ahead and test defensively before a version upgrade to handle issues mentioned in this post.
As always, AWS strives to invent and simplify on your behalf. We are working on making the RDS upgrade process even more reliable and streamlined for the future.
About the Authors
Jaime Lichauco is a database engineer in RDS team at Amazon Web Services.