How do I troubleshoot issues with major version upgrades in Amazon RDS for PostgreSQL and Aurora for PostgreSQL?

13 minute read
0

My engine version upgrade for Amazon Relational Database Service (Amazon RDS) for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition is stuck or has failed.

Short description

When Amazon RDS supports a new version of a database engine, you can upgrade your DB instances to the new version. You can perform a minor version grade or a major version upgrade for your DB instances.

Minor version upgrades are used to patch security vulnerabilities and fix bugs. These upgrades usually don't add any new functionality, and don't change the internal storage format. They are always compatible with the earlier and later minor releases of the same major version. However, major version upgrades contain database changes that aren't backward-compatible with existing applications. These upgrades might change the internal format of system tables, data files, and data storage. Amazon RDS uses the PostgreSQL utility pg_upgrade to perform major version upgrades.

During a major version upgrade of a PostgreSQL instance, Amazon RDS runs a precheck procedure. This procedure identifies any issues that might cause the upgrade to fail. It checks for potential incompatible conditions across all databases. If Amazon RDS identifies an issue during the precheck process, it creates a log event for the failed precheck. For more information on the precheck process for all databases, check the pg_upgrade_precheck.log upgrade log. Amazon RDS appends a timestamp to the file name. RDS events might also provide the reasons for upgrade failure. But for issues that are engine specific, you must check the database log files.

For more information, see Viewing and listing database log files for RDS for PostgreSQL. Or, see Viewing and listing database log files for Aurora for PostgreSQL.

During a major version upgrade, RDS completes these steps:

  1. Create a snapshot of the instance before the upgrade. This happens only if you set the backup retention period for your DB instance to a number greater than zero.
  2. Shut down the instance.
  3. Use the pg_upgrade utility to run the upgrade job on the instance.
  4. Create a snapshot of the instance after the upgrade.

Resolution

Although Amazon RDS manages these upgrades, you might encounter the following issues during a version upgrade:

  • Upgrade takes a longer time.
  • Upgrade fails.

Upgrade takes a long time

Pending maintenance activities: Any pending maintenance activities are automatically applied with engine version upgrades. This might include applying an operating system patch on your RDS instance. In this case, the operating system patch is applied first, and then the engine version is upgraded. So, performing operating system maintenance activities leads to an increase in the time taken to complete the upgrade.

Also, if your RDS instance is in a Multi-AZ deployment, then operating system maintenance results in a failover. When you set up your instance in Multi-AZ, the backup for the instance is usually created on the secondary instance. In case of a failover, a backup is created on a new secondary instance after the upgrade. This backup on the new secondary instance might not be the latest backup. So a full backup might be triggered instead of an incremental backup. Creating a full backup can take a long time, especially if the database is very large.

To avoid this issue, look for pending maintenance activities in the Pending maintenance section in your RDS console. For Aurora for PostgreSQL, see Viewing pending maintenance.

Or, use the AWS Command Line Interface (AWS CLI) command describe-pending-maintenance-actions on your instance.

aws rds describe-pending-maintenance-actions --resource-identifier example-arn

Note: Complete these maintenance activities before performing the database engine version upgrades.

No snapshot created before the upgrade: It's a best practice to create a snapshot of the RDS or Aurora for PostgreSQL cluster snapshot before performing the upgrade. If you already turned on backups for your instance, then a snapshot is created automatically as part of the upgrade process. Creating a snapshot before the upgrade reduces the time needed for the upgrade process to complete. This is because only an incremental backup is created during the upgrade process in this case.

RDS for PostgreSQL Read replica upgrades: When you perform a major version upgrade of your primary DB instance, all the read replicas in the same Region are automatically upgraded. After the upgrade workflow starts, the read replicas wait for pg_upgrade to complete successfully on the primary DB instance. Then, the primary instance upgrade waits for the read replica upgrades to complete. You experience an outage until all the upgrades are complete. If the downtime window for the upgrade is limited, then you can promote or drop your replica instance. Then, recreate the read replicas after the upgrade is complete.

To safely upgrade the DB instances that make up your cluster, Aurora for PostgreSQL uses the pg_upgrade utility. After the writer upgrade completes, each reader instance experiences a brief outage while it's upgraded to the new major version.

Long-running transactions or high workload before the upgrade: Long-running transactions or high workload before the upgrade might increase the time taken to shut down the database and increase upgrade time.

Run this query to identify long-running transactions:

SQL>SELECT pid, datname, application_name, state, 
age(query_start, clock_timestamp()), usename, query 
FROM pg_stat_activity 
WHERE query NOT ILIKE '%pg_stat_activity%' AND 
usename!='rdsadmin' 
ORDER BY query_start desc;

Insufficient compute capacity: The pg_upgrade utility can be compute-intensive. So, it's a best practice to perform a dry-run upgrade before upgrading your production databases. You can restore a snapshot of the production instance and perform a dry run with the same instance class as that of the production database.

Upgrade fails

Unsupported DB instance classes: The upgrade might fail if the instance class of your DB instance isn't compatible with the PostgreSQL version that you're upgrading to. Be sure to check the compatibility of the instance class with the engine version. For more information, review the supported DB engines for DB instance classes for RDS for PostgreSQL. Or, review the supported DB engines for DB instance classes for Aurora for PostgreSQL.

Open prepared transactions: Prepared transactions that are open on the database might lead to upgrade failure. Be sure to commit or roll back all open prepared transactions before starting an upgrade.

Run this query to check if there are open prepared transactions on your instance:

SELECT count(*) FROM pg_catalog.pg_prepared_xacts;

In this case, the error in the pg_upgrade.log file looks similar to this:

------------------------------------------------------------------------
Upgrade could not be run on Wed Apr 4 18:30:52 2018
-------------------------------------------------------------------------
The instance could not be upgraded from 9.6.11 to 10.6 for the following reasons.
Please take appropriate action on databases that have usage incompatible with 
the requested major engine version upgrade and try the upgrade again.

*There are uncommitted prepared transactions. Please commit or rollback 
all prepared transactions.*

Unsupported data types: The upgrade fails with an error if you attempt to upgrade the database with unsupported data types, such as the following:

  • regcollation
  • regconfig
  • regdictionary
  • regnamespace
  • regoper
  • regoperator
  • regproc
  • regprocedure

Note: The data types regclass, regrole, and regtype are supported.

The PostgreSQL upgrade utility pg_upgrade doesn't support upgrading databases that include table columns using the reg* OID-referencing system data types. Remove all uses of reg* data types, except for regclass, regrole, and regtype, before attempting an upgrade.

Run this query to verify the usage of unsupported reg* data types:

SELECT count(*) FROM pg_catalog.pg_class c, pg_catalog.pg_namespace n, pg_catalog.pg_attribute a
  WHERE c.oid = a.attrelid
      AND NOT a.attisdropped
      AND a.atttypid IN ('pg_catalog.regproc'::pg_catalog.regtype,
                         'pg_catalog.regprocedure'::pg_catalog.regtype,
                         'pg_catalog.regoper'::pg_catalog.regtype,
                         'pg_catalog.regoperator'::pg_catalog.regtype,
                         'pg_catalog.regconfig'::pg_catalog.regtype,
                         'pg_catalog.regdictionary'::pg_catalog.regtype)
      AND c.relnamespace = n.oid
      AND n.nspname NOT IN ('pg_catalog', 'information_schema');

Logical replication slots: An upgrade can't occur if your instance has any logical replication slots. Logical replication slots are typically used for AWS Database Migration Service (AMS DMS) migration. They are also used for replicating tables from databases to data lakes, business intelligence tools, and other targets. Before upgrading, be sure that you know the purpose of logical replication slots that are in use and confirm that they can be deleted.

If the logical replication slots are still being used, you must not delete them. In this case, you can't proceed with upgrade.

The related error in the pg_upgrade log file looks similar to this example:

"The instance could not be upgraded because one or more databases have logical replication slots. Please drop all logical replication slots and try again."

If the logical replication slots aren't needed, run these queries to delete them:

SELECT * FROM pg_replication_slots;
SELECT pg_drop_replication_slot(slot_name);

Storage issues: While the pg_upgrade script runs, the instance might run out of space. This causes the script to fail, and you see an error message similar to this:

pg_restore: [archiver (db)] Error while PROCESSING TOC: 
pg_restore: [archiver (db)] could not execute query: ERROR: could not create file "base/12345/12345678": No space keyword" left  on device

To resolve this issue, be sure that the instance has sufficient free storage before starting the upgrade.

Unknown data types: PostgreSQL versions 10 and later don't support unknown data types. If a PostgreSQL version 9.6 database uses the unknown data type, an upgrade to version 10 shows an error message like this:

"The instance could not be upgraded because the 'unknown' data type is used in user tables. Please remove all usages of the 'unknown' data type and try again."

This is a PostgreSQL limitation, and RDS automation doesn't modify columns using the unknown data type. You might need to modify these columns manually before the upgrade.

Run this query to find columns in your database with unknown data type:

SELECT DISTINCT data_type FROM information_schema.columns WHERE data_type ILIKE 'unknown';

After identifying the columns, you can remove these columns or modify them to a supported data type.

Read replica upgrade failure (RDS for PostgreSQL only): PostgreSQL instance has read replicas, then read replica upgrade failures might cause your primary instance upgrade to get stuck. Read replica upgrade failure might also result in failure of the primary instance upgrade. A failed read replica is placed in the incompatible-restore state, and replication stops on the DB instance.

A read replica upgrade might fail for one of these reasons:

  • The read replica is unable to catch up with the primary DB instance even after the wait time.
  • The read replica is in a terminal or incompatible lifecycle state, such as storage-full or incompatible-restore.
  • When the primary DB instance upgrade starts, a separate minor version upgrade is running on the read replica.
  • The read replica uses incompatible parameters.
  • The read replica is unable to communicate with the primary DB instance to synchronize the data folder.

To resolve this issue, delete the read replica. Then, recreate a new read replica based on the upgraded primary instance after the primary instance is upgraded.

Incorrect primary user name: If the primary user name starts with "pg_", then the upgrade fails, and you see the following error message:

PreUpgrade checks failed: The instance could not be upgraded because one or more role names start with 'pg_'. Please rename  all roles with names that start with 'pg_' and try again

To resolve this issue, create another user with the rds_superuser role. You can contact AWS Support to update this user as the new primary user.

Incompatible parameter error: This error occurs if a memory-related parameter, such as shared_buffer or work_memory, is set to a higher value. This can cause the upgrade script to fail. To fix the issue, reduce the values of these parameters, and then try running the upgrade again.

Extensions not updated before the upgrade: A major version upgrade doesn’t upgrade any PostgreSQL extensions. If you didn't update the extensions before performing a major version upgrade, then you see this error in the pg_upgrade.log file:

The Logs indicates that the RDS instance ''xxxx'' has older version of PostGIS extension or its dependent extensions (address_standardizer,address_standardizer_data_us, postgis_tiger_geocoder, postgis_topology, postgis_raster) installed as against the current version required for the upgrade.

This error message indicates an issue with the PostGIS extension.

Run this query to check the default and installed versions for PostGIS and its dependent extensions:

SELECT name, default_version, installed_version
FROM pg_available_extensions WHERE installed_version IS NOT NULL AND
name LIKE 'postgis%' OR name LIKE 'address%';

If the value for installed_version is less than that of the default_version, then you must update PostGIS to the default version. To do this, run this query:

ALTER EXTENSION extension_name UPDATE TO 'default_version_number';

For more information, see Upgrading PostgreSQL extensions for RDS for PostgreSQL or Upgrading PostgreSQL extensions for Aurora PostgreSQL.

Issue in views due to change in system catalog of the target version: Columns in the certain views vary across different PostgreSQL versions.

For example, you might see an error message like this:

PreUpgrade checks failed: The instance could not be upgraded because one or more databases have views or materialized views which depend on 'pg_stat_activity'. Please drop them and try again.

This error occurs when you upgrade the database from version 9.5 to 9.6. This error is caused due to the pg_stat_activity view because the column waiting is replaced with wait_event_type and wait_event columns in version 9.6.

pg_restore: from TOC entry xxx; xxx xxxx VIEW sys_user_constraints art 
pg_restore: error: could not execute query: ERROR: column c.consrc does not exist LINE 18: "c"."consrc" AS "search_condition", ^ HINT: Perhaps you meant to reference the column "c.conkey" or the column "c.conbin".

This error occurs because the structure of the catalog pg_constraint has changed in PostgreSQL version 12.

You can resolve these issues by dropping the views based on system catalogs of the target version.

Note: Use caution when dropping these views. Be sure to consult with your DBA.

Other considerations

  • The pg_upgrade utility produces two logs: pg_upgrade_internal.log and pg_upgrade_server.log. Amazon RDS appends a timestamp to the file name for these logs. View these logs to get more information about the issues and errors encountered during the upgrade. For more information, see Monitoring Amazon RDS log files for RDS for PostgreSQL or Monitoring Amazon Aurora log files for Aurora for PostgreSQL.
  • When the upgrade is complete, upgrade the pg_statistics table by running ANALYZE on all user databases. A major upgrade doesn’t move the content of the pg_statistics table to the new version. Skipping this step might result in slow running queries.
  • If you upgraded to PostgreSQL version 10, then run REINDEX on any hash indexes you have. Hash indexes were changed in version 10 and must be rebuilt. To locate invalid hash indexes, run this SQL for each database that contains hash indexes:
SELECT idx.indrelid::regclass AS table_name, 
   idx.indexrelid::regclass AS index_name 
FROM pg_catalog.pg_index idx
   JOIN pg_catalog.pg_class cls ON cls.oid = idx.indexrelid 
   JOIN pg_catalog.pg_am am ON am.oid = cls.relam 
WHERE am.amname = 'hash' 
AND NOT idx.indisvalid;

Related information

Overview of upgrading PostgreSQL

Overview of the Aurora PostgreSQL upgrade processes

2 Comments

Can you not at least include a summary that you can append to these logs, like stop and start and a duration? That's 3 very useful lines when all you wanted is a timing information. There seems to be a lot of information will all AWS activities but not a shortened summary of this is it information.

Ed
replied 6 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 6 months ago