AWS Database Blog

Use pgactive for rolling major version upgrades in Amazon RDS for PostgreSQL

In this post, we explore how pgactive can perform rolling major version upgrades for Amazon Relational Database Service (Amazon RDS) for PostgreSQL, allowing for a smoother transition with reduced impact on your applications.

PostgreSQL database upgrades in Amazon RDS

Database upgrades are essential for maintaining optimal performance and security. Each new major version release introduces enhanced features, performance optimizations, and critical security updates that protect against vulnerabilities.

RDS for PostgreSQL databases support both minor and major version upgrades. Minor version upgrades, which typically include bug fixes and security patches, are released periodically within a major version. These updates are backward-compatible, making sure that existing applications remain unaffected. You can initiate a minor version upgrade manually by modifying your database, or you can enable the Auto minor version upgrade option.

Unlike minor upgrades, major version upgrades introduce new, non-backward-compatible features and database changes. Major versions are typically released annually and are supported for about 5 years. Major upgrades can take significantly longer due to factors such as the number of database objects.

You can minimize downtime during minor version upgrades by using Amazon RDS Proxy or an open source database proxy such as PgBouncer. When upgrading minor versions of RDS Multi-AZ DB clusters, it’s possible to achieve downtime of 1 second or less by using the Advanced JDBC Wrapper Driver.

Amazon RDS for PostgreSQL provides multiple upgrade options, including in-place upgrades. For customers who require short, predictable downtime, Amazon RDS supports a fully managed blue/green deployment option. This approach involves provisioning a new DB instance with the target PostgreSQL version, validating the upgrade, and then swapping the database endpoints to minimize disruption.

Rolling upgrades with fallback using pgactive

Although blue/green deployment is effective for minimizing downtime during major version upgrades, some customers opt for a rolling upgrade approach to maintain high availability. In this method, new nodes running the updated version can join with an existing node, configured to accept writes, and set up for bidirectional data replication. This enables a gradual and smooth transition to the new database version while minimizing disruption to the production environment.

The pgactive extension enables active-active replication for Amazon RDS for PostgreSQL, allowing all participating nodes to handle read and write operations simultaneously. Data changes are automatically synchronized across all nodes in the cluster using PostgreSQL logical replication. pgactive includes built-in conflict detection, automatic resolution methods (such as last-write-wins), and conflict monitoring.

The following are some of the key benefits of using a logical replication mechanism like pgactive for a database upgrade:

  • Multi-tenant support – If you’re running an application in a multi-tenant environment, pgactive allows you to selectively replicate only the required schemas or databases to a higher version. Starting with pgactive version 2.1.4, it’s now possible to include or exclude specific tables.
  • Minimal downtime – Upgrading with pgactive significantly reduces downtime. The only interruption required is updating the application’s connection string to the new database.
  • Seamless rollback – By setting up active-active replication with pgactive, you can roll back your application to an older database version with minimal downtime and operational complexity.

Solution overview

For this post, we show an example for a major version upgrade. We use the source database version 14.12 and target version 16.3. The following diagram illustrates the solution architecture.

The process consists of the following steps:

  1. Set up the v14.12 source database.
  2. Set up the v16.3 (or higher) version database.
  3. Set up pgactive parameters and endpoints on both the v14.12 and v16.3 databases.
  4. Verify the data by bi-directionally replicating between v14.12 and v16.3 and then point the application to the higher version v16.3.

In the following sections, we walk through these steps with code examples.

Prerequisites

To follow along with this post, you should have the following prerequisites:

Create the source and target databases

In a real-world use case, you should already have a production database set up. For this post, we set up our source and target databases using the AWS Command Line Interface (AWS CLI). Complete the following steps:

  1. Use the following code to create a v14.12 database as the source database:
    aws rds create-db-instance \
    --engine postgres \
    --db-instance-identifier pg-14 \
    --engine-version 14.12 \
    --db-instance-class db.m5d.large \
    --vpc-security-group-ids mysecuritygroup \
    --db-subnet-group mydbsubnetgroup \
    --master-username masterawsuser \
    --manage-master-user-password 
    Code
  2. Connect to the database using psql and run the following query to verify the database version:
    # This need to be run using SQL client connected to the source DB
    postgres=> select version(); 
    							version 
    -------------------------------------------------------------------- 
    PostgreSQL 14.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
    
    SQL
  3. Set up a higher version (v16.3) database environment for upgrade purposes:
    aws rds create-db-instance \
    --engine postgres \
    --db-instance-identifier pg-16 \
    --engine-version 16.3 \
    --db-instance-class db.m5d.large \
    --vpc-security-group-ids mysecuritygroup \
    --db-subnet-group mydbsubnetgroup \
    --master-username masterawsuser \
    --manage-master-user-password
    Code
  4. Connect to the new database using psql and run the following query to verify the database version:
    # This need to be run using SQL client connected to the source DB
    postgres=> select version();
     				version   
    ----------------------------------------------------------------- 
    PostgreSQL 16.3 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit
    SQL

Set up the pgactive extension between v14.12 and v16.3

To set up pgactive on both the source and target databases, follow these steps:

  1. Configure database parameters and create the pgactive extension on both source and target databases.
  2. Set the enable_pgactive parameter to 1 in a DB parameter group and assign this parameter group to your DB instance.
  3. Create a DB parameter group with enable_pgactive set to 1.
    Run the following commands on both databases:

    REGION="us-east-1"
    
    aws rds create-db-parameter-group \
    --db-parameter-group-name pg16-pgactive \
    --db-parameter-group-family postgres16 \
    --description "Parameter group that contains pgactive settings for PostgreSQL 16" \
    --region "${REGION}"
    
    aws rds modify-db-parameter-group \
    --db-parameter-group-name pg16-pgactive \
    --parameters '[{"ParameterName": "rds.enable_pgactive","ParameterValue": "1","ApplyMethod": "pending-reboot"}]'
    --region "${REGION}"
    Code

    You can also modify the parameter group directly from the AWS Management Console. For more information, see Working with DB parameter groups.

    If you’re adding the parameter to an existing RDS instance, you must restart the database for the settings to take effect. For example, to restart an RDS instance named pgactive-inst1, use the following command:

    aws rds reboot-db-instance \
    --db-instance-identifier pgactive-inst1 \
    --region "${REGION}"
    Code
  4. Retrieve the endpoints for the two RDS for PostgreSQL instances.
    For the purposes of this post, we call these endpoint-pgactive14 and endpoint-pgactive16.
  5. To check if the shared_preload_libraries parameter contains pgactive, run the following command on both the pg-14 and pg-16 instances:
    # This need to be run using SQL client connected to the source DB (expected output: t)
    SELECT setting ~ 'pgactive' FROM pg_catalog.pg_settings 
    WHERE name = 'shared_preload_libraries';
    
    ?column?
    -----------
    t
    SQL
  6. Install the pgactive extension:
    # This need to be run using SQL client connected to the source DB
    CREATE EXTENSION IF NOT EXISTS pgactive;
    SQL

Initialize the replication group

Next, you set up the connection information and credentials for the replication accounts to log in to each node. pgactive uses the foreign data interface to securely manage and store these credentials.

Make sure that the shared_preload_libraries parameter is set, and the pgactive extension mentioned in the previous section is created on both v14.12 and v16.3.

  1. Run the following commands to set up the connection information for the pg-14 endpoint:
    # This need to be run using SQL client connected to the source DB
    -- connection info for endpoint1
    CREATE SERVER pgactive_server_endpoint1 FOREIGN DATA WRAPPER pgactive_fdw OPTIONS (host '<pg-14>', dbname 'upgrade');
    
    CREATE USER MAPPING FOR postgres SERVER pgactive_server_endpoint1 OPTIONS (user 'postgres', password '<password>');
    	
    -- connection info for endpoint2
    CREATE SERVER pgactive_server_endpoint2 FOREIGN DATA WRAPPER pgactive_fdw OPTIONS (host '<pg-16>', dbname 'upgrade');
    	
    CREATE USER MAPPING FOR postgres SERVER pgactive_server_endpoint2 OPTIONS (user 'postgres', password '<password>');
    
    SQL
  2. To check if the DB instance is ready, use the following command:
    SELECT pgactive.pgactive_create_group(node_name := 'endpoint1-app',node_dsn := 'user_mapping=postgres pgactive_foreign_server=pgactive_server_endpoint1');
    SELECT pgactive.pgactive_wait_for_node_ready();
    
    SQL

    If the command succeeds, you should see the following output:

    pgactive_wait_for_node_ready
    ------------------------------
    (1 row)
    SQL
  3. Run the following commands to set up the connection information for the pg-16 endpoint:
    # This need to be run using SQL client connected to the source DB
    -- connection info for endpoint1
    CREATE SERVER pgactive_server_endpoint1
    FOREIGN DATA WRAPPER pgactive_fdw
    OPTIONS (host '<pg-14>', dbname 'upgrade');
    CREATE USER MAPPING FOR postgres
    SERVER pgactive_server_endpoint1
    OPTIONS (user 'postgres', password '<password>');
    
    -- connection info for endpoint2
    CREATE SERVER pgactive_server_endpoint2
    FOREIGN DATA WRAPPER pgactive_fdw
    OPTIONS (host '<pg-16>', dbname 'upgrade');
    CREATE USER MAPPING FOR postgres
    SERVER pgactive_server_endpoint2
    OPTIONS (user 'postgres', password '<password>');
    SQL

    If the commands succeed, pgactive tries to synchronize the database. If the command runs successfully, you should see output similar to the following:

    NOTICE:  restoring database 'upgrade', 6% of 7483 MB complete
    NOTICE:  restoring database 'upgrade', 42% of 7483 MB complete
    NOTICE:  restoring database 'upgrade', 77% of 7483 MB complete
    NOTICE:  restoring database 'upgrade', 98% of 7483 MB complete
    NOTICE:  successfully restored database 'upgrade' from node node1-app in 00:1:02.15538
    pgactive_wait_for_node_ready
    ------------------------------
    row)
    Code

On pg-16, check to see data is present and insert sample data and verify it was replicated on pg-14. Now your database is logically synced to v16.3 using pgactive. You can perform functional and pressure or stress testing to make sure the new version is functionally good and giving optimal performance.

Prepare for the switchover

Now you can point your application by redirecting traffic to the new database (v16.3) by updating the connection endpoint in the application. A major advantage of using pgactive is the ability to gradually shift traffic from the old database to the new one. For example, if the same database endpoint is used across multiple microservice applications, you don’t need to update all microservices simultaneously. Instead, you can progressively change the endpoint to the new database (v16.3) as you test and validate each application. If you’re running a multi-tenant environment, then you could only replicate the required tables or database and point the application to the new version.

Limitations

1. Handling Sequences:

The pgActive extension is built on PostgreSQL’s logical replication system, ensuring automatic replication of tables and data. However, it comes with certain limitations, including the fact that sequence values are not synchronized after a node’s initial synchronization, which may require additional handling in distributed environments.

As a one-time operation, you need to generate sequence update SQL using the following code:

SELECT 
    'SELECT setval(''' || schemaname || '.' || sequencename || ''', ' || ( 
        CASE WHEN last_value IS NULL THEN 
            start_value
        ELSE
            last_value
       END) || ', true);'
FROM pg_sequences;
SQL

Apply the output of the above command to the joining node. This will set the correct last_value for sequences on the joining node.

NOTE: pgactive has a global sequence mapping function. This function generates globally-unique values by combining three components: the sequence number, a unique node ID, and a timestamp.

2. Materialized views:
When a new node joins, materialized views will be created with data up to the time when the initial COPY was completed. To get up-to-date data, it is recommended to refresh the materialized views

REFRESH MATERIALIZED VIEW [CONCURRENTLY] view_name;
SQL

3. Support for DDL replication

If a new table is added to any node, PgActive will automatically add it to PgActive’s replication set, but it is the user’s responsibility to create the new table on each of the participating nodes. pgactive has the same limitations as PostgreSQL logical replication.

4. Handling of existing logical replication

Pgactive prohibits the use of pglogical plugin and subscription functionality, in other words no inbound replication is allowed. If you have existing slot(s) replicating to downstream logical subscribers, these are permitted. However, these downstream subscribers will need to be resynchronized after cutover to a newer version node.

5. Handling of tables without Primary Key

Tables without primary keys will replicate in pgactive. However, update/delete operations on these tables will not replicate. As a workaround, you have two options: either create a unique index and use it as the replication identity, or REPLICA IDENTITY FULL on the table.

Migrate a multi-tenant environment

The following example demonstrates a multi-tenant database migration. This approach shows how to gradually migrate specific applications in a multi-tenant environment from an older to a newer database version, eliminating the need to upgrade all applications simultaneously. For our use case, the current DB instance (v14.12) is serving applications 1, 2, 3, and 4. Complete the following steps to perform the migration:

  1. Deploy the new DB instance (v16.3).
  2. Configure the pgactive extension to enable bidirectional replication between v14.12 and v16.3.
  3. Selectively migrate applications 3 and 4 to the new database instance (v16.3), while applications 1 and 2 remain connected to the old instance (v14.12).

The following diagram illustrates this setup.

Monitor replication lag

Monitoring and alerting replication lag is crucial for pgactive. pgactive can have lag at the decoding node or the applying node. Monitoring replication lag lets you diagnose potential issues with your active-active replication and helps mitigate the risk of introducing conflicting changes into your system or running out of disk space.

When the receiving node is down due to maintenance, networking issues, or hardware issues, the write-ahead log (WAL) will accumulate on the WAL sender node and if issue is not rectified on time, the sender node might run out of disk space or the WAL can get to a point where the receiving node might never catch up.

When at the receiving node, if the WAL applies results in error due to schema differences, unique or primary key violation, or other reasons, the WAL will get accumulated on the receiving node and eventually the node will run out of disk space if the issue isn’t rectified in time.

The following output is from a two-node pgactive setup where pgactive.pgactive_get_replication_lag_info() was executed on the pgactive1 node.

In this output, lag between pgactive1 and pagactive2, and pgactive2 and pgactive1 is shown:

postgres=# SELECT * FROM pgactive.pgactive_get_replication_lag_info();
-[ RECORD 1 ]--------+---------------------------------------------
node_name            | pgactive2
node_sysid           | 7412711618745234863
application_name     | pgactive:7412711618745234863:send
slot_name            | pgactive_16385_7412711618745234863_0_16385__
active               | t
active_pid           | 214501
pending_wal_decoding | 0
pending_wal_to_apply | 0
restart_lsn          | 0/1996D58
confirmed_flush_lsn  | 0/1996D90
sent_lsn             | 0/1996D90
write_lsn            | 0/1996D90
flush_lsn            | 0/1996D90
replay_lsn           | 0/1996D90
-[ RECORD 2 ]--------+---------------------------------------------
node_name            | pgactive1
node_sysid           | 7412711576138986882
application_name     | pgactive:7412711576138986882:send
slot_name            | pgactive_16385_7412711576138986882_0_16385__
active               | t
active_pid           | 214505
pending_wal_decoding | 0
pending_wal_to_apply | 0
restart_lsn          | 0/198F1B0
confirmed_flush_lsn  | 0/198F1E8
sent_lsn             | 0/198F1E8
>write_lsn            | 0/198F1E8
flush_lsn            | 0/198F1E8
replay_lsn           | 0/198F1E8
-[ RECORD 3 ]--------+---------------------------------------------
node_name            | pgactive1
node_sysid           | 7412711576138986882
application_name     | pgactive:7412711576138986882:send
slot_name            | pgactive_16385_7412711576138986882_0_16385__
active               | t
active_pid           | 214582
pending_wal_decoding | 0
pending_wal_to_apply | 0
>restart_lsn          | 0/198EC30
confirmed_flush_lsn  | 0/198EC68
sent_lsn             | 0/198EC68
write_lsn            | 0/198EC68
flush_lsn            | 0/198EC68
replay_lsn           | 0/198EC68
-[ RECORD 4 ]--------+---------------------------------------------
node_name            | pgactive2
node_sysid           | 7412711618745234863
application_name     | pgactive:7412711618745234863:send
slot_name            | pgactive_16385_7412711618745234863_0_16385__
active               | t
active_pid           | 214585
pending_wal_decoding | 0
pending_wal_to_apply | 0
restart_lsn          | 0/198EC30
confirmed_flush_lsn  | 0/198EC68
sent_lsn             | 0/198EC68
write_lsn            | 0/198EC68
flush_lsn            | 0/198EC68
>replay_lsn           | 0/198EC68
Code

At a minimum, the following alerts should be monitored:

  • active – It should alert when active is false, which indicates that the slot is not currently being used actively (the subscriber instance has disconnected from the publisher).
  • pending_wal_decoding – In PostgreSQL’s logical replication, WAL files are stored in binary format. The publisher must decode these WAL changes and convert them into logical changes (such as insert, update, or delete operations). The metric pending_wal_decoding indicates the number of WAL files waiting to be decoded on the publisher side. Several factors can cause this number to increase:
    • When the subscriber is not connected, active status will be false and pending_wal_decoding will increase.
    • The slot is active, but the publisher isn’t keeping up with the volume of changes in the WAL.
  • pending_wal_to_apply – The metric pending_wal_apply indicates the number of WAL files waiting to be applied on the subscriber side. This number can increase due to several factors that prevent the subscriber from applying changes and can potentially cause a disk full scenario:
    • Schema differences. For example, when you have changes in the WAL stream for a table named sample, but that table doesn’t exist on the subscriber side.
    • Values in the primary key columns were updated.
    • Secondary unique indexes can cause data divergence.

Rollback

Because the pgactive extension is already set up between the new primary and old production database, transactions are still replicating to the old primary (v14.12) and it will be in sync. If you need to roll back, you can simply point the application back to the old primary.

Clean up

To avoid incurring future charges, delete the resources you created. If you want to delete the old database after the upgrade, complete the following steps:

  1. Detach the old database from the replication group:
    # This need to be run using SQL client connected to the source DB
    SELECT pgactive.pgactive_detach_nodes(ARRAY['endpoint-pgactive14', 'endpoint-pgactive16']);
    SQL
  2. Drop the pgactive plugin:
    # This need to be run using SQL client connected to the source DB
    DROP EXTENSION pgactive;
    SQL
  3. Delete the old DB instance.

Conclusion

In this post, we demonstrated how to use the pgactive replication extension to upgrade your database, reduce downtime, and increase application availability. Active-active replication is a powerful database replication technique commonly used for high availability and other use cases.

If you have any questions, comments, or feedback, please leave them in the comments section.


About the Authors

Santosh Bhupathi is a Senior Database Specialist Solutions Architect based in Philadelphia. With a focus on relational and NoSQL databases, he provides guidance and technical assistance to customers to help them design, deploy, and optimize database workloads on AWS.

Chirag Daveis a Principal Solutions Architect with Amazon Web Services, focusing on managed PostgreSQL. He maintains technical relationships with customers, making recommendations on security, cost, performance, reliability, operational efficiency, and best practice architectures.