Migrating PostgreSQL from on-premises or Amazon EC2 to Amazon RDS using logical replication

PostgreSQL is one of the most advanced popular open-source relational database systems. With more than 30 years of development work, PostgreSQL has proven to be a highly reliable and robust database that can handle a large number of complicated data workloads. For many, PostgreSQL is the open-source database of choice when migrating from commercial databases such as Oracle or Microsoft SQL Server. From a cloud standpoint, AWS provides two managed PostgreSQL options: Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora for PostgreSQL.

If you want to migrate or upgrade your PostgreSQL database from on premises to AWS-managed PostgreSQL or upgrade to a major version of PostgreSQL within AWS-managed services, you can do so through any native PostgreSQL feature by using logical replication. The pglogical extension works for the community PostgreSQL version 9.4 and higher, and is part of Amazon RDS for PostgreSQL as of version 9.6.10. The pglogical extension is a good option if you want to migrate from an environment of 9.4 to 9.6.10 or higher. PostgreSQL also introduced native logical replication as an inbuilt feature in version 10, which I won’t be covering in this post.

With AWS Database Migration Service (DMS), you can migrate on-premises PostgreSQL to Amazon RDS for PostgreSQL with minimal downtime. AWS DMS has some advantages, such as an easier setup process. It can also migrate tables that don’t have a primary key, but supports only full-load, not CDC, and validates the data between the source and target databases for tables which have primary key. It does have some limitations, which pglogical overcomes.

pglogical is helpful in the following scenarios:

When using more bytea, jsonb, and enum data types in tables.
Resetting of sequence value automatically (pglogical auto increments the current source sequence value on targetDB so, that the new rows get inserted with the next sequence value seamlessly).
If data gets truncated and loaded to tables frequently.
When creating your own replication set (for example, if you want to replicate only inserts and updates to the target database).

This post covers logical replication (using the pglogical extension), a use case of pglogical, and its limitations.

Overview of logical replication

Logical replication is a method of replicating data objects and their changes based upon their replication identity (usually a primary key). The term logical is in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. Logical replication allows you to stream changes from a database in real time to a target database.

Physical vs. logical replication

Physical replication sends data to the replica in a binary format. Binary replication replicates the entire cluster in an all-or-nothing fashion; there is no way to get a specific table or database using binary replication. It is a complete cluster- and instance-level replication.

Logical replication sends data to the replica in a logical format. Additionally, logical replication can send data for a single table, database, or specific columns in a table.

The following diagram illustrates the architecture of logical replication

The architecture has the following features:

pglogical works on a per-database level, not the whole server level like physical streaming replication
The provider can send changes to multiple subscribers without incurring additional disk write overhead
Single subscribers can accept changes from multiple databases and detect conflicts between changes with automatic and configurable conflict resolution
Cascading replication is implemented in the form of change set forwarding

How logical replication works

Logical replication uses a publish and subscribe mechanism with one or more subscribers that subscribe to one or more publications on a publisher node. Subscribers pull data from their publications and may subsequently re-publish data to allow cascading replication or more complex configurations.

The master node (source database) defines the publication; that node is the publisher. The publisher always sends changed data (DMLs) to the subscriber and can send the data to multiple subscribers.

The replica server (target database) defines the subscription; that node is the subscriber. The subscriber accepts the data from multiple publishers and applies the changes to the target database.

pglogical asynchronously replicates only the changes in the data by using logical decoding. This makes it very efficient because only the differences are replicated. It is also tolerant of network faults because it can resume after the fault.

Amazon RDS for PostgreSQL 9.6 and later versions include pglogical. You can use it to set up logical replication between the master node (publisher) and replica node (subscriber).

You can use logical replication in the following scenarios:

Sending incremental changes of a single database or a subset of a database to subscribers as they occur
Consolidating multiple databases into a single database (for example, for analytical purposes)
Sharing a subset of a database between multiple databases
Upgrading a database from one major version to another
Migrating from one platform to another (for example, on-premises or Amazon EC2 to Amazon RDS)

Working with pglogical

This post demonstrates how to set up logical replication between multiple databases (multiple publishers) to Amazon RDS for PostgreSQL (single standby). This post uses two EC2 instances on which the community PostgreSQL has the pglogical extension installed. Amazon RDS for PostgreSQL has the pglogical extension by default because pglogical support is included in version 9.6 and later, but in the case of community PostgreSQL, you should install and configure this extension separately.

This post uses two PostgreSQL major versions running on single a RHEL-7 EC2 instance (PostgreSQL 9.6 and PostgreSQL 10).

Prerequisites

Before getting started, configure the listed parameters in the postgresql.conf of both PostgreSQL versions. Allow the replication user in the pg_hba.conf file on the source databases and target database to enable the parameter rds.logical_replication.

Configuring the source database

To configure the source database, complete the following steps:

In the source database, in postgresql.conf, edit the following parameters:

wal_level = 'logical'
track_commit_timestamp = on
max_worker_processes = 10
max_replication_slots = 10 (Set as per the requirment looking at publisher or subscriber)
max_wal_senders = 10
shared_preload_libraries = 'pglogical'

On the first source, restart the PostgreSQL instances for these parameters to take effect.

To see the set parameters, enter the following code:

/usr/pgsql-9.6/bin/psql -d source1 -p 5432 -c "select name, setting from pg_settings where name in ('wal_level','track_commit_timestamp','max_worker_processes','max_replication_slots','max_wal_senders  ','shared_preload_libraries');"

The following screenshot shows the query output.

On the second source, restart the PostgreSQL instances for these parameters to take effect.

To see the set parameters, enter the following code:

/usr/pgsql-10/bin/psql -d source2 -p 5433 -c "select name, setting from pg_settings where name in ('wal_level','track_commit_timestamp','max_worker_processes','max_replication_slots','max_wal_senders  ','shared_preload_libraries');”

The following screenshot shows the query output.

In pg_hba.conf, allow the replication user in pg_hba.conf to connect to the PostgreSQL instances.
Reload the PostgreSQL instances.
You can see the following changes in pg_hba.conf:
```
host     replication     all      <RDS endpoint>/32           md5
```

Reload the PostgreSQL instances (source-1 and source-2). See the following code:

/usr/pgsql-9.6/bin/psql -d source1 -p 5432 -c "select pg_reload_conf();"  
/usr/pgsql-10/bin/psql  -d source2 -p 5433 -c "select pg_reload_conf();"

Configuring the target DB parameter group

Amazon RDS for PostgreSQL has the pglogical extension by default. To configure the target DB parameter group, complete the following steps:

To enable the extension, in the target database, set rds.logical_replication=1 and shared_preload_libraries = 'pglogical' in the database parameter group.
Reboot the RDS instance for these parameters to take effect.
To see if the parameters are set on target database, enter the following code:
```
psql -h <your_rds_endpoint> -d target -U postgres -W -c "select name, setting from pg_settings where name in ('wal_level','track_commit_timestamp','max_worker_processes','max_replication_slots','max_wal_senders  ','shared_preload_libraries');"
```
The following screenshot shows the query output.
In Amazon RDS, whenever you enable rds.logical_replication, the replication entry is added in the pg_hba rules, to verify enter the following code:
```
./psql -h <your_rds_endpoint> -d target -U postgres -c "select pg_hba_file_rules();
```
The following screenshot shows the rule in pg_hba.conf for the replication user on target database.

Setting up logical replication

You are now ready to configure logical replication between multiple databases on your EC2 host. Complete the following steps:

Download the pglogical rpm and install on the source databases.
For PostgreSQL 9.6, enter the following code:

Curl https://access.2ndquadrant.com/api/repository/dl/default/release/9.6/rpm | bash
yum install postgresql96-pglogical

For PostgreSQL 10, enter the following code:

curl https://access.2ndquadrant.com/api/repository/dl/default/release/10/rpm | bash
yum install postgresql10-pglogical

You can now create the pglogical extension on the source and target databases.

On the source databases, create the pglogical extension with the following code:

/usr/pgsql-9.6/bin/psql -d source1 -p 5432 -c "create extension pglogical;"
/usr/pgsql-10/bin/psql -d source2 -p 5433 -c "create extension pglogical;"

The following screenshot shows that pglogical’s version is in the source database.
The pglogical schema and other objects are created under the pglogical schema, which is helpful in maintaining the information of the replication.
The following code to create pglogical extension in the target database:

psql -h <your_rds_endpoint> -d target -U postgres -W -c "create extension pglogical;"

The following code shows that pglogical is in the target database:

psql -h <your_rds_endpoint> -d target -U postgres -W  -c "SELECT e.extname AS "Name", e.extversion AS "Version", n.nspname AS "Schema", c.description AS "Description" FROM pg_catalog.pg_extension e LEFT JOIN pg_catalog.pg_namespace n ON n.oid = e.extnamespace LEFT JOIN pg_catalog.pg_description c ON c.objoid = e.oid AND c.classoid = 'pg_catalog.pg_extension'::pg_catalog.regclass where e.extname='pglogical' ORDER BY 1;"

The following screenshot shows that pglogical is in the target database.

Create the publication server on the source databases and add the tables and sequences to the replication set, covered in point 8.
For source-1, PostgreSQL 9.6, see the list of tables and sequences with the following code:
```
/usr/pgsql-9.6/bin/psql -d source1 -p 5432 <<EOF
\dt \ds
EOF
```
The following screenshot shows the list of sequences and tables present in the source-1 database.
For source-2, PostgreSQL 10, see the list of tables and sequences with the following code:
```
/usr/pgsql-10/bin/psql -d source2 -p 5433 <<EOF
\dt \ds
EOF
```
The following screenshot shows the list of sequences and tables present in the source-2 database.
Before you create the publication server, take the schema dump from PostgreSQL (9.6, 10) and restore it to the target Amazon RDS for PostgreSQL. You can do this using dump and restore by using native PostgreSQL utilities (pgdump -s to take the dump and psql -f<file> to restore).

Create the publisher on the source databases with the following code:

/usr/pgsql-9.6/bin/psql -d source1 -p 5432 <<EOF
SELECT pglogical.create_node( node_name := 'provider1', dsn := 'host=<your_ec2_endpoint> port=5432 dbname=source1 user=postgres' );
EOF

/usr/pgsql-10/bin/psql -d source2 -p 5433 <<EOF
SELECT pglogical.create_node( node_name := 'provider2', dsn := 'host=<your_ec2_endpoint> port=5433 dbname=source2 user=postgres' );
EOF

The following screenshot shows the creation of the publisher node in source-1 and source-2 databases.
The following screenshot of the pglogical.node_interface table shows an entry of the added node.

Add the tables and sequences to the default replication set. See the following code:

/usr/pgsql-9.6/bin/psql -d source1 -p 5432 <<EOF
SELECT pglogical.replication_set_add_all_tables('default', '{public}'::text[],synchronize_data := true);
select pglogical.replication_set_add_all_sequences( set_name := 'default', schema_names := '{public}'::text[], synchronize_data := true );
EOF

/usr/pgsql-10/bin/psql -d source2 -p 5433 <<EOF
SELECT pglogical.replication_set_add_all_tables('default', '{public}'::text[],synchronize_data := true);
select pglogical.replication_set_add_all_sequences( set_name := 'default', schema_names := '{public}'::text[], synchronize_data := true );
EOF

The following screenshot shows the addition of sequences and tables to replication set in the source-1 and source-2 databases.
In the preceding screenshot, there are added sequences in the replication set. This increments the sequences counter on the target.
You can now take the schema dump from the source databases (9.6, 10) restore it to the target Amazon RDS for PostgreSQL database, and create the subscription server on the target database.

To list tables and sequences in the target database, enter the following code:
```
psql -h <your_rds_endpoint> -d target -U postgres -W <<EOF
\dt \ds
EOF
```
The following screenshot shows the list of sequences and tables present in the target database.

Create one subscriber and two subscriptions in target database. See the following code:

psql -h <your_rds_endpoint> -d target -U postgres  <<EOF
SELECT pglogical.create_node(
    node_name := 'subscriber1',
    dsn := 'host=<your_rds_endpoint> port=5432 dbname=target password=postgres user=postgres'
);
EOF

The following screenshot shows the subscriber created in the target database.

Create subscriptions in the target database for source-1 and source-2 with the following code:

psql -h <your_rds_endpoint> -d target -U postgres  <<EOF
SELECT pglogical.create_subscription(
    subscription_name := 'subscription1',
    replication_sets := array['default'],
    provider_dsn := 'host=<your_ec2_endpoint> port=5432 dbname=source1 password=postgres user=postgres'
);

SELECT pglogical.create_subscription(
    subscription_name := 'subscription2',
    replication_sets := array['default'],
    provider_dsn := 'host=<your_ec2_endpoint> port=5433 dbname=source2 password=postgres user=postgres'
);
EOF

The following screenshot shows the subscriptions created in the target database.

Before loading the data into the source instances, compare the row count of tables on source PostgreSQL (9.6, 10) and target RDS PostgreSQL 11.
The following screenshot shows the comparison of the row count of tables on the source and target databases.
The following screenshot shows a small for loop script with insert statements to load the data into source databases.
The following screenshot compares the table count between the source and target databases before and after the data load.
The table data is replicated between the source and target databases, but what about the sequences? The state of sequences added to the replication sets is replicated periodically and not in real time.
The following screenshot shows the behavior of the sequence on the source and target database, if the sequences are added to the replication set.
During the initial data load using sequences, the sequence values are replicated from the source to the target database immediately. However, during the later loading cycles, it may take time to replicate the sequences data. As mentioned previously, sequence synchronization is not continuous or in real time. pglogical periodically synchronizes the sequences on the target database.
The following screenshot shows the incremental values of sequence on the source and target databases.
For this post, pglogical took more time (over 3 minutes) to replicate sequences to the target database.
When you switch over to the replica, execute the pglogical.synchronize_sequence function to synchronize all the sequences on the target database. You may want to execute this after there are no application connections to the current primary. See the following code:
```
/usr/pgsql-9.6/bin/psql -d source1 -p 5432 <<EOF
select pglogical.synchronize_sequence( seqoid ) from pglogical.sequence_state;
EOF
/usr/pgsql-9.6/bin/psql -d source2 -p 5433 <<EOF
select pglogical.synchronize_sequence( seqoid ) from pglogical.sequence_state;
EOF
```
The following screenshot shows the synchronization of sequences on the source and target databases after you enter the preceding code.
The dynamic buffer is used for the value being replicated so that the subscribers receive the future state of the sequence. This minimizes the chance of a subscriber’s notion of the sequence’s last_value falling behind, but does not completely eliminate the possibility. You may want to call synchronize_sequence to make sure all subscribers have up-to-date information about a given sequence after significant events in the database, such as data loading or during an online upgrade.

Limitations of pglogical

Using pglogical comes with the following limitations:

As of this writing, pglogical replication and administration requires superuser privileges. It may later extend to more granular privileges.
The UNLOGGED and TEMPORARY tables are not and cannot be replicated, much like with physical streaming replication. This is because the DMLs of these tables do not have entries in the WAL logs (transaction logs), and it is difficult for the logical worker process to capture the changes happening on these tables.
When you replicate multiple databases, you must set up individual provider and subscriber relationships for each. You cannot configure replication for all databases in a PostgreSQL install at one time.
You cannot replicate UPDATE and DELETE functions for tables that lack a primary or other valid replica identity, such as a UNIQUE Replication cannot find the table to update or delete because there is no unique identifier. Like other available tools on the market, including trigger-based tools like Bucardo and slony, pglogical needs a primary key or unique key on the tables to capture the DMLs.
There is no way to replicate the DDLs from provider to subscriber automatically. pglogical provides the pglogical.replicate_ddl_command function to allow DDLs to run on the provider and subscriber at a consistent point. For example, select pglogical.replicate_ddl_command('alter table public.migration1 add column surname varchar(20)');.
Using TRUNCATE...CASCADE only applies the CASCADE option on the provider side.

Conclusion

This post discussed the differences between logical and physical replications, use cases, advantages of logical replication, and how to implement logical replication between different versions of PostgreSQL by using the pglogical extension and built-in logical replication. Logical replication provides a more feasible and easier means to replicate your tables than the standard streaming replication mechanism. However, it does have some limitations that may or may not influence your decision to use it for replication.

About the Author

Rajesh Madiwale is an Associate Consultant with Amazon Web Services. He has deep expertise on database development and administration on Amazon RDS for PostgreSQL, Aurora PostgreSQL, Redshift, MySQL and Greenplum databases. He is an ardent member of the PostgreSQL community and has been working on PostgreSQL his entire tenure. He has also delivered several sessions at PostgreSQL conferences.