AWS Database Blog

Best practices for migrating PostgreSQL databases to Amazon RDS and Amazon Aurora

PostgreSQL is one of the most advanced open-source relational database systems. From a few GB to multi-TB databases, PostgreSQL is best suited for online transaction processing (OLTP) workloads. For many organizations, PostgreSQL is the open-source database of choice when migrating from commercial databases such as Oracle or Microsoft SQL Server. AWS offers Amazon Relational Database Service for PostgreSQL (Amazon RDS for PostgreSQL) and Amazon Aurora PostgreSQL-Compatible Edition as fully managed PostgreSQL database services.

The post discusses best practices and solutions for some of the limitations when using PostgreSQL native utilities to migrate PostgreSQL databases to Amazon RDS for PostgreSQL or Aurora PostgreSQL.

Overview

Migrating to RDS for PostgreSQL or Aurora PostgreSQL clusters requires strategy, resources, and downtime maintenance. In this post, we provide some best practices for migrating PostgreSQL databases to Amazon RDS for PostgreSQL and Aurora PostgreSQL using PostgreSQL native tools such as pg_dump and pg_restore, logical replication, and the COPY command. These practices are useful for migrating PostgreSQL databases from on-premises servers, Amazon Elastic Compute Cloud (Amazon EC2) instances, or from one RDS or Aurora instance to another. If downtime during migration is affordable, using pg_dump and pg_restore is the preferred method. To minimize downtime, logical replication is the better solution. You should use the COPY command to migrate filtered table data.

The following diagram compares the various PostgreSQL native migrations strategies to migrate Amazon RDS for PostgreSQL or Aurora PostgreSQL.

tree-style diagram comparing migration strategies.

Prerequisites

Before getting started, complete the following prerequisites:

  • To apply the best practices for migration, you need to have a source database up and in a Running state.
  • To avoid any access issues, run the migration commands as a superuser role at the source side.
  • Provision a target RDS for PostgreSQL database or Amazon PostgreSQL database.
  • Issue the migration commands at the target instance using the rds_superuser role.

Running an RDS or Aurora PostgreSQL instance incurs cost. For more information, see Amazon RDS for PostgreSQL Pricing or Amazon Aurora Pricing, respectively.

PostgreSQL native utility: pg_dump and pg_restore

pg_dump is a PostgreSQL native utility to generate a file with SQL commands. Feeding this file to a server recreates the database in the same state as it was at the time of the dump. This utility is commonly used for logical dumping data for migration or upgrade purposes. Depending on the format, you can restore dump files with the psql program or by another PostgreSQL utility, pg_restore. This method is best suited for migrating a few GB to 500 GB sized databases. Migrating bigger databases may require higher outage depending on database size. Next, we discuss some of the tips that can help reduce the migration time using pg_dump, pg_restore, and psql utilities

pg_dump and pg_restore options

Using some of the pg_dump and pg_restore process options can result in faster migration. You can find these options by entering pg_dump --help and pg_restore --help commands. For migrating to a higher version of PostgreSQL, we recommend using a higher version of pg_dump and pg_restore. For example, when migrating from PostgreSQL 10 to PostgreSQL 11, use the pg_dump and pg_restore utilities of PostgreSQL 11. The following pg_dump and pg_restore options can make the dump file process quicker:

  • –format/ -F – This option selects the output format. The default format is plain. You can restore files in plain format using psql. Other formats are custom, directory, and tar. You can choose these formats with --format=custom, --format=directory, or --format=tar. You can restore the output files in these formats by using the pg_restore utility. Custom and directory are the most flexible output formats. Compressed by default, these formats allow manual selection, speedy dump and restore, and reordering of archived items during restore. The directory format also supports parallel dumps.
  • –jobs/ -J – This option with the pg_dump command runs the given number of concurrent dump jobs simultaneously. With pg_restore, this option runs multiple restore jobs simultaneously. The --jobs option reduces the time of the dump and restore drastically, but it also increases the load on the database server. pg_dump opens the number of jobs + 1 connections to the database, so make sure the max_connections setting is high enough to accommodate all connections. The number of jobs should be equal to or less than the number of vCPUs allocated for the RDS instance to avoid resource contention. The following code shows an example of running five simultaneous pg_dump and pg_restore jobs. The --create option with pg_dump writes the CREATE DATABASE command in the dump file. --create with pg_restore creates the named database before restoring contents. --dbname specifies the connecting database name.
    pg_dump --host <hostname> --format=directory --create --jobs 5 --dbname <database name> --username <username> --file /home/ec2-user/db11.dump
    pg_restore --host <hostname> --format=directory --create –-jobs 5 --dbname <database name> --username <username> --file /home/ec2-user/db11.dump
  • –table/ -t – This option dumps all tables matching the given pattern. Similarly, --schema lets you dump all schemas matching a pattern. This is helpful when you want to have multiple pg_dump jobs running simultaneously to reduce overall migration time.
  • –compress/ -z – Valued between 0–9, this option specifies the compression level to use. Zero means no compression. For the custom archive format, this specifies compression of individual table-data segments. The default is to compress at a 6 level. A higher value can decrease the size of the dump file, but can cause high resource consumptions such as CPU and I/O at the client host. If the client host has enough resources, 7–8 compression levels can significantly lower the dump file size.

For running the pg_dump utility, you need to have SELECT permission on all database objects. By default, rds_superuser doesn’t have SELECT permission on all objects. As a workaround, grant SELECT permission to the rds_superuser role or grant permissions of other users to rds_superuser by running the following command:

GRANT <user> TO <rds_super_user>;

Modifications at the target RDS for PostgreSQL or Aurora PostgreSQL instance

You can do several schema-level changes and parameter modifications to have a faster restore. For schema-level changes, you need to restore with the --schema-only option. By default, pg_dump generates COPY commands. Existing indexes and foreign keys can cause significant delays during data restore. Dropping foreign keys, primary keys, and indexes before restore and adding them after restore can drastically reduce migration time. After this schema modification, pg_restore should be used with --data-only.

Certain parameter changes can help reduce restore time:

  • maintenance_work_mem – Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. Increasing its value can make data restore, adding keys, and indexes restore faster. If pg_restore uses n number of concurrent jobs, we should make sure to have n times maintenance_work_mem memory available at the instance. Because the database isn’t operating workloads, you can reduce the shared_buffer to accommodate more memory for maintenance_work_mem.
  • checkpoint_timeout – Defines the maximum time between automatic WAL checkpoints in seconds.
  • max_wal_size – Defines the maximum size to let the WAL grow to between automatic WAL checkpoints. During migration, because losing data due to crash isn’t a priority, set these two parameters to a higher value to make restore faster.

After data is restored, you should reset all these parameters to their default values.

Cloud-native options

Some of the best practices of pg_dump and pg_restore host and database configurations can make migration faster:

  • While running pg_dump, choosing a host with high CPU and throughput can bump the speed of pg_dump. This lets you run multiple pg_dump processes simultaneously.
  • Choosing Amazon EC2 as the host within same Availability Zone as of the target database mitigates any network latency during migration.
  • Similarly, a target database with higher compute and throughput configured with Provisioned IOPS storage can support concurrent restore.
  • Disabling Multi-AZ and backups during migration also make restore faster.
  • While migrating a small dataset from Amazon EC2 hosted PostgreSQL to Amazon RDS for PostgreSQL, streaming output of pg_dump as input to psql reduces space and time to create the dump file. This drastically reduces overall migration time. The following code is the example of running pg_dump and psql at same time:
    pg_dump --host <source_host> --username <username> <soure_dbname> | psql --host <target_host> --username <username> <target_dbname>

Another PostgreSQL utility, pg_dumpall, is used to dump all databases of the cluster in a single file. The supported file format of pg_dumpall is only text, and you can restore it by using the psql client. Because the Amazon RDS for PostgreSQL and Aurora PostgreSQL rds_superuser role doesn’t have permission on the pg_authid table, it’s important to use --no-role-passwords with pg_dumpall while dumping data. Only the newer version of pg_dumpall (from PostgreSQL 10.0 and above) supports the --no-role-passwords option. The following code is a sample command dumping all of the database at the Aurora cluster:

pg_dumpall --host <RDS/Aurora hostname> --no-role-passwords --username <Master User> > dumpall.dump

The pg_dump utility only takes the backup of the single database at any point of time. It doesn’t back up global objects such as users and groups. To migrate global objects, you need to use a combination of pg_dumpall and psql. Complete the following steps:

  1. Take a backup of source tablespaces and roles using pg_dumpall:
    pg_dumpall --host <RDS/Aurora hostname> --globals-only --no-role-passwords --username <username> > globals_only.sql
  2. Run psql to restore the global object:
    psql --host <RDS/Aurora hostname> --dbname <database name> --username <username> -f globals_only.sql

Logical replication

Logical replication is a method of replicating data objects and their changes based upon their replication identity. We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. Logical replication uses a publisher and subscriber model with one or more subscribers subscribing to one or more publications on the publisher node. Subscribers pull data from the publications they subscribe to and may subsequently republish data to allow cascading replication or more complex configurations. Unlike physical replication, you can set up logical replication between two different major versions of PostgreSQL.

Because Amazon RDS for PostgreSQL and Aurora PostgreSQL don’t support as targets for external PostgreSQL physical replication, logical replication is one of the common ways to reduce overall migration time. This is also helpful while migrating from an unencrypted RDS for PostgreSQL instance to an encrypted RDS for PostgreSQL instance with minimal outage.

The following are the major steps to set up logical replication:

  1. Take schema-only dump at the source and restore at the target:
    pg_dump --host <hostname> --schema-only --dbname=sourcedb --username=<username> > schema_only.dump
    
    psql --host <hostname> --username=<username> --file=schema_only.dump --dbname=targetdb
  2. Modify the following parameters at the source and reboot the instance to apply changes. If the source is an RDS for PostgreSQL or Aurora PostgreSQL instance, you need to enable rds.logical_replication to apply these changes.wal_level = logical
    wal_sender_timeout = 0
    max_replication_slots = 10 #number of subscriptions
    max_wal_senders = 12 #max_replication_slots(10) + number of other standbys (2)
  3. Provide access to the target RDS for PostgreSQL instance to the source instance by modifying pg_hba.conf. If the source is an RDS for PostgreSQL or Aurora PostgreSQL instance, add the target public and private IP to the source security group.
  4. At the source, create a logical replication user and publication:
    psql --host <hostname> --username <username> --dbname sourcedb
    sourcedb=> create table customer (id int, name varchar(50), CONSTRAINT customer_pkey PRIMARY KEY (id));
    CREATE TABLE
    sourcedb => CREATE ROLE log_repuser LOGIN;
    
    CREATE ROLE
    
    sourcedb => \password log_repuser
    
    Enter new password:
    
    Enter it again:
    
    sourcedb=> GRANT rds_superuser TO log_repuser;
    GRANT ROLE
    sourcedb=> show rds.logical_replication;
    rds.logical_replication
    -------------------------
    on
    (1 row)
    sourcedb=> show wal_sender_timeout;
    wal_sender_timeout
    --------------------
    0
    (1 row)
    sourcedb=> CREATE PUBLICATION pub_1;
    CREATE PUBLICATION
    sourcedb=> ALTER PUBLICATION pub_1 ADD ALL TABLES;
    ALTER PUBLICATION
  5. At the target PostgreSQL instance, create a subscription:
    psql --host <hostname> --username <username> --dbname targetdb
    
    targetdb=> create table customer (id int, name varchar(50), address varchar(100), CONSTRAINT customer_pkey PRIMARY KEY (id));
    CREATE TABLE
    targetdb=> CREATE SUBSCRIPTION sub_1 CONNECTION 'host=logicalsource1.cxxxxxxxxxx7.us-west-2.rds.amazonaws.com port=5432 password=<password> user=log_repuser dbname=sourcedb' PUBLICATION pub_1;
    NOTICE: created replication slot "sub_1" on publisher
    CREATE SUBSCRIPTION
  6. Test the logical replication:
    sourcedb=> INSERT into customer values (1, 'john'), (2, 'jeff');
    INSERT 0 2
    
    targetdb=> select * from customer ;
    id | name | address
    ----+------+---------
    1 | john |
    2 | jeff |
    (2 rows)
    targetdb=> INSERT into customer values (3, 'john'), (4, 'jeff');
    INSERT 0 2
    targetdb=> select * from customer ;
    id | name | address
    ----+------+---------
    1 | john |
    2 | jeff |
    3 | john |
    4 | jeff |
    
    sourcedb=> INSERT into customer values (5, 'john'), (6, 'jeff');
    INSERT 0 2
    
    targetdb=> select * from customer ;
    id | name | address
    ----+------+---------
    1 | john |
    2 | jeff |
    3 | john |
    4 | jeff |
    5 | john |
    6 | jeff |
    targetdb=> INSERT into customer values (7, 'john', 'chicago'), (8, 'jeff','omaha');
    INSERT 0 2
    
    targetdb=> select * from customer ;
    id | name | address
    ----+------+---------
    1 | john |
    2 | jeff |
    3 | john |
    4 | jeff |
    5 | john |
    6 | jeff |
    7 | john | chicago
    8 | jeff | Omaha

The following are some best practices for PostgreSQL logical replication:

  • While using logical replication, you can also insert data on the target side. Make sure to revoke all write privileges at the table while running replication.
  • In a publication, you can choose what type of commands to replicate: INSERT, DELETE, UPDATE, ALL. By default, it’s ALL. Set these replication type as per your business needs and minimum requirements.
  • The replicating tables should be normal tables. Views, materialized views, partition root tables, or foreign tables can’t be replicated. You can replicate tables with no primary or unique keys, but updates and deletes on those tables are slow on the subscriber.
  • Any DDL changes aren’t replicated. To avoid any interruption in replication, DDL changes such as any changes on table definition should be done at both sides at the same time.
  • The TRUNCATE operation isn’t supported. As a workaround, use the DELETE operation.

It’s important to have a replication identity (primary key or unique index) for tables that are part of the publication in the logical replication. If the table doesn’t have a primary key or a unique index, you can set replication identity to full (ALTER TABLE <table name> REPLICA IDENTITY FULL) where the entire row acts as a primary key.

Migrating data using the COPY command

PostgreSQL also has the native COPY command to move data between tables and standard file system files. COPY is usually comparatively faster than INSERT. This is another alternative to pg_dump and pg_restore commands to back up and restore PostgreSQL databases, tables, and underlying objects.

One of the advantages of the COPY command over pg_dump and pg_restore is that you can filter the data by using SQL queries. You can also use the COPY command to transfer data from one format to another or from a file to a table, or vice versa. PostgreSQL COPY commands are available in two variations: a server-side COPY command and the client-side meta-command \copy. The server-side command requires superuser privileges or pg_read_server_files, pg_write_server_files, or pg_execute_server_program privileges because they are read or written directly by the server. The client-side \copy command invokes copy from STDIN or copy to STDOUT, and then fetches and stores the data in a file accessible to the psql client. Server-side copy isn’t available on Amazon RDS for PostgreSQL or Aurora PostgreSQL. You need at least INSERT, UPDATE, and SELECT permissions on the table in order copy to or from it. The following are few key terminologies for the COPY command:

  • Delimiter – A character that separates dataset. By default, COPY expects tabs as delimiters, but you can change this by specifying the delimiter with “using delimiter” keywords.
  • Header – The line of the CSV file that consists of the column names.

The following examples show basic command usage of the COPY command and \copy meta-command from the psql client:

postgres=> \copy (select *from activity limit 100) to '~/activity.csv' with DELIMITER ',' CSV header;
COPY 100

postgres=> \copy activity_2 from '~/activity.csv' with DELIMITER ',' CSV HEADER;
COPY 100

An input file name path can be an absolute or relative path, but an output file name path must be an absolute path. The client-side \copy can handle relative pathnames.

In Amazon RDS for PostgreSQL 11.1 and above, you can import data from Amazon Simple Storage Service (Amazon S3) into a table belonging to an RDS for PostgreSQL DB instance. To do this, you need to use the AWS_S3 PostgreSQL extension that Amazon RDS provides.

Clean up

To avoid incurring ongoing charges, delete the RDS for PostgreSQL and Aurora PostgreSQL instances.

Conclusion

This post demonstrated the best practices for migrating PostgreSQL databases to Amazon RDS for PostgreSQL or Aurora PostgreSQL. Apart from pg_dump and pg_restore, the COPY command, and logical replication, other options for migration are AWS Database Migration Service (AWS DMS) and pglogical.

If you have comments or questions about this solution, please submit them in the comments section.


About the authors

Vivek Singh is a Senior Database Specialist with AWS focusing on Amazon RDS/Aurora PostgreSQL engines. He works with enterprise customers providing technical assistance on PostgreSQL operational performance and sharing database best practices.

 

 

 

 


Akm Raziul Islam has worked as a consultant with a focus on Database and Analytics at Amazon Web Services. He worked with customers to provide guidance and technical assistance about various database and analytical projects, helping them improving the value of their solutions when using AWS.