AWS Database Blog
Best practices for migrating PostgreSQL databases to Amazon RDS and Amazon Aurora
September 2023: This post was reviewed and updated for accuracy.
PostgreSQL is one of the most advanced open-source relational database systems. From a few GB to multi-TB databases, PostgreSQL is best suited for online transaction processing (OLTP) workloads. For many organizations, PostgreSQL is the open-source database of choice when migrating from commercial databases such as Oracle or Microsoft SQL Server. AWS offers Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition as fully managed PostgreSQL database services.
The post discusses best practices and solutions for some of the limitations when using PostgreSQL native utilities to migrate PostgreSQL databases to Amazon RDS for PostgreSQL or Aurora PostgreSQL.
Overview
Migrating to RDS for PostgreSQL or Aurora PostgreSQL clusters requires strategy, resources, and downtime maintenance. In this post, we provide some best practices for migrating PostgreSQL databases to Amazon RDS for PostgreSQL and Aurora PostgreSQL using PostgreSQL native tools such as pg_dump
and pg_restore
, logical replication, and the COPY command. These practices are useful for migrating PostgreSQL databases from on-premises servers, Amazon Elastic Compute Cloud (Amazon EC2) instances, or from one RDS or Aurora instance to another. If downtime during migration is affordable, using pg_dump
and pg_restore
is the preferred method. To minimize downtime, logical replication is the better solution. You should use the COPY command to migrate filtered table data.
The following diagram compares the various PostgreSQL native migrations strategies to migrate Amazon RDS for PostgreSQL or Aurora PostgreSQL.
Prerequisites
Before getting started, complete the following prerequisites:
- To apply the best practices for migration, you need to have a source database up and in a
Running
state. - To avoid any access issues, run the migration commands as a superuser role at the source side.
- Provision a target RDS for PostgreSQL database or Amazon PostgreSQL database.
- Issue the migration commands at the target instance using the
rds_superuser
role.
Running an RDS or Aurora PostgreSQL instance incurs cost. For more information, see Amazon RDS for PostgreSQL Pricing or Amazon Aurora Pricing, respectively.
PostgreSQL native utility: pg_dump and pg_restore
pg_dump is a PostgreSQL native utility to generate a file with SQL commands. Feeding this file to a server recreates the database in the same state as it was at the time of the dump. This utility is commonly used for logical dumping data for migration or upgrade purposes. Depending on the format, you can restore dump files with the psql program or by another PostgreSQL utility, pg_restore. This method is best suited for migrating a few GB to 500 GB sized databases. Migrating bigger databases may require higher outage depending on database size. Next, we discuss some of the tips that can help reduce the migration time using pg_dump
, pg_restore
, and psql
utilities
pg_dump and pg_restore options
Using some of the pg_dump
and pg_restore
process options can result in faster migration. You can find these options by entering pg_dump --help
and pg_restore --help
commands. For migrating to a higher version of PostgreSQL, we recommend using a higher version of pg_dump
and pg_restore
. For example, when migrating from PostgreSQL 14 to PostgreSQL 15, use the pg_dump
and pg_restore
utilities of PostgreSQL 15. The following pg_dump
and pg_restore
options can make the dump file process quicker:
- –format/ -F – This option selects the output format. The default format is plain. You can restore files in plain format using psql. Other formats are custom, directory, and tar. You can choose these formats with
--format=custom
,--format=directory
, or--format=tar
. You can restore the output files in these formats by using thepg_restore
utility. Custom and directory are the most flexible output formats. Compressed by default, these formats allow manual selection, speedy dump and restore, and reordering of archived items during restore. The directory format also supports parallel dumps. - –jobs/ -J – This option with the
pg_dump
command runs the given number of concurrent dump jobs simultaneously. Withpg_restore
, this option runs multiple restore jobs simultaneously. The--jobs
option reduces the time of the dump and restore drastically, but it also increases the load on the database server.pg_dump
opens the number of jobs + 1 connections to the database, so make sure the max_connections setting is high enough to accommodate all connections. The number of jobs should be equal to or less than the number of vCPUs allocated for the RDS instance to avoid resource contention. The following code shows an example of running five simultaneouspg_dump
andpg_restore
jobs. The--create
option withpg_dump
writes theCREATE DATABASE
command in the dump file.--create
withpg_restore
creates the named database before restoring contents.--dbname
specifies the connecting database name.
- –table/ -t – This option dumps all tables matching the given pattern. Similarly,
--schema
lets you dump all schemas matching a pattern. This is helpful when you want to have multiplepg_dump
jobs running simultaneously to reduce overall migration time. - –compress/ -z – Valued between 0–9, this option specifies the compression level to use. Zero means no compression. For the custom archive format, this specifies compression of individual table-data segments. The default is to compress at a 6 level. A higher value can decrease the size of the dump file, but can cause high resource consumptions such as CPU and I/O at the client host. If the client host has enough resources, 7–8 compression levels can significantly lower the dump file size.
For running the pg_dump
utility, you need to have SELECT permission on all database objects. By default, rds_superuser
doesn’t have SELECT permission on all objects. As a workaround, grant SELECT permission to the rds_superuser
role or grant permissions of other users to rds_superuser
by running the following command:
Modifications at the target RDS for PostgreSQL or Aurora PostgreSQL instance
You can do several schema-level changes and parameter modifications to have a faster restore. For schema-level changes, you need to restore with the --schema-only
option. By default, pg_dump
generates COPY commands. Existing indexes and foreign keys can cause significant delays during data restore. Dropping foreign keys, primary keys, and indexes before restore and adding them after restore can drastically reduce migration time. After this schema modification, pg_restore
should be used with --data-only
.
Certain parameter changes can help reduce restore time:
- maintenance_work_mem – Specifies the maximum amount of memory to be used by maintenance operations, such as
VACUUM
,CREATE INDEX
, andALTER TABLE ADD FOREIGN KEY
. Increasing its value can make data restore, adding keys, and indexes restore faster. Ifpg_restore
uses n number of concurrent jobs, we should make sure to have n timesmaintenance_work_mem
memory available at the instance. Because the database isn’t operating workloads, you can reduce theshared_buffer
to accommodate more memory formaintenance_work_mem
. - checkpoint_timeout – Defines the maximum time between automatic WAL checkpoints in seconds.
- max_wal_size – Defines the maximum size to let the WAL grow to between automatic WAL checkpoints. During migration, because losing data due to crash isn’t a priority, set these two parameters to a higher value to make restore faster.
After data is restored, you should reset all these parameters to their default values.
Cloud-native options
Some of the best practices of pg_dump
and pg_restore
host and database configurations can make migration faster:
- While running
pg_dump
, choosing a host with high CPU and throughput can bump the speed ofpg_dump
. This lets you run multiplepg_dump
processes simultaneously. - Choosing Amazon EC2 as the host within same Availability Zone as of the target database mitigates any network latency during migration.
- Similarly, a target database with higher compute and throughput configured with Provisioned IOPS storage can support concurrent restore.
- Disabling Multi-AZ and backups during migration also make restore faster.
- While migrating a small dataset from Amazon EC2 hosted PostgreSQL to Amazon RDS for PostgreSQL, streaming output of
pg_dump
as input to psql reduces space and time to create the dump file. This drastically reduces overall migration time. The following code is the example of runningpg_dump
and psql at same time:
Another PostgreSQL utility, pg_dumpall
, is used to dump all databases of the cluster in a single file. The supported file format of pg_dumpall
is only text, and you can restore it by using the psql client. Because the Amazon RDS for PostgreSQL and Aurora PostgreSQL rds_superuser
role doesn’t have permission on the pg_authid
table, it’s important to use --no-role-passwords
with pg_dumpall
while dumping data. Only the newer version of pg_dumpall
(from PostgreSQL 10.0 and above) supports the --no-role-passwords
option. The following code is a sample command dumping all of the database at the Aurora cluster:
The pg_dump
utility only takes the backup of the single database at any point of time. It doesn’t back up global objects such as users and groups. To migrate global objects, you need to use a combination of pg_dumpall
and psql. Complete the following steps:
- Take a backup of source tablespaces and roles using
pg_dumpall
: - Run psql to restore the global object:
Logical replication
Logical replication is a method of replicating data objects and their changes based upon their replication identity. We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. Logical replication uses a publisher and subscriber model with one or more subscribers subscribing to one or more publications on the publisher node. Subscribers pull data from the publications they subscribe to and may subsequently republish data to allow cascading replication or more complex configurations. Unlike physical replication, you can set up logical replication between two different major versions of PostgreSQL.
Because Amazon RDS for PostgreSQL and Aurora PostgreSQL don’t support as targets for external PostgreSQL physical replication, logical replication is one of the common ways to reduce overall migration time. This is also helpful while migrating from an unencrypted RDS for PostgreSQL instance to an encrypted RDS for PostgreSQL instance with minimal outage.
The following are the major steps to set up logical replication:
- Take
schema-only
dump at the source and restore at the target: - Modify the following parameters at the source and reboot the instance to apply changes. If the source is an RDS for PostgreSQL or Aurora PostgreSQL instance, you need to enable
rds.logical_replication
to apply these changes.wal_level = logical
wal_sender_timeout = 0
max_replication_slots = 10 #number of subscriptions
max_wal_senders = 12 #max_replication_slots(10) + number of other standbys (2)
- Provide access to the target RDS for PostgreSQL instance to the source instance by modifying pg_hba.conf. If the source is an RDS for PostgreSQL or Aurora PostgreSQL instance, add the target public and private IP to the source security group.
- At the source, create a logical replication user and publication:
- At the target PostgreSQL instance, create a subscription:
- Test the logical replication:
The following are some best practices for PostgreSQL logical replication:
- While using logical replication, you can also insert data on the target side. Make sure to revoke all write privileges at the table while running replication.
- In a publication, you can choose what type of commands to replicate: INSERT, DELETE, UPDATE, ALL. By default, it’s ALL. Set these replication type as per your business needs and minimum requirements.
- The replicating tables should be normal tables. Views, materialized views, partition root tables, or foreign tables can’t be replicated. You can replicate tables with no primary or unique keys, but updates and deletes on those tables are slow on the subscriber.
- Any DDL changes aren’t replicated. To avoid any interruption in replication, DDL changes such as any changes on table definition should be done at both sides at the same time.
- The TRUNCATE operation isn’t supported. As a workaround, use the DELETE operation.
It’s important to have a replication identity (primary key or unique index) for tables that are part of the publication in the logical replication. If the table doesn’t have a primary key or a unique index, you can set replication identity to full (ALTER TABLE <table name> REPLICA IDENTITY FULL
) where the entire row acts as a primary key.
Migrating data using the COPY command
PostgreSQL also has the native COPY command to move data between tables and standard file system files. COPY is usually comparatively faster than INSERT. This is another alternative to pg_dump
and pg_restore
commands to back up and restore PostgreSQL databases, tables, and underlying objects.
One of the advantages of the COPY command over pg_dump
and pg_restore
is that you can filter the data by using SQL queries. You can also use the COPY command to transfer data from one format to another or from a file to a table, or vice versa. PostgreSQL COPY commands are available in two variations: a server-side COPY command and the client-side meta-command \copy
. The server-side command requires superuser privileges or pg_read_server_files
, pg_write_server_files
, or pg_execute_server_program
privileges because they are read or written directly by the server. The client-side \copy
command invokes copy from STDIN
or copy to STDOUT
, and then fetches and stores the data in a file accessible to the psql client. Server-side copy isn’t available on Amazon RDS for PostgreSQL or Aurora PostgreSQL. You need at least INSERT, UPDATE, and SELECT permissions on the table in order copy to or from it. The following are few key terminologies for the COPY command:
- Delimiter – A character that separates dataset. By default, COPY expects tabs as delimiters, but you can change this by specifying the delimiter with “using delimiter” keywords.
- Header – The line of the CSV file that consists of the column names.
The following examples show basic command usage of the COPY command and \copy
meta-command from the psql client:
An input file name path can be an absolute or relative path, but an output file name path must be an absolute path. The client-side \copy
can handle relative pathnames.
In Amazon RDS for PostgreSQL 11.1 and above, you can import data from Amazon Simple Storage Service (Amazon S3) into a table belonging to an RDS for PostgreSQL DB instance. To do this, you need to use the AWS_S3 PostgreSQL extension that Amazon RDS provides.
Clean up
To avoid incurring ongoing charges, delete the RDS for PostgreSQL and Aurora PostgreSQL instances.
Conclusion
This post demonstrated the best practices for migrating PostgreSQL databases to Amazon RDS for PostgreSQL or Aurora PostgreSQL. Apart from pg_dump
and pg_restore
, the COPY command, and logical replication, other options for migration are AWS Database Migration Service (AWS DMS) and pglogical.
If you have comments or questions about this solution, please submit them in the comments section.
About the authors
Vivek Singh is a Principal Database Specialist Technical Account Manager with AWS focusing on RDS/Aurora PostgreSQL engines. He works with enterprise customers providing technical assistance on PostgreSQL operational performance and sharing database best practices. He has over 17 years of experience in open-source database solutions, and enjoys working with customers to help design, deploy, and optimize relational database workloads on AWS.
Akm Raziul Islam has worked as a consultant with a focus on Database and Analytics at Amazon Web Services. He worked with customers to provide guidance and technical assistance about various database and analytical projects, helping them improving the value of their solutions when using AWS.