AWS Database Blog

Migrating databases using RDS PostgreSQL Transportable Databases

Amazon RDS for PostgreSQL now supports the feature Transportable Databases, a high-speed data import and export method supported on versions 11.5 and later and 10.10 and later. If you need to import a PostgreSQL database from one RDS PostgreSQL instance into another, you can use native tools such as pg_dump and pg_restore or load data into target tables using the \copy command. With transportable databases, you can move data much faster than these traditional tools. This feature uses a new extension called pg_transport, which provides a per-database physical transport mechanism. Physical transport can move data much faster with minimum downtime by streaming the database files with minimal processing.

This post discusses common use cases for pg_transport and how to configure and use this in your RDS PostgreSQL environment using AWS CLI and the psql utility.

Use cases for pg_transport

Many customers run their own SaaS or multi-tenant and enterprise applications on RDS PostgreSQL because Amazon RDS makes it simple to set up, operate, and scale PostgreSQL deployments in AWS. It’s common to have multiple databases on a single RDS PostgreSQL instance serving different customers or applications. Transportable databases are useful in such environments; you can address the following use cases with minimal downtime:

  • Moving a database from one RDS instance to another for better resource management or isolation. For example, as an SaaS provider, you must move your customer data to a different configuration (larger instance type, provisioned IOPS, or different instance family) after they have reached a specific capacity.
  • Reorganizing data for security and compliance reasons. For example, moving a database from an RDS instance in one AWS account to an RDS instance in an account with additional security controls.
  • Providing a complete copy of your database for downstream applications or testing new features as needed.

Prerequisites

This post assumes you are familiar with RDS PostgreSQL and Amazon EC2 instances. You should also meet the following prerequisites:

  • Launch two RDS instances with version 11.5 and instance type r5. large in the same or different AWS accounts within the same AWS Region. Name the source RDS cluster as pgsource.<identifier>. <AWS region>.rds.amazonaws.com and the target as pgtarget.<identifier>. <AWS region>.rds.amazonaws.com. Both the source and target RDS instances should be of the same major version of PostgreSQL. For more information, see Create and Connect to a PostgreSQL Database.
  • Make sure that the VPC security group for your source database instance allows inbound traffic from the destination instance.
  • Set up an Amazon EC2 instance with the psql client to work with PostgreSQL instances. Configure your database security groups to allow traffic from this EC2 instance. For more information, see Connecting to a DB Instance Running the PostgreSQL Database Engine.
  • Configure AWS CLI with your credentials. You should have RDS administrator IAM privileges. For more information, see Identity and Access Management in Amazon RDS. You can also consider using AWS managed policy DatabaseAdministrator to manage the RDS instances.

For more information about the known limitations and configuration parameters related to this feature, see Transporting a PostgreSQL Database Using the transport.import_from_server Function.

Setting up the source environment

Before using this feature, enable the necessary configuration parameters and pg_transport extension on the source and target RDS environments. Complete the following steps:

  1. Log in to your EC2 instance. Create a new parameter group and attach it to the pgsource (source RDS) See the following command:
    aws rds create-db-parameter-group --db-parameter-group-name pg11-source-transport-group --db-parameter-group-family postgres11  --description "Parameter group with transport parameters enabled"
  2. Configure the pg_transport-related parameters pg_transport.num_workers, pg_transport.timing, and pg_transport.work_mem. Load the extension via shared_preload_libraries and update the max_worker_processes. See the following command:
    aws rds modify-db-parameter-group \
        --db-parameter-group-name pg11-source-transport-group \
        --parameters "ParameterName=pg_transport.num_workers,ParameterValue=4,ApplyMethod=immediate" \
                     "ParameterName=pg_transport.timing,ParameterValue=1,ApplyMethod=immediate" \
                     "ParameterName=pg_transport.work_mem,ParameterValue=131072,ApplyMethod=immediate" \
                     "ParameterName=shared_preload_libraries,ParameterValue=\"pg_stat_statements,pg_transport\",ApplyMethod=pending-reboot" \
                     "ParameterName=max_worker_processes,ParameterValue=24,ApplyMethod=pending-reboot"

    For more information related to pg_transport related parameters, see PostgreSQL.TransportableDB.Parameters

  1. Modify the source RDS instance with the newly created parameter group. You have to reboot the instance because you modified some of the static parameters such as shared_preload_libraries and max_worker_processes. See the following command:
    aws rds modify-db-instance --db-instance-identifier pgsource --db-parameter-group-name pg11-source-transport-group —apply-immediately
    aws rds reboot-db-instance —db-instance-identifier pgsource
  1. After the RDS instance is up, check the connectivity to the instance via psql. Set up a sample database to test pg_transport. This post uses two databases of different sizes, but you can use any databases from your environment. See the following command:
    psql -h pgsource.<identifier>.<AWS Region>.rds.amazonaws.com -d postgres -U <RDS Master User>
    create database imdb;
    create database benchdb;
    

    a.) Load the imdb database with data from the IMDB dataset (approximately 14 GB). There are many sample datasets available in the open-source community for PostgreSQL database. For more information, see Sample Databases in the PostgreSQL wiki.
    b.) Load the benchdb database using the pgbench This post uses the scale factor of 10000 to initialize the pgbench database (approximately 130 GB). Enter the following command to load the data into benchdb:

    pgbench -i -h pgsource.<identifier>.<AWS Region>.rds.amazonaws.com -d benchdb --foreign-keys -s 10000
  1. Verify the source database environment configuration and create the pg_transport extension for each database. After connecting to the source instance, enter the following commands. The following command also shows the output:
    psql -h pgsource.<identifier>.<AWS Region>.rds.amazonaws.com -d postgres -U <RDS Master User>
    
    postgres=>\c imdb
     
    imdb-> CREATE EXTENSION pg_transport;
    
    imdb->\dt+
                              List of relations
     Schema |       Name       | Type  |  Owner  |  Size   | Description
    --------+------------------+-------+---------+---------+-------------
     public | name_basics      | table | dbadmin | 926 MB  |
     public | title_akas       | table | dbadmin | 1520 MB |
     public | title_basics     | table | dbadmin | 705 MB  |
     public | title_crew       | table | dbadmin | 307 MB  |
     public | title_episode    | table | dbadmin | 176 MB  |
     public | title_principals | table | dbadmin | 1870 MB |
     public | title_ratings    | table | dbadmin | 49 MB   |
    
    imdb=> SELECT pg_size_pretty( pg_database_size('imdb') );
     pg_size_pretty
    ----------------
     14 GB
    (1 row)
    
    imdb->\c benchdb 
    
    benchdb-> CREATE EXTENSION pg_transport;
    benchdb-> \dt+
                              List of relations
     Schema |       Name       | Type  |  Owner  |  Size   | Description
    --------+------------------+-------+---------+---------+-------------
     public | pgbench_accounts | table | dbadmin | 125 GB  |
     public | pgbench_branches | table | dbadmin | 392 kB  |
     public | pgbench_history  | table | dbadmin | 0 bytes |
     public | pgbench_tellers  | table | dbadmin | 4360 kB |
  1. Verify that the source database configuration parameters updated via the new parameter group by entering the following query:
    select name, setting,unit from pg_settings where name in ('shared_preload_libraries','max_worker_processes', 
    'pg_transport.num_workers','pg_transport.timing','pg_transport.work_mem'); 

You now have a fully configured source environment.

Setting up the target environment

To set up the target environment, complete the following steps:

  1. Create a new parameter group and attach it to the pgtarget (target RDS) instance. See the following command:
    aws rds create-db-parameter-group --db-parameter-group-name pg11-target-transport-group --db-parameter-group-family postgres11  --description "Parameter group with transport parameters enabled"
  2. Configure the pg_transport-related parameters. Load the extension via shared_preload_libraries and update the max_worker_processes. You should set max_worker_processes on both the source and destination DB instances to at least three times the destination DB instance’s setting for the pg_transport.num_workers parameter. Add a few more to provide non-transport background worker processes. This post configures the value of max_worker_processes as 24. See the following command:
    aws rds modify-db-parameter-group \
        --db-parameter-group-name pg11-target-transport-group \
        --parameters "ParameterName=pg_transport.num_workers,ParameterValue=4,ApplyMethod=immediate" \
                     "ParameterName=pg_transport.timing,ParameterValue=1,ApplyMethod=immediate" \
                     "ParameterName=pg_transport.work_mem,ParameterValue=131072,ApplyMethod=immediate" \
                     "ParameterName=shared_preload_libraries,ParameterValue=\"pg_stat_statements,pg_transport\",ApplyMethod=pending-reboot"\
                     "ParameterName=max_worker_processes,ParameterValue=24,ApplyMethod=pending-reboot"
  3. Update the target RDS instance with the newly created parameter group. You have to reboot the instance due to the modification of static parameters such as shared_preload_libraries and max_worker_processes. See the following command:
    aws rds modify-db-instance --db-instance-identifier pgtarget --db-parameter-group-name pg11-target-transport-group —apply-immediately
    aws rds reboot-db-instance —db-instance-identifier pgtarget
  4. Verify that the target database configuration parameters updated via the new parameter group by entering the following query:
    select name, setting,unit from pg_settings where name in ('shared_preload_libraries','max_worker_processes', 
    'pg_transport.num_workers','pg_transport.timing','pg_transport.work_mem');
  5. After the RDS instance is up, check the connectivity to the instance via psql. Make sure the target environment does not contain the database with the same name as that of the source. See the following command:
    psql -h pgtarget.<identifier>.<AWS Region>.rds.amazonaws.com -d postgres -U <RDS Master User>
    postgres-> \l  
    Note: \l option list the databases. 
  6. Set up the pg_transport extension in the destination instance with the following command:
    postgres-> CREATE EXTENSION pg_transport;

Testing pg_transport

You now have fully configured source and target environments and are ready to test this feature. pg_transport provides a simple option to validate both environments before the actual transfer of data (a dry run of your migration).

  1. Use the transport.import_from_server function to perform the migration. Set the parameter dry_run=true. See the following command:
    psql -h pgtarget.<identifier>.<AWS Region>.rds.amazonaws.com -d postgres -U <RDS Master User>
    
    SELECT transport.import_from_server('pgsource.<identifier>.<AWS Region>.rds.amazonaws.com',5432,<Source DB User>,<Source DB User password>,'imdb',<Target DB user password>,true);    --dry run
    INFO:  Starting dry-run of import of database "imdb".
    INFO:  Created connections to remote database        (took 0.02 seconds).
    INFO:  Checked remote cluster compatibility          (took 0.05 seconds).
    INFO:  Dry-run complete                         (took 0.08 seconds total).
     import_from_server
    --------------------
    
    (1 row)
  2. After a successful dry run, initiate the database migration. Change the parameter value as false (dry_run=false). See the following command:
    postgres=>SELECT transport.import_from_server('pgsource.<identifier>.<AWS Region>.rds.amazonaws.com',5432,<Source DB User>,<Source DB User password>,'imdb',<Target DB user password>,false);
    
    INFO:  Starting import of database "imdb".
    INFO:  Created connections to remote database        (took 0.04 seconds).
    INFO:  Marked remote database as read only           (took 2.13 seconds).
    INFO:  Checked remote cluster compatibility          (took 0.02 seconds).
    INFO:  Signaled creation of PITR blackout window     (took 2.01 seconds).
    INFO:  Applied remote database schema pre-data       (took 0.65 seconds).
    INFO:  Created connections to local cluster          (took 0.01 seconds).
    INFO:  Locked down destination database              (took 0.00 seconds).
    INFO:  16.36% of files transported
    INFO:  23.64% of files transported (current rate is ~300 MB/sec)
    INFO:  34.55% of files transported (current rate is ~285 MB/sec)
    INFO:  38.18% of files transported (current rate is ~306 MB/sec)
    INFO:  41.82% of files transported (current rate is ~256 MB/sec)
    INFO:  49.09% of files transported (current rate is ~274 MB/sec)
    INFO:  58.18% of files transported (current rate is ~298 MB/sec)
    INFO:  63.64% of files transported (current rate is ~281 MB/sec)
    INFO:  81.82% of files transported (current rate is ~287 MB/sec)
    INFO:  90.91% of files transported (current rate is ~268 MB/sec)
    INFO:  Completed transfer of database files          (took 52.87 seconds).
    INFO:  Completed clean up                            (took 1.02 seconds).
    INFO:  Physical transport complete              (took 58.76 seconds total).
     import_from_server
    --------------------
    
    (1 row)
    
    postgres=> \c imdb
    imdb=> SELECT pg_size_pretty( pg_database_size('imdb') ); ## check the database size
     pg_size_pretty
    ----------------
     14 GB
    (1 row)

    For this post, transferring the 14-GB imdb database took less than 60 seconds.
    To understand the performance of pg_transport, this post tested importing the same imdb database using the pg_dump and pg_restore method. See the following command:

    pg_dump -Fc -v -h pgsource.<identifier>.<AWS Region>.rds.amazonaws.com -U [RDS Master User] imdb > imdb.dump
    pg_restore -v -h pgtarget.<identifier>.<AWS Region>.rds.amazonaws.com -U [RDS Master User] -d imdb  imdb.dump

    It took approximately 8 minutes, as opposed to pg_transport, which took less than a minute.
    While the database transport is underway, you can run read-only queries only. For example, if you attempt to update the table in imdb database title_rating, you get the following error:

    psql -h pgsource.<identifier>.<AWS Region>.rds.amazonaws.com -d imdb -U <RDS Master User>
    imdb=> update title_ratings set "averageRating"=7, "numVotes"=3000 where tconst=1;
    ERROR: cannot execute UPDATE in a read-only transaction
  1. Repeat the same steps for migrating the pgbench database benchdb. See the following command:
    postgres=>SELECT transport.import_from_server('pgsource.<identifier>.<AWS Region>.rds.amazonaws.com',5432,<Source DB User>,<Source DB User password>,'benchdb',<Target DB user password>,false);
    
    .........
    INFO:  Completed transfer of database files          (took 565.60 seconds).
    INFO:  Completed clean up                            (took 1.04 seconds).
    INFO:  Physical transport complete              (took 569.51 seconds total).

    It took approximately 569 seconds with pg_transport.num_workers=8 and max_worker_processes=40. There were no other workloads or queries running on the source and target RDS instances when these tests were conducted.  Your transfer speed may vary due to factors such as source and target system load, instance type, and RDS parameter configurations.

Best practices

For large-size databases (approximately 200 GB), you can modify the parameters related to pg_transport. For example, you can increase pg_transport.num_workers to 8 and max_worker_processes to three times the number of num_workers. pg_transport process consumes memory resources at the instance level, which may impact other running databases on both the source and target. Therefore, plan and test your configuration in a development environment before applying changes in the production environment.

pg_transport.work_mem is allocated for each worker process (for example, pg_transport.num_workers). Make sure you have sufficient freeable memory available in your instance. For more information, see Best Practices for Amazon RDS.

Setting up pg_transport requires rebooting the instance due to the modification of static parameters such as max_worker_processes and shared_preload_libraries. Therefore, it’s a good idea to test and standardize these parameters based on your database environment to avoid frequent changes.

After transporting the database, set up appropriate roles and permissions at the target as per your database access requirements. Additionally, you can enable extensions as needed by your application.

Summary

This post provided use cases of RDS PostgreSQL transportable databases feature and highlighted important considerations when configuring the pg_transport extension, as well as performance advantages over the traditional dump and load method.  We encourage you to try out this feature in your environment. As always, AWS welcomes feedback, so please leave comments or questions below.

 


About the Author

 

Gowri Balasubramanian is a Principal Database Solutions Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on both relational and NoSQL database services, helping them improve the value of their solutions when using AWS.