AWS Database Blog

Export and import data from Amazon S3 to Amazon Aurora PostgreSQL

You can build highly distributed applications using a multitude of purpose-built databases by decoupling complex applications into smaller pieces, which allows you to choose the right database for the right job. Amazon Aurora is the preferred choice for OLTP workloads. Aurora makes it easy to set up, operate, and scale a relational database in the cloud.

This post demonstrates how you can export and import data from Amazon Aurora PostgreSQL-Compatible Edition to Amazon Simple Storage Service (Amazon S3) and shares associated best practices. The feature to export and import data to Amazon S3 is also available for Amazon Aurora MySQL-Compatible Edition.

Overview of Aurora PostgreSQL-Compatible and Amazon S3

Aurora is a MySQL-compatible and PostgreSQL-compatible relational database built for the cloud. It combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases.

Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data.

The following diagram illustrates the solution architecture.

Prerequisites

Before you get started, complete the following prerequisite steps:

  1. Launch an Aurora PostgreSQL DB cluster. You can also use an existing cluster.
    Note: To export data to Amazon S3 from Aurora PostgreSQL, your database must be running one of the following PostgreSQL engine versions 10.14 or higher, 11.9 or higher ,12.4 or higher
  2. Launch an Amazon EC2 instance that you installed the PostgreSQL client on. You can also use the pgAdmin tool or tool of your choice for this purpose.
  3. Create the required Identity and Access Management (IAM) policies and roles:
    • Create an IAM policy with the least-restricted privilege to the resources in the following code and name it aurora-s3-access-pol. The policy must have access to the S3 bucket where the files are stored (for this post, aurora-pg-sample-loaddata01).
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetObject",
                      "s3:AbortMultipartUpload",
                      "s3:DeleteObject",
                      "s3:ListMultipartUploadParts",
                      "s3:PutObject",
                      "s3:ListBucket"
                  ],
                  "Resource": [
                      "arn:aws:s3:::aurora-pg-sample-loaddata01/*",
                      "arn:aws:s3:::aurora-pg-sample-loaddata01"
                  ]
              }
          ]
      }
    • Create two IAM roles named aurora-s3-export-role and aurora-s3-import-role and modify the trust relationships according to the following code. AssumeRole allows Aurora to access other AWS services on your behalf.
      {"Version": "2012-10-17","Statement": [
          {
            "Effect": "Allow","Principal": {
              "Service": "rds.amazonaws.com"
            },"Action": "sts:AssumeRole"
          }
        ]
      }
    • Attach the policy aurora-s3-access-pol from the previous step. For this post, we create the roles aurora-s3-export-role and aurora-s3-import-role. Their associated ARNs are arn:aws:iam::123456789012:role/aurora-s3-export-role and arn:aws:iam::123456789012:role/aurora-s3-import-role, respectively.
  4. Associate the IAM roles to the cluster. This enables the Aurora DB cluster to access the S3 bucket. See the following code:
    aws rds add-role-to-db-cluster --db-cluster-identifier aurora-postgres-cl --feature-name s3Export --role-arn arn:aws:iam::123456789012:role/aurora-s3-export-role
    aws rds add-role-to-db-cluster --db-cluster-identifier aurora-postgres-cl --feature-name s3Import --role-arn arn:aws:iam::123456789012:role/aurora-s3-import-role

You’re now ready to explore the following use cases of exporting and importing data.

Export data from Aurora PostgreSQL to Amazon S3

To export your data, complete the following steps:

  1. Connect to the cluster as the primary user, postgres in our case.By default, the primary user has permission to export and import data from Amazon S3. For this post, you create a test user with the least-required permission to export data to the S3 bucket. See the following code:
    psql -h aurora-postgres-cl.cluster-XXXXXXXXXXXX.us-east-1.rds.amazonaws.com -p 5432 -d pg1 -U postgres
  2. Install the required PostgreSQL extensions, aws_s3 and aws_commons. When you install the aws_s3 extension, the aws_commons extension is also installed:
    CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;
  3. Create the user testuser (or you can use an existing user):
    create user testuser;
    \password testuser;
  4. You can verify the user has been created with the \du command:
    \du testuser
    
               List of roles
     Role name | Attributes | Member of 
    -----------+------------+-----------
     testuser  |            | {}
    
  5. Grant privileges on the aws_s3 schema to testuser (or another user you chose):
    grant execute on all functions in schema aws_s3 to testuser;
  6. Log in to the cluster as testuser:
    psql -h aurora-postgres-cl.cluster-XXXXXXXXXXXX.us-east-1.rds.amazonaws.com -p 5432 -d pg1 -U testuser
  7. Create and populate a test table called apg2s3_table, which you use for exporting data to the S3 bucket. You can use any existing table in your database (preferably small size) to test this feature. See the following code:
    CREATE TABLE apg2s3_table (
    bid bigint PRIMARY KEY,
    name varchar(80)
    );
  8. Insert a few records using the following statement:
    INSERT INTO apg2s3_table (bid,name)
    VALUES (1, 'Monday'), (2,'Tuesday'),
    (3, 'Wednesday');
    
  9. To export data from the Aurora table to the S3 bucket, use the aws_s3.query_export_to_s3 and aws_commons.create_s3_uri functions:
    • We use the aws_commons.create_s3_uri function to load a variable with the appropriate URI information required by the aws_s3.query_export_to_s3 function. The parameters required by the aws_commons.create_s3_uri function are the S3 bucket name, the full path (folder and filename) for the file to be created by the export command, and the Region. See the following example code:
      SELECT aws_commons.create_s3_uri(
      'aurora-pg-sample-loaddata01',
      'apg2s3_table',
      'us-east-1'
      ) AS s3_uri_1 \gset
      
  10. Export the entire table to the S3 bucket with the following code:
    SELECT *
    FROM aws_s3.query_export_to_s3(
    'SELECT * FROM apg2s3_table',:'s3_uri_1');
    
     rows_uploaded | files_uploaded | bytes_uploaded 
    ---------------+----------------+----------------
                 3 |              1 |             31
    (1 row)
    

    If you’re trying this feature out for the first time, consider using the LIMIT clause for a larger table.

    If you’re using a PostgreSQL client that doesn’t have the \gset command available, the workaround is to call the aws_commons.create_s3_uri function inside of the aws_s3.query_export_to_s3 function as follows:

    SELECT *
    FROM aws_s3.query_export_to_s3(
    'SELECT * FROM apg2s3_table',
    aws_commons.create_s3_uri(
    'aurora-pg-sample-loaddata01',
    'apg2s3_table','us-east-1'));
    
     rows_uploaded | files_uploaded | bytes_uploaded 
    ---------------+----------------+----------------
                 3 |              1 |             31
    (1 row)
  11. The final step is to verify the export file was created in the S3 bucket. The default file size threshold is 6 GB. Because the data selected by the statement is less than the file size threshold, a single file is created:
    aws s3 ls s3://aurora-pg-sample-loaddata01/apg2s3_table --human-readable --summarize
    2020-12-12 12:06:59   31 Bytes apg2s3_table
    
    Total Objects: 1
       Total Size: 31 Bytes

If you need the file size to be smaller than 6 GB, you can identify a column to split the table data into small portions and run multiple SELECT INTO OUTFILE statements (using the WHERE condition). It’s best to do this when the amount of data selected is more than 25 GB.

Import data from Amazon S3 to Aurora PostgreSQL

In this section, you load the data back to the Aurora table from the S3 file. Complete the following steps:

  1. Create a new table called apg2s3_table_imp:
    CREATE TABLE apg2s3_table_imp(
    bid bigint PRIMARY KEY, 
    name varchar(80)
    );
  2. Use the create_s3_uri function to load a variable named s3_uri_1 with the appropriate URI information required by the aws_s3.table_import_from_s3 function:
    SELECT aws_commons.create_s3_uri(
    'aurora-pg-sample-loaddata01',
    'apg2s3_table','us-east-1') 
    AS s3_uri_1 \gset
  3. Use the aws_s3.table_import_from_s3 function to import the data file from an Amazon S3 prefix:
    SELECT aws_s3.table_import_from_s3(
    'apg2s3_table_imp', 
    '', 
    '(format text)',
    :'s3_uri_1'
    );
  4. Verify the information was loaded into the apg2s3_table_imp table:
    SELECT * FROM apg2s3_table_imp;

Best practices

This section discusses a few best practices for bulk loading large datasets from Amazon S3 to your Aurora PostgreSQL database. The observations we present are based on a series of tests loading 100 million records to the apg2s3_table_imp table on a db.r5.2xlarge instance (see the preceding sections for table structure and example records).

We carried out load testing when no other active transactions were running on the cluster. Results might vary depending on your cluster loads and instance type.

The baseline load included 100 million records using the single file apg2s3_table.csv, without any structural changes to the target table apg2s3_table_imp or configuration changes to the Aurora PostgreSQL database. The data load took approximately 360 seconds. See the following code:

time psql -h aurora-postgres-cl.cluster-XXXXXXXXXXXX.us-east-1.rds.amazonaws.com -p 5432 -U testuser -d pg1 -c "SELECT aws_s3.table_import_from_s3('apg2s3_table_imp','','(format CSV)','aurora-pg-sample-loaddata01','apg2s3_table.csv','us-east-1')"
                                          table_import_from_s3                                           
---------------------------------------------------------------------------------------------------------
100000000 rows imported into relation "apg2s3_table_imp" from file apg2s3_table.csv of 3177777796 bytes
(1 row)

real    6m0.368s
user    0m0.005s
sys    0m0.005s

We used the same dataset to implement some best practices iteratively to measure their performance benefits on the load times. The following best practices are listed in order of the observed performance benefits from lower to higher improvements.

Drop indexes and constraints

Although indexes can significantly increase the performance of some DML operations such as UPDATE and DELETE, these data structures can also decrease the performance of inserts, especially when dealing with bulk data inserts. The reason is that after each new record is inserted in the table, the associated indexes also need to be updated to reflect the new rows being loaded.

Therefore, a best practice for bulk data loads is dropping indexes and constraints on the target tables before a load and recreating them when the load is complete. In our test case, by dropping the primary key and index associated with the apg2s3_table_imp table, we reduced the data load down to approximately 131 seconds (data load + recreation on primary key). This is roughly 2.7 times faster than the baseline. See the following code:

psql -h aurora-postgres-cl.cluster-XXXXXXXXXXXX.us-east-1.rds.amazonaws.com -p 5432 -U testuser -d pg1 -c "\d apg2s3_table_imp" 
                                     Table "public.apg2s3_table_imp"
 Column |         Type          | Collation | Nullable | Default | Storage  | Stats target | Description 
--------+-----------------------+-----------+----------+---------+----------+--------------+-------------
 bid    | bigint                |           | not null |         | plain    |              | 
 name   | character varying(80) |           |          |         | extended |              | 
Indexes:
    "apg2s3_table_imp_pkey" PRIMARY KEY, btree (bid)
  
psql -h aurora-postgres-cl.cluster-XXXXXXXXXXXX.us-east-1.rds.amazonaws.com -p 5432 -U testuser -d pg1 -c "ALTER TABLE apg2s3_table_imp DROP CONSTRAINT apg2s3_table_imp_pkey"
ALTER TABLE
 
psql -h aurora-postgres-cl.cluster-XXXXXXXXXXXX.us-east-1.rds.amazonaws.com -p 5432 -U testuser -d pg1 -c "\d apg2s3_table_imp"
                 Table "public.apg2s3_table_imp"
 Column |         Type          | Collation | Nullable | Default 
--------+-----------------------+-----------+----------+---------
 bid    | bigint                |           | not null | 
 name   | character varying(80) |           |          | 

time psql -h aurora-postgres-cl.cluster-XXXXXXXXXXXX.us-east-1.rds.amazonaws.com -p 5432 -U testuser -d pg1 -c "SELECT aws_s3.table_import_from_s3('apg2s3_table_imp','','(format CSV)','aurora-pg-sample-loaddata01','apg2s3_table.csv','us-east-1')"
                                          table_import_from_s3                                           
---------------------------------------------------------------------------------------------------------
 100000000 rows imported into relation "apg2s3_table_imp" from file apg2s3_table.csv of 3177777796 bytes
(1 row)

real	1m24.950s
user	0m0.005s
sys	0m0.005s

time psql -h demopg-instance-1.cmcmpwi7rtng.us-east-1.rds.amazonaws.com -p 5432 -U testuser  -d pg1 -c "ALTER TABLE apg2s3_table_imp ADD PRIMARY KEY (bid);"
ALTER TABLE

real	0m46.731s
user	0m0.009s
sys	0m0.000s

Concurrent loads

To improve load performance, you can split large datasets into multiple files that can be loaded concurrently. However, the degree of concurrency can impact other transactions running on the cluster. The number of vCPUs allocated to instance types plays a key role because load operation requires CPU cycles to read data, insert into tables, commit changes, and more. For more information, see Amazon RDS Instance Types. Loading several tables concurrently with few vCPUs can cause CPU utilization to spike and may impact the existing workload. For more information, see Overview of monitoring Amazon RDS.

For our test case, we split the original file into five pieces of 20 million records each. Then we created simple shell script to run a psql command for each file to be loaded, like the following:

#To make the test simpler, we named the datafiles as apg2s3_table_imp_XX.
#Where XX is a numeric value between 01 and 05
for i in 1 2 3 4 5 
do
   myFile=apg2s3_table_imp_0$i
   psql -h aurora-postgres-cl.cluster-cmcmpwi7rtng.us-east-1.rds.amazonaws.com -p 5432 -U testuser -d pg1 -c "SELECT aws_s3.table_import_from_s3( 'apg2s3_table_imp', '', '(format CSV)','aurora-pg-sample-loaddata01','$myFile','us-east-1')" &
done
wait

You can export the PGPASSWORD environment variable in the session where you’re running the script to prevent psql from prompting for a database password.

Next, we ran the script:

time ./s3Load5
                                           table_import_from_s3                                           
----------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp" from file apg2s3_table_imp_01 of 617777794 bytes
(1 row)

                                           table_import_from_s3                                           
----------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp" from file apg2s3_table_imp_02 of 640000000 bytes
(1 row)

                                           table_import_from_s3                                           
----------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp" from file apg2s3_table_imp_03 of 640000002 bytes
(1 row)

                                           table_import_from_s3                                           
----------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp" from file apg2s3_table_imp_04 of 640000000 bytes
(1 row)

                                           table_import_from_s3                                           
----------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp" from file apg2s3_table_imp_05 of 640000000 bytes
(1 row)

real	0m40.126s
user	0m0.025s
sys	0m0.020s

time psql -h demopg-instance-1.cmcmpwi7rtng.us-east-1.rds.amazonaws.com -p 5432 -U testuser  -d pg1 -c "ALTER TABLE apg2s3_table_imp ADD PRIMARY KEY (bid);"
ALTER TABLE

real	0m48.087s
user	0m0.006s
sys	0m0.003s

The full load duration, including primary key rebuild, was reduced to approximately 88 seconds, which demonstrates the performance benefits of parallel loads.

Implement partitioning

If you’re using partitioned tables, consider loading partitions in parallel to further improve load performance. To test this best practice, we partition the apg2s3_table_imp table on the bid column as follows:

CREATE TABLE apg2s3_table_imp (
   bid BIGINT,
   name VARCHAR(80)
) PARTITION BY RANGE(bid);

CREATE TABLE apg2s3_table_imp_01 PARTITION OF apg2s3_table_imp FOR VALUES FROM (1) TO (20000001);
CREATE TABLE apg2s3_table_imp_02 PARTITION OF apg2s3_table_imp FOR VALUES FROM (20000001) TO (40000001);
CREATE TABLE apg2s3_table_imp_03 PARTITION OF apg2s3_table_imp FOR VALUES FROM (40000001) TO (60000001);
CREATE TABLE apg2s3_table_imp_04 PARTITION OF apg2s3_table_imp FOR VALUES FROM (60000001) TO (80000001);
CREATE TABLE apg2s3_table_imp_05 PARTITION OF apg2s3_table_imp FOR VALUES FROM (80000001) TO (MAXVALUE);

To load all the partitions in parallel, we modified the shell script used in the previous example:

#To make the test simpler, we named the datafiles as apg2s3_table_imp_XX.
#Where XX is a numeric value between 01 and 05
for i in 1 2 3 4 5
do
   myFile=apg2s3_table_imp_0$i
   psql -h aurora-postgres-cl.cluster-cmcmpwi7rtng.us-east-1.rds.amazonaws.com -p 5432 -U testuser -d pg1 -c "SELECT aws_s3.table_import_from_s3( '$myFile', '', '(format CSV)','aurora-pg-sample-loaddata01','$myFile','us-east-1')" &
done
wait

Then we ran the script to load all five partitions in parallel:

time ./s3LoadPart                                            table_import_from_s3                                             
-------------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp_04" from file apg2s3_table_imp_04 of 640000000 bytes
(1 row)
                                            table_import_from_s3                                             
-------------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp_02" from file apg2s3_table_imp_02 of 640000000 bytes
(1 row)
                                            table_import_from_s3                                             
-------------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp_05" from file apg2s3_table_imp_05 of 640000002 bytes
(1 row)
                                            table_import_from_s3                                             
-------------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp_01" from file apg2s3_table_imp_01 of 617777794 bytes
(1 row)
                                            table_import_from_s3                                             
-------------------------------------------------------------------------------------------------------------
 20000000 rows imported into relation "apg2s3_table_imp_03" from file apg2s3_table_imp_03 of 640000000 bytes
(1 row)

real	0m28.665s
user	0m0.022s
sys	0m0.024s

$ time psql -h demopg-instance-1.cmcmpwi7rtng.us-east-1.rds.amazonaws.com -p 5432 -U testuser  -d pg1 -c "ALTER TABLE apg2s3_table_imp ADD PRIMARY KEY (bid);"
ALTER TABLE

real	0m48.516s
user	0m0.005s
sys	0m0.004s

Loading all partitions concurrently reduced the load times even further, to approximately 77 seconds.

Disable triggers

Triggers are frequently used in applications to perform additional processing after certain conditions (such as INSERT, UPDATE, or DELETE DML operations) run against tables. Because of this additional processing, performing large data loads on a table where triggers are enabled can decrease the load performance substantially. It’s recommended to disable triggers prior to bulk load operations and re-enable them when the load is complete.

Summary

This post demonstrated how to import and export data between Aurora PostgreSQL and Amazon S3. Aurora is powered with operational analytics capabilities, and integration with Amazon S3 makes it easier to establish an agile analytical environment. For more information about importing and exporting data, see Migrating data to Amazon Aurora with PostgreSQL compatibility. For more information about Aurora best practices, see Best Practices with Amazon Aurora PostgreSQL.


About the authors

Suresh Patnam is a Solutions Architect at AWS. He helps customers innovate on the AWS platform by building highly available, scalable, and secure architectures on Big Data and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family.

 

 

 

Israel Oros is a Database Migration Consultant at AWS. He works with customers in their journey to the cloud with a focus on complex database migration programs. In his spare time, Israel enjoys traveling to new places with his wife and riding his bicycle whenever weather permits.