AWS Database Blog
Export and import data from Amazon S3 to Amazon Aurora PostgreSQL
You can build highly distributed applications using a multitude of purpose-built databases by decoupling complex applications into smaller pieces, which allows you to choose the right database for the right job. Amazon Aurora is the preferred choice for OLTP workloads. Aurora makes it easy to set up, operate, and scale a relational database in the cloud.
This post demonstrates how you can export and import data from Amazon Aurora PostgreSQL-Compatible Edition to Amazon Simple Storage Service (Amazon S3) and shares associated best practices. The feature to export and import data to Amazon S3 is also available for Amazon Aurora MySQL-Compatible Edition.
Overview of Aurora PostgreSQL-Compatible and Amazon S3
Aurora is a MySQL-compatible and PostgreSQL-compatible relational database built for the cloud. It combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases.
Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data.
The following diagram illustrates the solution architecture.
Prerequisites
Before you get started, complete the following prerequisite steps:
- Launch an Aurora PostgreSQL DB cluster. You can also use an existing cluster.
Note: To export data to Amazon S3 from Aurora PostgreSQL, your database must be running one of the following PostgreSQL engine versions 10.14 or higher, 11.9 or higher ,12.4 or higher - Launch an Amazon EC2 instance that you installed the PostgreSQL client on. You can also use the pgAdmin tool or tool of your choice for this purpose.
- Create the required Identity and Access Management (IAM) policies and roles:
- Create an IAM policy with the least-restricted privilege to the resources in the following code and name it
aurora-s3-access-pol
. The policy must have access to the S3 bucket where the files are stored (for this post,aurora-pg-sample-loaddata01
). - Create two IAM roles named
aurora-s3-export-role
andaurora-s3-import-role
and modify the trust relationships according to the following code. AssumeRole allows Aurora to access other AWS services on your behalf. - Attach the policy
aurora-s3-access-pol
from the previous step. For this post, we create the rolesaurora-s3-export-role
andaurora-s3-import-role
. Their associated ARNs arearn:aws:iam::123456789012:role/aurora-s3-export-role
andarn:aws:iam::123456789012:role/aurora-s3-import-role
, respectively.
- Create an IAM policy with the least-restricted privilege to the resources in the following code and name it
- Associate the IAM roles to the cluster. This enables the Aurora DB cluster to access the S3 bucket. See the following code:
You’re now ready to explore the following use cases of exporting and importing data.
Export data from Aurora PostgreSQL to Amazon S3
To export your data, complete the following steps:
- Connect to the cluster as the primary user,
postgres
in our case.By default, the primary user has permission to export and import data from Amazon S3. For this post, you create a test user with the least-required permission to export data to the S3 bucket. See the following code: - Install the required PostgreSQL extensions,
aws_s3
andaws_commons
. When you install theaws_s3
extension, theaws_commons
extension is also installed: - Create the user
testuser
(or you can use an existing user): - You can verify the user has been created with the
\du
command: - Grant privileges on the
aws_s3
schema totestuser
(or another user you chose): - Log in to the cluster as
testuser
: - Create and populate a test table called
apg2s3_table
, which you use for exporting data to the S3 bucket. You can use any existing table in your database (preferably small size) to test this feature. See the following code: - Insert a few records using the following statement:
- To export data from the Aurora table to the S3 bucket, use the aws_s3.query_export_to_s3 and aws_commons.create_s3_uri functions:
- We use the
aws_commons.create_s3_uri
function to load a variable with the appropriate URI information required by theaws_s3.query_export_to_s3
function. The parameters required by theaws_commons.create_s3_uri
function are the S3 bucket name, the full path (folder and filename) for the file to be created by the export command, and the Region. See the following example code:
- We use the
- Export the entire table to the S3 bucket with the following code:
If you’re trying this feature out for the first time, consider using the LIMIT clause for a larger table.
If you’re using a PostgreSQL client that doesn’t have the
\gset
command available, the workaround is to call theaws_commons.create_s3_uri
function inside of theaws_s3.query_export_to_s3
function as follows: - The final step is to verify the export file was created in the S3 bucket. The default file size threshold is 6 GB. Because the data selected by the statement is less than the file size threshold, a single file is created:
If you need the file size to be smaller than 6 GB, you can identify a column to split the table data into small portions and run multiple SELECT INTO OUTFILE statements (using the WHERE condition). It’s best to do this when the amount of data selected is more than 25 GB.
Import data from Amazon S3 to Aurora PostgreSQL
In this section, you load the data back to the Aurora table from the S3 file. Complete the following steps:
- Create a new table called
apg2s3_table_imp
: - Use the create_s3_uri function to load a variable named
s3_uri_1
with the appropriate URI information required by theaws_s3.table_import_from_s3
function: - Use the aws_s3.table_import_from_s3 function to import the data file from an Amazon S3 prefix:
- Verify the information was loaded into the
apg2s3_table_imp
table:
Best practices
This section discusses a few best practices for bulk loading large datasets from Amazon S3 to your Aurora PostgreSQL database. The observations we present are based on a series of tests loading 100 million records to the apg2s3_table_imp
table on a db.r5.2xlarge instance (see the preceding sections for table structure and example records).
We carried out load testing when no other active transactions were running on the cluster. Results might vary depending on your cluster loads and instance type.
The baseline load included 100 million records using the single file apg2s3_table.csv, without any structural changes to the target table apg2s3_table_imp
or configuration changes to the Aurora PostgreSQL database. The data load took approximately 360 seconds. See the following code:
We used the same dataset to implement some best practices iteratively to measure their performance benefits on the load times. The following best practices are listed in order of the observed performance benefits from lower to higher improvements.
Drop indexes and constraints
Although indexes can significantly increase the performance of some DML operations such as UPDATE and DELETE, these data structures can also decrease the performance of inserts, especially when dealing with bulk data inserts. The reason is that after each new record is inserted in the table, the associated indexes also need to be updated to reflect the new rows being loaded.
Therefore, a best practice for bulk data loads is dropping indexes and constraints on the target tables before a load and recreating them when the load is complete. In our test case, by dropping the primary key and index associated with the apg2s3_table_imp
table, we reduced the data load down to approximately 131 seconds (data load + recreation on primary key). This is roughly 2.7 times faster than the baseline. See the following code:
Concurrent loads
To improve load performance, you can split large datasets into multiple files that can be loaded concurrently. However, the degree of concurrency can impact other transactions running on the cluster. The number of vCPUs allocated to instance types plays a key role because load operation requires CPU cycles to read data, insert into tables, commit changes, and more. For more information, see Amazon RDS Instance Types. Loading several tables concurrently with few vCPUs can cause CPU utilization to spike and may impact the existing workload. For more information, see Overview of monitoring Amazon RDS.
For our test case, we split the original file into five pieces of 20 million records each. Then we created simple shell script to run a psql
command for each file to be loaded, like the following:
You can export the PGPASSWORD environment variable in the session where you’re running the script to prevent psql
from prompting for a database password.
Next, we ran the script:
The full load duration, including primary key rebuild, was reduced to approximately 88 seconds, which demonstrates the performance benefits of parallel loads.
Implement partitioning
If you’re using partitioned tables, consider loading partitions in parallel to further improve load performance. To test this best practice, we partition the apg2s3_table_imp
table on the bid
column as follows:
To load all the partitions in parallel, we modified the shell script used in the previous example:
Then we ran the script to load all five partitions in parallel:
Loading all partitions concurrently reduced the load times even further, to approximately 77 seconds.
Disable triggers
Triggers are frequently used in applications to perform additional processing after certain conditions (such as INSERT, UPDATE, or DELETE DML operations) run against tables. Because of this additional processing, performing large data loads on a table where triggers are enabled can decrease the load performance substantially. It’s recommended to disable triggers prior to bulk load operations and re-enable them when the load is complete.
Summary
This post demonstrated how to import and export data between Aurora PostgreSQL and Amazon S3. Aurora is powered with operational analytics capabilities, and integration with Amazon S3 makes it easier to establish an agile analytical environment. For more information about importing and exporting data, see Migrating data to Amazon Aurora with PostgreSQL compatibility. For more information about Aurora best practices, see Best Practices with Amazon Aurora PostgreSQL.
About the authors
Suresh Patnam is a Solutions Architect at AWS. He helps customers innovate on the AWS platform by building highly available, scalable, and secure architectures on Big Data and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family.
Israel Oros is a Database Migration Consultant at AWS. He works with customers in their journey to the cloud with a focus on complex database migration programs. In his spare time, Israel enjoys traveling to new places with his wife and riding his bicycle whenever weather permits.