AWS Database Blog
Archive and Purge Data for Amazon RDS for PostgreSQL and Amazon Aurora with PostgreSQL Compatibility using pg_partman and Amazon S3
The structured data industry has been experiencing exponential data growth year by year. It’s important for the relational database users to get smart with its data storage footprint by understanding data relevance while keeping its business needs in mind.
Data archive is the critical part of data management, as it enables us to keep business relevant hot data in high-cost, high performance storage, as well as move less significant cold data into cheaper storage. The archival practice helps reduce Total Cost of Ownership (TCO) and improve database performance as smaller datasets result in faster access time.
Traditionally, database archival practice was to move the infrequently accessed data to lower-cost storage like offsite tape/disk to keep the data for a long period and for compliance purposes.
Over the last decades, with the rise of analytic and machine learning (ML), the need to access the archived data has become a necessity for enterprise data mining and decision support. These needs are much more easily met by Amazon Simple Storage Service (Amazon S3) as the optimized archive storage solution when compared to the offsite tape and disk. Amazon S3 is scalable and secure, offering high performance, high availability, and durability that can help you archive and run analytics at a lower cost.
In some cases, data in a relational database is relevant for a period. A common way to handle such data is to partition and archive the tables based on time. The PostgreSQL extension pg_partman is a popular data partitioning automation extension available in Amazon Relational Database Service (Amazon RDS) for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition.
In this post, we show how you can efficiently use PostgreSQL’s native range partition to partition current (hot) data with pg_partman and archive historical (cold) data in Amazon S3. The solution works with Amazon RDS for PostgreSQL version 12.5 onward, and Amazon Aurora PostgreSQL version 12.6 onward.
Solution overview
The solution combines PostgreSQL’s native range partitioning feature with pg_partman and Amazon RDS’s Amazon S3 export/import functions.
PostgreSQL lets you divide a table into partitions based on key columns’ date/time ranges. It offers great performance and management benefits for archiving/purging historical data. Instead of bulk insert and delete, you simply copy the partition out for archive, then drop the partition when you no longer need it.
pg_partman is a PostgreSQL extension that supports PostgreSQL’s native partitioning to create and manage time-based and serial-based partition sets. It automates the child partition creation and works with your retention policy to detach or drop the obsolete partitions for you.
When a partition becomes obsolete, you call aws_s3.query_export_to_s3 function to upload the data to a designated Amazon S3 folder. Once confirmed, the partition is dropped from its database. The archived data stays in the S3 bucket for its lifetime, going through different Amazon S3 storage tiers according to the Amazon S3 lifecycle management policy. You can query data in Amazon S3 directly, so restoring the archived data to the database becomes optional.
Figure 1 shows data movement between the database and the Amazon S3 archive:
Figure 2 shows the destination S3 bucket’s structure for a monthly partition table:
The solution involves the following specific steps:
- Set up S3 bucket policies
- Set up your database
- Partition the candidate table
- Upload the obsolete table partition to Amazon S3 and drop the archived partition from the database
- (Optional) Restore the archived data from Amazon S3 to the database
Prerequisites
To implement the solution, you must have the following components ready.
- A data retention policy that defines what data should be archived, where, and for how long it should be kept.
- An AWS Managed Key Service (AWS KMS) customer key.
- A private S3 bucket as the archive destination:
- Encrypted with your AWS KMS customer key
- With a lifecycle policy configured
- An RDS for PostgreSQL instance (version 12.5 and above) or Aurora PostgreSQL cluster (version 12.6 or above) with the pg_partman extension installed.
- A privileged IAM user to manage IAM policies, to run Amazon Command Line Interface (AWS CLI) commands and to access the AWS Management Console.
- A scheduler like pg_cron to run partition maintenance jobs.
- A database user who runs archive procedures. If you prefer, you may create a separate database user to perform archive operations on behalf of all the schema/table owners. In such case, you must grant the user select privilege on the target tables and their partitions:
GRANT CONNECT ON DATABASE <dbname> TO <archive_user>;
GRANT USAGE ON SCHEMA <schemaname> TO <archive_user>;
GRANT SELECT ON <tablename> TO <archive_user>;
GRANT SELECT ON <tablepartitionname> TO <archive_user>;
Set up S3 bucket Policies
First, let’s set up the S3 bucket. Use the AWS CLI to run the following commands:
- Create a policy with the necessary privileges to access the S3 bucket and its AWS KMS key. Take note of the policy ARN.
In the preceding code,
s3-exp-policy
is a file in the current directory with the following content: - Create two AWS Identity and Access Management (IAM) roles: one for export and one for import.
In the preceding code,
s3-exp-assumerole-policy
is saved in the current directory with the following contents: - Attach the policy created in Step 1 to the roles created from Step 2.
- Add the roles to your Amazon Aurora cluster or RDS instance through the AWS Management Console.
Select the Aurora cluster or RDS instance from the Connectivity & Security tab, fill in the information, and select Add role. You enter the pg-s3-exprot-role
and the pg-s3-import-role
one by one as shown in Figure 3 and Figure 4.
Set up your Database
To set up your database, you install the required PostgreSQL schema and extensions. From a PostgreSQL client like psql, complete the following steps as rds_superuser
:
- Create
partman
schema andpg_partman
extension:partman
schema has the required tables and functions to help you configure and manage the table partitions. - Create
aws_s3
extension. These functions enable data moving between the database and Amazon S3. Grant permissions to the database user whoever does the archive.
Partition the candidate table
Examine the candidate table to determine the partition key based on your purge criteria. Learn how data is distributed to choose a proper partition interval and partition starting point.
As an example, we created a PostgreSQL table called ticket_purchase_hist
from aws-database-migration-samples
provided through this link. The README.md file has the details on how to create dms_sample schema and how to populate it with data. For performance reasons, we replaced the double precision data type with int.
The following is the DDL for ticket_purchase_hist
:
By design, with each “sell” or “transfer” of a ticket, a record is inserted into this table with the current timestamp. There are a total of 5,742,278 records in the table. Figure 5 shows data distribution by month:
Assume the retention policy is 12 months. The optimal partition strategy is to partition this table on the transaction_date_time
column with a monthly interval.
Note: The procedures in this section require application downtime.
Using the psql tool, follow these steps to partition the table:
- Create a new monthly range partition table from the original table. Choose
transaction_date_time
as the partition key. - Rename the tables so that the new table becomes the working table:
- As
rds_superuser
, register the parent table withpg_partman
and create the partitions. Since the earliest transaction date in this table is January 10, 2021, we setp_start_partition=>’2021-01-01’
. Your transaction date may be different, so make sure that you query your table and setp_start_partition
accordingly. - Drop the default partition to avoid possible future partition creation issues. The default partition is to store records that can’t be mapped to any other partitions. In case a record is inserted into the default partition, creating a partition using that record value range will fail.
- Migrate data into the partitioned table.
- Verify that the data was successfully inserted into each partition. For example:
- Add keys and indexes to the partitioned table if they are part of the original table.
- As
rds_superuser
, set retention policy withpg_partman
.
This example uses 12 months. Your environment may have different requirements. - As
rds_superuser
, from the psql command line, run the following query periodically (monthly in this example).
The procedure creates the future partitions ahead of time and detach the obsolete partitions according to the retention that we set at Step 8. We recommend that you run this regularly from a job scheduler. Refer to the RDS documentation to learn more.
Upload obsolete table partition to Amazon S3 and drop the archived partition from the database
A partition is maintained as a child table to its parent by pg_partman
. When time goes by, a partition will become obsolete as evaluated against the retention policy. Partman.run_maintenance_proc()
detects and detaches these partitions from its parent table automatically. Your task is to identify the detached partitions and ship them to the designated S3 bucket for archiving.
- From the psql command line, run the following query to identify the partitions that are detached from the parent:
- Export the detached table data to the designated folder in the S3 bucket.
In this step, you run the psql command to archive the data to the S3 bucket using two PostgreSQL functions:- create_s3_uri defines the target S3 bucket information. In the example, you enter the bucket name as ‘
dbarchive-test
‘, path as ‘testdb/dms_sample/ticket_purchase_hist/202101/data
’, and AWS regions as ‘us-east-1
‘. - query_export_to_s3 function extracts and exports data based on a query.
In the following example, the query exports the data for the entire partition:
- create_s3_uri defines the target S3 bucket information. In the example, you enter the bucket name as ‘
- Validate by comparing the total records in Amazon S3 and records in the database table.
From AWS CLI, run the following Amazon S3 Select query to count records for the archived table in the S3 bucket. In the example, the number of records is written to an output file called “recordcount.csv
”.You should see
64946
inrecordcount.csv
file, which matches the table record count. - Upload the archived table
DDL
to Amazon S3.
Because the table structure may change over time, it’s a good idea to add the archived table DDL to the Amazon S3 folder as well. In the following example, with the PostgreSQL pg_dump command, you first save the create table statement as a file named “ticket_purchase_hist_p2021_01.sql
”. Then, you upload it to the S3 bucket.From the machine with pg_dump installed, run the command to extract and save the create table DDL.With AWS CLI, upload the
ticket_purchase_hist_p2021_01.sql
to the DDL folder of the S3 bucket. - Drop the archived table from the database using psql.
(Optional) Restore archived data from Amazon S3 to the database
You can access the data in Amazon S3 using tools like Amazon S3 Select (Amazon S3 Glacier Select) or Amazon Athena. Only by special requirement, you need to restore the archived data to a database. You can achieve this in two steps:
- Restore the table ddl from Amazon S3.
From AWS CLI, download the ddl from the S3 bucket.
For example:From psql, create the table using the above file.
For example: - Restore the archived data by calling
aws_s3.table_import_from_s3
andaws_commons.create_s3_uri
functions.
-
- aws_commons.create_s3_uri defines the source S3 bucket information. In the example, you enter the bucket name as ‘
dbarchive-test
‘, path as ‘testdb/dms_sample/ticket_purchase_hist/202101/data
’, and AWS regions as ‘us-east-1
‘. - aws_s3.table_import_from_s3 imports data from the S3 bucket to the target table. In the example, you enter ‘
dms_sample.ticket_purchase_hist_p2021_01
’ as target table name, and ‘text
’ for format.
- aws_commons.create_s3_uri defines the source S3 bucket information. In the example, you enter the bucket name as ‘
Example code for psql:
Output:
Summary
In this post, we’ve showed you an efficient PostgreSQL database archive solution with table partitioning using pg_partman
and Amazon S3 functions. You can adapt the same solution to multiple tables in your own schema. Although we run the commands manually, you can automate the process by scripting them together and using AWS Lambda to schedule it to run.
Share your experience in the Comments section. We highly value your feedback!
About the Author
Li Liu is a Senior Database Specialty Architect with the Professional Services team at Amazon Web Services. She helps customers to migrate traditional on-premise databases to AWS cloud. She specializes in database design, architecture and performance tuning.
Shunan Xiang is a Senior Database Consultant with the Professional Services team at AWS. He works as a database migration specialist to help Amazon customers in design and implementing scalable, secure, performant, and robust database solutions on the cloud.
Samujjwal Roy is a Principal DB Consultant with the Professional Services team at Amazon Web Services. He has been with Amazon for 17+ years and has led migration projects for internal and external Amazon customers to move their on-premises database environment to AWS Cloud database solutions.