Archive and Purge Data for Amazon RDS for PostgreSQL and Amazon Aurora with PostgreSQL Compatibility using pg_partman and Amazon S3

The structured data industry has been experiencing exponential data growth year by year. It’s important for the relational database users to get smart with its data storage footprint by understanding data relevance while keeping its business needs in mind.

Data archive is the critical part of data management, as it enables us to keep business relevant hot data in high-cost, high performance storage, as well as move less significant cold data into cheaper storage. The archival practice helps reduce Total Cost of Ownership (TCO) and improve database performance as smaller datasets result in faster access time.

Traditionally, database archival practice was to move the infrequently accessed data to lower-cost storage like offsite tape/disk to keep the data for a long period and for compliance purposes.

Over the last decades, with the rise of analytic and machine learning (ML), the need to access the archived data has become a necessity for enterprise data mining and decision support. These needs are much more easily met by Amazon Simple Storage Service (Amazon S3) as the optimized archive storage solution when compared to the offsite tape and disk. Amazon S3 is scalable and secure, offering high performance, high availability, and durability that can help you archive and run analytics at a lower cost.

In some cases, data in a relational database is relevant for a period. A common way to handle such data is to partition and archive the tables based on time. The PostgreSQL extension pg_partman is a popular data partitioning automation extension available in Amazon Relational Database Service (Amazon RDS) for PostgreSQL or Amazon Aurora PostgreSQL-Compatible Edition.

In this post, we show how you can efficiently use PostgreSQL’s native range partition to partition current (hot) data with pg_partman and archive historical (cold) data in Amazon S3. The solution works with Amazon RDS for PostgreSQL version 12.5 onward, and Amazon Aurora PostgreSQL version 12.6 onward.

Solution overview

The solution combines PostgreSQL’s native range partitioning feature with pg_partman and Amazon RDS’s Amazon S3 export/import functions.

PostgreSQL lets you divide a table into partitions based on key columns’ date/time ranges. It offers great performance and management benefits for archiving/purging historical data. Instead of bulk insert and delete, you simply copy the partition out for archive, then drop the partition when you no longer need it.

pg_partman is a PostgreSQL extension that supports PostgreSQL’s native partitioning to create and manage time-based and serial-based partition sets. It automates the child partition creation and works with your retention policy to detach or drop the obsolete partitions for you.

When a partition becomes obsolete, you call aws_s3.query_export_to_s3 function to upload the data to a designated Amazon S3 folder. Once confirmed, the partition is dropped from its database. The archived data stays in the S3 bucket for its lifetime, going through different Amazon S3 storage tiers according to the Amazon S3 lifecycle management policy. You can query data in Amazon S3 directly, so restoring the archived data to the database becomes optional.

Figure 1 shows data movement between the database and the Amazon S3 archive:

Figure 1 – Data movement between the database and Amazon S3

Figure 2 shows the destination S3 bucket’s structure for a monthly partition table:

Figure 2 – Diagram representing the bucket’s folder organization

The solution involves the following specific steps:

Set up S3 bucket policies
Set up your database
Partition the candidate table
Upload the obsolete table partition to Amazon S3 and drop the archived partition from the database
(Optional) Restore the archived data from Amazon S3 to the database

Prerequisites

To implement the solution, you must have the following components ready.

A data retention policy that defines what data should be archived, where, and for how long it should be kept.
An AWS Managed Key Service (AWS KMS) customer key.
A private S3 bucket as the archive destination:
- Encrypted with your AWS KMS customer key
- With a lifecycle policy configured
An RDS for PostgreSQL instance (version 12.5 and above) or Aurora PostgreSQL cluster (version 12.6 or above) with the pg_partman extension installed.
A privileged IAM user to manage IAM policies, to run Amazon Command Line Interface (AWS CLI) commands and to access the AWS Management Console.
A scheduler like pg_cron to run partition maintenance jobs.
A database user who runs archive procedures. If you prefer, you may create a separate database user to perform archive operations on behalf of all the schema/table owners. In such case, you must grant the user select privilege on the target tables and their partitions:
- GRANT CONNECT ON DATABASE <dbname> TO <archive_user>;
- GRANT USAGE ON SCHEMA <schemaname> TO <archive_user>;
- GRANT SELECT ON <tablename> TO <archive_user>;
- GRANT SELECT ON <tablepartitionname> TO <archive_user>;

Set up S3 bucket Policies

First, let’s set up the S3 bucket. Use the AWS CLI to run the following commands:

Create a policy with the necessary privileges to access the S3 bucket and its AWS KMS key. Take note of the policy ARN.

aws iam create-policy --policy-name pg-s3-policy --policy-document file://s3-exp-policy

In the preceding code, s3-exp-policy is a file in the current directory with the following content:

{
        "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:AbortMultipartUpload",
                "s3:DeleteObject",
                "s3:ListMultipartUploadParts",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<bucketname>/<databasename>*",
                "arn:aws:s3:::<bucketname>"
            ]
        },
		{
           "Effect": "Allow",
           "Action": [
               "kms:Decrypt",
               "kms:DescribeKey",
	"kms:GenerateDataKey",
	"kms:Encrypt",
                "kms:ReEncrypt*"
           ],
           "Resource": "<yourKMSkeyarn>"
       }	
    ]
}

Create two AWS Identity and Access Management (IAM) roles: one for export and one for import.

aws iam create-role  --role-name pg-s3-export-role  --assume-role-policy-document file://s3-exp-assumerole-policy

aws iam create-role  --role-name pg-s3-import-role  --assume-role-policy-document file://s3-exp-assumerole-policy

In the preceding code, s3-exp-assumerole-policy is saved in the current directory with the following contents:

{
"Version": "2012-10-17","Statement": [
    {
      "Effect": "Allow","Principal": {
        "Service": "rds.amazonaws.com"
      },"Action": "sts:AssumeRole"
    }
  ]
}

Attach the policy created in Step 1 to the roles created from Step 2.

aws iam attach-role-policy  --policy-arn <yourpolicyarn> --role-name pg-s3-export-role
aws iam attach-role-policy  --policy-arn <yourpolicyarn> --role-name pg-s3-import-role

Add the roles to your Amazon Aurora cluster or RDS instance through the AWS Management Console.

Select the Aurora cluster or RDS instance from the Connectivity & Security tab, fill in the information, and select Add role. You enter the pg-s3-exprot-role and the pg-s3-import-role one by one as shown in Figure 3 and Figure 4.

Figure 3 – Add S3 export role to Aurora cluster

Figure 4 – Add S3 import role to Aurora Cluster

Set up your Database

To set up your database, you install the required PostgreSQL schema and extensions. From a PostgreSQL client like psql, complete the following steps as rds_superuser:

Create partman schema and pg_partman extension:
```
CREATE SCHEMA partman;

CREATE EXTENSION pg_partman WITH SCHEMA partman;
```
partman schema has the required tables and functions to help you configure and manage the table partitions.
Create aws_s3 extension. These functions enable data moving between the database and Amazon S3. Grant permissions to the database user whoever does the archive.
```
CREATE EXTENSION aws_s3 CASCADE;
GRANT USAGE ON SCHEMA aws_s3 TO <archive_user>;
GRANT EXECUTE ON ALL FUNCTIONS IN SCHEMA aws_s3 TO <archive_user>;
```

Partition the candidate table

Examine the candidate table to determine the partition key based on your purge criteria. Learn how data is distributed to choose a proper partition interval and partition starting point.

As an example, we created a PostgreSQL table called ticket_purchase_hist from aws-database-migration-samples provided through this link. The README.md file has the details on how to create dms_sample schema and how to populate it with data. For performance reasons, we replaced the double precision data type with int.

The following is the DDL for ticket_purchase_hist:

CREATE TABLE dms_sample.ticket_purchase_hist (
	sporting_event_ticket_id int NOT NULL,
	purchased_by_id int NOT NULL,
	transaction_date_time timestamp(0) NOT NULL,
	transferred_from_id int NULL,
	purchase_price numeric(8, 2) NOT NULL,
	CONSTRAINT ticket_purchase_hist_pk PRIMARY KEY (sporting_event_ticket_id, purchased_by_id, transaction_date_time)
);

By design, with each “sell” or “transfer” of a ticket, a record is inserted into this table with the current timestamp. There are a total of 5,742,278 records in the table. Figure 5 shows data distribution by month:

Figure 5 – Table data distribution by month

Assume the retention policy is 12 months. The optimal partition strategy is to partition this table on the transaction_date_time column with a monthly interval.

Note: The procedures in this section require application downtime.

Using the psql tool, follow these steps to partition the table:

Create a new monthly range partition table from the original table. Choose transaction_date_time as the partition key.

CREATE TABLE dms_sample.ticket_purchase_hist_temp (
	sporting_event_ticket_id int NOT NULL,
	purchased_by_id int NOT NULL,
	transaction_date_time timestamp(0) NOT NULL,
	transferred_from_id int NULL,
	purchase_price numeric(8, 2) NOT NULL)
	PARTITION BY RANGE(transaction_date_time);

Rename the tables so that the new table becomes the working table:

ALTER TABLE dms_sample.ticket_purchase_hist RENAME TO ticket_purchase_hist_old;
ALTER TABLE dms_sample.ticket_purchase_hist_temp RENAME TO ticket_purchase_hist;

As rds_superuser, register the parent table with pg_partman and create the partitions. Since the earliest transaction date in this table is January 10, 2021, we set p_start_partition=>’2021-01-01’. Your transaction date may be different, so make sure that you query your table and set p_start_partition accordingly.
```
SELECT partman.create_parent( p_parent_table => 'dms_sample.ticket_purchase_hist', p_control => 'transaction_date_time', p_type => 'native', p_interval=> 'monthly', p_premake => 7,p_start_partition => '2021-01-01');
```
Drop the default partition to avoid possible future partition creation issues. The default partition is to store records that can’t be mapped to any other partitions. In case a record is inserted into the default partition, creating a partition using that record value range will fail.
```
DROP TABLE dms_sample.ticket_purchase_hist_default;
```

Migrate data into the partitioned table.

INSERT INTO dms_sample.ticket_purchase_hist SELECT * FROM dms_sample.ticket_purchase_hist_old;

Verify that the data was successfully inserted into each partition. For example:
```
SELECT count(*) FROM dms_sample. ticket_purchase_hist_p2021_01;
Output: 64946
```

Add keys and indexes to the partitioned table if they are part of the original table.

ALTER TABLE dms_sample.ticket_purchase_hist ADD CONSTRAINT ticket_purchase_hist_pk1 PRIMARY KEY (sporting_event_ticket_id, purchased_by_id, transaction_date_time);

As rds_superuser, set retention policy with pg_partman.
This example uses 12 months. Your environment may have different requirements.

UPDATE partman.part_config SET infinite_time_partitions = true, retention = '12 months', retention_keep_table=true WHERE parent_table = 'dms_sample.ticket_purchase_hist';

As rds_superuser, from the psql command line, run the following query periodically (monthly in this example).
```
CALL partman.run_maintenance_proc();
```

The procedure creates the future partitions ahead of time and detach the obsolete partitions according to the retention that we set at Step 8. We recommend that you run this regularly from a job scheduler. Refer to the RDS documentation to learn more.

Upload obsolete table partition to Amazon S3 and drop the archived partition from the database

A partition is maintained as a child table to its parent by pg_partman. When time goes by, a partition will become obsolete as evaluated against the retention policy. Partman.run_maintenance_proc() detects and detaches these partitions from its parent table automatically. Your task is to identify the detached partitions and ship them to the designated S3 bucket for archiving.

From the psql command line, run the following query to identify the partitions that are detached from the parent:

select relname, n.nspname
from
pg_class 
join pg_namespace n on n.oid = relnamespace
where relkind = 'r' and relispartition ='f'
and relname like ‘ticket_purchase_hist_p%' and n.nspname = 'dms_sample';

output:
relname                      	|nspname   |
ticket_purchase_hist_p2021_01|dms_sample|

Export the detached table data to the designated folder in the S3 bucket.
In this step, you run the psql command to archive the data to the S3 bucket using two PostgreSQL functions:
- create_s3_uri defines the target S3 bucket information. In the example, you enter the bucket name as ‘dbarchive-test‘, path as ‘testdb/dms_sample/ticket_purchase_hist/202101/data’, and AWS regions as ‘us-east-1‘.
- query_export_to_s3 function extracts and exports data based on a query.
In the following example, the query exports the data for the entire partition:
```
SELECT *
FROM aws_s3.query_export_to_s3(
'SELECT * FROM dms_sample.ticket_purchase_hist_p2021_01',
aws_commons.create_s3_uri(
'dbarchive-test',
'testdb/dms_sample/ticket_purchase_hist/202101/data','us-east-1'));
```
Validate by comparing the total records in Amazon S3 and records in the database table.
From AWS CLI, run the following Amazon S3 Select query to count records for the archived table in the S3 bucket. In the example, the number of records is written to an output file called “recordcount.csv”.
```
aws s3api select-object-content \
    --bucket dbarchive-test \
    --key testdb/dms_sample/ticket_purchase_hist/202101/data \
    --expression "select count(*) from s3object" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": {}, "CompressionType": "NONE"}' \
--output-serialization '{"CSV": {}}' "recordcount.csv"
```
You should see 64946 in recordcount.csv file, which matches the table record count.
Upload the archived table DDL to Amazon S3.
Because the table structure may change over time, it’s a good idea to add the archived table DDL to the Amazon S3 folder as well. In the following example, with the PostgreSQL pg_dump command, you first save the create table statement as a file named “ticket_purchase_hist_p2021_01.sql”. Then, you upload it to the S3 bucket.From the machine with pg_dump installed, run the command to extract and save the create table DDL.
```
pg_dump -h <dbhost> -d <dbname> -U <dbuser> -s -t dms_sample.ticket_purchase_hist_p2021_01 > ticket_purchase_hist_p2021_01.sql
```
With AWS CLI, upload the ticket_purchase_hist_p2021_01.sql to the DDL folder of the S3 bucket.
```
aws s3 cp ticket_purchase_hist_p2021_01.sql s3://dbarchive-test/testdb/dms_sample/ticket_purchase_hist/202101/ddl/ticket_purchase_hist_p2021_01.sql
```

Drop the archived table from the database using psql.

Drop table dms_sample.ticket_purchase_hist_p2021_01 cascade;

Note: You need to be the table owner to drop the table.

(Optional) Restore archived data from Amazon S3 to the database

You can access the data in Amazon S3 using tools like Amazon S3 Select (Amazon S3 Glacier Select) or Amazon Athena. Only by special requirement, you need to restore the archived data to a database. You can achieve this in two steps:

Restore the table ddl from Amazon S3.
From AWS CLI, download the ddl from the S3 bucket.
For example:

aws s3 cp s3://dbarchive-test/testdb/dms_sample/ticket_purchase_hist/202101/ddl/ticket_purchase_hist_p2021_01.sql ticket_purchase_hist_p2021_01.sql

From psql, create the table using the above file.
For example:

psql "dbname=<dbname> host=<dbhost> user=<dbuser> port=5432" -f ticket_purchase_hist_p2021_01.sql

Restore the archived data by calling aws_s3.table_import_from_s3 and aws_commons.create_s3_uri functions.

- aws_commons.create_s3_uri defines the source S3 bucket information. In the example, you enter the bucket name as ‘dbarchive-test‘, path as ‘testdb/dms_sample/ticket_purchase_hist/202101/data’, and AWS regions as ‘us-east-1‘.
- aws_s3.table_import_from_s3 imports data from the S3 bucket to the target table. In the example, you enter ‘dms_sample.ticket_purchase_hist_p2021_01’ as target table name, and ‘text’ for format.

Example code for psql:

SELECT *
FROM aws_s3.table_import_from_s3(
'dms_sample.ticket_purchase_hist_p2021_01',
'', 
'(format text)',
aws_commons.create_s3_uri(
'dbarchive-test',
'testdb/dms_sample/ticket_purchase_hist/202101/data','us-east-1'))

Output:

64946 rows imported into relation "dms_sample.ticket_purchase_hist_p2021_01 " from file testdb/dms_sample/ticket_purchase_hist/202101/data of 3003108 bytes

Summary

In this post, we’ve showed you an efficient PostgreSQL database archive solution with table partitioning using pg_partman and Amazon S3 functions. You can adapt the same solution to multiple tables in your own schema. Although we run the commands manually, you can automate the process by scripting them together and using AWS Lambda to schedule it to run.

Share your experience in the Comments section. We highly value your feedback!

About the Author

Li Liu is a Senior Database Specialty Architect with the Professional Services team at Amazon Web Services. She helps customers to migrate traditional on-premise databases to AWS cloud. She specializes in database design, architecture and performance tuning.

Shunan Xiang is a Senior Database Consultant with the Professional Services team at AWS. He works as a database migration specialist to help Amazon customers in design and implementing scalable, secure, performant, and robust database solutions on the cloud.

Samujjwal Roy is a Principal DB Consultant with the Professional Services team at Amazon Web Services. He has been with Amazon for 17+ years and has led migration projects for internal and external Amazon customers to move their on-premises database environment to AWS Cloud database solutions.