Partition existing tables using native commands in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL

Customers use Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition for hosting their Online Transaction Processing (OLTP) database workloads. Considering the scale at which today’s applications operate, databases can grow to hundreds of terabytes in a very short span of time. Databases grow in size because the majority share of database storage is taken by a few highly-active large tables. These large tables become operationally unmanageable and cause performance problems. Eventually, it becomes imperative to partition those tables for better manageability and to reduce operational and performance overhead.

Partitioning a table can help in many ways, such as the easier archiving of older data using DDL DETACH commands, improving query performance using partition pruning, reducing VACUUM time by running them on selective active partitions, and more. However, partitioning an existing table needs special approaches when you want to reduce downtime. There are a few ways in which you can partition a non-partitioned table with lower downtime: AWS Database Migration Service (AWS DMS) to partition existing tables, or PostgreSQL native SQL commands.

In this post, I show you how to use PostgreSQL native SQL commands to convert an existing non-partitioned table to a partitioned one. This method applies to Amazon RDS for PostgreSQL, Aurora PostgreSQL, and self-managed PostgreSQL.

Overview

To partition an existing large table, we use the following high-level steps:

Create and validate check constraint on the existing table.
Rename the existing table and its corresponding indexes, and create a new partitioned table with the original table’s name.
Attach the old table as a partition of the newly created partitioned table.
Drop the check constraint.
Move data from the old table that has been attached as a partition to multiple smaller partitions in the new table.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account.
An Amazon RDS for PostgreSQL (version 13.8) database instance or Aurora PostgreSQL (version 13.8) cluster –Create an RDS for PostgreSQL DB instance or create an Aurora PostgreSQL cluster.
An Amazon Simple Storage Service (Amazon S3) Bucket.
Knowledge of PostgreSQL partitioning and corresponding SQL commands and PostgreSQL locks.

Complete the following steps to prepare your environment. For this scenario, I create a sample table car_data that is non-partitioned. Now let’s use native commands to convert this table to a partitioned table of the same name.

Connect to the RDS for PostgreSQL DB instance or connect to the Aurora PostgreSQL cluster created in the prerequisite step.
Create a non-partitioned table and indexes using the following commands:

CREATE TABLE car_data (email VARCHAR(100),
name VARCHAR(60),
car_make VARCHAR(40),
car_model VARCHAR(40),
date_of_purchase DATE,
car_model_year INT,
car_price DECIMAL(11,2),
city_of_purchase VARCHAR(30),
country_of_purchase VARCHAR(40),
car_color VARCHAR(20),
car_vin VARCHAR(40));

CREATE INDEX ix_email ON car_data(email);
CREATE INDEX ix_city_of_purchase ON car_data(city_of_purchase);

Create a check constraint on the existing table

Define a check constraint that matches the partition boundaries as that of the old table when it’s attached as a partition to the new table. The check constraint must be defined on the column that has been used as the partition key of the new table. The purpose of the check constraint is to make sure that while attaching the old table as a partition, a table scan is avoided, which is needed to validate the partition boundaries.

Because we’re using date_of_purchase as the partition key, we determine the minimum and maximum value of it to define the check constraint accordingly. For this post, we get the following output:

SELECT MIN(date_of_purchase) FROM car_data;
min
------------
2000-06-25

SELECT MAX(date_of_purchase) FROM car_data;
max
------------
2022-06-23

Create the check constraint with the date ranges found in the output for date_of_purchase with the following code:

ALTER TABLE car_data ADD CONSTRAINT car_data_check_constraint CHECK (date_of_purchase >= DATE ‘2000-06-01’ AND date_of_purchase < DATE ‘2022-07-01’) NOT VALID;

We use the NOT VALID option to avoid an expensive table scan for constraint validation. Because we already have the minimum and maximum values for the partition key, we don’t need any additional enforcers for the check constraint.

Validate check constraint on the existing table

After creating the constraint with NOT VALID operator, run VALIDATE CONSTRAINT statement for the constraint to be accepted when attaching the table as a partition.

ALTER TABLE car_data VALIDATE CONSTRAINT car_data_check_constraint;

This command acquires a SHARE UPDATE EXCLUSIVE lock (you can find more details in the PostgreSQL documentation – VALIDATE CONSTRAINT and SHARE UPDATE EXCLUSIVE lock), which allows you to run your regular operations.

You can still run your regular workload with UPDATE, INSERT, DELETE and SELECT statements when VALIDATE CONSTRAINT command is running as shown in the following example.

mytestdb=> select a.locktype, a.relation,a.mode, b.relname, b.oid from pg_locks a, pg_class b where  a.relation=b.oid and b.oid=24657;
 locktype | relation |           mode           |   relname    |  oid  
----------+----------+--------------------------+--------------+-------
 relation |    24657 | ShareUpdateExclusiveLock | car_data     | 24657
(1 row)

mytestdb=> select sr_no, model, make, model_year from car_data limit 5;
 sr_no | model  |   make   | model_year 
-------+--------+----------+------------
     2 | BMW    | M5       |       2001
     3 | Toyota | Prius    |       2007
     4 | Ford   | F350     |       2002
     5 | Buick  | LaCrosse |       2005
     6 | Subaru | Outback  |       2008

mytestdb=> update car_data set model_year=2004 where sr_no=2 and model='BMW' and make='M5';
UPDATE 1

mytestdb=> insert into car_data (sr_no, model, make, model_year) values (2004,'Safari','Tata',2004);
INSERT 0 1

mytestdb=> select sr_no, model, make, model_year from car_data where sr_no=2 and model='BMW';
 sr_no | model | make | model_year 
-------+-------+------+------------
     2 | BMW   | M5   |       2004

mytestdb=> select sr_no, model, make, model_year from car_data where sr_no=2004 and model='Safari';
 sr_no | model  | make | model_year 
-------+--------+------+------------
  2004 | Safari | Tata |       2004

mytestdb=> delete from car_data where sr_no=2004 and model='Safari';
DELETE 1

mytestdb=> select sr_no, model, make, model_year from car_data where sr_no=2004 and model='Safari';
 sr_no | model | make | model_year 
-------+-------+------+------------
(0 rows)

Rename the existing table and indexes and create a new partitioned table

As the first step, you rename the existing table and its corresponding indexes, create a new partitioned table with the same name with a partition that can accept new data, and create the indexes as necessary. Furthermore, you add an index on the partitioned column. And you perform all of these actions in a single transaction because there are interdependencies among the steps.

During this operation, the table isn’t accessible for the entire transaction time because ALTER TABLE takes ACCESS EXCLUSIVE LOCK, which causes some downtime. See the following SQL statement:

BEGIN TRANSACTION;

ALTER TABLE car_data RENAME TO car_data_old;

ALTER TABLE ix_email RENAME TO ix_email_old;

ALTER TABLE ix_city_of_purchase RENAME TO ix_city_of_purchase_old;

CREATE TABLE car_data (email VARCHAR(100),
name VARCHAR(60),
car_make VARCHAR(40),
car_model VARCHAR(40),
date_of_purchase DATE,
car_model_year INT,
car_price DECIMAL(11,2),
city_of_purchase VARCHAR(30),
country_of_purchase VARCHAR(40),
car_color VARCHAR(20),
car_vin VARCHAR(40))
PARTITION BY RANGE (date_of_purchase);

CREATE INDEX ix_email ON car_data(email);

CREATE INDEX ix_city_of_purchase ON car_data(city_of_purchase);

CREATE INDEX ix_date_of_purchase ON car_data(date_of_purchase);

CREATE TABLE car_data_p202207 PARTITION OF car_data FOR VALUES FROM (‘2022-07-01’) TO (‘2022-08-01’);

COMMIT;

Attach the old table as a partition in the new table

Attach the renamed table car_data_old as a partition of car_data. This operation takes a SHARE UPDATE EXCLUSIVE lock on the main table car_data and an ACCESS EXCLUSIVE lock on the table being attached (car_data_old), thereby making it unavailable for reads and writes. See the following code:

ALTER TABLE car_data ATTACH PARTITION car_data_old FOR VALUES FROM (‘2000-06-01’) TO (‘2022-07-01’);

Without the check constraint, this step can take significantly longer, as it would scan the entire car_data_old table to validate partition boundaries.

Drop the check constraint

As you don’t need it, drop the check constraint that you created:

ALTER TABLE car_data_old DROP CONSTRAINT car_data_check_constraint;

Move data from the large partition to smaller ones

Because we attached a large table as a single partition in the new table, it should be divided into smaller partitions. You can move the data from this large partition to smaller ones incrementally at your own pace. This should be run when there is no or less business activity on the table, so that the impact of data movement, high I/O, and locks is low on the active queries. You must complete this process in steps using transactions, as mentioned in the following steps along with its corresponding SQL code:

Begin transaction.
```
BEGIN TRANSACTION;
```
Detach the old table partition that was attached in the step attach the old table as a partition in the new table.
```
ALTER TABLE car_data DETACH PARTITION car_data_old;
```
Create a new partition for the data that will be moved from the old table detached above. For example, create a partition for records with values between 2022-06-01 to 2022-07-01.
```
CREATE TABLE car_data_p202206 PARTITION OF car_data FOR VALUES FROM (‘2022-06-01’) TO (‘2022-07-01’);
```

Populate the partition, created above in Step 3, from the old table using INSERT INTO SELECT statement.

INSERT INTO car_data_p202206 SELECT * FROM car_data_old WHERE date_of_purchase >= ‘2022-06-01’ AND date_of_purchase < ‘2022-07-01’;

Delete the records that have been moved from old table to the new partition.

DELETE FROM car_data_old WHERE date_of_purchase >= ‘2022-06-01’ AND date_of_purchase < ‘2022-07-01’;

Add check constraint to avoid table scan while attaching the partition again.

ALTER TABLE car_data_old ADD CONSTRAINT car_data_check_constraint CHECK (date_of_purchase >= DATE ‘2000-06-01’ AND date_of_purchase < DATE ‘2022-06-01’) NOT VALID;

Validate the CHECK constraint created in the previous step.

ALTER TABLE car_data_old VALIDATE CONSTRAINT car_data_check_constraint;

Attach the old table with remaining records as a partition in the new table.

ALTER TABLE car_data ATTACH PARTITION car_data_old FOR VALUES FROM (‘2000-06-01’) TO (‘2022-06-01’);

Drop the constraint as it’s no longer needed.

ALTER TABLE car_data_old DROP CONSTRAINT car_data_check_constraint;

Complete the transaction.
```
COMMIT;
```

After detaching the partition car_data_old, the table car_data would only contain data from the newer partitions (from 2022-07-01 to 2022-08-01) that was defined earlier. Notice that the partition boundaries have changed while attaching the old table again. This is because some data has been moved to the new partition car_data_p202206. Moreover, you must define the check constraint and validate it each time you run this step because it avoids table scans each time new partition boundaries are defined. Continue this step until the entire partition with the old table has been divided into smaller partitions.

Clean up

After you complete all of the steps to implement this solution, delete the RDS DB instance or delete the Aurora DB cluster.

Conclusion

In this post, I showed you how to use PostgreSQL native commands to migrate an existing non-partitioned table to a partitioned one with minimal downtime. Tables can be unavailable because of locks that are taken while running ALTER commands during the renaming and partitioning processes. When we define a check constraint, the downtime during partitioning is reduced substantially by skipping the costly table scan that is otherwise needed while attaching a table as a partition. Make sure that you use the commands run in transactions as mentioned earlier, otherwise you won’t achieve the expected results.

Finally, I encourage you to test the process by yourself and clean up the resources once done. If you have questions or suggestions, then leave them in the comments section.

About the Author

Soumyadeep Dey is a Database Specialist Solutions Architect with AWS. He works with AWS customers to design scalable, performant, and robust database architectures on the cloud using both SQL and NoSQL database offerings.