AWS Database Blog

Partition existing tables using native commands in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL

Customers use Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition for hosting their Online Transaction Processing (OLTP) database workloads. Considering the scale at which today’s applications operate, databases can grow to hundreds of terabytes in a very short span of time. Databases grow in size because the majority share of database storage is taken by a few highly-active large tables. These large tables become operationally unmanageable and cause performance problems. Eventually, it becomes imperative to partition those tables for better manageability and to reduce operational and performance overhead.

Partitioning a table can help in many ways, such as the easier archiving of older data using DDL DETACH commands, improving query performance using partition pruning, reducing VACUUM time by running them on selective active partitions, and more. However, partitioning an existing table needs special approaches when you want to reduce downtime. There are a few ways in which you can partition a non-partitioned table with lower downtime: AWS Database Migration Service (AWS DMS) to partition existing tables, or PostgreSQL native SQL commands.

In this post, I show you how to use PostgreSQL native SQL commands to convert an existing non-partitioned table to a partitioned one. This method applies to Amazon RDS for PostgreSQL, Aurora PostgreSQL, and self-managed PostgreSQL.

Overview

To partition an existing large table, we use the following high-level steps:

  1. Rename the existing table and its corresponding indexes, and create a new partitioned table with the original table’s name.
  2. Create a check constraint on the old table.
  3. Attach the old table as a partition of the newly created partitioned table.
  4. Drop the check constraint.
  5. Move data from the old table that has been attached as a partition to multiple smaller partitions in the new table.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Complete the following steps to prepare your environment. For this scenario, I create a sample table car_data that is non-partitioned. Now let’s use native commands to convert this table to a partitioned table of the same name.

  1. Connect to the RDS for PostgreSQL DB instance or connect to the Aurora PostgreSQL cluster created in the prerequisite step.
  2. Create a non-partitioned table and indexes using the following commands:
CREATE TABLE car_data (email VARCHAR(100),
name VARCHAR(60),
car_make VARCHAR(40),
car_model VARCHAR(40),
date_of_purchase DATE,
car_model_year INT,
car_price DECIMAL(11,2),
city_of_purchase VARCHAR(30),
country_of_purchase VARCHAR(40),
car_color VARCHAR(20),
car_vin VARCHAR(40));

CREATE INDEX ix_email ON car_data(email);

CREATE INDEX ix_city_of_purchase ON car_data(city_of_purchase);

Rename the existing table and indexes and create a new partitioned table

As the first step, you rename the existing table and its corresponding indexes, create a new partitioned table with the same name with a partition that can accept new data, and create the indexes as necessary. Furthermore, you add an index on the partitioned column. And you perform all of these actions in a single transaction because there are interdependencies among the steps.

During this operation, the table isn’t accessible for the entire transaction time because ALTER TABLE takes ACCESS EXCLUSIVE LOCK, which causes some downtime. See the following SQL statement:

BEGIN TRANSACTION;

ALTER TABLE car_data RENAME TO car_data_old;

ALTER TABLE ix_email RENAME TO ix_email_old;

ALTER TABLE ix_city_of_purchase RENAME TO ix_city_of_purchase_old;

CREATE TABLE car_data (email VARCHAR(100),
name VARCHAR(60),
car_make VARCHAR(40),
car_model VARCHAR(40),
date_of_purchase DATE,
car_model_year INT,
car_price DECIMAL(11,2),
city_of_purchase VARCHAR(30),
country_of_purchase VARCHAR(40),
car_color VARCHAR(20),
car_vin VARCHAR(40))
PARTITION BY RANGE (date_of_purchase);

CREATE INDEX ix_email ON car_data(email);

CREATE INDEX ix_city_of_purchase ON car_data(city_of_purchase);

CREATE INDEX ix_date_of_purchase ON car_data(date_of_purchase);

CREATE TABLE car_data_p202207 PARTITION OF car_data FOR VALUES FROM (‘2022-07-01’) TO (‘2022-08-01’);

COMMIT;

Create a check constraint on the old table

Define a check constraint that matches the partition boundaries as that of the old table when it’s attached as a partition to the new table. The check constraint must be defined on the column that has been used as the partition key of the new table. The purpose of the check constraint is to make sure that while attaching the old table as a partition, a table scan is avoided, which is needed to validate the partition boundaries.

Because we’re using date_of_purchase as the partition key, we determine the minimum and maximum value of it to define the check constraint accordingly. For this post, we get the following output:

SELECT MIN(date_of_purchase) FROM car_data_old;
min
------------
2000-06-25

SELECT MAX(date_of_purchase) FROM car_data_old;
max
------------
2022-06-23

Create the check constraint with the date ranges found in the output for date_of_purchase with the following code:

ALTER TABLE car_data_old ADD CONSTRAINT car_data_old_check_constraint CHECK (date_of_purchase >= DATE ‘2000-06-01’ AND date_of_purchase < DATE ‘2022-07-01’) NOT VALID;

We use the NOT VALID option to avoid an expensive table scan for constraint validation. Because we already have the minimum and maximum values for the partition key, we don’t need any additional enforcers for the check constraint.

Attach the old table as a partition in the new table

Attach the renamed table car_data_old as a partition of car_data. This operation takes a SHARE UPDATE EXCLUSIVE lock on the main table car_data and an ACCESS EXCLUSIVE lock on the table being attached (car_data_old), thereby making it unavailable for reads and writes. See the following code:

ALTER TABLE car_data ATTACH PARTITION car_data_old FOR VALUES FROM (‘2000-06-01’) TO (‘2022-07-01’);

Without the check constraint, this step can take significantly longer, as it would scan the entire car_data_old table to validate partition boundaries.

Drop the check constraint

As you don’t need it, drop the check constraint that you created:

ALTER TABLE car_data_old DROP CONSTRAINT car_data_old_check_constraint;

Move data from the large partition to smaller ones

Because we attached a large table as a single partition in the new table, it should be divided into smaller partitions. You can move the data from this large partition to smaller ones incrementally at your own pace. This should be run when there is no or less business activity on the table, so that the impact of data movement, high I/O, and locks is low on the active queries. You must complete this process in steps using transactions, as mentioned in the following steps along with its corresponding SQL code:

  1. Begin transaction.
BEGIN TRANSACTION;
  1. Detach the old table partition that was attached in the step attach the old table as a partition in the new table.
ALTER TABLE car_data DETACH PARTITION car_data_old;
  1. Create a new partition for the data that will be moved from the old table detached above. For example, create a partition for records with values between 2022-06-01 to 2022-07-01.
CREATE TABLE car_data_p202206 PARTITION OF car_data FOR VALUES FROM (‘2022-06-01’) TO (‘2022-07-01’);
  1. Populate the partition, created above in Step 3, from the old table using INSERT INTO SELECT statement.
INSERT INTO car_data_p202206 SELECT * FROM car_data_old WHERE date_of_purchase >= ‘2022-06-01’ AND date_of_purchase < ‘2022-07-01’;
  1. Delete the records that have been moved from old table to the new partition.
DELETE FROM car_data_old WHERE date_of_purchase >= ‘2022-06-01’ AND date_of_purchase < ‘2022-07-01’;
  1. Add check constraint to avoid table scan while attaching the partition again.
ALTER TABLE car_data_old ADD CONSTRAINT car_data_old_check_constraint CHECK (date_of_purchase >= DATE ‘2000-06-01’ AND date_of_purchase < DATE ‘2022-06-01’) NOT VALID;
  1. Attach the old table with remaining records as a partition in the new table.
ALTER TABLE car_data ATTACH PARTITION car_data_old FOR VALUES FROM (‘2000-06-01’) TO (‘2022-06-01’);
  1. Drop the constraint as it’s no longer needed.
ALTER TABLE car_data_old DROP CONSTRAINT car_data_old_check_constraint;
  1. Complete the transaction.
COMMIT;

After detaching the partition car_data_old, the table car_data would only contain data from the newer partitions (from 2022-07-01 to 2022-08-01) that was defined earlier. Notice that the partition boundaries have changed while attaching the old table again. This is because some data has been moved to the new partition car_data_p202206. Moreover, you must define the check constraint each time you run this step because it avoids table scans each time new partition boundaries are defined. Continue this step until the entire partition with the old table has been divided into smaller partitions.

Clean up

After you complete all of the steps to implement this solution, delete the RDS DB instance or delete the Aurora DB cluster.

Conclusion

In this post, I showed you how to use PostgreSQL native commands to migrate an existing non-partitioned table to a partitioned one with minimal downtime. Tables can be unavailable because of locks that are taken while running ALTER commands during the renaming and partitioning processes. When we define a check constraint, the downtime during partitioning is reduced substantially by skipping the costly table scan that is otherwise needed while attaching a table as a partition. Make sure that you use the commands run in transactions as mentioned earlier, otherwise you won’t achieve the expected results.

Finally, I encourage you to test the process by yourself and clean up the resources once done. If you have questions or suggestions, then leave them in the comments section.


About the Author

Soumyadeep Dey is a Database Specialist Solutions Architect with AWS. He works with AWS customers to design scalable, performant, and robust database architectures on the cloud using both SQL and NoSQL database offerings.