Partition existing tables using native commands in Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL
Customers use Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition for hosting their Online Transaction Processing (OLTP) database workloads. Considering the scale at which today’s applications operate, databases can grow to hundreds of terabytes in a very short span of time. Databases grow in size because the majority share of database storage is taken by a few highly-active large tables. These large tables become operationally unmanageable and cause performance problems. Eventually, it becomes imperative to partition those tables for better manageability and to reduce operational and performance overhead.
Partitioning a table can help in many ways, such as the easier archiving of older data using DDL DETACH commands, improving query performance using partition pruning, reducing VACUUM time by running them on selective active partitions, and more. However, partitioning an existing table needs special approaches when you want to reduce downtime. There are a few ways in which you can partition a non-partitioned table with lower downtime: AWS Database Migration Service (AWS DMS) to partition existing tables, or PostgreSQL native SQL commands.
In this post, I show you how to use PostgreSQL native SQL commands to convert an existing non-partitioned table to a partitioned one. This method applies to Amazon RDS for PostgreSQL, Aurora PostgreSQL, and self-managed PostgreSQL.
To partition an existing large table, we use the following high-level steps:
- Create and validate check constraint on the existing table.
- Rename the existing table and its corresponding indexes, and create a new partitioned table with the original table’s name.
- Attach the old table as a partition of the newly created partitioned table.
- Drop the check constraint.
- Move data from the old table that has been attached as a partition to multiple smaller partitions in the new table.
For this walkthrough, you should have the following prerequisites:
- An AWS account.
- An Amazon RDS for PostgreSQL (version 13.8) database instance or Aurora PostgreSQL (version 13.8) cluster –Create an RDS for PostgreSQL DB instance or create an Aurora PostgreSQL cluster.
- An Amazon Simple Storage Service (Amazon S3) Bucket.
- Knowledge of PostgreSQL partitioning and corresponding SQL commands and PostgreSQL locks.
Complete the following steps to prepare your environment. For this scenario, I create a sample table
car_data that is non-partitioned. Now let’s use native commands to convert this table to a partitioned table of the same name.
- Connect to the RDS for PostgreSQL DB instance or connect to the Aurora PostgreSQL cluster created in the prerequisite step.
- Create a non-partitioned table and indexes using the following commands:
Create a check constraint on the existing table
Define a check constraint that matches the partition boundaries as that of the old table when it’s attached as a partition to the new table. The check constraint must be defined on the column that has been used as the partition key of the new table. The purpose of the check constraint is to make sure that while attaching the old table as a partition, a table scan is avoided, which is needed to validate the partition boundaries.
Because we’re using
date_of_purchase as the partition key, we determine the minimum and maximum value of it to define the check constraint accordingly. For this post, we get the following output:
Create the check constraint with the date ranges found in the output for
date_of_purchase with the following code:
We use the NOT VALID option to avoid an expensive table scan for constraint validation. Because we already have the minimum and maximum values for the partition key, we don’t need any additional enforcers for the check constraint.
Validate check constraint on the existing table
After creating the constraint with
NOT VALID operator, run
VALIDATE CONSTRAINT statement for the constraint to be accepted when attaching the table as a partition.
This command acquires a
SHARE UPDATE EXCLUSIVE lock (you can find more details in the PostgreSQL documentation – VALIDATE CONSTRAINT and SHARE UPDATE EXCLUSIVE lock), which allows you to run your regular operations.
You can still run your regular workload with UPDATE, INSERT, DELETE and SELECT statements when
VALIDATE CONSTRAINT command is running as shown in the following example.
Rename the existing table and indexes and create a new partitioned table
As the first step, you rename the existing table and its corresponding indexes, create a new partitioned table with the same name with a partition that can accept new data, and create the indexes as necessary. Furthermore, you add an index on the partitioned column. And you perform all of these actions in a single transaction because there are interdependencies among the steps.
During this operation, the table isn’t accessible for the entire transaction time because ALTER TABLE takes ACCESS EXCLUSIVE LOCK, which causes some downtime. See the following SQL statement:
Attach the old table as a partition in the new table
Attach the renamed table
car_data_old as a partition of
car_data. This operation takes a SHARE UPDATE EXCLUSIVE lock on the main table
car_data and an ACCESS EXCLUSIVE lock on the table being attached (
car_data_old), thereby making it unavailable for reads and writes. See the following code:
Without the check constraint, this step can take significantly longer, as it would scan the entire
car_data_old table to validate partition boundaries.
Drop the check constraint
As you don’t need it, drop the check constraint that you created:
Move data from the large partition to smaller ones
Because we attached a large table as a single partition in the new table, it should be divided into smaller partitions. You can move the data from this large partition to smaller ones incrementally at your own pace. This should be run when there is no or less business activity on the table, so that the impact of data movement, high I/O, and locks is low on the active queries. You must complete this process in steps using transactions, as mentioned in the following steps along with its corresponding SQL code:
- Begin transaction.
- Detach the old table partition that was attached in the step attach the old table as a partition in the new table.
- Create a new partition for the data that will be moved from the old table detached above. For example, create a partition for records with values between 2022-06-01 to 2022-07-01.
- Populate the partition, created above in Step 3, from the old table using INSERT INTO SELECT statement.
- Delete the records that have been moved from old table to the new partition.
- Add check constraint to avoid table scan while attaching the partition again.
- Validate the CHECK constraint created in the previous step.
- Attach the old table with remaining records as a partition in the new table.
- Drop the constraint as it’s no longer needed.
- Complete the transaction.
After detaching the partition
car_data_old, the table
car_data would only contain data from the newer partitions (from 2022-07-01 to 2022-08-01) that was defined earlier. Notice that the partition boundaries have changed while attaching the old table again. This is because some data has been moved to the new partition
car_data_p202206. Moreover, you must define the check constraint and validate it each time you run this step because it avoids table scans each time new partition boundaries are defined. Continue this step until the entire partition with the old table has been divided into smaller partitions.
In this post, I showed you how to use PostgreSQL native commands to migrate an existing non-partitioned table to a partitioned one with minimal downtime. Tables can be unavailable because of locks that are taken while running ALTER commands during the renaming and partitioning processes. When we define a check constraint, the downtime during partitioning is reduced substantially by skipping the costly table scan that is otherwise needed while attaching a table as a partition. Make sure that you use the commands run in transactions as mentioned earlier, otherwise you won’t achieve the expected results.
Finally, I encourage you to test the process by yourself and clean up the resources once done. If you have questions or suggestions, then leave them in the comments section.
About the Author
Soumyadeep Dey is a Database Specialist Solutions Architect with AWS. He works with AWS customers to design scalable, performant, and robust database architectures on the cloud using both SQL and NoSQL database offerings.