Remove bloat from Amazon Aurora and RDS for PostgreSQL with pg_repack

Do you have a database where the size of relations on disk is larger than you expect? Did you observe this size increasing every time you ran several UPDATE and DELETE operations on the relations? Did you observe adverse effects on performance of the database as a result of this? This might be a result of dead tuples in relations, which results in database bloat. In PostgreSQL databases, bloat is the extra space allocated to the table or index to maintain old version of rows that are no longer necessary.

AWS has helped you run PostgreSQL-based databases on Amazon Relational Database Service (Amazon RDS) for PostgreSQL since November 2013 and Amazon Aurora PostgreSQL-Compatible Edition since October 2017. Common questions we hear from customers managing PostgreSQL databases is “Why is my database so slow? Why are my queries taking so much CPU?” One of the reasons we have seen for such occurrences is heavy bloat in the database. In this post, we discuss bloat in relations or tables while also going through the adverse effects heavy bloat may cause to PostgreSQL databases and how you can remove bloat.

While unintended bloat is bad and that is what we mean by “heavy bloat” in this post. In PostgreSQL based databases, we sometimes intentionally add bloat using fill_factor to increase performance. For UPDATE and INSERT/DELETE heavy tables a small amount of bloat allows the table to have a natural balance and helps avoid need to extend and then contract the files every time.

Multiversion concurrency control

Before we begin, it’s important to understand multiversion concurrency control (MVCC). MVCC allows PostgreSQL to offer high concurrency even during significant database read and write activity. To implement this in PostgreSQL, an UPDATE or DELETE operation of a row (or tuple) doesn’t immediately remove the old version of the row. Instead, the old tuple is marked for deletion. Old tuples are also known as dead tuples. Therefore, the database can grow in size quickly, if your application performs a lot of UPDATE and DELETE operations, which may result in a significant number of dead tuples increasing the size of the table on the disk. In database terminology, these dead tuples in a table are what results in relation or table bloat.

Detect bloat

Before looking into how to remove bloat, let’s first discuss how you can identify bloated tables in a PostgreSQL database. The view pg_stat_user_tables in PostgreSQL provides statistic information about accesses (number of rows inserted, deleted, and updated, and estimated number of dead tuples) to a particular table. You can use the following SQL query to check the number of dead tuples and when the last autovacuum or vacuum ran on the tables:

SELECT relname AS TableName, n_live_tup AS LiveTuples, n_dead_tup AS DeadTuples, n_tup_del, n_tup_upd, last_autovacuum AS Autovacuum, last_vacuum AS ManualVacuum, now() FROM pg_stat_user_tables;

-[ RECORD 1 ]+------------------------------
tablename | dashboard
livetuples | 0
deadtuples | 98399974
n_tup_del | 66314578
n_tup_upd | 98400002
autovacuum | 2021-05-04 15:42:12.882048+00
manualvacuum |
now | 2021-05-04 15:53:32.012384+00

The view pg_stat_user_tables provides information about user tables only and needs to be run independently for all the databases. In our example case, we identified 98,399,974 dead tuples.

Remove bloat

The simplest way of reusing the storage space occupied by dead tuples is the PostgreSQL VACUUM operation. Autovacuum is a PostgreSQL auxiliary process that automates running VACUUM and ANALYZE commands. For more information about autovacuum in RDS for PostgreSQL environments, refer to Understanding autovacuum in Amazon RDS for PostgreSQL environments.

You can run two variants of VACUUM to get rid of dead tuples in PostgreSQL: standard VACUUM and VACUUM FULL.

The standard form of VACUUM removes dead row versions in tables and indexes and marks this space for reuse. However, it doesn’t free up the storage space allocated to the table. This shouldn’t pose an issue in normal day-to-day operations because this space can be reused by any subsequent inserts on the table.

VACUUM can run in parallel with production database operations without taking an exclusive lock on a table. Commands such as SELECT, INSERT, UPDATE, and DELETE continue to run normally. Vacuum can consume a substantial amount of I/O and CPU resources thus, it is helpful to run it in non peak hours. Further, when running manually you can choose to run Vacuum on a particular table or you can run it for the whole database:

VACUUM <Table Name> – Performs the VACUUM operation on a particular table
VACUUM – Runs VACUUM on all the tables within a database.

For example, see the following code:

testdb=> vacuum dashboard;
VACUUM

testdb=> vacuum;
VACUUM

VACUUM FULL can reclaim the space to operating system; however, before you run VACUUM FULL, it’s important to consider the following:

It requires an access exclusive lock on the table it’s working on, and therefore can obstruct other operations on the table (including SELECT).
It actively compacts the tables by creating a copy of the table with no dead tuples. This can not only take a long time but also double the storage consumed by the table and indexes, until the operation is complete.

You can run VACUUM FULL for a particular table or whole database:

VACUUM FULL <Table Name> – Performs the VACUUM FULL operation on a particular table.
VACUUM FULL – Runs VACUUM FULL on all the tables within a database. It is not recommend for a user to run Vacuum Full without providing table name, as it can compact system catalogs.

For example, see the following code:

testdb=> VACUUM FULL dashboard;
VACUUM

In production databases, you don’t want to use the VACUUM FULL operation because it blocks other activities in the database. Another alternative when you want the storage reclaimed rather than just be available for reuse is using the pg_repack extension.

pg_repack is an open-source PostgreSQL extension available with Amazon RDS (version 9.6.20+) and Aurora PostgreSQL that cleans up dead tuples and, like the VACUUM FULL operation, reclaims storage. Unlike VACUUM FULL, the pg_repack extension doesn’t require an exclusive lock for complete duration. pg_repack only hold an ACCESS EXCLUSIVE lock for a short period during initial creation of log table and during the final swap-and-drop phase. For the rest of the time, pg_repack only needs to hold an ACCESS SHARE lock on the original table, meaning INSERT, UPDATE, and DELETE operations may proceed as usual.

It lets you remove bloat from tables as well as from indexes. You can choose to run pg_repack in a time of less load on the database.

All forms of VACUUM as well as the pg_repack extension are I/O intensive operations. Therefore, discretion is advised when running the same.

Now let’s see how you can use pg_repack with Amazon RDS for PostgreSQL or Aurora PostgreSQL.

Prerequisites

Before getting started, complete the following prerequisites:

Connect to your Amazon RDS PostgreSQL or Aurora PostgreSQL compatible instance. If you’re using psql client, the connection string would look like below. For more information on how you can connect to your DB instance, read Connecting to a DB instance running the PostgreSQL database engine.

psql -h <DB Endpoint> -d <Database Name> -U <Username>

Create the extension pg_repack on the PostgreSQL database instance:

testdb=> create extension pg_repack;
CREATE EXTENSION

Determine the version of pg_repack installed in the server using the following query. You can also use \dx shorthand when using psql client.

testdb=> \dx pg_repack

********* QUERY **********
SELECT e.extname AS "Name", e.extversion AS "Version", n.nspname AS "Schema", c.description AS "Description"
FROM pg_catalog.pg_extension e LEFT JOIN pg_catalog.pg_namespace n ON n.oid = e.extnamespace LEFT JOIN pg_catalog.pg_description c ON c.objoid = e.oid AND c.classoid = 'pg_catalog.pg_extension'::pg_catalog.regclass
WHERE e.extname ~ '^(pg_repack)$'
ORDER BY 1;
**************************

List of installed extensions
Name | Version | Schema | Description 
-----------+---------+--------+--------------------------------------------------------------
pg_repack | 1.4.6 | public | Reorganize tables in PostgreSQL databases with minimal locks
(1 row)

Different versions of pg_repack might be on the Amazon RDS or Aurora PostgreSQL instance based on the database version. To determine the version of pg_repack supported with your instance’s database version, see PostgreSQL on Amazon RDS.

You need to have a client machine, for example an EC2 instance with pg_repack installed in it and connectivity to your RDS or Aurora instance. To install pg_repack you can either use source installation or PostgreSQL Yum Repository. Make sure to download and install the same version of the pg_repack extension in your client machine as in your database instance.
After the installation, use the following command on the client machine to run the extension:

pg_repack -h <RDS/Aurora Endpoint> -U <username> -t <tablename> -k <databasename>

For more options, refer to Usage.

Make sure that you have available storage before running pg_repack. You should have about twice the size of the target tables and indexes.

Run pg_repack

You can run pg_repack for a bloated table named dashboard in a public schema within Amazon RDS with endpoint test-repack-instance.xxxxxxx.rds.amazonaws.com, username repackuser, and database testdb using the following command:

[ec2-user@ pg_repack-1.4.6]$ pg_repack -h test-repack-instance.xxxxxxxrds.amazonaws.com -U repackuser -t dashboard -k testdb
Password:
INFO: repacking table "public.dashboard"

Now, let’s see the high-level steps being performed in the backend on the DB instance once the above command is run at the client machine.

Create a log table that captures any changes made to the original table:

DEBUG: create_pktype : CREATE TYPE repack.pk_16405 AS (id integer)
DEBUG: create_log : CREATE TABLE repack.log_16405 (id bigserial PRIMARY KEY, pk repack.pk_16405, row public.dashboard)

Create triggers on the original table to capture the delta and insert it into the log table:

DEBUG: create_trigger : CREATE TRIGGER repack_trigger AFTER INSERT OR DELETE OR UPDATE ON public.dashboard FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger('INSERT INTO repack.log_16405(pk, row) VALUES( CASE WHEN $1 IS NULL THEN NULL ELSE (ROW($1.id)::repack.pk_16405) END, $2)')
DEBUG: enable_trigger : ALTER TABLE public.dashboard ENABLE ALWAYS TRIGGER repack_trigger

Create a new table and copy data from the original data into the new table:

DEBUG: create_table : CREATE TABLE repack.table_16405 WITH (oids = false) TABLESPACE pg_default AS SELECT id FROM ONLY public.dashboard WITH NO DATA
DEBUG: copy_data : INSERT INTO repack.table_16405 SELECT id FROM ONLY public.dashboard

Create indexes on the new table that were present on the old table after all the data has been loaded:

DEBUG: index[0].target_oid : 16450
DEBUG: index[0].create_index : CREATE UNIQUE INDEX index_16450 ON repack.table_16405 USING btree (id) TABLESPACE pg_default

Replay the delta data from the log table into the new table:

DEBUG: ---- copy tuples ----

Swap the tables, including indexes and toast tables, using the system catalogs:

DEBUG: ---- swap ----

Drop the original table:

DEBUG: ---- drop ----
DEBUG: ---- analyze ----
DEBUG: No workers to disconnect.

Common issues faced using pg_repack

In this section, we discuss some common issues you may encounter when using the pg_repack extension with Amazon RDS for PostgreSQL or Aurora PostgreSQL.

Superuser checks

By default, pg_repack performs superuser checks. A true superuser isn’t available in Amazon RDS for PostgreSQL nor Aurora PostgreSQL environments. Because these are managed services, we have a primary user that you define while creating the DB instance, which is assigned the rds_superuser role. This is a predefined Amazon RDS role similar to the standard PostgreSQL superuser role, but with some restrictions. With these restrictions, the rds_superuser role isn’t considered a true superuser role. For more information, see Creating roles. As a result, when pg_repack performs a superuser check when using Amazon RDS or Amazon Aurora PostgreSQL, it fails with the following error message:

[ec2-user@pg_repack-1.4.6]$ pg_repack -h test-repack-instance.xxxxxxx.rds.amazonaws.com -U repackuser -t dashboard testdb

ERROR: pg_repack failed with error: You must be a superuser to use pg_repack

To resolve the error, you need to use the -k or --no-superuser-check option in the connection string. This option skips superuser checks when you run pg_repack.

Missing the pg_repack extension on the database instance

As part of the prerequisites, you must create the extension after logging in to your Amazon RDS for PostgreSQL or Aurora PostgreSQL instance. If you don’t create the extension, you encounter the following error:

[ec2-user@ pg_repack-1.4.6]$ pg_repack -h test-repack-instance.xxxxxxx.rds.amazonaws.com -U repackuser -t dashboard -k testdb

ERROR: pg_repack failed with error: pg_repack 1.4.6 is not installed in the database

To resolve this, you can log in to your instance and run the following command:

testdb=> create extension pg_repack;
CREATE EXTENSION

Version mismatch

The pg_repack extension version must be the same for the server and client side. In case of a disparity, you see the following error message:

[ec2-user@ pg_repack-1.4.6]$ pg_repack -h test-repack-instance.xxxxxxx.rds.amazonaws.com -U repackuser -t dashboard -k testdb

ERROR: pg_repack failed with error: program 'pg_repack 1.4.6' does not match database library 'pg_repack 1.4.0

Depending on which version of PostgreSQL you’re using, you may see a different version of pg_repack for different Amazon RDS for PostgreSQL or Aurora PostgreSQL instances. For example, PostgreSQL version 10.17 uses pg_repack version 1.4.3, whereas PostgreSQL version 13.3 uses pg_repack version 1.4.6.

You can check your pg_repack version with the following query:

testdb=> SELECT e.extname AS "Name", e.extversion AS "Version", n.nspname AS "Schema", c.description AS "Description"
FROM pg_catalog.pg_extension e LEFT JOIN pg_catalog.pg_namespace n ON n.oid = e.extnamespace LEFT JOIN pg_catalog.pg_description c ON c.objoid = e.oid AND c.classoid = 'pg_catalog.pg_extension'::pg_catalog.regclass
ORDER BY 1;

Name  | Version | Schema |             Description              
-----------+---------+--------+--------------------------------------------------------------
 pg_repack | 1.4.6  | public | Reorganize tables in PostgreSQL databases with minimal locks

To resolve this, you must reinstall the correct version of pg_repack in the client machine, which is the same as the pg_repack version installed in your instance.

Missing primary key or not-null unique constraint

pg_repack requires tables to have either a primary key or a not-null unique key defined. If you try to run pg_repack on a table that doesn’t have a primary key nor a non-null unique key, you get the following error:

[ec2-user@ pg_repack-1.4.6]$ pg_repack -h test-repack-instance.xxxxxxx.rds.amazonaws.com -U repackuser -t dashboard -k testdb

WARNING: relation "public.dashboard" must have a primary key or not-null unique keys

To resolve this, make sure your table has a primary key or a not-null unique key.

Summary

Bloat can have a significant impact on database performance. It’s important to keep it at a minimum in your PostgreSQL databases. In this post, we showed you how to identify bloat in a table and how to remove bloat from tables in Amazon RDS for PostgreSQL or Aurora PostgreSQL using manual operations like VACUUM, VACUUM FULL, and the pg_repack extension.

To learn more about the autovacuum auxiliary process, which can help you automate maintenance operations like VACUUM and ANALYZE, see Understanding autovacuum in Amazon RDS for PostgreSQL environments and A Case Study of Tuning Autovacuum in Amazon RDS for PostgreSQL.

Finally, we encourage you to keep bloat at a minimum in your PostgreSQL databases to ensure optimum performance. If you have questions or suggestions, please leave them in the comments section.

About the Authors

Ayushi Gupta is a Technical Account Manager at AWS, based out of Delhi, India. She takes initiative in helping customers on complex issues and finding pertinent solutions, strategizing optimal cloud deployments based on each customer’s unique business requirements. Her area of interest is database and migration technologies. She enjoys spending time with her family and is a nature admirer.

Vibhu Pareek is a Solutions Architect at AWS. He joined AWS in 2016 and specializes in providing guidance on cloud adoption through the implementation of well architected, repeatable patterns and solutions that drive customer innovation. He has keen interest in open source databases like PostgreSQL. In his free time, you’d find him enjoying a game of football or engaged in pretentious fancy cooking.