AWS Database Blog
Remove bloat from Amazon Aurora and RDS for PostgreSQL with pg_repack
Do you have a database where the size of relations on disk is larger than you expect? Did you observe this size increasing every time you ran several UPDATE and DELETE operations on the relations? Did you observe adverse effects on performance of the database as a result of this? This might be a result of dead tuples in relations, which results in database bloat. In PostgreSQL databases, bloat is the extra space allocated to the table or index to maintain old version of rows that are no longer necessary.
AWS has helped you run PostgreSQL-based databases on Amazon Relational Database Service (Amazon RDS) for PostgreSQL since November 2013 and Amazon Aurora PostgreSQL-Compatible Edition since October 2017. Common questions we hear from customers managing PostgreSQL databases is “Why is my database so slow? Why are my queries taking so much CPU?” One of the reasons we have seen for such occurrences is heavy bloat in the database. In this post, we discuss bloat in relations or tables while also going through the adverse effects heavy bloat may cause to PostgreSQL databases and how you can remove bloat.
While unintended bloat is bad and that is what we mean by “heavy bloat” in this post. In PostgreSQL based databases, we sometimes intentionally add bloat using fill_factor to increase performance. For UPDATE and INSERT/DELETE heavy tables a small amount of bloat allows the table to have a natural balance and helps avoid need to extend and then contract the files every time.
Multiversion concurrency control
Before we begin, it’s important to understand multiversion concurrency control (MVCC). MVCC allows PostgreSQL to offer high concurrency even during significant database read and write activity. To implement this in PostgreSQL, an UPDATE or DELETE operation of a row (or tuple) doesn’t immediately remove the old version of the row. Instead, the old tuple is marked for deletion. Old tuples are also known as dead tuples. Therefore, the database can grow in size quickly, if your application performs a lot of UPDATE and DELETE operations, which may result in a significant number of dead tuples increasing the size of the table on the disk. In database terminology, these dead tuples in a table are what results in relation or table bloat.
Detect bloat
Before looking into how to remove bloat, let’s first discuss how you can identify bloated tables in a PostgreSQL database. The view pg_stat_user_tables in PostgreSQL provides statistic information about accesses (number of rows inserted, deleted, and updated, and estimated number of dead tuples) to a particular table. You can use the following SQL query to check the number of dead tuples and when the last autovacuum or vacuum ran on the tables:
The view pg_stat_user_tables
provides information about user tables only and needs to be run independently for all the databases. In our example case, we identified 98,399,974 dead tuples.
Remove bloat
The simplest way of reusing the storage space occupied by dead tuples is the PostgreSQL VACUUM operation. Autovacuum is a PostgreSQL auxiliary process that automates running VACUUM and ANALYZE commands. For more information about autovacuum in RDS for PostgreSQL environments, refer to Understanding autovacuum in Amazon RDS for PostgreSQL environments.
You can run two variants of VACUUM to get rid of dead tuples in PostgreSQL: standard VACUUM and VACUUM FULL.
The standard form of VACUUM removes dead row versions in tables and indexes and marks this space for reuse. However, it doesn’t free up the storage space allocated to the table. This shouldn’t pose an issue in normal day-to-day operations because this space can be reused by any subsequent inserts on the table.
VACUUM can run in parallel with production database operations without taking an exclusive lock on a table. Commands such as SELECT, INSERT, UPDATE, and DELETE continue to run normally. Vacuum can consume a substantial amount of I/O and CPU resources thus, it is helpful to run it in non peak hours. Further, when running manually you can choose to run Vacuum on a particular table or you can run it for the whole database:
- VACUUM <Table Name> – Performs the VACUUM operation on a particular table
- VACUUM – Runs VACUUM on all the tables within a database.
For example, see the following code:
VACUUM FULL can reclaim the space to operating system; however, before you run VACUUM FULL, it’s important to consider the following:
- It requires an access exclusive lock on the table it’s working on, and therefore can obstruct other operations on the table (including SELECT).
- It actively compacts the tables by creating a copy of the table with no dead tuples. This can not only take a long time but also double the storage consumed by the table and indexes, until the operation is complete.
You can run VACUUM FULL for a particular table or whole database:
- VACUUM FULL <Table Name> – Performs the VACUUM FULL operation on a particular table.
- VACUUM FULL – Runs VACUUM FULL on all the tables within a database. It is not recommend for a user to run Vacuum Full without providing table name, as it can compact system catalogs.
For example, see the following code:
In production databases, you don’t want to use the VACUUM FULL operation because it blocks other activities in the database. Another alternative when you want the storage reclaimed rather than just be available for reuse is using the pg_repack extension.
pg_repack
is an open-source PostgreSQL extension available with Amazon RDS (version 9.6.20+) and Aurora PostgreSQL that cleans up dead tuples and, like the VACUUM FULL operation, reclaims storage. Unlike VACUUM FULL, the pg_repack
extension doesn’t require an exclusive lock for complete duration. pg_repack
only hold an ACCESS EXCLUSIVE lock for a short period during initial creation of log table and during the final swap-and-drop phase. For the rest of the time, pg_repack
only needs to hold an ACCESS SHARE lock on the original table, meaning INSERT, UPDATE, and DELETE operations may proceed as usual.
It lets you remove bloat from tables as well as from indexes. You can choose to run pg_repack
in a time of less load on the database.
All forms of VACUUM as well as the pg_repack
extension are I/O intensive operations. Therefore, discretion is advised when running the same.
Now let’s see how you can use pg_repack
with Amazon RDS for PostgreSQL or Aurora PostgreSQL.
Prerequisites
Before getting started, complete the following prerequisites:
- Connect to your Amazon RDS PostgreSQL or Aurora PostgreSQL compatible instance. If you’re using psql client, the connection string would look like below. For more information on how you can connect to your DB instance, read Connecting to a DB instance running the PostgreSQL database engine.
- Create the extension
pg_repack
on the PostgreSQL database instance:
- Determine the version of
pg_repack
installed in the server using the following query. You can also use\dx
shorthand when using psql client.
Different versions of pg_repack
might be on the Amazon RDS or Aurora PostgreSQL instance based on the database version. To determine the version of pg_repack
supported with your instance’s database version, see PostgreSQL on Amazon RDS.
- You need to have a client machine, for example an EC2 instance with pg_repack installed in it and connectivity to your RDS or Aurora instance. To install pg_repack you can either use source installation or PostgreSQL Yum Repository. Make sure to download and install the same version of the pg_repack extension in your client machine as in your database instance.
- After the installation, use the following command on the client machine to run the extension:
For more options, refer to Usage.
Make sure that you have available storage before running pg_repack
. You should have about twice the size of the target tables and indexes.
Run pg_repack
You can run pg_repack
for a bloated table named dashboard
in a public schema within Amazon RDS with endpoint test-repack-instance.xxxxxxx.rds.amazonaws.com
, username repackuser
, and database testdb
using the following command:
Now, let’s see the high-level steps being performed in the backend on the DB instance once the above command is run at the client machine.
- Create a log table that captures any changes made to the original table:
- Create triggers on the original table to capture the delta and insert it into the log table:
- Create a new table and copy data from the original data into the new table:
- Create indexes on the new table that were present on the old table after all the data has been loaded:
- Replay the delta data from the log table into the new table:
- Swap the tables, including indexes and toast tables, using the system catalogs:
- Drop the original table:
Common issues faced using pg_repack
In this section, we discuss some common issues you may encounter when using the pg_repack
extension with Amazon RDS for PostgreSQL or Aurora PostgreSQL.
Superuser checks
By default, pg_repack
performs superuser checks. A true superuser isn’t available in Amazon RDS for PostgreSQL nor Aurora PostgreSQL environments. Because these are managed services, we have a primary user that you define while creating the DB instance, which is assigned the rds_superuser
role. This is a predefined Amazon RDS role similar to the standard PostgreSQL superuser role, but with some restrictions. With these restrictions, the rds_superuser
role isn’t considered a true superuser role. For more information, see Creating roles. As a result, when pg_repack
performs a superuser check when using Amazon RDS or Amazon Aurora PostgreSQL, it fails with the following error message:
To resolve the error, you need to use the -k
or --no-superuser-check
option in the connection string. This option skips superuser checks when you run pg_repack
.
Missing the pg_repack extension on the database instance
As part of the prerequisites, you must create the extension after logging in to your Amazon RDS for PostgreSQL or Aurora PostgreSQL instance. If you don’t create the extension, you encounter the following error:
To resolve this, you can log in to your instance and run the following command:
Version mismatch
The pg_repack
extension version must be the same for the server and client side. In case of a disparity, you see the following error message:
Depending on which version of PostgreSQL you’re using, you may see a different version of pg_repack
for different Amazon RDS for PostgreSQL or Aurora PostgreSQL instances. For example, PostgreSQL version 10.17 uses pg_repack
version 1.4.3, whereas PostgreSQL version 13.3 uses pg_repack
version 1.4.6.
You can check your pg_repack
version with the following query:
To resolve this, you must reinstall the correct version of pg_repack
in the client machine, which is the same as the pg_repack
version installed in your instance.
Missing primary key or not-null unique constraint
pg_repack
requires tables to have either a primary key or a not-null unique key defined. If you try to run pg_repack
on a table that doesn’t have a primary key nor a non-null unique key, you get the following error:
To resolve this, make sure your table has a primary key or a not-null unique key.
Summary
Bloat can have a significant impact on database performance. It’s important to keep it at a minimum in your PostgreSQL databases. In this post, we showed you how to identify bloat in a table and how to remove bloat from tables in Amazon RDS for PostgreSQL or Aurora PostgreSQL using manual operations like VACUUM, VACUUM FULL, and the pg_repack
extension.
To learn more about the autovacuum auxiliary process, which can help you automate maintenance operations like VACUUM and ANALYZE, see Understanding autovacuum in Amazon RDS for PostgreSQL environments and A Case Study of Tuning Autovacuum in Amazon RDS for PostgreSQL.
Finally, we encourage you to keep bloat at a minimum in your PostgreSQL databases to ensure optimum performance. If you have questions or suggestions, please leave them in the comments section.
About the Authors
Ayushi Gupta is a Technical Account Manager at AWS, based out of Delhi, India. She takes initiative in helping customers on complex issues and finding pertinent solutions, strategizing optimal cloud deployments based on each customer’s unique business requirements. Her area of interest is database and migration technologies. She enjoys spending time with her family and is a nature admirer.
Vibhu Pareek is a Solutions Architect at AWS. He joined AWS in 2016 and specializes in providing guidance on cloud adoption through the implementation of well architected, repeatable patterns and solutions that drive customer innovation. He has keen interest in open source databases like PostgreSQL. In his free time, you’d find him enjoying a game of football or engaged in pretentious fancy cooking.