Why do I see high storage consumption on my source PostgreSQL database when I run an AWS DMS CDC task?
Last updated: 2022-07-25
I am running an AWS Database Migration Service (AWS DMS) change data capture (CDC) task and using a PostgreSQL database as the source. Why do I see high storage consumption on my source database when I run an AWS DMS CDC task?
When you use PostgreSQL as a source for a CDC task, AWS DMS uses logical replication slots, a PostgreSQL feature, to get changes from the source database. These slots represent a stream of operations that are replayed in the same order that they were made in the PostgreSQL source database.
By design, logical replication slots retrain translation logs (WAL) that AWS DMS requires, even when they are not connected to the PostgreSQL source. So, WAL is removed from PostgreSQL only after AWS DMS confirms that it has fetched the changes that it needs from the replication slot. AWS DMS confirms that it has the necessary changes by advancing the restart_lsn of the replication slot.
Causes of high storage consumption on a PostgreSQL source
Because of how PostgreSQL implements the logical replication slot feature, in some scenarios, storage volume issues can occur on the source database.
- The AWS DMS CDC task is stopped for a long time - This means that AWS DMS isn't connected to the source database, and it is not consuming changes from the replication slot on the source. This means that PostgreSQL continuously retains WAL without deleting older WAL that hasn't been read by AWS DMS. So, the storage on the source database eventually fills up.
- Heavy Workload - Heavy workloads that cause excessive WAL generation can also cause storage to become full on a PostgreSQL database when the logical replication slots are in use. This happens because the WAL is still retained by PostgreSQL if the log sequence numbers (LSNs) are required by the replication slot.
- Idle replication slots - Even if a table, schema, or database that AWS DMS is replicating changes from is inactive, the WAL being retained by the replication slot still contains information about that table, schema or database. This causes storage to fill up on the source, even if the tables have no transactions taking place on them.
Check if replication slots are causing high disk space usage on the PostgreSQL source
To check if replication slots are the cause of high disk space usage in your PostgreSQL database, run the queries listed in Why did I receive a "No space left on device” or "DiskFull" error on Amazon RDS for PostgreSQL?
Note: Many of these queries can be used on self-managed PostgreSQL installations.
Turn on the WAL heartbeat feature
Turn on the AWS DMS WAL heartbeat feature to help you avoid consuming storage on a PostgreSQL source database. This feature mimics a dummy transaction so that idle logical replication slots don't hold on to old WAL logs. This heartbeat keeps restart_lsn moving and prevents storage from filling up on the PostgreSQL source.
To turn on the WAL heartbeat feature, add this extra connection attribute (ECA) to the PostgreSQL source endpoint:
Optionally, specify these additional ECAs:
heartbeatFrequency determines how often (in minutes) the heartbeat transaction is run on the PostgreSQL source. For example, if you set heartbeatFrequency to a value of 15, then AWS DMS runs the heartbeat transaction every 15 minutes on the source.
heartbeatSchema specifies which database schema AWS DMS creates database objects in to generate the heartbeat transaction.
Note: The heartbeat transaction runs on the source only if an AWS DMS task is running. If your AWS DMS tasks are stopped, then the WAL heartbeat feature has no effect.
Limit the size of a slot in PostgreSQL
You can apply max_slot_wal_keep_size on the source database in PostgreSQL 13 and later. This sets the maximum amount of WAL that can be retained by replication slots.
Note: The max_slot_wal_keep_size setting can help you avoid storage full issues on the PostgreSQL source. But, it can also cause WAL to be purged from the source before the AWS DMS CDC task can read its changes from the replication slot. So, the task fails.