Why did I receive a "No space left on device” or "DiskFull" error on Amazon RDS for PostgreSQL?

10 minute read

I have a small Amazon Relational Database Service (Amazon RDS) for PostgreSQL database. The instance's free storage space is decreasing, and I receive the following error: "Error message: PG::DiskFull: ERROR: could not extend file "base/16394/5139755": No space left on device. HINT: Check free disk space." I want to resolve the DiskFull errors and prevent storage issues.

Short description

Amazon RDS DB instance storage is used by the following:

Temporary tables or files that are created by PostgreSQL transactions
Data files
Write ahead logs (WAL logs)
Replication slots
DB logs (error files) that are retained for too long
Other DB or Linux files that support the consistent state of the RDS DB instance

Resolution

1. Use Amazon CloudWatch to monitor your DB storage space using the FreeStorageSpace metric. When you set a CloudWatch alarm for free storage space, you receive a notification when the space starts to decrease. If you receive an alarm, review the causes of storage issues mentioned previously.

2. If your DB instance is still consuming more storage than expected, check for the following:

Size of the DB log files
Presence of temporary files
Constant increase in transaction logs disk usage
Replication slot:

Physical replication slots are created by cross-Region read replicas or same-Region read replicas only if they are running on PostgreSQL 14.1 and higher versions
Logical replication slots are created for a replica or subscriber

Bloat or improper removal of dead rows
Presence of orphaned files

3. When your workload is predictable, enable storage autoscaling for your instance. With storage autoscaling enabled, when Amazon RDS detects that you are running out of free database space, your storage is automatically scaled. Amazon RDS starts a storage modification for an autoscaling-enabled DB instance when the following factors apply:

Free available space is less than 10 percent of the allocated storage.
The low-storage condition lasts at least five minutes.
At least six hours have passed since the last storage modification, or storage optimization has completed on the instance, whichever is longer.

You can set a limit for autoscaling your DB instance by setting the maximum storage threshold. For more information, see Managing capacity automatically with Amazon RDS storage autoscaling.

Check the size of the DB log files

By default, Amazon RDS for PostgreSQL error log files have a retention value of 4,320 minutes (three days). Large log files can use more space because of higher workloads or excessive logging. You can change the retention period for system logs using the rds.log_retention_period parameter in the DB parameter group associated with your DB instance. For example, if you set the value to 1440, then logs are retained for one day. For more information, see PostgreSQL database log files.

Also, you can change error reporting and logging parameters in the DB parameter group to reduce excessive logging. This in turn reduces the log file size. For more information, see Error reporting and logging.

Check for temporary files

Temporary files are files that are stored per backend or session connection. These files are used as a resource pool. Review temporary files statistics by running a command similar to this:

psql=> SELECT datname, temp_files AS "Temporary files",temp_bytes AS "Size of temporary files" FROM pg_stat_database ;

Important: The columns temp_files and temp_bytes in view pg_stat_database are collecting statistics in aggregation (accumulative). This is by design because these counters are reset only by recovery at server start. That is, the counters are reset after an immediate shutdown, a server crash, or a point-in-time recovery (PITR). For this reason, it's a best practice to monitor the growth of these files in number and size, rather than reviewing only the output.

Temporary files are created for sorts, hashes, or temporary query results. To track the creation of temporary tables or files, set log_temp_files in a custom parameter group. This parameter controls the logging of temporary file names and sizes. If you set the log_temp_files value to 0, then all temporary file information is logged. If you set the parameter to a positive value, then only files that are equal to or larger than the specified number of kilobytes are logged. The default setting is -1, which disables the logging of temporary files.

You can also use an EXPLAIN ANALYZE of your query to review disk sorting. When you review the log output, you can see the size of temporary files created by your query. For more information, see the PostgreSQL documentation for Monitoring database activity.

Check for a constant increase in transaction logs disk usage

The CloudWatch metric for TransactionLogsDiskUsage represents the disk space used by transaction WALs. Increases in transaction log disk usage can happen because of:

High DB loads (writes and updates that generate additional WALs)
Streaming read replica lag (replicas in the same Region) or read replica in storage full state
Replication slots

Replication slots can be created as part of logical decoding feature of AWS Database Migration Service (AWS DMS). For logical replication, the slot parameter rds.logical_replication is set to 1. Replication slots retain the WAL files until the files are externally consumed by a consumer. For example, they might be consumed by pg_recvlogical; extract, transform, and load (ETL) jobs; or AWS DMS.

If you set the rds.logical_replication parameter value to 1, then AWS RDS sets the wal_level, max_wal_senders, max_replication_slots, and max_connections parameters. Changing these parameters can increase WAL generation. It's a best practice to set the rds.logical_replication parameter only when you are using logical slots. If this parameter is set to 1 and logical replication slots are present but there isn't a consumer for the WAL files retained by the replication slot, then then transaction logs disk usage can increase. This also results in a constant decrease in free storage space.

Run this query to confirm the presence and size of replication slots:

PostgreSQL v9:

psql=> SELECT slot_name, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(),restart_lsn)) AS 
replicationSlotLag, active FROM pg_replication_slots ;

PostgreSQL v10 and later:

psql=> SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),restart_lsn)) AS replicationSlotLag, 
active FROM pg_replication_slots ;

After you identify the replication slot that isn't being consumed (with an active state that is False), drop the replication slot by running this query:

psql=> SELECT pg_drop_replication_slot('Your_slotname_name');

Note: If an AWS DMS task is the consumer and it is no longer required, then delete the task and manually drop the replication slot.

Sample output:

slot_name                                                      | replicationslotlag | active
---------------------------------------------------------------+--------------------+--------
xc36ujql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d    | 129 GB             | f
7pajuy7htthd7sqn_00013322_a27bcebf_7d0f_4124_b336_92d0fb9f5130 | 704 MB             | t
zp2tkfo4ejw3dtlw_00013322_03e77862_689d_41c5_99ba_021c8a3f851a | 624 MB             | t

In this example, the slot name xc36ujql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d has an active state that is False. So this slot isn't actively used, and the slot is contributing to 129 GB of transaction files.

Drop the query by running the following command:

psql=> SELECT pg_drop_replication_slot('xc36ujql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d');

Check the status of cross-Region read replicas

When you use cross-Region read replication, a physical replication slot is created on the primary instance. If the cross-Region read replica fails, then the storage space on the primary DB instance can be affected. This happens because the WAL files aren't replicated over to the read replica. You can use CloudWatch metrics, Oldest Replication Slot Lag, and Transaction Logs Disk Usage to determine how far behind the most lagging replica is. You can also see how much storage is used for WAL data.

To check the status of cross-Region read replica, use query pg_replication_slots. For more information, see the PostgreSQL documentation for pg_replication_slots. If the active state is returned as false, then the slot is not currently used for replication.

psql=> SELECT * FROM pg_replication_slots;

You can also use view pg_stat_replication on the source instance to check the statistics for the replication. For more information, see the PostgreSQL documentation for pg_stat_replication.

Check for bloat or improper removal of dead rows (tuples)

In normal PostgreSQL operations, tuples that are deleted or made obsolete by an UPDATE aren't removed from their table. For Multi-Version Concurrency Control (MVCC) implementations, when a DELETE operation is performed the row isn't immediately removed from the data file. Instead, the row is marked as deleted by setting the xmax field in a header. Updates mark rows for deletion first, and then carry out an insert operation. This allows concurrency with minimal locking between the different transactions. As a result, different row versions are kept as part of MVCC process.

If dead rows aren't cleaned up, they can stay in the data files but remain invisible to any transaction, which impacts disk space. If a table has many DELETE and UPDATE operations, then the dead tuples might use a large amount of disk space that's sometimes called "bloat" in PostgreSQL.

The VACUUM operation can free the storage used by dead tuples so that it can be reused, but this doesn't release the free storage to the filesystem. Running VACUUM FULL releases the storage to the filesystem. Note, however, that during the time of the VACUUM FULL run an access exclusive lock is held on the table. This method also requires extra disk space because it writes a new copy of the table and doesn't release the old copy until the operation is complete. It's a best practice to use this method only when you must reclaim a significant amount of space from within the table. It's also a best practice to perform periodic vacuum or autovacuum operations on tables that are updated frequently. For more information, see the PostgreSQL documentation for VACUUM.

To check for the estimated number of dead tuples, use the pg_stat_all_tables view. For more information, see the PostgreSQL documentation for pg_stat_all_tables view. In this example, there are 1999952 dead tuples (n_dead_tup):

psql => SELECT * FROM pg_stat_all_tables WHERE relname='test';

-[ RECORD 1 ]-------+------------------------------
relid               | 16395
schemaname          | public
relname             | test
seq_scan            | 3
seq_tup_read        | 5280041
idx_scan            | 
idx_tup_fetch       | 
n_tup_ins           | 2000000
n_tup_upd           | 0
n_tup_del           | 3639911
n_tup_hot_upd       | 0
n_live_tup          | 1635941
n_dead_tup          | 1999952
n_mod_since_analyze | 3999952
last_vacuum         | 
last_autovacuum     | 2018-08-16 04:49:52.399546+00
last_analyze        | 2018-08-09 09:44:56.208889+00
last_autoanalyze    | 2018-08-16 04:50:22.581935+00
vacuum_count        | 0
autovacuum_count    | 1
analyze_count       | 1
autoanalyze_count   | 1


psql => VACUUM TEST;

Check for orphaned files

Orphaned files can occur when the files are present in the database directory but there are no objects that point to those files. This might happen if your instance runs out of storage or the engine crashes during an operation such as ALTER TABLE, VACUUM FULL, or CLUSTER. To check for orphaned files, follow these steps:

1. Log in to PostgreSQL in each database.

2. Run these queries to assess the used and real sizes.

# Size of the database occupied by files
psql=> SELECT pg_size_pretty(pg_database_size('DATABASE_NAME')); 

# Size of database retrieved by summing the objects (real size)
psql=> SELECT pg_size_pretty(SUM(pg_relation_size(oid))) FROM pg_class;

3. Note the results. If the difference is significant, then orphaned files might be using storage space.