How can I identify what is causing the "No space left on device” or "DiskFull" error on Amazon RDS for PostgreSQL?

Last updated: 2019-04-16

I have a small Amazon Relational Database Service (Amazon RDS) for PostgreSQL database. The instance's free storage space is decreasing, and I received the following error:

"Error message: PG::DiskFull: ERROR: could not extend file "base/16394/5139755": No space left on device. HINT: Check free disk space."

How do I resolve DiskFull errors, and how do I prevent storage full issues?

Short Description

Amazon RDS DB instance storage is used by the following:

  • Temporary tables that are created by PostgreSQL transactions
  • Data files
  • Write ahead log (WAL logs)
  • Replication slots that are not consumed
  • DB logs (error files) that are retained for too long
  • Cross-region read replication that is stopped
  • Other DB or Linux files that support the consistent state of the RDS DB instance

Resolution

Use Amazon CloudWatch to monitor your DB storage space by using the FreeStorageSpace metric. By setting an Amazon CloudWatch alarm for free storage space, you receive a notification when the space starts to decrease. If you receive an alarm, you can review the previous possible causes of storage issues. We also recommend monitoring the logs in log_temp_files to see when and how temporary files are created. If your DB instance is still consuming more storage than expected, check for the following:

  • Size of the DB log files
  • Presence of temporary files
  • Constant increase in transaction logs disk usage
  • Cross-region read replicas
  • Bloat or improper removal of dead rows
  • Presence of orphaned files

Check the size of the DB log files

By default, PostgreSQL error log files have a retention value of 4,320 minutes (three days). Large log files can use more size, which can lead to higher workloads. You can change the retention period for system logs using the rds.log_retention_period parameter in the DB parameter group associated with your DB instance. For example, if you set the value to 1440, logs are retained for one day. For more information, see PostgreSQL Database Log Files.

Check for temporary files

Temporary files are files that are stored per backend or session connection, and they are used as a resource pool or buffer. These files are stored separately from the shared resource space. You can review these files by using the delta of subsequently running a command similar to the following:

psql> SELECT datname, temp_files AS "Temporary files",temp_bytes AS "Size of temporary files" FROM pg_stat_database ;

Important: The columns temp_files and temp_bytes in view pg_stat_database are collecting statistics in aggregation (accumulative). This is by design, because these counters are reset only by recovery at server start, such as after an immediate shutdown, a server crash, and a point-in-time recovery (PITR). For this reason, it's a best practice to monitor the growth of these files in number and size, rather than to review only the output.

Temporary files are created for sorts, hashes, and temporary query results. A log entry is made for each temporary file when the file is deleted. To track the creation of temporary space used by a single query, set log_temp_files in a custom parameter group. This parameter controls the logging of temporary file names and sizes. If you set the log_temp_files value to 0, all temporary file information is logged. If you set the parameter to a positive value, only files sizes that are greater than or equal to the specified number of kilobytes are logged. The default setting is -1, which logs the creation of temporary files and the executed statements. You can also use an EXPLAIN ANALYZE of your query to review disk sorting. If you review the log output, you can see the size of temporary files created by your query. For more information, see the PostgreSQL Documentation for Monitoring Database Activity.

Check for a constant increase in transaction logs disk usage

The CloudWatch metric for TransactionLogsDiskUsage represents the disk space used by transaction WALs. The following can cause an increase in transaction log disk usage:

  • High DB loads (writes and updates that generate additional WALs)
  • Read replicas are in storage full state (retains transaction logs on the primary instance)
  • Replication slots

Replication slots can be created as part of logical decoding feature of AWS Database Migration Service (AWS DMS). For logical replication, the slot parameter rds.logical_replication is set to 1. Replication slots retain the WAL files until the files are externally consumed by a consumer, for example, by pg_recvlogical; extract, transform, and load (ETL) jobs; or AWS DMS.

If you set the rds.logical_replication parameter value to 1, AWS DMS sets the wal_level, max_wal_senders, max_replication_slots, and max_connections parameters. Changing these parameters can increase WAL generation, so it's a best practice to set the rds.logical_replication parameter only when you are using logical slots. If this parameter is set to 1 and logical replication slots are present but there isn't a consumer for the WAL files retained by the replication slot, then you can see an increase in the transaction logs disk usage. This also results in a constant decrease in free storage space.

Execute a query similar to the following to confirm the presence and size of replication slots:

PostgreSQL v9

psql=> select slot_name, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(),restart_lsn)) as 
replicationSlotLag, active from pg_replication_slots ;

PostgreSQL v10 and v11

psql=> select slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),restart_lsn)) as replicationSlotLag, 
active from pg_replication_slots ;

After you identify the replication slot that is not being consumed (with an active state that is False), then you can drop the replication slot by executing a query similar to the following:

psql=>select pg_drop_replication_slot('Your_slotname_name');

Note: If an AWS DMS task is the consumer and it is no longer required, you can delete the task.

Sample output:

--------------
                      
slot_name                            | replicationslotlag | active
----------------------------------------------------------------+--------------------+--------
xc36ufspjql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d | 129 GB             | f
7pajuy7htthd7sqn_00013322_a27bcebf_7d0f_4124_b336_92d0fb9f5130 |
704 MB             | t
zp2tkfo4ejw3dtlw_00013322_03e77862_689d_41c5_99ba_021c8a3f851a | 624 MB            
| t

In this example, the slot name xc36ufspjql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d has an active state that is False. So this slot isn't actively used, and the slot is contributing to 129 GB of transaction files.

You can drop the query by executing a command similar to the following:

psql=> select pg_drop_replication_slot('xc36ufspjql35djp_00013322_907c1e0a_9f8b_4c13_89ea_ef0ea1cf143d');

Cross-region read replicas

If you use cross-region read replication, a physical replication slot is created on the primary instance. If the cross-region read replica fails, then the storage space on the primary DB instance can be affected, because the WAL files aren't replicated over to the read replica. You can use CloudWatch metrics, Oldest Replication Slot Lag, and Transaction Logs Disk Usage to determine how far behind the most lagging replica, in regards to the WAL data received and how much storage is used for WAL data.

To check the status of cross-region read replica, query pg_replication_slots. For more information, see the PostgreSQL Documentation for query pg_replication_slots. If the active state is returned as false, then the slot is not currently used for replication.

postgres=# select * from pg_replication_slots;

In addition to replica lag CloudWatch metrics and events, you can use view pg_stat_replication on the source instance to check the statistics for the replication. For more information, see the PostgreSQL Documentation for pg_stat_replication.

Bloat or improper removal of dead rows (tuples)

In normal PostgreSQL operations, tuples that are deleted or made obsolete by an UPDATE aren't removed from their table. For Multi-Version Concurrency Control (MVCC) implementations, if a DELETE operation is performed, the row is not immediately removed from the data file. Instead, the row is marked as deleted by setting the xmax field in a header. Updates similarly mark rows for deletion first, and then carry out an insert operation. This allows concurrency with minimal locking between the different transactions. As a result, different row versions are kept as part of MVCC process.

If dead rows aren't cleaned up, they can stay (invisible to any transaction) in the data files, which impacts disk space. If a table has many DELETE and UPDATE operations, the dead tuples might use a large amount of disk space—which is sometimes called bloat in PostgreSQL.

The VACUUM operation can reclaim storage that is occupied by dead tuples. It's a best practice to perform periodic vacuum or autovacuum operations on tables that are updated frequently. For more information, see the PostgreSQL Documentation for VACUUM.

To check for the estimated number of dead tuples, use the pg_stat_all_tables view. For more information, see the PostgreSQL Documentation for the pg_stat_all_tables view. In the following example, there are 1999952 dead tuples (n_dead_tup):

psql => select * from pg_stat_all_tables where relname='test';
-[ RECORD 1 ]-------+------------------------------
relid               | 16395
schemaname          | public
relname             | test
seq_scan            | 3
seq_tup_read        | 5280041
idx_scan            | 
idx_tup_fetch       | 
n_tup_ins           | 2000000
n_tup_upd           | 0
n_tup_del           | 3639911
n_tup_hot_upd       | 0
n_live_tup          | 1635941
n_dead_tup          | 1999952
n_mod_since_analyze | 3999952
last_vacuum         | 
last_autovacuum     | 2018-08-16 04:49:52.399546+00
last_analyze        | 2018-08-09 09:44:56.208889+00
last_autoanalyze    | 2018-08-16 04:50:22.581935+00
vacuum_count        | 0
autovacuum_count    | 1
analyze_count       | 1
autoanalyze_count   | 1


psql => VACUUM TEST;

Orphaned files

Orphaned files can occur when the files are present in the database directory, but there are no objects that point to those files. This scenario rarely generates orphaned files, but it can occur if your instance runs out of storage during an operation such as ALTER TABLE, VACUUM FULL, or CLUSTER. To check for orphaned files, follow these steps:

1.    Log in to PostgreSQL in each database.

2.    Execute the following queries to assess the used and real sizes.

# Size of the database occupied by files

psql=>SELECT pg_size_pretty(pg_database_size('DATABASE_NAME')); 


# Size of database retrieved by summing the objects (real size)
psql=> SELECT pg_size_pretty(SUM(pg_relation_size(oid))) FROM pg_class;

3.    Note the results. The difference should be significant.

If there is a significant difference is between the size of the database occupied by file and the size of the database that is retrieved by summing the objects, then orphan files might be using storage space.


Did this article help you?

Anything we could improve?


Need more help?