Using delayed read replicas for Amazon RDS for PostgreSQL disaster recovery

Human errors can create substantial risks to business continuity. An accidental DELETE statement, an incorrect batch update, or a faulty application deployment can instantly corrupt years of critical business data. Amazon Relational Database Service(Amazon RDS) provides protection through automated backups and transaction log backups, giving you a reliable safety net. However, the traditional recovery method involves creating new database instances and performing point-in-time recovery. For large databases, this restoration process can extend to several hours, significantly impacting business operations.

AWS recently launched delayed read replicas for Amazon RDS for PostgreSQL. This feature offers an alternative approach for disaster recovery by maintaining a standby replica that intentionally, lags behind the primary database by a configurable time interval, retaining data on the read replica as it existed minutes or hours in the past. This provides you an opportunity to detect data corruption on your production instance and promote the replica before a problematic operation is applied. This mechanism functions as a real-time safety net that reduces recovery complexity compared to traditional point-in-time backup restoration.

When data corruption occurs, you can promote the delayed replica to become the new primary cluster, recovering within minutes. You can enable this feature using the recovery_min_apply_delay parameter. It is available with Amazon RDS for PostgreSQL versions 14.19, 15.14, 16.10, and 17.6 and later.

In this post, we explore the use cases for delayed replication, the recovery procedures, and best practices for managing delayed replicas to help ensure your database recovery strategy is both robust and efficient.

Use cases for delayed replication

Delayed replicas addresses three primary use cases: preventing accidental data modifications, protecting against logical errors in applications, and enabling auditing and forensic analysis. Let’s explore each of these in detail:

Preventing accidental data modifications – Human errors such as executing UPDATE or DELETE statements without proper WHERE clauses can instantly corrupt large datasets. A delayed replica provides a buffer period to detect such mistakes on production and promote the replica to recover from the accidental changes. For example, if a database administrator accidentally runs DELETE FROM customer_orders WHERE status = 'pending' instead of targeting a specific date range, the delayed replica provides you a window of time to recover from the mistake. Upon detecting the error, you can catch the replica up until before the disastrous operation and promote the delayed replica to become the new primary, effectively rolling back the accidental changes.
Protection against errors in applications – Application bugs or incorrect deployment logic can introduce risks of corruption through unwanted data changes, such as bulk inserting erroneous records, applying faulty data transformations, or unintended cascade operations that affect multiple tables. Delayed read replicas gives you an opportunity to recover from such errors. When new code contains bugs that modify critical data, you can halt the delayed replica’s Write-ahead Log (WAL) application and use it to restore the correct state.
Auditing and forensic analysis of data changes: A delayed replica can serve as an auditing resource. It preserves the history of your data for a configurable delay period, so that you can examine and compare past and present data side by side. For example, if you suspect unauthorized or unintended changes, you can query the delayed replica and the primary to see what changed in that interval. Additionally, advanced users can inspect the WAL on the delayed replica using tools like the pg_walinspect extension to pinpoint the exact transactions that occurred. This forensic capability helps in auditing data changes and investigating incidents, all without the complexity of restoring point-in-time backups.

In all these scenarios, delayed replication acts as an “undo buffer” or safety net. It is not a replacement for Automated Backups, but it complements your disaster recovery strategy by offering a real-time point-in-time recovery mechanism. As the PostgreSQL documentation notes, time-delayed replicas can be very useful for correcting data loss errors by providing a window to react.

Set up delayed replication in Amazon RDS for PostgreSQL

At its core, the recovery_min_apply_delay parameter controls PostgreSQL’s WAL replay mechanism at the transaction commit level. When configured in Amazon RDS for PostgreSQL, it modifies the replica’s recovery process by comparing the commit timestamp in each WAL record against the replica’s system clock, creating a deliberate lag in transaction visibility.

For an Amazon RDS for PostgreSQL database instance that has an Amazon RDS for PostgreSQL reader instance the following procedure shows how you can use the AWS CLI to configure delayed replication.

Create a custom database parameter group:

aws rds create-db-parameter-group \
--db-parameter-group-name awsblog-demo-delayedrepl-param-grp \
--db-parameter-group-family postgres17 \
--description "database param group to configure delayed replica" \
--region us-west-2

Modify the newly created custom database parameter group to configure the recovery_min_apply_delay parameter. The default value of this parameter is 0 milliseconds (which means no delay) and can be a maximum of 86400000 milliseconds (24 hours). In this example, we configure it to be 43200000 milliseconds (approximately 12 hours):
```
aws rds modify-db-parameter-group \
--db-parameter-group-name awsblog-demo-delayedrepl-param-grp \
--parameters \
'[{ 
"ParameterName": "recovery_min_apply_delay", 
"ParameterValue": "43200000", 
"ApplyMethod": "immediate" 
}]' \
--region us-west-2
```
The following screenshot shows modifying the newly created database parameter group to set recovery_min_apply_delay to 43200000 milliseconds (~12 hours) using the Amazon RDS console.
Modify the read replica database instance to use the custom database parameter and reboot the replica for the database parameter configurations to take effect.
```
aws rds modify-db-instance \
--db-instance-identifier awsblog-demo-delayed-replica \
--db-parameter-group-name awsblog-demo-delayedrepl-param-grp \
--apply-immediately \
--region us-west-2

aws rds reboot-db-instance \
--db-instance-identifier awsblog-demo-delayed-replica \
--region us-west-2
```
Note: The recovery_min_apply_delay parameter is a static parameter, you must reboot the replica database for the parameter change to take effect.

Verify that the replica is configured with 12 hours of delay by connecting to the RDS read replica instance and running one of the following queries:

show recovery_min_apply_delay;
  recovery_min_apply_delay
  12h
  (1 row)

SELECT SETTING FROM PG_SETTINGS WHERE NAME='recovery_min_apply_delay';
  setting
  43200000
  (1 row)

Recovery control functions with delayed replication

The delayed replication feature on RDS for PostgreSQL also introduces access to two recovery functions for greater control over the recovery process.

These functions require the rds_superuser role for execution.

pg_wal_replay_pause(): Use this function to request a pause in the recovery process. When called, it initiates a pause request, though the actual pause may not occur immediately. To confirm the recovery has fully paused, you can use pg_get_wal_replay_state(). During a paused state, no new changes are applied to your delayed replica giving you a stable point-in-time view of your data.
pg_wal_replay_resume(): When you’re ready to continue the recovery process, call this function to resume normal operations. The delayed replica will begin applying changes from where it is paused.

Once you pause WAL replay with pg_wal_replay_pause(), you must call pg_wal_replay_resume() to continue replay of WAL logs. Otherwise, WAL logs will accumulate indefinitely on the read replica and cause excessive storage consumption.

Demonstration and recovery with delayed replicas

Let’s explore a common disaster scenario and its resolution using delayed replication.

In this example, we’ll simulate an accidental database drop and demonstrate the recovery process using a delayed replica configured with a 12-hour delay.

At 2025-08-05 22:14:37 UTC, a critical incident occurs when a user accidentally drops a logical database from a production RDS instance. Production applications have begun failing, and a service outage is reported.

-- verify the database exists
 select * from pg_database where datname='blog_production';
-[ RECORD 1 ]—+----------------
oid | 16455
datname | blog_production
datdba | 16409
encoding | 6
datlocprovider | c
datistemplate | f
datallowconn | t
dathasloginevt | f
datconnlimit | -1
datfrozenxid | 723
datminmxid | 1
dattablespace | 1663
datcollate | en_US.UTF-8
datctype | en_US.UTF-8
datlocale | 
daticurules | 
datcollversion | 2.26-59.amzn2
datacl | 
 

-- Force drop a database if it exists and if there are active connections
 DROP DATABASE IF EXISTS blog_production WITH (FORCE);
DROP DATABASE

-- verified that the database is dropped
 select * from pg_database where datname='blog_production';
(0 rows)

Recovering from a dropped database

Since the read replica is delayed by 12 hours, we have time to implement a recovery plan before the dangerous DROP database statement reaches our replica.

First, we connect to the delayed replica using an account with rds_superuser privileges and verify the replication status. To prevent any further changes, we immediately pause the WAL replay.

-- check if the WAL replay is ongoing on the replica 
select pg_is_wal_replay_paused(); 
pg_is_wal_replay_paused 
f
(1 row) 

SELECT pg_is_in_recovery(); 
pg_is_in_recovery 
t
(1 row)

-- paused wal replay to ensure that no transactions are applied on the read replica and verify the replay is actually paused 
select pg_wal_replay_pause(); 
pg_wal_replay_pause 
(1 row) 

select pg_is_wal_replay_paused(); 
pg_is_wal_replay_paused 
t
(1 row)

Capturing comprehensive replica metrics:

select pg_last_xact_replay_timestamp() as last_replay_timestamp, NOW() - pg_last_xact_replay_timestamp() as replication_lag, pg_last_wal_receive_lsn() as last_received_lsn, pg_last_wal_replay_lsn() as last_replayed_lsn, pg_wal_lsn_diff( pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn() ) as replay_lag_bytes,current_setting('recovery_min_apply_delay') as configured_delay;

last_replay_timestamp | replication_lag | last_received_lsn | last_replayed_lsn | replay_lag_bytes | configured_delay
-------------------------------+-----------------+-------------------+-------------------+------------------+------------------
2025-08-05 21:52:58.163932+00 | 00:48:34.408276 | 1/34000000 | 1/10000390 | 603978864 | 12h
(1 row)

Given above replication metrics our key observations are:

The output shows that the read replica is running 48 minutes behind the primary (replication_lag: 00:48:34) which is normal since we have configured a 12-hour intentional delay (configured_delay: 12h).
There’s approximately 576 MB of WAL data waiting to be replayed (replay_lag_bytes: 603978864), calculated as the difference between the last received WAL position (1/34000000) and last replayed position (1/10000390).
The last transaction was replayed at 21:52:58 UTC.

We enabled log_statements=all configuration on our source instance, so this helps with investigating the incident. We can trace the sequence of events and identified a checkpoint at 2025-08-05 22:10:28 UTC with LSN 1/1C000080, occurring before the database drop. Comparing the captured last_replayed_lsn=1/10000390 on the delayed read replica and the lsn captured on the source instance, we can confirm that (‘1/1C000080’) is ahead of (‘1/10000390’) by 201,325,808 bytes in the WAL stream.

: 2025-08-05 22:10:28 UTC::@:[1349]:LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.207 s, sync=0.005 s, total=0.229 s; sync files=3, longest=0.004 s, average=0.002 s; distance=65536 kB, estimate=99676 kB; lsn=1/1C000080, redo lsn=1/1C000028
7: 2025-08-05 22:11:55 UTC:[local]:testuser@postgres:[4320]:LOG: statement: SELECT
8: d.datname as "Name",
9: pg_catalog.pg_get_userbyid(d.datdba) as "Owner",
10: pg_catalog.pg_encoding_to_char(d.encoding) as "Encoding",
11: CASE d.datlocprovider WHEN 'b' THEN 'builtin' WHEN 'c' THEN 'libc' WHEN 'i' THEN 'icu' END AS "Locale Provider",
12: d.datcollate as "Collate",
13: d.datctype as "Ctype",
14: d.datlocale as "Locale",
15: d.daticurules as "ICU Rules",
16: CASE WHEN pg_catalog.array_length(d.datacl, 1) = 0 THEN '(none)' ELSE pg_catalog.array_to_string(d.datacl, E'\n') END AS "Access privileges"
17: FROM pg_catalog.pg_database d
18: ORDER BY 1;
19: 2025-08-05 22:12:30 UTC:[local]:testuser@postgres:[4320]:LOG: statement: SELECT datname FROM pg_database;
20: 2025-08-05 22:12:48 UTC:[local]:testuser@postgres:[4379]:FATAL: password authentication failed for user "testuser"
21: 2025-08-05 22:12:48 UTC:[local]:testuser@postgres:[4379]:DETAIL: Connection matched file "/rdsdbdata/config/pg_hba.conf" line 5: "local all all scram-sha-256"
22: 2025-08-05 22:13:05 UTC:[local]:testuser@postgres:[4382]:LOG: statement: select * from pg_database;
23: 2025-08-05 22:13:32 UTC:[local]:testuser@postgres:[4382]:LOG: statement: select * from pg_database where datname='blog_production';
24: 2025-08-05 22:14:23 UTC:[local]:testuser@postgres:[4382]:LOG: statement: DROP DATABASE IF EXISTS blog_production WITH (FORCE);

SELECT pg_wal_lsn_diff('1/1C000080', '1/10000390');
-[ RECORD 1 ]---+----------
pg_wal_lsn_diff | 201325808

Next, we set the recovery_target_lsn=1/1C000080 and recovery_target_inclusive= true on the delayed read replica
Note: If you don’t have pgAudit or log_statements parameter enabled, you can use recovery_target_time instead of recovery_target_lsn.
We modify our read replica’s parameters and make the following changes:
1. Remove the replication delay (recovery_min_apply_delay)
2. Set target recovery point (recovery_target_lsn and recovery_target_inclusive). Since recovery_target_lsn and recovery_target_inclusive are static parameters, these changes require a database reboot to take effect.
  While we’re setting recovery_min_apply_delay to 0, this alone won’t restart WAL replay. When WAL replay is explicitly paused using pg_wal_replay_pause(), it remains paused until manually resumed with pg_wal_replay_resume(), regardless of delay settings.
```
aws rds modify-db-parameter-group \
--db-parameter-group-name awsblog-demo-delayedrepl-param-grp \
--parameters \
'[{ 
"ParameterName": "recovery_min_apply_delay", 
"ParameterValue": "0", 
"ApplyMethod": "immediate" 
}, 
{ 
"ParameterName": "recovery_target_lsn", 
"ParameterValue": "1/1C000080", 
"ApplyMethod": "pending-reboot" 
}, 
{ 
"ParameterName": "recovery_target_inclusive", 
"ParameterValue": "true", 
"ApplyMethod": "pending-reboot" 
}]' \
--region us-west-2

aws rds reboot-db-instance \
--db-instance-identifier awsblog-demo-delayed-replica \
--region us-west-2
```

We connect to the database again and verify that replication delay is removed, recovery_target_lsn and recovery_target_inclusive parameters are set on the database:

SELECT SETTING FROM PG_SETTINGS WHERE NAME='recovery_target_lsn'; 
setting 
1/1C000080
(1 row) 

SELECT SETTING FROM PG_SETTINGS WHERE NAME='recovery_target_inclusive'; 
setting 
on
(1 row)

We resume the wal replay and monitor our recovery process.

SELECT pg_is_wal_replay_paused(); 
pg_is_wal_replay_paused 
t
(1 row) 

SELECT pg_wal_replay_resume(); 
pg_wal_replay_resume 
(1 row)

From the database error logs on the read replica, we see that the recovery was paused after reaching the configured recovery_target_lsn=1/1C000080.

30: 2025-08-05 23:24:30 UTC::@:[1344]:LOG:  recovery stopping after WAL location (LSN) "1/1C000080"
31: 2025-08-05 23:24:30 UTC::@:[1344]:LOG:  pausing at the end of recovery

After confirming the blog_production database exists on the replica, we promote the read replica to become our new primary.

select * from pg_database where datname='blog_production';
-[ RECORD 1 ]—+----------------
oid | 16455
datname | blog_production
datdba | 16409
encoding | 6
datlocprovider | c
datistemplate | f
datallowconn | t
dathasloginevt | f
datconnlimit | -1
datfrozenxid | 723
datminmxid | 1
dattablespace | 1663
datcollate | en_US.UTF-8
datctype | en_US.UTF-8
datlocale | 
daticurules | 
datcollversion | 2.26-59.amzn2
datacl |

aws rds promote-read-replica \
--db-instance-identifier awsblog-demo-delayed-replica \
--backup-retention-period 1 \
--region us-west-2

Recovering applications from production outage

After the RDS for PostgreSQL read replica database instance is promoted, we verify it is active, healthy, and accepting new connections. We can now modify the RDS for PostgreSQL source instance to rename it to awsblog-demo-source-old and modify the newly promoted database instance to awsblog-demo-source, so that the application traffic can be routed to the newly promoted instance.

aws rds modify-db-instance \
--db-instance-identifier awsblog-demo-source \
--new-db-instance-identifier awsblog-demo-source-old \
--apply-immediately \
--region us-west-2

aws rds modify-db-instance \
--db-instance-identifier awsblog-demo-delayed-replica \
--new-db-instance-identifier awsblog-demo-source \
--apply-immediately \
--region us-west-2

Best practices

When implementing delayed replication in Amazon RDS for PostgreSQL, it’s important to follow best practices for optimal performance and to prevent storage-related issues:

Storage management and monitoring

Set up comprehensive monitoring by configuring Amazon CloudWatch Alarms to track FreeStorageSpace on your source and the delayed replica instance.
Enable storage auto-scaling on the source as well as the delayed replica instance to
accommodate WAL log accumulation, which is particularly important when using delayed replication.
Consider configuring the max_slot_wal_keep_size parameter to automatically rotate WAL logs, helping prevent storage-full conditions. This configuration safely manages WAL data without breaking replication – if streaming is interrupted and WAL is rotated on source, RDS for PostgreSQL switches to recovery mode using archived WAL data from Amazon S3, then automatically re-establishes streaming replication once complete.

Recovery management

If your source instance or the delayed read replica is using too much storage because WAL logs are piling up, you can manually advance the replica to catch up and free space using Amazon RDS for PostgreSQL’s built-in recovery controls.
- Recovery target parameters:
  - recovery_target_time – Stop at a specific date/time
  - recovery_target_lsn – Stop at a specific log position
  - recovery_target_name – Stop at a named checkpoint
  - recovery_target_xid – Stop at a specific transaction
  - recovery_target – General recovery target setting
  - recovery_target_inclusive – Include or exclude the target point
- WAL replay control functions:
  - pg_wal_replay_pause() – Stop processing new changes
  - pg_wal_replay_resume() – Start processing changes again
- Using these parameters and functions, pause the replica, set your target point, then resume. The replica will catch up to that point and stop, reducing stored WAL logs while keeping your delayed replica functional for recovery purposes. For more details on these database parameters and functions please refer managing-rpg-delayed-replication.
Regularly review your delayed replica’s replication status to monitor the lag, storage consumption, and adjust the delay interval based on your disaster recovery requirements and storage constraints.

Summary

In this post, we showed you how delayed replication in Amazon RDS for PostgreSQL can help you protect your database from data corruption and human errors. While implementing delayed replication requires careful planning and ongoing management, the benefits of having a real-time recovery option far outweigh the operational overhead.

AWS Database Blog

Using delayed read replicas for Amazon RDS for PostgreSQL disaster recovery

Use cases for delayed replication

Set up delayed replication in Amazon RDS for PostgreSQL

Recovery control functions with delayed replication

Demonstration and recovery with delayed replicas

Recovering from a dropped database

Recovering applications from production outage

Best practices

Storage management and monitoring

Recovery management

Summary

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help