Why do I have replication lags and errors in my RDS for PostgreSQL DB instance?
Last updated: 2022-06-22
I am getting replication errors and lags in my Amazon Relational Database Service (Amazon RDS) for PostgreSQL instance.
You can scale reads for your Amazon RDS for PostgreSQL DB instance by adding read replicas to the instance. RDS for PostgreSQL uses PostgreSQL native streaming replication to create a read-only copy of a source DB instance. This read replica DB instance is an asynchronously created physical replica of the source DB instance. This means that sometimes the replica can't keep up with the primary DB instance. As a result, replication lag can occur. The replica DB is created by a special connection that transmits the write ahead log (WAL) data from the source DB instance to the read replica. Therefore, the read replica checks the WAL logs to replicate the changes done on the primary instance. When the read replica can't find the WAL on the primary instance, the read replica is recovered from the archived WAL data in Amazon Simple Storage Service (Amazon S3). For more information, see How streaming replication works for different RDS for PostgreSQL versions.
You can monitor replication lag in Amazon CloudWatch by viewing the Amazon RDS ReplicaLag metric. This metric shows how far a read replica has fallen behind its source DB instance. Amazon RDS monitors the replication status of your read replica. Then, it updates the Replication State field in the Amazon RDS console to Error if replication stops for any reason. The ReplicaLag metric indicates how well a read replica is keeping up with the source DB instance and the amount of latency between the source DB instance and a specific read instance.
You might see one of the following errors in the RDS for PostgreSQL error logs when replica lag increases:
- Streaming replication has stopped: You get this error when streaming replication between the primary and replica instances breaks down. In this case, replication starts replaying from archive location in Amazon S3, leading to further increase of replica lag.
- Streaming replication has been terminated: You get this error when replication is stopped for more than 30 consecutive days, either manually or due to a replication error. In this case, Amazon RDS terminates replication between the primary DB instance and all read replicas to prevent increased storage requirements on the primary instance and longer failover times.
The read replica instance is available even after replication is terminated. However, replication can't be resumed because the transaction logs required by the read replica are deleted from the primary instance after replication is terminated.
The most common reasons for increase in the replica lag are the following:
- Configuration differences between the primary and replica instances
- Heavy write workload on the primary instance
- Transactions that are running for a long time
- Exclusive lock on primary instance tables
- Corrupted or missing WAL file
- Network issues
- Incorrect parameter setting
- No transactions
Configuration differences between primary instance and read replica
Incorrect replica instance configurations can affect replication performance. Read replica handles a write workload that's similar to that of the source instance along with additional read queries. Therefore, use replicas of the same or higher instance class and storage type as the source instance. Because the replica must replay the same write activity as the source instance, the use of a lower-instance class replica can cause high latency for the replica and increase replication lag. Mismatched storage configurations increase the replication lag as well.
Heavy write workload on the primary instance
A heavy write workload on the primary instance might create a high influx of WAL files. An increase in the number of WAL files and replaying of these files on read replicas might slow down the overall replication performance. Therefore, when you see an increase in replica lag, be sure to check the write activity on the primary instance. You can use CloudWatch metrics or Enhanced Monitoring to analyze this workload. View values for TransactionLogsDiskUsage, TransactionLogsGeneration, WriteIOPS, WriteThroughput, and WriteLatency to find if the source instance is under heavy write workload. You can also check for bottlenecks at the throughput level. Each instance type has its dedicated throughput. For more information, see Hardware specifications for DB instance classes.
To avoid this issue, control and distribute write activity for the source instance. Instead of performing many write activities together, break your task into smaller task bundles, and then distribute these bundles evenly across multiple transactions. You can use CloudWatch alerts for metrics, such as Writelatency and WriteIOPS, to be notified of heavy writes on the source instance.
Transactions that are running for a long time
Active transactions that are running for a long time in the database might interfere with the WAL replication process, thereby increasing the replication lag. Therefore, be sure to monitor the runtime of active transactions with the PostgreSQL pg_stat_activity view.
Run a query similar to the following to find the process ID (PID) of the query that's running for a long time:
SELECT datname, pid,usename, client_addr, backend_start, xact_start, current_timestamp - xact_start AS xact_runtime, state, backend_xmin FROM pg_stat_activity WHERE state='active';
After identifying the PID of the query, you can choose to end the query.
Run a query similar to the following to terminate the query:
You can also choose to rewrite or tune the query to avoid transactions that are running for a long time.
Exclusive lock on primary instance tables
When you run commands, such as DROP TABLE, TRUNCATE, REINDEX, VACUUM FULL, REFRESH MATERIALIZED VIEW (without CONCURRENTLY), on the primary instance, PostgreSQL processes an Access Exclusive lock. This lock prevents all other transactions from accessing the table for the lock’s hold duration. Usually, the table remains locked until the transaction ends. This lock activity is recorded in WAL, and is then replayed and held by the read replica. The longer the table remains under an Access Exclusive lock, the longer the replication lag.
To avoid this issue, it's a best practice to monitor the transactions by periodically querying the pg_locks and pg_stat_activity catalog tables.
SELECT pid, usename, pg_blocking_pids(pid) AS blocked_by, QUERY AS blocked_query<br>FROM pg_stat_activity<br>WHERE cardinality(pg_blocking_pids(pid)) > 0;
Corrupted or missing WAL file
A corrupted or missing WAL file can result in replica lag. In this case, you see an error in PostgreSQL logs stating that the WAL can't be opened. You might also see the error "requested WAL segment XXX has already been removed".
A network interruption between the primary and replica instances might cause issues with streaming replication that might result in an increased replica lag.
Incorrect parameter setting
Incorrectly setting some of the custom parameters in the server configuration parameter group might cause an increased replica lag. The following are some of the parameters that you must set correctly:
- wal_keep_segments: This parameter specifies the number of WAL files that the primary instance keeps in the pg_wal directory. The default value for this parameter is set to 32. If this parameter isn't set to a value that's high enough for your deployment, the read replica might fall behind, causing the streaming replication to stop. In this case, RDS generates a replication error and begins recovery on the read replica by replaying the primary instance's archived WAL data from S3. This recovery process continues until the read replica can continue streaming replication.
Note: In PostgreSQL version 13, the wal_keep_segments parameter is named wal_keep_size. This parameter serves the same purpose as wal_keep_segments. However, the default value for this parameter is defined in MB (2048 MB) rather than the number of files.
- max_wal_senders: This parameter specifies the maximum number of connections that the primary instance can support at the same time over the streaming replication protocol. The default value for this parameter for RDS for PostgreSQL 13 and higher releases is 20. This parameter should be set to a value that's slightly higher than the actual number of read replicas. If this parameter is set to a value that's less than the number of read replicas, then replication stops.
- hot_standby_feedback: This parameter specifies whether the replica instance sends feedback to the primary instance about queries that are currently running in the replica instance. By turning on this parameter, you curate the following error message at the source instance and postpone the VACUUM operation on related tables, unless the read query in the replica instance is completed. Therefore, a replica instance that has hot_standby_feedback turned on can serve long-running queries. However, this parameter can bloat tables at the source instance. Be sure to monitor long-running queries in replica instances to avoid serious issues such as out-of-storage and Transaction ID Wraparound in the primary instance.
ERROR: canceling statement due to conflict with recovery Detail: User query might have needed to see row versions that must be removed
- max_standby_streaming_delay/max_standby_archive_delay: You can enable parameters, such as max_standby_archive_delay or max_standby_streaming_delay, on the replica instance for completing long-running read queries. These parameters pause WAL replay in the replica if the source data is modified when read queries are running on the replica. A value of -1 for these parameters lets the WAL replay wait until the read query completes. However, this pause increases the replication lag indefinitely and causes high storage consumption at the source due to WAL accumulation.
If no user transactions are occurring on the source DB instance, then the PostgreSQL read replica reports a replication lag of up to five minutes.
PostgreSQL documentation for Server configuration