Why did my Amazon Aurora read replica fall behind and restart?

Last updated: 2020-06-22

I am running a production Amazon Aurora DB cluster and my reader node has restarted with the error message "Read Replica has fallen behind the master too much. Restarting MySQL" or "Read Replica has fallen behind the master too much. Restarting postgres". Why did my reader restart?

Short description

AuroraReplicaLag is a measure of lag in milliseconds when replicating updates from the primary Aurora DB instance to the reader nodes in an Aurora DB cluster. Aurora Replicas connect to the same storage volume as the primary DB instance and support read operations, only. You can measure the lag between the Primary and Reader nodes by using the AuroraReplicaLag metric in Amazon CloudWatch

For an Aurora PostgreSQL DB cluster, the AuroraReplicaLag metric indicates the lag for the page cache of the Aurora Replica compared to that of the primary DB instance. This means that after the data is written to the cluster volume, it is accessible to both the writer and readers nodes, in near real-time.

In order to be sure that your changes are propagated among the reader nodes, you must invalidate the cached data for read consistency. In some cases, there can be a delay when propagating changes across the reader nodes. You can observe this through an increase in the AuroraReplicaLag metric in CloudWatch, which leads to an eventual restart.

For Aurora MySQL, you can measure near real-time metadata regarding AuroraReplicaLag through the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table as follows:
mysql> select server_id AS Instance_Identifier, 
if(session_id = 'MASTER_SESSION_ID','writer', 'reader') as Role, 
replica_lag_in_milliseconds as 
from information_schema.replica_host_status;

| Instance_Identifier | Role   | AuroraReplicaLag  |
| primary-instance    | writer |                 0 |
| reader-node-02      | reader | 5.150000095367432 |
| reader-node-03      | reader | 5.033999919891357 |

For Aurora PostgreSQL, call the aurora_replica_status() function and filter the results as follows:

postgres=> select server_id, case when session_id= 'MASTER_SESSION_ID' then 'Writer' else 'Reader' end AS Role,
cur_replay_latency_in_usec*1000/60/60/60 as AuroraReplicaLag 
from aurora_replica_status();

  server_id   |  role  | aurorareplicalag 
 read-node-01 | Reader |                 71
 read-node-02 | Reader |                 79
 primary-node | Writer |                    


Be sure that all instances in the cluster have the same specification

If a reader node has a weaker DB instance class configuration than that of a writer DB instance, the volume of changes can be too big for the reader to invalidate in the cache and catch up. In this scenario, it is a best practice to be sure that all DB instances in the Aurora cluster have the same specification.

Monitor sessions using metrics and Enhanced Monitoring

A reader node can experience lag when multiple sessions are running at the same time. An Aurora replica can be slow to apply the necessary changes coming from the master because there is a lack of resources available. You can visualize this using metrics such as CPUUtilization, DBConnections, NetworkReceiveThroughput, and ActiveTransactions.

For Aurora PostgreSQL, instead of ActiveTransactions, use the ReadIOPS metric. You can enable Enhanced Monitoring to at least a granularity of 5/1 seconds to understand the utilization of the reader node. You can also use Performance Insights to visualize the workload.

Visualize write activity using Amazon CloudWatch

A sudden surge of write activity in an already write-heavy production cluster can cause an overload on the writer DB instance. The added stress caused by the surge can cause the reader nodes to fall behind. You can visualize this by using Amazon CloudWatch, where you observe sudden bursts of the DMLThroughput, DDLThroughput, and Queries metrics.

For Aurora PostgreSQL, you can visualize this using the WriteThroughPut metric.

Investigate increasingly high History List Length (HLL) (Aurora MySQL)

MySQL's InnoDB engine incorporates multi-version concurrency control (MVCC), by default. This means that you need to keep track of all changes that have occurred on all rows affected throughout the length of a transaction. After long-running transactions are complete, a spike in purge thread activity commences. Due to the volume of backlog created through long running transactions, the sudden purge can cause an Aurora Replica to fall behind.

As of 1.19.2 and 2.06, Aurora MySQL includes the metric RollbackSegmentHistoryListLength, which you can use in CloudWatch to visualize this purge. This is also visible from SHOW ENGINE INNODB STATUS or by querying the information schema as follows:

mysql> select NAME AS RollbackSegmentHistoryListLength, 
COUNT from INFORMATION_SCHEMA.INNODB_METRICS where NAME = 'trx_rseg_history_len';

| RollbackSegmentHistoryListLength | COUNT |
    trx_rseg_history_len             |   358 |
1 row in set (0.00 sec)

Set up CloudWatch alarms to monitor this metric so that it does not reach an excessively high number. It is a best practice in Aurora to incorporate high concurrent short lived transactions, so avoid running long running transactions.

Resolve transient network issue

Although rare, a transient networking communication failure between the writer and reader nodes, or the nodes and the Aurora storage layer can occur. Reader nodes can fall behind or restart due to a brief interruption in networking. The Aurora Replica can fall behind due to network saturation as a result of a large volume of changes, or brief moments of Inter-AZ packet loss. You should consider the size of the DB instance, as well as its networking capabilities, to avoid this issue.

Important Notes to consider

There are some notable features within Aurora MySQL 1.x starting from 1.17.4 that you should be aware of:

  • aurora_enable_replica_log_compression is a cluster parameter that enables compression of replication payloads to improve network bandwidth utilization between the master and Aurora Replicas. This is useful when network saturation is evident.

  • aurora_enable_zdr is a cluster parameter that you can utilize to keep any open connections and preserve them on an Aurora Replica restart.

  • aurora_enable_staggered_replica_restart is a cluster parameter that allows Aurora Replicas to follow a staggered restart schedule to increase cluster availability.