Why did my Amazon Aurora read replica fall behind and restart?

Last updated: 2021-02-02

I am running a production Amazon Aurora DB cluster. My reader node has restarted with the error message "Read Replica has fallen behind the master too much. Restarting MySQL" or "Read Replica has fallen behind the master too much. Restarting postgres." Why did my reader restart?

Short description

AuroraReplicaLag is a measure of lag in milliseconds when replicating updates from the primary Aurora DB instance to the reader nodes in an Aurora DB cluster. Aurora Replicas connect to the same storage volume as the primary DB instance and support read operations, only. You can measure the lag between the Primary and Reader nodes by using the AuroraReplicaLag metric in Amazon CloudWatch.

For an Aurora PostgreSQL DB cluster, the AuroraReplicaLag metric indicates the lag for the page cache of the Aurora Replica compared to that of the primary DB instance. This means that after the data is written to the cluster volume, it's accessible to both the writer and readers nodes, in near real time.

To be sure that your changes are propagated among the reader nodes, you must invalidate the cached data for read consistency. In some cases, there can be a delay when propagating changes across the reader nodes. This delay appears as an increase in the AuroraReplicaLag metric in CloudWatch, which leads to an eventual restart.

For Aurora MySQL, you can measure near real time metadata about AuroraReplicaLag through the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table:

mysql> select server_id AS Instance_Identifier, if(session_id = 'MASTER_SESSION_ID','writer', 'reader') as Role, 
replica_lag_in_milliseconds as AuroraReplicaLag from information_schema.replica_host_status;

+---------------------+--------+-------------------+
| Instance_Identifier | Role   | AuroraReplicaLag  |
+---------------------+--------+-------------------+
| primary-instance    | writer |                 0 |
| reader-node-02      | reader | 5.150000095367432 |
| reader-node-03      | reader | 5.033999919891357 |
+---------------------+--------+-------------------+

For Aurora PostgreSQL, call the aurora_replica_status() function and filter the results:

postgres=> select server_id, case when session_id= 'MASTER_SESSION_ID' then 'Writer' else 'Reader' end AS Role, 
replica_lag_in_msec as AuroraReplicaLag from aurora_replica_status();

server_id    | role   | aurorareplicalag
-------------+--------+------------------
read-node-01 | Reader | 19.641
read-node-02 | Reader | 19.752
primary-node | Writer |
(3 rows)

Resolution

Be sure that all instances in the cluster have the same specification

When a reader node has a weaker DB instance class configuration than that of a writer DB instance, the volume of changes can be too big for the reader to invalidate in the cache and then catch up. In this scenario, it's a best practice that all DB instances in the Aurora cluster have the same specification.

Monitor sessions using metrics and Enhanced Monitoring

A reader node can experience lag when multiple sessions are running at the same time. An Aurora replica can be slow to apply the necessary changes coming from the primary because there is a lack of resources available. You can see this lag in metrics such as CPUUtilization, DBConnections, NetworkReceiveThroughput, and ActiveTransactions.

For Aurora PostgreSQL, instead of ActiveTransactions, use the ReadIOPS metric. You can enable Enhanced Monitoring to at least a granularity of 5/1 seconds to understand the use of the reader node. You can also use Performance Insights to visualize the workload.

Visualize write activity using Amazon CloudWatch

A sudden surge of write activity in an already write-heavy production cluster can cause an overload on the writer DB instance. The added stress caused by the surge can cause the reader nodes to fall behind. You can see this in Amazon CloudWatch, which shows sudden bursts of the DMLThroughput, DDLThroughput and Queries metrics.

For Aurora PostgreSQL, you can see this in the WriteThroughPut metric.

Investigate increasingly high History List Length (HLL) (Aurora MySQL)

The MySQL InnoDB engine incorporates multi-version concurrency control (MVCC) by default. This means that you must track all changes that occurred on all rows affected throughout the length of a transaction. After long-running transactions are complete, a spike in purge thread activity begins. Due to the volume of backlog created through long-running transactions, the sudden purge can cause an Aurora Replica to fall behind.

As of versions 1.19.2 and 2.06, Aurora MySQL includes the metric RollbackSegmentHistoryListLength. You can use this metric in CloudWatch to see this purge. This is also visible from SHOW ENGINE INNODB STATUS or by querying the information schema as follows:

mysql> select NAME AS RollbackSegmentHistoryListLength, 
COUNT from INFORMATION_SCHEMA.INNODB_METRICS where NAME = 'trx_rseg_history_len';

+----------------------------------+-------+
| RollbackSegmentHistoryListLength | COUNT |
+----------------------------------+-------+
| trx_rseg_history_len             |   358 |
+----------------------------------+-------+
1 row in set (0.00 sec)

Set up CloudWatch alarms to monitor this metric so that it doesn't reach an excessively high number. It's a best practice in Aurora to incorporate high concurrent short lived transactions, so avoid running long-running transactions.

Resolve transient network issue

Although rare, a transient networking communication failure between the writer and reader nodes, or the nodes and the Aurora storage layer can occur. Reader nodes can fall behind or restart due to a brief interruption in networking. The Aurora Replica can fall behind due to network saturation because of a large volume of changes or brief moments of Inter-AZ packet loss. It's a best practice to consider the size of the DB instance, as well as its networking capabilities, to avoid this issue.

Important: There are some notable features within Aurora MySQL 1.x starting from version 1.17.4 to be aware of:

  • aurora_enable_replica_log_compression is a cluster parameter that enables compression of replication payloads to improve network bandwidth use between the primary and Aurora Replicas. This is useful when network saturation is evident.

  • aurora_enable_zdr is a cluster parameter that you can use to keep any open connections and preserve them on an Aurora Replica restart.

  • aurora_enable_staggered_replica_restart is a cluster parameter that allows Aurora Replicas to follow a staggered restart schedule to increase cluster availability.