Why did my Amazon Aurora read replica fall behind and restart?

Last updated: 2022-07-10

I am running a production Amazon Aurora DB cluster. My reader instance has restarted with the error message "Read Replica has fallen behind the master too much. Restarting MySQL" or "Read Replica has fallen behind the master too much. Restarting postgres." Why did my reader restart?

Short description

AuroraReplicaLag is a measure of lag in milliseconds when replicating changes from the writer Aurora DB instance to the reader instances in an Aurora DB cluster. Aurora Replicas connect to the same storage volume as the writer instance. You can measure the lag between the Writer and Reader instances by using the AuroraReplicaLag metric in Amazon CloudWatch.

For an Aurora DB cluster, the AuroraReplicaLag metric indicates the lag for the data cache (buffer pool or page cache) of the Aurora Replica compared to that of the writer DB instance. Changes are synchronously written to the shared storage volume, but asynchronously replicated to the reader instances, where any data cached in memory that relates to the change is invalidated for read consistency. In some cases, there can be a delay when propagating changes across the reader instances. This delay appears as an increase in the AuroraReplicaLag metric in CloudWatch, which might lead to eventual restarts.

You can measure near real-time metadata about AuroraReplicaLag:

For Amazon Aurora MySQL-Compatible Edition, select from the INFORMATION_SCHEMA.REPLICA_HOST_STATUS table:

mysql> select server_id AS Instance_Identifier, if(session_id = 'MASTER_SESSION_ID','writer', 'reader') as Role, 
replica_lag_in_milliseconds as AuroraReplicaLag from information_schema.replica_host_status;

+---------------------+--------+-------------------+
| Instance_Identifier | Role   | AuroraReplicaLag  |
+---------------------+--------+-------------------+
| myamscluster-aza-1  | writer |                 0 |
| myamscluster-azb-1  | reader | 5.150000095367432 |
| myamscluster-aza-2  | reader | 5.033999919891357 |
+---------------------+--------+-------------------+

For Amazon Aurora PostgreSQL-Compatible Edition, call the aurora_replica_status() function and filter the results:

postgres=> select server_id, case when session_id= 'MASTER_SESSION_ID' then 'Writer' else 'Reader' end AS Role, 
replica_lag_in_msec as AuroraReplicaLag from aurora_replica_status();

server_id          | role   | aurorareplicalag
-------------------+--------+------------------
myapgcluster-aza-1 | Reader | 19.641
myapgcluster-azb-1 | Reader | 19.752
myapgcluster-aza-2 | Writer |
(3 rows)

Note: The solution provided in this article applies to single-region Amazon Aurora cluster, not multi-Region Global Database clusters.

Resolution

Be sure that all instances in the cluster have the same specification

When a reader instance has a weaker DB instance class configuration than that of a writer DB instance, the volume of changes can be too big for the reader to invalidate in the cache and then catch up. In this scenario, it's a best practice that all DB instances in the Aurora cluster have the same specification.

Monitor sessions using metrics and Enhanced Monitoring

A reader instance can experience lag when multiple sessions are running at the same time. An Aurora replica can be slow to apply the necessary changes coming from the writer because there is a lack of resources available. You can see this lag in metrics such as CPUUtilization, DBConnections, and NetworkReceiveThroughput. You can also enable Enhanced Monitoring with a granularity of 1 or 5 seconds to understand the resource usage on the reader. And you can use Performance Insights to visualize its workload. Additionally, for Aurora PostgreSQL-Compatible only, use the ReadIOPS metric.

Visualize write activity using Amazon CloudWatch

A sudden surge of write activity in an already write-heavy production cluster can cause an overload on the writer DB instance. The added stress caused by the surge can cause one or more reader instances to fall behind. You can see this in Amazon CloudWatch, which shows sudden bursts of the DMLThroughput, DDLThroughput and Queries metrics.

For Aurora PostgreSQL, you can see this in the WriteThroughput metric.

Investigate increasingly high History List Length (HLL) (Aurora MySQL-Compatible)

The MySQL InnoDB engine incorporates multi-version concurrency control (MVCC) by default. This means that you must track all changes that occurred on all rows affected throughout the length of a transaction. After long-running transactions are completed, a spike in purge thread activity begins. Due to the volume of backlog created through long-running transactions, the sudden purge can cause an Aurora Replica to fall behind.

On versions 1.19.2, 2.06, and higher, Aurora MySQL includes the metric RollbackSegmentHistoryListLength. You can use this metric in CloudWatch to see this purge. This is also visible on the output of SHOW ENGINE INNODB STATUS or by querying the information schema as follows:

mysql> select NAME AS RollbackSegmentHistoryListLength, 
COUNT from INFORMATION_SCHEMA.INNODB_METRICS where NAME = 'trx_rseg_history_len';

+----------------------------------+-------+
| RollbackSegmentHistoryListLength | COUNT |
+----------------------------------+-------+
| trx_rseg_history_len             |   358 |
+----------------------------------+-------+
1 row in set (0.00 sec)

Set up CloudWatch alarms to monitor this metric so that it doesn't reach an excessively high number. It's a best practice in relational databases to avoid long-running transactions.

Prevent brief network interruptions

Although rare, transient networking communication failures between the writer and reader instances, or between an instance and the Aurora storage layer, can occur. Reader instances can fall behind or restart due to a brief interruption in networking. The Aurora Replica can also fall behind due to saturation of the instance's network bandwidth because of a large volume of changes. It's a best practice to consider the size of the DB instance, and therefore its networking capabilities, to avoid this issue.