How do I find the cause of replica lag when using Amazon Relational Database Service (Amazon RDS) for MySQL?

Because Amazon RDS for MySQL uses asynchronous replication, sometimes the replica isn't able to keep up with the primary instance.

For Amazon RDS for MySQL read replicas, you can monitor replication lag in Amazon CloudWatch by viewing the Amazon RDS ReplicaLag metric. The ReplicaLag metric reports the value of the Seconds_Behind_Master field of the SHOW SLAVE STATUS command.

The Seconds_Behind_Master shows the difference between the current timestamp on the replica instance and the original timestamp that was logged on the primary instance for the event that's being processed on the replica instance.

MySQL replication works with three threads: the Binlog Dump thread, the IO_THREAD, and the SQL_THREAD. For more information about how these threads function, see the MySQL Documentation for Replication Implementation Details. If there is a delay in replication, first identify whether the lag is caused by the replica IO_THREAD or the replica SQL_THREAD. Then, you can identify the root cause of the lag.

To identify which replication thread is lagging, see the following examples:

1.    First, run the MASTER STATUS command on the primary instance, and review the output:

mysql> SHOW MASTER STATUS;
+----------------------------+----------+--------------+------------------+-------------------+
| File                       | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+----------------------------+----------+--------------+------------------+-------------------+
| mysql-bin-changelog.066552|      521 |              |                  |                   |
+----------------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

Note: In the example output, the master DB instance is writing the binary logs to the file mysql-bin.066552.

2.    Run the SHOW SLAVE STATUS command on the replica instance and review the output:

Example output A:

mysql> SHOW SLAVE STATUS\G;
*************************** 1. row ***************************
Master_Log_File: mysql-bin.066548
Read_Master_Log_Pos: 10050480
Relay_Master_Log_File: mysql-bin.066548
Exec_Master_Log_Pos: 10050300
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

In example output A, the Master_Log_File: mysql-bin.066548 indicates that the replica IO_THREAD is reading from the binary log file mysql-bin.066548. This is because the primary DB instance is writing the binary logs to the file mysql-bin.066552. This output shows that the replica IO_THREAD is behind by 4 binlogs. However, the Relay_Master_Log_File is mysql-bin.066548, which indicates that the replica SQL_THREAD is reading from same file as the IO_THREAD. This means that the replica SQL_THREAD is keeping up, but the replica IO_THREAD is lagging behind.

Example output B:

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Master_Log_File: mysql-bin.066552
Read_Master_Log_Pos: 430
Relay_Master_Log_File: mysql-bin.066530
Exec_Master_Log_Pos: 50360
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

In example output B, the IO_THREAD is keeping up with the primary instance. The output shows that the Master log File is mysql-bin-changelog.066552 (the file parameter from the master status). And the replica IO_THREAD is Master_Log_File: mysql-bin-changelog.066552. In the replica output, the SQL thread is performing Relay_Master_Log_File: mysql-bin-changelog.066530. This means that SQL_THREAD is lagging behind by 12 binary logs.

Normally, IO_THREAD doesn't cause large replication delays, because the IO_THREAD reads the binary logs only from the master. However, network connectivity and network latency can affect the speed of the reads between the servers. The replica IO_THREAD could be slow due to high bandwidth usage.

If the replica SQL_THREAD is the source of replication delays, those delays could be due to the following reasons:

  • Long-running queries on the primary instance
  • Insufficient instance class size or storage
  • Parallel queries executed on the primary instance
  • Binary logs synced to the disk on the replica instance
  • Binlog_format on the replica is set to ROW
  • Replica creation lag

Long-running queries on the primary instance

Long-running queries on the primary instance that take an equal amount of time to run on the replica instance can increase the seconds_behind_master. For example, if a change executed on the primary instance takes one hour to execute, by the time that change starts running on the replica, the lag is one hour. Because the change might also take one hour to complete on the replica, by the time the change is complete, the total lag is approximately two hours. This delay is expected, but you can minimize this lag by monitoring the slow query log on the master. You can also identify long-running statements to reduce lag. Then, break long-running statements into smaller statements or transactions. For more information, see Accessing the MySQL Slow Query and General Logs.

Insufficient instance class size or storage

If the replica instance class or storage configuration is smaller than the primary, then the replica is throttled and unable to keep up with the changes made on the primary. Be sure that the instance type of the replica is the same or higher than the primary instance. For replication to operate effectively, each read replica requires the same amount of compute and storage resources as the source DB instance. For more information, see DB Instance Class.

Parallel queries executed on the primary instance

Queries that are executed in parallel on the primary are committed in a serial order on the replica, because the MySQL replication is single threaded (SQL_THREAD). If a high volume of writes to the source DB instance can occur in parallel, the writes to the read replica are serialized using a single SQL_THREAD. This can cause a lag between the source instance and read replica.

Multi-threaded (parallel) replication is available for MySQL 5.6, MySQL 5.7, and higher versions. For more information about multi-threaded replication, see the MySQL Documentation for Binary Logging Options and Variables.

Multi-threaded replication can cause gaps in replication. For example, multi-threaded replication isn't recommended when skipping the replication errors, because it's difficult to identify which transactions are skipped. This can lead to gaps in data consistency between the primary and replica instances.

Binary logs synced to the disk on the replica instance

If automatic backups are enabled on the replica, then additional effort might be required to sync the binary logs to the disk on the replica. The default value of the parameter sync_binlog is set to 1. If you change this value to 0, the synchronization of the binary log to the disk by the MySQL server is disabled. Instead of logging to the disk, the operating system (OS) occasionally flushes the binary logs to disk.

Disabling the binary log synchronization can reduce the performance overhead required to sync the binary logs to disk on every commit. But if there is a power failure or the OS crashes, some of the commits might not be synchronized to the binary logs. This can affect point in time restore (PITR) capabilities. For more information, see the MySQL Documentation for sync_binlog.

Binlog_format on the replica is set to ROW

If the binlog_format on the replica is set to ROW and the table on which the updates are made is missing the primary key, then by default the slave-rows-search-algorithms=TABLE_SCAN,INDEX_SCAN is executed on every row modified on the primary. And this parameter also performs a full table scan on the replica. For more information about the slave-rows-search-algorithms parameter, see the MySQL Documentation for slave-rows-search-algorithms.

Replica creation lag

Amazon RDS creates a read replica of a MySQL master by taking a DB snapshot of the master. Then, Amazon RDS restores the snapshot to create a new DB instance (replica) and establishes replication between the two.

New read replicas take time to be created. When the replication is established, there is a lag for the duration of the time that it takes to create a backup of the primary. To minimize this lag, create a manual backup before calling for the replica creation. Then, the snapshot taken by the replica creation process is an incremental backup, which is faster.

When a read replica is restored from a snapshot, the replica won't wait for all the data to be transferred from Amazon Simple Storage Service (Amazon S3) to the Amazon Elastic Block Store (Amazon EBS) volume that's associated with the replica instance. The replica instance is available to perform the DB operations, and the new volume created from existing EBS snapshot loads in the background (lazy loading). For Amazon RDS for MySQL replicas (EBS-based volumes), the replica lag might increase initially, because the loading effect might compromise the replication performance. For more information, see Initializing Amazon EBS Volumes.


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2019-02-26