How do I monitor and troubleshoot replication lags when using the read replica for an Amazon RDS for Oracle instance?

Last updated: 2021-11-18

I want to monitor and troubleshoot replication lags when I use the read replica for my Amazon Relational Database Service (Amazon RDS) for Oracle DB instance.

Short description

You can configure the read replica for your RDS for Oracle instance using the Oracle Active Data Guard License. When you create a read replica for your instance, the primary and read replica instances operate in the Data Guard maximum performance mode. For more information, see Oracle documentation for Oracle Data Guard protection modes.

You can use the read replica to offload the read operations from the primary instance. You can also promote the read replica as a standalone DB instance in the event of disaster recovery. For more information, see Promoting a read replica to be a standalone DB instance.

Before configuring a read replica for your instance, be sure to check the limitations related to version and licensing. Also, review Miscellaneous requirements and restrictions for Oracle replicas.

Resolution

Keep in mind the following when you use the read replica for your RDS for Oracle instance:

  • Amazon RDS is a managed service. Therefore, the replication process is managed by an internal RDS_DATAGUARD user. This user is configured when you create a read replica instance.
  • You can't stop the managed recovery process from the admin user account.
  • You can't configure or modify the Oracle Data Guard Broker for RDS read replica.
  • You can't delay the replication of data from a primary instance to a read replica instance.
  • Modifications done to the option group propagate to the read replica instance. If the read replicas are configured in the same Region, then they are associated with the same option group. For more information, see Option requirements and restrictions for Oracle replicas.

Monitor the replication lag

You might experience replication lag if the replica can't keep up with the primary instance. To monitor the replication lag in Amazon CloudWatch, view the Amazon RDS metric ReplicaLag. This metric shows the replication lag in milliseconds. For more information, see Monitoring read replication and Viewing Amazon RDS metrics and dimensions.

To monitor and troubleshoot the replication lag in Active Data Guard, do the following:

Run the following query to check the Data Guard lag metrics:

sql > SELECT name,value,datum_time,time_computed FROM v$dataguard_stats;

Run the following query to check whether there is an archive log gap between the primary and read replica instances:

sql >  SELECT archived.thread# THREAD,max(archived.sequence#) PRIMARY,max(applied.sequence#) STANDBY,(max(archived.sequence#) - max(applied.sequence#)) GAP FROM v$archived_log archived,v$archived_log applied WHERE archived.thread#=applied.thread# AND applied.applied='YES' GROUP BY archived.thread#;

Note: RDS for Oracle doesn't alert you when there is a replication lag. To send notifications whenever the read replica doesn't keep up with the primary instance, create a CloudWatch alarm for the RDS metric ReplicaLag.

Troubleshoot the replication lag

  • If the primary instance is unavailable, check the current state of the instance using the Amazon RDS Console. Also, check if the instance is accessible from any SQL client. If the primary instance is unavailable (for example: the instance is in incompatible-parameters state), then the read replica can't keep up with the primary instance and falls behind. Check Amazon RDS events to find why the primary instance is unavailable. Then, fix the instance to make sure that it's accessible.
  • If the primary instance is under a high CPU or memory overload, then a high transport lag might result. Monitor the CloudWatch metrics for the instance. Also, use Enhanced Monitoring to identify whether a particular process causes an unusual memory or CPU usage spike. For more information, see Viewing OS metrics in the RDS console. When the CPU/memory load reduces, the replica might keep up with the primary instance. If the overload is as expected on the instance, you can choose to scale up the instance class, IOPS, or throughput as needed.
  • Monitor the alert.log files for the primary instance and read replica to identify alerts or ORA errors that might affect the transport or apply processes. For more information on accessing the alert log for RDS instances, see Oracle database log files. Resolve the identified errors to reduce the lag.
  • When you resize or change the number of redo logs in the primary instance, the changes aren't replicated in the standby instance. To replicate these changes, you must recreate the standby instance. It's a best practice to update the redo log parameters before configuring the read replica. For more information, see Changing your logging configuration.
  • Configuring the read replica instance with less compute or IOPS capacity compared to the primary instance might slow down the application of changes in the read replica, resulting in replication lag. Monitor the RDS metrics using CloudWatch to check for throttling in both instances. If the primary instance doesn't have any throttling and the read replica experiences throttling, identify the resource that's getting throttled. Then, scale up the storage or compute configuration as needed.