Build high availability for Amazon RDS Custom for Oracle using read replicas

A high availability solution for the database stack is an important aspect to consider when migrating or deploying Oracle databases in the AWS Cloud to help ensure that the architecture can meet the service level agreement (SLA) of the application. Unavailability of a critical database can lead to application outage, interruption to business operations, inaccessibility to critical data, revenue loss, and loss of customer trust.

In this post, we discuss how to enable high availability for an Amazon Relational Database Service (Amazon RDS) Custom for Oracle instance using read replicas and best practices to follow during role transition and failover scenarios.

You can achieve high availability (HA) and disaster recovery (DR) for Oracle databases using a physical standby database by creating and maintaining a physical copy of the primary database. The standby instance can be hosted in a location different and far enough from the primary instance, such as a different Region or Availability Zone, to maintain the availability of the database when the primary instance is impacted by unplanned incidents. Oracle Data Guard provides a comprehensive set of services to create, maintain, and monitor one or more standby databases. Data Guard maintains a copy of the data in a standby database that is continuously updated with changes from the primary database. Data Guard validates the changes before they’re applied to the standby database, to help prevent physical corruptions that occur in the storage layer from causing data loss and downtime. Data Guard also helps repair corrupted blocks in the primary database in real time by fetching the “good” version of the corrupted block from an Active Data Guard physical standby database. Additionally, Data Guard can also help improve availability of the database during planned maintenance activities such as database patching using Standby First Patching strategy.

Amazon Relational Database Service (Amazon RDS) Custom is a managed database service for legacy, custom, and packaged applications and any application that requires access to the underlying OS and DB environment. RDS Custom for Oracle provides the flexibility to customize your database, underlying server, and operating system configurations to support applications that require such customizations. RDS Custom for Oracle supports read replicas that are built on Data Guard technology, and you can use these replicas to offload reads from the primary and for disaster recovery. As of this writing, RDS Custom for Oracle doesn’t support the Multi-AZ option, which is available for Amazon RDS for Oracle as an HA solution using storage replication.

Solution overview

With RDS Custom for Oracle, you can create up to five read replicas, which are managed physical standby databases using Data Guard configuration. These read replicas created in mounted mode can be opened in read-only mode and used for reporting purposes. Read replicas in RDS Custom use asynchronous log shipping mode with MaxPerformance Data Guard configuration, and are restricted to the same Region as the primary instance (as of this writing). Because these read replicas are managed by AWS, most of the heavy lifting activities in maintaining a typical Data Guard configuration are automated, including the following tasks:

Creating and maintaining a Data Guard broker configuration
Shipping redo changes from the primary instance to the standby instance (read replica)
Resolving archivelog gaps
Purging archivelogs from the primary and standby when they’re no longer needed

The following diagram illustrates a read replica in RDS Custom for Oracle.

Solution Overview

However, this default configuration of read replicas doesn’t provide a Recovery Point Objective (RPO) of zero because the redo is shipped asynchronously from the primary instance to the replica. Furthermore, the failover operation is manual because there’s no automated failure detection or failover mechanism, and rebuilding the former primary instance after a failover operation needs to be done manually by creating a new read replica.

With shell access to the underlying Amazon Elastic Compute Cloud (Amazon EC2) instance of the RDS Custom for Oracle instance, the Data Guard configuration of the read replica can be customized to meet various customer-defined HA and DR requirements. The following are a few customizations you can make to the default read replica configuration to help you meet your Recovery Time Objective (RTO) and RPO requirements:

Modify log shipping mode from asynchronous to synchronous to help you achieve an RPO of zero
Enable Data Guard protection mode to maximum availability
Configure a Data Guard observer process to enable Fast-Start Failover (FSFO) to facilitate automated failure detection and failover
Enable the flashback database feature on both the primary and standby instances for automated and fast reinstatement of the former primary database after a failover activity

The following diagram illustrates a customized read replica configuration to help you meet your HA requirements.

Customized read replica

In addition to these managed replicas, you can also create self-managed standby databases in various locations, such as a different AWS Region, on-premises, or a different cloud environment, to help you meet your disaster recovery requirements. You can manage those self-managed standby databases and managed read replicas with a single Data Guard configuration.

In this post, we discuss the following topics:

Create a read replica for an existing RDS Custom for Oracle instance
Change Data Guard protection mode
Convert a replica from mounted state to open read-only
Role transition (switchover)
Data Guard failover
Promote the read replica
Configure FSFO
Seamless Oracle Client connectivity
Cross-Region Data Guard configuration

In this post, we refer to different resources such as Database Identifier in AWS Console, Database Unique Name (db_unique_name instance parameter), Database name in Data Guard broker configuration and Net service name in tnsnames.ora to illustrate a sample configuration. The following table serves as a reference to the names of different resources used in this post.

.	Primary	Read Replica
DB Identifier	orcl1	orcl2
DB Unique Name	ORCL_A	ORCL_B
DB Name in Data Guard Broker	orcl_a	orcl_b
Oracle Net Service Name	RDS_CUSTOM_ORCL_A	RDS_CUSTOM_ORCL_B

Create a read replica for an existing RDS Custom for Oracle instance

You can create a read replica for an existing RDS Custom for Oracle instance either from the AWS Management Console or using the AWS Command Line Interface (AWS CLI).

To create a read replica from the Amazon RDS console, complete the following steps:

On the Amazon RDS console, choose Databases in the navigation pane.
Select the database you want to add a read replica for.
On the Actions menu, choose Create read replica. Read replica is a copy of your source database, and automatically receives and applies changes received from source database.
For Replica source, enter your source DB instance.
In this example, we use the existing RDS Custom for Oracle instance orcl1 in the Singapore Region as the source to create our read replica.
For DB instance identifier, enter your DB instance (for this post, orcl2).
Choose the database instance class necessary for the performance requirements of the replica instance.
For Storage type, choose your storage type.
We recommend that you use the same storage type as the source DB instance for the read replica used as a failover target.
Accept the suggested values under Connectivity, and choose a different Availability Zone from the source database to protect from Availability Zone failure.
For IAM instance profile, choose the instance profile for your RDS Custom for Oracle DB instance.
Choose Create read replica to create the read replica orcl2.

Alternatively, you can create a read replica from the AWS CLI, as shown in the following example:

aws rds create-db-instance-read-replica \
--db-instance-identifier orcl4 \
--source-db-instance-identifier orcl1 \
--db-instance-class db.m5.large \
--availability-zone ap-southeast-1c \
--port 1521 \
--vpc-security-group-ids  sg-0534febf02235e865 \
--storage-type gp2 \
--replica-mode mounted \
--custom-iam-instance-profile AWSRDSCustomInstanceProfile-ap-southeast-1-Demo

Choose your database protection mode

Oracle Data Guard provides three protection modes to configure your Data Guard environment based on your availability, protection, and performance requirements. The following table summarizes these three modes.

Protection Mode	Redo Transport Setting	Description
MAXIMUM PERFORMANCE	ASYNC	For transactions happening on the primary database, redo data is asynchronously transmitted and written to the standby database redo log, therefore the performance impact is minimal. `MaxPerformance` can not provide RPO=0 due to asynchronous log shipping.
MAXIMUM PROTECTION	SYNC+AFFRIM	For transactions on the primary database, redo data is synchronously transmitted and written to the standby database redo log on disk before the transaction is acknowledged. If the standby database becomes unavailable, the primary database shuts itself down to ensure transactions are protected.
MAXIMUM AVAILABILITY	SYNC+AFFRIM	Similar to `MaxProtection` mode, except when no acknowledgement from standby database, it operates as if it were in `MaxPerformance` mode to preserve the primary database availability until it’s able to write its redo stream to a synchronized standby database again.
MAXIMUM AVAILABILITY	SYNC+NOAFFRIM	For transactions on the primary database, redo is synchronously transmitted to the standby database, and the primary waits only for acknowledgement that the redo has been received on the standby, not that it has been written to standby disk. This mode, which is also known as `FastSync`, can provide a performance benefit at the expense of potential exposure to data loss in a special case of multiple simultaneous failures.

Read replicas in RDS Custom for Oracle are created with maximum performance protection mode, which is also the default protection mode for Oracle Data Guard. It provides the lowest performance impact on the primary database, which can help you meet the RPO requirement measured in seconds. With FSFO FastStartFailoverLagLimit configuration, you can control the failover to the standby database based on the lag between primary and standby to further help you meet your RPO requirements.

To work to achieve a zero data loss (RPO=0) objective, you can customize the Data Guard protection mode to MaxAvailability with the SYNC+NOAFFIRM setting for redo transport for better performance. Because commits on the primary database are acknowledged only after the corresponding redo vectors are successfully transmitted to the standby database, the network latency between the primary instance and replica can be crucial for commit-sensitive workloads. Deploying the read replica in the same Availability Zone as the primary database provides lower network latency compared to a read replica in a different Availability Zone. However, the same Availability Zone read replicas may not meet your availability requirements because in the unlikely event of Availability Zone unavailability, both the primary instance and read replica instance are impacted. It’s a recommended practice to perform load testing for your workload to assess the performance impact when the read replica is customized to run in MaxAvailability mode.

Follow these steps to change Data Guard protection mode to MaxAvailability.

Access the Data Guard broker configuration on Primary (orcl1) via the Data Guard Command Line Interface (DGMGRL).
You can find the net service name for your database in the tnsnames.ora file located in the $ORACLE_HOME/network/admin directory. RDS Custom for Oracle automatically populates these entries for your primary database and your read replicas. The password for the RDS_DATAGUARD user is stored in AWS Secrets Manager, with secret name do-not-delete-rds-custom-+<<RDS Resource ID>>+-dg. Refer to Connecting to your RDS Custom DB instance using SSH for more information on how to connect to a RDS Custom EC2 instance using the SSH key retrieved from Secrets Manager.
```
$ dgmgrl RDS_DATAGUARD@RDS_CUSTOM_ORCL_A
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Mon May 16 01:18:54 2022
Version 19.13.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Password:
Connected to "ORCL_A"
Connected as SYSDG.
DGMGRL>
```

Verify the current protection mode and log transport setting:

DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxPerformance
Members:
orcl_a - Primary database
orcl_b - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 4 seconds ago)

DGMGRL> show database orcl_b;

Database - orcl_b

Role:               PHYSICAL STANDBY
Intended State:     APPLY-ON
Transport Lag:      0 seconds (computed 1 second ago)
Apply Lag:          0 seconds (computed 1 second ago)
Average Apply Rate: 2.00 KByte/s
Real Time Query:    OFF
Instance(s):
ORCL

Database Status:
SUCCESS

DGMGRL> show database orcl_b logxptmode;
LogXptMode = 'ASYNC'

Change log transport mode to FastSync, corresponding to the redo transport setting SYNC+NOAFFIRM. Change it for both the primary database (orcl_a) and standby database (orcl_b) to check that you have valid settings after the role switch.

DGMGRL> edit database orcl_b set property logxptmode=fastsync;
Property "logxptmode" updated

DGMGRL> show database orcl_b LogXptMode;
LogXptMode = 'fastsync'

DGMGRL> edit database orcl_a set property logxptmode=fastsync;
Property "logxptmode" updated

DGMGRL> show database orcl_a logxptmode;
LogXptMode = 'fastsync'

Change the protection mode to MaxAvailability:

DGMGRL> edit configuration set protection mode as maxavailability;
Succeeded.

DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_a - Primary database
orcl_b - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 5 seconds ago)

Change replica status from mounted to read-only

Using a replica in read-only mode requires the Active Data Guard option which comes at an extra cost because it is a separately licensed feature of the Oracle database Enterprise Edition.

Having your read replica in read-only mode provides additional benefits:

Offloads read-only workloads from the primary database
Enables automatic repair of corrupted blocks by retrieving healthy blocks from the standby database using the Active Data Guard feature
Uses the far sync capability to keep the remote standby database in sync without the performance overhead associated with long-distance redo log transmission

The read replica is created in MOUNT mode by default. Follow these steps to open it in read-only mode.

Access the Data Guard broker configuration on Read Replica (orcl2) via DGMGRL.

$ dgmgrl RDS_DATAGUARD@RDS_CUSTOM_ORCL_B
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Mon May 16 02:18:26 2022
Version 19.13.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Password:
Connected to "ORCL_B"
Connected as SYSDG.

Stop redo apply for the standby database orcl_b.

DGMGRL> show database orcl_b

Database - orcl_b

Role:               PHYSICAL STANDBY
Intended State:     APPLY-ON
Transport Lag:      0 seconds (computed 1 second ago)
Apply Lag:          0 seconds (computed 1 second ago)
Average Apply Rate: 2.00 KByte/s
Real Time Query:    OFF
Instance(s):
ORCL

Database Status:
SUCCESS

DGMGRL> edit database orcl_b set state=apply-off;
Succeeded.

DGMGRL> show database orcl_b

Database - orcl_b

Role:               PHYSICAL STANDBY
Intended State:     APPLY-OFF
Transport Lag:      0 seconds (computed 0 seconds ago)
Apply Lag:          11 seconds (computed 0 seconds ago)
Average Apply Rate: (unknown)
Real Time Query:    OFF
Instance(s):
ORCL

Database Status:
SUCCESS

Connect to the standby database orcl_b using the TNS entry RDS_CUSTOM_ORCL_B and open it in read-only mode.

$ sqlplus RDS_DATAGUARD@RDS_CUSTOM_ORCL_B as sysdg

SQL*Plus: Release 19.0.0.0.0 - Production on Mon May 16 02:42:59 2022
Version 19.13.0.0.0

Copyright (c) 1982, 2021, Oracle.  All rights reserved.

Enter password:
Last Successful login time: Mon May 16 2022 02:18:31 +00:00

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.13.0.0.0

SQL> select open_mode from v$database;

OPEN_MODE
--------------------
MOUNTED

SQL> alter database open read only;

Database altered.

SQL> select open_mode from v$database;

OPEN_MODE
--------------------
READ ONLY

Start redo apply on ORCL_B.

Connect to Data Guard broker configuration on read replica using DGMGRL and enable redo apply.

$ dgmgrl RDS_DATAGUARD@RDS_CUSTOM_ORCL_B
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Mon May 16 02:18:26 2022
Version 19.13.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Password:
Connected to "ORCL_B"
Connected as SYSDG.

DGMGRL> edit database orcl_b set state=apply-on;
Succeeded.

DGMGRL> show database orcl_b

Database - orcl_b

Role:               PHYSICAL STANDBY
Intended State:     APPLY-ON
Transport Lag:      0 seconds (computed 1 second ago)
Apply Lag:          0 seconds (computed 1 second ago)
Average Apply Rate: 13.00 KByte/s
Real Time Query:    ON
Instance(s):
ORCL

Database Status:
SUCCESS

Role transition (Data Guard switchover)

The primary database and one of the read replica databases can perform role transition. During this process, the databases in the Data Guard configuration continue to function in their new roles. Typical use cases for Data Guard switchover are disaster recovery drills, scheduled maintenance activities on databases, and Standby-First rolling patches.

Follow these steps to perform Data Guard switchover with a read replica.

Pause RDS Custom database automation for both orcl1 and orcl2.
There might be database restarts during the switchover; this depends on the existing Data Guard role and database open mode. Refer to Pausing and resuming RDS Custom automation for steps on how to pause and resume RDS Custom database automation. Although the RDS Custom automation framework doesn’t interfere with the role transition process, it’s a good practice to pause automation during Data Guard switchover.

Use DGMGRL to log in to database orcl1 with user RDS_DATAGUARD, and check the Data Guard status.

DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_a - Primary database
orcl_b - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 55 seconds ago)

On the Amazon RDS console, check the instance role (primary and replica) in the database’s Replication section on the Connectivity & security tab.
The primary role should match the Data Guard primary database, and the replica role should match the Data Guard physical standby database. Note that the instance role shown on the Amazon RDS console is only updated correctly for read replicas managed by RDS Custom.

Perform the switchover:

DGMGRL> switchover to orcl_b;
Performing switchover NOW, please wait...
Operation requires a connection to database "orcl_b"
Connecting ...
Connected to "ORCL_B"
Connected as SYSDBA.
New primary database "orcl_b" is opening...
Operation requires start up of instance "ORCL" on database "orcl_a"
Starting instance "ORCL"...
Connected to an idle instance.
ORACLE instance started.
Connected to "ORCL_A"
Database mounted.
Database opened.
Connected to "ORCL_A"
Switchover succeeded, new primary is "orcl_b"

Verify Data Guard status.

DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_b - Primary database
orcl_a - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 46 seconds ago)

DGMGRL> show configuration lag;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_b - Primary database
orcl_a - Physical standby database
Transport Lag:      0 seconds (computed 1 second ago)
Apply Lag:          0 seconds (computed 1 second ago)

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 55 seconds ago)

You might see WARNING instead of SUCCESS for the configuration status if you run the show configuration command right after switchover. In an ideal scenario, the configuration returns to a healthy state within a minute.

The Amazon RDS console reflects the new roles after you perform the role switch. It may take a few minutes for the Replication state to update from empty to Replicating.

Resume RDS Custom database automation for orcl1 and orcl2.
Change your application configuration to use the new primary database endpoint for the read/write workloads, or refer to the best practices discussed later in this post to avoid such changes.

Data Guard failover

During the Data Guard failover process, the standby database (read replica) is converted to a primary database when the original primary database fails. This is a manual process when the Fast-Start Failover (FSFO) is not configured and there may or may not be data loss depending upon on the Data Guard configuration. The failover process doesn’t remove the primary database from the Data Guard broker configuration and it allows the former primary database to be reinstated when the Flashback database feature is enabled.

If you want to test failover without having to restore from backup or re-create the replica after failover, enable the flashback database as explained in the following steps.

Pause RDS Custom for Oracle database automation for databases orcl1 and orcl2, and disable FSFO if it has been enabled previously.
Log in to the Amazon RDS for Custom underlying EC2 instances running orcl1 and orcl2 and create a directory using user rdsdb.
```
$ mkdir /rdsdbdata/fra

$ ls -ld /rdsdbdata/fra
drwxrwxr-x 2 rdsdb rdsdb 4096 May 26 16:40 /rdsdbdata/fra
```
Enable the flashback database on both databases while the database is in MOUNT state:
```
SQL> alter system set dg_broker_start=false scope=both;
SQL> shutdown immediate
SQL> startup mount
SQL> alter system set db_recovery_file_dest='/rdsdbdata/fra';
SQL> alter system set db_flashback_retention_target=60;
SQL> alter system set db_recovery_file_dest_size=10G;
SQL> alter database flashback on;
```
The code has the following parameters:
- - dg_broker_start=false – Disables Data Guard broker automatic start to prevent the broker from automatically opening the database for the primary database and starting the managed recovery process for the standby database.
  - db_recovery_file_dest – The Oracle database stores database flashback logs in the fast recovery area (FRA) defined by this parameter.
  - db_flashback_retention_target – How far (in minutes) we need the database to rewind. The default value is 1440. To support our FSFO use case, 60 minutes usually is sufficient with maximum availability protection mode.
  - db_recovery_file_dest_size – The size of the FRA. Use the column ESTIMATED_FLASHBACK_SIZE in the view V$FLASHBACK_DATABASE_LOG to estimate the flashback database log space requirement during the peak workload.
Enable Data Guard broker automatic start to allow the broker process to automatically open the database for the primary database and start the managed recovery process for the standby database:
```
SQL> alter system set dg_broker_start=true scope=both;
```
When the process is complete, resume RDS Custom database automation for orcl1 and orcl2, and enable FSFO if it was disabled previously.

To test failover, or in the event of primary database failure, you can fail over to the read replica manually using the following steps:

Pause RDS Custom database automation for orcl2, the read replica instance that acts as the target for failover.

Use DGMGRL to log in to database orcl_b with user RDS_DATAGUARD, and check the Data Guard status. Make sure you log in to the target database, which is the failover target.

DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_a - Primary database
orcl_b - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 31 seconds ago)

Start the failover:
```
DGMGRL> failover to orcl_b immediate;
Performing failover NOW, please wait...
Failover succeeded, new primary is "orcl_b"
```
With the IMMEDIATE option, no attempt is made to apply any unapplied redo that has been received. This helps in the event that the failover command fails with Error: ORA-16472: failover failed due to data loss.
Check the Data Guard status:
```
DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_b - Primary database
Warning: ORA-16629: database reports a different protection level from the protection mode

orcl_a - Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated

Fast-Start Failover:  Disabled

Configuration Status:
WARNING   (status updated 37 seconds ago)
```
The Data Guard configuration is in a WARNING state because there’s no working physical standby for the new primary database. There are a number of ways to reinstate the standby database, including the following:
- Use RMAN backup and restore or RMAN DUPLICATE from the active database.
- Use the Oracle flashback database option if enabled.
- On the Amazon RDS console, drop the old primary database (orcl1) and recreate a replica for orcl2. Note that in this case, the unique DB name takes the next running sequence name if possible. It might become ORCL_C instead of reusing the existing ORCL_A.
Resume RDS Custom database automation for orcl2.
Change your application configuration to use the new primary database endpoint for the read/write workloads, or refer to the best practices discussed later in this post to avoid such changes.

Promote the read replica

The read replica can be promoted to a standalone database that can support read/write workloads by choosing the Promote option on the Amazon RDS Console, the AWS CLI command promote-read-replica, or the PromoteReadReplica Amazon RDS API operation.

As part of this read replica promotion, Amazon RDS Custom for Oracle removes Data Guard configuration for orcl2 in both the primary (orcl1) and the Read Replica (orcl2). The promotion process completes even if the primary (orcl1) is not in “Available” status. Amazon RDS Custom for Oracle removes Dataguard the configuration for orcl2 in the former primary (orcl1) once it comes back to “Available” status.

To promote an RDS Custom for Oracle replica (orcl2) using the Amazon RDS Console:

On the Amazon RDS console, choose Databases in the navigation pane.
Choose the RDS Custom for Oracle replica (orcl2) that you want to promote.
For Actions, choose Promote.
On the Promote Oracle Replica page, enter the backup retention period and the backup window for the newly promoted DB instance, then choose Promote Read Replica.

promote replica

After a few minutes, the role for orcl2 will change from Replica (RDS Custom) to Instance (RDS Custom). If orcl2 was the only replica for the Primary (orcl1), the role for orcl1 will change from Primary (RDS Custom) to Instance (RDS Custom) too.

Alternatively, you can promote a read replica from the AWS CLI, as shown in the following example:

aws rds promote-read-replica \

--db-instance-identifier orcl2 \

--backup-retention-period 1

Configure Fast-Start Failover

FSFO monitors the health of instances in a Data Guard configuration using an observer process and automatically initiates failover to a standby database in the event of a primary instance failure. When there are multiple standby databases for the primary database, you can specify the list of databases as failover candidates via FastStartFailoverTarget configuration.

An observer process is a DGMGRL client component, typically running in a different machine from the primary and standby databases. The ORACLE HOME installation for the observer can be an Oracle Client Administrator installation, or you can install either an Oracle Database Enterprise Edition or Personal Edition. Refer to Installing and Starting the Observer for observer installation for your database release. To provide high availability for the observer process, you might want to do the following:

Enable EC2 instance auto-recovery for the EC2 instance running your observer. You need to automate the observer startup process as part of the OS startup.
Additionally, deploy an observer in the EC2 instance and configure an Amazon EC2 Auto Scaling group with size one (1). In the event of EC2 instance failure, the Auto Scaling group automatically spins up another EC2 instance. You need to automate the observer startup process as part of the OS bootstrap and startup.
For Oracle 12c Release 2 and above, you can deploy multiple (up to three) observers. One of them is the primary observer, and the rest are backup observers. When the primary observer fails, one of the backup observers resumes the primary role.

Follow these steps for configure FSFO for an RDS Custom for Oracle instance with read replicas:

Use DGMGRL from the observer host, which is configured with tnsnames.ora entries for the primary and standby databases:

$ dgmgrl RDS_DATAGUARD@RDS_CUSTOM_ORCL_A
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Wed May 25 10:47:38 2022
Version 19.13.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Password:
Connected to "ORCL_A"
Connected as SYSDG.

DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_a - Primary database
orcl_b - Physical standby database

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 34 seconds ago)

Although it’s a supported configuration to enable FSFO with maximum performance protection mode and to allow automatic failover as long as data loss is within the FastStartFailoverLagLimit configuration (value in seconds), you need to use maximum availability protection mode to work to achieve zero data loss (RPO=0).

Specify one or more standby databases as the failover target. In a configuration with only one standby database (read replica), this step is optional.

DGMGRL> edit database orcl_a set property FastStartFailoverTarget='orcl_b';
Property "faststartfailovertarget" updated

DGMGRL> edit database orcl_b set property FastStartFailoverTarget='orcl_a';
Property "faststartfailovertarget" updated

DGMGRL> show database orcl_a FastStartFailoverTarget;
FastStartFailoverTarget = 'orcl_b'

DGMGRL> show database orcl_b FastStartFailoverTarget;
FastStartFailoverTarget = 'orcl_a'

Specify the FastStartFailoverThreshold value in seconds, which indicates the duration the observer process waits when the primary database becomes unavailable before initiating the failover.
The default value is 30 seconds in Oracle 19c, and the minimum value is 6 seconds. A lower value can potentially shorten the RTO during the failover, whereas a higher value helps reduce the chance of unnecessary failover transient errors on the primary database. The RDS Custom for Oracle automation framework monitors database health and performs corrective actions every few seconds. Therefore, we recommend setting FastStartFailoverThreshold to a value higher than 10 seconds.In the following example, configure the threshold value to 45 seconds:
```
DGMGRL> edit configuration set property FastStartFailoverThreshold=45;
Property "faststartfailoverthreshold" updated

DGMGRL> show configuration FastStartFailoverThreshold;
FastStartFailoverThreshold = '45'
```
Enable FSFO:
```
DGMGRL> enable fast_start failover;
Warning: ORA-16827: Flashback Database is disabled

Enabled in Zero Data Loss Mode.
```
If your database doesn’t have flashback database enabled, the warning message ORA-16827 appears. Although the flashback database feature is optional for FSFO, it helps automatically reinstate failed primary databases using a flashback database to a point in time before failover if the FastStartFailoverAutoReinstate configuration property is set to TRUE (which is the default).

Check the configuration to confirm FSFO is enabled. The warning message is expected because you haven’t started the observer process.

DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_a - Primary database
Warning: ORA-16819: fast-start failover observer not started

orcl_b - (*) Physical standby database
Warning: ORA-16819: fast-start failover observer not started

Fast-Start Failover: Enabled in Zero Data Loss Mode

Configuration Status:
WARNING   (status updated 40 seconds ago)

Start the observer.
You can start the observer before or after you enable FSFO. If FSFO is already enabled, the observer immediately begins monitoring the status and connections to the primary and target standby databases. If FSFO is not enabled, the observer only starts monitoring after FSFO is enabled.
```
DGMGRL> start observer

[W000 2022-05-25T12:06:25.965+00:00] FSFO target standby is orcl_b
Observer 'ip-10-1-4-219.ap-southeast-1.compute.internal' started
[W000 2022-05-25T12:06:26.049+00:00] Observer trace level is set to USER
[W000 2022-05-25T12:06:26.049+00:00] Try to connect to the primary.
[W000 2022-05-25T12:06:26.049+00:00] Try to connect to the primary rds_custom_orcl_a.
[W000 2022-05-25T12:06:26.058+00:00] The standby orcl_b is ready to be a FSFO target
[W000 2022-05-25T12:06:27.058+00:00] Connection to the primary restored!
[W000 2022-05-25T12:06:29.058+00:00] Disconnecting from database rds_custom_orcl_a.
```
To avoid the observer stopping when the terminal running the command is closed, we recommend wrapping the command in a script and starting with the nohup command. From Oracle 12c Release 2 onwards, it’s possible to start an observer in the background. This feature uses Oracle Wallet to store the credentials to log in to target databases. In the following example, rds_custom_orcl_a and rds_custom_orcl_b are the keys for Oracle Wallet to retrieve the credentials for the observer. Because it’s going to run in background mode, you can press Enter to get the DGMGRL command prompt after the start observer command is complete.
```
DGMGRL> start observer in background logfile is '/tmp/observer.log' connect identifier is rds_custom_orcl_a;
Connected to "ORCL_A"
Submitted command "START OBSERVER" using connect identifier "rds_custom_orcl_a"
```
The following are the quick steps to configure Oracle Wallet on the machine running your Oracle Data Guard observer. For complete steps, refer to Configuring a Client to Use the External Password Store.

The first step is to create Oracle Wallet and credentials. For this post, rds_custom_orcl_a is your connection identifier, and rds_dataguad is the user name. Enter the password in the command when prompted.
```
$ mkstore -wrl /u01/app/oracle/wallet/ -create
$ mkstore -wrl /u01/app/oracle/wallet/ -createCredential rds_custom_orcl_a rds_dataguard
$ mkstore -wrl /u01/app/oracle/wallet/ -createCredential rds_custom_orcl_b rds_dataguard
```
Add the following lines to $ORACLE_HOME/network/admin/sqlnet.ora:
```
WALLET_LOCATION=(SOURCE=(METHOD=FILE)(METHOD_DATA=(DIRECTORY=/u01/app/oracle/wallet/)))
SQLNET.WALLET_OVERRIDE=TRUE
```

Check the observer status:

DGMGRL> show observer;

Configuration - rds_dg

Primary:            orcl_a
Active Target:      orcl_b

Observer "ip-10-1-4-219.ap-southeast-1.compute.internal1"(19.13.0.0.0) - Master

Host Name:                    ip-10-1-4-219.ap-southeast-1.compute.internal
Last Ping to Primary:         1 second ago
Last Ping to Target:          1 second ago

Check the Data Guard and FSFO status, and use VERBOSE to get detailed information about Data Guard configuration and status:

DGMGRL> show configuration verbose;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_a - Primary database
orcl_b - (*) Physical standby database

(*) Fast-Start Failover target

Properties:
FastStartFailoverThreshold      = '45'
OperationTimeout                = '30'
TraceLevel                      = 'USER'
FastStartFailoverLagLimit       = '0'
CommunicationTimeout            = '15'
ObserverReconnect               = '0'
FastStartFailoverAutoReinstate  = 'TRUE'
FastStartFailoverPmyShutdown    = 'TRUE'
BystandersFollowRoleChange      = 'ALL'
ObserverOverride                = 'FALSE'
ExternalDestination1            = ''
ExternalDestination2            = ''
PrimaryLostWriteAction          = 'CONTINUE'
ConfigurationWideServiceName    = 'ORCL_CFG'

Fast-Start Failover: Enabled in Zero Data Loss Mode
Lag Limit:          0 seconds
Threshold:          45 seconds
Active Target:      orcl_b
Potential Targets:  "orcl_b"
orcl_b     valid
Observer:           ip-10-1-4-219.ap-southeast-1.compute.internal1
Shutdown Primary:   TRUE
Auto-reinstate:     TRUE
Observer Reconnect: (none)
Observer Override:  FALSE

Configuration Status:
SUCCESS

Optionally, perform a failover test by manually stopping the EC2 instance for orcl1. Before stopping the EC2 instance, use the tail command to monitor the observer log file.

$ tail -f /tmp/observer.log

[W000 2022-05-25T12:58:25.805+00:00] Primary database cannot be reached.
[W000 2022-05-25T12:58:25.805+00:00] Fast-Start Failover threshold has not exceeded. Retry for the next 45 seconds
[W000 2022-05-25T12:58:26.805+00:00] Try to connect to the primary.
ORA-12541: TNS:no listener

Unable to connect to database using rds_custom_orcl_a
[W000 2022-05-25T12:58:26.982+00:00] Primary database cannot be reached.
[W000 2022-05-25T12:58:27.982+00:00] Try to connect to the primary.
ORA-12541: TNS:no listener

Unable to connect to database using rds_custom_orcl_a
[W000 2022-05-25T12:58:27.985+00:00] Primary database cannot be reached.
[W000 2022-05-25T12:58:28.985+00:00] Try to connect to the primary.
ORA-12541: TNS:no listener

Unable to connect to database using rds_custom_orcl_a
[W000 2022-05-25T12:58:28.990+00:00] Primary database cannot be reached.
[W000 2022-05-25T12:58:29.990+00:00] Try to connect to the primary.
ORA-12541: TNS:no listener

Unable to connect to database using rds_custom_orcl_a

[W000 2022-05-25T12:59:12.008+00:00] Primary database cannot be reached.
[W000 2022-05-25T12:59:12.008+00:00] Fast-Start Failover threshold has expired.
[W000 2022-05-25T12:59:12.008+00:00] Try to connect to the standby.
[W000 2022-05-25T12:59:12.009+00:00] Making a last connection attempt to primary database before proceeding with Fast-Start Failover.
[W000 2022-05-25T12:59:12.009+00:00] Check if the standby is ready for failover.
[S002 2022-05-25T12:59:12.124+00:00] Fast-Start Failover started...

2022-05-25T12:59:12.124+00:00
Initiating Fast-Start Failover to database "orcl_b"...
[S002 2022-05-25T12:59:12.125+00:00] Initiating Fast-start Failover.
Performing failover NOW, please wait...
Failover succeeded, new primary is "orcl_b"
2022-05-25T12:59:15.906+00:00
[S002 2022-05-25T12:59:15.906+00:00] Fast-Start Failover finished...
[W000 2022-05-25T12:59:15.906+00:00] Failover succeeded. Restart pinging.
[W000 2022-05-25T12:59:15.938+00:00] Primary database has changed to orcl_b.
[W000 2022-05-25T12:59:16.941+00:00] Try to connect to the primary.
[W000 2022-05-25T12:59:16.942+00:00] Try to connect to the primary rds_custom_orcl_b.
[W000 2022-05-25T12:59:17.092+00:00] The standby orcl_a needs to be reinstated
[W000 2022-05-25T12:59:17.092+00:00] Try to connect to the new standby orcl_a.
[W000 2022-05-25T12:59:17.092+00:00] Connection to the primary restored!
[W000 2022-05-25T12:59:19.092+00:00] Disconnecting from database rds_custom_orcl_b.
[W000 2022-05-25T12:59:47.135+00:00] Try to connect to the new standby orcl_a.
ORA-12170: TNS:Connect timeout occurred

Use DGMGRL to log in to database orcl2 (with DB unique name orcl_b) with user RDS_DATAGUARD, and check the Data Guard status. It shows that orcl2 is our new primary database.

$ dgmgrl RDS_DATAGUARD@RDS_CUSTOM_ORCL_B
DGMGRL for Linux: Release 19.0.0.0.0 - Production on Wed May 25 13:02:37 2022
Version 19.13.0.0.0

Copyright (c) 1982, 2019, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Password:
Connected to "ORCL_B"
Connected as SYSDG.

DGMGRL> show configuration;

Configuration - rds_dg

Protection Mode: MaxAvailability
Members:
orcl_b - Primary database
Warning: ORA-16824: multiple warnings, including fast-start failover-related warnings, detected for the database

orcl_a - (*) Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated

Fast-Start Failover: Enabled in Zero Data Loss Mode

Configuration Status:
WARNING   (status updated 5 seconds ago)

If the flashback database has been enabled on orcl1 prior to failover, after the former primary database returns online and starts in MOUNT state, the observer reinstates it into a new standby database. The reinstated database acts as the FSFO target for the new primary database. The observer log is as follows:

[W000 2022-05-25T13:07:42.577+00:00] Connection to the primary restored!
[W000 2022-05-25T13:07:42.577+00:00] Wait for new primary to be ready to reinstate.
[W000 2022-05-25T13:07:42.581+00:00] New primary is now ready to reinstate.
[W000 2022-05-25T13:07:43.581+00:00] Issuing REINSTATE command.

2022-05-25T13:07:43.581+00:00
Initiating reinstatement for database "orcl_a"...
Reinstating database "orcl_a", please wait...
[W000 2022-05-25T13:08:12.617+00:00] The standby orcl_a is ready to be a FSFO target
Reinstatement of database "orcl_a" succeeded
2022-05-25T13:08:27.047+00:00
[W000 2022-05-25T13:08:27.637+00:00] Successfully reinstated database orcl_a.

Seamless Oracle Client connectivity

You can avoid application configuration changes during a role transition by using a TNS entry that contains both the primary and standby database endpoints in the configuration. You can define two role-based database services to support both read/write and read-only workloads. In the following example, orcl_rw is the read/write service that’s active on the primary database. orcl_ro is the read-only service, and is active on the standby database that has been opened in read-only mode.

Run the following SQL statements in the primary database to create these services:

exec dbms_service.create_service('orcl_rw','orcl_rw');
exec dbms_service.create_service('orcl_ro','orcl_ro');

Run the following SQL statement in the primary database to start the orcl_rw service:

exec dbms_service.start_service('orcl_rw');

Run the following SQL statement in standby databases running in read-only mode to start the orcl_ro service:

exec dbms_service.start_service('orcl_ro');

To automatically start these services while the database is restarted, create the following trigger in the primary database:

CREATE OR REPLACE TRIGGER TrgDgServices after startup on database
DECLARE
db_role VARCHAR(30);
db_open_mode VARCHAR(30);
BEGIN
SELECT DATABASE_ROLE, OPEN_MODE INTO db_role, db_open_mode FROM V$DATABASE;
IF db_role = 'PRIMARY' THEN
DBMS_SERVICE.START_SERVICE('orcl_rw');
END IF;
IF db_role = 'PHYSICAL STANDBY' AND db_open_mode LIKE 'READ ONLY%' THEN
DBMS_SERVICE.START_SERVICE('orcl_ro');
END IF;
END;
/

The following is a sample configuration for a read/write workload:

ORCL_RW = (DESCRIPTION =
(CONNECT_TIMEOUT= 120)(RETRY_COUNT=20)(RETRY_DELAY=3)(TRANSPORT_CONNECT_TIMEOUT=3)
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST=orcl1.cngutvmeo0q7.ap-southeast-1.rds.amazonaws.com)(PORT=1521))
(ADDRESS = (PROTOCOL = TCP)(HOST=orcl2.cngutvmeo0q7.ap-southeast-1.rds.amazonaws.com)(PORT=1521))
)
(CONNECT_DATA=(SERVICE_NAME = orcl_rw))
)

The following is a sample configuration for a read-only workload:

ORCL_RO = (DESCRIPTION =
(CONNECT_TIMEOUT= 120)(RETRY_COUNT=20)(RETRY_DELAY=3)(TRANSPORT_CONNECT_TIMEOUT=3)
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST=orcl1.cngutvmeo0q7.ap-southeast-1.rds.amazonaws.com)(PORT=1521))
(ADDRESS = (PROTOCOL = TCP)(HOST=orcl2.cngutvmeo0q7.ap-southeast-1.rds.amazonaws.com)(PORT=1521))
)
(CONNECT_DATA=(SERVICE_NAME = orcl_ro))
)

Refer to Oracle Data Guard Specific FAN and FCF Configuration Requirements for other Oracle client considerations to improve failover time during failover and switchover operations.

Cross-Region Data Guard configuration

To help you meet your disaster recovery requirements, you may need to deploy a standby database in a different Region or in an environment outside of AWS, such as on-premises or other cloud services. For such configurations, you can manually set up a self-managed physical standby database (managed replicas can’t be created outside your primary Region as of this writing).

However, managed replicas and self-managed standby databases can be managed by a single Data Guard broker configuration. The following architecture shows one such configuration for a primary RDS Custom for Oracle instance with a managed replica in the same Region and a cross-Region self-managed physical standby database.

cross region

Refer to Enabling High Availability with Data Guard on Amazon RDS Custom for Oracle for more details on creating an RDS Custom for Oracle standby database in a remote Region and maintaining it as a self-managed physical standby database for a primary RDS Custom for Oracle instance.

Refer to the Oracle Data Guard Broker guide and Data Guard Concepts and Administration for details on how to configure and manage a Data Guard environment with multiple standby databases.

Conclusion

Amazon RDS Custom for Oracle allows you to create read replicas that are built on Oracle Data Guard technology to offload read-only workloads from your primary instance. In this post, we discussed options to customize the default configuration of Oracle Data Guard deployed by a read replica to help you meet your specific RTO and RPO requirements for your critical applications hosted on RDS Custom for Oracle. We also reviewed best practices and step-by-step instructions to follow during role transition and failover scenarios.

If you have any comments or questions, leave them in the comment section.

About the author

Jobin Joseph is a Senior Database Specialist Solution Architect based in Dubai. With a focus on relational database engines, he assists customers in migrating and modernizing their database workloads to AWS. He is an Oracle Certified Master with 20 years of experience with Oracle databases.

Donghua Luo is a Senior RDS Specialist Solutions Architect. He works with AWS customers to migrate and achieve higher flexibility, scale, and resiliency with database services in AWS cloud.