Enable transparent connectivity to Oracle Data Guard environments using Amazon Route 53 CNAME records

Customers choose AWS for running their Oracle database workload to help increase resiliency, performance, and scalability of the database layer. A high availability (HA) solution for the database stack is an important aspect to consider when migrating or deploying Oracle databases in AWS to help ensure that the architecture can meet the service level agreement (SLA) of the application. Customers who run their Oracle databases on Amazon Elastic Compute Cloud (Amazon EC2) commonly choose Oracle Data Guard physical standby databases to help meet the HA and disaster recovery (DR) for their Oracle database workloads.

As discussed in this Oracle documentation, role-based services with multiple listener endpoints in the connection URL or tnsnames.ora entry is the preferred way to transparently connect to the database layer that is part of a Data Guard configuration. However, some application components and driver configurations don’t support multiple hostnames in the connection URL. Those applications require a single hostname or IP for the clients to connect to the Data Guard environment.

This post talks about the concept of using an Amazon Route 53 CNAME record in a Data Guard environment on EC2 and lists the artifacts to automatically route the connection between primary and standby environments in a Data Guard configuration based on the database role.

Solution overview

To help avoid the manual efforts to update DNS entries or tnsnames.ora file after a failover or switchover operation in a Data Guard environment, the solution uses an AFTER DB_ROLE_CHANGE trigger to automate the DNS failover process. This trigger runs a shell script on the database host, which in turn updates the CNAME record in Route 53 to point the CNAME records to reflect the role transition. The following diagram illustrates the solution architecture (Figure 1).

Figure 1. Solution architecture

The solution discussed in this post covers routing new database connection requests to the right database post a Data Guard switchover activity. However, other factors such as application/client TTL settings and behavior of the connection pool to invalidate the connection handles created prior to the switchover activity can cause the application to connect to the database with a different role (like read-write workloads are connected to standby after switchover) and can generate errors, such as ORA-16000: database or pluggable database open for read-only access. It is a best practice to verify the database role before using the connection handles for transactions to verify that the application is connected to the database with the expected role.

The following workflow depicts the sequence of events that happens during a failover or switchover activity in a Data Guard environment to enable seamless connectivity for the application:

A role transition event occurs in the Data Guard environment.
The event triggers the AFTER DB_ROLE_CHANGE trigger.
The trigger runs the shell script on the EC2 instance using a scheduler job.
The shell script updates Route 53 to point the CNAME records to reflect the role transition.

Prerequisites

This post assumes the following prerequisites:

You should have an existing Data Guard configuration with one primary and one standby DB instance within a single VPC. Refer to the Oracle quick start template to deploy a Data Guard environment on Amazon EC2.
The steps discussed here are for self-managed Data Guard configuration on Amazon EC2 with Red Hat Linux AMI.
The scenario discussed in the post involves one primary and one standby database in the Data Guard configuration. For any other configurations, the scripts shown in this example require additional changes.
A private or public Route 53 hosted zone should be configured in the VPC where the DB environment exists.
The shell script uses the instance profile of the EC2 instance to run the AWS Command Line Interface (AWS CLI) commands. Make sure that the instance profile of the EC2 instances hosting the primary and standby databases has a policy attached that allows changing the record set in the hosted zone such as the following:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DBCnameFlipPloicy",
"Effect": "Allow",
"Action": [
"route53:ChangeResourceRecordSets",
"route53:ListResourceRecordSets"
],
"Resource": "arn:aws:route53:::hostedzone/<<YourHostedZoneId>>"
}
]
}

Nslookup, jq, and curl utilities must be installed on all of the DB hosts. If not installed, you can install the utility on RHEL Linux using the following command:

yum install -y bind-utils
yum install -y curl
yum install -y jq

Environment details

This post assumes a Data Guard configuration with two instances within a single VPC, one primary and one standby, with the following details and naming conventions:

Oracle database version – 19.10 configured in maximum performance mode with Active Data Guard
Route 53 domain name – mydbdomain
Database name – orcl
DB_UNIQUE_NAME – orcl_a and orcl_b
Instance names – orcl
Route 53 A record for the host in AZ1 – orcl-a-db.mydbdomain
Route 53 A record for the host in AZ2 – orcl-b-db.mydbdomain

Route 53 configuration

Two A records are created in Route 53 to point to the IPs of the primary and standby hosts. Two CNAME records are also created in Route 53, which are automatically updated during the Data Guard switchover and failover scenarios. The CNAME record orcl-rw.mydbdomain points to the instance in the primary role that can accept read/write transactions, and orcl-ro.mydbdomain points to the instance in the standby role that accepts read-only queries.

The A records configuration is as follows:

DB host IP in AZ1 (10.0.0.5 in this example) – orcl-a-db.mydbdomain
DB host IP in AZ2 (10.0.32.5 in this example) – orcl-b-db.mydbdomain

The CNAME records configuration is as follows:

orcl-a-db.mydbdomain – orcl-rw.mydbdomain
orcl-b-db.mydbdomain – orcl-ro.mydbdomain

The following screenshot shows the Route 53 console view of the domain mydbdomain.

Figure 2. The Route 53 console view of the domain mydbdomain

TNS configuration

The following tnsnames.ora file entries show how connections can be made to primary and standby databases using the CNAME records without a dependency on the actual IP address of the EC2 instances that host primary and standby databases. The entry orcl_a always points to the instance on orcl-a-db.mydbdomain, and orcl_b always points to the instance on orcl-b-db.mydbdomain, regardless of their roles. The entries orclrw and orclro direct the connection to the databases playing primary and standby roles, respectively.

orcl_a =
(description =
(address = (protocol = tcp)(host = orcl-a-db.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl_a)
)
)

orcl_b =
(description =
(address = (protocol = tcp)(host = orcl-b-db.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl_b)
)
)

orclrw =
(description =
(address = (protocol = tcp)(host = orcl-rw.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl)
)
)

orclro =
(description =
(address = (protocol = tcp)(host = orcl-ro.mydbdomain)(port = 1525))
(connect_data =
(server = dedicated)
(service_name = orcl)
)
)

To enable connectivity using orclrw and orclro TNS entries, you can use either a role-based service or a static listener registration entry in both the primary and standby listener, as shown in the following code:

SID_DESC =
      (GLOBAL_DBNAME = orcl)
      (ORACLE_HOME = /opt/oracle/product/19c/dbhome_1)
      (SID_NAME = orcl)
    )

Implement the solution

To implement an automated DNS update during an Oracle switchover or failover, we use an Oracle database trigger and a shell script. The following are the high-level steps for the entire workflow:

Create a DB_ROLE_CHANGE ON DATABASE trigger on the primary database
The trigger in turn creates a DBMS job that calls a shell script with the cname_switch.sh.
The shell script updates the Route 53 CNAME entries.

Database trigger

Use the following code for the database trigger:

CREATE OR REPLACE TRIGGER sys.cname_flip_post_role_change 
AFTER DB_ROLE_CHANGE ON DATABASE
DECLARE
  v_db_name VARCHAR2(9);
  v_db_role VARCHAR2(16);
BEGIN
  SELECT DATABASE_ROLE  INTO v_db_role FROM V$DATABASE;
  SELECT DB_UNIQUE_NAME INTO v_db_name FROM V$DATABASE;

  IF v_db_role = 'PRIMARY' THEN
    BEGIN
      dbms_scheduler.drop_job('RW_CNAME_FLIP');
    EXCEPTION
      WHEN OTHERS THEN NULL;
    END;

    dbms_scheduler.create_job(
      job_name   => 'RW_CNAME_FLIP',
      job_type   => 'EXECUTABLE',
      number_of_arguments => 1,
      job_action => '/home/oracle/admin/bin/cname_switch.sh',
      enabled    => false,
      auto_drop  => true);

    dbms_scheduler.set_job_argument_value(
      job_name          => 'RW_CNAME_FLIP',
      argument_position => 1,
      argument_value    => v_db_name);

    BEGIN
      dbms_scheduler.run_job('RW_CNAME_FLIP');
    EXCEPTION
    WHEN OTHERS THEN
      raise_application_error(-20101, 'CNAME flip failed, check script error');
    END;

  END IF;

EXCEPTION
  WHEN OTHERS THEN
    raise_application_error(-20102, 'CNAME flip failed due to error: ' || SQLERR
M);
END;
/

Shell script

This script determines the current CNAME, identifies the dependent A records, and maps the CNAME to the correct A records accordingly. This shell script is provided for reference assuming the naming conventions for db_name and db_unique_name as used in the sample configuration. You should review and modify the script to meet your specific requirements and organization standards.

As per the example shown earlier, the shell script is placed in the location /home/oracle/admin/bin/cname_switch.sh.

Note: it’s common to see production databases that are restored or cloned to lower environments.

If the script is run in those environments, it can potentially change the CNAME entries unexpectedly. To mitigate this, the shell script has the function restore_safeguard. This function checks that the IP assigned to the EC2 instance is actually matching with the A records configured for this database in Route 53. If no match is found, this will not perform CNAME failover.

#! /bin/bash
#set -x

# Variables may need to be changed to suit your environment

DB_NAME=$1
DB_IN=$1
echo "Orginal Input : ${DB_NAME}"
DB_NAME=`echo "${DB_NAME::-2}"`  # removing last 2 characters from DB_UNIQUE_NAME
DB_NAME=`echo "${DB_NAME}" | tr '[:upper:]' '[:lower:]'`
echo "Modified Input : ${DB_NAME}"

DB_DOMAIN=<<YOUR_AWS_ROUTE53_DOMAIN_NAME>>    # Update as per your AWS Route53 domian name
ZONE_ID=<<YOUR_AWS_ROUTE53_HOSTED_ZONE_ID>>   # Update as per your AWS Route53 hosted zone ID
EC2_METADATA='http://169.254.169.254/latest/dynamic/instance-identity/document'

# CNAME and A-Records related varables :

RW_CNAME=`echo "${DB_NAME}-rw.${DB_DOMAIN}"`
RO_CNAME=`echo "${DB_NAME}-ro.${DB_DOMAIN}"`
A_CNAME=`echo "${DB_NAME}-a-db.${DB_DOMAIN}"`
B_CNAME=`echo "${DB_NAME}-b-db.${DB_DOMAIN}"`

REGION=`curl -s ${EC2_METADATA}|grep region|awk -F\" '{print $4}'`

# Logfile configuration and file initilization

TS=`date +%Y%m%d_%H%M%S`
LOG_DIR=/tmp
CHANGE_SET_FILE=`echo "${LOG_DIR}/${DB_NAME}-CnameFlip-${TS}.json"`
LOG_FILE=`echo "${LOG_DIR}/${DB_NAME}-CnameFlip-${TS}.log"`
CONF_FILE=`echo "file://${CHANGE_SET_FILE}"`

# Function to check if current host IP matching with Route 53 configuration

IS_SAFE='Unsafe'

function restore_safeguard()
{
    AWS_TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`
    LOCAL_IPV4=`curl -sH "X-aws-ec2-metadata-token: $AWS_TOKEN" -v http://169.254.169.254/latest/meta-data/local-ipv4`
    PUBLIC_IPV4=`curl -sH "X-aws-ec2-metadata-token: $AWS_TOKEN" -v http://169.254.169.254/latest/meta-data/public-ipv4`
    NOT_FOUND=`echo ${PUBLIC_IPV4} | grep '404 - Not Found' | wc -l`

    if [ ${NOT_FOUND} == 1 ]; then
       PUBLIC_IPV4='No Public IP Assigned'
    fi

    A_IP=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`A`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-a" '.[] | select( .Name | contains($DB_NAME)).Value')

    B_IP=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`A`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-b" '.[] | select( .Name | contains($DB_NAME)).Value')

    PREVIOUS_RW_ID=$(aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} \
           --query 'ResourceRecordSets[?Type==`CNAME`].{Name: Name, Value:ResourceRecords[0].Value}' | \
           jq -cr --arg DB_NAME "${DB_NAME}-rw" '.[] | select( .Name | contains($DB_NAME)).Value' | cut -d'-' -f2)

    if [ ${PREVIOUS_RW_ID} == 'a' ]; then
       RW_NODE_IP=${A_IP}
       RO_NODE_IP=${B_IP}
    else
       RW_NODE_IP=${B_IP}
       RO_NODE_IP=${A_IP}
    fi

    # Looging Input values

    echo "Orginal Input   : ${DB_IN}"          | tee -a ${LOG_FILE}
    echo "Modified Input  : ${DB_NAME}"        | tee -a ${LOG_FILE}
    echo "Current RW ID   : ${PREVIOUS_RW_ID}" | tee -a ${LOG_FILE}
    echo "Host Private IP : ${LOCAL_IPV4}"     | tee -a ${LOG_FILE}
    echo "Host Public IP  : ${PUBLIC_IPV4}"    | tee -a ${LOG_FILE}
    echo "A Node IP       : ${A_IP}"           | tee -a ${LOG_FILE}
    echo "A Node IP       : ${B_IP}"           | tee -a ${LOG_FILE}
    echo "RW Node IP      : ${RW_NODE_IP}"     | tee -a ${LOG_FILE}
    echo "RO Node IP      : ${RO_NODE_IP}"     | tee -a ${LOG_FILE}

    if [ "${LOCAL_IPV4}" == "${RO_NODE_IP}" -o "${PUBLIC_IPV4}" == "${RO_NODE_IP}" ]; then
       IS_SAFE='Safe'
    else
       IS_SAFE='Unsafe'
    fi
}

restore_safeguard

if [ ${IS_SAFE} == 'Safe' ]; then
   echo "Safe for CNAME faliover..." | tee -a ${LOG_FILE}
else
   echo "Unsafe for CNAME faliover..." | tee -a ${LOG_FILE}
   echo "Aborting..."
   exit 1
fi

PRI_DB_ID=`nslookup ${RW_CNAME}|grep "canonical name"|cut -d'=' -f2|cut -d'-' -f2`

# Looging Input values :
echo "Orginal Input      : ${DB_IN}"     | tee    ${LOG_FILE}
echo "Modified Input     : ${DB_NAME}"   | tee -a ${LOG_FILE}
echo "Current RW host ID : ${PRI_DB_ID}" | tee -a ${LOG_FILE}

echo -e "\nChange to be done : \n" | tee -a ${LOG_FILE}

if [ ${PRI_DB_ID} == 'a' ]; then
   echo "Changing ${RW_CNAME} from ${A_CNAME} to ${B_CNAME}" | tee -a ${LOG_FILE}
   echo "Changing ${RO_CNAME} from ${B_CNAME} to ${A_CNAME}" | tee -a ${LOG_FILE}
   TO_BE_RW_CNAME=${B_CNAME}
   TO_BE_RO_CNAME=${A_CNAME}
else
   echo "Changing ${RW_CNAME} from ${B_CNAME} to ${A_CNAME}" | tee -a ${LOG_FILE}
   echo "Changing ${RO_CNAME} from ${A_CNAME} to ${B_CNAME}" | tee -a ${LOG_FILE}
   TO_BE_RW_CNAME=${A_CNAME}
   TO_BE_RO_CNAME=${B_CNAME}
fi

R53_CHANGE=`echo -e "
{
  \"Comment\": \"Flip CNAMEs\",
  \"Changes\": [
    {
      \"Action\" : \"UPSERT\",
      \"ResourceRecordSet\" : {
        \"Name\" : \"${RW_CNAME}.\",
        \"Type\" : \"CNAME\",
        \"TTL\"  : 60,
        \"ResourceRecords\" : [{ \"Value\": \"${TO_BE_RW_CNAME}.\" }]
      }
    },
    {
      \"Action\" : \"UPSERT\",
      \"ResourceRecordSet\" : {
        \"Name\" : \"${RO_CNAME}\",
        \"Type\" : \"CNAME\",
        \"TTL\"  : 60,
        \"ResourceRecords\" : [{ \"Value\": \"${TO_BE_RO_CNAME}.\" }]
      }
    }
  ]
}
"`

echo -e "\nRoute53 Change Set :\n" | tee -a ${LOG_FILE}
echo ${R53_CHANGE} | tee -a ${LOG_FILE}
echo ${R53_CHANGE} > ${CHANGE_SET_FILE}

echo -e "\nCommand to Execute : " | tee -a ${LOG_FILE}
echo -e "\naws route53 change-resource-record-sets --hosted-zone-id ${ZONE_ID} \
         --change-batch ${CONF_FILE} \n" | tee -a ${LOG_FILE}

echo -e "\nExecution Result :\n"
aws route53 change-resource-record-sets --hosted-zone-id ${ZONE_ID} \
--change-batch ${CONF_FILE} | tee -a ${LOG_FILE}

echo -e "\nAfter Change :\n "
aws route53 list-resource-record-sets --hosted-zone-id ${ZONE_ID} | tee -a ${LOG_FILE}

Test the solution

The following screenshot shows the Route 53 console view of the domain mydbdomain before the switchover. The primary database is running on orcl-a-db.mydomain because orcl-rw.mydomain is pointing to that.

Figure 3. Route 53 console view of the domain mydbdomain before the switchover

The following SQL displays the current role of both primary and standby databases and host_name they are currently running on.

[oracle@ip-10-0-0-5 sql]$ cat db_info.sql

ALTER SESSION SET NLS_DATE_FORMAT='YYYY-MM-DD:HH24:MI';
set lines 150 pages 200
col HOST_NAME for a30 trunc

select d.NAME, d.db_unique_name, d.DATABASE_ROLE, d.OPEN_MODE, i.INSTANCE_NAME, 
i.HOST_NAME, i.STARTUP_TIME
from v$instance i, v$database d;

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclrw

SQL> @db_info

NAME  DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE INSTANCE_NAME HOST_NAME STARTUP_TIME
------ ---------------- -------------- ---------------- ------------------------------ ----------------
ORCL orcl_a PRIMARY READ WRITE orcl ip-10-0-0-5.us-west-2.compute. 2020-05-24:01:47

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclro

SQL> @db_info

NAME DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE INSTANCE_NAME HOST_NAME STARTUP_TIME
------ ---------------- -------------------- -------------- ------------------------------- ----------------
ORCL orcl_b PHYSICAL STANDBY READ ONLY WITH APPLY orcl ip-10-0-32-5.us-west-2.compute. 2020-05-24:05:50

Let’s initiate the switchover:

[oracle@ip-10-0-0-5 sql]$ dgmgrl /
DGMGRL for Linux: Release 12.2.0.1.0 - Production on Wed May 27 06:42:51 2020

Copyright (c) 1982, 2017, Oracle and/or its affiliates.  All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected to "orcl_a"
Connected as SYSDG.
DGMGRL> show configuration;

Configuration - awsguard

  Protection Mode: MaxPerformance
  Members:
  orcl_a - Primary database
    orcl_b - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 39 seconds ago)

DGMGRL> switchover to orcl_b;
Performing switchover NOW, please wait...
Operation requires a connection to database "orcl_b"
Connecting ...
Connected to "orcl_b"
Connected as SYSDBA.
New primary database "orcl_b" is opening...
Oracle Clusterware is restarting database "orcl_a" ...
Switchover succeeded, new primary is "orcl_b"
DGMGRL>
DGMGRL> show configuration;

Configuration - awsguard

  Protection Mode: MaxPerformance
  Members:
  orcl_b - Primary database
    orcl_a - Physical standby database

Fast-Start Failover: DISABLED

Configuration Status:
SUCCESS   (status updated 67 seconds ago)

DGMGRL>

Now that the switchover is complete, let’s connect to the database using the orclrw and orclro TNS entries using the following code:

[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclrw

SQL> @db_info

NAME DB_UNIQUE_NAME  DATABASE_ROLE  OPEN_MODE     INSTANCE_NAME  HOST_NAME                      STARTUP_TIME
----- -------------- ------------- -------------- ------------------------------ ----------------
ORCL  orcl_b PRIMARY        READ WRITE    orcl          ip-10-0-32-5.us-west-2.compute 2020-05-24:05:50


[oracle@ip-10-0-0-5 sql]$ sqlplus system@orclro

SQL> @db_info

NAME  DATABASE_ROLE     OPEN_MODE            INSTANCE_NAME  HOST_NAME            STARTUP_TIME
----- ----------------- -------------------- -------------- ------------------------------ ----------------
ORCL orcl_a PHYSICAL STANDBY  READ ONLY WITH APPLY orcl          ip-10-0-0-5.us-west-2.compute. 2020-05-27:06:43

The following screenshot shows the Route 53 console view of the domain mydbdomain after the switchover. The primary database is now running on orcl-b-db.mydomain because orcl-rw.mydomain is pointing to that.

Figure 4. Route 53 console view of the domain mydbdomain after the switchover

Conclusion

Application connectivity to a Data Guard environment can be challenging, especially when the application configuration doesn’t support multiple hostnames or listener endpoints. In this post, we discussed step-by-step details to enable seamless connectivity to Data Guard environments using Route 53 CNAME records, a database trigger, and a shell script. You can use these artifacts to direct the DB connections to the database with the right role seamlessly without application changes. If you are using Data Guard Observer for automated failover, another blog, Setup a high availability design for Oracle Data Guard (Fast-Start Failover) using Amazon Route 53 discusses an alternate mechanism to achieve the same result.

AWS Architecture Blog