Effectively migrating LOB data to Amazon S3 from Amazon RDS for Oracle with AWS DMS

Data lakes in AWS are highly popular and, at the heart of it, remains Amazon Simple Storage Service (Amazon S3), which offers unlimited, highly durable, elastic, and cost-effective storage. You can build data lakes in Amazon S3 from different data sources. Amazon Relational Database Service (Amazon RDS) for Oracle databases are a common source for data lakes, and in many cases these databases may have tables with LOB columns. After you replicate the data to Amazon S3, you can use various AWS services to consume this data for AI, machine learning (ML), reporting, or even one-time queries without impacting the source database.

In this post, I walk through an effective way to migrate and continuously replicate tables with LOB columns from Amazon RDS for Oracle to Amazon S3 using AWS Database Migration Service (AWS DMS). This service helps you migrate on-premises databases to AWS quickly and securely. In this post, we explore its heterogeneous replication feature.

Prerequisites

For this walkthrough, you need the following:

An AWS account
An Amazon RDS for Oracle database (Oracle 12.1.0.2.v7 or later)
An S3 bucket in the same AWS Region where you create AWS DMS replication instance to migrate your data.

Solution overview

The following diagram illustrates the architecture for this solution.

To implement the solution, you complete the following steps:

Create an AWS Identity and Access Management (IAM) role to access the target S3 bucket.
Prepare the source Amazon RDS for Oracle database for replication.
Create your AWS DMS instance and endpoints.
Create a task for the full load.
Create a task for change data capture (CDC).

For this post, I use Amazon RDS for Oracle (12.1.0.2) as the source in the same Region as the target S3 bucket.

Creating an IAM role to access the target S3 bucket

You need an IAM role with write and delete access to your target S3 bucket and optional tagging access so you can tag any S3 objects written to this bucket. For this post, I created the role dmsdemos3bucketrole and used the following JSON policy. The name of my S3 bucket is dmsdemos3bucket. Make sure you change the name of the S3 bucket as appropriate in the policy:

{
    "Version": "2012-10-17",
    "Statement": [{
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutObjectTagging"
            ],
            "Resource": [
                "arn:aws:s3:::dmsdemos3bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::dmsdemos3bucket"
            ]
        }
    ]
}

Preparing the Amazon RDS for Oracle database for replication

For AWS DMS to migrate the data to Amazon S3 from the source database, you need to complete a few prerequisites so the AWS DMS instance can read the data. This includes creating a database user with appropriate permissions and making configuration changes to the database based on the method you chose for the continuous data replication.

The migration consists of two major parts:

Full load – The initial data is migrated from the Amazon RDS for Oracle database to Amazon S3. During a full load task, AWS DMS performs a full table scan of the source table for each table processed in parallel. It uses some resources on your source database.
CDC – The changes during and after the full load are cached in the memory of the AWS DMS instance, based on the configuration of CDC. If available memory is exhausted in the instance, the changes are cached to disk. During a CDC task, AWS DMS applies these cached changes to the target. There are two modes available for CDC with the source as Amazon RDS for Oracle: Oracle LogMiner (default) and AWS DMS Binary Reader. For more information, see Using Oracle LogMiner or AWS DMS binary reader for change data capture (CDC).

I choose Binary Reader for this migration over LogMiner for the following reasons:

Binary Reader supports CDC for LOBs in Oracle version 12c; LogMiner doesn’t.
For migrations with a high volume of changes, LogMiner might have some I/O or CPU impact on the source database. Binary Reader has less chance of having I/O or CPU impact because the archive logs are copied to the replication instance and mined there.
For migrations with a high volume of changes, CDC performance is usually much better when using Binary Reader compared to LogMiner.
LogMiner doesn’t support updates to LOBs.

After you choose your mode for CDC, you can prepare the source Amazon RDS for Oracle database.

Setting up supplemental logging

Create supplemental logging for the database level and primary key (assuming the table has a primary key). See the following code:

exec rdsadmin.rdsadmin_util.alter_supplemental_logging('ADD');
exec rdsadmin.rdsadmin_util.alter_supplemental_logging('ADD','PRIMARY KEY');

If the source table doesn’t have a primary key, enable supplemental logging for all the columns. Only do this if the source table doesn’t have primary keys. See the following code:

exec rdsadmin.rdsadmin_util.alter_supplemental_logging('ADD','ALL');
exec rdsadmin.rdsadmin_util.alter_supplemental_logging('DROP','PRIMARY KEY');

Enabling automatic backups

Make sure you enable automatic backups for the source database to place the database in ARCHIVELOG Mode.

On the Amazon RDS console, in the Management Settings section for your Amazon RDS for Oracle database instance, set the Enabled Automatic Backups option to Yes.

Setting up log archival

Retain archive logs for at least 24 hours (you need additional storage for these archive logs). See the following code:

exec rdsadmin.rdsadmin_util.set_configuration('archivelog retention hours',24);

Creating directories for Binary Reader

Create your directories with the following code:

exec rdsadmin.rdsadmin_master_util.create_archivelog_dir;
exec rdsadmin.rdsadmin_master_util.create_onlinelog_dir;

Creating a database user for AWS DMS in the source database

Create a user (DMS_USER) with the required privileges for AWS DMS and support for Binary Reader mode in the source database. Make sure to change the <password> block in the following code with a secure password:

CREATE USER DMS_USER IDENTIFIED BY <password>;

GRANT CREATE SESSION TO DMS_USER;
GRANT SELECT ANY TABLE TO DMS_USER;
GRANT SELECT ANY TRANSACTION to DMS_USER;
GRANT LOGMINING TO DMS_USER;
GRANT READ ON DIRECTORY ONLINELOG_DIR TO DMS_USER; 
GRANT READ ON DIRECTORY ARCHIVELOG_DIR TO DMS_USER;


begin
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_VIEWS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'DBA_TABLESPACES', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_TAB_PARTITIONS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_INDEXES', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_OBJECTS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_TABLES', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_USERS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_CATALOG', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_CONSTRAINTS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_CONS_COLUMNS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_TAB_COLS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_IND_COLUMNS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object( p_obj_name => 'ALL_LOG_GROUPS', p_grantee => 'DMS_USER', p_privilege => 'SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$ARCHIVED_LOG','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$LOG','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$LOGFILE','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$DATABASE','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$THREAD','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$PARAMETER','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$NLS_PARAMETERS','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$TIMEZONE_NAMES','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$TRANSACTION','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('DBA_REGISTRY','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('OBJ$','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('ALL_ENCRYPTED_COLUMNS','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$LOGMNR_LOGS','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('V_$LOGMNR_CONTENTS','DMS_USER','SELECT');
rdsadmin.rdsadmin_util.grant_sys_object('DBMS_LOGMNR','DMS_USER','EXECUTE');
rdsadmin.rdsadmin_util.grant_sys_object('V_$TRANSPORTABLE_PLATFORM','DMS_USER','SELECT');
end;
/

Setting up an AWS DMS instance and endpoints

For instructions on setting up your instance, see Creating a Replication Instance.

For this post, I create an AWS DMS instance with the latest version 3.3.3. The latest version is recommended because it comes with many advanced features and performance improvements.

Creating your source endpoint

When your instance creation is complete, create the source endpoint for RDS Oracle using the source database connection details and the DMS_USER credentials. For detailed instructions, see Creating source endpoint .

In the Extra connection attributes, make sure to use the following parameters to use Binary Reader for CDC. In the Oracle Path prefix, change ORCL to the appropriate database name:

useLogminerReader=N;useBfile=Y;accessAlternateDirectly=false;useAlternateFolderForOnline=true; oraclePathPrefix=/rdsdbdata/db/ORCL_A/;usePathPrefix=/rdsdbdata/log/;replacePathPrefix=true

Creating your target endpoint

To create your target Amazon S3 endpoint, use the destination S3 bucket details along with the Service Access Role ARN of the role to access the S3 Bucket. For detailed instructions, see Creating target endpoint.

Creating a task for the full load

Before you initiate the full load, record the SCN in the source database so you can use it to capture the changes from the point of the full load. See the following code:

select to_char(current_scn) from v$database;

To create the task for the full load, see detailed instructions at Creating a DMS task. Make sure you choose Migrate existing data. As of this writing, full LOB mode isn’t supported for Amazon S3 as the target, so the only option available is Limited LOB Mode. Therefore, choose Limited LOB Mode with maximum size of the LOB in the source table as the Maximum Lob size.

After you create the full load task, you can monitor its progress on the Table Statistics tab. For more information, see Monitoring AWS DMS tasks.

Creating a task for CDC

The next step is to continuously replicate the data changes from the Amazon RDS for Oracle database to Amazon S3. To create this task, see detailed instructions at Creating a DMS task. Make sure you choose Replicate Data Changes only. As of this writing and mentioned earlier, full LOB mode isn’t supported for Amazon S3 as the target, so the only option available is Limited LOB Mode. Therefore, choose Limited LOB Mode with maximum size of the LOB in the source table as the Maximum Lob size.

After you create the task, the DML transactions captured during the full load of the database (using the SCN) are applied to the S3 bucket. For CDC, an additional column is present in the target with values of I, D, or U against each row in the CSV file. The values represent whether the changes were inserts, deletes, or updates, respectively. Full load doesn’t have these columns.

How CDC works for LOB columns

In this section, I explain how the AWS DMS CDC works for tables with LOB data under different scenarios.

For these use cases, we use the table test1 at the source with the following structure:

col1 int primary key, 
col2 clob , 
col3 varchar2(10)

To insert into the source table, enter the following code:

insert into test1 values (1, 'row1','test1');

In the target, you have a new row in the CSV file.

	A	B	C	D
1	I	1	row1	test1

To update the LOB column in the source table, enter the following code:

update test1 set col2='newrow' where col1=1;

In the target, you have a new row in the CSV file.

	A	B	C	D
1	U	1	newrow	test1

To update to a non-LOB column in the same row, enter the following code:

update test1 set col3='test3' where col1=1;

In the target, you have a new row in the CSV file.

	A	B	C	D
1	U	1		test3

In this use case, you can see the LOB column is blank because there is no update to the LOB column value.

To delete the row, enter the following code:

delete from test1 where col1=1;

In the target, you have a new row in the CSV file.

	A	B	C	D
1	D	1		test3

In the third and fourth use cases, the updates and deletes that involve changes to non-LOB columns result in a blank LOB column for that row in the target. This is because supplemental logging in Oracle doesn’t capture LOB, LONG, and ADT columns if they’re not explicitly updated. Keep this in mind while migrating the LOB data during CDC. You should configure the application using the data in the Amazon S3 data lake to handle these use cases.

Performance considerations

When implementing this solution, consider the following:

Limited LOB Mode – Using the limited LOB mode considerably increases performance when using LOB. Additionally, full LOB mode isn’t supported currently with Amazon S3 as the target.
Loading multiple tables in parallel – By default, AWS DMS loads eight tables at a time. You can improve performance by choosing a larger size replication instance and increasing the maximum number of tables to load in parallel in the Advanced settings while creating the task.
Choosing the optimum size for a replication instance – The AWS DMS instance can use a significant amount of memory and CPU when you have many tables or LOB data types. It’s recommended to set up a larger replication instance. You can also boost your throughput by running multiple tasks in parallel, if needed. Make sure you monitor CPU and memory consumption as you run your tests and right-size as appropriate.

Conclusion

In this post, I walked you through the steps to effectively migrate tables with LOB data from an Amazon RDS for Oracle database to Amazon S3. I explained with examples how this LOB data is represented in a CSV file in Amazon S3 during CDC. I also explained why Binary Reader mode and limited LOB mode is required to migrate LOB data. For more information, see Using Amazon S3 as a Target for AWS Database Migration Service.

This post also covered some performance considerations while using AWS DMS. For more information about AWS DMS best practices, see Improving the performance of an AWS DMS migration.

Similarly, you can migrate data from on-premises or Amazon Elastic Compute Cloud (Amazon EC2)-based Oracle databases to Amazon S3 with minor adjustments.

I hope you found this post informative and learned an effective way to build a data lake using AWS DMS with LOB data from a source database such as Oracle. AWS welcomes your feedback and comments.

About the Author

Harish Shenoy is a Senior Technical Account Manager for AWS Enterprise Support. He has been working with various Database technologies for more than 9 years. He is a great advocate of automation and has designed and Implemented multiple Database automation tools in his career. He works with Enterprise customers to design, implement and support complex cloud infrastructures. He is a sports enthusiast who enjoys playing Cricket, Ping Pong, Badminton and Tennis . He spends his free time with his wife and a daughter traveling and exploring different places.

AWS Database Blog

Effectively migrating LOB data to Amazon S3 from Amazon RDS for Oracle with AWS DMS

Prerequisites

Solution overview

Creating an IAM role to access the target S3 bucket

Preparing the Amazon RDS for Oracle database for replication

Setting up supplemental logging

Enabling automatic backups

Setting up log archival

Creating directories for Binary Reader

Creating a database user for AWS DMS in the source database

Setting up an AWS DMS instance and endpoints

Creating your source endpoint

Creating your target endpoint

Creating a task for the full load

Creating a task for CDC

How CDC works for LOB columns

Performance considerations

Conclusion

About the Author

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help