Migrate billions of records from an Oracle data warehouse to Amazon Redshift using AWS DMS

Customers are migrating to Amazon Redshift to modernize their data warehouse solution and help save on their licensing, support, operations, and maintenance costs. To migrate data from an on-premises data warehouse to Amazon Redshift, you can use services such as AWS Database Migration Service (AWS DMS), AWS Schema Conversion Tool (AWS SCT), Amazon Simple Storage Service (Amazon S3), and AWS Snowball Edge.

The type of service to choose for the data migration varies depending on your case, dataset size, and other software prerequisites. Migrating large datasets has its challenges, and it’s essential to develop a migration strategy that accounts for encryption requirements, storage requirements, network connectivity, and data transfer speeds between source and target data stores. Besides the technical factors, you should also be aware of the non-technical aspects of your migration, such as the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) goals of your organization.

While migrating a large volume of data from the Oracle database to Amazon Redshift, you may run into a few challenges, such as ORA-1555: Snapshot too old error or the rate of changes on the source table could be too high to cache all of the changes in the UNDO log for example. In this post, we walk you through an approach to overcome this challenge. We show how to migrate a large table (approximately 500 GB in size) in an on-premises Oracle data warehouse running on Oracle Real Application Clusters (Oracle RAC) to Amazon Redshift using AWS DMS.

Solution overview

The AWS resources for this example solution are Amazon Redshift and AWS DMS. Let’s review these resources.

Amazon Redshift is an enterprise-level, petabyte-scale, cloud data warehousing solution. With Amazon Redshift, you can analyze large volumes of data using existing business intelligence tools. Amazon Redshift integrates with AWS services to provide features such as querying the data lake with Amazon Redshift Spectrum, semi-structured data ingestion, streaming ingestion from Amazon Kinesis or Amazon Managed Streaming for Apache Kafka (Amazon MSK), federated queries to Amazon Aurora and Amazon Relational Database Service (Amazon RDS) operational databases. Amazon Redshift Serverless allows you to run analytics workloads with high performance. It scales automatically and is ideal for variable, periodic, and spiky workloads.

AWS DMS is a service to migrate data between source and target data stores. One of the requirements to use AWS DMS is that one of the data stores must be on the AWS Cloud. The source and target data stores can be of the same database engine type or different, but one of them must reside in the AWS Cloud.

Depending upon the volume of data migration, you can either use AWS SCT or AWS DMS. AWS SCT is the preferred tool of choice to migrate workloads larger than 10 TB, and AWS DMS is traditionally used for workloads less than 10 TB. Also, be mindful that you can install AWS SCT only on specific operating systems. Refer to Installing AWS SCT for the list of supported operating systems. While using AWS DMS to migrate the data from large Oracle database tables, DBAs may run into an ORA-1555: Snapshot too old error. One way to avoid this error is to break down the long-running processes into a series of steps; so in this solution, we split the migration of 5 billion records across five AWS DMS tasks.

The following diagram illustrates our solution architecture.

Parallel-load process

Let’s review the concept of the parallel-load process before we dive into the solution.

The parallel-load process helps migrate large volumes of records faster and more efficiently. This process can migrate a single segmented table using several threads in parallel. AWS DMS splits a full load task into threads, with each table segment allocated to its thread. You can segment tables by ranges of column values. A column with a sequential value is one such example. We provide the code snippet for the parallel-load process later in this post.

Configure the following parameters in the AWS DMS task settings to expedite the data migration:

ParallelLoadThreads specifies the number of threads allocated by AWS DMS to load the data in tables in the target database. The default value is zero.
ParallelLoadBufferSize specifies the maximum number of records to keep in the buffer before writing to the target database tables. The default value is 50.

Refer to Working with AWS DMS tasks for more information on creating and modifying an AWS DMS task.

Prerequisites

The following are the prerequisites needed to perform the migration from an Oracle database to Amazon Redshift:

An active AWS account.
An Amazon Redshift cluster configured, and up and running.
An Oracle database instance in an on-premises environment or the AWS Cloud.
Network connectivity between the on-premises data center and the AWS Cloud.
An AWS DMS replication instance configured, including source and target endpoints for the data stores.
Availability of sufficient storage space on the source database server to hold the database archive logs.
Binary Reader mode configured for the AWS DMS source endpoint.
Verify that there are no long-running transactions on the table before you start the copy process. Use the following SQL commands to check for long-running transactions on the source database:

SET COLUMN percent FORMAT 999.99 

SELECT sid, to_char(start_time,'hh24:mi:ss') stime, 
message,( sofar/totalwork)* 100 percent 
FROM v$session_longops
WHERE sofar/totalwork < 1

Migrate data from an Oracle data warehouse to Amazon Redshift

Let’s walk through the approach we used to migrate 5 billion rows from an Oracle data warehouse on a four-node RAC system using Automatic Storage Management (ASM) to Amazon Redshift.

Pause writes during the copy process and capture the timestamp after copying the data, or let table writes continue and capture the timestamp at the beginning of the copy process. We used the first approach, because the latter requires manual reconciliation of the records during the cutover.
Copy the data into five tables with 1 billion records each:

CREATE TABLE <SOURCE_TABLE_PART1> AS SELECT * FROM <SOURCE_TABLE> WHERE <PRIMARY_KEY> BETWEEN 1 AND 1000000000;
---
---
---
CREATE TABLE <SOURCE_TABLE_PART5> AS SELECT * FROM <SOURCE_TABLE> WHERE <PRIMARY_KEY> BETWEEN 4000000000 AND 5000000000;

With this method, AWS DMS reads the data from the smaller tables instead of one big table for a full load.

Configure Binary Reader and ParallelAsmReadThreads in the extra connection attributes of your AWS DMS source endpoint:

useLogMinerReader=N;useBfile=Y;asm_user=<asm username>;asm_server=<IP address>:<port number>/+ASM; parallelASMReadThreads=8;readAheadBlocks=150000;

Configuring the ParallelAsmReadThreads value to 8 is recommended for the Oracle database. Refer to Using an Oracle database as a source for AWS DMS for more information on configuring extra connection attributes for an Oracle database as a source endpoint. The following knowledge center article explains how to add or modify extra connection attributes for an AWS endpoint.

Create full load only AWS DMS tasks for each table containing a billion rows to migrate the data to the target Amazon Redshift tables.

For this solution, we have created five tasks, one task for each 1 billion records, respectively. Refer to Creating a task for instructions.

In the task settings of your AWS DMS task, modify the parameter values as per the following table.

Parameter Name	Value
`ParallelLoadThreads`	4
`ParallelLoadBufferSize`	100
`CommitRate`	50000

The preceding values must be modified for each of your AWS DMS tasks. For instructions, refer to Modifying a task.

As described earlier, we need to create a table mapping rule of type table-settings to use parallel load. In our case, we are splitting 1 billion records into four segments. We have used the sequential primary key to split the records into multiple segments. Boundaries should be chosen carefully so as to not miss any records. You can also use the right columns to use indexes to speed up the ingestion.

Use the following sample code snippet for the table settings rule type with the parallel load:

{
	"rule-type":"table-settings",
	"rule-id":"2",
	"rule-name":"2",
	"object-locator":{
		"schema-name":"your_schema",
		"table-name":"your_table"
	},
	"parallel-load":{
		"type":"ranges",
		"columns":[
		  "id_column_name"
		],
		"boundaries":[
		  [
			  "25000000"
		  ],
		  [
			   "50000000"
		  ],
		  [
			   "75000000"
		  ],
		  [
			   "100000000"
		  ]
		]
	}
}

Because we set ParallelLoadThreads to four, we have to split the boundaries into four segments. The number of threads or boundaries can be adjusted as needed. Apply the mapping rule to the full load tasks and adjust the boundaries. For more information, refer to Using table mapping to specify task settings.

Now start the full load AWS DMS tasks. In our case, we ran two AWS DMS tasks at a time.

The ability to run the number of AWS DMS tasks parallelly depends on many factors, such as the network connectivity, network bandwidth, activity on the source table, size and number of large objects (LOBs), AWS DMS replication instance size and type, and more. We used dms.r5.xlarge for our AWS DMS replication instance. Refer to Selecting the best size for a replication instance for guidance.

When the full load AWS DMS tasks are complete and the data is migrated to the target tables, merge the five target tables into a single table in the target Amazon Redshift data warehouse:

INSERT INTO MERGED_TABLE
(SELECT * FROM TABLE1
UNION
SELECT * FROM TABLE2)

After we migrate the data from the five tables to the Amazon Redshift table, we need to bring additional data to keep the source and target tables in sync. For this purpose, we need to configure and run a new AWS DMS change data capture (CDC) task for the ongoing data replication from the source Oracle data warehouse table to the merged table in Amazon Redshift. For instructions, refer to Creating a task.

In the AWS DMS task’s CDC setting, we use the timestamp captured from Step 1 to perform the final cutover by enabling CDC in the task CDC configuration.

Clean up

To clean up and prevent incurring further costs, complete the following steps:

Consider downsizing your AWS DMS replication instance for the CDC.
Delete the AWS DMS tasks for the temporary tables you created.
Delete the temporary database tables in the source and the target data stores.
After the cutover, delete the AWS DMS replication instance and the AWS DMS task.

Conclusion

AWS DMS is a comprehensive service that supports over 20 database and analytic engines to migrate data from commercial database engines such as Oracle and Microsoft SQL Server. We can help you in this database migration journey through our network of AWS Professional Services, Database Freedom partners, and Amazon Database Migration Accelerator.

In this post, we shared an approach that you can use to migrate 5 billion records from an on-premises Oracle data warehouse table to Amazon Redshift. AWS DMS supports a range of homogeneous and heterogeneous data replications. For a complete list, refer to Sources for AWS DMS and Targets for AWS DMS. For further reading and guidance on AWS DMS, refer to How AWS Database Migration Service works and Best practices for AWS Database Migration Service.

If you have questions or suggestions, leave a comment.

About the authors

Keerthi Maryada serves as a Delivery Practice Manager within the Public Sector Proserve team at Amazon Web Services. Her primary responsibilities involve working closely with customers to offer technical oversight and guidance on data and analytics projects, helping to optimize their solutions when utilizing AWS. Beyond work, she enjoys spending her free time engaged in activities such as long runs, travel, reading, and quality family time.

Prathap Thoguru is a Technical Leader and an Enterprise Solutions Architect at AWS. He’s an AWS certified professional in nine areas and specializes in data and analytics. He helps customers get started on and migrate their on-premises workloads to the AWS Cloud. He holds a Master’s degree in Information Technology from the University of Newcastle, Australia.

Javeed Mohammed is a Database Specialist Solutions Architect with Amazon Web Services. He works with the AWS RDS team, focusing on commercial database engines like Oracle. He enjoys working with customers to help design, deploy, and optimize relational database workloads on AWS Cloud.

AWS Database Blog