AWS Database Blog
Continuous database replication using AWS DMS to migrate from Oracle to Amazon Aurora
This is a guest post from IPG. In their own words, “IPG, based in Japan, specializes in TV-related data. We use data from broadcasters across Japan, format it for easy use and understanding, structure it with metatags, and make it simple to use on any platform, like smartphones.”
This post discusses the different optimizations we tested during our migration from Oracle to Amazon Aurora with MySQL compatibility using AWS DMS, shows the shortcomings and benefits of each solution, and describes what worked best for our use case.
The following diagram illustrates how IPG moves data from satellites, broadcasters, and content providers.
We initially used Oracle database to store and manage the data of TV programs and its various related information. Our input data has the following details:
- EPG data – Approximately 250,000 programs (8 days ahead TV program information, including terrestrial, BS, CS100°, and new 4K8K satellite broadcast)
- Broadcasters – Approximately 200 broadcasters, and around 1000 channels
- Contents provider – Approximately 10 companies
This data is sent to 30 companies to be used in their services.
For system operations, we used a complex network of software programs, including 40 batch programs, 10 operation UIs, and 40 API services.
Concerns with Oracle
While this system was durable and stable, we had the following concerns:
- How the database would respond to scale and updates over time, considering that Oracle tightly couples their components and is challenging to scale and evolve.
- Only a few engineers on our team were knowledgeable about Oracle databases.
Deciding to migrate
Given our concerns, we decided to migrate to a system that could quickly respond to changes and unpredictable business requirements. We looked for a new system environment that allowed for the development of consumer and API services that used existing data. Also, we needed to maintain the existing system with full stability while we migrated. We discovered AWS DMS, which can replicate and operate on the existing data continuously and also between the different database systems.
Companies often choose PostgreSQL as the target database when migrating off Oracle because PostgreSQL is compatible with Oracle features. However, because our company already uses MySQL databases and has in-house MySQL expertise, we targeted MySQL. We checked the compatibility of MySQL with our Oracle database using AWS SCT. AWS SCT is a tool to generate a schema into which the data physically fit, but is not specifically designed to generate a schema that is optimized for a target database; optimization is always required when migrating with AWS SCT. According to the execution result of AWS SCT, we found out that it could automatically convert 100% of our database storage objects. Of our database code objects, it could convert 100% of view objects and 67% of function objects.
Based on this result, we chose MySQL as the target database and decided to use Aurora with MySQL compatibility for improved performance and scalability.
Discovering obstacles
We attended a free, hands-on AWS DMS workshop held by AWS Japan. We learned basic AWS DMS operations and confirmed that we would really need the help of AWS DMS and AWS SCT to have a simple and easy migration. They taught us AWS DMS functions with AWS CloudFormation so we could make templates of our infrastructure without working from scratch for every duplication.
Back at the office, we tried to create a staging system but ran into the following obstacles:
- During the full load for the initial load, we did not achieve favorable results as this process overloaded the source database (Oracle). The cause of this was high read throughput of the source database and network bandwidth reaching the upper limit value.
- After 10 hours, the full load could not finish due to the following factors:
- The schema that AWS SCT generated automatically caused a long full load because foreign keys and indexes were still set.
- Because we did not explicitly specify table mapping, temporary tables were automatically created. This caused an error in the target table.
- The instance size of the target database was small, which affected the processing speed for writing.
- The size of the data we needed to migrate was large.
Implementing the solution
After the preceding issues became apparent, we proceeded to optimize our migration process.
At the time, we maintained an existing Oracle environment while running a new MySQL environment in parallel. Therefore, our infrastructure costs were double. We realized that we had to keep operational costs low in the new environment with the minimum possible infrastructure configuration. We also realized that we could not easily shut down or configure the existing Oracle environment. Even if a failure occurred and it was necessary to re-execute from the initial load, the Oracle environment had to execute with minimal impact. Lastly, we found that shortening the initial load time contributed to shortening the recovery time objective (RTO) in disaster recovery and to improving the service level.
With these lessons in mind, we made the following adjustments:
- Choose the appropriate instance size of source, replication, and target
- List up tables for migration
- Reduce foreign keys and indexes in the target database (we reduced the number of indexes while moving the data and recreated them afterward)
To reduce the time required for a full load, we changed the required infrastructure configuration in AWS from a large instance size to a 4xlarge size during the full load operation. In change data capture (CDC), which is for applying difference, we needed to reduce the running cost by configuring with the minimum instance size. There is a restriction that you cannot change the configuration of an Oracle environment instance at the time of full-load operation. To solve these problems and operate efficiently, we completed the following steps:
- Take a snapshot of the source database (Oracle) and restore the Oracle instance for full load from the snapshot. Include the following:
• Record the starting time of snapshot creation
• Do not connect the existing business system to the restored Oracle instance
• Select a large instance size of the Oracle instance for full load.
The following diagram illustrates the snapshot and restore architecture. - Execute full load task by using the Oracle instance for a full load as a source database. Select a large instance size for the replication and target database instance.
The following diagram illustrates the full load task architecture. - After the full load is complete, delete the Oracle instance for a full load, and scale down the instance size of the replication and target database.
The following diagram illustrates the change in architecture. - Execute the CDC task by using the Oracle instance in the existing environment as the source database. Calculate the system change number (SCN) from the snapshot creation time acquired in Step 1. See the following example code:
- Choose Specify log sequence number for CDC start mode, and set the SCN to what you calculated.
The following diagram shows the CDC task architecture.
The following diagram illustrates the new solution architecture.
Approximately 500 tables and 1.5 billion rows (240 GB) of data existed as data volume in the source database. We reduced the number of tables to approximately 160 tables and 350 million rows (65 GB) by selecting the tables necessary for using the service and leaving the unnecessary tables in Oracle. We created rules for all tables with the mapping rule set to the migration task; we explicitly set include (migrate) or exclude (do not migrate) as the rule-action of each table.
We removed all foreign key constraints and indexes from the target database schema at the time of full-load operation and performed the full load. We then returned the foreign key constraints and indexes to their original tables.
Implementation result
The implementation yielded positive results. Firstly, the solution used the instances restored from snapshots for the source database at the full load and increased each instance size (source, replication, and target). Additionally, the full load time was reduced by approximately 6 hours. At the full load, we used 4xlarge for the instance size of the source, replication, and target. At CDC execution, we scaled down the instance size for the replication instance (from 4xlarge to large) and the target database (from 4xlarge to 2xlarge), and performed it.
The following table summarizes the instance size and full load times.
Instance Size | Full Load Time |
Source Database: large Replication Instance: large Target Database: large |
7 hours, 36 minutes |
Source Database: 2xlarge Replication Instance: 2xlarge Target Database: 2xlarge |
2 hours, 30 minutes |
Source Database: 4xlarge Replication Instance: 4xlarge Target Database: 4xlarge |
1 hour, 46 minutes |
We reduced the full load time by approximately 5 hours by adjusting the amount of data. We wanted to cut migration time because our full load operation may occur more than once, especially if there were any issues with the CDC operation.
The following table summarizes the full load times from this adjustment.
Data Volume Adjustment | Full Load Time |
Not Applied | 12 hours |
Applied | 7 hours |
Although we could not reduce the full load time to about half or less, the time of reassigning foreign key constraints and indexes after the full load took longer than the full load time. It may be useful when there are few rows per table, but not in our use case. Therefore, we performed the full load with foreign key constraints and indexes.
The following table summarizes the full load time with and without indexes.
Execution Method | Full Load Time |
Full load with indexes | 12 hours |
Full load without indexes | 5 hours |
Full load with indexes added to all processes (8 processes, parallel) | 13 hours |
Conclusion
Throughout this process of experimentation and testing, we learned that the micro-optimizations that we attempted in the first few implementations had minor benefits, but none were as impactful as the final implementation of scaling the instance. In this implementation, we achieved the following excellent results:
- We did not have to load the same amount of the full load to the existing environment
- We shortened RTO during the full load by taking more instance resources and performing full load tasks in a short time
- We minimized operational costs at CDC execution by choosing the minimum instance size required for the task
The following diagram illustrates the architecture of the new system.
We also plan to make future improvements. First, we plan to address the lost or duplicated data due to an ambiguous CDC starting point. AWS DMS allows you to migrate using SCN in CDC start mode. AWS DMS is a logical replication tool and can eventually get the target consistent with the source. If you use the wrong SCN, there is a possibility of data loss. However, we recommend mentioning an earlier SCN so that the target eventually becomes consistent even if a few transactions are rerun.
We also intend to improve our procedure when performing the full load again. When the full load is performed again, the data of the target database is cleared once. Because we could shorten the RTO, we could also shorten the data loss time (the period during the full load). However, we would like to make it even shorter, if possible. For example, we could keep the data at the time of CDC stopped in the target database while the service continues working. We could also consider if we can switch the cluster after completing the full load in another cluster, and so on.
This was a massive and exciting project for us, and we’re happy to teach others from our experience. For more information, see the recap of our October 2019 AWS DMS seminar on the IPG developer diary website (recap is in Japanese). Best of all, we can provide our users with services and value by efficiently using AWS solutions.
About the Author
Toru Kimura is a Technical Program Manager who works for IPG, Inc. He is responsible for the deployment of the platform system (minds) managing TV program information, and for the machine learning project team.