AWS Cloud Operations Blog

How TMAP Migrated their large Oracle Database to Amazon Aurora MySQL using AWS DMS

Launched in 2002, TMAP Mobility is South Korea’s leading mobility platform with 20 million registered users and 14 million monthly active users. TMAP provides navigation services based on a wide range of real-time traffic information and data.

TMAP is growing vertical offerings to its users that add value to navigation services, such as user profiles, ads, driving patterns, and surrogate driving. Although the navigation services are primarily hosted in their on-premises infrastructure, the vertical services are being developed in AWS or being migrated to AWS. TMAP learned that AWS makes it possible to quickly and easily verify, develop, and launch services.

In 2022, TMAP migrated the “User Profile” application and database, one of their core navigation backend services, to AWS. The “User Profile” service manages user-specific data, and provides APIs to personalize user experiences such as Favorites, Recent Destinations, Profiles, and Preferences. These APIs were single item selects, inserts, and updates, and processed requests of up to 6,000 TPS on an Oracle database. TMAP chose Amazon Aurora MySQL as the target database to save licensing costs and because of its high performance, availability, and  scalability. Additionally, as TMAP was already using MySQL, their technical staff had the necessarily skills to quickly adopt and use Amazon Aurora MySQL database.

The TMAP team partnered with the AWS team to successfully complete the migration of their complex core services database to AWS cloud. In this blog, we will share TMAP’s challenges during the migration and the solutions implemented to overcome them. We hope this helps others who are engaged in similar migration efforts.

Migration challenges

TMAP planned to migrate user profile service to AWS within six months. The scope of the migration included the web application that received user requests, the main Oracle database that stored user profiles, and a smaller MySQL database that stored purchase history.

The main database, based on Oracle 11g, was around 1 TB in size, with 3 schemas and 163 tables. The smaller MySQL database was around 5 GB in size with 28 tables in a single schema. As the main database received 30,000 to 40,000 QPS (Query Per Second) real-time requests from web application servers for storing bookmarks and recent route searches, it was crucial to minimize the business impact of any failures during migration.

Furthermore, the user profile services are also consumed by other related services such as user driving score, advertisements, user authentication, products and vouchers. Minimizing delays of real-time replication and securing data consistency between heterogeneous databases was key in maintaining the quality of overall navigation service.

TMAP faced the following challenges during the migration process:

  • Minimizing any performance related impact of production DB
  • Change Data Capture (CDC) related delays during real-time replication
  • Data inconsistency due to difference in data processing mechanisms between Oracle and MySQL

Figure 1 shows the On-premises architecture of the user profile services. Figure 2 shows the proposed AWS architecture of the user profile service

Figure 1. On-premises architecture of the user profile services

Figure 1. On-premises architecture of the user profile services

Figure 2. AWS architecture of the user profile service

Figure 2. AWS architecture of the user profile service

Solution overview

Prior to starting the actual data migration, TMAP used AWS Schema Conversion Tool (SCT) to convert objects from the source Oracle database to the target Aurora MySQL database. Approximately 90% of tables, views, functions and other objects were converted automatically with no or minimal changes. The rest of the schema changes were completed manually by the database administrators. TMAP set up an intermediate Oracle Database and loaded it with data from the main production Oracle database using Data Pump. This intermediate Oracle database served as the source for AWS Database Migration Service (DMS) to transfer data into the target Aurora MySQL database, thereby minimizing impact on the main production database during the actual data migration. For tables with frequent changes, TMAP scaled up DMS instances and separated a task into multiple tasks based on the User Key column. This mitigated the latency in real-time replication of large amounts of data. By adding the unique key column, TMAP also addressed data loss caused by the difference in how Oracle and MySQL handle whitespaces.

Solution to minimizing performance impact of production DB

TMAP implemented the following steps to reduce data volume for online migration from two on-premise databases (main Oracle DB and smaller MySQL DB) to Amazon Aurora MySQL.

Figure 3. Overall Architecture of User Profile Migration

Figure 3. Overall Architecture of User Profile Migration

Step 1: TMAP fully loaded the initial object and data to an intermediate Amazon Relational Database Service (Amazon RDS) for Oracle, which was temporarily created for migration to AWS. Among the 164 tables and 154 coded objects of the main Oracle on-premises database, TMAP excluded unnecessary objects from migration. For 25 tables and objects that do not require historical data or receive data from outside, TMAP only loaded schema data to the temporary database. For 35 tables that would replicate data in real-time using DMS, TMAP backed up initial data from the source DB using data pump. After uploading the dump file on Amazon Simple Storage Service (Amazon S3), it was then fully loaded into the temporary database.

Step 2: With AWS SCT, TMAP extracted and converted schema objects from Amazon RDS for Oracle in Step 1 to target Amazon Aurora MySQL database. Most objects were auto-generated, with the exception of a few tables, that were generated manually. For example, a table that stores all user driving history was used by configuring a HASH partition with User Key as the key to enhance performance. The HASH partition of MySQL only supports integer-type partition keys, which do not support automatic generation of SCT. Therefore, TMAP manually generated with KEY partition. For large object (LOB), date, and number type columns, TMAP checked the actual data to select and convert the appropriate type among TEXT/LONGTEXT, DATE/DATETIME, and INT/BIGINT/DECIMAL for the automatically-generated tables.

Figures 4A, 4B and 4C show the Schema Conversion Tool Reports for the different types of tables.

Figure 4A. Schema Conversion Tool Report for User Profile Tables

Schema A. The tables for managing user profiles in Schema A were mostly automatically convertible. TMAP confirmed that procedures were no longer in use and thus excluded them from the migration.

Figure 4A. Schema Conversion Tool Report for User Profile Tables

Figure 4B. Schema Conversion Tool Report for Vehicle and Driving Tables

Schema B, encompassing vehicle and driving information, contained five partition tables with a high degree of conversion complexity. The capacities of these tables, relatively small at several gigabytes each, led TMAP to decide to remove the partitions, considering the efforts for partition management and the goal of performance efficiency.

Figure 4B. Schema Conversion Tool Report for Vehicle and Driving Tables

Figure 4C-1. Schema Conversion Tool Report for Destinations and Favorites Tables

Figure 4C-2. Schema Conversion Tool Report for Destinations and Favorites Tables

Schema C. This schema contained objects for recent destinations, favorites, and user profiles, among other things. Among the 15 tables that required complex actions, most were partition tables. For load balancing, we retained only HASH partitions and converted the rest into non-partition tables. As the key column of the HASH partition was of the varchar type, it was converted to a KEY partition. We also removed programmable objects like views, procedures, and functions, as they were not to be used by the development team.

Figure 4C. Schema Conversion Tool Report for Destinations and Favorites Tables

Step 3: Using Full Load Task of AWS DMS, TMAP loaded the entire data from Amazon RDS for Oracle in Step 1 to the target Amazon Aurora MySQL (MySQL version 8.0 and Aurora version 3.02.0) setup in Step 2. For large tables, TMAP changed the mapping rules to perform table-level parallel processing as illustrated in Figure 5.

Figure 5-1. DMS Full Load Task Configuration

Figure 5-2. DMS Full Load Task Configuration

Figure 5. DMS Full Load Task Configuration

ParallelLoadThreads needs to be set carefully as the task can consume high memory and sometimes fail.  ParallelLoadThreads in the source endpoint setting can also trigger failures when loading to the target due to deadlock. For more information and best practices, please refer to AWS DMS User guide on task settings.

Step 4: Using ongoing change data capture (CDC) task of AWS DMS, TMAP replicated data between the on-premises Oracle and Amazon Aurora MySQL databases. Data that changed after Step 1 in the on-premises Oracle tables were applied to Amazon Aurora MySQL tables.

Figure 6 shows the DMS CDC Task Mapping rules.

Figure 6. DMS CDC Task Mapping Rules

Figure 6. DMS CDC Task Mapping Rules

Step 5: To migrate the on-premises MySQL database, TMAP created a new database on Amazon Aurora MySQL, built during the Step 2. TMAP selected tables and other objects to migrate, took a data dump from the on-premises MySQL database and loaded the data on the new database. TMAP performed dump operation for passive master to minimize the impact of production database. TMAP only dumped and loaded schema data for 10 out of 28 target tables. For 18 tables that required real-time data replication, TMAP dumped and loaded the entire data, including –master-data.

Step 6: For data that was inserted/updated/deleted after data was loaded in Step 5, TMAP applied replication to transfer data to the target Amazon Aurora MySQL tables on a real-time basis. TMAP used replicate-do-db parameter of Aurora MySQL to only replicate the specific database among other databases in the source MySQL.

Figure 7-1. MySQL Binlog Replication Settings

Figure 7-2. MySQL Binlog Replication Settings

Figure 7. MySQL Binlog Replication Settings

Solution to address Change Data Capture (CDC) delays in real-time replication

As mentioned earlier, TMAP started by fully loaded the production DB data into the intermediate Oracle DB. For replicating real-time changes on production DB, TMAP faced delays in both source and target during the synchronization process with the CDC function of DMS, as well as out of memory issue on the DMS instance.

The key Amazon CloudWatch metrics related to replication delay are :

  • CDCLatencySource: Latency of reading changes from the source
  • CDCLatencyTarget: Latency of applying changes on the source to the target
  • CDCIncomingChanges: The number of events waiting to be applied to the target. This metric increases when changes from the source are not yet applied to the target.

Please refer to Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong for additional information.

Figure 8. Both CDCLatencySource and CDCLatencyTarget increased

Figure 8. Both CDCLatencySource and CDCLatencyTarget increased

TMAP identified and resolved three issues to solve the delays in replication.

Issue 1. Replication delay due to increased source latency

TMAP initially used a single DMS task for CDC of 35 tables. While reading data from the source, it resulted in both source and target latency, which led to an increase in the values of both CDCLatencySource and CDCLatencyTarget in CloudWatch.

Higher source latency can be caused by the following:

  1. Network latency between source and DMS instance
  2. Bad Performance at source
  3. Task settings causing DMS to read changes slowly from the source.

None of them applied to TMAP, so TMAP scaled up DMS instance (from r5.2xlarge to r5.4xlarge). This resulted in decrease in CDCSourceLatency values . But the metrics for CDCLatencyTarget  and CDCIncomingChanges continued to increase because many changes occurred in source database.

Issue 2. Replication delay due to target latency

The target latency increased as a single DMS task sequentially replicated large-scale source changes to the target. To solve the issue, TMAP used filters on the mapping rule for large tables to further divide the task into smaller ones. The key was to find conditional clauses where data transactions are not concentrated on one task.

Figure 9. DMS CDC Task Selection Rule

Figure 9. DMS CDC Task Selection Rule

For example, the table containing user information has a 14-digit VARCHAR column for storing the year when the user started the service subscription. TMAP used the column values to separate tasks according to the subscription year. Therefore, this table was split into 9 tasks (2014-2022) and these 9 tasks were executed in parallel for CDC. This strategy helped solve the replication delays at the target.

Figure 10. Completely synchronized with the source after target latency is addressed

Figure 10. Completely synchronized with the source after target latency is addressed

Issue 3.  Out of Memory (OOM) issues of the DMS instance

TMAP observed OOM issues on the DMS instance as they were determining the optimum number of DMS tasks to run.

TMAP reduced value of StreamBufferCount of StreamBufferSettings for the task from 180 to 32. This successfully resolved the OOM issue. Although increasing StreamBufferCount speeds up DMS data extraction, it also uses more memory, which can cause OOM issues and eventual failure of the DMS tasks.

Solution to address issue of data consistency by differences in data processing mechanisms between Oracle and MySQL

While transferring data from Oracle to MySQL, TMAP discovered differences in how the two databases handle trailing spaces.

Oracle’s unique index checks for trailing spaces after identical leading characters, while MySQL’s unique index ignores these spaces and considers only the leading characters, viewing them as duplicates. Please refer to the test results in Figure 11 for details.

mysql> create table blanktest (vchar varchar(10));
Query OK, 0 rows affected (0.09 sec)

mysql> insert blanktest values ('text'), ('text ');
Query OK, 3 rows affected (0.02 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select vchar, length(vchar) from blanktest where vchar='text';
+-------+---------------+
| vchar | length(vchar) |
+-------+---------------+
| text  |             4 |
| text  |             5 |
+-------+---------------+
2 rows in set (0.05 sec)

mysql> select vchar, length(vchar) from blanktest where vchar='text ';
+-------+---------------+
| vchar | length(vchar) |
+-------+---------------+
| text  |             4 |
| text  |             5 |
+-------+---------------+
2 rows in set (0.05 sec)

mysql> alter table blanktest add unique index blanktest_ix01 (vchar);
ERROR 1062 (23000): Duplicate entry 'text' for key 'blanktest.blanktest_ix01'

Figure 11. MySQL’s Handling of Trailing Whitespace

SQL> select vchar, length(vchar) from blanktest where vchar='text';
VCHAR	   LENGTH(VCHAR)
---------- -------------
text		       4

SQL> select vchar, length(vchar) from blanktest where vchar='text ';
VCHAR	   LENGTH(VCHAR)
---------- -------------
text		       5

SQL> create unique index test_ix01 on blanktest (vchar);

Index created.

Figure 12. Oracle’s Handling of Trailing Whitespace

Some data in TMAP’s ‘Recent Destination’ table were failing to migrate due to duplication errors caused by these trailing whitespace differences as shown in Figure 13.

RetCode: SQL_ERROR  SqlState: 23000 NativeError: 1062 Message: [MySQL][ODBC 8.0(w) Driver][mysqld-8.0.23]Duplicate entry '20226RD4002_xxxxxxxxxxx-xxxxxx' for key 'poi_t_rct_id_c.POI_T_RCT_ID_C_PK

Figure 13. Errors Occurring in the DMS’s awsdms_apply_exceptions Table

TMAP considered several approaches to address this issue. One approach was to add a function-based unique index using the length () function, which could distinguish based on the presence of whitespace. However, this method could not be implemented as AWS DMS does not support the migration of function-based indexes.

As a second approach, TMAP added a column to the unique index that stored the timestamp of when a user made a query. This approach worked as it was improbable for a user to enter two queries with the same destination at exactly the same time. By implementing this method, TMAP was able to resolve the data consistency issues and successfully complete the migration.

Conclusion

In this post, we covered how TMAP overcame challenges and migrated its large Oracle database to AWS Aurora. TMAP worked with the AWS team to learn and apply a number of best practices to this critical database migration. Adopting an intermediate database to mitigate impacts to its running production databases during migration and configuring DMS tasks for optimum performance helped TMAP complete this critical migration effort on time. Furthermore, the application response time improved by 41% compared to the on-premises environment. It also reduced the operational burden on TMAP staff due to the auto-scaling capabilities of Amazon Aurora to handle peak-period loads. TMAP was also able to get out of the database license renewal contracts and is now able to take advantage of a pay-as-you-go model.

TMAP is planning to choose an appropriate target AWS database for each on-premises workload to continue migration to AWS. TMAP is planning to migrate other core services that use Oracle in the second half of 2023. TMAP has gained confidence in completing heterogeneous database migrations by migrating the User Profile service. As next steps, the authentication service migration to Aurora PostgreSQL is being prioritized for performance and scalability reasons.

“It was challenging to change the large-scale DBMS of UserProfile, which is TMAP’s main system used by over 10 million monthly users. This case was shared company-wide as the best practice of TMAP Mobility. DevOps engineers have acquired knowledge and experiences, and now more efficiently perform migration”

– Soonhyoung An, Leader, Backend Infrastructure Group

About the authors:

SangHun Kwon

SangHun Kwon is a Cloud Database Engineer working at TMAP Mobility. He is dedicated to creating efficient and effective designs for various workloads. His primary objective is to enhance database performance and reduce costs, guaranteeing a seamless and uninterrupted operation.

MinWoo Lee

MinWoo Lee is a Solutions Architect in AWS Korea. He works on multi-account architectures for large enterprise customers. He enjoys helping customers achieve governance and control on their cloud journey.

Soyoung Jeon

Soyoung Jeon is a Migration and Modernization Solutions Architect with expertise in database and analytics in AWS Korea. Her primary focus is helping customers accelerate their migration from on-premise to AWS in an agile and efficient way. She is dedicated to customers’ modernization to free up their time from operation toils.

SooJung Lee

SooJung Lee is a Technical Account Manager in AWS Korea. She helps enterprise customers establish operational excellence as they move to the cloud.