AWS Database Blog

Migrate very large databases to Amazon Aurora MySQL using MyDumper and MyLoader

Amazon Aurora is a MySQL- and PostgreSQL-compatible relational database built for the cloud. Aurora combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases. Aurora provides advanced features such as Amazon Aurora Global Databases, Amazon Aurora Serverless v2, Amazon Aurora Parallel Query, and database cloning, while supporting up to 15 low-latency read replicas per cluster.

Historically, very large MySQL databases (VLDBs) have been hosted on specialized hardware within corporate data centers, resulting in significant upfront investments in infrastructure, maintenance, and management. To modernize your relational database while achieving full MySQL compatibility, high availability, and global-scale performance, consider migrating your VLDB to Amazon Aurora MySQL-Compatible Edition.

In this post, we discuss how to migrate MySQL VLDBs from a self-managed MySQL database to Amazon Aurora MySQL-Compatible Edition using the MyDumper and MyLoader tools.

You can also use this solution to migrate to Amazon Relational Database Service (Amazon RDS) for MySQL.

VLDB migration challenges

Migrating a MySQL VLDB requires careful evaluation of several critical factors, with database size, data consistency, and complexity being the most important. The most common approach of migrating MySQL VLDBs is physical migration, which involves copying physical binary database files, transferring them over the network, and restoring them on the destination server. However, with a VLDB containing numerous objects and a large volume of data, this method can significantly slow down both the backup and restore processes. Additionally, physical backup tools often lack the flexibility to break the backup into smaller, more manageable files, further complicating the migration.

Another approach to migrating VLDBs is logical migration. This approach involves exporting database objects (schemas, tables, data) as SQL statements that can be executed on the target to reproduce the original database object definitions and the table’s data. Traditional tools used for logical migrations, such as mysqldump and mysqlpump, aren’t optimized for VLDBs because they are single-threaded and process one table at a time, leading to slower migration times.

In this post, we demonstrate logical migration using the open source tools MyDumper and MyLoader, which are better suited for MySQL VLDB migration due to their multi-threaded capabilities.

Solution overview

MyDumper and MyLoader are open source MySQL export/import tools engineered to address the performance limitations often encountered with the traditional mysqldump and mysqlpump utilities. These tools offer a wide range of features, such as concurrent data export and import, per-table dump file generation, large tables split into smaller chunks, separate data and metadata files for simplified parsing and management, customizable transaction sizes during import, and more.

The following diagram shows the high-level steps involved in migrating data using MyDumper and MyLoader.

Migrating using MyDumper and MyLoader involves the following steps:

  1. Use MyDumper to create a backup of the source MySQL database.
  2. Upload the backup files to an Amazon Simple Storage Service (Amazon S3) bucket. There are alternative ways to upload the backup files. Refer to AWS Prescriptive Guidance documentation to choose the right method for your environment.
  3. Restore the backup files using MyLoader.
  4. Optionally, for minimal downtime, set up replication between the source and the Aurora MySQL DB cluster using the binary log position captured by MyDumper.

For step-by-step instructions on using the MyDumper and MyLoader method, refer to Multi-threaded migration using MyDumper and MyLoader. For more information about limitations, see Limitations and recommendations for importing backup files from Amazon S3 to Amazon RDS.

In the following section, we discuss how you can optimize the performance and efficiency of MyDumper and MyLoader for migrating VLDBs.

Optimizations for logical backups using MyDumper

Optimizing the backup process when using MyDumper is crucial for achieving efficient, faster, and reliable data backups. MyDumper offers a range of features that allow you to optimize your backup strategies, meet your unique requirements, and improve backup efficiency and performance. In this section, we discuss a few key techniques for improving the backup process.

Parallel threads for backup

Instead of performing backups sequentially, you can use multiple threads to run the backups in parallel. This concurrent approach significantly improves performance and efficiency, especially for VLDBs. A multi-threaded approach allows for efficient utilization of system resources, including the CPU cores and disk I/O. A good guideline is to use one thread per server CPU core.

With the –-threads (-t) option, MyDumper allows you to use parallel runs by specifying the number of threads to use during the backup process. For example, to perform a backup using 8 threads, you can run the following command:

mydumper --host=localhost --user=root --password=mypassword --outputdir=path/to/backup --threads=8
Bash

Split large table backups into smaller files

VLDBs often contain tables with millions or even billions of rows, resulting in backup files that can become excessively large, making them difficult to manage, transfer, and store. To address this, you can split large table backups into smaller, more manageable chunks by considering factors like chunk size or row count, which can improve manageability and reduce restore times.

The --chunk-filesize (-F) option in MyDumper allows you to split large tables into separate files based on a specified file size. You should choose the chunk file size in MyDumper based on factors such as parallel restoration requirements, storage constraints, network transfer considerations, and backup compression needs, while also considering the trade-offs between a large number of small files and the benefits of parallel restoration.

For example, the following command splits large tables into files of approximately 1 GB each:

mydumper --chunk-filesize=1G --outputdir=/path/to/backup --database=mydatabase
Bash

This approach results in more manageable file sets that can be restored in parallel and transferred more efficiently over the network compared to a single large file. In general, chunk sizes between 1–10 GB work well. Files smaller than 1 GB might introduce overhead due to the management of numerous files, and files larger than 10 GB might negatively impact the database.

Alternatively, you can use the --rows (-r) option to specify the number of rows included in each INSERT statement in the SQL dump files.

For example:

mydumper --rows=100000 --outputdir=/path/to/backup --database=mydatabase
Bash

In this example, each dump file contains INSERT statements for 100,000 rows. In general, the value of the --rows option in MyDumper should be determined by considering the trade-off between table size (higher values for larger tables), available memory resources (lower values for limited memory), and parallel restoration requirements (lower values for better workload distribution across multiple threads or processes).

You can only use either the --chunk-filesize option or the --rows option, but not both. If both parameters are provided, MyDumper prioritizes the --rows option.

Compress your backup files

Enabling compression reduces backup file size, improving storage efficiency and accelerating file transfers over the network. This results in faster backup and restore operations, minimizing downtime and enhancing performance, especially in environments with limited network bandwidth or slower connections. However, compression also introduces additional CPU overhead. Therefore, it’s crucial to make sure adequate CPU resources are available to mitigate potential bottlenecks and fully take advantage of the benefits of compression.

With --compress (-c), MyDumper gives you two options to compress your files: gzip and zstd. The recommended compression method is zstd (Zstandard) because it provides better compression ratios along with faster compression and decompression speeds compared to gzip.

To use zstd compression with MyDumper, you can use the --compress or -c option followed by zstd. For example:

mydumper --compress=zstd --outputdir=/path/to/backup --database=mydatabase
Bash

The zstd compression algorithm provides high compression ratios while maintaining high decompression speeds, making it suitable for backup scenarios where you need to balance compression efficiency and performance.

However, if you have specific requirements or compatibility concerns with existing tools or systems, you can use gzip compression by specifying --compress=gzip or -c gzip instead.

Use flat files for backup

Using flat files for data import can speed up the process compared to SQL dumps. You can load flat files into the database using the LOAD DATA LOCAL INFILE statement or MyLoader import tool, which efficiently read rows from a text file into a table at a high speed. By generating multiple flat files, tables can be restored in parallel, further improving performance.

To generate flat files, you can use the --load-data option in MyDumper. MyDumper generates the flat files for each table, aiding parallel data loading into the database.

For example, to generate flat files using MyDumper, you could run the following command:

mydumper --load-data --outputdir=/path/to/backup --database=mydatabase
Bash

Additionally, you have the option to use --stream over STDOUT after the files have been written, and –exec to run a command after the file has been generated. For example:

mydumper --stream --outputdir=/path/to/backup --database=mydatabase | mysql -u user -p mydatabase
Bash

This command streams the data directly into the MySQL database without needing flat files. For more detailed instructions, refer to MyDumper usage.

The following example command generates backup files for the MySQL databases, excluding mysql, information_schema, and performance_schema schemas:

mydumper --host=<source_server_address> --user=<source_user> \
--password=<source_user_password> --directory=<output-dir> --rows=500000 \
--threads=16 --compress ZSTD --events --triggers --routines \
--regex ‘^(?!(mysql|information_schema|performance_schema))\
--logfile /<output_dir> /mydumper-logs.txt
Bash

For a complete list of supported parameters and their default values, use the built-in help:

Mydumper --help

After you run the preceding command, MyDumper will set up global variables and create threads. After the threads are synchronized, the consistent backup process will begin.

Optimizations for restoring logical backups using MyLoader

MyLoader is an open source MySQL import utility that simplifies and speeds up loading data into MySQL databases. Developed as a companion tool to MyDumper, MyLoader offers a seamless and efficient solution for importing data exported using MyDumper.

Its multi-threaded architecture enables parallel execution, maximizing resource utilization and accelerating data import operations. MyLoader supports various data formats, including SQL files generated by MyDumper, CSV files, and TSV files, offering flexibility in data import sources. MyLoader provides transactional support for data import operations, providing data consistency and integrity. By encapsulating data import tasks within transactions, MyLoader enables you to roll back changes in case of errors or failures, minimizing the risk of data corruption.

Optimizing the restore process when using MyLoader for backup restoration is crucial for achieving faster, efficient, and reliable data recovery. MyLoader offers a range of configurable options to tailor the data import processes. In this section, we discuss key techniques for optimizing the restore process.

Parallel threads for restore

MyLoader enhances restore performance by distributing workloads across multiple threads, allowing concurrent processing of multiple tables or data files. This maximizes resource utilization and reduces overall restore time. However, it’s important to consider the available system resources and balance the number of threads with system capacity to avoid resource contention, which can lead to performance degradation.

A best practice is to allocate one thread for every two vCPUs.

The --threads (-t) option lets you specify the number of threads to use during the data import process, enabling parallel processing and further optimizing the import performance by concurrently processing multiple data files or tables. The following is an example command to use MyLoader with 4 threads:

myloader --threads=4 --directory=/path/to/backup --database=mydatabase
Bash

Batch size optimization

By configuring the batch size to an optimal value, you can balance the trade-off between transaction size and commit frequency. Larger batch sizes can improve performance by reducing transaction overhead but might lead to memory exhaustion or transaction log limits. Smaller batch sizes are useful for systems with limited memory or specific configurations.

To determine the optimal value for the --queries-per-transaction (-q) option in MyLoader, consider the transaction size, available memory, and database configuration. For example, for an Aurora MySQL database with 64 GB of RAM, you can start with values like 50,000 or 100,000. For very large tables or datasets, reduce the value to avoid excessive memory issues. If the data loading process is still slow, increase the value while monitoring memory usage and transaction log growth.

The process to optimize batch size includes the following steps:

  1. Start with a moderate value (such as 1,000 or 10,000) depending on the size of your database and the available memory.
  2. Monitor the restore process for performance, memory usage, and potential errors like out-of-memory issues or transaction log size limits.
  3. If the restoration process is running smoothly and there are sufficient memory resources, you can gradually increase the --queries-per-transaction value (such as 50,000 or 100,000) to see if it improves performance.
  4. If you encounter issues like out-of-memory errors or transaction log size limits being reached, reduce the value for --queries-per-transaction (such as 100 or 1,000).
  5. Repeat Steps 2–4 until you find the optimal value that maximizes performance without causing resource limitations or errors.

Index optimization

When loading data into a database table, the database management system (DBMS) must maintain not only the data itself but also associated indexes. Indexes are data structures that facilitate efficient data retrieval by creating sorted representations of the data based on specific columns or combinations of columns.

The overhead of maintaining indexes during data loading can be significant, particularly for large tables or databases with complex index structures. The reason for this overhead is that the DBMS must update indexes when new data is inserted, which can be resource-intensive, particularly for non-clustered indexes.

MyLoader addresses this issue by providing an option to load data without secondary indexes. By disabling the creation or maintenance of secondary indexes during the data loading process, MyLoader can reduce the overhead associated with index management. This optimization can lead to faster data restoration times, because the DBMS doesn’t need to spend resources on updating indexes concurrently with data insertion.

However, recreating secondary indexes on large tables comes with its own set of challenges. The index creation process can be resource-intensive and time-consuming, requiring substantial temporary disk space. Depending on the size of the table and the complexity of the secondary indexes, the index creation can take several hours to complete.

Analyzing the trade-offs between restore performance and index availability is crucial. Factors to consider include dataset size, the number of secondary indexes, index complexity, and the urgency of querying restored data.

The --innodb-optimize-keys option in MyLoader optimizes the creation of InnoDB secondary indexes during the data import process. When enabled, MyLoader defers the creation of secondary indexes until after it has loaded the data into InnoDB tables. This option is recommended for restoring VLDBs because it enhances the restore performance.

The following is an example of how to use the --innodb-optimize-keys option in MyLoader:

myloader --directory=/path/to/backup --database=mydatabase --innodb-optimize-keys
Bash

Transaction management

Fine-tuning transaction management parameters, such as the number of rows per transaction (--rows), can optimize transactional behavior during the restore process. This option is particularly useful for large tables that need to be split into multiple transactions for efficient parallel loading. By limiting the number of rows per transaction, you can optimize resource utilization, reduce memory consumption, and improve transaction throughput.

The –-rows (-r) option enables you to define the maximum number of rows processed per transaction during data import.

For example, let’s assume you have a large table with 100 million rows and want to restore the data in parallel using multiple MyLoader threads. You also have sufficient memory available, allocating between 25–50% of the total system memory for the MyLoader process during parallel restoration. This allocation makes sure there’s enough memory for other system processes and helps prevent excessive swapping or out-of-memory issues. In this scenario, you could set the –rows parameter to a value like 100,000 or 1,000,000 to balance performance and memory usage while still enabling parallel restoration.

The following is an example command using MyLoader with the --rows parameter set to 1,000,000:

myloader --directory=/path/to/backup --overwrite-tables --rows=1,000,000
Bash

This command tells MyLoader to load data with up to 1,000,000 rows per INSERT statement. For small tables, the entire table can be loaded in a single transaction, regardless of the --rows value. For instance, if a table contains only 1,000 rows and you’ve set --rows=1,000,000, MyLoader will generate a single INSERT statement with the 1,000 rows.

It’s important to experiment with the --rows value in your test environment to optimize memory consumption and transactional throughput.

Resource allocation

Allocating sufficient system resources, such as CPU, memory, and disk I/O, is crucial for optimizing the restore process. Verifying that MyLoader has adequate resources minimizes bottlenecks and maximizes performance during data restoration. Temporarily scaling up your database instance can help facilitate a smoother import process. Data import might require additional resources like increased CPU usage, Amazon Elastic Block Store (Amazon EBS) bandwidth, and memory. When the import is complete, you can downgrade the instance to its usual configuration.

The following command imports backup files into an Aurora MySQL database by loading data in batches of 100,000 rows, grouping 1,500 insert queries per transaction. It uses 16 threads for parallel data loading to improve performance and defers the creation of secondary indexes during the load for better efficiency.

myloader --host=<host_name> --user=<source_user> \
--password=<source_user_password> --directory=<output-dir> \
--queries-per-transaction=1500 --rows=100000 --threads=16 --innodb-optimize-keys
--verbose=3 --logfile=/<output_dir>/MyLoader-logs.txt
Bash

The optimal value for options might vary depending on your specific database characteristics, hardware resources, and restoration requirements. It’s generally recommended to experiment with different values and observe the impact on restore performance and memory usage to find the optimal value for your specific use case.

For a complete list of parameters and usage, see MyLoader Usage.

Next, we compare the results of benchmarking the backup and restore processes using both mysqldump and the MyDumper and MyLoader tools.

Benchmark results

We used mysqldump and MyDumper/MyLoader to back up an on-premises MySQL database and restore it to an Aurora MySQL DB cluster. We carried out benchmarking tests using an Amazon Elastic Compute Cloud (Amazon EC2) instance of type r7a.16xlarge with 64 vCPU, 512 GB RAM, 45,000 Provisioned IOPS, and 8,000 GB EBS volume. The MySQL version used was 8.0.36, and the dataset employed for testing was comprised of 900 objects, totaling approximately 1.6 TB.

We compared the backup performance of mysqldump and MyDumper in various thread configurations. MyDumper showed consistency and efficiency across various thread counts, surpassing mysqldump in every situation. As the number of threads increased from 16 to 32 and 64, MyDumper consistently outperformed mysqldump in backup speed. These findings highlight the importance of choosing the right backup tool. MyDumper is an important option for organizations seeking multi-threaded, efficient, and high-performance database backup operations.

In the following graph, the mysqldump time is excluded to show the backup completion time of MyDumper with different thread counts.

Your performance results might vary depending on your workload and instance configuration.

Next, we performed an import using the mysqldump and MyLoader tool in an Aurora MySQL r7g.16xlarge instance. Similar to the MyDumper results, MyLoader could restore the database to Aurora MySQL in less time compared to mysqldump, which took over 75 hours to finish the similar job.

In the following graph, the mysqldump time has been excluded to show the backup completion time of MyLoader with different thread counts.

The restore process with more threads was actually slower. This can be attributed to the fact that although using multiple threads can help parallelize the workload, it also introduces overhead due to increased resource contention and coordination among the threads. When the number of threads exceeds the optimal level for a given workload and system configuration, the overhead of thread management and resource contention can outweigh the benefits of parallelization, leading to slower overall performance.

To determine the optimal thread count for your specific use case, it’s recommended to perform benchmarking tests with different thread configurations and monitor relevant performance metrics, such as CPU utilization, memory usage, and I/O throughput. By analyzing these metrics, you can identify the optimal settings where the restore process achieves the desired balance between performance and resource efficiency.

AWS DMS

If you have a MySQL-compatible database such as MariaDB that can’t be migrated using native export/import tools, AWS Database Migration Service (AWS DMS) is a good alternative. AWS DMS is a managed database migration service that helps you quickly and securely migrate a database to AWS. Multiple options are available for data migration, including schema remapping, advanced filtering, and merging multiple database servers into one Aurora MySQL cluster. AWS DMS allows for straightforward migrations with minimal downtime using change data capture (CDC) replication. For more information, see Using a MySQL-compatible database as a source for AWS DMS.

Conclusion

Migrating an on-premises very large database to Aurora represents a substantial decision aimed at taking advantages of cloud-based infrastructure. However, this migration process is inherently time-consuming and intricate. Various factors demand careful consideration, from evaluating the size and complexity of the database to adhering to performance requirements and compliance standards. In this post, we discussed the MyDumper and MyLoader export/import open source tool for migrating on-premises very large databases to Aurora MySQL-Compatible, including its various options.

MyDumper supports parallel backups (--threads), table chunking (--chunk-filesize), compression (--compress with zstd recommended), and flat file creation (--load-data) for faster restoration. MyLoader enables parallel data import (--threads), index optimization (--innodb-optimize-keys), and batch size configuration (--queries-per-transaction, --rows) for efficient restoration. Benchmarks show they outperform mysqldump, making them suitable for migrating VLDBs to Aurora MySQL-Compatible.

If you found this post helpful or have any additional thoughts, leave a comment.


About the Author

Maria Ehsan is a Database Specialist Solutions Architect based in Dubai. She primarily focuses on helping customers with the migration, optimization, and modernization of their database systems using purpose-built databases offered by AWS.