Amazon RDS Customer Data Import Guide for MySQL

Articles & Tutorials>Amazon RDS>Amazon RDS Customer Data Import Guide for MySQL
This guide provides information on loading data into Amazon RDS for the MySQL 5.1 engine. The main body of the guide provides step-by-step procedures for loading data fast and efficiently while safeguarding your data. An appendix is provided that contains additional technical information.

Details

Submitted By: John@AWS
AWS Products Used: Amazon EC2
Created On: October 23, 2009 5:20 PM GMT
Last Updated: December 10, 2009 9:06 AM GMT

Procedures

We present two procedures for loading data into Amazon RDS. One applies when the source data is stored in a MySQL database and the amount of data to be loaded is less than 1GB. The other is more general, works for any data source, and can handle very large loads.

Warning: The 'mysql' database contains authentication and authorization information required to log into your DBInstance and access your data. Dropping, altering, renaming, or truncating tables, data, or other contents of the 'mysql' database in your DBInstance can result in error and may render the DBInstance and your data inaccessible. If this occurs, the DBInstance can be restored from a snapshot using rds-restore-db-instance-from-db-snapshot or recovered using rds-restore-db-instance-to-point-in-time.

Small Loads from Existing MySQL Databases

For small amounts of data already stored in MySQL, the simplest way to transfer it to Amazon RDS is to extract the data with mysqldump and pipe it directly into Amazon RDS. mysqldump is a command line utility included with the MySQL client software distribution that's commonly used to make backups and transfer data from one MySQL server to another. Here's an example showing the "acme" database being copied to Amazon RDS.

mysqldump acme | mysql --host=hostname --user=username --password acme

We use the --host and --user options to specify the hostname and username used to connect to our DB Instance. The --password option prompts for the password. There are other options that allow mysqldump to do some very useful things, such as:

  • Sort each table's data by its primary key (--order-by-primary). This can dramatically reduce load times.
  • Compress the data before sending it to Amazon RDS (--compress). While it probably won't make the transfer faster, this option can significantly reduce network bandwidth consumption.
  • Ensure that all of the data is consistent with a single point in time (--single-transaction). If there are other processes changing the data while mysqldump is reading it, use this option to maintain data integrity.
See the MySQL documentation for full details.

All Loads / Any Source

If you have more than 1GB of data to load, or if your data is coming from somewhere other than a MySQL database, we recommend creating flat files and loading them with mysqlimport. mysqlimport is another command line utility bundled with the MySQL client software whose purpose is to load flat files into MySQL.

We also recommend creating DB Snapshots of the target Amazon RDS DB Instance before and after the data load. Amazon RDS DB Snapshots are complete backups of your DB Instance that can be used to restore your DB Instance to a known state. When you initiate a DB Snapshot, I/O operations to your database instance are temporarily suspended for a few minutes while your database is backed up.

Creating a DB Snapshot immediately before the load allows you restore the database to its state prior to the load, should the need arise. A DB Snapshot taken immediately after the load protects you from having to load the data again in case of a mishap and can also be used to seed new database instances.

We'll start by listing the steps in order and discuss the details of each separately.

  1. Create flat files containing the data to be loaded
  2. Stop any applications accessing the target DB Instance
  3. Create a DB Snapshot
  4. Disable Amazon RDS automated backups
  5. Load the data using mysqlimport
  6. Enable automated backups again

Step 1: Create Flat Files Containing the Data to be Loaded

Use a common format, such as CSV (Comma-Separated Values), to store the data to be loaded. Each table must have its own file; data for multiple tables cannot be combined in the same file. Give each file the same name as the table it corresponds to. The file extension can be anything you like. For example, if the table name is "sales", the file name could be "sales.csv" or "sales.txt", but not "sales_01.csv".

Whenever possible, order the data by the primary key of the table being loaded. This drastically improves load times and minimizes disk storage requirements.

The speed and efficiency of this procedure is dependent upon keeping the size of the files small. If the uncompressed size of any individual file is larger than 1GB, split it into multiple files and load each one separately.

On Unix-like systems (including Linux), use the 'split' command. For example, the following command splits the sales.csv file into multiple files of less than 1GB, splitting only at line breaks (-C 1024m). The new files will be named sales.part_00, sales.part_01, etc.

split -C 1024m -d sales.csv sales.part_

Similar utilities are available on other operating systems.

Step 2: Stop Any Applications Accessing the Target DB Instance

Before starting a large load, stop all application activity accessing the target DB Instance that you will be loading to (particularly if other sessions will be modifying the tables being loaded or tables they reference). This will reduce the risk of constraint violations occurring during the load, improve load performance, and make it possible to restore the database instance to the point just prior to the load without losing changes made by processes not involved in the load.

Of course, this may not be possible or practical. If you are unable to stop applications from accessing the DB Instance prior to the load, take steps to ensure the availability and integrity of your data. The specific steps required vary greatly depending upon specific use cases and site requirements.

Step 3: Create a DB Snapshot

If you will be loading data into a new DB Instance that contains no data, you may skip this step. Otherwise, creating a DB Snapshot of your DB Instance will allow you to restore the database Instance to the point just prior to the load, if it becomes necessary. As previously mentioned, when you initiate a DB Snapshot, I/O operations to your database instance are suspended for a few minutes while the database is backed up.

In the example below, we use the rds-create-db-snapshot command to create a DB Snapshot of our AcmeRDS instance and give the DB Snapshot the identifier "preload".

rds-create-db-snapshot AcmeRDS --db-snapshot-identifier=preload

You can also use the restore from DB Snapshot functionality in order to create test database instances for dry runs or to "undo" changes made during the load.

It is important to keep in mind that restoring a database from a DB Snapshot creates a new DB Instance which, like all DB Instances, has a unique identifier and endpoint. If you need to restore the database instance without changing the endpoint, you must first delete the DB Instance so that the endpoint can be reused.

For example, to create a DB Instance for dry runs or other testing, you would give the DB Instance its own identifier. In the example, "AcmeRDS-2" is the identifier and we would connect to the database instance using the endpoint associated with AcmeRDS-2.

rds-restore-db-instance-from-db-snapshot AcmeRDS-2 --db-snapshot-identifier=preload

To reuse the existing endpoint, we must first delete the database instance and then give the restored database the same identifier:

rds-delete-db-instance AcmeRDS --final-db-snapshot-identifier AcmeRDS-Final
rds-restore-db-instance-from-db-snapshot AcmeRDS --db-snapshot-identifier=preload

Note that the example takes a final DB Snapshot of the database instance before deleting it. This is optional, but recommended.

Step 4: Disable Amazon RDS Automated Backups

Warning: DO NOT DISABLE AUTOMATED BACKUPS IF YOU NEED TO RETAIN THE ABILITY TO PERFORM POINT-IN-TIME RECOVERY. Disabling automated backups erases all existing backups, so point-in-time recovery will not be possible after automated backups have been disabled. Disabling automated backups is a performance optimization and is not required for data loads. Note that DB Snapshots are not affected by disabling automated backups. All existing DB Snapshots are still available for restore.

Disabling automated backups will reduce load time by about 25% and reduce the amount of storage space required during the load. If you will be loading data into a new DB Instance that contains no data, disabling backups is an easy way to speed up the load and avoid using the additional storage needed for backups. However, if you will be loading into a DB Instance that already contains data; you must weigh the benefits of disabling backups against the impact of losing the ability to perform point-in-time-recovery.

DB Instances have automated backups enabled by default (with a one day retention period). In order to disable automated backups, you must set the backup retention period to zero. After the load, you can re-enable backups by setting the backup retention period to a non-zero value. In order to enable or disable backups, Amazon RDS must shut the DB Instance down and restart it in order to turn MySQL logging on or off.

Use the rds-modify-db-instance command to set the backup retention to zero and apply the change immediately. Setting the retention period to zero requires a DB Instance restart, so wait until the restart has completed before proceeding.

rds-modify-db-instance AcmeRDS --apply-immediately --backup-retention-period=0

You can check the status of your DB Instance with the rds-describe-db-instances command. The example displays the status of the AcmeRDS database instance and includes the --headers option to show column headings.

rds-describe-db-instances AcmeRDS --headers

When the Status column shows that the database is available, you're ready to proceed.

Step 5: Load the Data

Use the mysqlimport utility to load the flat files into Amazon RDS. In the example we tell mysqlimport to load all of the files named "sales" with an extension starting with "part_". This is a convenient way to load all of the files created in the "split" example. Use the --compress option to minimize network traffic. The --fields-terminated-by=',' option is used for CSV files and the --local option specifies that the incoming data is located on the client. Without the --local option, MySQL will look for the data on the database host, so always specify the --local option.

mysqlimport --local --compress --user=username --password --host=hostname \
--fields-terminated-by=',' Acme sales.part_*

For very large data loads, take additional DB Snapshots periodically between loading files and note which files have been loaded. If a problem occurs, you can easily resume from the point of the last DB Snapshot, avoiding lengthy reloads.

Step 6: Enable Amazon RDS Automated Backups

Once the load is finished, re-enable Amazon RDS automated backups by setting the backup retention period back to its pre-load value. As noted earlier, Amazon RDS will restart the DB Instance, so be prepared for a brief outage.

In the example, we use the rds-modify-db-instance command to enable automated backups for the AcmeRDS DB Instance and set the retention period to 1 day.

rds-modify-db-instance AcmeRDS --apply-immediately --backup-retention-period=1

Appendix

This section contains additional technical information related to loading data into MySQL. It is intended for advanced users who are familiar with the MySQL server architecture. Note that all comments related to LOAD DATA INFILE apply to mysqlimport as well.

Transaction Size

Transaction size plays an important role in MySQL data loads. It has a major influence on resource consumption, disk space utilization, resume process, time to recover, and input format (flat files or SQL). This section describes how transaction size affects binary logging and makes the case for disabling binary logging during large data loads. As noted earlier, binary logging is enabled and disabled by setting the Amazon RDS automated backup retention period. Non-zero values enable binary logging, and zero disables it. We also describe the impact of large transactions on InnoDB and why it's important to keep transaction sizes small.

Binary Log

Data loads incur a performance penalty and require additional free disk space (up to 4X more) when binary logging is enabled versus loading the same data with binary logging turned off. The severity of the performance penalty and the amount of free disk space required is directly proportional to the size of the transactions used to load the data.

Small Transactions

For small transactions, binary logging doubles the number of disk writes required to load the data. Depending upon the upload rate, other database activity taking place during the load, and the capacity of your Amazon RDS DB Instance, this can severely degrade performance for other database sessions and increase the time required to load the data.

The binary logs also consume disk space roughly equal to the amount of data loaded until they are backed up and removed. Fortunately, Amazon RDS minimizes this by backing up and removing binary logs on a frequent basis.

Large Transactions

Large transactions incur a 3X penalty for IOPS and disk consumption with binary logging enabled. This is due to the binary log cache spilling to disk, consuming disk space and incurring additional IO for each write. The cache cannot be written to the binlog until the transaction commits or rolls back, so it consumes disk space in proportion to the amount of data loaded. When the transaction commits, the cache must be copied to the binlog, creating a third copy of the data on disk.

Because of this, there must be at least three times as much free disk space available to load the data compared to loading with binary logging disabled. For example, 10GB of data loaded as a single transaction will consume at least 30GB disk space during the load: 10GB for the table + 10GB for the binary log cache + 10GB for the binary log itself. The cache file remains on disk until the session that created it terminates or the session fills its binary log cache again during another transaction. The binary log must remain on disk until backed up, so it may be some time before the extra 20GB is freed.

If the data was loaded using LOAD DATA LOCAL INFILE, yet another copy of the data is created if the database has to be recovered from a backup made prior to the load. During recovery, MySQL extracts the data from the binary log into a flat file and then executes LOAD DATA LOCAL INFILE, just as the original transaction, only this time the infile is local to the database server. Continuing with the example above, recovery will fail unless there is at least 40GB free disk space available.

Disable Binary Logging

Whenever possible, disable binary logging during large data loads to avoid the resource overhead and addition disk space requirements. In Amazon RDS, disabling binary logging is as simple as setting the backup retention period to zero. If you do this, it's recommended that you take a DB Snapshot of the database instance immediately before the load so that you can quickly and easily undo changes made during loading if the need arises.

After the load, set the backup retention period back to an appropriate (no zero) value.

InnoDB

The information in this section provides a strong argument for keeping transaction sizes small when using InnoDB.

Undo

InnoDB generates undo to support features such as transaction rollback and MVCC . Undo is stored in the InnoDB system tablespace (usually ibdata1) and is retained until removed by the purge thread. The purge thread cannot advance beyond the undo of the oldest active transaction, so it is effectively blocked until the transaction commits or completes a rollback. If the database is processing other transactions during the load, their undo also accumulates in the system tablespace and cannot be removed even if they commit and no other transaction needs the undo for MVCC. In this situation, all transactions (including read-only transactions) that access any of the rows changed by any transaction (not just the load transaction) slow down as they scan through undo that could have been purged if not for the long running load transaction.

Since undo is stored in the system tablespace and since the system tablespace never shrinks in size, large data load transactions can cause the system tablespace to become quite large, consuming disk space that cannot be reclaimed without recreating the database from scratch.

Rollback

InnodDB optimized for commits. Rolling back a large transaction can take a very, very long time. In some cases, it may be faster to perform a point-in-time recovery or restore a DB Snapshot.

Input Data Format

MySQL can accept incoming data in one of two forms: flat files and SQL. This section points out some key advantages and disadvantages of each.

Flat Files

Loading flat files with LOAD DATA LOCAL INFILE can be the fastest and least costly method of loading data as long as transactions are kept relatively small. Compared to loading the same data with SQL, flat files usually require less network traffic, lowering transmission costs and load much faster due to the reduced overhead in the database.

One big transaction

LOAD DATA INFILE loads the entire flat file as one transaction. This isn't necessarily a bad thing. If the size of the individual files can be kept small, this has a number of advantages:
  • Resume Capability

  • Keeping track of which files have been loaded is easy. If a problem arises during the load, you can pick up where you left off with little effort. Some data may have to be retransmitted to Amazon RDS, but with small files, the amount retransmitted is minimal.

  • Load data in parallel

  • If you've got IOPs and network bandwidth to spare with a single file load, loading in parallel may save time.

  • Throttle the load rate

  • Data load impacting other processes? Throttle the load by increasing the interval between files.

Be Careful

The advantages of LOAD DATA INFILE diminish rapidly as transaction size increases. If breaking up a large set of data into smaller ones isn't an option, SQL may be the better choice.

SQL

SQL has one main advantage over flat files: it's easy to keep transaction sizes small. However, SQL can take significantly longer to load than flat files and it can be difficult to determine where to resume the load after a failure. For example, mysqldump files are not restartable. If a failure occurs while loading a mysqldump file, the file will require modification or replacement before the load can resume. The alternative is to restore to the point in time prior to the load and replay the file once the cause of the failure has been corrected.

Take Checkpoints Using Amazon RDS Snapshots

If you have a load that's going to take several hours or even days, loading without binary logging isn't a very attractive prospect unless you can take periodic checkpoints. This is where the Amazon RDS DB Snapshot feature comes in very handy. A DB Snapshot creates a point-in-time consistent copy of your database instance which can be used restore the database to that point in time after a crash or other mishap.

To create a checkpoint, simply take a DB Snapshot. Any previous DB Snapshots taken for checkpoints can be removed without affecting durability or restore time.

Snapshots are fast too, so frequent checkpointing doesn't add significantly to load time.

Decreasing Load Time

Here are some additional tips to reduce load times:

  • Create all secondary indexes prior to loading. This is counterintuitive for those familiar with other databases. Adding or modifying a secondary index causes MySQL to create a new table with the index changes, copy the data from the existing table to the new table, and drop the original table.
  • Load data in PK order. This is particularly helpful for InnoDB tables where load times can be reduced by 75-80% and datafile size cut in half.
  • Disable foreign key constraints foreign_key_checks=0 For flat files loaded with LOAD DATA INFILE, this is required in many cases. For any load, disabling FK checks will provide significant performance gains. Just be sure to enable the constraints and verify the data after the load.
  • Load in parallel unless already near a resource limit. Use partitioned tables when appropriate.
  • Use multi-value inserts when loading with SQL to minimize statement execution overhead. When using mysqldump, this is done automatically.
  • Reduce InnoDB log IO innodb_flush_log_at_trx_commit=0

    Note: using innodb_flush_log_at_trx_commit=0 causes InnoDB to flush its logs every second instead of at each commit. This provides a significant speed advantage, but can lead to data loss during a crash. Use with caution.

©2013, Amazon Web Services, Inc. or its affiliates. All rights reserved.