AWS Database Blog
Best practices for exporting and importing data from Amazon Aurora MySQL to Amazon S3
You can build highly distributed applications using a multitude of purpose-built databases by decoupling complex applications into smaller pieces, which allows you to choose the right database for the right job. Amazon Aurora is the preferred choice for OLTP workloads. Aurora makes it easy to set up, operate, and scale a relational database in the cloud.
This post demonstrates how you can export and import the data from Aurora MySQL to Amazon Simple Storage Service (Amazon S3) and shares associated best practices. Export import data to Amazon S3 feature is also available for Amazon Aurora PostgreSQL. This blog post focuses on Amazon Aurora MySQL.
Overview of Aurora and Amazon S3
Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, which combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases.
Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data.
Prerequisites
Before you get started, complete the following:
- Launch an Aurora MySQL DB cluster. You can also use an existing cluster.
- Launch an Amazon EC2 instance that you installed the MySQL client on. You can also use MySQL Workbench for this purpose.
- Create the following required Identity and Access Management (IAM) policies and roles:
- Create an IAM policy with the least-restricted privilege to the resources in the following code and name it
aurora-s3-access-pol
. The policy must have access to the S3 bucket where the files are stored (for this post,sample-loaddata01
). - Create an IAM role and modify the trust relationship according to the following code. AssumeRole allows Aurora to access other AWS services on your behalf.
- Attach the policy
aurora-s3-access-pol
from the previous step. For this post, the role is nameds3-demo-access-role
and the role ARN isarn:aws:iam::123456789012:role/aurora-s3-access-role
.
- Create an IAM policy with the least-restricted privilege to the resources in the following code and name it
- Create and assign parameter groups for the cluster:
- Create a custom parameter group using the AWS Command Line Interface (AWS CLI). See the following code:
- Modify the parameter group to edit the value of the
aurora_default_s3_role
parameter to the IAM roles3-demo-access-role
. See the following code:You can map the IAM role to the
aurora_select_into_s3_role
parameter to allow only data export or map theaurora_load_from_s3_role
parameter to allow only data import from the S3 bucket. If the IAM role isn’t specified for these two parameters, Aurora uses the IAM role specified inaws_default_s3_role
for both export and import.
- Modify the Aurora cluster and associate the newly created parameter group. See the following code:
The
apply-immediately
code triggers modification and asynchronously applies changes as soon as possible regardless of thePreferredMaintenanceWindow
on the DB cluster.
- Associate the IAM role to the cluster. This allows database users in an Aurora DB cluster to access the S3 bucket. See the following code:
- Reboot the primary instance in the cluster to apply the changes. See the following code:
You’re now ready to explore the following use cases of exporting and importing data.
Exporting data from Aurora MySQL to Amazon S3
To export your data, complete the following steps:
- Connect to the cluster as the master user. By default, the master user has permission to export and import data from Amazon S3. For this post, you create a test user with the least-required permission to export data to the S3 bucket. See the following code:
- Create the user
testuser
and grant the required privileges to the user on the database. This post usessampledb01
. See the following code:
- Grant the SELECT INTO S3 privilege to
testuser
with the following code:
- Log in to the cluster as
testuser
:
- For this post, you export the
lineitem
table to the S3 bucket. You can use any existing table in your database (preferably small size) to test this feature. See the following code:The following screenshot displays a few records from the
lineitem
table.
- To export data from the Aurora table to the S3 bucket, use the SELECT INTO OUTFILE S3 The following statement exports the entire table to the S3 bucket. If you’re trying this feature out for the first time, consider using the LIMIT clause for a larger table. See the following code:
The following code shows the exported file in the S3 bucket. The default file size threshold is 6 GB. Because the data selected by the statement is less than the file size threshold, a single file is created.
If you need the file size to be smaller than 6 GB, you can identify a column to split the table data into small portions and run multiple
SELECT INTO OUTFILE
statements (using theWHERE
condition). It’s best to do this when the amount of data selected is more than 25 GB.
Importing data from Amazon S3 to Aurora MySQL
In this section, you load the data back to the Aurora table from the Amazon S3 file. Complete the following steps:
- Grant the LOAD FROM S3 privilege to
testuser
. See the following code:
- Use LOAD DATA FROM S3 to import the data file from an Amazon S3 prefix. See the following code:
- Verify the loaded files using the aurora_s3_load_history system table. See the following code:
Best practices
This section discusses a few best practices for bulk loading large datasets from Amazon S3 to your Aurora MySQL database. This post bases these observations on a series of tests loading 50 million records to the lineitem
table on a db.r4.4xlarge instance (see the preceding sections for table structure and example records).
Load testing was carried out when no other active transactions were running on the cluster. This might change depending on your cluster loads and instance type.
File size
Make sure that the source files in the S3 bucket aren’t too small. Loading several small files (1–10 MB) adds overhead of file contention and impacts the load performance. For optimal performance, consider a file size between 100 MB–1 GB.
The following graph shows the import time versus file size while loading 50 million records to the lineitem
table on a db.r4.4xlarge instance. It’s quite evident that load performance falls very sharply with small files, but doesn’t meaningfully increase with very large files.
Partitioning
If you’re using partitioned tables, consider loading partitions in parallel to improve load performance. Load from Amazon S3 supports explicit partition selection and only locks the partition where data is being loaded. Partitioning the lineitem
table on the linenumber
column and loading all partitions concurrently shows significant improvement in load time. See the following code:
Without a partition strategy, the approximate load time is 8 minutes, 20 seconds. With the list
(linenumber
) partition strategy and concurrent load, the approximate load time is 2 minutes, 30 seconds.
Concurrent loads
To improve load performance, you can load multiple tables concurrently. However, the degree of concurrency can impact the other transactions running on the cluster. The number of vCPUs allocated to instance types plays a key role because load operation requires CPU cycles to read data, insert into tables, commit changes, and more. For more information, see Amazon RDS Instance Types. Loading several tables concurrently with few vCPUs can cause CPU utilization to spike and may impact the existing workload. For more information, see Overview of Monitoring Amazon RDS.
The following Amazon CloudWatch metric illustrates the optimal CPU utilization when eight concurrent loads were running on a db.r4.8xlarge (32 vCPU) cluster.
The following Amazon RDS Performance Insights graph was taken during the eight concurrent loads. From the graph, it’s quite evident that majority of time is spent consuming basic host resources like CPU and I/O. The dominant contributor to DB load is the wait event io/table/sql/handler
, which suggests the storage engine is processing I/O requests against tables.
For more information about wait events, see Aurora MySQL Events. For more information about using Performance Insights, see Tuning Amazon RDS for MySQL with Performance Insights.
Another option is to split the files into S3 buckets and run concurrent loads on a single non-partitioned table. This slightly boosts the load time compared to running a single load operation. With one concurrent session, the approximate load time to import 150 million rows was 23 minutes, 40 seconds. With three concurrent sessions each importing 50 million records, the approximate load time was 22 minutes, 10 seconds.
Running the export from a reader instance
Exporting a huge amount of data out from Aurora to S3 is an intensive operation. Running the export from a writer instance may impact other transactions running on the cluster. To solve this, you can use a reader instance. Aurora readers are designed to serve read traffic and support the SELECT INTO OUTFILE
operation.
Constraints
The data import process often becomes slower when foreign key and unique key checks are enabled. You can turn these parameters off at the session level by setting it to 0 before running the load statements. You should only consider it if you trust the integrity of the data loaded from Amazon S3 and there is no other write workload on the table. To disable the key checks, enter the following code:
After the load statements are complete, enable the key checks with the following code:
Don’t modify the table structure (for example, by using ALTER TABLE
) while foreign key checks and unique checks are disabled. As a general best practice, assess the impact before disabling these parameters.
Summary
This post demonstrated how to import and export data between Aurora MySQL and Amazon S3. Aurora is powered with operational analytics capabilities, and integration with Amazon S3 makes it easier to establish an agile analytical environment. For more information about importing data, see Loading Data into an Amazon Aurora MySQL DB Cluster from Text Files in an Amazon S3 Bucket. For more information about exporting data, see Saving Data from an Amazon Aurora MySQL DB Cluster into Text Files in an Amazon S3 Bucket. For more information about Aurora best practices, see Best Practices with Amazon Aurora MySQL.
About the Authors
Suresh Patnam is a Senior Big Data Architect at AWS. He works with customers in their journey to the cloud with a focus on big data, data lakes, and data strategy. In his spare time, Suresh enjoys playing tennis and spending time with his family.
Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.