I want to upload a large amount of data to Amazon Simple Storage Service (Amazon S3), or copy a large amount of data between S3 buckets. How can I optimize the performance of this data transfer?
Consider the following methods of transferring large amounts of data to or from Amazon S3 buckets:
Parallel uploads using the AWS Command Line Interface (AWS CLI)
Note: As a best practice, be sure that you're using the most recent version of the AWS CLI. For more information, see Installing the AWS Command Line Interface.
To potentially decrease the overall time it takes to complete the transfer, split the transfer into multiple mutually exclusive operations. You can run multiple instances of aws s3 cp (copy), aws s3 mv (move), or aws s3 sync (synchronize) at the same time.
One way to split up your transfer is to use --exclude and --include parameters to separate the operations by file name. For example, if you need to copy a large amount of data from one bucket to another bucket, and all the file names begin with a number, you can run the following commands on two instances of the AWS CLI.
Note: The --exclude and --include parameters are processed on the client side. Because of this, the resources of your local machine might affect the performance of the operation.
Run this command to copy the files with names that begin with the numbers 0 through 4:
aws s3 cp s3://srcbucket/ s3://destbucket/ --recursive --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*"
Run this command to copy the files with names that begin with the numbers 5 through 9:
aws s3 cp s3://srcbucket/ s3://destbucket/ --recursive --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"
Important: If you need to transfer a very large number of objects (hundreds of millions), consider building a custom application using an AWS SDK to perform the copy. While the AWS CLI can perform the copy, a custom application might be more efficient at that scale.
Consider using AWS Snowball for transfers between your on-premises data centers and Amazon S3, particularly when the data exceeds 10 TB.
Note the following limitations:
- AWS Snowball doesn't support bucket-to-bucket data transfers.
- AWS Snowball doesn't support server-side encryption with keys managed by AWS Key Management System (AWS KMS). For more information, see Server-Side Encryption in AWS Snowball.
S3DistCp with Amazon EMR
Consider using S3DistCp with Amazon EMR to copy data across Amazon S3 buckets. S3DistCp enables parallel copying of large volumes of objects.
Important: Because this option requires you to launch an Amazon EMR cluster, be sure to review Amazon EMR pricing.