What's the best way to transfer large amounts of data from one Amazon S3 bucket to another?
I want to transfer a large amount of data (1 TB or more) from one Amazon Simple Storage Service (Amazon S3) bucket to another bucket. How can I do that?
Depending on your use case, you can perform the data transfer between buckets using one of the following options:
- Run parallel uploads using the AWS Command Line Interface (AWS CLI)
- Use an AWS SDK
- Use cross-Region replication or same-Region replication
- Use Amazon S3 batch operations
- Use S3DistCp with Amazon EMR
Run parallel uploads using the AWS CLI
Note: As a best practice, be sure that you're using the most recent version of the AWS CLI. For more information, see Installing the AWS CLI.
You can split the transfer into multiple mutually exclusive operations to improve the transfer time by multi-threading. For example, you can run multiple, parallel instances of aws s3 cp, aws s3 mv, or aws s3 sync using the AWS CLI. You can create more upload threads while using the --exclude and --include parameters for each instance of the AWS CLI. These parameters filter operations by file name.
Note: The --exclude and --include parameters are processed on the client side. Because of this, the resources of your local machine might affect the performance of the operation.
For example, to copy a large amount of data from one bucket to another where all the file names begin with a number, you can run the following commands on two instances of the AWS CLI. First, run this command to copy the files with names that begin with the numbers 0 through 4:
aws s3 cp s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*"
Then, run this command to copy the files with names that begin with the numbers 5 through 9:
aws s3 cp s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"
Additionally, you can customize the following AWS CLI configurations to speed up the data transfer:
- multipart_chunksize: This value sets the size of each part that the AWS CLI uploads in a multipart upload for an individual file. This setting allows you to break down a larger file (for example, 300 MB) into smaller parts for quicker upload speeds.
Note: A multipart upload requires that a single file is uploaded in not more than 10,000 distinct parts. You must be sure that the chunksize that you set balances the part file size and the number of parts.
- max_concurrent_requests: This value sets the number of requests that can be sent to Amazon S3 at a time. The default value is 10. You can increase it to a higher value like 50.
Note: Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum amount of concurrent requests that you want.
Use an AWS SDK
Consider building a custom application using an AWS SDK to perform the data transfer for a very large number of objects. While the AWS CLI can perform the copy operation, a custom application might be more efficient at performing a transfer at the scale of hundreds of millions of objects.
Use cross-Region replication or same-Region replication
After you set up cross-Region replication (CRR) or same-Region replication (SRR) on the source bucket, Amazon S3 automatically and asynchronously replicates new objects from the source bucket to the destination bucket. You can choose to filter which objects are replicated using a prefix or tag. For more information on configuring replication and specifying a filter, see Replication Configuration Overview.
After replication is configured, only new objects are replicated to the destination bucket. Existing objects are not replicated to the destination bucket. To replicate existing objects, you can run the following cp command after setting up replication on the source bucket:
aws s3 cp s3://source-awsexamplebucket s3://source-awsexamplebucket --recursive --storage-class STANDARD
This command copies objects in the source bucket back into the source bucket, which triggers replication to the destination bucket.
Note: It's a best practice to test the cp command in a non-production environment. Doing so allows you to configure the parameters for your exact use case.
Use Amazon S3 batch operations
You can use Amazon S3 batch operations to copy multiple objects with a single request. When you create a batch operation job, you specify which objects to perform the operation on using an Amazon S3 inventory report or a CSV file. Then, Amazon S3 batch operations call the API to perform the operation.
After the batch operation job is complete, you get a notification and you can choose to receive a completion report about the job.
Use S3DistCp with Amazon EMR
The S3DistCp operation on Amazon EMR can perform parallel copying of large volumes of objects across Amazon S3 buckets. S3DistCp first copies the files from the source bucket to the worker nodes in an Amazon EMR cluster. Then, the operation writes the files from the worker nodes to the destination bucket. For more guidance on using S3DistCp, see Seven tips for using S3DistCp on Amazon EMR to move data efficiently between HDFS and Amazon S3.
Important: Because this option requires you use Amazon EMR, be sure to review Amazon EMR pricing.