What's the best way to transfer large amounts of data from one Amazon S3 bucket to another?

Lesedauer: 5 Minute
0

I want to transfer a large amount of data (1 TB or more) from one Amazon Simple Storage Service (Amazon S3) bucket to another bucket.

Short description

Depending on your use case, you can perform the data transfer between buckets using one of the following options:

  • Run parallel uploads using the AWS CLI
  • Use an AWS SDK
  • Use cross-Region replication or same-Region replication
  • Use Amazon S3 batch operations
  • Use S3DistCp with Amazon EMR
  • Use AWS DataSync

Resolution

Run parallel uploads using the AWS CLI

Note: If you receive errors when running AWS Command Line Interface (AWS CLI) commands, make sure that you're using the most recent version of the AWS CLI.

To improve your transfer time, use multi-threading. Split the transfer into multiple mutually exclusive operations. For example, use the AWS CLI to run multiple, parallel instances of aws s3 cp, aws s3 mv, or aws s3 sync. You can create more upload threads when you use the --exclude and --include parameters for each instance of the AWS CLI. These parameters filter operations by file name.

Note: The --exclude and --include parameters process on the client side. This means that the resources on your local machine might affect the performance of the operation.

For example, to copy a large amount of data from one bucket to another, run the following commands. Note that the file names begin with a number.

First, run this command to copy the files with names that begin with the numbers 0 through 4:

aws s3 cp s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "0*" --include "1*" --include "2*" --include "3*" --include "4*"

Then, run this command on a second AWS CLI instance to copy the files with names that begin with the numbers 5 through 9:

aws s3 cp s3://source-awsexamplebucket/ s3://destination-awsexamplebucket/ --recursive --exclude "*" --include "5*" --include "6*" --include "7*" --include "8*" --include "9*"

If you want to speed up the data transfer, then customize the following AWS CLI configurations:

  • multipart_chunksize: This value sets the size of each part that the AWS CLI uploads in a multipart upload for an individual file. This setting allows you to break down a larger file (for example, 300 MB) into smaller parts for quicker upload speeds.
    Note: A multipart upload requires that a single file is uploaded in not more than 10,000 distinct parts. Verify that the chunksize that you set balances the part file size and the number of parts.
  • max_concurrent_requests: This value sets the number of requests that you can send to Amazon S3 at a time. The default value is 10, but you can increase it to a higher value. Verify that your machine has enough resources to support the maximum number of concurrent requests that you want.

Use an AWS SDK

Use an AWS SDK to build a custom application that performs data transfers for a large number of objects. Depending on your use case, a custom application might be more efficient than the AWS CLI for transferring hundreds of millions of objects.

Use cross-Region replication or same-Region replication

Set up cross-Region replication (CRR) or same-Region replication (SRR) on the source bucket. This allows Amazon S3 to automatically replicate new objects from the source bucket to the destination bucket. To filter the objects that Amazon S3 replicates, use a prefix or tag. For more information on configuring replication and specifying a filter, see Replication configuration.

After you configure replication, only new objects are replicated to the destination bucket. Existing objects aren't replicated to the destination bucket. For more information, see Replicating existing objects with S3 Batch Replication.

Use Amazon S3 batch operations

You can use Amazon S3 batch operations to copy multiple objects with a single request. When you create a batch operation job, you can use an Amazon S3 inventory report to specify which objects to perform the operation on. Or, you can use a CSV manifest file to specify a batch job. Then, Amazon S3 batch operations call the API to perform the operation.

After the batch operation job completes, you get a notification and an optional completion report about the job.

Use S3DistCp with Amazon EMR

The S3DistCp operation on Amazon EMR can copy in parallel a large number of objects across Amazon S3 buckets. S3DistCp first copies the files from the source bucket to the worker nodes in an Amazon EMR cluster. Then, the operation writes the files from the worker nodes to the destination bucket. For more guidance on using S3DistCp, see Seven tips for using S3DistCp on Amazon EMR to move data efficiently between HDFS and Amazon S3.

Important: Because this option requires that you use Amazon EMR, be sure to review Amazon EMR pricing.

Use AWS DataSync

To move large amounts of data from one Amazon S3 bucket to another bucket, perform these steps:

  1. Open the AWS DataSync console.
  2. Create a task.
  3. Create a new location for Amazon S3.
  4. Select your S3 bucket as the source location.
  5. Update the source location configuration settings. Make sure that you specify the AWS Identity Access Management (IAM) role to access your source S3 bucket.
  6. Select your S3 bucket as the destination location.
  7. Update the destination location configuration settings. Make sure that you specify the IAM role to access your S3 destination bucket.
  8. Configure the settings for your task.
  9. Review the configuration details.
  10. Choose Create task.
  11. Start your task.

Important: When you use AWS DataSync, you incur additional costs. To preview any DataSync costs, review the DataSync pricing structure and DataSync limits.

AWS OFFICIAL
AWS OFFICIALAktualisiert vor 8 Monaten
2 Kommentare

How long will each of the above mentioned services take to complete a transfer of 200TB of data?

beantwortet vor 9 Monaten

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
beantwortet vor 9 Monaten