How can I use Data Pipeline to run a one-time copy or automate a scheduled synchronization of my Amazon S3 buckets?

Last updated: 2020-06-18

I want to transfer data between two Amazon Simple Storage Service (Amazon S3) buckets as a one-time task, or as a scheduled synchronization. How can I set up a copy or sync operation between buckets using AWS Data Pipeline?

Resolution

Note: Using Data Pipeline is one option for transferring data between S3 buckets. Other options include using S3 Batch Operations, enabling replication, or running the cp or sync commands on the AWS Command Line Interface (AWS CLI).

1.    Confirm that your AWS Identity and Access Management (IAM) user or role has sufficient permissions for using Data Pipeline.

2.    Sign in to the AWS Data Pipeline console with your IAM user or role. Confirm that the console is set to an AWS Region that supports Data Pipeline.

Important: The source and destination buckets don't need to be in the same Region, and the buckets don't need to be in the same Region as the pipeline. However, data transfers between Regions incur cost. Be sure to review data transfer pricing for Amazon S3.

3.    Choose Create Pipeline.

4.    For Name, enter a name for the pipeline.

5.    For Source, select Build using a template. Then, select Run AWS CLI command.

6.    For AWS CLI command, to set up a copy operation, enter the following command:

aws s3 cp s3://source-AWSDOC-EXAMPLE-BUCKET1 s3://destination-AWSDOC-EXAMPLE-BUCKET2

Note: The copy command overwrites any objects in the destination bucket that have the same key name as objects in the source bucket.

To set up a sync operation, enter the following command:

aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET1 s3://destination-AWSDOC-EXAMPLE-BUCKET2

Note: The sync command compares the source and destination buckets, and then transfers only the difference.

7.    For Run, select on pipeline activation for a one-time copy or sync job. Or, select on a schedule for a scheduled copy or sync, and then complete the Run every, Starting, and Ending fields based on your use case.

8.    For Logging, you can select Enabled, and then enter an S3 location for logs. Or, if you don't want logs, you can select Disabled.

9.    For IAM roles, you can select either the Default role or a Custom role. The default role has Amazon S3 permissions for s3:CreateBucket, s3:DeleteObject, s3:Get*, s3:List*, and s3:Put*.

Note: If the buckets have default encryption with AWS Key Management Service (AWS KMS), then you must grant the Data Pipeline role permissions for using the AWS KMS key.

10.    Choose Activate.

Note: You can optionally optimize performance by creating multiple pipelines for each root-level prefix in your bucket.