How can I use Data Pipeline to run a one-time copy or automate a scheduled synchronization of my Amazon S3 buckets?
Last updated: 2020-12-23
I want to transfer data between two Amazon Simple Storage Service (Amazon S3) buckets as a one-time task, or as a scheduled synchronization. How can I set up a copy or sync operation between buckets using AWS Data Pipeline?
Note: Using Data Pipeline is one option for transferring data between S3 buckets. Other options include using S3 Batch Operations, enabling replication, or running the cp or sync commands on the AWS Command Line Interface (AWS CLI).
Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent AWS CLI version .
1. Confirm that your AWS Identity and Access Management (IAM) user or role has sufficient permissions for using Data Pipeline.
Important: The source and destination buckets don't need to be in the same Region. The S3 buckets also don't need to be in the same Region as the pipeline. However, because data transfers between Regions incur cost, make sure to review data transfer pricing for Amazon S3.
3. Choose Create Pipeline.
4. For Name, enter a name for the pipeline.
5. For Source, select Build using a template. Then, select Run AWS CLI command.
6. For AWS CLI command, to set up a copy operation, enter the following command:
aws s3 cp s3://source-AWSDOC-EXAMPLE-BUCKET1 s3://destination-AWSDOC-EXAMPLE-BUCKET2
Note: The copy command overwrites any objects in the destination bucket that have the same key name as objects in the source bucket.
To set up a sync operation, enter the following command:
aws s3 sync s3://source-AWSDOC-EXAMPLE-BUCKET1 s3://destination-AWSDOC-EXAMPLE-BUCKET2
Note: The sync command compares the source and destination buckets, and then transfers only the difference.
7. For Run, select on pipeline activation for a one-time copy or sync job. Or, select on a schedule for a scheduled copy or sync, and then complete the Run every, Starting, and Ending fields based on your use case.
8. For Logging, you can select Enabled, and then enter an S3 location for logs. Or, if you don't want logs, you can select Disabled.
9. For IAM roles, you can select either the Default role or a Custom role. The default role has Amazon S3 permissions for s3:CreateBucket, s3:DeleteObject, s3:Get*, s3:List*, and s3:Put*.
Note: If the buckets have default encryption with AWS Key Management Service (AWS KMS), then make sure the proper permissions are granted. Proper permissions for using the AWS KMS key must be granted to the Data Pipeline role.
10. Choose Activate.
Note: You can optionally optimize performance by creating multiple pipelines for each root-level prefix in your bucket.