A step-by-step guide to synchronize data between Amazon S3 buckets
The need for data synchronization in Amazon S3 comes up in a number of scenarios for customers – enabling a new geographic region for end users, migrating data between AWS accounts, or creating additional copies of data for disaster recovery (DR).
In this post, we walk through the options available to S3 customers for migrating or synchronizing data, and provide guidance on which is the best choice for different scenarios. We start with S3 Replication, a simple-to-configure S3 feature that will replicate all newly written objects, as they are written. Next, we go into more advanced techniques for migrating data, which might be useful if you need to re-drive replication tasks.
Section 1: Replicating new objects between S3 buckets
S3 Replication enables automatic, asynchronous copying of objects across Amazon S3 buckets. Buckets configured for object replication can be owned by the same or different AWS accounts and can be in the same or different AWS Regions. S3 Replication can be used to copy new objects between two or more S3 buckets, and can be additionally enabled to copy existing objects. To enable S3 Replication, refer to the S3 Replication User Guide.
Estimating cost of S3 Replication
For S3 Replication (Cross-Region Replication and Same-Region Replication), you pay for replication request charges, storage charges for selected destination, and applicable infrequent access storage retrieval fees. For Cross-Region Replication (CRR), you also pay for inter-Region Data Transfer OUT from S3 to each destination Region. Additionally, when you use S3 Replication Time Control (S3 RTC), you pay a Replication Time Control Data Transfer fee.
For example, let’s assume you want to replicate data from US-west to US-east with CRR. Your bucket is comprised of 4-MB objects and the data sums up to 5 TB in size. You have 1.3 million objects in the bucket, so you will be charged replication request charges of $6.50 (request charges are $0.005 per 1000 requests). Since you are replicating data in between two AWS Regions, you will also be charged the data transfer fee of $102.4 (data transfer fee is $0.02/GB).
If your objects are encrypted with SSE-KMS, you will be charged for two KMS operations (decrypt at source and re-encrypt in destination) at a rate of $0.03 per 10,000 requests (more details on KMS charges here). For objects encrypted with SSE-KMS, consider enabling S3 Bucket Keys. S3 Bucket Keys can reduce AWS KMS request costs by up to 99 percent by decreasing the request traffic from Amazon S3 to AWS KMS. There are no additional charges for replicating data cross-account. For updated pricing on S3 Replication, please refer to the pricing FAQs under Replication here and the S3 pricing page.
Section 2: Replicating existing objects between S3 buckets
If your bucket has new and existing data to be replicated, it’s best to configure existing object replication. You can enable existing object replication by contacting AWS Support. Once support for replication of existing objects has been enabled for the AWS account, you will be able to use S3 Replication for existing objects, in addition to newly uploaded objects. Existing object replication is an extension of the existing S3 Replication feature and includes all the same functionality. This includes the ability to replicate objects while retaining metadata (such as object creation date and time), replicate objects into different storage classes, and maintain object copies under different ownership. Existing object replication has a different timeline than new object replication, and can take some time. Time estimates will differ based on size of data to be replicated, object count, if objects are encrypted, and Region pairs for data transfer. Refer to this blog post for more details on existing object replication.
Note that existing object replication will not replicate objects that were previously replicated to or from another bucket, or have previously failed replication. If your bucket contains these types of objects, please refer to Section 3 for guidance.
Estimating cost of existing object replication
The cost to replicate existing objects is the same as replication of new objects, as explained at the bottom of Section 1.
Section 3: Compare and copy objects between S3 buckets
If your bucket has objects that have previously been replicated, or failed, and that need to be copied to the destination bucket, you will first want to identify the objects that need to be copied. In this section, we show you how to identify the objects by comparing the source and destination buckets with S3 Inventory, and then how to copy those objects using S3 Batch Operations.
Please note that copying objects is different from replication in that copying does not maintain the object metadata like version ID and creation timestamp. Later in the post, we provide a few options for performing the object copy such as a copy in place, or copy directly to the destination bucket.
Step 1: Compare two Amazon S3 buckets
To get started, we first compare the objects in the source and destination buckets to find the list of objects that you want to copy.
Step 1a. Generate S3 Inventory for S3 buckets
Configure Amazon S3 Inventory to generate a daily report on both buckets. Please refer to this User Guide for step-by-step instructions on how to configure S3 Inventory reports. The first daily report can take up to 48 hours to generate. Make sure to keep the following list of parameters selected:
- Inventory scope: Include all versions (assuming you have S3 Versioning enabled since you have S3 Replication enabled on these buckets)
- Destination bucket: Provide S3 bucket to store Inventory report
- Frequency: Daily (first Inventory report generation may take up to 48 hours).
- Output format: Apache Parquet
- Additional fields – select Replication status. If necessary, you can also select other parameters such as Size, Last modified, ETag, etc.
Once S3 Inventory has generated reports for both source and destination buckets, you can use Amazon Athena to query them.
Step 1b. Query S3 Inventory using Amazon Athena
You can create tables for source and destination S3 Inventory reports in Amazon Athena as described in our public documentation here. Briefly, let’s understand the table schema, and then review a few example queries.
- bucket (string): The name of the S3 buckets.
- key (string): Object key name.
- version_id (string): The version ID assigned to the object if versioning was enabled. If there is no value, then it may be because you did not set up versioning when you created the object).
- is_latest (boolean): If there are multiple versions of the object, then it will show as ‘yes’ if the selected record is latest else ‘false.’
- is_delete_marker (boolean): If the specific version of object was deleted, then it will be set to ‘true,’ else will be set to ‘false.’
- replication_status (string): Object replication status if you set up S3 Replication. If there is no value, then it may be because you created the object before you had set up S3 Replication or while it was suspended. Refer to this documentation to understand S3 Replication status information.
Next are two of the most commonly used queries to identify differences between source and destination buckets using S3 inventory tables:
Example query 1: Get a count of objects that exist in source bucket but not in destination to give us an estimate of how much data needs to be replicated to synchronize destination with source. In this query, we compare source and destination buckets (tables) based on object key (column key of table).
SELECT COUNT(*) FROM your-s3source-bucket-table st LEFT JOIN your-s3destination-bucket-table dt ON dt.key = st.key WHERE dt.key IS NULL;
Example query 2: Get a list of object key and version IDs that exist in the source bucket but not in the destination.
SELECT st.* FROM your-s3source-bucket-table st LEFT JOIN your-s3destination-bucket-table dt ON dt.key = st.key AND dt.version_id = st.version_id WHERE dt.key IS NULL;
You can also query S3 Inventory to address few other scenarios, whether objects exist in destination but do not exist in source bucket, or to identify only the existing objects in source. The
replication_status of the existing objects will be empty or null. Note, that there are a few different reasons why the source and destination Inventory reports may look different. For example, you may have different S3 Lifecycle policies on each bucket. You may also see differences if S3 Inventory was triggered at different times or if S3 Replication is still in progress in the source bucket.
If your source and destination buckets have 100M+ objects, then the S3 Inventory reports will be fairly large. You can load S3 Inventory into an Amazon Redshift database and use all of the preceding queries (from the Athena example above) to run against datasets loaded in an Amazon Redshift cluster.
Estimating cost of generating S3 Inventory reports
To generate S3 Inventory reports, you will be charged $0.0025 per 1 million objects listed, and additionally the storage cost of the Inventory report in the bucket of your choice. You will have a new Inventory report published in your S3 bucket every 24 hours and will be required to pay the storage cost for these reports per day, as all other objects in your bucket. For example, if your Inventory report has 10 million objects listed, and the report itself is 100 MB, you will be charged 0.025 for generating the report (0.0025 per million objects listed x 10) and 0.0022 for storage ($0.023/GB for S3 Standard x 100 MB) per day. For updated pricing, please refer to this page.
Step 2: Copy objects between S3 buckets
Once objects required to be copied between S3 buckets are identified, next step is to prepare for the copy job and initiate it. We provide you two options, S3 Batch Operations and S3 DistCp (only required if you have objects larger than 5 GB because those are not supported by S3 Batch Operations).
Copy objects using S3 Batch Operations
You can use S3 Batch Operations to create a copy job to copy objects identified in a manifest file. Before you start setting up the Batch Operations job, make sure that S3 Replication is configured and verify that it is working. To do so, verify that newly written objects are being replicated. There are two ways to do the actual copy operation.
Our recommended option is to copy the objects in place by setting up an S3 Batch Operations job that overwrites the objects in the source bucket (instead of the destination bucket). If you have S3 Versioning enabled in your source bucket, the copy in place operation creates a new version of the object in the source bucket. What this does is trigger S3 Replication on these new versions of the objects from source to destination automatically.
When you do the copy in place, we recommend you to stop writes into the source bucket while the copy operation is in progress, to avoid version conflicts in the destination bucket. This is especially useful if your application has the possibility to overwrite existing objects while the copy in place operation is in progress. We prefer the copy in place option over copying the objects directly to the destination bucket, because if you apply new mutations (change object tags, ACLs, or other metadata) to the copy in the source bucket, those will continue to replicate to the destination with S3 Replication. The copy in place option can be more expensive than copying objects directly to destination bucket because of the additional storage cost of maintaining more than one version of the object. There are although ways to cost optimize, for example, by using S3 Lifecycle in the source bucket to expire noncurrent versions of the objects.
Another option with S3 Batch Operations is for you to copy the identified objects to the destination bucket directly using the manifest file. This will be cheaper than doing a copy in place of the objects, but the object copy in the destination bucket will not be a managed resource like object replicated via S3 Replication and therefore, subsequent metadata updates to the source object will not be copied to the destination object copies.
Once the copy operations complete, you can restart the writes to your source bucket. Remember to keep S3 Replication enabled on the source bucket to ensure new objects landing in the source bucket get replicated to the destination buckets automatically.
Generate manifest for S3 Batch Operations copy job
To perform the S3 Batch Operations job, you will need to generate a manifest file to submit the job. You can use Amazon Athena or Amazon Redshift queries to generate a manifest file. The following sample query will generate a manifest to drive a Batch Operations job in a comma-separated values (CSV) format. You can add additional criteria in the below query by adding to the WHERE clause, for example, filter for
SELECT '("' || t1.bucket || '","' || t1.key || '","' || NULLIF (t1.version_id,null) || '")' result FROM your_s3source_bucket_table t1 LEFT JOIN your_s3destination_bucket_table t2 ON t2.key = t1.key WHERE t2.key IS NULL
Make sure that the manifest file does not include the header information, which the Athena or Amazon Redshift query results could have generated. You can use the
sed Linux command or using a PowerShell script on Windows to remove the headers. Another tip to keep in mind is to truncate your first manifest to a few objects and submit a similar Batch Operations job. The benefit of a short manifest is that the job will complete very quickly. You can use this first job as a dry run to verify all of your settings and options before submitting the job for all your data.
Tip for maintaining version stack with S3 Batch Operations copy
Objects are not necessarily copied in the same order as they appear in the manifest. For versioned buckets, if preserving current and noncurrent version order is important in your destination buckets, same as source, we recommend you to schedule two S3 Batch Operations copy job. In the first job, copy all noncurrent versions. Then schedule a subsequent Batch Operations job to copy the current versions of those objects. For additional details on how to configure S3 Batch Operations copy, refer to the documentation.
Estimating cost of S3 Batch Operations copy
For each S3 Batch Operations job, you will be charged $0.25 per job, and $1 per 1 million objects processed. We recommend that you split large jobs into multiple jobs, so you get the benefit of parallel processing. In addition to the Batch Operations job management fee, you will be charged for the actual operation, which in this case is copy. Copy, similar to replication, charges you for requests in addition to inter-Region Data Transfer OUT if you are wanting to do a cross-Region copy.
Let’s assume the same example as Section 1, where you want to copy 5-TB data from US-west to US-east. Your bucket is comprised of 4-MB objects and the object count is 1.3 million. If you were running this in a single S3 Batch Operations job as recommended, you’d be charged $1.55 for the job management ($0.25 per job + $1.3 for 1.3 million objects processed). Additionally you’d be charged for copy request charges of $0.005 per 1000 requests (for S3 Standard storage class) and inter-Region Data Transfer OUT of $0.02/GB. For the copy operation, you’d be charged $108.90, where $6.50 is for the copy requests on 1.3 million objects and the majority, $102.40 is from data transfer of 5 TB between two Regions. For updated pricing, please refer to this page.
Copy objects using S3DistCp
If your bucket has a number of objects larger than 5 GB, we recommend you to use S3DistCp to perform the copy operation. To run S3DistCp, you will need to launch an Amazon EMR cluster and execute S3DistCp from a Primary node. The number of task nodes in your Amazon EMR cluster will decide the parallelism of copy process, and copy performance accordingly. S3DistCp will also need a manifest file to copy identified objects from the source to the destination bucket (refer to the previous section on how to generate a manifest file for S3 Batch Operations)
Once you have generated a CSV manifest, make sure to convert this CSV file in .gz format and make it available in your S3 bucket for S3DistCp to reference. You can run the following S3DistCp command from an Amazon EMR cluster to start the object copy between S3 buckets:
s3-dist-cp --src s3://your-source-bucket --dest s3://your-destination-bucket --copyFromManifest --previousManifest=s3://your-s3distcp-manifest-file
Estimating cost of S3DistCp
If you use S3DistCp to perform the copy operation, you will be charged the cost of bringing up an EMR cluster, Copy request charges of $0.005 per 1000 requests (for S3 Standard storage class), and S3 Data Transfer OUT (for cross-Region copy operations). If you are using multipart copy for the objects larger than 5 GB, you will be charged a copy request charge for each object part. For more information on how to use multipart copy see here.
Other considerations to keep in mind
- If you want to sync your delete markers between source and destination buckets, you can easily enable or disable the replication of delete markers between source and destination buckets for each replication rule.
- S3 Replication recently launched support to replicate data to more than one destination in the same or different AWS Regions.
- Before using S3 Batch Operations, determine whether you need all versions of the source objects in the destination bucket, or only the latest version of the object, and set up your manifest accordingly.
- S3 Batch Operations and S3 PUT Copy are both limited to objects of a maximum of 5 GB. For objects larger than 5 GB, consider doing a multipart upload with MPU Copy or S3DistCp.
To avoid incurring additional cost, you may consider deleting the respective resources created in your AWS account for services used for this migration. For example, the S3 inventory job and reports generated, S3 Batch Operations jobs, EMR clusters for S3DistCp, Athena tables and related S3 bucket/objects. Additionally, if you created multiple copies or versions of your objects that you want to clean up, consider enabling S3 Lifecycle on each of your buckets.
Finally, if your organization does not have an ongoing need to copy data to a different bucket or Region for data protection or compliance reasons, consider disabling S3 Replication.
In this post, we’ve shown you how to use a combination of S3 Replication, S3 Inventory, S3 Batch Operations, and S3DistCp to synchronize bucket contents. We demonstrated comparing buckets using S3 Inventory and generating a job manifest with Amazon Athena, and provided couple of different options for copying data using S3 Batch Operations or S3DistCp. We also introduced a few features with S3 Replication to help you copy existing objects, replicate delete markers, and replicate to multiple destinations. Going forward, we recommend that you have S3 Replication enabled between the source and destination buckets to ensure that you have replicated all writes to the source bucket to the destination bucket in an automated and asynchronous way.
Thanks for reading this blog post on synchronizing data between Amazon S3 buckets. If you have any comments or questions, please leave them in the comments section.