Synchronizing your data to Amazon S3 using AWS DataSync

There are many factors to consider when migrating data from on premises to the cloud, including speed, efficiency, network bandwidth and cost. A common challenge many organizations face is choosing the right utility to copy large amounts of data from on premises to an Amazon S3 bucket.

I often see cases in which customers start with a free data transfer utility, or an AWS Snow Family device, to get their data into S3. Sometimes, those same customers then use AWS DataSync to capture ongoing incremental changes.

In this type of scenario, where data is first copied to S3 using one tool and incremental updates are applied using DataSync, there are a few questions to consider. How will DataSync respond when copying data to an S3 bucket than contains files that were written by a different data transfer utility? Will DataSync recognize that the existing files match the on-premises files? Will a second copy of the data be created in S3, or will the data need to be retransmitted?

To avoid additional time, costs, and bandwidth consumption, it is important to fully understand exactly how DataSync identifies “changed” data. DataSync uses object metadata to identify incremental changes. If the data was transferred using a utility other than DataSync, this metadata will not be present. In that case, DataSync will need to perform additional operations to properly transfer incremental changes to S3. Depending upon which storage class was used for the initial transfer, this could result in unexpected costs.

In this post, I dive deep into copying on-premises data to an S3 bucket by exploring the following three distinct scenarios, each of which will produce a unique result:

Using DataSync to perform the initial copy and all incremental changes.
Using DataSync to synchronize data that was written by a utility other than DataSync in which the storage class is: S3 Standard, S3 Intelligent-Tiering (Frequent Access or Infrequent Access tiers), S3 Standard-IA, or S3 One Zone-IA.
Using DataSync to synchronize data that was written by a utility other than DataSync in which the storage class is: S3 Glacier, S3 Glacier Deep Archive, or S3 Intelligent-Tiering (Archive Access or Deep Archive Access tiers).

After reviewing the detailed results of each scenario, you will be better prepared to decide how to use DataSync to efficiently migrate and synchronize your data to Amazon S3 without unexpected charges.

Solution overview

DataSync is an online data transfer service that simplifies, automates, and accelerates moving data between on-premises storage systems and AWS Storage services, in addition to between AWS Storage services. You can use DataSync to migrate active datasets to AWS, archive data to free up on-premises storage capacity, replicate data to AWS for business continuity, or transfer data to the cloud for analysis and processing.

Let’s take a moment to review how DataSync can be used to transfer data from on premises to AWS storage services. A DataSync agent is a virtual machine (VM) that is used to read or write data from on-premises storage systems. The agent communicates with the DataSync service in the AWS Cloud, which performs the actual reading and writing of the data to AWS Storage services.

A DataSync task consists of a pair of locations that data will be transferred between. When you create a task, you define both a source and destination location. For detailed information on configuring DataSync, please visit the user guide.

The following diagram walks through the architecture for my testing environment. The source locations are a set of Network File System (NFS) shares, hosted by an on-premises Linux server. The target location is an S3 bucket with versioning enabled. Three DataSync tasks have been configured, one for each scenario, using the following task options:

Verify data: Verify only the data transferred. This option calculates the checksum of transferred files and metadata on the source. It then compares this checksum to the checksum calculated on those files at the destination at the end of the transfer.
Transfer mode: Transfer only data that has changed. This mode copies only data and metadata that differ between the source and destination.
Task logging: Log level: Log all transferred objects and files. Log records are published to CloudWatch Logs for all files or objects that the task copies and integrity-checks.

Synchronizing your data to Amazon S3 using AWS DataSync

In order to dive deep on the inner workings of DataSync, I make use of three tools:

Amazon CloudWatch Logs provide the step-by-step execution of DataSync tasks for review.
S3 server access logs provide detailed records of the requests that are made to an S3 bucket.
S3 console shows the objects copied into S3.

A brief Amazon S3 primer

Since this blog is focused on transferring data to an S3 bucket, let us review some of the relevant topics. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance.

S3 storage classes

S3 offers a range of storage classes designed for different access patterns at corresponding rates.

- S3 Standard for general-purpose storage of frequently accessed data.
- S3 Intelligent-Tiering for data with unknown or changing access patterns. This storage class automatically transitions data between four access tiers based on data access: Frequent Access, Infrequent Access, Archive Access, and Deep Archive Access.
- S3 Standard-Infrequent Access and S3 One Zone-Infrequent Access for long-lived, but less frequently accessed data.
- S3 Glacier and S3 Glacier Deep Archive for long-term archive and digital preservation of data with asynchronous access.

DataSync allows you to write directly to any S3 storage class.

S3 metadata

When copying file data to Amazon S3, DataSync automatically copies each file as a single S3 object in a 1:1 relationship, and preserves POSIX metadata on the source file as Amazon S3 object metadata. If a DataSync task is configured to “transfer only data that has changed,” DataSync compares the file metadata with the S3 metadata on the corresponding object to determine whether a file needs to be transferred. The following picture shows an example of S3 object metadata that was written by DataSync.

An example of S3 object metadata that was written by DataSync

S3 API Operations

S3 supports the REST API standard. This allows for programmatic access to S3 buckets and objects. In this blog, I will focus on the following actions:

- HeadObject: retrieves metadata from an object without returning the object itself.
- PutObject: adds an object to a bucket.
- CopyObject: Creates a copy of an object that is already stored in Amazon S3

S3 Versioning

Amazon S3 Versioning is a means of keeping multiple variants of an object in the same bucket. You can use the S3 Versioning feature to preserve, retrieve, and restore every version of every object stored in your buckets. The impact of versioning should be carefully considered when using DataSync to transmit data to an S3 bucket.

- If DataSync is being used for an ongoing backup of your on-premises storage, versioning may be desirable. This would allow you to restore both current and prior versions of objects.
- If DataSync is being used as a migration tool, you may want to disable versioning. As you can see in the following scenarios, there are scenarios in which buckets with existing data can result in two copies of the data being stored in S3.

Walkthrough scenarios

Scenario 1: Using DataSync to perform the initial synchronization of data to an S3 bucket.

In this scenario, I demonstrate transferring files from an on-premises NFS share to an empty S3 bucket. This scenario sets our baseline to understand DataSync’s behavior as related to scenarios 2 and 3. As DataSync is being used for the initial synchronization, the S3 storage class has no impact on the behavior.

The on-premises NFS share contains two files: “TestFile1” and “TestFile2.” The S3 bucket is empty. A DataSync task is executed to transfer the NFS files to S3. Upon completion of this task, the following is observed:

- CloudWatch Logs for the DataSync task confirm that the two files were successfully transferred and verified:

CloudWatch Logs for the DataSync task confirm that the two files were successfully transferred and verified

- S3 server access logs show that DataSync performed a “PUT” operation on both:

S3 server access logs show that DataSync performed a “PUT” operation on two files

- The S3 console shows the two objects in the bucket:

Two objects in a bucket - synchronizing your data to Amazon S3 using AWS DataSync

Two files are added to the NFS share (“TestFile3” and “TestFile4”). The DataSync task is executed a second time. Upon completion of the task, the following is observed:

- CloudWatch Logs for the DataSync task show that only the two new files were transferred and verified:

CloudWatch Logs for the DataSync task show that only the two new files were transferred and verified

- S3 server access logs show that DataSync performed the following operations:
  - On the existing objects, a “HEAD” operation was used to read the object metadata. Since the metadata was identical to the source, DataSync did not need to perform any additional operations.
  - On the two new files, a “PUT” operation occurred.

S3 server access logs show that DataSync performed the following operations on existing and new objects

- The S3 console confirms that the four objects have been transferred to the S3 bucket. Note that there is only one version of each:

The S3 console confirms that the four objects have been transferred to the S3 bucket. Note that there is only one version of each

No new files are added to the NFS share. The DataSync task is executed a third time. Upon completion of the task, the following is observed:

- CloudWatch Logs show that no files were transferred:

CloudWatch Logs show that no files were transferred (no new files were added)

- S3 server access logs show that a “HEAD” operation occurred on each object. Since the S3 metadata matched the source, no further operations were required.

Since the S3 metadata matched the source, no further operations were required

Scenario 2: Using DataSync to synchronize data with existing S3 data in which the storage class is Standard, S3 Intelligent-Tiering (Frequent or Infrequent Access tiers), S3 Standard-IA, or S3 One Zone-IA.

Unlike the first scenario, the S3 bucket is not empty. It contains two objects, which are the same data as the two files in the on-premises file share. However, the files were uploaded to S3 using a utility other than DataSync, and therefore do not have the POSIX metadata.

The on-premises NFS share contains two files: “TestFile1” and “TestFile2.” The S3 bucket contains copies of the same data, in the Standard storage class.

Two additional files are added to the NFS share (“TestFile3” and “TestFile4”). The DataSync task is executed for the first time. Upon completion, the following is observed:

- CloudWatch Logs indicate that only the new files were transferred, but all four were verified:

CloudWatch Logs indicate that only the new files were transferred, but all four were verified

- S3 server access logs show an interesting pattern:
  - For the two existing files, DataSync performed the following operations:
    - A “HEAD” operation to read the object’s metadata. Since these objects were transferred to S3 using something other than DataSync, they lacked the POSIX metadata.
    - A “REST.GET.OBJECT” operation reads the object in order to calculate a checksum, to verify that this object matches the source file.
    - A “REST.COPY.OBJECT” created a new object, which includes the POSIX metadata. This is essentially an “in-place copy” of the object and did not result in data being retransmitted over the wire.
  - For the two new files, a “PUT” operation occurred.

For the two new files, a 'PUT' operation occurred, with several operations performed on the two existing files

- The S3 console shows that there are two versions of the previously transmitted files, and only one version of the new files:

The S3 console shows that there are two versions of the previously transmitted files, and only one version of the new files

No new files are added to the NFS share. The DataSync task is executed a second time. Upon completion of the task, the following is observed:

- CloudWatch Logs show that no files were transferred:

CloudWatch Logs show that no files were transferred

- S3 server access logs show that only a “HEAD” operation occurred on each object. Since the S3 metadata matched the source, no further operations were required.

Only a 'HEAD' operation occurred on each object. Since the S3 metadata matched the source, no further operations were required

Scenario 3: Using DataSync to synchronize data with existing S3 data in which the storage class is Amazon S3 Glacier, S3 Glacier Deep Archive, or S3 Intelligent-Tiering (Archive or Deep Archive tiers).

This is similar to scenario 2, in which the S3 bucket is not empty. It contains two objects, which are the same data as the two files in the on-premises file share. The files were uploaded to S3 using a utility other than DataSync, and therefore do not have the POSIX metadata. However, unlike the prior scenario, the objects reside in a storage class that is intended for long-term archiving.

The on-premises NFS share contains two files: “TestFile1” and “TestFile2.” The S3 bucket contains copies of the same data, in the Amazon S3 Glacier storage class.

Two additional files are added to the NFS share (“TestFile3” and “TestFile4”). The DataSync task is executed for the first time. Upon completion, the following is observed:

- CloudWatch Logs indicate that both the existing files and new files were transferred and verified:

CloudWatch Logs indicate that both the existing files and new files were transferred and verified

- S3 server Access Logs show a different pattern from the prior test:
  - For the two existing files, DataSync performed the following operations:
    - A “HEAD” operation to read the object’s metadata. Since the objects were transferred to S3 outside of DataSync, they lacked the POSIX metadata.
    - Because the objects are in an Archive storage class, DataSync cannot perform a “GET” operation to read the object and compute a checksum. Therefore, a new copy of the file had to be transmitted over the wire using a “PUT” operation.
  - For the two new files, a “PUT” operation occurred.

For the two new files, a 'PUT' operation occurred (2)

- The S3 console shows that there are two versions of the previously transmitted files, and only one version of the new files:

The S3 console shows that there are two versions of the previously transmitted files, and only one version of the new files (1)

No new files are added to the NFS share. The DataSync task is executed a second time. Upon completion of the task, the following is observed:

- CloudWatch Logs show that no files were transferred:

CloudWatch Logs show that no files were transferred (2)

- S3 server access logs show that only a “HEAD” operation occurred on each object. Since the S3 metadata matched the source, no further operations were required. It is important to note that S3 metadata can be read, even when an object resides in an Archive storage class.

Only a “HEAD” operation occurred on each object. Since the S3 metadata matched the source, no further operations were required

Conclusion

In this blog, I walked you through a deep dive of how AWS DataSync operates when it copies data to an empty S3 bucket, compared to S3 buckets with existing data. The takeaways are:

In scenario 1, DataSync performed the initial copy from on-premises to an empty S3 bucket. DataSync was able to write the files and the required metadata in one step. This is a very efficient process that makes excellent use of network bandwidth, can save time, and may reduce overall costs. Most importantly, it has the advantage of working with any Amazon S3 storage class.

Scenario 2 demonstrated synchronizing data with previously transmitted S3 data in which the S3 storage class is: Standard, S3 Intelligent-Tiering (Frequent or Infrequent Access tiers), S3 Standard-IA, or S3 One Zone-IA. Although the data did not need to be retransmitted over the wire, a new copy of each S3 object was created. This approach will still offer bandwidth efficiency, but will result in additional S3 request charges for the GET and COPY operations.

Scenario 3 demonstrated synchronizing data with previously transmitted S3 data in which the S3 storage class is: Amazon S3 Glacier, S3 Glacier Deep Archive, or S3 Intelligent-Tiering (Archive or Deep Archive tiers). In this approach, each file needed to be retransmitted over the wire. This represents the least-efficient experience as it results in additional S3 request charges, increased bandwidth utilization, and additional time.

Regardless of the S3 storage class, synchronizing with previously transmitted S3 data will create a duplicate copy of each object if S3 versioning is enabled on the bucket. This results in additional S3 storage costs as showcased in both scenario 2 and 3.

If you use DataSync for ongoing synchronization of your data, then my recommendation is to use DataSync to perform the initial transfer to S3 whenever possible. However, in cases where you have copied the data to S3 by a means other than DataSync, please ensure that the storage class is not Amazon S3 Glacier, S3 Glacier Deep Archive, or S3 Intelligent-Tiering (Archive or Deep Archive tiers). I also recommend suspending S3 Versioning on the bucket, if it is currently enabled.

Thank you for reading about AWS DataSync and Amazon S3. If you have any questions or comments, please leave a note in the comments section.

AWS Storage Blog

Synchronizing your data to Amazon S3 using AWS DataSync

Solution overview

A brief Amazon S3 primer

S3 storage classes

S3 metadata

S3 API Operations

S3 Versioning

Walkthrough scenarios

Scenario 1: Using DataSync to perform the initial synchronization of data to an S3 bucket.

Scenario 2: Using DataSync to synchronize data with existing S3 data in which the storage class is Standard, S3 Intelligent-Tiering (Frequent or Infrequent Access tiers), S3 Standard-IA, or S3 One Zone-IA.

Scenario 3: Using DataSync to synchronize data with existing S3 data in which the storage class is Amazon S3 Glacier, S3 Glacier Deep Archive, or S3 Intelligent-Tiering (Archive or Deep Archive tiers).

Conclusion

Resources

Follow