Migrating hundreds of TB of data to Amazon S3 with AWS DataSync

This blog is co-authored by Satish Kumar of Autodesk and Sona Rajamani of AWS.

Enterprises are often faced with challenges in migrating vast amounts of data efficiently and effectively from their on-premises data storage environments to AWS. To aid and ease this migration, AWS offers offline data transfer services such as AWS Snowball, AWS Snowball Edge, and AWS Snowmobile. At re:Invent 2018, AWS launched a new service to expedite data transfer over the network, called AWS DataSync.

In this post, we show you how Autodesk successfully migrated over 700 terabytes (TB) of data from their on-premises Dell EMC Data Domain storage system to Amazon Simple Storage Service (Amazon S3). We did this swiftly and effortlessly using DataSync.

“Our petabyte scale data migration journey from on-premises to AWS was accomplished swiftly with minimal effort and was completely self-managed with AWS DataSync. This solution is a game changer!”

– Satish Kumar, Infrastructure Engineer at Autodesk Inc

Overview

To give some background, Autodesk is a leader in 3D design, engineering, and entertainment software. Autodesk makes software for people who make things. If you’ve ever driven a high-performance car, admired a towering skyscraper, used a smartphone, or watched a great film, chances are that you’ve experienced what millions of Autodesk customers are doing with their software.

Autodesk has been an AWS Partner Network Advanced Technology Partner for over 6 years, enabling many customers to leverage the benefits of cloud computing technology to design, engineer, and build products in the Manufacturing, Construction, and Entertainment industries.

Enterprises typically maintain large storage arrays on premises to provide storage for servers and applications. In context of this post, Autodesk’s data source on premises was a Data Domain storage array, which was a target for Oracle RMAN and SQL Server database backups. Due to compliance requirements, the database backups are required to be retained for a certain number of years. So, over the course of time, the Data Domain storage system had accumulated over 2.4 petabytes (PB) of de-duplicated and compressed data. In later sections of this post, we talk about how Autodesk scaled this dataset down to 700 TB.

Autodesk’s Infrastructure teams were tasked with managing and maintaining the Data Domain system, which spanned across multiple racks and arrays in the data center.

Autodesk IT evaluated multiple options to ease the maintenance and management of this system. One option was to upgrade the existing Data Domain infrastructure to a newer version but that would still have an infrastructure to build and maintain. Autodesk preferred to move to a managed, scalable, and reliable solution.

Autodesk decided to use Amazon S3 because of the low cost, pay-as-you-go model, high durability, and availability. It also has lifecycle management capabilities for long-term archival storage to Amazon S3 Glacier or Amazon S3 Glacier Deep Archive. The goal was to move this dataset to S3 as soon as possible and eventually lifecycle it to Amazon S3 Glacier for long-term retention.

Migration

Migrating 2.4 PB of data came with its own set of challenges. To ease the scale of this migration, Autodesk internally evaluated the type of database backups being stored and the current retention policies. With closer investigation and analysis, Autodesk was able to come up with a more reasonable policy for data retention, which was acceptable to the application owners. This exercise allowed them to cut down the “data-to-copy” from 2.4 PB to just over 700 TB. That ~300% reduction was possible by understanding the application needs and qualifying the data that had to be retained for compliance purposes.

With the revised and reduced dataset of 700 TB, the one-time bulk migration seemed viable using multiple AWS Snowballs in parallel. As Snowballs are physical devices shipped to data centers, it requires additional planning and resources. Working together, we estimated needing 10 Snowballs for this migration operating in parallel.

Based on Autodesk’s experience with Snowball, we decided to write scripts to automate the copying of data to AWS Snowball, and validating the data after the transfer to S3. Together we formed a migration plan and identified a sequence of tasks for doing the Snowball-based migration:

Doing a proof-of-concept with both Snowball and Snowball Edge
Developing scripts for copying and validation
Testing the scripts
Getting the 10 Snowball devices
Setting them up in the data center
Establishing network connectivity
Copying, shipping, importing the data to S3
Finally, verifying data integrity

As Autodesk was preparing to execute on this plan, AWS launched DataSync. It’s a data transfer service that makes it easy for users to automate moving data between on-premises storage and S3 or Amazon EFS. DataSync automatically handles many of the tasks related to data transfers such as encryption, network optimization, and data integrity validation that can slow down migrations or burden IT operations. DataSync can transfer data at speeds up to 10 times faster than open-source tools. You pay only for the data copied. Because Autodesk has a 2-Gbps internet circuit, we were open to DataSync as a network-based transfer option.

Some of the benefits of DataSync that led to the selection of this solution include:

High data throughput
Full support with NFS
Easy-to-use console and AWS CLI management
Time savings (takes only few minutes to set up)
Pay-per-GB transfer

DataSync was generally available at launch. Within few weeks, Autodesk deployed and started testing the DataSync agent on-premises. Testing was successful, proving that the DataSync solution could save time, effort, and costs with its performance and high data throughput. With just one agent, we were able to saturate Autodesk’s internet circuit quickly, and realized the power of fast data transfer with DataSync. Autodesk Network Engineering team immediately reached to understand what we were doing to saturate the network.

After that, Autodesk IT worked together with Autodesk Network Engineering to put in network bandwidth control in Autodesk’s primary switches to prevent the DataSync agent traffic from inundating the shared internet circuit. Schedule based restrictions were placed which limited the agent bandwidth to a max of 1.4 Gbps.

Now, DataSync supports setting bandwidth limits in transfer task settings itself.

The transfer task was fast, steady, and used the allocated 1.4-Gbps pipe fully during the low peak usage hours at night. It yielded a data transfer rate of about 500 Gib/hour. With this rate, Autodesk successfully transferred and verified the entire 700-TB dataset to S3 within two and half months.

The following is a high-level architecture diagram of the implementation of Autodesk’s DataSync solution:

a high-level architecture diagram of the implementation of Autodesk’s DataSync solution:

Workflow:

Deploy DataSync Agent (OVA file) on-premises for local storage access.
Secure data transfer over internet using purpose-built protocol & TLS encryption
DataSync services writes data to S3.
Life-cycle rules are set to move data to long-term archival (Glacier).

Solution

To get started with DataSync, launch DataSync service in the AWS Management Console.

Next, select the On-premises to AWS use case. To access the on-premises storage, an agent has to be downloaded, deployed, and activated in the VMware environment on-premises. The activation process associates the agent with the corresponding AWS account.

In the Agent setup section of the Create agent page, chose Download image. The agent is an OVA file. Autodesk IT deployed this OVA in a VMware ESXi hypervisor on-premises. Next, an IP address was assigned to this virtual machine (VM) statically. Alternatively, you can also use DHCP-based IP and hostname assignment if available.

To separate the northbound traffic between the DataSync agent and AWS and to separate the southbound traffic between the Data Domain NFS and DataSync agent, Autodesk IT assigned another network adapter (NIC) and IP address to the VM. This is an optional step to provide a separate interface for southbound network traffic to the NFS.

Next, in the Create agent page, enter agent’s northbound IP address, and choose the Get key option to get an activation key. Give a name and a tag as well to the agent.

The activation key is important, because it securely associates the agent with the AWS account. The activation process requires the agent’s port 80 to be accessible from the browser. After the agent is activated, it closes port 80 and the port is no longer accessible.

Next, in Autodesk’s use case two locations were configured: one for source and another for destination. A location is a configuration of an NFS file system or AWS storage service. In this use case, Autodesk used NFS as the location type for source. The destination pointed to an S3 bucket.

Now that an agent, source, and destination locations were configured, next a task has to be created. A task is a set of two locations (source and destination) and a set of options that helps control the behavior of a task. The settings are straightforward, but we want to call out couple of features:

Enable verification—With this option, DataSync performs a data integrity verification at the end of the task execution, after all data and metadata have been transferred. This setting was left enabled to ensure that the source and destination data matched when the copy was complete.
Set bandwidth limit (MiB/s)—Allows you to limit the bandwidth used by the DataSync agent for this task. With Use available, DataSync agent maximizes the network bandwidth utilization that is available for the transfer.
In Autodesk’s use case, choice was made to use the Use available option as bandwidth limit was put in Autodesk’s network switches to control the northbound traffic. However, Set bandwidth limit is a less intrusive option, which could work for most customer scenarios.

After the task execution starts, it transitions through four phases (launching, preparing, transferring, and verifying), and then terminates with success or failure, as shown in the diagram below. The status of the Task can be monitored using Amazon CloudWatch.

Best practices

Based on Autodesk’s experience with DataSync, we want to highlight a few points:

Enterprises often end up storing data that may be redundant or even unnecessary. Before a petabyte scale migration, revisit your backup retention policies and revise it if needed. Understand the real purpose of your enterprise archive data retention policies and clean up the dataset before initiating the migration to save time and cost.
While the DataSync task is running, do not enable the lifecycle transition rule from S3 to Amazon S3 Glacier. Always wait until the task execution is complete. The objects being written in the bucket must be available for the final data verification step. If the objects are moved to Amazon S3 Glacier before this verification is complete, then you have to run a recall job (s3bro tool) to make the objects available again in S3 for verification.
A single DataSync agent is designed to maximize the usage of available network pipe. To limit the bandwidth used by an agent for the task, make sure to use the Set Bandwidth Limit option in the task.
Clean your dataset and reorganize the folders before initiating data transfer to save on costs by transferring only the data intended. DataSync now has a Filter feature to sift and transfer only a subset of your source files as needed.

Conclusion

In this post, we demonstrate how to successfully and swiftly complete a large data migration from on-premises to AWS using AWS DataSync. We hope this post helps you in your decision-making process to try out and use DataSync for your own use cases. We welcome your feedback and suggestions.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.