Migrate large HPC datasets from the edge to the cloud then synchronize continuously

Organizations running high-performance computing (HPC) workloads on premises often want to move data to the cloud to leverage scalability, performance, cost optimization, and other benefits of the cloud. For edge locations with limited or no available network bandwidth, online migrations can take a long time or be impossible.

In locations where there is limited bandwidth availability, you can use AWS Snowball Edge Storage Optimized devices for the primary bulk data transfer and then AWS DataSync for ongoing updates and metadata synchronization. You can use the snow-transfer-tool to boost data transfer efficiency, and Amazon FSx for Lustre offers a powerful file system solution to mount the HPC data as Lustre file system.

In this post, I walk through ordering Snowball devices, setting up virtual machines to transfer data from on-premises storage to AWS Snowball, configuring DataSync for post-synchronization of data and metadata transfer, and mounting migrated data as a Lustre file system using FSx for Lustre. This solution offers a comprehensive approach for efficiently migrating large HPC datasets to AWS and, to account for changes to the data after the migration, maintaining ongoing synchronization of your datasets. It specifically addresses the need for low-latency data mounting for HPC workloads via FSx for Lustre.

Solution architecture

In this architecture, you will be using AWS Snowball Edge Storage Optimized devices with 210 TB capacity for initial offline data transfer. Snowball devices are suitcase sized portable storage devices that allow users to securely transfer large datasets physically to AWS, bypassing the constraints of low-bandwidth connections. This design uses a virtual machine connected to a private network to replicate data onto the Snowball device.

You use DataSync, a migration service that allows you to transfer data between on-premises and AWS, between AWS storage services, and between AWS and other locations, for post-migration synchronization. ataSync uses an on-premises agent to transfer data to AWS. Then, the migrated data is stored in Amazon Simple Storage Service (S3) and linked with FSx for Lustre file system for low latency access.

For HPC workload, where speed is crucial, FSx for Lustre provides an easy and cost-effective way to launch and run the high-performance Lustre file system. FSx for Lustre is POSIX-compliant, allowing you to run your Linux-based HPC applications without modification. You will be using FSx for Lustre in integration with Amazon S3. When connected to an Amazon S3 bucket, an FSx for Lustre file system presents S3 objects as files.

Solution architecture diagram for migrating and synchronizing to AWS from on-premises using Snowball and DataSync

In this diagram:

Bulk offline data transfer using Amazon Snowball Edge Storage Optimized devices.
Post-migration synchronization with AWS DataSync.
Mounting migrated data in Amazon FSx for Lustre from the Amazon S3 bucket.

Prerequisites

The following are necessary to continue with this post:

Compute for snow-transfer-tool:

- Virtual machine (VM) hardware required depends on how many Snow devices run in parallel for data transfer.
- The snow-transfer-tool and other steps need Python 3.6 or higher and AWS Command Line Interface (CLI).

A virtual machine to run DataSync agent for post migration synchronization of data.

Walkthrough

Once you set up the virtual machine for data transfer to Snowball and access the necessary compute resources to run the and DataSync agent as specified in the prerequisites, you can start deploying the solution.

Follow these six steps to deploy the solution:

Create a S3 bucket
Order Snowball Edge devices
VM configuration for data transfer to Snowball
Transfer data to Snowball
Use DataSync to identify and replicate any changes
Mounting data in a Lustre volume

1. Create a S3 bucket

AWS Snowball Edge is a regional service, an amazon S3 bucket in the corresponding AWS region is required before placing a Snowball order. For bucket creation, use either the AWS CLI or the console.

aws s3 mb s3://mysnowball-data --region eu-west-1

2. Order Snowball Edge devices

2.1. Calculate the number of devices required for the migration:

Number of devices = Total data to be migrated/Size of the selected Snowball Edge storage optimized device

Example: To migrate 1 PB data requires five Snowball Edge Storage Optimized 210 TB devices.

2.2. Create a Snow job

You can create a Snow job from the console. Follow the instructions in this developer guide.

Use the following configuration in the Snow job for data migration:

Choose a job type: import into Amazon S3
Snow devices: Snowball Edge Storage Optimized with 210 TB
Choose your pricing option: on-demand
Select the storage type: Amazon S3 data transfer
Select your S3 buckets: mysnowball-data (from Step 1)
Encryption: aws/importexport (default)
Shipping address: your shipping address
Shipping speed: One-Day Shipping (1 business day)
Set notifications: Create a new SNS topic, provide a topic name and email.

3. VM configuration for data transfer to Snowball

3.1. Create a Linux VM and mount the on-premises volume.

Deploy a Linux VM in the on-premises network for migration purpose. Install the required packages to mount the file system in the VM:

- Install AWS CLI
- Install NFS Client

3.2. Install snow-transfer-tool

The snow-transfer-tool can help you efficiently handle large migrationsSnowball Edge devices. The snow-transfer-tool facilitates the grouping of large collections of small files based on a predetermined size, with batch transfer for more efficient uploading. It can scan the source directory, generating several lists based on partition size and device size containing partitions of files. Each line of a partition file is the absolute path of a file that is stored inside your source directory.

git clone https://github.com/aws-samples/snow-transfer-tool.git
cd snow-transfer-tool/
./install.sh

3.3. Divide the files into lists based on device size

Before ordering Snowball devices, you must generate a list of file paths using the snow-transfer-tool based on the size of the devices. In this case, demonstrate using 210 TB devices.

The following command creates partition files within the designated file list directory, organizing them in separate subfolders for each device.

source_file_path=/nfs
filelist_directory= ~/snow/list_files
logs=~/snow/logs

snowTransfer gen_list  --device_capacity 210TB --src $source_file_path --filelist_dir $ filelist_directory --log_dir $logs --partition_size 1GB

Here is a screenshot of the list of files generated by the preceding command:

screenshot of the directory showing the list of files generated by the snowTransfer gen_list command

4. Transfer data to Snowball

4.1. Download the access keys for the Snowball device:

4.1.1. Download AWS OpsHub for Snow Family devices.

4.1.2. Unlock the device through AWS OpsHub – You can unlock the device using the unlock key and manifest file from the Snowball job.

4.1.3. Download the access keys from AWS OpsHub.

4.2. Configure an AWS profile

You need to configure an AWS profile on the virtual machine set up in step 3. This profile will be used by the AWS CLI to authenticate with the Snowball device. Use the access keys obtained in step 4a for its setup. You can run the AWS CLI command below to configure the AWS profile for connecting to Snowball.

aws configure --profile snowball1

4.3. Transfer data to Snowball

You can use the snow-transfer-tool upload command to upload files from the sources folder to the device using the generated file lists. In the command below, “–extract_flag true” will configure the metadata flags required for the automatic extraction of tar files created for batch transfer to Snowball during the import into the S3 bucket.

 snowTransfer upload --src $filelist_directory/device_210.00TiB_1/fl_QDU2UZ_1.ready --bucket_name mysnowballdata --log_dir =~/snow/logs/ --endpoint https://<snowball_1_IP>:8443 --profile_name snowball1 --extract_flag true

The blog “Migrating mixed file sizes with the snow-transfer-tool on AWS Snowball Edge devices”, provides a detailed explanation of the tool and how it is possible to optimize data transfer using increased parallelism.

4.4. Return the Snowball device

Prepaid shipping details are displayed on the E-Iink screen of the Snowball device. You can refer to this developer guide.

4.5. Repeat the preceding steps (4a to 4d) until the data migration is complete.

5. Use DataSync to identify and replicate any changes

After finishing the bulk data transfer using the Snowball Edge device, use DataSync for syncing both the updated data and POSIX metadata needed to mount the file as a Lustre file system for HPC data processing from Linux.

In DataSync, the initial step involves configuring the DataSync agent, detailed in this DataSync User Guide. Next, configure NFS as the source location and the Snowball Edge-migrated data bucket as the destination. To transfer POSIX metadata from an NFS filesystem, the following settings within the “additional settings” of DataSync needs to be enabled, select Copy ownership, Copy permissions, and Copy timestamps.

The blog “Synchronizing your data to Amazon S3 using AWS DataSync”, provides a detailed guide for the DataSync metadata migration and data synchronization.

6. Mounting data in a Lustre volume

6.1. Create a Lustre volume for hosting the POSIX file system

Create a new FSx for Lustre volume with the data repository import/export option configured and choose to import data from and export data to Amazon S3. After your Lustre file system is created, FSx for Lustre can keep your file and directory listings up to date automatically as you add or modify objects in your S3 bucket.

aws fsx create-file-system --file-system-type LUSTRE --subnet-ids subnet-*** --security-group-ids sg-** --lustre-configuration DeploymentType=PERSISTENT_1,PerUnitStorageThroughput=100,AutoImportPolicy=NEW_CHANGED_DELETED,ImportPath=s3://mysnowball-data/dataset/,ExportPath=s3://mysnowball-data/dataset/ --storage-capacity 1200

screenshot of the AWS Console for FSx for Lustre file system with S3 as the data repository

6.2. Mounting FSx for Lustre volume in a Linux system

Finally, install the Lustre client (instructions are in this Amazon Lustre guide) and mount the Lustre volume in the Linux system with the following command.

mount -t lustre -o relatime,flock fs-****.fsx.eu-west-1.amazonaws.com@tcp:/**** /fsx

Cleaning up

After completing the migration, delete the resources created as described in this blog post to avoid future charges.

Unmount the Lustre volume from HPC cluster.
Delete the FSx Lustre file system created to mount s3 bucket in HPC cluster.
Clean-up DataSync resources used for post migration data synchronization.
Remove the on-premises virtual machine used for Snowball data transfer.

Conclusion

In this post, I covered using AWS Snowball, AWS DataSync, and Amazon FSx for Lustre in an effective and scalable solution for migrating large HPC datasets to AWS, even from bandwidth-constrained locations. This approach facilitates offline data transfer via Snowball, continuous synchronization with DataSync, and low-latency access to the migrated data using FSx for Lustre.

Using AWS Snowball devices with the snow-transfer-tool provides an efficient offline data migration approach. The snow-transfer-tool improves performance by grouping files and using parallel processing through partitioned files, leading to a significant reduction in data transfer time. Additionally, simultaneous use of multiple Snow edge devices further reduces the time of data transfer to AWS. DataSync facilitates the transfer of POSIX metadata and newly generated files post-Snowball data migration, making sure of a comprehensive and seamless data transfer approach. Amazon FSx for Lustre provides a cost-effective, high-speed, POSIX-compliant file system for HPC workloads integrating with Amazon S3. You can use this solution for faster transfer of HPC data where there is limited to no bandwidth availability.

For information about Snowball and DataSync, check out AWS Snowball Edge and AWS DataSync documentation. If you’d like to discuss your Snow purchase in more detail, reach out to your AWS account team or to our sales team.