Migrating mixed file sizes with the snow-transfer-tool on AWS Snowball Edge devices

When moving your applications and business infrastructure to AWS, it is likely you will need to migrate your existing data as well. This data often comes from file share environments and contains a variety of file sizes. If the data contains more than a single digit percentage of files under 1 MB, your migration performance may be impacted by these small files.

The reason small files present a challenge is due to the need to seek, read and write both file data and metadata. The result is reduced throughput compared to larger files due to latency in the metadata operations, particularly with devices that utilize spinning media. AWS Snowball Edge devices are not immune to this impact.

To help resolve this challenge, AWS recommends in our documentation that small files should be batched together in tar or zip files when using the Amazon S3 API as the transfer mechanism on the AWS Snowball Edge device. To follow these guidelines, you have to invest considerable time in developing scripting and tooling that addresses your compliance, logging, and transfer optimization needs.

We recognize this process as a heavy lift for our customers, and we’ve developed the snow-transfer-tool to make it easier and more streamlined. Using the snow-transfer-tool with AWS Snowball Edge is a way to achieve these recommendations and increase the efficiency of your data transfer.

In this post, I describe specific use cases and walk you through the steps you need to perform a migration using the snow-transfer-tool with AWS Snowball Edge devices.

Solution overview

The snow-transfer-tool provides multiple options to optimize your process depending on the size of the data being moved, the available compute capacity on the transfer workstation, and the number of devices being utilized for the migration. It also provides logging and options to tune your transfer based on the compute resources available and the best practices for your particular scenario. When the tool encounters large files, they are directly uploaded to the Snowball Edge device.

The tool works with both data migrations that require one or multiple Snowball Edge (SBE) devices using the S3 API for data migration. The differences in leveraging the tool for each of these use cases is covered below. It is important to note that the script can be run with either command line options or using a configuration file. Detailed usage information can be found in the README.md file in the GitHub repository. Additionally, a video walk-through of both scenarios is available.

Prerequisites

The walkthrough has the following prerequisites:

The data migration system(s) that will run the snow-transfer-tool:
- 6+ physical cores
- 16GB RAM
- 10Gb Ethernet
- Windows 2016 or newer, Linux with kernel 3.12 or newer, MacOS 12 or newer
- Python 3.6 or greater
- AWS CLI installed on system (used to configure profiles for each Snowball Edge device)
An AWS account.
AWS Snowball Edge device with Amazon S3 API as selected data transfer option. This is selected at time of device ordering and cannot be changed once the device is shipped.
AWS Snowball Edge device that has been unlocked.
Amazon S3 credentials from the AWS Snowball Edge device
If using HTTPS for data transfer over the network, obtain the SSL cert as defined in documentation and set the AWS_CA_Bundle environment variable to point to the certificate.

Walkthrough: Preparation for all scenarios

This section covers the preparatory steps needed for all data transfer scenarios. The generalized steps are:

Meet the prerequisites defined above
Download and install the tool
Configure the AWS CLI profile for the Snowball Edge device
Perform the copy
- Optional: Create the partition files

Link to GitHub repository: https://github.com/aws-samples/snow-transfer-tool

Installation and configuration

The following steps should be performed against the data migration host system.

To download and install the snow-transfer-tool

1. Using your browser, navigate to https://github.com/aws-samples/snow-transfer-tool.
2. Select the green button that is labeled Code.
3. In the drop-down menu, select Download Zip.

Image depicting the process to download the ZIP archive containing the code for the the snow-transfer-tool. URL for ZIP archive at time of writing: https://github.com/aws-samples/snow-transfer-tool/archive/refs/heads/main.zip

4. Extract this file on the computer(s) acting as the data copy workstation(s).
5. For Windows, select and run install.bat. For Linux or Mac OS, run install.sh.

Configure Amazon S3 credentials

The following configuration step simplifies multiple copy operations to the same Snowball Edge by storing the keys needed to authenticate the S3 API operations being performed. If there are only a small number of copy jobs, or your security policy prohibits storing the keys in a text file on the data migration workstation, you can use the aws_access_key_id and aws_secret_access_key option flags and skip these steps.

After installation of the AWS CLI completes, create profiles on each workstation being used for the data copy:
```
aws configure --profile {profile name}
```
Example:
```
aws configure --profile snowballedge-1
```
You will then be prompted to provide the access key, secret key, region and default output format for this device. The access key and secret access key can be obtained via the Snowball Edge CLI or via AWS OpsHub for each device.
Repeat the profile creation for each device being managed on each system that will be used for data copy operations.

Copy data to the AWS Snowball Edge

There are two methods to copy data using the script, with and without partition files. Partition files are text files that list the files to be uploaded to a single archive file in the AWS Snowball Edge device. If the number of files to copy is relatively small (low thousands), using the script without partition files is the simplest method.

If there are many thousands, millions, or even billions of files, or if the total capacity being copied exceeds the capacity of a single Snowball Edge device, using partition files provides the ability to parallelize multiple jobs to a single device or to upload data to multiple devices simultaneously. Additionally, using partition files enables you to begin the migration preparation before ordering a Snowball Edge device.

NOTE: It is critical that the commands be run from the workstation console or in a resumable session, such as in a screen session on Linux. Failure to do so can result in a failed copy process if the session is disconnected.

Copy without partition files

The following sample command will directly upload the data located in /dir/with/smallfiles to the bucket named my-snowballedge-bucket on the AWS Snowball Edge device with an IP 192.168.50.51 using the HTTPS protocol. Archive files will be written to the prefix of mysmallfiles and are set to be auto-extracted at time of import. The transfer logs will be placed in /my/logs/ and will be uploaded to the device.

snowTransfer upload --src /dir/with/smallfiles/ --bucket_name my-
snowballedge-bucket --log_dir /my/logs/ --profile_name snowballedge-1 --
endpoint https://192.168.50.51:8443 --prefix_root mysmallfiles --
extract_flag true --upload_logs=true --max_files 100000 --partition_size
10GB

When I tested this process against a single directory containing 131 million 4-KB files, it took a little under 32 hours to upload all the files to the Snowball Edge device. Copying the same files without the snow-transfer-tool took over 22 days, a 94% decrease in the time required for the migration. While your results may vary depending on the numerous factors that may impact performance, the general principle of turning many small writes into a large, sequential write should result in improved throughput.

Copy with partition files

Partition files specify what files will be batched into a particular archive file on the AWS Snowball Edge device. The partition files break up the batched files into smaller, manageable chunks that are supported by the Snowball Edge import service for auto-extraction. To be auto-extracted at time of import to Amazon S3, the archive files must meet certain requirements for size and number of files per archive. Partition file creation allows you to follow these requirements and to optimize the upload of the data to the Snowball device(s) through parallel uploads. When working with multiple devices, the partition files may be stored in a network share available to mount on all the data copy workstations.

Another benefit of using partition files is that you also have a manifest of what files are copied to each device. If a device is damaged in shipping or a copy process fails, it is possible to copy the same files to a replacement device without needing to inventory all the objects in S3 and perform a comparison.

There are two steps to copying the data using partition files:

1. Generate partition files

The first step is to generate the partition files. This may be done prior to ordering any Snow devices. In this example, we are moving 210TB of data and will distribute it evenly across 3 devices:

snowTransfer gen_list --filelist_dir /my/partition-files --
partition_size 1GB --device_capacity 70TB --src /mysrcdir/smallfiles 
--log_dir /my/logs

This command will create the partition files in the specified fileslist_dir location with subdirectories for each device with a fill capacity of 70TB as set with the device_capacity option. The maximum size of each uploaded tar file will be 1GB, set by the partition_size option.

2. Copy data to the AWS Snowball Edge device(s)

Building on the example above, the following command will upload the files represented by the partition file named fl_RKPYSJ_1.ready to the bucket named my-snowballedge-bucket on the AWS Snowball Edge device with an IP 192.168.50.51 using the HTTPS protocol. When written to the tar file, the files will be written with a prefix of mysmallfiles and are set to be auto-extracted at time of import to Amazon S3 with the original directory structure preserved following any prefix set with the –prefix_root option. The transfer logs will be placed in /my/logs/ and will also be uploaded to the device

snowTransfer upload --src 
/my/partition_files/device_70.00TiB_1/fl_RKPYSJ_1.ready --
bucket_name my-snowballedge-bucket --log_dir /my/logs/ --endpoint 
https://192.168.50.51:8443 --profile_name snowballedge1 --prefix_root mysmallfiles 
--extract_flag true --upload_logs true

While the preceding example is for a single partition file representing at most one gigabyte of data, the scenario described above has 210 TB of data to move and has divided it evenly into 70 TB per Snowball Edge device, thus each device subdirectory would contain 70,000 partition files. If workstation-1 is the worker for snowballedge-1, workstation-2 is the worker for snowballedge-2, and workstation-3 is the worker for snowballedge-3, then the commands for each workstation to upload each file sequentially would look like this:

workstation-1

for i in {1..70000}; do snowTransfer upload --src /my/partition_files/device_70.00TiB_1/fl_RKPYSJ_$i.ready --profile_name snowballedge-1 --bucket_name my-snowballedge-bucket --log_dir /my/logs/ --endpoint https://192.168.50.51:8443 --prefix_root mysmallfiles --extract_flag true --upload_logs true;done

workstation-2

for i in {1..70000}; do snowTransfer upload --src /my/partition_files/device_70.00TiB_2/fl_RKPYSJ_$i.ready --profile_name snowballedge-2 --bucket_name my-snowballedge-bucket --log_dir /my/logs/ --endpoint https://192.168.50.52:8443 --prefix_root mysmallfiles --extract_flag true --upload_logs true;done

workstation-3

for i in {1..70000}; do snowTransfer upload --src /my/partition_files/device_70.00TiB_3/fl_RKPYSJ_$i.ready --profile_name snowballedge-3 --bucket_name my-snowballedge-bucket --log_dir /my/logs/ --endpoint https://192.168.50.53:8443 --prefix_root mysmallfiles --extract_flag true --upload_logs true;done

Further optimization of the upload process

These additional option flags and optimizations may be used to enhance your upload process, depending on the specifics of your environment.

The --max_process option flag can be set to allow higher concurrency in upload operations if sufficient CPU cores and free memory are available on the workstations.

The --partition_size and --max_files parameters are used to ensure that the generated tar files fall within best practices for the import process as defined in the AWS Snowball Edge documentation. Additionally, files larger than the –partition_size option are directly uploaded using multi-part uploads, resulting in an optimized transfer process.

Setting the --compression flag to true will result in reducing the size of the archive stored on the device if the data is compressible, which should also reduce the time required to upload the data to the device.

When uploading data using partition files, it is also possible to tell the script to strip out certain characters from the destination file prefix using the --ignored_path_prefix flag. For example, setting the flag to a value of /user/johndoe when the files in the paso rtition file have a path of /user/johndoe/appdata/ would result in the data in S3 being located in /appdata/.

On Linux and MacOS, there are additional packages that may be used to improve upload times even further. The GNU Parallel tool enables parallel execution of process on a single platform. To use this with the snow-transfer-tool upload command would look like this:

parallel -j 8 snowTransfer upload --src
/my/testscript/partition/device_70.00TiB_3fl_RKPYSJ_{}.txt --profile_name
snowballedge-1 --bucket_name my-snowballedge-bucket --log_dir /my/logs/
--endpoint https://192.168.50.51:8443 --prefix_root testscript1
--extract_flag true ::: {1..70000}

This command would result in 8 parallel upload processes running with the partition files 1 through 70,000 being spread across those 8 processes. Note that each upload process, by default will create up to 5 processes and that the maximum recommended concurrent connections for a single Snowball Edge device is in the 30-40 range. Using this command with the test data set of 131 million 4k files, reduced the time to copy to the Snowball Edge device from just under 32 hours to just under 12 hours.

NOTE: Using increased parallelism can result in reduced performance for other workloads using the source storage device.

Conclusion

The snow-transfer-tool simplifies data migrations in several ways. It improves performance by batching small files and it enables parallelism of data copies through the use of partition files, thus reducing the time required to perform data migrations with Snowball Edge devices. The tool also increases repeatability of jobs via the partition files and increases visibility and reporting by creating log files. We encourage you to view the video, use the snow-transfer-tool, and provide feedback via the GitHub mechanisms as appropriate. Additionally, if you have an enhancement to the codebase that you would like to share, please feel free to do so by following the instructions in the CONTRIBUTING.md file in the repository.