Best practices for accelerating data migrations using AWS Snowball Edge

Customers frequently perform bulk migrations of their application data when moving to the cloud. There are different online and offline methods for moving your data to the cloud. When proceeding with a data migration, data owners must consider the amount of data, transfer time, frequency, bandwidth, network costs, and security concerns. No matter how data makes its way to the cloud, customers often ask us how they can transfer their data to the cloud as quickly and as efficiently as possible. Often, speed is paramount, as customers want their data available and functional, and to take advantage of the agility, scalability, elasticity, reliability, and cost savings the AWS Cloud offers.

Even with high-speed internet connections, it can take months to transfer large amounts of data. For example, it can take more than 100 days to transfer 100 TB of data over a dedicated 100-Mbps connection. You can accomplish the same transfer in less than one day, plus shipping time, using two AWS Snowball Edge devices. Snowball Edge offers customers a fast and inexpensive way to ensure that they can quickly transfer data both into and out of AWS.

Customer often use Snowball Edge for data transfers when there are connectivity limitations, bandwidth constraints, high network connection costs, legacy environment challenges, and/or when data is collected in remote locations. Although Snowball Edge also provides edge computing and edge storage capabilities, this blog focuses on data migration use cases and how you can accelerate bulk data transfers into Amazon S3.

In this blog, we discuss techniques and provide examples on how to speed up your data migration using one or more AWS Snowball Edge devices. You can use multiple devices in parallel or clustered together to transfer your on-premises data to the Snowball Edge. Most of the techniques discussed in this blog are specific to Snowball Edge and are not applicable to AWS Snowcone, which is the smallest member of the AWS Snow Family, as Snowcone does not currently have an Amazon S3 endpoint.

Use the right tool for your job

Whenever you have available connectivity between your location and AWS, you should first explore online transfer methods using AWS DataSync, the AWS CLI, or alternative transfer methods. The Snow Family of devices should only be used when you have connectivity challenges with the internet/cloud. Use Snowcone when you must transfer less than 32 TB to AWS. We recommend this, as the cost of four 8 TB Snowcones is less expensive compared with the cost of a single 80 TB Snowball Edge. Use Snowball Edge when you have terabytes of data you want to upload to AWS.

Data transfer tools

The following is a list of some of the tools we tried and used when transferring data to Snowball Edge:

AWS CLI version 1.16.14 or earlier: Use s3 cp or s3 sync to copy or transfer changed data from your source to the Snowball Edge Amazon S3 endpoint.
AWS OpsHub: Use a graphical user interface to manage your Snow devices, deploy edge computing workloads, and simplify data migration to the cloud.
S5cmd: This third-party tool provides faster data transfers to Amazon S3 endpoints with parallelism, multi-threading, tab completion, and wildcard support for files.
Minio Client (mc): Use this tool to copy between file systems and Amazon S3 cloud storage solutions that support AWS Signature v2 and v4. The Minio client offers fast transfer speeds that are comparable with s5cmd.

Both the AWS CLI and AWS OpsHub are the recommended data transfer methods per the AWS Snowball Edge Developer Guide. Customers have also been successful using s5cmd and the Minio client (mc) for data transfers to Snowball Edge.

Amazon S3 or NFS?

Customers often ask whether they should use the built-in Amazon S3 endpoint on Snowball Edge or the file interface (NFS) for data transfers to Snowball Edge. When transfer speeds and duration are important, we recommend using the S3 endpoint as it is currently roughly 10 times faster than using the NFS.

Note the following considerations when you transfer data to the Snowball Edge Amazon S3 endpoint:

Snowball Edge does not support symbolic links.
Snowball Edge does not support copying empty folders.
Avoid modifying files during data transfer as the copy operation will fail the checksum validation, and it will be marked as a failed transfer.
Maximum file size is 5 TB.
Maximum upload part size is 512 MB.
Files and folders names must conform to the Amazon S3 object key naming guidelines.
Use UTF-8 characters for object key name.
File path length (for example, /top-directory/subdirectory/filename.txt) must not exceed the Amazon S3 object key length limit of 1,024 bytes.

Occasionally, there are scenarios when you may want to use the file interface:

Environment is limited to NFS and does not allow for Amazon S3 endpoint use.
Source produces transfer rates less than 40 MB/s.
Preserve source metadata, for example, POSIX permissions, user ID, and group ID — the metadata is stored in the userMetadata portion of the corresponding Amazon S3 object.

The file interface on Snowball Edge supports a subset of the standard NFS protocol. Consequently, you will encounter errors if the copy tool uses unsupported calls. For example, Snowball Edge does not support copying Oracle RMAN backups directly to the NFS interface. Here are some additional considerations for using the file interface:

Maximum file size is 150 GB with NFS data transfers.
The interface ignores changes to file permissions after a file is written to Snowball Edge.
Avoid truncating, renaming, or changing ownership operations on files after data transfers.
Ensure that file path lengths do not exceed 1,024 characters with UTF-8 as Linux permits a file path length of up to 4,096 characters.

Use only one method to read and write data to Snowball Edge. Using both Amazon S3 and NFS at the same time can result in undefined behavior.

Accelerate your data migration

Customers also often ask how to speed up their data transfers to the Snowball Edge. There are some performance recommendations and best practices you should read about in the AWS Snowball Edge Developer Guide. The following subsection covers some additional best practices we often use to accelerate the data transfers to Snowball Edge.

Optimize for small files with batching

Each copy operation has some overhead because of encryption used; therefore, performing many transfers on individual small files has slower overall performance than transferring the same data in larger files. To significantly improve your transfer speed for small files (files less than 1 MB), batch the small files together. Batching files is a manual process. If the batched files are transferred to the Snowball Edge with the --metadata snowball-auto-extract=true option, the batches are automatically extracted when data is imported from Snowball Edge into Amazon S3. Note: Standard S3 charges will apply when your data is uploaded into your S3 bucket from the Snow device. If auto-extract=true is selected, a put request is initiated for each extracted small file.

Run the following tar command to manually batch small files, and then transfer them to Snowball Edge:

tar -cvf - /Logs/April | aws s3 cp - s3://mybucket/batch01.tar --metadata snowball-auto-extract=true --endpoint http://192.0.0.0:8080 --profile snowprofile

Keep the following in mind when batching small files:

Maximum batch size of 100 GB.
Recommended maximum of 10,000 files per batch.
Batches > 100 GB are not be auto-extracted when imported to Amazon S3.
Supported archive formats are tgz, tar, and ZIP.
S5cmd does not support the –metadata snowball-auto-extract=trueoption, so use AWS CLI or Minio client (mc) instead.
Minio client (mc) supports reading metadata and saving it during copy operations using the -a option.

Optimize for large files

Whenever you have many large files to transfer, break up larger files into smaller chunks in order to increase the number of threads to use to transfer objects through parallelization.

When using s3 cp or s3 sync to transfer data to Snowball Edge, you can fine-tune the AWS CLI by modifying the configuration file located in ~/.aws/config.

s3 =
    max_concurrent_requests = 30
    multipart_threshold = 32MB
    multipart_chunksize = 32MB

The aws s3 transfer commands are multi-threaded, and you can optimize large file transfers by configuring the following:

max_concurrent_requests: The maximum number of concurrent requests allowed at any given time. Default value is set to 10. For optimal throughput, set this parameter as high as your connection can sustain. As a best practice, configure this value to be <= 40.
multipart_chunksize: This value sets the size of each part that the AWS CLI uploads in a multipart upload for an individual file. This setting allows you to break down a larger file into smaller parts for quicker upload speeds. Default value is set to 8 MB. Ensure the value you set balances the part file size and the number of parts. Note: A multipart upload requires that a single file is uploaded in <= 10,000 distinct parts.
multipart_threshold: The size threshold the CLI uses for multipart transfers of individual files. Default value is set to 8 MB.

The actual value of these parameters varies based on your host resources, network, and file sizes. At some point, the AWS CLI will reach a performance limit. Specifically, if you’re modifying the max_concurrent_requests setting and you don’t see any performance increase between 10 and 20 threads, this is due to a limitation of the CLI. Increasing thread count to 100’s using other software will continue to increase performance. However, it is reasonable to assume that added threads should continue to add performance until another bottleneck presents itself, such as running out of CPU. Refer to the AWS CLI S3 configuration documentation for more information. Also, see examples on the Maximizing Storage Throughput and Performance workshop for some testing ideas.

Parallelize data transfers

Performing multiple write operations at any given time improves your data transfers to Snowball Edge. Specifically, consider multiple concurrent writes from the same workstation or from multiple workstations connected to the Snowball Edge.

To perform multi-threading transfers, we recommend using GNU parallel. GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that executes each line of the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.

The following script can help you parallelize your data transfer by providing a list of files to copy to Snowball Edge:

#!/bin/sh
# Copy files in parallel to Snowball Edge

AWS_PROFILE="snow-profile"     # Name of profile in credentials file where key/secret is stored
ENDPOINT=http://192.0.0.0:8080 # SnowballEdge s3 endpoint
BUCKET=snowbucket              # s3 bucket
FILE_PATH=$1                   # First and only argument - List of files to be copied

if [ ! -f $FILE_PATH ]
then
echo "Syntax: $0 <file-list>"
exit
fi

echo "Will copy files from $FILE_PATH"

# Copy files to Snowball Edge using parallel, where -j10 means 10 copies
cat $FILE_PATH |parallel -j10 "aws s3 cp {} s3://$BUCKET/{} --endpoint $ENDPOINT —profile $AWS_PROFILE"

Partition large datasets

To split files into manageable sizes for data transfers to Snowball Edge, consider using fpart to sort file trees and pack them into partitions. fpart splits a list of directories and file trees into a certain number of partitions, trying to produce partitions with the same size and number of files. It can also produce partitions with a given number of files or a limited size. You can generate a partitions list, which third-party programs use for data transfers.

Use the following script to help generate a partitioned list of all subdirectories of your source dataset that you want to copy to Snowball Edge:

#!/bin/sh
# Script to create a partitioned list of all subdirectories
MOUNT_POINT="data"         # Mount point where the subdirectories are located
PART_FILE_NAME="part-list" # Individual partition file prefix
FILES=10000                # Number of files in a partition

for subdir in ls $MOUNT_POINT
do
echo "date: Currently working on $MOUNT_POINT/$subdir"
fpart -f $FILES -o snowball-$subdir-$PART_FILE_NAME $MOUNT_POINT/$subdir
don

Improve copy performance using additional infrastructure

Use multiple workstations for dividing and copying your dataset in parallel to either the same Snowball Edge or additional Snowball Edge devices. Ensure that the data source can support higher read performance and will not impact the production workloads that are running on the data source.

Log copy operations and data verification

When you copy files to Snowball Edge using the AWS CLI, the Amazon S3 Adapter for Snowball Edge generates a number of checksums and uses these checksums to automatically validate the integrity of your data throughout the transfers. When checksums don’t match, the associated data is not imported into Amazon S3.

As a best practice, you should log the files you copy. This enables you to perform verifications on files transferred and to log any errors that may occur when copying data from your source to the Snowball Edge. On Linux, you can redirect standard output and errors to a file using the 2>&1 operation when you use the CLI to perform data transfers.

Whenever the Snow service imports data from Snowball Edge to Amazon S3, the Snow service generates a job report at the end of the process to provide a summary of the data transfer. Additionally, the Snow service provides a success log and a failure log to give you further insight into status of the transferred objects in S3. Refer to the documentation on getting your job completion report and logs for additional information.

If needed, you can use checksum utilities provided with various operating systems to generate checksums of your dataset for compliance or auditing needs:

Linux:

$ md5sum myfile.json

047646f5571c9c1919fe31744f9d0abc  myfile.json

macOS:

prompt> openssl md5 myfile.json
MD5(myfile.json)= 047646f5571c9c1919fe31744f9d0abc

Windows:

C:\> certutil -hashfile myfile.json
SHA1 hash of myfile.json:
1a175d4f6d525e80334dedf6ad6b35b8ca8b742
CertUtil: -hashfile command completed successfully.

Amazon S3 using s3cmd, an S3 client, and backup for Linux and macOS:

prompt > s3cmd ls --list-md5 s3://snow-bucket/
2018-09-28 16:22          286  fb59697dd6b1896bef2d5cb7e5e99546     
s3://snow-bucket/myfile.json

Summary

In this blog, we discussed data migration considerations, various tools, and techniques to accelerate your bulk data migrations to AWS using AWS Snowball Edge in a reliable and efficient manner. Leveraging the techniques mentioned can help you improve and optimize your data transfer speeds, reducing the time needed for data transfers. They also enable you to tackle common challenges with large-scale data transfers including high network costs, long transfer times, and security concerns, thus getting your data into AWS faster.

To learn more about data migrations with Snowball Edge, check out the following resources:

Thank you for reading about best practices for accelerating data migrations using AWS Snowball Edge. Please leave a comment in the comments section if you have any questions.