Migrate HDFS files to an Amazon S3 data lake with AWS Snowball Edge

The need to store newly connected data grows as the sources of data increase. Enterprise customers use Hadoop Distributed File System (HDFS) as their data lake storage repository for on-premises Hadoop applications. Customers are migrating their data lakes to AWS for a more secure, scalable, agile, and cost-effective solution.

For HDFS migrations where high-speed transfer rates to AWS are not plausible, AWS offers the AWS Snowball Edge service. The best practice for HDFS migrations with AWS Snowball Edge is to use an intermediary staging machine for the file transfer. This blog post details how to use that intermediary staging machine during your migration.

AWS Snowball Edge is a physical data migration and edge computing device (figure 1) that gets shipped to you. It is used to transfer large amounts of data into Amazon Simple Storage Service (Amazon S3).

Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. For bulk ingestion, Amazon S3 is the entry point for the data lake. An object store with 99.999999999% (11 9’s) of durability, Amazon S3 hosts more than 10,000 data lakes.

As a foundation for data lake storage in AWS, Amazon S3 enables the decoupling of money-saving storage from compute and data processing. Amazon EMR is a managed platform for running big data frameworks, like Hadoop and Spark. In addition to HDFS, Amazon EMR clusters have their own implementation of HDFS called the EMR File System (EMRFS). The EMRFS is used for reading and writing to Amazon S3. On top of the native AWS services, Data Analytics Partners within the Amazon Partner Network (APN) provide integrated tools and standardized framework for processing, scheduling, security, and analytics.

In this blog post, we review the steps to optimally transfer your data from the on-premises HDFS files to Amazon S3 using AWS Snowball Edge. These steps should be performed initially in the “POC, tooling, and optimization” phase laid out in Data Migration Best Practices with Snowball Edge. This should be done for one Snowball Edge and then applied in the execution phase for multiple Snowball Edges.

We provide criteria on how to decide when to use AWS Snowball Edge.

Prerequisites

You must have the following before proceeding through the all the components of this post.

AWS account
IAM User
AWS Snowball Edge device onsite and connected to your local network
A machine (VM or bare-metal host) with 10G-bits network uplinks

Deciding on AWS Snowball Edge

AWS provides services to ingest and transfer data into Amazon S3. Some are designed for migration into AWS using available networks and others are used for offline migrations. AWS Snowball Edge is generally used for offline migrations.

To learn more about other mechanisms for online ingest, reference other AWS services below:

If you’re facing any network limitations with your on-premises copy to Amazon S3, AWS Snowball Edge provides you with the ability to securely copy, bulk transport files, and perform edge computing. Data transfers of over 10 terabytes (TB) with AWS Snowball Edge may be more cost optimized than anything less. Figure 2 below estimates transfer times based on descending network transfer rates. When comparing this to your Snowball Edge timelines, it takes approximately 15 days from job order to data available in S3 with each Snowball Edge.

Rate (Mbps)	82 TB Transfer Time (days)
6400	1.22
3600	2.11
3200	2.37
2400	3.16
2216	3.42
1600	4.75
800	9.49
480	15.53
240	31.06
80	85.42

Once it is decided to deploy AWS Snowball Edge, use the Getting Started with AWS Snowball Edge: Your First Job documentation to order the first job. The guide steps through creating your first job, along with detailing how to have the Snowball Edge device shipped to your shipping address.

Connect AWS Snowball Edge to your local network by using the Connect to Your Local Network document. The blog post Data Migration Best Practices with Snowball Edge provides guidance on network requirements. Ideally, this local network connection is 10 Gbps or greater throughput and low-latency. The device does not need to be connected to the internet.

Migration steps

The below steps walk you through how to use a staging machine with AWS Snowball Edge to migrate HDFS files to Amazon S3:

Prepare Staging Machine
Test Copy Performance
Copy Files to the Device
Validate File transfer

Step 1: Prepare staging machine

The following section details how to set up the staging machine. As a best practice, Hadoop file transfers to AWS Snowball Edge use an intermediary staging machine with HDFS mounted to the local file system. Mounting HDFS allows you to interact with it as a local file system. The staging machine is then used for high throughput parallel reads from HDFS and writes to AWS Snowball Edge. Below you can see the workflow in figure 3.

Workflow for preparing a staging machine

Note: If multiple staging machines are used, each machine must mount HDFS as a local file system.

The current guidance on the staging machine requires a minimum of four cores, 32-GB memory, and fast disk access to optimize your throughputs. The staging machine can be a virtual machine (VM) or bare-metal host with 10G-bits network uplinks. It could even be an AMI deployed on the Snowball Edge. It has been observed that bare-metal hosts give better performance when large copy operations are performed.

Staging Machine Setup

Install the AWS CLI.
Configure a mountable HDFS for your Hadoop cluster. Install and configure “Filesystem in Userspace” (FUSE) as an interface into HDFS.
Mount the FUSE interface to the staging machine and test your access to HDFS as a local file system. A mounted interface allows you to interact with HDFS as a local file system.

Step 2: Test copy performance

The below section explores improving overall and testing copy performance. In terms of throughput, a single copy or sync operation is not sufficient to transfer large datasets. To increase transfer rate, data partitioning and parallel copy options should be explored.

For large data transfers, we recommend to partition your data into a number of distinct parts. Segmenting your file transfer allows the partitions to either be transferred in parallel or one at a time. If copy failure occurs, troubleshooting the identified segment makes it quicker to resolve. Only review the failed segment, instead of looking at the dataset in its entirety. Parallel transfers enable multiple writes to the AWS Snowball Edge to increase performance.

Note: Customers use GNU parallel with multiple workers for parallel transfers. The number of parallel operations depends on the network bandwidth, data source performance, and latency.

Once the staging machine is set up and the segments are planned out, test the copy operation throughput. A successful HDFS transfer to Snowball Edge is able to achieve 1.6-2.4 Gbps throughput. Performance varies based on hardware, network, and file size.

A simple copy command is listed below:

aws s3 cp /path/to/<file> s3://bucket/prefix/<file> --endpoint http://<SNOWBALL_EDGE IP>:8080 -–profile <profile-name>

Open a single terminal and run the aws s3 cp
Record the transfer rate (for example 400 Mbps).
Parallelize the copy function by continuing to add terminals with the aws s3 cp – assess performance after each test. You see an upper bound on your performance where adding terminals may negatively impact your throughputs.
Add the terminals performance (terminal 1 @ 400 Mbps and terminal 2 @ 400 Mbps). In this example, you are transferring at 800 Mbps. For more information on improving performance, target network latencies, staging machine enhancements, file size optimizations, and ensuring HDFS read rates through FUSE mount. Also, check out Best Practices for Data Migration.

Step 3: Copy files to the device

Once you complete your performance testing and optimization, prepare commands to copy your files in parallel to the Snowball Edge device. The commands should be run on the staging machine. For example, the command below performs a recursive copy on all files within the specified path and writes them to the AWS Snowball Edge endpoint.

A recursive copy command is listed below:

aws s3 cp /path/to/mounted/hdfs s3://bucket_name --recursive --endpoint http://<SNOWBALL_EDGE IP>:8080 -–profile <profile-name>

In addition to running transfers in parallel, batching smaller files (less than 5 MB) together helps increase throughput. By default, Snowball Edge is encrypting all data with 256-bit encryption during each copy operation. The encryption process has overhead that can cause latency during the transfer of a smaller file. Batching smaller files helps boost transfer performance. We recommend to batch the objects together with up to 10,000 other files in a single archive. During the transfer operation, enable the auto-extraction of archives imported into Amazon S3.

A command to batch and copy files is listed below:

tar -zcf - /path/to/batch | aws s3 cp - s3://bucket_name/batch_name.tar.gz --metadata snowball-auto-extract=true --endpoint http://<SNOWBALL_EDGE IP>:8080 -–profile <profile-name>

Note: Batches must be in one of the supported archive formats. To get additional information on batching, check out Batching Small Files.

Step 4: Validate file transfer

After completing your file transfer to the AWS Snowball Edge, and before shipping the device back to AWS, there remains one important step. This is validating all your files can be imported to Amazon S3 successfully.

Use the aws s3 ls command to list all the objects on the AWS Snowball Edge device. Once you have the inventory of the objects you’ve copied to the device, you can easily compare it against the files from your source location. Use this method, to identify any files not transferred.

If there are any validation errors, the file is not written to the device. Check out some of the common validation errors.

After validation, disconnect the device and ship it back to AWS. Once AWS received the AWS Snowball Edge, the data is imported into Amazon S3. When that occurs, you are able to navigate to a Job Completion Report in the AWS Management Console. The report provides an overview of the job.

If additional validation of the import job is needed, you can enable S3 inventory. You can also use Amazon S3 sync on the S3 bucket and prefixes used for the import. S3 inventory generates a csv file that you can compare to the source files you migrated on-premises. Running aws s3 sync recursively copies new and updated files from the source directory to the destination. The command syncs directories and S3 prefixes.

Note: To use the sync command as described for validation, you must be able to connect to the internet from your client.

Conclusion

As your data and Hadoop environment on-premises grows, AWS Snowball Edge is available to accelerate your journey to Amazon S3. For a Hadoop migration, where network bandwidth is limited and not plausible, AWS offers the AWS Snowball Edge service. Amazon S3 provides a more secure, scalable, and cost-effective storage solution for your historical and growing data.

Now that you performed the “POC, tooling, and optimization” phase for the HDFS migration, as laid out in Data Migration Best Practices with Snowball Edge for one Snowball Edge, you are now ready to apply your learnings toward the execution phase with multiple Snowball Edge devices.

Learn more:

AWS Storage Blog

Migrate HDFS files to an Amazon S3 data lake with AWS Snowball Edge

Prerequisites

Deciding on AWS Snowball Edge

Migration steps

Step 1: Prepare staging machine

Staging Machine Setup

Step 2: Test copy performance

Step 3: Copy files to the device

Step 4: Validate file transfer

Conclusion

Resources

Follow

Learn

Resources

Developers

Help