AWS Big Data Blog

Moving Big Data into the Cloud with Tsunami UDP

by Matt Yanchyshyn | on | Permalink | Comments |  Share

Matt Yanchyshyn is a Principal Solutions Architect with Amazon Web Services

AWS Solutions Architect Leo Zhadanovsky also contributed to this post.

Introduction

One of the biggest challenges facing companies that want to leverage the scale and elasticity of AWS for analytics is how to move their data into the cloud. It’s increasingly common to have datasets that are multiple petabytes. Moving data of this magnitude can take considerable time. AWS helps you leverage accelerated file transfer protocols that use a combination of UDP for data transfer and TCP for the control connection.

Unlike purely TCP-based protocols such as SCP, FTP or HTTP, these hybrid UDP/TCP protocols can achieve much greater throughput, using more of your available bandwidth, while being less susceptible to network latency. This makes them ideal for transfers over long distances, such as between AWS regions or for transmitting large files to and from the cloud. Under ideal circumstances, accelerated file transfer protocols that implement the hybrid UDP/TCP model can achieve transfer rates that are dozens if not hundreds of times greater than traditional TCP-based protocols such as FTP.

This blog post will show you how to use Tsunami UDP, a file transfer solution that implements a hybrid UDP/TCP accelerated file transfer protocol, to move data from Amazon EC2 to Amazon S3.  (Other powerful accelerated file transfer and workflow solutions include Aspera, ExpeDat, File Catalyst, Signiant, and Attunity. Many of these products are available on the AWS Marketplace.)

AWS Public Data Sets

AWS hosts a variety of public data sets at no charge to the community. For this blog post, we will transfer the Wikipedia Traffic Statistics V2, a 650GB collection of 16 months’ worth of hourly pageview statistics for all articles in Wikipedia. This public dataset is stored as an EBS snapshot that you can mount onto an Amazon EC2 instance. For more information about this process, see the EC2 documentation.

We will move this 650GB (compressed) dataset from an Amazon EC2 instance located in the AWS N. Virginia Region (us-east-1) to an Amazon S3 bucket in the AWS Tokyo Region. Once in Amazon S3, the data is available for your big data analytics projects using AWS services such as Amazon Elastic MapReduce (EMR) and Amazon Redshift, both of which can quickly import data stored in Amazon S3 and analyze it at a massive scale. In this demonstration we’ll move a large dataset from an Amazon EC2 instance in one region to another, but these examples are also applicable if datasets must be moved from on-premises datacenters to AWS or vice versa.

Tsunami UDP

Tsunami UDP is a popular open source file transfer protocol and associated command line interface that uses TCP for control and UDP for data transfer. It is designed to increase network efficiency by replacing TCP’s packet acknowledgement-based congestion control mechanisms with an alternative model that leverages UDP and is focused on the efficiency of data transfer on lossy or variable-latency networks. This mitigates the negative effects that high network latency has on throughput when using purely TCP-based protocols.

Tsunami UDP appeals to many AWS customers for several reasons: it’s fast, it’s free, and it’s easy to use. It was originally released in 2002 by Mark Meiss and his lab colleagues at Indiana University. The version widely used today is a fork of that original code, released in 2009 and currently hosted on Sourceforge.

Setting up the AWS Public Data Set on an Amazon EC2 instance

Before we test Tsunami UDP, we need to download a dataset for our tests. We’ve placed a copy of the data in an Amazon S3 bucket in ap-northeast-1 for convenience.

Set up the Tsunami UDP server

  1. Launch an Amazon Linux instance in ap-northeast-1 (Tokyo). Choose one with 10Gbit networking and enough ephemeral storage to hold the dataset. The new i2.8xlarge Amazon EC2 instance type is an excellent choice, while the the cc2.8xlarge is a cost-effective alternative. For more information about available Amazon EC2 instance types, visit the Amazon EC2 Instance Types page. For convenience we have created a CloudFormation template that launches an Amazon EC2 instance, opens up ports TCP 22 and TCP/UDP 46224 for for SSH and Tsunami UDP access, sets up a local ephemeral volume on the EC2 instance, and installs the Tsunami UDP application from a source.  AWS Documentation can help you learn how to launch a CloudFormation stack.
  1. Log in to your newly created instance with SSH. For example:
ssh -i mykey.pem ec2-12-234-567-890.compute-1.amazonaws.com
  1. Set up the AWS CLI using IAM credentials:
aws configure
  1. Copy the Wikipedia statistics onto the ephemeral device:
aws s3 cp --region ap-northeast-1 --recursive s3://accel-file-tx-demo-tokyo/ 
/mnt/bigephemeral

Downloading these files takes a long time. If you don’t intend to manipulate the dataset later using Hadoop and only want to test throughput or the functionality of Tsunami UDP, you can quickly create a temporary file by issuing the following command on your Amazon Linux instance, replacing 650G with the file size in gigabytes that you’d like to test with:

fallocate -l 650G bigfile.img

Tsunami UDP transfers take a short time to establish maximum transfer rates. When many small files are transferred there’s a performance penalty for coordination, so consider transferring a smaller number of large files rather than a large number of small files. On an i2.2xlarge instance, for example, Tsunami UDP will transfer a single 650GB file at a sustained rate of ~650Mbps. By comparison, transferring the individual ~50MB pagecount* files from the data results in an average transfer rate of ~250Mbps.

To maximize the data transfer rate for the Wikipedia dataset, you can create a single large tar file using:

tar cvf /mnt/bigephemeral/wikistats.tar 
/mnt/bigephemeral
  1. Once you have your files ready to transfer, start a Tsunami UDP server using ports TCP/UDP 46224 and serving all files on the ephemeral RAID0 volume array:
tsunamid --port 46224 /mnt/bigephemeral/*

Set up the Tsunami UDP client

  1. Launch an Amazon Linux instance in us-east-1 (N. Virginia).  For testing purposes this instance should be the same type as the one you launched in ap-northeast-1. You can re-use the same CloudFormation template as above, this time in us-east-1.
  1. Log in to your newly created instance with SSH.

Transferring the data and measuring performance

  1. Run the Tsunami UDP client, replacing [server] with the public IP address of the Tsunami UDP server Amazon EC2 instance that you launched in the AWS Tokyo Region:
tsunami connect [server] get *
  1. If you want to limit the transfer rate to avoid saturating your network link, use the “set rate” option. For example, this will limit transfers to 100 Mbps:
tsunami set rate 100M connect [server] get *
  1. Use CloudWatch NetworkOut on Tsunami UDP server and NetworkIn on TsunamiUDP client to get performance.

Cloudwatch NetworkOut and NetworkIn to get performance

For this long-distance file transfer between Tokyo and Virginia, we achieved 651 Mbps, or 81.4 MB/s during the duration of the file transfer on an i2.2xlarge instance. Pretty impressive when you consider the distance!

  1. To compare this to other TCP-based protocols, you can try using the scp protocol (Secure copy). For example:
scp -i yourkey.pem ec2-user@[server]:/mnt/bigephemeral/bigfile.img

Using the same i2.2xlarge instances for both server and client and the SCP protocol, we were only able to achieve average transfer speeds of ~9 MB/s when transferring a single, large 650GB file. That’s about 10 times slower than Tsunami UDP. That’s a significant performance benefit, even when you consider the encrypted SSH connection overhead caused by SCP.

Moving the dataset into Amazon S3

Once the data is transferred to your EC2 instance in ap-northeast-1, you can move it into Amazon S3. From there you can import it into Amazon Redshift using the parallel COPY command, analyze it directly using Amazon EMR, or archive it for later use:

  1. Create a new Amazon S3 bucket in the AWS Tokyo Region.
  1. Copy the data from your Amazon EC2 instance in us-east-1 into the newly created bucket:
aws s3 cp --recursive /mnt/bigephemeral  
s3://<your-new-bucket>/

Note: The new universal AWS CLI automatically uses multipart transfers to Amazon S3, optimizing throughput.

Note: If you tarballed your Wikipedia Traffic Statistics V2 dataset before transferring with Tsunami UDP, untar first if you want to complete the next step of analyzing the data with Amazon EMR.

Analyzing the dataset using Amazon EMR

Once the dataset is in Amazon S3, you can use Amazon EMR to analyze or transform the data. This example is tailored for the Wikipedia dataset that we used in this blog post and allows you to use Apache Spark on Amazon EMR to query the Wikipedia statistics.

Conclusion

Tsunami UDP provides a free, easy way quickly move large amounts of data in and out of AWS or between regions. When combined with the AWS CLI’s multipart upload to Amazon S3, it’s a convenient and free way to move large datasets into Amazon S3’s durable, low-cost object storage, allowing it to be analyzed  using AWS big data services such as Amazon EMR and Amazon Redshift.

Tsunami UDP does, however, have some limitations. It does not support native encryption, it is single-threaded, and can be difficult to automate due to the lack of an SDK or plugins. Running multiple clients and servers to get around the single-thread limitations can cause retries during transfers, often lowering overall throughput instead of improving it. Tsunami UDP also does not support native Amazon S3 integration so transfers must first be terminated on an Amazon EC2 instance and then re-transmitted to Amazon S3 using a tool such as the AWS CLI.

Lastly, because the most recent contribution to the Tsunami UDP codebase was in 2009, there is not any commercial support and are there not any active open source forums devoted to the product. Two other accelerated file transfer solutions address these shortfalls: ExpeDat S3 Gateway and Signiant SkyDrop. Both of these products support encryption, provide native S3 integration and include additional features to make them appealing to a wider commercial audience.

If you have questions about this article, please click “Comments” below and let us know!

————————————————-

Related:

Building and Maintaining an Amazon S3 Metadata Index without Servers

 

TAGS: