AWS Big Data Blog
Using AWS for Multi-instance, Multi-part Uploads
James Saull is a Principal Solutions Architect with AWS
There are many advantages to using multi-part, multi-instance uploads for large files. First, the throughput is improved because you can upload parts in parallel. Amazon Simple Storage Service (Amazon S3) can store files up to 5TB, yet a single machine with a 1Gbps interface would take about 12 hours to upload. It would also require 5TB of local or attached storage to store the complete file before uploading. Second, you can quickly recover from any network issues: smaller part sizes limits the impact of restarting a failed upload due to a network error. Third, you can pause and resume object uploads, and you can begin an upload before you know the final object size (you can upload an object as you are creating it). Lastly, it is a recommended best practice to upload large objects into Amazon S3 by dividing it up into many smaller parts.
This article shows you how to use many instances for a single multi-part upload to Amazon S3. We’ll organize all of the participating instances using Amazon DynamoDB and use some basic .Net code to work with 1,000 log files spread across ten Amazon Elastic Compute Cloud (Amazon EC2) instances, resulting in a single 25.8 GB log file uploaded at a rate in excess of 1 Gigabyte per second.
The challenge of multi-part uploads
Using multiple instances poses several challenges. For example, many processes may simultaneously want to begin a multi-part upload but must coordinate because only one process can initiate the multipart upload and share this ID with the cluster. In addition, one process will be the last process to finish uploading a part which will complete the set. The process should recognize this and complete the multipart upload. Moreover, each process must determine which part (or parts) it has and where they sit within the sequence. If you’re composing a 4K 3D video master, sequence matters if the film is to make any sense. If you’re composing a single log file from 1000 servers, sequence may not be important because de-multiplexing can be performed later during data analysis.
Executing a multi-part upload
For cluster coordination, Amazon DynamoDB is often used as a shared state provider. It has a simple programming model and is a managed database service, thus eliminating any setup, operational complexity, or costs. In addition, it is highly available and durable, running across multiple availability zones. Amazon DynamoDB can also support small clusters requiring only a few reads/writes per second to coordinate to very large and busy clusters requiring hundreds or thousands of reads/writes per second.
The following diagram illustrates the multipart upload process.
A primitive proof of concept implementation is written to behave as follows:
- A separate process emits files that represent parts of a much larger whole. We’ll assume the order in which the files are processed matters for this application. Therefore, we will use a convention to specify the file part such as a time range, a byte range, or a sequence number.
- The large file is split into multiple parts in a folder on local disk using the following naming convention: filename.abc_1_100 , filename.abc_8_100. This tells the upload agent which parts in the sequence it is uploading. The parts can be scattered across multiple machines.
- The upload agent enumerates the files in the folder, sequentially or in parallel.
- Checks Amazon DynamoDB to see if this is the first process attempting to upload this file. Use Amazon DynamoDB’s Conditional Writes or Optimistic Locking Using Version Numbers to control concurrency. After all, many threads on many machines may simultaneously try to initiate the multi-part uploads.
- If it is the first, create a new entry in Amazon DynamoDB and initiate the Amazon S3 multipart upload. Record the S3-provided upload ID in Amazon DynamoDB for all subsequent processes to reuse when they upload a part.
- Upload all the parts while keeping a record of each in Amazon DynamoDB.
- Each time a part completes uploading, check Amazon DynamoDB to see whether all parts have now been uploaded.Other processes on other machines may still be in progress. At some point, this check will be done by a host with the final part.
- When all parts are uploaded, close the multipart upload and remove the entries from Amazon DynamoDB.
Testing with a Single Machine, Single Process, Multipart Upload
We’ll test using AWS Windows PowerShell Tools. With a single machine, there are two likely constraints to the potential upload throughput: the network capacity of the Amazon EC2 instance and the local disk IO storing the upload source. For example, let’s look at a Windows 2012 m1.xlarge instance in EU-WEST-1 uploading to a bucket in the same region using the AWS Windows PowerShell Tools:
The upload operates at 119.4 MB/s (3,758,587,904 bytes in 30.0123006 seconds). This is what we might expect from a 1 Gbps network interface given the timing is not purely data transfer. For example, it includes instantiating the CmdLet and establishing multiple HTTP connections.
Setting Up a multi-Instance, multi-part upload system
Now let’s increase the size of the upload as we recruit multiple machines and aggregate their resources.
New-EC2Instance -ImageId ami-1214ca65 ` -InstanceType m1.xlarge ` -MaxCount 10 ` -InstanceProfile_Id S3AndDynamoDBAccess ` -AssociatePublicIp $true ` -SubnetId subnet-c1e90c98 ` -SecurityGroupIds sg-b4a502d1 ` -KeyName myKey
The above invocation launches ten Amazon EC2 hosts in a VPC using an AMI I created based on Windows 2012. The Instance Profile only has permissions to allow my AWS SDK-based code to upload data to an Amazon S3 bucket and write cluster state to an Amazon DynamoDB table. This avoids placing AWS access keys into configuration files. By specifying a public subnet and granting PublicIP addresses, the instances can access the public Amazon S3 and Amazon DynamoDB endpoints. Finally, the Security Group allows outbound Internet access but no inbound Internet access (except RDP from the IP address of my desktop). It also allows traffic between members of the same Security Group because I will need to coordinate tasks across the cluster later. Pull down the latest version of the executable code to each of the machines. I use Windows Remote Management to instruct all machines to perform the same task:
# Open up a Remote Session to all of the hosts (created above) $PrivateIps = (Get-EC2NetworkInterface -Region eu-west- 1).PrivateIpAddress $s = New-PSSession -ComputerName $PrivateIps # Copy the compiled code from S3 to the local disk pf each host Invoke-Command -Session $s -ScriptBlock {Read-S3Object -BucketName jsaull-mybucket-dub -Key MultiMachineMultiPartUpload.exe -File C:tempMultiMachineMultiPartUpload.exe -Region eu-west-1}
On each of the servers, layout 100 log files using our naming convention. In aggregate this results in a multi-part upload with 1000 parts.
$scriptBlock = { param ( [int] $loop , [int] $total ) $partsFolder = "w:mmufiles" # If this machine was used in a previous test it may have # incorrectly named file parts if (Test-Path $partsFolder) { Remove-Item -Path $partsFolder -Recurse } New-Item $partsFolder -ItemType directory # Download the template file part Read-S3Object -BucketName jsaull-mybucket-dub ` -Key weblog.log ` -File $partsFolderweblog.log ` -Region eu-west-1 foreach($count in ((($loop*100)-99)..($loop*100))) { Copy-Item $partsFolderweblog.log ("$partsFolderweblog.log_"+$count+"_"+$total) } Remove-Item $partsFolderweblog.log # Remove the template file } $position = 1 # Each machine in the cluster wants to know its position # Asynchronoulsy (in parallel) instruct each machine in the cluster to execute our script with individualised parameters foreach($machine in $PrivateIps) { Invoke-Command -ComputerName $machine ` -ScriptBlock $scriptBlock ` -ArgumentList $position , ($PrivateIps.Length * 100) ` -AsJob -JobName CreateParts $position++ }
The figure below shows some of the file parts created for the test. Each instance has 100 parts with 26.5 MB each. The naming convention shows this instance has the first 100 out of 1000 parts.
Testing the Multi-Instance, Multi-Part Upload System
With the cluster upand running with sample log files, it is time to tell the upload agent to gather all parts in the W: drive and send them to Amazon S3 to form the megalog.log file:
Invoke-Command -Session $s ` -ScriptBlock {c:tempMultiMachineMultiPartUpload.exe jsaull-mybucket-dub megalog.log w:mmufiles}
The effect? Throughput reaches approximately 1100 MB/s, resulting in a 25.8 GB file, although the average is 715 MB/s if we factor in one instance that straggles. The last instance tells Amazon S3 that all parts are uploaded and cleans up the records in Amazon DynamoDB. Each instance runs MultiMachineMultiPartUpload.exe which reports the time taken, bytes transferred, and transfer rate:
Conclusion
This article has shown that Amazon S3 multi-part uploading can be scaled out beyond a single process on a single machine. In this example, a single host was able to upload a 3.5 GB file to Amazon S3 at 119 MB/s, whereas a coordinated cluster of 10 machines uploaded a 25.8 GB file at an average rate of 715 MB/s (inclusive of the time to coordinate and manage the cluster). Finding ways to scale out and work with high-scale systems like Amazon S3 can significantly improve the ability to work with and manage big data assets and processes.
If you have questions or suggestions, please leave a comment below.
—————————————
Related: