Efficient Amazon S3 Object Concatenation Using the AWS SDK for Ruby
Today’s post is from one of our Solutions Architects: Jonathan Desrocher, who coincidentally is also a huge fan of the AWS SDK for Ruby.
There are certain situations where we would like to take a dataset that is spread across numerous Amazon Simple Storage Service (Amazon S3) objects and represent it as a new object that is the concatenation of those S3 objects. A real-life example might be combining individual hourly log files from different servers into a single environment-wide concatenation for easier indexing and archival. Another use case would be concatenating outputs from multiple Elastic MapReduce reducers into a single task summary.
While it is possible to download and re-upload the data to S3 through an EC2 instance, a more efficient approach would be to instruct S3 to make an internal copy using the new copy_part API operation that was introduced into the SDK for Ruby in version 1.10.0.
Why upload when you can copy?
Typically, new S3 objects are created by uploading data from a client using AWS::S3::S3Object#write method or by copying the contents of an existing S3 object using the AWS::S3::Object#copy_to method of the Ruby SDK.
While the copy operation offers the advantage of offloading data transfer from the client to the S3 back-end, it is limited by its ability to only produce new objects with the exact same data as the data specified in the original. This limits the usefulness of the copy operation to those occasions where we want to preserve the data but change the object’s properties (such as key-name or storage class) as S3 objects are immutable.
In our case, we want to offload the heavy lifting of the data transfer to S3’s copy functionality, but at the same time, we need to be able to shuffle different source objects’ contents into a single target derivative—and that brings us to the Multipart Upload functionality.
Copying into a Multipart Upload
Amazon S3 offers a Multipart Upload feature that enables customers to create a new object in parts and then combine those parts into a single, coherent object.
By its own right, Multipart Upload enables us to efficiently upload large amounts of data and/or deal with an unreliable network connection (which is often the case with mobile devices) as the individual upload parts can be retried individually (thus reducing the volume of data retransmissions). Just as importantly, the individual upload parts can be uploaded in parallel, which can greatly increase the aggregated throughput of the upload (note that the same benefits also apply when using byte range GETs).
Multipart Upload can be combined with the copy functionality through the Ruby SDK’s AWS::S3::MultipartUpload#copy_part method—which results in the internal copy of the specified source object into an upload part of the Multipart Upload.
Upon the completion of the Multipart Upload job the different upload parts are combined together such that the last byte of an upload part will be immediately followed by the first byte of the subsequent part (which could be the target of a copy operation itself)— resulting in a true in-order concatenation of the specified source objects.
Note that this example uses Amazon EC2 roles for authenticating to S3. For more information about this feature, see our “credential management” post series.
require 'rubygems' require 'aws-sdk' s3 = AWS::S3.new() mybucket = s3.buckets['my-multipart'] # First, let's start the Multipart Upload obj_aggregate = mybucket.objects['aggregate'].multipart_upload # Then we will copy into the Multipart Upload all of the objects in a certain S3 directory. mybucket.objects.with_prefix('parts/').each do |source_object| # Skip the directory object unless (source_object.key == 'parts/') # Note that this section is thread-safe and could greatly benefit from parallel execution. obj_aggregate.copy_part(source_object.bucket.name + '/' + source_object.key) end end obj_completed = obj_aggregate.complete() # Generate a signed URL to enable a trusted browser to access the new object without authenticating. puts obj_completed.url_for(:read)
- The AWS::S3::MultipartUpload#copy_part method has an optional parameter called :part_number. Omitting this parameter (as in the example above) is thread-safe. However, if multiple processes are participating in the same Multipart Upload (as in different Ruby interpreters on the same machine or different machines altogether), then the part number must be explicitly provided in order to avoid sequence collisions.
- With the exception of the last part, there is a 5 MB minimum part size.
- The completed Multipart Upload object is limited to a 5 TB maximum size.
- It is possible to mix-and-match between upload parts that are copies of existing S3 objects and upload parts that are actually uploaded from the client.
- For more information on S3 multipart upload and other cool S3 features, see the “STG303 Building scalable applications on S3” session from AWS re:Invent 2012.