Tag: v1


Efficient Amazon S3 Object Concatenation Using the AWS SDK for Ruby

by Trevor Rowe | on | in Ruby | Permalink | Comments |  Share

Today’s post is from one of our Solutions Architects: Jonathan Desrocher, who coincidentally is also a huge fan of the AWS SDK for Ruby.


There are certain situations where we would like to take a dataset that is spread across numerous Amazon Simple Storage Service (Amazon S3) objects and represent it as a new object that is the concatenation of those S3 objects. A real-life example might be combining individual hourly log files from different servers into a single environment-wide concatenation for easier indexing and archival. Another use case would be concatenating outputs from multiple Elastic MapReduce reducers into a single task summary.

While it is possible to download and re-upload the data to S3 through an EC2 instance, a more efficient approach would be to instruct S3 to make an internal copy using the new copy_part API operation that was introduced into the SDK for Ruby in version 1.10.0.

Why upload when you can copy?

Typically, new S3 objects are created by uploading data from a client using AWS::S3::S3Object#write method or by copying the contents of an existing S3 object using the AWS::S3::Object#copy_to method of the Ruby SDK.

While the copy operation offers the advantage of offloading data transfer from the client to the S3 back-end, it is limited by its ability to only produce new objects with the exact same data as the data specified in the original. This limits the usefulness of the copy operation to those occasions where we want to preserve the data but change the object’s properties (such as key-name or storage class) as S3 objects are immutable.

In our case, we want to offload the heavy lifting of the data transfer to S3’s copy functionality, but at the same time, we need to be able to shuffle different source objects’ contents into a single target derivative—and that brings us to the Multipart Upload functionality.

Copying into a Multipart Upload

Amazon S3 offers a Multipart Upload feature that enables customers to create a new object in parts and then combine those parts into a single, coherent object.

By its own right, Multipart Upload enables us to efficiently upload large amounts of data and/or deal with an unreliable network connection (which is often the case with mobile devices) as the individual upload parts can be retried individually (thus reducing the volume of data retransmissions). Just as importantly, the individual upload parts can be uploaded in parallel, which can greatly increase the aggregated throughput of the upload (note that the same benefits also apply when using byte range GETs).

Multipart Upload can be combined with the copy functionality through the Ruby SDK’s AWS::S3::MultipartUpload#copy_part method—which results in the internal copy of the specified source object into an upload part of the Multipart Upload.

Upon the completion of the Multipart Upload job the different upload parts are combined together such that the last byte of an upload part will be immediately followed by the first byte of the subsequent part (which could be the target of a copy operation itself)— resulting in a true in-order concatenation of the specified source objects.

Code Sample

Note that this example uses Amazon EC2 roles for authenticating to S3. For more information about this feature, see our “credential management” post series.


require 'rubygems'
require 'aws-sdk'

s3 = AWS::S3.new()
mybucket = s3.buckets['my-multipart']

# First, let's start the Multipart Upload
obj_aggregate = mybucket.objects['aggregate'].multipart_upload

# Then we will copy into the Multipart Upload all of the objects in a certain S3 directory.
mybucket.objects.with_prefix('parts/').each do |source_object|

  # Skip the directory object
  unless (source_object.key == 'parts/')
    # Note that this section is thread-safe and could greatly benefit from parallel execution.
    obj_aggregate.copy_part(source_object.bucket.name + '/' + source_object.key)
  end

end

obj_completed = obj_aggregate.complete()

# Generate a signed URL to enable a trusted browser to access the new object without authenticating.
puts obj_completed.url_for(:read)

Last Notes

  • The AWS::S3::MultipartUpload#copy_part method has an optional parameter called :part_number. Omitting this parameter (as in the example above) is thread-safe. However, if multiple processes are participating in the same Multipart Upload (as in different Ruby interpreters on the same machine or different machines altogether), then the part number must be explicitly provided in order to avoid sequence collisions.
  • With the exception of the last part, there is a 5 MB minimum part size.
  • The completed Multipart Upload object is limited to a 5 TB maximum size.
  • It is possible to mix-and-match between upload parts that are copies of existing S3 objects and upload parts that are actually uploaded from the client.
  • For more information on S3 multipart upload and other cool S3 features, see the “STG303 Building scalable applications on S3” session from AWS re:Invent 2012.

Happy concatenating!

AWS SDK for Ruby and Nokogiri

by Trevor Rowe | on | in Ruby | Permalink | Comments |  Share

In two weeks, on November 19, 2013, we will be removing the upper bound from the nokogiri gem dependency in version 1 of the AWS SDK for Ruby. We’ve been discussing this change with users for a while on GitHub.

Why Is There Currently an Upper Bound?

Nokogiri removed support for Ruby 1.8 with the release of version 1.6. The Ruby SDK requires Nokogiri and supports Ruby 1.8.7+. The current restriction ensures that Ruby 1.8 users can install the aws-sdk gem.

Why Remove the Upper Bound?

Users of the Ruby SDK that use Ruby 1.9+ have been requesting the upper bound removed. Some want to access features of Nokogiri 1.6+, while others are having headaches with dependency management when multiple libraries require Nokogiri and the versions are exclusive.

The Ruby SDK has been tested against Nokogiri 1.4+. By removing this restriction, we allow end users to choose the version of Nokogiri that best suits their needs. Libraries that have narrow dependencies can make managing co-dependencies difficult. We want to help remove some of that headache.

Will it Still be Possible to Install the Ruby SDK in Ruby 1.8.7?

Yes, it will still be possible to install the Ruby SDK in Ruby 1.8.7 when the restriction is removed. If your target environment already has Nokogiri < 1.6 installed, then you don’t need to do anything. Otherwise, you will need to install Nokogiri before installing the aws-sdk gem.

If you are using bundler to install the aws-sdk gem, add an entry for Nokogiri:

gem 'aws-sdk'
gem 'nokogiri', '< 1.6'

If you are using the gem command to install the Ruby SDK, you must ensure that a compatible version of Nokogiri is present in Ruby 1.8.7.

gem install nokogiri --version="<1.6"
gem install aws-sdk

You should not need to make any changes if any of the following are true:

  • You have a Gemfile.lock deployed with your application
  • Nokogiri < 1.6 is already installed in your target environment
  • Your deployed environment does not reinstall gems when restarted

What About Version 2 of the Ruby SDK?

Version 2 of the Ruby SDK no longer relies directly on Nokogiri. Instead it has dependencies on multi_xml and multi_json. This allows you to use Ox and Oj for speed, or Nokogiri if you prefer. Additionally, the Ruby SDK will work with pure Ruby XML and JSON parsing libraries which makes distributing to varied environments much simpler.

Keep the Feedback Coming

We appreciate all of the users who have chimed in on this issue and shared their feedback. User feedback definitely helps guide development of the Ruby SDK. If you haven’t had a chance to check out our work on v2 of the Ruby SDK, please take a look at the GitHub repo for AWS Ruby Core.

Thanks!