AWS Official Blog

Elastic MapReduce Updates – Hive, Multipart Upload, JDBC, Squirrel SQL

by Jeff Barr | on | in Amazon Elastic MapReduce | | Comments

I have a number of Elastic MapReduce updates for you:

  • Support for S3’s Large Objects and Multipart Upload
  • Upgraded Hive Support
  • JDBC Drivers for Hive
  • A tutorial on the use of Squirrel SQL with Elastic MapReduce

Support for S3’s Large Objects and Multipart Upload
We recently introduced an important new feature for Amazon S3 the ability to break a single object into chunks and to upload two or more of the chunks to S3 concurrently. Applications that make use of this feature enjoy quicker uploads with better error recovery and can upload objects up to 5 TB in size.

Amazon Elastic MapReduce now supports this feature, but it is not enabled by default. Once it has been enabled, Elastic MapReduce can actually begin the upload process before the Hadoop task has finished. The combination of parallel uploads and an earlier start means that data-intensive applications will often finish more quickly.

In order to enable Multipart Upload to Amazon S3, you must add a new entry to your Hadoop configuration file. You can find complete information in the newest version of the Elastic MapReduce documentation. This feature is not enabled by default because your application becomes responsible for cleaning up after a failed upload. The AWS SDK for Java contains a helper method (AbortMultipartUploads) to simplify the cleanup process.

Upgraded Hive Support
You can now use Hive 0.7 with Elastic MapReduce. This version of Hive provides a number of new features including support for the HAVING clause, IN clause, and performance enhancements from local mode queries, improved column compression, and dynamic partitioning.

You can also run versions 0.5 and 0.7 concurrently on the same cluster. You will need to use the Elastic MapReduce command-line tools to modify the default version of Hive for a particular job step.

JDBC Drivers for Hive
We have released a set of JDBC drivers for Apache Hive that have been optimized for use with Elastic MapReduce. Separate builds of the drivers are available for versions 0.5 and 0.7 of Hive:

Squirrel SQL Tutorial
We have written a tutorial to show you how to use the open source Squirrel SQL client to connect to Elastic MapReduce using the new JDBC drivers. You will be able to query your data using a graphical query tool.

— Jeff;