AWS News Blog

Elastic MapReduce Now Supports Hive 13

I am pleased to announce that Elastic MapReduce now supports version 13 of Hive. Hive is a great tool for building and querying large data sets. It supports the ETL (Extract/Transform/Load) process with some powerful tools, and give you access to files stored on your EMR cluster in HDFS or in Amazon Simple Storage Service (S3). Programmatic or ad hoc queries supplied to Hive are executed in massively parallel fashion by taking advantage of the MapReduce model.

Version 13 Features
Version 13 of Hive includes all sorts of cool and powerful new features. Here’s a sampling:

Vectorized Query Execution – This feature reduces CPU usage for query options such as scans, filters, aggregates, and joins. Instead of processing queries on a row-by-row basis, the vectorized query execution feature processes blocks of 1024 rows at a time. This reduces internal overhead and allows the column of data stored within the block to be processed in a tight, efficient loop. In order to take advantage of this feature, your data must be stored in the ORC (Optimized Row Columnar) format. To learn more about this format and its advantages, take a look at ORC: An Intelligent Big Data file format for Hadoop and Hive.

Faster Plan Serialization – The process of serializing a query plan (turning a complex Java object in to an XML representation) is now faster. This speeds up the transmission of the query plan to the worker nodes and improves overall Hive performance.

Support for DECIMAL and CHAR Data Types – The new DECIMAL data type supports exact representation of numerical values with up to 38 digits of precision. The new CHAR data type supports fixed-length, space-padded strings. See the documentation on Hive Data Types for more information.

Subquery Support for IN, NOT IN, EXISTS, and NOT EXISTSHive subqueries within a WHERE clause now support the IN, NOT IN, EXISTS, and NOT EXISTS statements in both correlated and uncorrelated form. In an uncorrelated subquery, columns from the parent query are not referenced.

JOIN Conditions in WHERE Clauses – Hive now supports JOIN conditions within WHERE clauses.

Improved Windowing Functions – Hive now supports improved, highly optimized versions of the “windowing” functions that perform aggregation over a moving window. For example, you can easily compute the moving average of a stock price over a specified number of days.

Catch the Buzz
You can start using these new features today by making use of version 3.2.0 of the Elastic MapReduce AMI in your newly launched clusters.

Jeff;