Additional Features of Hive in Amazon Elastic MapReduce

This article describes the Hive extensions that make Hive work more easily with Amazon Elastic MapReduce.


Submitted By: Vaibhav@AWS
Created On: September 25, 2009


For a list of other documents related to running Hive on Elastic MapReduce, go to here

Recovering Partitions

We added a statement to the Hive query language that recovers the partitions of a table from table data located in Amazon S3. For example:

  create external table (json string) raw_impression 
    partitioned by (dt string) 
    location 's3://elastic-mapreduce/samples/hive-ads/tables/impressions'
  ;
  alter table logs recover partitions ;

The partition directories and data must be at the location specified in the table definition and must be named according to the Hive convention, e.g., dt=2009-01-01

Writing Data Directly to Amazon S3

The Hadoop Distributes File System (HDFS) and Amazon S3 have different properties. So, they need to be treated differently when used within Elastic MapReduce and Hive.

The version of Hive installed with Amazon Elastic MapReduce writes directly to Amazon S3 without the use of temporary files. This produces a significant performance improvement but it means that HDFS and Amazon S3, from a Hive perspective, behave differently.

A consequence of writing directly to Amazon S3 is that you cannot read and write within the same Hive statement to the same table if that table is located in Amazon S3. If you want to update a table located in Amazon S3:

1. Create a temporary table in the job flow local HDFS filesystem

2. Write the results to that table and then copy them to Amazon S3

  create temporary table tmp like my_s3_table ;
  insert overwrite tmp select .... ;
  insert overwrite my_s3_table select * from tmp ;

Accessing Resources in Amazon S3

The version of Hive installed in Amazon Elastic MapReduce enables you to reference resources located in Amazon S3

  add jar s3://elasticmapreduce/samples/hive-ads/lib/jsonserde.jar

You can also reference scripts located in Amazon S3 to execute custom map and reduce operations:

  from logs select transform (line) 
    using 's3://mybucket/scripts/parse-logs.pl' as 
      (time string, exception_type string, exception_details string)

Variables

You can include variable in your scripts using the dollar sign and curly braces.

  add jar ${LIB}/jsonserde.jar

You pass the values of these variables to Hive on the command line using the -d parameter, e.g.

  -d LIB=s3://elasticmapreduce/samples/hive-ads/lib

You can also pass the values into steps, which execute Hive scripts.

  $ elastic-mapreduce --hive-script --arg s3://mybucket/script.q \
      --args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib

Making JDBC Connections

A Hive server starts on the master node whenever you start an interactive Hive session (using the AWS Management Console or the command line), which installs Hive in a job flow.

The Hive server accepts JDBC connections from the HiveJDBC driver on port 10000. To establish a connection from a remote machine:

1. Start an SSH tunnel:

  ssh -i my_private_key.pem hadoop@ -N -L 1234:localhost:10000

Replace with the DNS name of the master node of your job flow. Programatically, you can establish an SSH tunnel using Jsch.

2. Connect to the Hive server using the JDBC connection string.

  jdbc:hive://localhost:1234/default

Alternatively, you can connect from a machine running in EC2 that is either in the ElasticMapReduce-Master or ElasticMapReduce-Slave security group.

Persisting Hive Schema

By default Hive keeps its schema information on the master node and that information ceases to exist when the job flow terminates.

This feature allows you to override the location of the metadata store to use for example a MySQL instance that you already have running in EC2.

The first step is to create a Hive site configuration file and store it in Amazon S3 so that it can override the location of the metadata store.

  
  
    javax.jdo.option.ConnectionURL
    jdbc:mysql://ec2-72-44-33-189.compute-1.amazonaws.com:3306/hive?user=user12&password=abababa7&create=true
    JDBC connect string for a JDBC metastore
  
  
  
    javax.jdo.option.ConnectionDriverName
    com.mysql.jdbc.Driver
    Driver class name for a JDBC metastore
  
  

Let's assume that this is located at:

  s3://mybucket/config/hive-site.conf

You use hive-site to install the configuration file in a job flow using the Amazon Elastic MapReduce command line:

  $ elastic-mapreduce --jobflow $JOBFLOW \
      --hive-site=s3://mybucket/conf/hive-site.xml

Amazon Elastic MapReduce Hive Steps

The command line client provides a convenient way to access Hive steps. You may want to access Hive steps from programs that call directly into the Amazon Elastic MapReduce web service.

The --describe command (of the command line client) shows the form of the install step on a job flow that has Hive installed on it.

  "StepConfig": {
    "ActionOnFailure": "TERMINATE_JOB_FLOW",
    "Name": "Setup Hive",
    "HadoopJarStep": {
       "MainClass": null,
       "Jar": "s3:\/\/us-east-1.elasticmapreduce\/libs\/script-runner\/script-runner.jar",
       "Args": [
         "s3:\/\/us-east-1.elasticmapreduce\/libs\/hive\/0.4\/install-hive"
        ],
       "Properties": []
    }
  }
            

From this example you can see that a custom JAR called script-runner executes a script called install-hive, which resides in Amazon S3.

Notice that the install scripts are region-specific. If you're launching a job flow in eu-west-1 for example, you should include the installed script in the bucket eu-west-1.elasticmapreduce rather than us-east-1.

Hive Patches for Amazon Elastic MapReduce

The Amazon Elastic MapReduce team has created the following patches for Hive.

Patch

Description

Write to Amazon S3

Supports moving data between different file systems, such as HDFS and Amazon S3. Adds support for file systems (such as Amazon S3) that do not provide a “move” operation. Removes redundant operations like moving data to and from the same location.

Status: Submitted
Fixed in AWS Hive Version: 0.4
Fixed in Apache Hive Version: n/a (HIVE-2318)

Scripts in Amazon S3

 

Enables Hive to download the Hive scripts in Amazon S3 buckets and run them. Saves you the step of copying scripts to HDFS before running them.

Status: Committed
Fixed in AWS Hive Version: 0.4
Fixed in Apache Hive Version: 0.7.0 (HIVE-1624)

Recover partitions

Allows you to recover partitions from table data located in Amazon S3 and Hive table data in HDFS.

Status: Not Submitted
Fixed in AWS Hive Version: 0.4
Fixed in Apache Hive Version: n/a

Variables in Hive

Create a separate namespace (aside from HiveConf) for managing Hive variables. Adds support for setting variables on the command line using either '-define x=y' or 'set hivevar:x=y'. Adds support for referencing variables in statements using '${var_name}'. Provides a means for differentiating between hiveconf, hivevar, system, and environment properties in the output of 'set -v'.

Status: Committed
Fixed in AWS Hive Version: 0.4
Fixed in Apache Hive Version: 0.8.0 (HIVE-2020)

Report progress while writing to Amazon S3

FileSinkOperator reports progress to Hadoop while writing large files, so that the task is not killed.

Status: Not Submitted
Fixed in AWS Hive Version: 0.5
Fixed in Apache Hive Version: n/a

Fix compression arguments

Corrects an issue where compression values were not set correctly in FileSinkOperator, which resulted in uncompressed files.

Status: Submitted
Fixed in AWS Hive Version: 0.5
Fixed in Apache Hive Version: n/a (HIVE-2266)

Fix UDAFPercentile to tolerate null percentiles

Fixes an issue where UDAFPercentile would throw a null pointer exception when passed null percentile list.

Status: Committed
Fixed in AWS Hive Version: 0.5
Fixed in Apache Hive Version: 0.8.0 (HIVE-2298)

Fix hashCode method in DoubleWritable class

Fixes the hashCode() method of DoubleWritable class of Hive and prevents the HashMap (of type DoubleWritable) from behaving as LinkedList.

Status: Committed
Fixed in AWS Hive Version: 0.7
Fixed in Apache Hive Version: 0.7.0 (HIVE-1629)

Recover partitions, version 2

Improved version of Recover Partitions that uses less memory.

Status: Not Submitted
Fixed in AWS Hive Version: 0.7
Fixed in Apache Hive Version: n/a

HAVING clause

Use the HAVING clause to directly filter on groups by expressions (instead of using nested queries). Integrates Hive with other data analysis tools that rely on the HAVING expression.

Status: Committed
Fixed in AWS Hive Version: 0.7
Fixed in Apache Hive Version: 0.7.0 (HIVE-1790)

Improve Hive query performance

 

Reduces startup time for queries spanning a large number of partitions.

Status: Committed
Fixed in AWS Hive Version: 0.7.1
Fixed in Apache Hive Version: 0.8.0 (HIVE-2299)

Improve Hive query performance for Amazon S3 queries

Reduces startup time for Amazon S3 queries. Set Hive.optimize.s3.query=true to enable optimization.

Status: Not Submitted
Fixed in AWS Hive Verson: 0.7.1
Fixed in Apache Hive Version: n/a

Skip comments in Hive scripts

Fixes an issue where Hive scripts would fail on a comment line; now Hive scripts skip commented lines.
Status: Committed
Fixed in AWS Hive Verson: 0.7.1
Fixed in Apache Hive Version: 0.8.0 (HIVE-2259)