Additional Features of Hive in Amazon Elastic MapReduce
This article describes the Hive extensions that make Hive work more easily with Amazon Elastic MapReduce.
Submitted By: Vaibhav@AWS
Created On: September 25, 2009
For a list of other documents related to running Hive on Elastic MapReduce, go to here
Recovering Partitions
We added a statement to the Hive query language that recovers the partitions of a table from table data located in Amazon S3. For example:
create external table (json string) raw_impression partitioned by (dt string) location 's3://elastic-mapreduce/samples/hive-ads/tables/impressions' ; alter table logs recover partitions ;
The partition directories and data must be at the location specified in the table definition and must be named according to the Hive convention, e.g., dt=2009-01-01
Writing Data Directly to Amazon S3
The Hadoop Distributes File System (HDFS) and Amazon S3 have different properties. So, they need to be treated differently when used within Elastic MapReduce and Hive.
The version of Hive installed with Amazon Elastic MapReduce writes directly to Amazon S3 without the use of temporary files. This produces a significant performance improvement but it means that HDFS and Amazon S3, from a Hive perspective, behave differently.
A consequence of writing directly to Amazon S3 is that you cannot read and write within the same Hive statement to the same table if that table is located in Amazon S3. If you want to update a table located in Amazon S3:
1. Create a temporary table in the job flow local HDFS filesystem
2. Write the results to that table and then copy them to Amazon S3
create temporary table tmp like my_s3_table ; insert overwrite tmp select .... ; insert overwrite my_s3_table select * from tmp ;
Accessing Resources in Amazon S3
The version of Hive installed in Amazon Elastic MapReduce enables you to reference resources located in Amazon S3
add jar s3://elasticmapreduce/samples/hive-ads/lib/jsonserde.jar
You can also reference scripts located in Amazon S3 to execute custom map and reduce operations:
from logs select transform (line) using 's3://mybucket/scripts/parse-logs.pl' as (time string, exception_type string, exception_details string)
Variables
You can include variable in your scripts using the dollar sign and curly braces.
add jar ${LIB}/jsonserde.jar
You pass the values of these variables to Hive on the command line using the -d parameter, e.g.
-d LIB=s3://elasticmapreduce/samples/hive-ads/lib
You can also pass the values into steps, which execute Hive scripts.
$ elastic-mapreduce --hive-script --arg s3://mybucket/script.q \ --args -d,LIB=s3://elasticmapreduce/samples/hive-ads/lib
Making JDBC Connections
A Hive server starts on the master node whenever you start an interactive Hive session (using the AWS Management Console or the command line), which installs Hive in a job flow.
The Hive server accepts JDBC connections from the HiveJDBC driver on port 10000. To establish a connection from a remote machine:
1. Start an SSH tunnel:
ssh -i my_private_key.pem hadoop@-N -L 1234:localhost:10000
Replace
2. Connect to the Hive server using the JDBC connection string.
jdbc:hive://localhost:1234/default
Alternatively, you can connect from a machine running in EC2 that is either in the ElasticMapReduce-Master or ElasticMapReduce-Slave security group.
Persisting Hive Schema
By default Hive keeps its schema information on the master node and that information ceases to exist when the job flow terminates.
This feature allows you to override the location of the metadata store to use for example a MySQL instance that you already have running in EC2.
The first step is to create a Hive site configuration file and store it in Amazon S3 so that it can override the location of the metadata store.
javax.jdo.option.ConnectionURL jdbc:mysql://ec2-72-44-33-189.compute-1.amazonaws.com:3306/hive?user=user12&password=abababa7&create=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore
Let's assume that this is located at:
s3://mybucket/config/hive-site.conf
You use hive-site to install the configuration file in a job flow using the Amazon Elastic MapReduce command line:
$ elastic-mapreduce --jobflow $JOBFLOW \ --hive-site=s3://mybucket/conf/hive-site.xml
Amazon Elastic MapReduce Hive Steps
The command line client provides a convenient way to access Hive steps. You may want to access Hive steps from programs that call directly into the Amazon Elastic MapReduce web service.
The --describe command (of the command line client) shows the form of the install step on a job flow that has Hive installed on it.
"StepConfig": { "ActionOnFailure": "TERMINATE_JOB_FLOW", "Name": "Setup Hive", "HadoopJarStep": { "MainClass": null, "Jar": "s3:\/\/us-east-1.elasticmapreduce\/libs\/script-runner\/script-runner.jar", "Args": [ "s3:\/\/us-east-1.elasticmapreduce\/libs\/hive\/0.4\/install-hive" ], "Properties": [] } }
From this example you can see that a custom JAR called script-runner executes a script called install-hive, which resides in Amazon S3.
Notice that the install scripts are region-specific. If you're launching a job flow in eu-west-1 for example, you should include the installed script in the bucket eu-west-1.elasticmapreduce rather than us-east-1.
Hive Patches for Amazon Elastic MapReduce
The Amazon Elastic MapReduce team has created the following patches for Hive.
Patch |
Description |
Write to Amazon S3 |
Supports moving data between different file systems, such as HDFS and Amazon S3. Adds support for file systems (such as Amazon S3) that do not provide a “move” operation. Removes redundant operations like moving data to and from the same location. |
Scripts in Amazon S3
|
Enables Hive to download the Hive scripts in Amazon S3 buckets and run them. Saves you the step of copying scripts to HDFS before running them. Status: Committed |
Recover partitions |
Allows you to recover partitions from table data located in Amazon S3 and Hive table data in HDFS. |
Variables in Hive |
Create a separate namespace (aside from HiveConf) for managing Hive variables. Adds support for setting variables on the command line using either '-define x=y' or 'set hivevar:x=y'. Adds support for referencing variables in statements using '${var_name}'. Provides a means for differentiating between hiveconf, hivevar, system, and environment properties in the output of 'set -v'. Status: Committed |
Report progress while writing to Amazon S3 |
FileSinkOperator reports progress to Hadoop while writing large files, so that the task is not killed. Status: Not Submitted |
Fix compression arguments |
Corrects an issue where compression values were not set correctly in FileSinkOperator, which resulted in uncompressed files. Status: Submitted |
Fix UDAFPercentile to tolerate null percentiles |
Fixes an issue where UDAFPercentile would throw a null pointer exception when passed null percentile list. Status: Committed |
Fix hashCode method in DoubleWritable class |
Fixes the hashCode() method of DoubleWritable class of Hive and prevents the HashMap (of type DoubleWritable) from behaving as LinkedList. Status: Committed |
Recover partitions, version 2 |
Improved version of Recover Partitions that uses less memory. Status: Not Submitted |
HAVING clause |
Use the HAVING clause to directly filter on groups by expressions (instead of using nested queries). Integrates Hive with other data analysis tools that rely on the HAVING expression. Status: Committed |
Improve Hive query performance
|
Reduces startup time for queries spanning a large number of partitions. Status: Committed |
Improve Hive query performance for Amazon S3 queries |
Reduces startup time for Amazon S3 queries. Set Hive.optimize.s3.query=true to enable optimization. Status: Not Submitted |
Skip comments in Hive scripts |
Fixes an issue where Hive scripts would fail on a comment line; now Hive scripts skip commented lines. |