AWS News Blog

Using Elastic MapReduce as a Generic Hadoop Cluster Manager

My colleague Steve McPherson sent along a nice guest post to get you thinking about ways to use Elastic MapReduce in non-traditional ways!

Jeff;


Amazon Elastic MapReduce (EMR) is a fully managed Hadoop-as-a-service platform that removes the operational overhead of setting up, configuring and managing the end-to-end lifecycle of Hadoop clusters. Many of our customers never interact with Hadoop for scheduled data processing tasks or job flows (clusters in EMR terminology). Instead, they specify an input data source, the query or program that should be run, and the output location for the results.

As the Hadoop ecosystem has expanded from being a generic MapReduce (batch-oriented data processing) system, EMR has expanded to support Hadoop clusters that are long-running, shared, interactive data-processing environments. EMR clusters have Hive and Pig already set up when started and implement the full suite of best practices and integrations with related AWS services such as EC2, VPC, CloudWatch, S3, DynamoDB and Kinesis.

Despite the name Elastic MapReduce, the service goes far beyond batch-oriented processing. Clusters in EMR have a flexible and rich cluster-management framework that users can customize to run any Hadoop ecosystem application such as low-latency query engines like Hbase (with Phoenix), Impala, Spark/Shark and machine learning frameworks like Mahout. These additional components can be installed using Bootstrap Actions or Steps.

Bootstrap Actions are scripts that run on every machine in the cluster as they are brought online, but before the core Hadoop services like HDFS (name node or data node) and the Hive Metastore are configured and started. For example, Cascading, Apache Spark, and Presto can be deployed to a cluster without any need to communicate with HDFS or Zookeeper.

Steps are scripts as well, but they run only on machines in the Master-Instance group of the cluster. This mechanism allows applications like Zookeeper to configure the master instances and allows applications like Hbase and Apache Drill to configure themselves.

The Amazon EMR team maintains an open source repository of bootstrap actions and related steps that can be used as examples for writing your own Bootstrap actions and Steps. Using these examples, our customers configure applications like Apache Drill and OpenTSB to run in EMR. If you are using these, we.d love to know how you.ve customized EMR to suit your use case. And yes, pull requests are welcome!

Here’s an example that shows you how to use the Presto boostrap action from the repository. Run the following command to create an Elastic MapReduce cluster:


$ elastic-mapreduce --create --name "Presto" --alive \
  --hive-interactive --ami-version 3.1.0 --num-instances 4 \
  --master-instance-type i2.2xlarge --slave-instance-type i2.2xlarge \
  --bootstrap-action s3://presto-bucket/install_presto_0.71.rb --args "-t","1GB","-l","DEBUG","-j","-server -Xmx1G -XX:+UseConcMarkSweepGC -XX:+ExplicitGCInvokesConcurrent -XX:+AggressiveOpts -XX:+HeapDumpOnOutOfMemoryError -XX:OnOutOfMemoryError=kill -9 %p -Dhive.config.resources=/home/hadoop/conf/core-site.xml,/home/hadoop/conf/hdfs-site.xml","-v","0.72","-s","1GB","-a","1h","-p","http://central.maven.org/maven2/com/facebook/presto/" \
  --bootstrap-name "Install Presto"

Once the cluster is up and running, start Hive like this:


$ hive

Create, set up, and test a table:


# Create Hive table
DROP TABLE IF EXISTS apachelog;
CREATE EXTERNAL TABLE apachelog (
  host STRING,
  IDENTITY STRING,
  USER STRING,
  TIME STRING,
  request STRING,
  STATUS STRING,
  SIZE STRING,
  referrer STRING,
  agent STRING 
)
PARTITIONED BY(iteration_no int)
LOCATION 's3://publicprestodemodata/apachelogsample/hive';
 
ALTER TABLE apachelog RECOVER PARTITIONS;
 
# Test Hive
select * from apachelog where iteration_no=101 limit 10;
 
# Exit Hive
exit

Start Presto and run a test query:


# Set Presto Pager to null for clean display
export PRESTO_PAGER=
 
# Launch Presto
./presto --catalog hive
 
# Show tables to prove that Presto is seeing Hive's tables
show tables;
 
# Run test query in Presto
select * from apachelog where iteration_no=101 limit 10;

— Steve

Jeff Barr

Jeff Barr

Jeff Barr is Chief Evangelist for AWS. He started this blog in 2004 and has been writing posts just about non-stop ever since.