Getting Started with Amazon EMR Bootstrap Actions
Steve McPherson is a Senior Manager for Amazon Elastic MapReduce
Note: This post was updated 2/8/16. The Presto bootstrap action documented in the original post has been deprecated because EMR now offers a Presto-Sandbox as a full-fledged EMR application. For details, see the EMR sandbox.
Amazon Elastic MapReduce (EMR) is a fully managed Hadoop-as-a-service platform that removes the operational overhead of setting up, configuring and managing the end-to-end lifecycle of Hadoop clusters. Many of our customers use the service for scheduled data processing tasks or job flows (clusters in EMR terminology) without ever having to interact with Hadoop infrastructure itself. Instead, they specify an input data source, the query or program that should be run, and the output location for the results.
As the Hadoop ecosystem has expanded from being a generic MapReduce (batch-oriented data processing) system, EMR has expanded to support Hadoop clusters that are long-running, shared, interactive data-processing environments. EMR clusters come prepackaged with the most common Hadoop apps like Hive, Pig and Cascading. The apps are configured to implement the full suite of best practices and integrations with related AWS services such as EC2, VPC, CloudWatch, S3, DynamoDB and Kinesis.
Despite the name Elastic MapReduce, the service goes far beyond batch-oriented processing. Clusters in EMR have a flexible and rich cluster-management framework that users can customize to run any Hadoop ecosystem application such as low-latency query engines like Hbase (with Phoenix), Impala, Spark/Shark and machine learning frameworks like Mahout. These additional components can be installed using Bootstrap Actions or Steps.
Bootstrap Actions are scripts that run on every machine in the cluster as they are brought online, but before the core Hadoop services like HDFS (name node or data node) and the Hive Metastore are configured and started.
Steps are scripts as well, but they run only on machines in the Master-Instance group of the cluster. This mechanism allows applications like Zookeeper to configure the master instances and allows applications like Hbase and Apache Drill to configure themselves.
The Amazon EMR team maintains an open source repository of bootstrap actions and related steps that can be used as examples for writing your own Bootstrap actions and Steps. Using these examples, our customers configure applications like Apache Drill and OpenTSB to run in EMR.
If you have questions or suggestions, please leave a comment below.