AWS News Blog

New Elastic MapReduce Feature: Bootstrap Actions

When you launch an Amazon Elastic MapReduce job flow, the Hadoop job is run on a a generic AMI that we supply. Until now, there’s been  no easy way to customize the image by modifying configuration files or installing additional software.

By popular demand, we now support bootstrap actions for each Elastic MapReduce job flow. The bootstrap actions are scripts stored in Amazon S3. You can write the scripts in any language that’s already installed on the instance — Perl, Python, Ruby, or Bash. Bash is probably your  best bet for simple customizations. Here’s an example of running a job flow with a bootstrap action that uses the Elastic MapReduce command-line client:

$ elastic-mapreduce –create \
–bootstrap-action s3://elasticmapreduce/scripts/configure-hadoop \
–arg s3://mybucket/config/custom-site-config.xml

This command uses a bootstrap action provided by Elastic MapReduce that will override the settings in the Hadoop site config with settings loaded from a file in S3.

Another predefined bootstrap action allows you to modify the amount of memory allocated to various Hadoop daemons:

$ elastic-mapreduce –create \
  –bootstrap-action s3://elasticmapreduce/scripts/configure-daemons \
  –arg –namenode-heap-size=2048 \
  –arg –namenode-opts=-XX:GCTimeRatio=19

This command sets the heap size for NameNode to be 2048M and adds the Java command line argument XX:GCTimeRatio=19 which will increase the frequency with which the Java garbage collector runs.

Bootstrap actions run as the user hadoop, but this user is allowed to escalate to root using sudo, so if you wanted to install a Debian package you could write a bootstrap action like this:

#!/bin/bash
hadoop fs -copyToLocal s3://mybucket/packages/mypackage.deb
dpkg -i mypackage.deb

If your bootstrap action fails then your jobflow will be shutdown, so youll want to test your bootstrap action script on a running jobflow before specifying it as a bootstrap action. To do this run a development jobflow with the alive option like this:

$ elastic-mapreduce –create –alive –name “My Development JobFlow”

Then you can SSH to the master node of your job flow. download your script from S3 with hadoop fs copyToLocal, and execute it. Once you know that it works then try it as a bootstrap action on a new jobflow.

There’s more information on Elastic MapReduce bootstrap actions in the newest version of the documentation.

— Jeff;

Jeff Barr

Jeff Barr

Jeff Barr is Chief Evangelist for AWS. He started this blog in 2004 and has been writing posts just about non-stop ever since.