Change Log
- 2012-04-09 - Added support for Pig 0.9.2, Pig versioning, and Hive 0.7.1.4
- 2012-03-13 - Added support for Hive 0.7.1.3.
- 2012-02-28 - Added support for Hive 0.7.1.2.
- 2011-12-08 - Added support for Amazon Machine Image (AMI) versioning, Hadoop 0.20.205, Hive 0.7.1, and Pig 0.9.1. The default AMI version is the latest AMI version available.
- 2011-11-30 - Fixed support for Amazon Elastic IP addresses.
-
2011-08-08 - Added support for running a job flow on Spot Instances.
Note: You may now specify a bid price when creating instance groups. For example, use --bid-price 0.10 to specify a Spot Price bid of 10 cents an hour.
- 2011-01-24 - Fixed bugs in the --json command processing and the list option.
- 2010-12-08 - Add support for Hive 0.7
- 2010-11-11 - Fixed bugs in the processing of pig and hive arguments and the --main-class argument to the custom jar step
-
2010-10-19 - Added support for resizing running job flows. Substantially reworked processing arguments to be more consistent and unit testable.
Note: This version is required to resize running job flows and manipulate instance groups.
- 2010-09-16 - Added support for fetching files from EMR.
- 2010-06-02 - Added support for Hadoop 0.20, Hive 0.5 and Pig 0.6.
- 2010-04-07 - Added support for bootstrap actions.
Installation and Dependencies
The command line client requires Ruby 1.8. Once you have Ruby installed simply unzip elastic-mapreduce-ruby.zip into a directory and then run the command line client with elastic-mapreduce.
Windows users can install Ruby 1.8 using the one-click installer. Ubuntu and Debian users can installed ruby with
sudo apt-get install ruby
Setting up a Credentials File
To avoid having to supply your AWS credentials on the command line copy your credentials into a file credentials.json and place this file in the directory where you unzipped the client. The credentials file should look like this:
{
"access-id": "",
"private-key": "",
"key-pair": "",
"key-pair-file": "",
"log-uri": ""
}
The key-pair is the name of the EC2 Key Pair that you created either using the EC2 API or using the EC2 Tab on AWS Management Console.
The key-pair-file is the name of the file on your local disk where you're running the command line client. It contains the secret key that you downloaded when you created your EC2 keypair.
Note: EC2 keypairs are specific to a region, so if you start job flows in eu-west-1 for instance then you will need to use a keypair that you created in the eu-west-1 region of EC2.
Edit Your Path
If you're running bash as your shell then you can add the elastic-mapreduce program to your path with
export PATH=$PATH:
Usage
The elastic-mapreduce command line client supports operations: --create, --list, --describe, and --terminate.
Listing Active Job Flows
elastic-mapreduce --list # list recently created job flows
elastic-mapreduce --list --active # list all running or starting job flows
elastic-mapreduce --list --all # list all job flows
Creating a new Job Flow
# create a job flow that requires manual termination
elastic-mapreduce --create --alive
# create a job and run the default streaming application
elastic-mapreduce --create --stream
The version of Hadoop run is 0.20 by default. You can run 0.18 by specifying "--hadoop-version 0.18".
To create a job flow that will run a mapper written in python and stored in Amazon S3. Make sure you change mybucket to be your own bucket in Amazon S3 before running it.
# create a job and run a mapper written in python and stored in Amazon S3
elastic-mapreduce --create \
--stream --input s3n://elasticmapreduce/samples/wordcount/input \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--output s3n://mybucket/output_path
To create a job flow that will run a Hadoop job using a Java main function in a jar stored in Amazon S3.
# create a job and run a mapper written in python and stored in Amazon S3
elastic-mapreduce --create \
--jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \
--arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \
--arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \
--arg hdfs:///cloudburst/output/1 \
--args 36,3,0,1,240,48,24,24,128,16
Add a step to a running Job Flow
The following adds a step that uses a Java main function to execute a Hadoop job.
elastic-mapreduce --jobflow j-ABABABABABAB
--jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \
--arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \
--arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \
--arg hdfs:///cloudburst/output/1 \
--args 36,3,0,1,240,48,24,24,128,16
Terminate a Job Flow
# terminate all active job flows
elastic-mapreduce --list --active --terminate
# terminate a running job flow
elastic-mapreduce --terminate --jobflow j-ABABABASABA
Active job flows are job flows that are either starting or running.
SSH onto the Master Node of Running Job Flow
elastic-mapreduce --ssh --jobflow j-ABABABABA
Run a jobflow with Debugging Enabled, Hive and Pig installed
elastic-mapreduce --create --alive --name "My Dev JobFlow" \
--enable-debugging --hive-interactive --pig-interactive
Additional Documentation
The archive comes with a more extensive README file that explain how to use the command line client in more detail.