Amazon Elastic MapReduce Ruby Client

Developer Tools>Elastic MapReduce>Amazon Elastic MapReduce Ruby Client
Community Contributed Software

  • Amazon Web Services provides links to these packages as a convenience for our customers, but software not authored by an "@AWS" account has not been reviewed or screened by AWS.
  • Please review this software to ensure it meets your needs before using it.

A command line client for creating, describing and terminating Job Flows using Amazon Elastic MapReduce. Also included is a pure ruby library for making web service calls to the Amazon Elastic Map ReduceWeb Service.

Details

Submitted By: Richard@AWS
AWS Products Used: Amazon Elastic MapReduce
Languages(s): Ruby
License: Apache License 2.0
Created On: April 1, 2009 1:48 AM GMT
Last Updated: April 18, 2012 10:44 PM GMT
Download

Change Log

  • 2012-04-09 - Added support for Pig 0.9.2, Pig versioning, and Hive 0.7.1.4
  • 2012-03-13 - Added support for Hive 0.7.1.3.
  • 2012-02-28 - Added support for Hive 0.7.1.2.
  • 2011-12-08 - Added support for Amazon Machine Image (AMI) versioning, Hadoop 0.20.205, Hive 0.7.1, and Pig 0.9.1. The default AMI version is the latest AMI version available.
  • 2011-11-30 - Fixed support for Amazon Elastic IP addresses.
  • 2011-08-08 - Added support for running a job flow on Spot Instances.

    Note: You may now specify a bid price when creating instance groups. For example, use --bid-price 0.10 to specify a Spot Price bid of 10 cents an hour.

  • 2011-01-24 - Fixed bugs in the --json command processing and the list option.
  • 2010-12-08 - Add support for Hive 0.7
  • 2010-11-11 - Fixed bugs in the processing of pig and hive arguments and the --main-class argument to the custom jar step
  • 2010-10-19 - Added support for resizing running job flows. Substantially reworked processing arguments to be more consistent and unit testable.

    Note: This version is required to resize running job flows and manipulate instance groups.

  • 2010-09-16 - Added support for fetching files from EMR.
  • 2010-06-02 - Added support for Hadoop 0.20, Hive 0.5 and Pig 0.6.
  • 2010-04-07 - Added support for bootstrap actions.

Installation and Dependencies

The command line client requires Ruby 1.8. Once you have Ruby installed simply unzip elastic-mapreduce-ruby.zip into a directory and then run the command line client with elastic-mapreduce.

Windows users can install Ruby 1.8 using the one-click installer. Ubuntu and Debian users can installed ruby with

    sudo apt-get install ruby

Setting up a Credentials File

To avoid having to supply your AWS credentials on the command line copy your credentials into a file credentials.json and place this file in the directory where you unzipped the client. The credentials file should look like this:

  {
    "access-id":     "",
    "private-key":   "",
    "key-pair":      "",
    "key-pair-file": "",
    "log-uri":       ""
  }

The key-pair is the name of the EC2 Key Pair that you created either using the EC2 API or using the EC2 Tab on AWS Management Console.

The key-pair-file is the name of the file on your local disk where you're running the command line client. It contains the secret key that you downloaded when you created your EC2 keypair.

Note: EC2 keypairs are specific to a region, so if you start job flows in eu-west-1 for instance then you will need to use a keypair that you created in the eu-west-1 region of EC2.

Edit Your Path

If you're running bash as your shell then you can add the elastic-mapreduce program to your path with

    export PATH=$PATH:

Usage

The elastic-mapreduce command line client supports operations: --create, --list, --describe, and --terminate.

Listing Active Job Flows

    elastic-mapreduce --list           # list recently created job flows
    elastic-mapreduce --list --active  # list all running or starting job flows
    elastic-mapreduce --list --all     # list all job flows

Creating a new Job Flow

    # create a job flow that requires manual termination
    elastic-mapreduce --create --alive  

    # create a job and run the default streaming application
    elastic-mapreduce --create --stream 

The version of Hadoop run is 0.20 by default. You can run 0.18 by specifying "--hadoop-version 0.18".

To create a job flow that will run a mapper written in python and stored in Amazon S3. Make sure you change mybucket to be your own bucket in Amazon S3 before running it.

    # create a job and run a mapper written in python and stored in Amazon S3
    elastic-mapreduce --create \
      --stream --input s3n://elasticmapreduce/samples/wordcount/input \
      --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
      --output s3n://mybucket/output_path 

To create a job flow that will run a Hadoop job using a Java main function in a jar stored in Amazon S3.

    # create a job and run a mapper written in python and stored in Amazon S3
    elastic-mapreduce --create \
      --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \
      --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \
      --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \
      --arg hdfs:///cloudburst/output/1 \
      --args 36,3,0,1,240,48,24,24,128,16

Add a step to a running Job Flow

The following adds a step that uses a Java main function to execute a Hadoop job.

    elastic-mapreduce --jobflow j-ABABABABABAB
      --jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \
      --arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \
      --arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \
      --arg hdfs:///cloudburst/output/1 \
      --args 36,3,0,1,240,48,24,24,128,16

Terminate a Job Flow

    # terminate all active job flows
    elastic-mapreduce --list --active --terminate

    # terminate a running job flow
    elastic-mapreduce --terminate --jobflow j-ABABABASABA

Active job flows are job flows that are either starting or running.

SSH onto the Master Node of Running Job Flow

    elastic-mapreduce --ssh --jobflow j-ABABABABA

Run a jobflow with Debugging Enabled, Hive and Pig installed

    elastic-mapreduce --create --alive --name "My Dev JobFlow" \
      --enable-debugging --hive-interactive --pig-interactive

Additional Documentation

The archive comes with a more extensive README file that explain how to use the command line client in more detail.

Comments

ruby bug
I'm getting this error when i try to run it, where can i make a bug report? elastic-mapreduce --help /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/httpdestinationhandler.rb:23: warning: else without rescue is useless /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/awsquery.rb:3:in `require': /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/httpdestinationhandler.rb:19: syntax error, unexpected ':', expecting keyword_then or ',' or ';' or '\n' (SyntaxError) /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/httpdestinationhandler.rb:36: syntax error, unexpected keyword_end, expecting $end from /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/awsquery.rb:3:in `' from /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/service.rb:5:in `require' from /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/service.rb:5:in `' from /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/elasticmapreduceclient.rb:3:in `require' from /usr/local/ec2/elastic-mapreduce-ruby/amazon/coral/elasticmapreduceclient.rb:3:in `' from /usr/local/ec2/elastic-mapreduce-ruby/elastic-mapreduce:7:in `require' from /usr/local/ec2/elastic-mapreduce-ruby/elastic-mapreduce:7:in `'
Tommy Chheng on August 16, 2010 7:52 PM GMT
ruby emr & json
This library was well built, but contains a few concerns for me. The lack of proper documentation and examples forced me to go to the actual amazon emr api to better understand how the library works. Also, if you take a closer look at the code, the JSON module and various methods are overwritten. Keep this in mind of you have implementations that rely on system or bundled json gems. Also, the library makes it extremely difficult to determine job flow and job flow step information since the responses returned are hashes within arrays within hashes where keys are string and camel cased. The implemenation would be much more DRY and concise if there were convenience methods to obtain information about job flows and steps. I will most likely work on refactoring this code for use internally. It's a great library, but be aware of the implementation and lack of documentation.
kennethcheung on July 8, 2010 4:48 PM GMT
Good tutorial...
a few things to clarify: for access_id: go to 'Accounts' -> 'Security Credentials' -> 'Access Keys' (its the 'Access Key ID' there) for private_key: from the above, click on 'Show' under 'Secret Access Key' to find that.
dammmitimmad on January 31, 2010 10:03 PM GMT
re:
I think the keypair in the credentials.json means the name of the key pair, such as gsg-keypair
yuangw on November 16, 2009 10:00 PM GMT
edit/fix for "Terminate a Job Flow" section
Minor edit: the "Terminate a Job Flow" section shows use of "elastic-map-reduce" client script, which should be "elastic-mapreduce" instead.
insights4sharethis on July 18, 2009 5:08 AM GMT
Credentials File Needs More Docs
This tutorial works flawlessly except I can't figur eout wht goes in the following fields: "keypair": "" "log_uri": "" Is keypair the path to a file or is it the actual keypair? I found the docs for log_uri so that ok, thought Include it in the comment... s3n:\/\/[bucket name]/lastfm/logs/ What is the keypair value? What goes there? HELP!
Paul Kenjora on April 29, 2009 7:35 PM GMT
We are temporarily not accepting new comments.
©2010, Amazon Web Services LLC or its affiliates. All rights reserved.