How to Create and Debug an Amazon Elastic MapReduce Job Flow

Articles & Tutorials>How to Create and Debug an Amazon Elastic MapReduce Job Flow
This document provides a quick guide on how to use Elastic MapReduce to develop, debug, and run job flows that have multiple steps.

Details

Submitted By: Gwen@AWS
AWS Products Used: Amazon Elastic MapReduce
Language(s): Python, Ruby
Created On: June 29, 2010 12:51 AM GMT
Last Updated: July 9, 2010 7:35 PM GMT

How to Create and Debug an Elastic MapReduce Job Flow

This document explains how to use Elastic MapReduce to develop, debug, and run job flows that have multiple steps.

Amazon Elastic MapReduce, Amazon Elastic Compute Cloud, and Amazon Simple Storage Service are sometimes referred to within this document as "Elastic MapReduce," "EC2," and "Amazon S3," respectively.

Understanding Elastic MapReduce Job Flows

A job flow is a user-defined task that Amazon Elastic MapReduce executes by running a cluster of Amazon EC2 instances. A job flow defines one or more steps. A step is a MapReduce algorithm implemented as a Java JAR or a Hadoop streaming program written in Python, Ruby, Perl or C++. Steps are executed in sequence on a master node on an Amazon EC2 cluster.

A job flow typically consists of multiple steps where the output of one step becomes the input of the next. For example, you might have a step that counts the number of times each word occurs in the document and a second step that sorts the output from the first step based on the number of occurrences.

Data is normally communicated from one step to the next using files stored in Hadoop’s distributed file system (HDFS). Data stored in HDFS exists only as long as the job flow is running. Once the job flow has shutdown, all HDFS data is discarded. The final step in a job flow usually stores the results of that job flow in Amazon S3 so that the results will be accessible to future job flows.

MapReduce is the algorithm you use to process your data using Hadoop. You can specify the algorithms in several ways:

- Using Hive, which enables you to use a high-level language called Hive to create MapReduce algorithms for job flows

- Using Pig, which enables you to use an high-level language called Pig Latin to create MapReduce algorithms for job flows

- Using Java to create the algorithm and thereby creating a Custom JAR to use for the Hadoop job flow

- Using some computer language other than Java to create the algorithm and thereby creating a Streaming job flow

You can create job flows consisting of multiple steps using the Elastic MapReduce command line interface (CLI). The AWS console only supports job flows with single steps. This document primarily describes how to manage job flows with the CLI. Details on how to use the AWS Console and Elastic MapReduce API are in the Amazon Elastic MapReduce Developer Guide and the Amazon Elastic MapReduce API Reference.

Installing Ruby

You must have the following version of Ruby and its libraries installed to use the Elastic MapReduce CLI.

  • Ruby version 1.8 or later

To get Ruby, go to http://json.rubyforge.org/.

  • Ruby JSON library version 0.4.2-1 or later

To get the Ruby JSON library, go to http://json.rubyforge.org/.

To Install Ruby

1. Use the appropriate download command for your operating system.

2. Verify that Ruby is running by typing the following at the command prompt:

$ ruby -v

If the version displays, you installed Ruby correctly. If the version does not display, repeat the installation process.

Installing the Command Line Interface

To download the Elastic MapReduce CLI

1. Create a local directory for the CLI. For example, from the command line enter:

mkdir elastic-mapreduce-cli

2. Go to http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2264&categoryID=273 and click Download.

3. Save the file in your newly created directory.

To install the Elastic MapReduce CLI

1. If not already there, navigate to the elastic-mapreduce-cli directory.

2.Unzip the compressed file by entering the following from the command line.

$ unzip elastic-mapreduce-ruby.zip

Configuring Credentials

Your credentials are used to calculate the signature value for every request you make. Elastic MapReduce automatically looks for them in credentials.json. For that reason, it's most convenient to edit the credentials.json file and include your AWS credentials. You can substitute the name of another file on the command line where your credentials reside, but you must name the file in every request.

An AWS key pair is a security credential similar to a password, which you use to securely connect to your instance once it's running. If you are new to AWS, you have not yet created any key pairs. We recommend that you create a new key pair.

To create an Amazon EC2 key pair

1. Go to the EC2 tab of the AWS Management Console . If you are not logged in to AWS, please supply your AWS account credentials.

2. From the Amazon EC2 Console Dashboard, click Key Pair.

3. On the Key Pairs page, click Create Key Pair.

4. Enter a name for your key pair, for example, mykeypair.

5. Click Create.

6. Save the PEM file in a safe location. You will need this file when logging on the master node running your job flow.

On a Linux or UNIX computer, set the permissions on the PEM file that contains your new key pair. For example, if you saved the file as mykeypair.pem, the command looks like:

chmod og-rwx mykey.pem

The Elastic MapReduce credentials file provides information required for many commands. The file saves you from the trouble of entering the information repeatedly.

To update your credentials file

1. If you do not already have a credentials file, create a file named credentials.json located in your CLI directory.

2. Add the following lines to your credentials file:

{
"access_id": "[Your AWS Access Key ID]",
"private_key": "[Your AWS Secret Access Key]",
"keypair": "[key pair name created previously]"
"log_uri": "[Path to a bucket you own in Amazon S3, for example, s3://mybucket/logs"
"region": "[The region where you want to launch your job flow, either us-east1 or eu-west-1"
}

To retrieve your AWS credentials

a. Go to http://aws.amazon.com

b. Click Your Account/Access Identifiers.

c. Record your IDs.

To create a log-uri

a. The value of the log-uri is an Amazon S3 bucket that you create. To create an S3 bucket go to https://-console.aws.amazon.com/-s3/-home

b. Click Create Bucket.

c. Enter a bucket name, for example mylog-uri. This name should be globally unique, and cannot be the same name used by another bucket.

d. Choose the same Region used for region in your PEM file.

Note:

When you enable logging in the Create a Bucket wizard, it only enables bucket access logs, not Elastic MapReduce job flow logs.

e. Click Create. You have now created a bucket with the URI http://s3.amazonaws.com/mylog-uri/.

Listing All Command Line Options

You can use the help option to list all of the commands available in the Elastic MapReduce CLI.

To list all of the command options

Type the following at the command prompt:

$ ./elastic-mapreduce --help 

For Microsoft Windows, type the following:

$ ruby elastic-mapreduce --help 

The general format is:

elastic-mapreduce [options] 

Creating and Managing Job Flows

This section describes basic introductory information on how to create and manage job flows. For more detailed descriptions and additional functionality, see the Amazon Elastic MapReduce Developer Guide.

Creating a Job Flow

Using the command line interface, you can construct a job flow that will continue to run until you terminate it. This is useful for debugging. When a step fails, you can add another step to your active job flow without having to incur the shutdown and startup cost of a job flow.

The following command will start a job flow that will keep running and consuming resources until you terminate it.

To create a job flow

From the command prompt, enter a command similar to the following:

$ ./elastic-mapreduce --create --alive --log-uri s3n://my-example-bucket/logs Created job flow j-36U2JMAE73054

By default this will launch a job flow running on a single m1.small instance. Later, when your steps are running correctly on small sample data you will want to launch job flows running on more instance.

You can specify the number of instance and the type of instance to run with the --num-instances and --instance-type options.

The --alive option tells the job flow to keep running even when it has finished all its steps.

The log-uri specifies a location in Amazon S3 for the log files from your job flow to be pushed. The log-uri can be safely omitted if you have not created a bucket in Amazon S3. Log files are not pushed to Amazon S3 until 5 minutes after the step is complete. For debug sessions, you will most likely log onto the master node of your active job flow. Specifying a log-uri is required if you want to be able to read log files from Amazon S3 after the job flow has terminated.

How to Create a Job Flow Using Hive

You can run Hive in interactive or batch modes. Typically, you use interactive mode to troubleshoot your job flow and batch mode in production. For information on how to use Hive in batch mode, see the Amazon Elastic MapReduce Developer Guide.

To Run Hive in Interactive Mode

1. Use the alive option with the create command so that the job flow remains active until you terminate it.

$ elastic-mapreduce --create --alive --name "Testing Hive -- $USER" \ --num-instances 5 --instance-type c1.medium \ --hive-interactive 

The return output is similar to the following:

Created jobflow j-ABABABABABAB

2. Wait for the job flow to start and reach a waiting state. Optionally, you can list running job flows using the following command.

$ elastic-mapreduce --list --active

3. Once the job flow is in the waiting state, ssh as a Hadoop user in to the master node and run Hive.

$ elastic-mapreduce --jobflow j-ABABABABABAB --ssh ... \ hadoop@domU-12-12-12-12-12-12:~$ hive Hive>

You are now running Hive in interactive mode and can execute Hive queries.

How to Create a Job Flow Using Pig

You can run Pig in interactive or batch modes. Typically, you use interactive mode to troubleshoot your job flow and batch mode in production. For information on how to use Pig in batch mode, see the Amazon Elastic MapReduce Developer Guide.

To Run Pig in Interactive Mode

1. Use the alive option with the create command so that the job flow remains active until you terminate it.

$ elastic-mapreduce --create --alive --name "Testing Pig -- $USER" \ --num-instances 5 --instance-type c1.medium \ --pig-interactive 

The return is similar to the following:

 Created jobflow j-ABABABABABAB 

2. Wait for the job flow to start and reach a waiting state. Optionally, you can list running job flows using the following command:

 $ elastic-mapreduce --list --active 

3. Once the job flow is in the waiting state, ssh as a Hadoop user in to the master node and run Pig.

 $ elastic-mapreduce --jobflow j-ABABABABABAB --ssh ... \ hadoop@domU-12-12-12-12-12-12:~$ pig Pig> 

You are now running Pig in interactive mode and can execute Pig queries.

How to Create a Job Flow Using Streaming

To run a job flow you must have one described in a JSON file. The following example shows how to run a streaming job flow that resides in the file streaming_jobflow.json. A streaming job flow uses a MapReduce executable not written in Java.

This job flow uses a Python script that counts the words in a file. Before using the following sample job flow code, replace the variables, MY_LOG_BUCKET and MY_BUCKET with real Amazon S3 bucket names and MY_KEY_NAME with an EC2 key pair name. Save the following example job flow in a file called streaming_jobflow.json.

{
  "Name": "Wordcount Using Python Example", 
  "LogUri": "[MY_LOG_BUCKET]/log", 
  "Instances": 
  { 
   "SlaveInstanceType": "m1.small", 
   "MasterInstanceType": "m1.small", 
   "InstanceCount": "1", 
   "Ec2KeyName": "[MY_KEY_NAME]", 
   "KeepJobFlowAliveWhenNoSteps": "false" 
  }, 

"Steps": 
 [ 
 
   { 
      "Name": "Streaming Job flow Step", 
      "ActionOnFailure": "CONTINUE", 
      "HadoopJarStep": 
      { 
         "Jar": "/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar", 
         "Args": 
         [ 
            "-input", "elasticmapreduce-external/demo/wordcount/input", 
            "-output", "[MY_BUCKET]/demo/output", 
            "-mapper", "elasticmapreduce-external/demo/wordcount/wordSplitter.py", 
            "-reducer", "aggregate" 
         ] 
      } 
    } 
  ] 
}

This job flow specifies what type of hardware to run the job flow on including the number of machines (InstanceCount), the place for the job flow to upload logs to (LogUri), whether or not the job flow should remain running after all steps have finished (KeepJobFlowAliveWhenNoSteps), and what happens if there is an error ( ActionOnError). The Steps element is an array that enables you to specify 0 or more steps. Elastic MapReduce executes multiple steps in the order listed in the job flow.

Note:

Zero steps means that data is not processed. You might specify zero steps to set up the EC2 cluster so that you can subsequently use AddJobFlowSteps to add steps in a debugging scenario.

Note:

If the InstanceValue value is 1, one instance serves as the master and slave node. If the value is greater than one, one instance is the master node and the remainder are slave nodes.

If you are using the HDFS file system for the input data, use three slashes (///) to designate the path, for example, hdfs:///aws-hadoop/MyCompany/sampleInput/.

The reducer executable, aggregate, is included from the Aggregate library that comes with Hadoop. Aggregate provides many basic reducer aggregations, such as sum, max, and min.

Use the Amazon S3 console at https://-console.aws.amazon.com/-s3/-home . To create the Amazon S3 bucket where you want to upload the processed data you use the -output parameter to specify the location of that bucket.

The RunJobFlow example uses the following Python script, wordSplitter.py.

#!/usr/bin/python 
import sys 
def main(argv): line = sys.stdin.readline() try: 
while line: line = line.rstrip() words = line.split() for word in words: 
print "LongValueSum:" + word + "\t" + "1" 
line = sys.stdin.readline() except "end of file": return None 
if __name__ == "__main__": main(sys.argv) 

To run a streaming job flow

  1. Use the Amazon S3 Console at https://-console.aws.amazon.com/-s3/-home to upload a MapReduce executable, such as wordSplitter.py, into an Amazon S3 bucket.
  2. Run the job flow using the RunJobFlow command, as follows.
$ ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json {"JobFlowId":"j-138L1TOL8PIJT"} 

The maximum lifetime of a job flow is 2 weeks.

Running a Custom JAR Job Flow

This section explains how to run a job flow that uses a MapReduce executable written in Java.

This job flow specifies the type of hardware used to run the job flow, including the number of machines ( InstanceCount), the place for the job flow to upload logs to (LogUri), whether the job flow should remain running after all steps have finished (KeepJobFlowAliveWhenNoSteps), and what happens in the event an error occurs (ActionOnError). The Steps element is an array that enables you to specify 0 or more steps. Elastic MapReduce executes multiple steps in the order listed in the job flow.

Note:

Zero steps means that data is not processed. You might specify zero steps to set up the EC2 cluster so that you can subsequently use AddJobFlowSteps to add steps in a debugging scenario.

Build a JAR that contains a main function and has been linked against Hadoop version 0.20. For more information about building a Hadoop main function, go to http://hadoop.apache.org/core/docs/current/mapred_tutorial.html and look at the word count example.

2. To bundle a JAR, put your source code in the src directory and unpack Hadoop into a directory stored in $HADOOP_HOME. Bundle the JAR in one of the following ways:

If you are using Linux or UNIX input the following:

export HADOOP_HOME=/home/myname/hadoop-0.20 
mkdir -p build/output 
javac -cp $CLASSPATH:$(ls $HADOOP_HOME/*.jar $HADOOP_HOME/lib/*.jar | tr "\n" ":") \ 
-sourcepath src $(find src -name "*.java") -d build/output -s 
build/output 
pushd build/output; jar cf ../wordcount.jar * ; popd 

If you are using Microsoft Windows, export a JAR file from Eclipse using the export command in the File menu.

3. Upload your jar to Amazon S3 using the Amazon S3 console at https://-console.aws.amazon.com/-s3/-home .

4. Construct a job flow description that specifies the JAR.

This job flow uses a Python script that counts the words in a file.

a. Before using the following sample job flow code, replace the variables, MY_LOG_BUCKET and MY_BUCKET with real Amazon S3 bucket names and MY_KEY_NAME with an EC2 key name.

b. Save the following example job flow in a file called custom_jar_jobflow.json.

{
   "Name": "Execute custom jar step", 
   "LogUri": "[MY_LOG_BUCKET]/log", 
   "Instances": 
   { 
      "SlaveInstanceType": "m1.small", 
      "MasterInstanceType": "m1.small", 
      "InstanceCount": "1", 
      "Ec2KeyName": "[MY_KEY_NAME]", 
      "KeepJobFlowAliveWhenNoSteps": "false" 
   }, 
   "Steps": 
   [ 
      { 
         "Name": "Custom Jar Grep Example", 
         "ActionOnFailure": "CONTINUE", 
         "HadoopJarStep": 
         { 
            "MainClass": "grep", 
            "Jar": "[MY_BUCKET]/demo/custom.jar", 
            "Args": 
            [ 
               "[MY_BUCKET]/demo/input", 
               "[MY_BUCKET]/output", 
               "the" 
            ] 
         } 
      } 
   ] 
}

5. Run the job flow.

 $ ./elasticmapreduce-client.rb RunJobFlow streaming_jobflow.json 

Creating a Job Flow with Bootstrap Actions

Bootstrap actions allow you to pass a reference to a script stored in Amazon S3 and related arguments to Elastic MapReduce when creating a job flow. This script is executed on each job flow instance before the actual job flow runs.

In the following procedure, we provide an example that creates a predefined word count sample job flow with a bootstrap action script that downloads and extracts a compressed tar archive from Amazon S3. The sample script we use is already stored in Amazon S3:

http://elasticmapreduce.s3.amazonaws.com/bootstrap-actions/download.sh

The sample script looks like this:

#!/bin/bash
set -e 
bucket=elasticmapreduce 
path=samples/bootstrap-actions/file.tar.gz 
wget -S -T 10 -t 5 http://$bucket.s3.amazonaws.com/$path 
mkdir -p /home/hadoop/contents 
tar -C /home/hadoop/contents -xzf file.tar.gz  

References to bootstrap action scripts are passed to Elastic MapReduce by using the --bootstrap-action parameter with the create command. The syntax for this parameter is:

--bootstrap-action "s3://<mybucket>/<myfile1>" --args  "<arg1>","<arg2>" 

You can specify up to 16 bootstrap actions per job by providing multiple --bootstrap-action parameters on the command line.

Creating a job flow with bootstrap actions

To create a job flow with bootstrap actions

If you're running a Linux or UNIX computer, enter the following command at a command prompt:

./elastic-mapreduce --create --stream --alive \ 
--input s3n://elasticmapreduce/samples/wordcount/input \ 
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \ 
--output s3n://yourbucket 
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/download.sh 

If you're running on Windows, enter the following command at a command prompt:

ruby elastic-mapreduce --create --stream --alive \ 
--input s3n://elasticmapreduce/samples/wordcount/input \ 
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \ 
--output s3n://yourbucket 
--bootstrap-action "s3://elasticmapreduce/bootstrap-actions/download.sh" 

While the job flow master instance is running, you can connect to the master instance and see the log files the that the bootstrap action script generated stored in the /mnt/var/log/bootstrap-actions/1 directory:

$ cd /mnt/var/log/bootstrap-actions/1 
$ ls 
controller stderr stdout 

The contents of the stderr log file should look similar to the following example:

$ cat stderr
--2010-02-20 01:26:24-- http://beta.elasticmapreduce.s3.amazonaws.com/samples/bootstrap-actions/file.tar.gz 
Resolving beta.elasticmapreduce.s3.amazonaws.com... 72.21.211.147 
Connecting to beta.elasticmapreduce.s3.amazonaws.com|72.21.211.147|:80... connected. 
HTTP request sent, awaiting response... 
HTTP/1.1 200 OK 
x-amz-id-2: W4ggqTgUmMWylfuQksXgBi5hxrgmiHp8LYrZH184CXpUW+s1/jfOvKmhoG/NIFJz 
x-amz-request-id: D673608F6C6114D2 
Date: Sat, 20 Feb 2010 01:26:25 GMT 
x-amz-meta-s3fox-filesize: 153 
x-amz-meta-s3fox-modifiedtime: 1256233644776 
Last-Modified: Thu, 22 Oct 2009 17:47:44 GMT 
ETag: "47a007dae0ff192c166764259246388c" 
Content-Type: application/gzip 
Content-Length: 153 
Connection: Keep-Alive 
Server: AmazonS3 
Length: 153 [application/gzip] 
Saving to: `file.tar.gz' 

0K                                                       100% 24.3M=0s 
2010-02-20 01:26:24 (24.3 MB/s) - `file.tar.gz' saved [153/153]  

You can verify that the bootstrap action script worked by changing to the /home/hadoop/contents directory. You should see a README file containing the words "Hello world!":

 $ cd /home/hadoop/contents $ ls README $ cat README Hello World! $ 

If the bootstrap action fails, the job flow terminates with a lastStateChangeReason error code. You can examine the log files generated by the bootstrap action. They are stored in the log directory in the Amazon S3 bucket that you specified when you ran the job flow. For example:

 S3://my-bucket/logs/j-ABABABA/i-18889182/bootstrap-actions/1/stderr 

Retrieving Information About a Job Flow

You can get information about a job flow using the describe option and a specified job flow ID.

To get information about a job flow

Use the describe option with a valid job flow ID.

 $ ./elastic-mapreduce --describe --jobflow[  JobFlowId  ] 

The response looks similar to the following:

{ 
   "JobFlows": 
   [ 
      { 
         "LogUri": null, 
         "Name": "Development Job Flow", 
         "ExecutionStatusDetail": 
         { 
            "EndDateTime": 1237948135.0, 
            "CreationDateTime": 1237947852.0, 
            "LastStateChangeReason": null, 
            "State": "COMPLETED", 
            "StartDateTime": 1237948085.0 
         }, 
         "Steps": [], 
         "Instances": 
         { 
            "Ec2KeyName": null, 
            "InstanceCount": 1.0, 
            "Placement": 
            { 
               "AvailabilityZone": "us-east-1a" 
            }, 
            "KeepJobFlowAliveWhenNoSteps": false, 
            "MasterInstanceType": "m1.small", 
            "SlaveInstanceType": "m1.small", 
            "MasterPublicDnsName": "ec2-67-202-3-73.compute-1.amazonaws.com", 
            "MasterInstanceId": "i-39325750" 
         }, 
         "JobFlowId": "j-3GJ4FRRNKGY97" 
      } 
   ] 
} 

Listing Job Flows

Use the list option by itself or in combination with other options to list job flows in various states. This section presents some of those listings.

To list job flows created in the last two days

Use the list option to list job flows:

 $ ./elastic-mapreduce --list 

The response is similar to the following:

 j-1YE2DN7RXJBWU    FAILED Example Job Flow 
                   CANCELLED Custom Jar 
j-3GJ4FRRNKGY97    COMPLETED ec2-67-202-3-73.compute-1.amazonaws.com Example job flow 
j-5XXFIQS8PFNW     COMPLETED ec2-67-202-51-30.compute-1.amazonaws.com demo 3/24 s1
COMPLETED Custom Jar 

This example shows there were three job flows created in the last two days. The indented lines are job flow steps. The columns for a job flow line are job flow ID, job flow state, master node DNS Name, and job flow name. The columns for a job flow step line are step state, and step name.

If you have not created any job flows in the last two days you will get no output from the command.

To list only active job flows

Use the list and active options, as follows:

$ ./elastic-mapreduce --list --active 

The response lists job flows that are starting, running, or shutting down.

To list only running or terminated job flows

Use the RUNNING and TERMINATED options, as follows:

$ ./elastic-mapreduce --list --state RUNNING --state TERMINATED 

The response lists job flows that are running, or have terminated down.

Adding Steps to a Job Flow

You can add steps to a job flow only if you set the RunJobFlow parameter KeepJobFlowAliveWhenNoSteps to True. This value keeps the EC2 cluster engaged even after the successful completion of a job flow. If you already set the value to False, just revise the jar in the job flow and rerun it.

To add a step to a job flow using default parameter values

Add the step using j option, as follows.

 $ ./elastic-mapreduce -j j-36U2JMAE73054 --streaming 

The --streaming argument adds a streaming step using default parameters. The default parameters are the word count example that is available in the Elastic MapReduce console.

You can see the step you just added using the console. In the console, refresh the job flow (j-36U2JMAE73054 in this example) you created, click it, and look at the detail pane in the lower half of the screen and you'll see the step you just added.

To add a step to a job flow using non-default parameter values

Add the step using j option, as follows.

$ ./elastic-mapreduce -j j-36U2JMAE73054 \
--jar s3n://elasticmapreduce/samples/cloudburst/cloudburst.jar \ 
--main-class org.myorg.WordCount \ 
--arg s3n://elasticmapreduce/samples/cloudburst/input/s_suis.br \ 
--arg s3n://elasticmapreduce/samples/cloudburst/input/100k.br \ 
--arg hdfs:///cloudburst/output/1 \ 
--arg 36 --arg 3 --arg 0 --arg 1 --arg 240 --arg 48 --arg 24 \ 
--arg 24 --arg 128 --arg 16  

This command runs an example job flow step that downloads and runs the jar. The arguments are passed to the main function in the jar.

If your jar has a manifest.mf file, you do not need to specify the jar's main class using --main-class, as shown in the previous example.

Terminating a Job Flow

You use a job flow ID to specify the job flow you want to terminate.

To terminate a job flow

Use the terminate command to terminate a job flow. This example uses job flow j-C019299B1X.

$ ./elastic-mapreduce --terminate j-C019299B1X 

Enabling Debugging

You can enable logging at the step and Hadoop job levels when you create a job flow by including two options:

-enable-debugging --log-uri pathToLogFilesOnAmazonS3 

To access the log files, use the AWS Management Console.

Debugging Job Flows Using SSH

The easiest way to access log files is by using the AWS Management Console.

If you prefer not to use the console, you can set up an SSH tunnel between your host and the EC2 master node where you can look on the file system for log files or at the job flow statistics published by the Hadoop web server. The master node in the cluster contains summary information of all of the work done by the slave nodes. You can, however, explore the working and error logs on each slave node in an effort to resolve problems occurring in the execution of the job flow.

Important:

You cannot use Amazon Elastic MapReduce to debug an application unless you enable debugging using --log_uri and --enable_debugging in the same job flow creation command. log_uri specifies a location in Amazon S3 where Amazon Elastic MapReduce will save logs (you must supply a bucket name as the argument to log_uri) and enable_debugging enables Amazon Elastic MapReduce to save state of Hadoop jobs, tasks and task attempts in Amazon SimpleDB. Without specifying these command options, Amazon Elastic MapReduce does not save state information of Hadoop job, tasks, or task attempts.

How to Determine the DNS of the Master Node

You need the DNS of the master node to log in so that you can inspect the log files. This section explains how to discover the DNS of the master node.

To determine the DNS of a master node

Use the --list option as follows: For Linux or UNIX computer, enter:

 $ ./elastic-mapreduce --list --jobflow [  Your job flow ID  ] 

For Microsoft Windows, enter:

 $ ruby elastic-mapreduce --list --jobflow [  Your job flow ID  ] 

In the response, the third column lists the DNS name of the master node if that node is currently running.

If you don't know the job flow ID, use the --listoption with the --activeoption (instead of the --jobflow option) to list all active job flows.

Monitoring Job Flow Status Using SSH

To log on to the master node you can use:

For a Linux or UNIX computer enter:

$ ./elastic-mapreduce --jobflow [Your job flow ID] --ssh 

For Microsoft Windows enter:

$ ruby elastic-mapreduce --jobflow [Your job flow ID] --ssh 

Alternatively, you can use SSH port forwarding to set up a secure link between your computer and the master node in the EC2 cluster that is processing your job flow. To SSH as Hadoop user into the master node, your job flow status must be either WAITING or RUNNING. To make the job flow remain in a WAITING state even after successful completion, use the --alive option in the CreateJobFlow command.

To view the Elastic MapReduce logs on the EC2 master node

1. Open an SSH shell and use an ssh command of the following form to set up an SSH connection as the Hadoop user between your host and the EC2 master node.

 ssh –i [  keyfile.pem  ] hadoop@[  EC2_master_node_DNS  ] 

Substitute the PEM file from your own key pair and the public DNS name of the master node. The following is an example for myKeyPairName.pem at ec2-67-202-49-73 .

 ssh -i ~/ec2-keys/myKeyPairName.pem hadoop@ec2-67-202-49-73.compute1.amazonaws.com 

For keyfile, use the value you set for:

  • SSH Key Name > in the console
  • Ec2KeyName in a CLI
  • key in a RunJobFlow request. The key name provides a handle to the master node and enables you to log into it with account Hadoop without using a password. You cannot SSH into the master node if you did not set a value for SSH Key Name or Ec2KeyName.

For the EC2_master_node_DNS, use the value returned for it in the console or from DescribeJobFlows. You always log in as Hadoop.

Note:

As an alternative to SSH, you can use a utility, such as PuTTY.

If you get an error executing the ssh command, you might not have set the permissions on the mykey.pem file, the key file might be specified incorrectly, or you copied the DNS name incorrectly.

2. Navigate to /mnt/var/log/hadoop/steps/1 to see the logs on the master node for the first step. The second step log files are in /mnt/var/log/hadoop/steps/2 and so on. The log files are:

  • controller— >Log file of the process that attempts to execute your step
  • syslog— >Log file generated by Hadoop that describes the execution of your Hadoop job by the job flow step
  • stderr— >A stderr log file generated by Hadoop when it attempts to execute your job flow
  • stdout— >The stdout log file generated by Hadoop when it attempts to execute your job flow

These log files may not appear until the step has run for some time, or finished or failed. These logs contain counter and status information.

Note :

If you specified a log URI where Elastic MapReduce uploads log files onto Amazon S3, you can inspect the log files on Amazon S3. There is, however, a five minute delay between when the log files stop being written and when they're pushed into Amazon S3. So, it's a faster to look at the log files on the master node, especially if the step failed quickly

Amazon Elastic MapReduce Resources

The following lists related resources that you'll find useful as you work with this service.

Amazon Elastic MapReduce Getting Started Guide http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

Amazon Elastic MapReduce Developer Guide href="http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/

Amazon Elastic MapReduce API Reference http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/

Amazon Elastic MapReduce Technical FAQ > http://aws.amazon.com/elasticmapreduce/faqs/

AWS Developer Resource Center http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=59

Discussion Forums http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52

Copyright © 2010 Amazon Web Services LLC or its affiliates. All rights reserved.

©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.