Developing, Deploying and Debugging Hadoop Jobs in Eclipse for Amazon Elastic MapReduce using Karmasphere Studio

Articles & Tutorials>Developing, Deploying and Debugging Hadoop Jobs in Eclipse for Amazon Elastic MapReduce using Karmasphere Studio
This tutorial will show you how to use Karmasphere Studio to develop, debug and deploy Hadoop Jobs for Amazon Elastic MapReduce.

Details

Submitted By: Amazon Web Services
AWS Products Used: Elastic MapReduce
Created On: October 28, 2011 12:28 AM GMT
Last Updated: October 28, 2011 12:28 AM GMT
Karmasphere

Introduction

Karmasphere Studio is a graphical environment that provides visual SQL access to data in Amazon Elastic MapReduce. Studio is available for Windows, Mac and Linux systems and dramatically accelerates the Hadoop development process.

Assumptions

This document assumes:

  • You have are familiar with Amazon’s AWS products and architecture
  • You are familiar with SQL-style queries

    About Karmasphere

    Karmasphere Studio’s familiar graphical environment supports the complete lifecycle for developing Amazon Elastic MapReduce applications, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs. By simplifying development, Karmasphere Studio increases the productivity of developers, saving valuable time and effort. Its intuitive, visual interface enables the full spectrum of developers--from those just starting with Big data to those highly-experienced with Java, Cascading and Streaming-to take advantage of Elastic MapReduce.

    Who should use Karmasphere Studio?

    Karmasphere Studio is designed for developers that support analytic teams. It provides a graphical environment to learn Hadoop, develop custom analytic algorithms and systematize the creation of meaningful datasets the analysts find. The results can then be integrated into business processes and applications.

    How to get Karmasphere Studio

    Karmasphere Studio is available in the same pay-as-you-go, hourly pricing model as Amazon Elastic MapReduce, providing a low cost of entry and a single payment process through Amazon. For pricing details and to download the Karmasphere software, please visit the Elastic MapReduce with Karmasphere Analytics detail page.

    Starting Karmasphere Studio in Eclipse

    Karmasphere Studio is a plug-in for Eclipse and requires Eclipse version 3.6.2 to be installed. You will also need to install JDK 1.6 (Analyst will not work with JDK 1.7). To download Eclipse, please go to the Eclipse web site at http://www.eclipse.org. Install Karmasphere Studio as a plug-in, using the Help -> Install New Software wizard within the Eclipse IDE.

    How to Configure Karmasphere Studio with your Amazon Credentials

    To use Karmasphere Studio with Elastic MapReduce, Amazon S3, and Amazon RDS, you need your AWS Access Key ID, Secret Access Key and credentials. An AWS key pair is a security credential similar to a password, which you use to securely connect to your instance once it's running.

    Before you can use Karmasphere Studio with Amazon Elastic MapReduce and Amazon S3, you must:

    1. Set up an Amazon AWS Account. If you don’t have an AWS account, you can create one here:www.amazon.com/gp/aws/registration/registration-form.html
    2. Generate and Save AWS SSH Key (SSH key name and private key)

    How to Enter your AWS Account Information in Karmasphere Studio

    Note: Be sure that your Eclipse perspective is set to ‘Hadoop’ before beginning this section.

    Select Amazon Accounts in the left-hand panel and right-click on it. Click on New Amazon Account.

    Karmasphere

    Enter the Account Name, AWS Secret Key ID, and AWS Secret Key. Test the account credentials by clicking on the Test button. Click Next.

    Karmasphere

    Enter your SSH Key information. Select the file(s) corresponding to the EC2 Key(s). After the required file(s) have been selected, click Finish.

    Karmasphere

    The Amazon Accounts tree will appear as shown displaying your new account.

    Karmasphere

    Accessing Amazon S3

    Expand Amazon Accounts to show the tree-structure of the account. This view shows the existing S3 file systems, job flow templates and Active and Terminated job flows.

    Karmasphere

    How to Access Amazon S3

    Karmasphere Studio enables you to browse, read and write data to the s3n buckets associated with your account.

    To use an existing bucket, right-click on the S3 bucket name and select Browse.

    Karmasphere

    Browsing the Amazon S3 Filesystem

    To browse an s3n bucket, select the bucket and click Browse. The contents of the bucket are shown in the right-hand portion of the window.

    Karmasphere

    To view a file, select the file and right-click, choose Open in the pull-down menu. The file contents are displayed.

    Karmasphere

    How to Create a new s3n bucket

    To create a new s3n bucket, right-click on S3 Filesystems under Amazon Accounts, then click on New S3 Bucket.

    Karmasphere

    Enter the new bucket name and location. Click OK.

    Karmasphere

    How to Create and Use Elastic MapReduce Job Flows

    Karmasphere Studio empowers the user to deploy three types of jobs: workflows, pre-existing JAR files, and streaming jobs.

    A user can view the submitted job’s current status as well the input, output, and log files on EMR.

    Creating and Deploying a Java-based Hadoop Job Using the Workflow

    You must create a workflow first, before deploying. Once the workflow is created, click on the Deploy icon.

    Karmasphere

    In the Deployment window, enter the Job Name, select the Target Cluster and Data Filesystem and enter the input and output parameters and then click OK. Note that each parameter can either be entered on a separate line or on the same line separated by a space. In the example screen-shot, the first parameter is the location of the input file and the second parameter is the location of the output directory. The output directory should either be empty or not appear.

    Karmasphere

    Deployment status is available in the Output window as shown.

    Karmasphere

    Deploying a Java-based Hadoop Job that was created outside the Workflow

    A pre-existing JAR file is deployed to a cluster using the Hadoop Services view. Select the Hadoop perspective, open Hadoop Services. Right-click on Hadoop Jobs, and then select New Job.

    Karmasphere

    Enter the Job Name, select the Job Type as ‘Hadoop’ Job from Pre-existing JAR file. Click Next.

    Karmasphere

    Select the Primary Jar File, select the Main Class. Click Next.

    Karmasphere

    Select the Default Cluster, enter the Default Arguments and then click Finish. The Default Arguments are either separated by a space or by a newline. Note: The Output directory should not be empty or non-existent.

    Karmasphere

    This job is displayed under Hadoop Services. Right-click on the job, and then click Run Job.

    Karmasphere

    Verify all the parameters are correct, change if required and click Ok.

    Karmasphere

    Once the job is deployed, the deployment status is available in the Output window.

    Deploying a Streaming Job

    Open Hadoop Perspective and then open Hadoop Services. Now right-click Jobs followed by click New Job.

    Karmasphere

    Enter Job Name and select Job Type and then click Next.

    Karmasphere

    Enter Input Location and Output Location in the text areas that are provided and then click Next.

    Karmasphere

    Select the Mapper and Reducer types as Raw Command, enter /bin/cat for both the Mapper and Reducer and then click Finish. Note that if you are developing your own code for the Mapper and/or Reducer you need to check the respective boxes for Upload.

    Karmasphere

    The newly created job is now available under Hadoop Jobs. Right-click on the job, then click Run Job.

    Karmasphere

    Check the parameters in the Deployment window (and modify the parameters if they need to be modified), then click OK.

    Karmasphere

    Standalone Jar

    A pre-existing JAR file is deployed to a cluster using Hadoop Services. Open the Hadoop Perspective and then open Hadoop Services. Right-click on Jobs, then click on New Job.

    Karmasphere

    Enter the Job Name, select the Job Type as Hadoop Job from pre-existing JAR file, then click Next.

    Karmasphere

    Select the Primary Jar file, enter the Main Class and then click Next.

    Karmasphere

    Select the Default Cluster, enter the Default Parameters and then click Finish. The Default Parameters can either be separated by a space or by a new line. In the example screen-shot the default parameters entered are the input file and the output directory locations. Note that the output directory should either be empty or not appear.

    Karmasphere

    This job can be viewed under Hadoop Services.

    Karmasphere

    Right-click on the job, then click Run Job.

    Karmasphere

    Verify parameters and then click OK.

    Karmasphere

    Once a job is deployed, its deployment status is available in the Output window.

    Karmasphere

    Starting a New job flow using a job flow Template

    Karmasphere Studio enables you to create new job flows and use existing job flows. To create a new job flow one or more templates are created which reflect the parameters of the job flow you want to start. Right-click on job flow Templates under Amazon Accounts and select New job flow Template.

    Karmasphere

    Give the cluster a name. Amazon currently provides a choice of Hadoop 0.18.3 and Hadoop 0.20. Select the Hadoop version to use with EMR for your cluster from the drop-down menu. Associate a default file system with the cluster. This will normally be an S3 filesystem. If you have not yet created a file system, click the Add button to create one. Click the Next button.

    Karmasphere

    Configure the desired parameters of your job flow. It is very important to have the account information and SSH key correct, otherwise Karmasphere Studio will not be able to create or access your cluster. To configure your Amazon AWS account credentials, click Manage. Configure one or more Amazon account profiles each of which includes your AWS Security Credentials.

    In the Amazon cluster parameters window, select your account credentials from the list, the boxes below automatically populate with values downloaded from your Amazon account. Select your S3 buckets and AWS key pairs from the drop-downs to not risk making a mistake.

    Make sure the private key file you provide matches the key name you select. Click Finish to complete the creation of your Amazon Elastic MapReduce cluster object. To keep the cluster alive after the job has completed check the box.

    Karmasphere

    The newly created cluster appears under Hadoop Services.

    Karmasphere

    Starting a job flow Instance

    Start a job flow Instance by right-clicking on job flow Templates and select the job flow which you want to use to start the Instance.

    Karmasphere

    Enter the Instance name and click OK.

    Karmasphere

    The new instance is shown under Active job flow as shown.

    Karmasphere

    Monitoring, Profiling and Debugging Jobs on a job flow

    To monitor jobs, select Active job flows under your Amazon Account and right-click to select Monitor Job Flows. The monitor jobs view is shown. The Job Monitor window reads job information from the Amazon S3 filesystem associated with your EMR cluster.

    Karmasphere

    How to Profile a Job

    To access job profiling features, select the job in the Monitor view and click on the Profile Job button.

    Karmasphere

    A sample job profile display is shown.

    Karmasphere

    In this view you can profile a job’s counters, view logs and tasks and see the job configuration. Select the appropriate tab to profile that aspect of the job. The Diagnostics tab provides a diagnostic evaluation of the job based on job and task counters. This will aid you in the debugging and development process.

    How to Debug a Job

    Karmasphere Studio allows you to debug a MapReduce job from your desktop system. It is an easy way to get started with MapReduce programming because you can continuously develop and debug your job without the need for a cluster or the delays of a full job deployment cycle.

    Select the Java perspective in Eclipse. Open the Project Explorer and expand a project. Double-click on the workflow (here as HadoopJob.workflow). The Project Explorer is shown with the WordCountProject open and the workflow expanded.

    Karmasphere

    The right-hand panel allows access to the Input (file). Mapper, Partitioner, Comparator, Combiner, Reducer and Output (file) are stages of the MapReduce pipeline and each stage can be debugged as required. Click on the desired tab to select that stage, and modify the code as required.

    
  • ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.