Introduction
Karmasphere Studio is a graphical environment that provides visual SQL access to data in Amazon Elastic MapReduce. Studio is available for Windows, Mac and Linux systems and dramatically accelerates the Hadoop development process.
Assumptions
This document assumes:
About Karmasphere
Karmasphere Studio’s familiar graphical environment supports the complete lifecycle for developing Amazon Elastic MapReduce applications, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs. By simplifying development, Karmasphere Studio increases the productivity of developers, saving valuable time and effort. Its intuitive, visual interface enables the full spectrum of developers--from those just starting with Big data to those highly-experienced with Java, Cascading and Streaming-to take advantage of Elastic MapReduce.
Who should use Karmasphere Studio?
Karmasphere Studio is designed for developers that support analytic teams. It provides a graphical environment to learn Hadoop, develop custom analytic algorithms and systematize the creation of meaningful datasets the analysts find. The results can then be integrated into business processes and applications.
How to get Karmasphere Studio
Karmasphere Studio is available in the same pay-as-you-go, hourly pricing model as Amazon Elastic MapReduce, providing a low cost of entry and a single payment process through Amazon. For pricing details and to download the Karmasphere software, please visit the Elastic MapReduce with Karmasphere Analytics detail page.
Starting Karmasphere Studio in Eclipse
Karmasphere Studio is a plug-in for Eclipse and requires Eclipse version 3.6.2 to be installed. You will also need to install JDK 1.6 (Analyst will not work with JDK 1.7). To download Eclipse, please go to the Eclipse web site at http://www.eclipse.org. Install Karmasphere Studio as a plug-in, using the Help -> Install New Software wizard within the Eclipse IDE.
How to Configure Karmasphere Studio with your Amazon Credentials
To use Karmasphere Studio with Elastic MapReduce, Amazon S3, and Amazon RDS, you need your AWS Access Key ID, Secret Access Key and credentials. An AWS key pair is a security credential similar to a password, which you use to securely connect to your instance once it's running.
Before you can use Karmasphere Studio with Amazon Elastic MapReduce and Amazon S3, you must:
- Set up an Amazon AWS Account. If you don’t have an AWS account, you can create one here:www.amazon.com/gp/aws/registration/registration-form.html
- Generate and Save AWS SSH Key (SSH key name and private key)
How to Enter your AWS Account Information in Karmasphere Studio
Note: Be sure that your Eclipse perspective is set to ‘Hadoop’ before beginning this section.
Select Amazon Accounts in the left-hand panel and right-click on it. Click on New Amazon Account.
Enter the Account Name, AWS Secret Key ID, and AWS Secret Key. Test the account credentials by clicking on the Test button. Click Next.
Enter your SSH Key information. Select the file(s) corresponding to the EC2 Key(s). After the required file(s) have been selected, click Finish.
The Amazon Accounts tree will appear as shown displaying your new account.
Accessing Amazon S3
Expand Amazon Accounts to show the tree-structure of the account. This view shows the existing S3 file systems, job flow templates and Active and Terminated job flows.
How to Access Amazon S3
Karmasphere Studio enables you to browse, read and write data to the s3n buckets associated with your account.
To use an existing bucket, right-click on the S3 bucket name and select Browse.
Browsing the Amazon S3 Filesystem
To browse an s3n bucket, select the bucket and click Browse. The contents of the bucket are shown in the right-hand portion of the window.
To view a file, select the file and right-click, choose Open in the pull-down menu. The file contents are displayed.
How to Create a new s3n bucket
To create a new s3n bucket, right-click on S3 Filesystems under Amazon Accounts, then click on New S3 Bucket.
Enter the new bucket name and location. Click OK.
How to Create and Use Elastic MapReduce Job Flows
Karmasphere Studio empowers the user to deploy three types of jobs: workflows, pre-existing JAR files, and streaming jobs.
A user can view the submitted job’s current status as well the input, output, and log files on EMR.
Creating and Deploying a Java-based Hadoop Job Using the Workflow
You must create a workflow first, before deploying. Once the workflow is created, click on the Deploy icon.
In the Deployment window, enter the Job Name, select the Target Cluster and Data Filesystem and enter the input and output parameters and then click OK. Note that each parameter can either be entered on a separate line or on the same line separated by a space. In the example screen-shot, the first parameter is the location of the input file and the second parameter is the location of the output directory. The output directory should either be empty or not appear.
Deployment status is available in the Output window as shown.
Deploying a Java-based Hadoop Job that was created outside the Workflow
A pre-existing JAR file is deployed to a cluster using the Hadoop Services view. Select the Hadoop perspective, open Hadoop Services. Right-click on Hadoop Jobs, and then select New Job.
Enter the Job Name, select the Job Type as ‘Hadoop’ Job from Pre-existing JAR file. Click Next.
Select the Primary Jar File, select the Main Class. Click Next.
Select the Default Cluster, enter the Default Arguments and then click Finish. The Default Arguments are either separated by a space or by a newline. Note: The Output directory should not be empty or non-existent.
This job is displayed under Hadoop Services. Right-click on the job, and then click Run Job.
Verify all the parameters are correct, change if required and click Ok.
Once the job is deployed, the deployment status is available in the Output window.
Deploying a Streaming Job
Open Hadoop Perspective and then open Hadoop Services. Now right-click Jobs followed by click New Job.
Enter Job Name and select Job Type and then click Next.
Enter Input Location and Output Location in the text areas that are provided and then click Next.
Select the Mapper and Reducer types as Raw Command, enter /bin/cat for both the Mapper and Reducer and then click Finish. Note that if you are developing your own code for the Mapper and/or Reducer you need to check the respective boxes for Upload.
The newly created job is now available under Hadoop Jobs. Right-click on the job, then click Run Job.
Check the parameters in the Deployment window (and modify the parameters if they need to be modified), then click OK.
Standalone Jar
A pre-existing JAR file is deployed to a cluster using Hadoop Services. Open the Hadoop Perspective and then open Hadoop Services. Right-click on Jobs, then click on New Job.
Enter the Job Name, select the Job Type as Hadoop Job from pre-existing JAR file, then click Next.
Select the Primary Jar file, enter the Main Class and then click Next.
Select the Default Cluster, enter the Default Parameters and then click Finish. The Default Parameters can either be separated by a space or by a new line. In the example screen-shot the default parameters entered are the input file and the output directory locations. Note that the output directory should either be empty or not appear.
This job can be viewed under Hadoop Services.
Right-click on the job, then click Run Job.
Verify parameters and then click OK.
Once a job is deployed, its deployment status is available in the Output window.
Starting a New job flow using a job flow Template
Karmasphere Studio enables you to create new job flows and use existing job flows. To create a new job flow one or more templates are created which reflect the parameters of the job flow you want to start. Right-click on job flow Templates under Amazon Accounts and select New job flow Template.
Give the cluster a name. Amazon currently provides a choice of Hadoop 0.18.3 and Hadoop 0.20. Select the Hadoop version to use with EMR for your cluster from the drop-down menu. Associate a default file system with the cluster. This will normally be an S3 filesystem. If you have not yet created a file system, click the Add button to create one. Click the Next button.
Configure the desired parameters of your job flow. It is very important to have the account information and SSH key correct, otherwise Karmasphere Studio will not be able to create or access your cluster. To configure your Amazon AWS account credentials, click Manage. Configure one or more Amazon account profiles each of which includes your AWS Security Credentials.
In the Amazon cluster parameters window, select your account credentials from the list, the boxes below automatically populate with values downloaded from your Amazon account. Select your S3 buckets and AWS key pairs from the drop-downs to not risk making a mistake.
Make sure the private key file you provide matches the key name you select. Click Finish to complete the creation of your Amazon Elastic MapReduce cluster object. To keep the cluster alive after the job has completed check the box.
The newly created cluster appears under Hadoop Services.
Starting a job flow Instance
Start a job flow Instance by right-clicking on job flow Templates and select the job flow which you want to use to start the Instance.
Enter the Instance name and click OK.
The new instance is shown under Active job flow as shown.
Monitoring, Profiling and Debugging Jobs on a job flow
To monitor jobs, select Active job flows under your Amazon Account and right-click to select Monitor Job Flows. The monitor jobs view is shown. The Job Monitor window reads job information from the Amazon S3 filesystem associated with your EMR cluster.
How to Profile a Job
To access job profiling features, select the job in the Monitor view and click on the Profile Job button.
A sample job profile display is shown.
In this view you can profile a job’s counters, view logs and tasks and see the job configuration. Select the appropriate tab to profile that aspect of the job. The Diagnostics tab provides a diagnostic evaluation of the job based on job and task counters. This will aid you in the debugging and development process.
How to Debug a Job
Karmasphere Studio allows you to debug a MapReduce job from your desktop system. It is an easy way to get started with MapReduce programming because you can continuously develop and debug your job without the need for a cluster or the delays of a full job deployment cycle.
Select the Java perspective in Eclipse. Open the Project Explorer and expand a project. Double-click on the workflow (here as HadoopJob.workflow). The Project Explorer is shown with the WordCountProject open and the workflow expanded.
The right-hand panel allows access to the Input (file). Mapper, Partitioner, Comparator, Combiner, Reducer and Output (file) are stages of the MapReduce pipeline and each stage can be debugged as required. Click on the desired tab to select that stage, and modify the code as required.