Using Hive on Amazon Elastic MapReduce with Karmasphere Analytics

This tutorial shows how to use Karmasphere Analyst with Amazon Elastic MapReduce to analyze large data sets stored in Amazon S3.


Submitted By: Amazon Web Services
Created On: October 28, 2011


Karmasphere

Introduction

Karmasphere Analyst used with Amazon EMR provides an intuitive, high productivity solution for working with large structured and unstructured data sets using Apache Hadoop. Karmasphere Analyst works on Windows, Mac and Linux desktop systems. It provides a comprehensive workspace for data professionals and data analysts exploring and interacting with Big Data stored on Amazon S3 using Elastic MapReduce. With Karmasphere Analyst you have immediate access to unstructured, semi-structured and structured data in Hadoop. Through familiar SQL and wizards you can make ad-hoc queries, interact with the results, and iterate.

How to get Karmasphere Analyst

Karmasphere Analyst is available in the same pay-as-you-go, hourly pricing model as Amazon Elastic MapReduce, providing a low cost of entry and a single payment process through Amazon. For pricing details and to download the Karmasphere software, please visit the Elastic MapReduce with Karmasphere Analytics detail page.

Key Concepts

The Big Data Analytics Workflow

The workflow model comprises four stages:

  • Access: Configure your Amazon AWS credentials; manage and start connections to EMR Job Flows.You will need your AWS account credentials, including your AWS Access Key ID and Secret Access Key. You will also need your SSH key name and private key.
  • Assemble: Prepare and manage tables including unstructured, structured and compressed data
  • Analyze: Explore and mine the data,iteratively.
  • Act: Share, save, operationalize and integrate results, charts and queries.

    How to Configure Access to Elastic MapReduce and Amazon S3

    Configuring Your Amazon Credentials

    To configure your Amazon AWS account credentials for the first time, select the ‘File’ main menu option and select ‘Manage Cloud Credentials’.

    Karmasphere

    Click on the ‘Add’ button.

    Karmasphere

    Enter the name you want to use. Enter your Access Key ID and Secret Access Key information. Test the credentials entered by clicking on the ‘Test’ button. Enter the SSH Key information by clicking on the SSH Keys button. Depending on which Amazon EMR region you use, you select that region and enter the SSH Key information. Click on the OK button.

    Using an Existing JobFlow

    If you have already configured a Job Flow and want to start the connection, then select that connection and click on the Karmasphere Start Connection icon. Once the connection is up, the red-cloud graphic turns green.

    Example connections are shown in the connection pane.

    Karmasphere

    Launching a New JobFlow

    On the Karmasphere Analyst home view click on the ‘Access’ icon.The Access view is shown.

    Karmasphere

    Click on the Karmasphere New Cloud connection icon. The following window is displayed. Enter the additional information required.

    Karmasphere

    The connection is created and starts the JobFlow on the cluster. The connection shows up in the connection pane.

    Starting a Connection to a Job Flow

    Karmasphere

    (The connection pane)

    Select the connection in the connection pane and click on the Karmasphere Start Connection icon.

    How to Assemble Data and Manage Tables

    Assembling and managing tables is part of the Assemble stage and is the second step of the Karmasphere Analyst workflow process. This stage allows you to collect and prepare data of any format, organize it for easy understanding and prepare it for analysis, the third stage of the workflow process. The result of the Assemble stage is one or more tables. 

    Analyst understands many common file and compression formats such as ZIP, GZIP and others and quickly prepares your data for analysis. Analyst comes with a sample excite data log in GZIP format for testing and analysis and is included in the Analyst installation.

    Note: Prior to starting the Assemble stage, please be sure to use the Amazon EMR connection you’ve created above in order to create a table on EMR.

    To get to the Assemble view, click on the ‘Assemble’ icon on the home view. The Assemble view is shown.

    Karmasphere

    Click on the Karmasphere icon to create and load a new table on EMR. Select the name of the table and location of the data file to use to load this table. The sample data file to use is excite.log.gz, and is located in the Sample folder of your Karmasphere Analyst installation directory.

    Karmasphere

    Enter a name for this new table. If the table name entered already exists, the table name field has a red border. For source data, click on the Browse button and select the excite.log.gz file. Click on the Next button shown above. Click the Next button again to continue to the next screen.

    Karmasphere

    Continue to accept the default values for the next two steps. Click on the Finish button, to create and load this table.

    Karmasphere

    The following dialog window is displayed, indicating that this operation was successful. Click the OK button.

    Karmasphere

    Note: At the bottom of the Assemble view, you see a progress bar, showing the data being copied to your S3 bucket.This is an example:

    Karmasphere

    How to Analyze Your Data

    The Analyze stage is the third step of the Analyst workflow process and offers powerful features and functionality for the user.

    Once you start to see patterns and trends, you can begin to iterate the results by formatting, filtering and sorting these results. Karmasphere Analyst offers the ability to use HQL, a subset of SQL to enter queries. Syntax highlighting; auto complete and visual query plans that can help you optimize a route to a successful query are also provided.

    Once the results have been generated, you have the ability Act on the results, and to save and re-use scripts, export data to files and databases and integrate these results into tools like Microsoft Excel and Tableau.

    To get to the Analyze view, click on the Analyze icon in the home view. The Analyze view is shown.

    Note: Be sure to use the correct EMR connection when accessing the Analyze stage, so that you are accessing and working with the desired data.

    Karmasphere

    Executing Queries – An Example

    Enter the following command in the Query window. Any syntax errors are displayed below the Query window. Click on the ‘Run’ button on the right-hand side to execute the query.

    SELECT col2, count(1) query_count FROM newtable WHERE col2 LIKE '%lake %' GROUP BY col2 ORDER BY query_count DESC

    The left-hand pane shows the table schema, as shown.

    Karmasphere

    >Filtering these results allows for further iteration. Click on the Karmasphere Filter results icon. A dialog window is displayed. Select the options as shown. Click the OK button.

    Karmasphere

    This query shows us the records matching this filter.

    Karmasphere

    Act on the results, by saving the results as a XLS format file. Click on the Karmasphere Save to XLS file icon. Enter the name and location of the XLS file to save. Click on the OK button. Use the XLS file viewer ( Karmasphere ) to view the results.

    How to Chart Results

    Karmasphere Analyst allows you to chart your results in a variety of chart types, including line, bar, column and pie charts. Click on the Karmasphere Chart Results icon. Select the chart type (Pie, Bar, Line, Column, Scatter) and other options, including adding a chart title. A sample pie chart of the filtered results from the filtered query above is shown.

    Karmasphere

    How to Act on Results, Charts and SQL Queries

    With Karmasphere Analyst, you act on the results by being able to save the results in a variety of file formats; as a database table, a Hive table, or for view in a XLS file viewer. Save charts you’ve created for later use in reports or as reference.

    Launch the XLS file viewer, by clicking on the Karmasphere Launch XLS viewer icon, MS-Excel is started and the file is shown in Excel, as shown.

    Karmasphere