Launch an EMR Job Flow with the MapR Distribution for Hadoop

Articles & Tutorials>Launch an EMR Job Flow with the MapR Distribution for Hadoop
Learn how to launch an EMR job flow with the MapR Distribution for Hadoop. MapR offers an open, enterprise-grade distribution that makes Hadoop easier-to-use and more dependable.

Details

Submitted By: AdamG@AWS
AWS Products Used: Amazon Elastic MapReduce
Created On: January 1, 1970 12:00 AM GMT
Last Updated: September 12, 2012 1:53 AM GMT
MapR Technologies

Introduction

MapR offers an open, enterprise-grade distribution that makes Hadoop easier-to-use and more dependable. Combined with Amazon Elastic MapReduce's managed Hadoop environment, seamless integration with other AWS services, and hourly pricing with no upfront fees or long-term commitments, Amazon EMR with MapR offers customers a powerful tool for generating insights from their data. For more details on EMR with MapR, visit the EMR with the MapR Distribution for Hadoop detail page.

Starting an EMR Job Flow with the MapR Distribution for Hadoop

Follow these steps to start an Amazon EMR with MapR cluster:

Log in to your Amazon Web Services Account

Use your normal Amazon Web Services (AWS) credentials to log in to your AWS account.

Start a New Job Flow and Select a MapR Product

From the AWS Management Console
  1. Select Elastic MapReduce, then Create New Job Flow.
  2. Select MapR M3 Edition or MapR M5 Edition from the Hadoop Version drop-down selector.

  3. Create New Job Flow screenshot with Hadoop Version selector

    • MapR M3 Edition is a complete Hadoop distribution that provides many unique capabilities such as industry-standard NFS and ODBC interfaces, end-to-end management, high reliability and automatic compression. You can manage a MapR cluster via the AWS Management Console, the command line, or a REST API. Amazon EMR's standard rates include the full functionality of MapR M3 at no additional cost.
    • MapR M5 Edition expands the capabilities of M3 with enterprise-grade capabilities such as high availability, snapshots and mirroring.

  4. Continue to specify your job flow as described in Creating a Job Flow.
From the Command Line Interface

Use the --with-supported-products parameter with the elastic-mapreduce command to specify a MapR distribution:

  • Use mapr-m3 for MapR M3.
  • Use mapr-m5 for MapR M5.

This example launches a job flow that uses the MapR M3 Edition distribution:
./elastic-mapreduce --create --alive \
--instance-type m1.xlarge\
--num-instances 5 \
--with-supported-products mapr-m3

From the REST API

Make a call to RunJobFlow that specifies a MapR distribution as a member of the SupportedProducts list:

  • Use mapr-m3 for MapR M3.
  • Use mapr-m5 for MapR M5.

This example launches a job flow that uses the MapR M3 distribution:
https://elasticmapreduce.amazonaws.com?Action=RunJobFlow
&Name=MyJobFlowName&LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir
&Instances.MasterInstanceType=m1.xlarge&Instances.SlaveInstanceType=m1.xlarge&Instances.InstanceCount=4
&Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a
&Instances.KeepJobFlowAliveWhenNoSteps=true
&Instances.TerminationProtected=true
&Steps.member.1.Name=MyStepName
&Steps.member.1.ActionOnFailure=CONTINUE
&Steps.member.1.HadoopJarStep.Jar=MyJarFile
&Steps.member.1.HadoopJarStep.MainClass=MyMainClass
&Steps.member.1.HadoopJarStep.Args.member.1=arg1
&Steps.member.1.HadoopJarStep.Args.member.2=arg2
&SupportedProducts.member.1=mapr-m3
&AuthParams

Configuring your MapR Job Flow

After your MapR job flow is running, you need to open a port to enable access to the MapR Control System (MCS). Follow these steps to open a port.

  1. Select your job from the list of jobs displayed in Your Elastic MapReduce Job Flows in the Elastic MapReduce tab of the AWS Management Console, then select the Description tab in the lower pane. Make a note of the Master Public DNS Name value. Click the Amazon EC2 tab in the AWS Management Console to open the Amazon EC2 Console Dashboard.

  2. Select Security Groups from the Network & Security group in the Navigation pane at the left of the EC2 Console Dashboard.

  3. Select Elastic MapReduce-master from the list displayed in Security Groups.

  4. In the lower pane, click the Inbound tab.

  5. In Port Range:, type 8453. Leave the default value in the Source: field.

  6. Click Add Rule, then click Apply Rule Changes.

You can now navigate to the master node's DNS address. Connect to port 8453 to log in to the MapR Control System. Use the string hadoop for both login and password at the MCS login screen.

Testing Your Cluster

Amazon EMR with MapR provides a Debian environment with MapR software running on each node. MapR's NFS interface mounts the cluster is mounted on localhost at the /mapr directory. Follow these steps to create a file and run your first MapReduce job:

    1. Connect to the master node with SSH as user hadoop. Pass your .pem credentials file to ssh with the -i flag, as in this example:
      ssh -i /path_to_pemfile/credentials.pem hadoop@masterDNS.amazonaws.com
    2. Create a simple text file:
      cd /mapr/my.cluster.com
      mkdir in
      echo "the quick brown fox jumps over the lazy dog" > in/data.txt
    3. Run the following command to perform a word count on the text file:
      hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /mapr/my.cluster.com/in /mapr/my.cluster.com/out
      
      As the job runs, you should see terminal output similar to the following:
      12/06/09 00:00:37 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-118-194-139.ec2.internal/10.118.194.139:9001
      12/06/09 00:00:37 INFO input.FileInputFormat: Total input paths to process : 1
      12/06/09 00:00:37 INFO mapred.JobClient: Running job: job_201206082332_0004
      12/06/09 00:00:38 INFO mapred.JobClient: map 0% reduce 0%
      12/06/09 00:00:50 INFO mapred.JobClient: map 100% reduce 0%
      12/06/09 00:00:57 INFO mapred.JobClient: map 100% reduce 100%
      12/06/09 00:00:58 INFO mapred.JobClient: Job complete: job_201206082332_0004
      12/06/09 00:00:58 INFO mapred.JobClient: Counters: 25
      12/06/09 00:00:58 INFO mapred.JobClient: Job Counters
      12/06/09 00:00:58 INFO mapred.JobClient: Launched reduce tasks=1
      12/06/09 00:00:58 INFO mapred.JobClient: Aggregate execution time of mappers(ms)=6193
      12/06/09 00:00:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
      12/06/09 00:00:58 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
      12/06/09 00:00:58 INFO mapred.JobClient: Launched map tasks=1
      12/06/09 00:00:58 INFO mapred.JobClient: Data-local map tasks=1
      12/06/09 00:00:58 INFO mapred.JobClient: Aggregate execution time of reducers(ms)=4875
      12/06/09 00:00:58 INFO mapred.JobClient: FileSystemCounters
      12/06/09 00:00:58 INFO mapred.JobClient: MAPRFS_BYTES_READ=385
      12/06/09 00:00:58 INFO mapred.JobClient: MAPRFS_BYTES_WRITTEN=276
      12/06/09 00:00:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=94449
      12/06/09 00:00:58 INFO mapred.JobClient: Map-Reduce Framework
      12/06/09 00:00:58 INFO mapred.JobClient: Map input records=1
      12/06/09 00:00:58 INFO mapred.JobClient: Reduce shuffle bytes=94
      12/06/09 00:00:58 INFO mapred.JobClient: Spilled Records=16
      12/06/09 00:00:58 INFO mapred.JobClient: Map output bytes=80
      12/06/09 00:00:58 INFO mapred.JobClient: CPU_MILLISECONDS=1530
      12/06/09 00:00:58 INFO mapred.JobClient: Combine input records=9
      12/06/09 00:00:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=125
      12/06/09 00:00:58 INFO mapred.JobClient: Reduce input records=8
      12/06/09 00:00:58 INFO mapred.JobClient: Reduce input groups=8
      12/06/09 00:00:58 INFO mapred.JobClient: Combine output records=8
      12/06/09 00:00:58 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=329244672
      12/06/09 00:00:58 INFO mapred.JobClient: Reduce output records=8
      12/06/09 00:00:58 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=3252969472
      12/06/09 00:00:58 INFO mapred.JobClient: Map output records=9
      12/06/09 00:00:58 INFO mapred.JobClient: GC time elapsed (ms)=18
      
    4. Check the /mapr/my.cluster.com/out directory for a file named part-r-00000 with the results of the job.
      cat out/part-r00000
      brown 1
      dog 1
      fox 1
      jumps 1
      lazy 1
      over 1
      quick 1
      the 2
      
  • ©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.