Introduction
MapR offers an open, enterprise-grade distribution that makes Hadoop easier-to-use and more dependable. Combined with Amazon Elastic MapReduce's managed Hadoop environment, seamless integration with other AWS services, and hourly pricing with no upfront fees or long-term commitments, Amazon EMR with MapR offers customers a powerful tool for generating insights from their data. For more details on EMR with MapR, visit the EMR with the MapR Distribution for Hadoop detail page.
Starting an EMR Job Flow with the MapR Distribution for Hadoop
Follow these steps to start an Amazon EMR with MapR cluster:
Log in to your Amazon Web Services Account
Use your normal Amazon Web Services (AWS) credentials to log in to your AWS account.
Start a New Job Flow and Select a MapR Product
From the AWS Management Console
- Select Elastic MapReduce, then Create New Job Flow.
- Select MapR M3 Edition or MapR M5 Edition from the Hadoop Version drop-down selector.
- MapR M3 Edition is a complete Hadoop distribution that provides many unique capabilities such as industry-standard NFS and ODBC interfaces, end-to-end management, high reliability and automatic compression. You can manage a MapR cluster via the AWS Management Console, the command line, or a REST API. Amazon EMR's standard rates include the full functionality of MapR M3 at no additional cost.
- MapR M5 Edition expands the capabilities of M3 with enterprise-grade capabilities such as high availability, snapshots and mirroring.
- Continue to specify your job flow as described in Creating a Job Flow.
From the Command Line Interface
Use the --with-supported-products
parameter with the elastic-mapreduce command
to specify a
MapR distribution:
- Use
mapr-m3for MapR M3. - Use
mapr-m5for MapR M5.
This example launches a job flow that uses the MapR M3
Edition distribution:
./elastic-mapreduce --create --alive \
--instance-type m1.xlarge\
--num-instances 5 \
--with-supported-products mapr-m3
From the REST API
Make a call to RunJobFlow that specifies a MapR distribution as a member of the SupportedProducts list:
- Use
mapr-m3for MapR M3. - Use
mapr-m5for MapR M5.
This example launches a job flow that uses the MapR M3
distribution:
https://elasticmapreduce.amazonaws.com?Action=RunJobFlow
&Name=MyJobFlowName&LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir
&Instances.MasterInstanceType=m1.xlarge&Instances.SlaveInstanceType=m1.xlarge&Instances.InstanceCount=4
&Instances.Ec2KeyName=myec2keyname&Instances.Placement.AvailabilityZone=us-east-1a
&Instances.KeepJobFlowAliveWhenNoSteps=true
&Instances.TerminationProtected=true
&Steps.member.1.Name=MyStepName
&Steps.member.1.ActionOnFailure=CONTINUE
&Steps.member.1.HadoopJarStep.Jar=MyJarFile
&Steps.member.1.HadoopJarStep.MainClass=MyMainClass
&Steps.member.1.HadoopJarStep.Args.member.1=arg1
&Steps.member.1.HadoopJarStep.Args.member.2=arg2
&SupportedProducts.member.1=mapr-m3
&AuthParams
Configuring your MapR Job Flow
After your MapR job flow is running, you need to open a port to enable access to the MapR Control System (MCS). Follow these steps to open a port.
-
Select your job from the list of jobs displayed in Your Elastic MapReduce Job Flows in the Elastic MapReduce tab of the AWS Management Console, then select the Description tab in the lower pane. Make a note of the Master Public DNS Name value. Click the Amazon EC2 tab in the AWS Management Console to open the Amazon EC2 Console Dashboard.
-
Select Security Groups from the Network & Security group in the Navigation pane at the left of the EC2 Console Dashboard.
-
Select Elastic MapReduce-master from the list displayed in Security Groups.
-
In the lower pane, click the Inbound tab.
-
In Port Range:, type
8453. Leave the default value in the Source: field. -
Click Add Rule, then click Apply Rule Changes.
You can now navigate to the master node's DNS address. Connect to port 8453 to log in to the MapR Control System. Use the string hadoop for both login and password at the MCS login screen.
Testing Your Cluster
Amazon EMR with MapR provides a Debian environment with MapR software running on each node. MapR's NFS interface mounts the cluster is mounted on localhost at the /mapr directory.
Follow these steps to create a file and run your first MapReduce job:
- Connect to the master node with SSH as user hadoop. Pass your
.pemcredentials file tosshwith the-iflag, as in this example:ssh -i /path_to_pemfile/credentials.pem hadoop@masterDNS.amazonaws.com
- Create a simple text file:
cd /mapr/my.cluster.com mkdir in echo "the quick brown fox jumps over the lazy dog" > in/data.txt
- Run the following command to perform a word count on the
text file:
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /mapr/my.cluster.com/in /mapr/my.cluster.com/out
As the job runs, you should see terminal output similar to the following:12/06/09 00:00:37 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-118-194-139.ec2.internal/10.118.194.139:9001 12/06/09 00:00:37 INFO input.FileInputFormat: Total input paths to process : 1 12/06/09 00:00:37 INFO mapred.JobClient: Running job: job_201206082332_0004 12/06/09 00:00:38 INFO mapred.JobClient: map 0% reduce 0% 12/06/09 00:00:50 INFO mapred.JobClient: map 100% reduce 0% 12/06/09 00:00:57 INFO mapred.JobClient: map 100% reduce 100% 12/06/09 00:00:58 INFO mapred.JobClient: Job complete: job_201206082332_0004 12/06/09 00:00:58 INFO mapred.JobClient: Counters: 25 12/06/09 00:00:58 INFO mapred.JobClient: Job Counters 12/06/09 00:00:58 INFO mapred.JobClient: Launched reduce tasks=1 12/06/09 00:00:58 INFO mapred.JobClient: Aggregate execution time of mappers(ms)=6193 12/06/09 00:00:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/09 00:00:58 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/09 00:00:58 INFO mapred.JobClient: Launched map tasks=1 12/06/09 00:00:58 INFO mapred.JobClient: Data-local map tasks=1 12/06/09 00:00:58 INFO mapred.JobClient: Aggregate execution time of reducers(ms)=4875 12/06/09 00:00:58 INFO mapred.JobClient: FileSystemCounters 12/06/09 00:00:58 INFO mapred.JobClient: MAPRFS_BYTES_READ=385 12/06/09 00:00:58 INFO mapred.JobClient: MAPRFS_BYTES_WRITTEN=276 12/06/09 00:00:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=94449 12/06/09 00:00:58 INFO mapred.JobClient: Map-Reduce Framework 12/06/09 00:00:58 INFO mapred.JobClient: Map input records=1 12/06/09 00:00:58 INFO mapred.JobClient: Reduce shuffle bytes=94 12/06/09 00:00:58 INFO mapred.JobClient: Spilled Records=16 12/06/09 00:00:58 INFO mapred.JobClient: Map output bytes=80 12/06/09 00:00:58 INFO mapred.JobClient: CPU_MILLISECONDS=1530 12/06/09 00:00:58 INFO mapred.JobClient: Combine input records=9 12/06/09 00:00:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=125 12/06/09 00:00:58 INFO mapred.JobClient: Reduce input records=8 12/06/09 00:00:58 INFO mapred.JobClient: Reduce input groups=8 12/06/09 00:00:58 INFO mapred.JobClient: Combine output records=8 12/06/09 00:00:58 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=329244672 12/06/09 00:00:58 INFO mapred.JobClient: Reduce output records=8 12/06/09 00:00:58 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=3252969472 12/06/09 00:00:58 INFO mapred.JobClient: Map output records=9 12/06/09 00:00:58 INFO mapred.JobClient: GC time elapsed (ms)=18
- Check the
/mapr/my.cluster.com/outdirectory for a file namedpart-r-00000with the results of the job.cat out/part-r00000 brown 1 dog 1 fox 1 jumps 1 lazy 1 over 1 quick 1 the 2