Amazon EMR with the MapR Distribution for Hadoop - Key Features

Articles & Tutorials>Amazon EMR with the MapR Distribution for Hadoop Key Features
The MapR Hadoop distribution adds dependability and ease of use to the strength and flexibility of Hadoop. The Amazon Elastic MapReduce (EMR) service enables you to easily setup, operate, and scale MapR deployments in the cloud as well as integrate with other AWS services. Users can take advantage of hourly pricing with no up-front fees or long-term commitments.

Details

Submitted By: AdamG@AWS
AWS Products Used: Amazon Elastic MapReduce
Created On: June 12, 2012 9:19 PM GMT
Last Updated: June 12, 2012 11:45 PM GMT
MapR Technologies

Using NFS

The MapR distribution for Hadoop provides an NFS interface that you can use to mount the cluster. The NFS interface enables you to use standard Linux tools and applications with your cluster directly. You can get data into and out of the cluster with scp, and analyze data with commands like grep, sed, awk, or your own applications or scripts. Amazon EMR with MapR clusters have NFS preconfigured. The cluster is mounted at the /mapr directory on the master node; cluster data and files reside in the directory /mapr/clustername (for example /mapr/my.cluster.com). To use NFS on your Amazon EMR with MapR cluster, log in to the master node via ssh. After logging in to the cluster, you can use standard file-based applications, including Linux utilities, file browsers, and other applications.

Example: Creating a file via NFS
hadoop@domU-12-31-39-0E-E5-61:~$ cd /mapr/my.cluster.com
hadoop@domU-12-31-39-0E-E5-61:~$ mkdir test
hadoop@domU-12-31-39-0E-E5-61:~$ ls
cluster-info hbase test var
hadoop@domU-12-31-39-0E-E5-61:~$ cd test
hadoop@domU-12-31-39-0E-E5-61:~$ echo "Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit" > testfile
hadoop@domU-12-31-39-0E-E5-61:~$ cat testfile
Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
hadoop@domU-12-31-39-0E-E5-61:~$ grep pisci *
testfile: Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit

Using BI Tools and Query Builders with the Hive ODBC Connector

The MapR distrbution for Hadoop provides a Hive ODBC driver that conforms to the standard ODBC 3.52 specification. To get started, install the correct Hive ODBC connector:

  • Install the 64-bit connector for 64-bit applications
  • Install the 32-bit connector for 32-bit applications

Create a Data Source Name (DSN) for the data source on your Amazon EMR with MapR cluster with the Data Source Administrator by following these steps:

  1. Open the Data Source Administrator from the Start menu.
  2. Click Add to open the Create New Data Source dialog.
  3. Select Hive ODBC Connector and click Finish to open the MapR Hive ODBC Connector Setup window.
  4. Enter the connection information for the Hive instance:
    • Data Source Name � a name for the DSN
    • Description � an optional description for the DSN
    • Host � the IP or hostname of the Amazon EMR with MapR node running Hive Server (i.e., the Thrift server)
    • Port � the listening port for Hive Server (you'll need to open this port in the security group)
    • Database � use show databases at the Hive command line if you are not sure
  5. Click Test to test the connection.
  6. When you're sure the connection works, click Finish.

The MapR Control System

The MapR Control System (MCS) enables you to manage all aspects of your Amazon EMR with MapR cluster. For example, you can:
  • Monitor the health of the nodes
  • Monitor system alarms
  • Create, remove and manage volumes
  • Configure snapshots and mirroring
  • Manage schedules
  • Change your cluster's NFS and NFS High Availability (HA) settings
  • Generate a Nagios topology script
To log in to the MapR Control System (MCS), use a browser to open an HTTPS connection to port 8453 on your master node (this port must be open in the security group). You can determine the address of your Amazon EMR with MapR master node with the elastic-mapreduce --list --active command or by selecting your EMR job flow from the Elastic MapReduce tab of the AWS Management Console.

Creating a Volume

MapR volumes can be used to group related files and directories into a single tree structure so they can be easily organized, managed, and secured. Volumes provide the ability to apply policies, such as:
  • Replication factor
  • Mirroring
  • Snapshots
  • Data placement control
  • Quotas and usage tracking
  • Administrative permissions

To create a volume using the MapR Control System:

  1. In the Navigation pane, click Volumes under MapR-FS.
  2. Click the New Volume button.



    The New Standard Volume dialog displays.
  3. Fill in the fields in the New Standard Volume dialog:
    • Volume Type: Standard Volume
    • Volume Name: MyVolume
    • Mount Path:�/myvolume
  4. Log on to the master node via ssh and check that your volume has been created at the correct path:
    /mapr/my.clusters.com/myvolume
  5. Create a file inside the volume. Example:
    cd /mapr/my.clusters.con/myvolume
    touch myfile
For more information, see Volumes in the MapR documentation.

Using Snapshots (M5 only)

MapR is the only Hadoop distribution that provides snapshots, enabling users to recover from user and application errors. Snapshots can be configured with flexible schedules to accommodate a range of recovery point objectives. Bacause MapR's snapshots are consistent, any application can be snapshotted, including HBase. Recovering from a snapshot is as easy as copying the directory or files from the hidden snapshot directory to the current read/write directory.

Note that snapshots are extremely efficient and do not have any performance impact. No data is copied in order to create a snapshot. As a result, a snapshot of a petabyte volume can be performed in seconds. MapR uses redirect-on-write operations, meaning that each write in the system goes to a new block on disk.

To create a snapshot using the MapR Control System:

  1. In the Navigation pane, click on the Volumes element under MapR-FS.
  2. Select the checkbox next to MyVolume.
  3. Click�Snapshots.
  4. Click New Snapshot and fill out the field:
    Name for new snapshot(s)?: MySnapshot
  5. Log on to the master node via ssh, and check that your snapshot was created:
    cd /mapr/my.cluster.com/myvolume/.snapshot
    ls
  6. Look inside the snapshot to see that the file myfile that you created in the volume MyVolume is there:
    cd MySnapshot
    ls


You can also create snapshots using the MCS CLI or REST API

The real power of snapshots is protection against user error --- if you accidentally delete or damage a file, you can recover it from an existing snapshot.

To recover a file from a snapshot:
  1. Log on to the master node via ssh.
  2. "Accidentally" delete the file you created earlier:
    cd /mapr/my.cluster.com/myvolume
    rm myfile
  3. Check the snapshot directory:
    cd /mapr/my.cluster.com/myvolume/.snapshot/MySnapshot
    ls
  4. Copy the deleted file back to the volume:
    cp myfile ../..

Using Mirrors (M5 only)

Going far beyond replication, MapR�s mirroring means you can set policies around your Recovery Time objectives (RTO) and mirror your data automatically within your cluster, between clusters (such as a production and a research cluster) or between sites.

A mirror volume is a read-only physical copy of another volume, the source volume.

To create a mirror using the MapR Control System:

  1. In the Navigation pane, click Volumes under MapR-FS.
  2. Click the New Volume button.



    The New Standard Volume dialog displays.
  3. Fill in the fields in the New Standard Volume dialog:
    • Volume Type: Local Mirror Volume
    • Mirror Name: backup.MyVolume
    • Source Volume Name: MyVolume
    • Mount Path:�/backup/myvolume (make sure the directory /backup exists)
    • Mirror Scheduling: Important Data
The Mirror Scheduling field specifies a regular time for the data to be synchronized, but you can also synchronize the mirror with the source volume on demand: once you have created the mirror, select the checkbox beside it in the list, and click�Start Mirroring.

Using Schedules (M5 only)

Both snapshots and mirrors can be controlled by schedules. Out of the box, the cluster comes with a few schedules to get you started, but you may want to create your own or modify the existing schedules.

To create a schedule using the MapR Control System:
  1. In the Navigation pane, expand the MapR-FS group and click the Schedules view.
  2. Click New Schedule.
  3. Type a name for the new schedule in the Schedule Name field.
  4. Define one or more schedule rules in the Schedule Rules section.
  5. Click [ + Add Rule ] to specify additional schedule rules, as desired.
  6. Click Save Schedule to create the schedule.
Once you have created a schedule, you can use it to schedule snapshots and mirrors:
  • To use the schedule for snapshots, apply it to a standard volume
  • To use the schedule for mirrors, apply it to a mirror volume
You can use the same schedule for as many volumes as you like.

To apply a schedule to a volume:
  1. In the Navigation pane, click on the Volumes element under MapR-FS.
  2. Select the checkbox next to the volume.
  3. Click Properties to display the�Volume Properties window.
  4. Scroll down to the Scheduling pane and select the schedule from the Snapshot Schedule or Mirror Schedule dropdown list.

High Availability (M5 only)

MapR provides high availability, redundancy and fault tolerance at all levels of the stack.
  • The MapR JobTracker HA improves recovery time objectives and provides for a self-healing cluster. Upon failure, the JobTracker automatically restarts on another node in the cluster. TaskTrackers will automatically reconnect to the new JobTracker. Any currently running jobs or tasks continue without losing any progress or failing.
  • MapR is the only Hadoop distribution with a no-NameNode architecture, supporting automatic failover and failback. The metadata for the entire cluster is distributed, so that there is no loss or downtime even in the face of multiple disk or node failures. Also, unlike other distributions, MapR's HA is self-contained and does not have any external dependencies (e.g., NAS appliance).
You can see the redundant services for yourself in the MapR Control System --- form the Dashboard, the running services are visible.

To view services in the MapR Control System:
  1. Under the Cluster group in the left pane, click Dashboard.
  2. Check the Services pane and make sure each service is running the correct number of instances.
    • For an M3 cluster:
      • Instances of the FileServer and TaskTracker on all nodes
      • 3 or 5 instances of ZooKeeper
      • 1 instance of the CLDB, JobTracker, NFS, and WebServer
    • For an M5 cluster:
      • Instances of the FileServer and TaskTracker on all nodes
      • 3 or 5 instances of ZooKeeper
      • 3 instances of the CLDB
      • 1 of 3 instances of the JobTracker
      • 1 instance of the WebServer
      • 1 or more instances of NFS

Automatic Compression

MapR saves disk space and network bandwidth by automatically and transparently compressing data. By default, all data on disk is compressed, and all data transmitted within a MapR cluster or between clusters (when mirroring) is compressed and checksummed over the wire. Users do not need to manually compress or index data. See Compression in the MapR documentation.

CLI and REST API

The MapR Control System is available through a CLI and REST API. This list provides a brief summary of available commands:

  • acl: Modify and create Access Control Lists (ACL).
  • alarm: Interact with system alarms.
  • config: Change MapR configuration values.
  • dashboard: Display summary information about the cluster.
  • dialhome: Change your cluster's Dial Home settings.
  • disk: List, add, or remove disks from the cluster.
  • entity: Manage users and groups.
  • license: Manage your MapR licenses
  • nagios: Generate a Nagios topology script.
  • nfsmgmt: Refresh NFS exports.
  • node: Manage the status and behavior of nodes in your cluster.
  • schedule: List and modify schedules for mirroring and snapshot syncing.
  • service: List all services on a specified node.
  • setloglevel: Set log levels for individual services on a node.
  • trace: View and modify the trace buffer and trace levels for system modules.
  • urls: Display the status page URL for a specified service.
  • virtualip: Manage Virtual IP addresses.
  • volume: Work with volumes, snapshots, and mirrors.
©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.