Amazon EMR with the MapR Distribution for Hadoop

Amazon Elastic MapReduce (Amazon EMR) makes it easy to provision and manage the open source data analytics framework, Apache Hadoop, in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the MapR Distribution for Hadoop.

MapR introduces enterprise-focused features for Hadoop such as high availability, data snapshotting, cluster mirroring across AZs, and NFS mounts. Combined with Amazon Elastic MapReduce’s managed Hadoop environment, seamless integration with other AWS services, and hourly pricing with no upfront fees or long-term commitments, Amazon EMR with the MapR Distribution for Hadoop offers customers a powerful tool for generating insights from their data.

Features

Industry-standard interfaces

  • NFS - MapR provides random read/write access and a standard NFS interface so that users can mount the cluster and leverage standard file-based applications with Hadoop, including Linux utilities, file browsers and other applications. When using MapR on Amazon EMR, the NFS interface is pre-mounted at /mapr.
  • ODBC - MapR provides an ODBC driver for Hive that conforms to the standard ODBC 3.52 specification, enabling users to utilize any BI tool or SQL query builder with Hadoop. MicroStrategy, Tableau, Excel, Toad and many other commercial and open source tools are supported.

Management

  • Deployment - Amazon EMR with MapR fully automates the provisioning, installation and configuration of the cluster which can be launched via the AWS Management Console, CLI or API.
  • MapR Control System (MCS) -MapR provides end-to-end monitoring and management for Hadoop, including hardware, storage, MapReduce and other components in the distribution.
  • CLI and REST API -All MCS capabilities are also exposed through the CLI and REST API. This enables users to obtain cluster information and perform operations programmatically. It also allows integration with third-party and custom monitoring/management systems.

Business continuity

  • File System High Availability - MapR provides a no-NameNode architecture that can tolerate multiple simultaneous failures with automatic failover and failback. The metadata is distributed and replicated, just like the data. There is no NameNode, so there is no practical limit to how many files can be stored, and there is no dependency on any external NAS.
  • MapReduce High Availability - MapR provides JobTracker HA, with automatic failover and failback. If the active JobTracker fails, it is automatically started on a different node, and all jobs and tasks continue to run with no interruption.
  • Data Protection - MapR provides snapshots for point-in-time recovery, enabling users to recover from user and application errors. MapR uses redirect-on-write technology, so only changed blocks are snapshotted, and there is no impact on performance. Note that snapshots are guaranteed to be consistent, so all applications are supported.
  • Disaster Recovery - MapR provides mirroring between clusters, enabling disaster recovery across availability zones, as well as hybrid deployments involving both on-premise and EMR clusters. For hybrid deployments, all MapR-based Hadoop distributions are supported, including EMC Greenplum MR and the Cisco UCS appliance. Note that only changed blocks are transferred, and all data is automatically compressed.

Compression and performance

  • Compression - MapR automatically and transparently compresses all data that is not already compressed. This reduces disk and network I/O and increases performance. There is no need to manually compress files or modify applications to handle compression. Random read/write is efficient because only the necessary blocks are decompressed, and files are splittable.
  • Performance - MapR features an advanced architecture that provides higher efficiency and parallelism, while reducing disk and network I/O.

Top


Editions

Amazon EMR with the M3 and M5 editions of MapR are complete Hadoop distributions, including many open source components such as Hive, Pig, and Cascading. Both editions include the industry-standard interfaces (e.g., NFS, ODBC), as well as the management, compression and performance benefits described on this page. M5 additionally includes the business continuity capabilities, such as HA, data protection and disaster recovery.

Top


Pricing

With Elastic MapReduce you only pay for what you use.

Your cost will depend on the number and type of Amazon EC2 Instances in your job flow and the amount of time it is running. Pricing for Elastic MapReduce with MapR is in addition to pricing for EC2 and S3.

Pricing for Amazon EC2 and Amazon Elastic MapReduce

You are charged from the time the job flow begins processing until it is terminated. Partial hours are rounded up.

Save Money with Reserved and Spot Instances

The Amazon EC2 prices above are for On-demand Instances. On-Demand Instances are the most expensive but give you the most flexibility. EC2 also offers Reserved Instances and Spot Instances.

  • Reserved Instances give you the option to make a low, one-time payment for each instance you want to reserve and in turn receive a significant discount on the hourly charge for that instance. There are three Reserved Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that enable you to balance the amount you pay upfront with your effective hourly price.
  • Spot Instances enable you to bid for unused Amazon EC2 capacity. Instances are charged the Spot Price, which is set by Amazon EC2 and fluctuates periodically depending on the supply of and demand for Spot Instance capacity. To use Spot Instances, you specify the maximum price you are willing to pay per instance hour. If your maximum price bid exceeds the current Spot Price, your request is fulfilled and your instances will run until either you choose to terminate them or the Spot Price increases above your maximum price (whichever is sooner).
To view more information and current prices for Reserved Instances and Spot Instances, see the Amazon EC2 pricing page.

Other Pricing Details

Amazon S3 is billed separately. (Many customers store their input and output data in S3; others store all of the data locally on HDFS.) Currently it costs $668 per month to store 10 TB of data in S3 with reduced redundancy. The more data you store, the lower the monthly price per GB.

Amazon SimpleDB is also billed separately. (Only applies if you enable debugging for your job flow)

Estimate your bill

You can use the AWS Simple Monthly Calculator to estimate your bill.

Top


Getting Started

The EMR Developer Guide includes detailed instructions on how to launch MapR on EMR using the AWS Management Console, CLI or API. To launch a MapR cluster using the AWS Management Console:
  1. Access the EMR service on the AWS Management Console.
  2. Click Create New Job Flow to start the Create a new Job Flow wizard. This wizard will launch the MapR cluster.
  3. Select MapR M3 or M5 from the Hadoop Version dropdown list on the Define Job Flow pane of the wizard.
  4. Follow the remaining steps in the wizard to launch your job flow.

    Top


    Support

    AWS Premium Support customers may contact Amazon regarding any issues with MapR on EMR.

    M5 users may also contact MapR 24x7 support directly by emailing support@mapr.com. All M3 and M5 users are welcome to post questions to the MapR Forums, which are continuously monitored by MapR.

    Top


    Resources

    Top


    About MapR Technologies

    MapR delivers on the promise of Hadoop, making managing and analyzing Big Data a reality for more business users. The award-winning MapR Distribution brings dependability, speed and ease-of-use to Hadoop. Combined with data protection and business continuity, MapR enables customers to harness the power of Big Data analytics. Investors include Lightspeed Venture Partners, NEA and Redpoint Ventures. Connect with MapR on Facebook, Linkedin, and Twitter.

    Top


©2013, Amazon Web Services, Inc. or its affiliates. All rights reserved.