Elastic MapReduce Training
What is Amazon Elastic MapReduce?
With Amazon Elastic MapReduce (Amazon EMR) you can analyze vast amounts of data. It does this by distributing the computational work across a cluster of virtual servers running in the Amazon cloud. The cluster is managed using an open-source framework called Hadoop. Amazon EMR has been used by thousands of customers around the world to launch millions of Hadoop clusters since 2009.
Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for processing. The results of the computation performed by those servers is then reduced down to a single output set. One node, designated as the master node, controls the distribution of tasks. The following diagram shows a Hadoop cluster with the master node directing a group of slave nodes which process the data.
Amazon EMR has made enhancements to Hadoop and other open-source applications to work seamlessly with AWS. For example, Hadoop clusters running on Amazon EMR use Amazon EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input and output data, and Amazon CloudWatch to monitor cluster performance and raise alarms. You can also move data into and out of Amazon DynamoDB using Amazon EMR and Hive. All of this is orchestrated by Amazon EMR control software that launches and manages the Hadoop cluster. This process is called an Amazon EMR job flow.
The following diagram illustrates how Amazon EMR interacts with other AWS services.
Open-source projects that run on top of the Hadoop architecture can also be run on Amazon EMR. The most popular applications, such as Hive, Pig, HBase, DistCp, Ganglia, Mahout, and R are already integrated with Amazon EMR.
By running Hadoop on Amazon EMR you get the benefits of the cloud:
- The ability to provision clusters of virtual servers within minutes.
- You can scale the number of virtual servers in your cluster to manage your computation needs, and only pay for what you use.
- Integration with other AWS services.
Getting Started Tutorial
Are you ready to launch your first cluster? Click here to create a cluster that will count the frequency of words in a sample text file.
Amazon EMR 101 (11 Videos)
Other Training Videos
- The agenda for the class, and why developers should consider using Amazon Elastic MapReduce.
- Signing up for an AWS account, generating a key-pair, and setting up an S3 bucket.
- Creating, monitoring, and getting results from you EMR Job Flow.
- EC2 instance types, pricing, and Hadoop cluster configuration.
- S3 architectures, pricing, and access control.
- How to use a Hadoop Job Flow to analyze text from Wikipedia.
- When and how to use the EMR and s3cmd tools.
- Best practices for debugging EMR Job Flows.
- Creating, monitoring, and getting results from Hive & Pig Job Flows.
- How to use a Hive Job Flow to analyze Wikipedia article data.
- Bootstrap actions, spot pricing and task groups.
- Monitoring Amazon EMR Metrics.
- Save Money with Spot Instances.
- Use EMR to export and analyze data from DynamoDB.
- Use AWS Data Pipeline to schedule recurring EMR jobs.
- Learn how to use MapR's Hadoop Distributions on EMR.
- From AWS Re:Invent 2012: Learn about EMR and the Hadoop Ecosystem.
- An overview of AWS Big Data.
If you are planning to process more than 1 TB per day you may be eligible for EMR Bootcamp, an onsite proof-of-concept and knowledge transfer workshop with an AWS Solutions Architect who specializes in EMR. To learn more, click here or contact us.
Scale Unlimited offers customized on-site training for companies that need to quickly learn how to use EMR and other big data technologies. To find out more, click here.
Amazon EMR Documentation
Detailed Amazon EMR Documentation:
Amazon EMR Options (Cluster Configurations, Applications, Tools)
This section describes some of the choices you need to make when using EMR.
Cluster Configuration Choices
With EMR you decide:
- How many nodes to provision in your cluster (This will be driven by the amount of data you have and your time requirements)
- What types of nodes to provision in your cluster (For example: high CPU, high memory, high storage, etc.)
- Which Hadoop distribution to use (Amazon EMR supports Amazon's Hadoop distribution, MapR M3, and MapR M5)
- How long to keep your cluster running (Some users never terminate their clusters; other users launch/terminate many clusters per day)
- Which AWS region to use (EMR is available in AWS data centers around the world)
- Where to store your data (Some users store the input and output data in S3, others store the data on the cluster's HDFS)
Types of EMR Applications
Amazon EMR simplifies running Hadoop and related big-data applications on AWS. You can use it to manage and analyze vast amounts of data. For example, a cluster can be configured to process petabytes of data.
Custom MapReduce Applications
In order to develop custom Hadoop applications, you used to need access to a lot of hardware to test your Hadoop programs. Amazon EMR makes it easy to spin up a set of Amazon EC2 instances as virtual servers to run your Hadoop cluster. You can also test various server configurations without having to purchase or reconfigure hardware. When you're done developing and testing your application, you can terminate your cluster, only paying for the computational time you used.
Amazon EMR provides three types of clusters (also called job flows) that you can launch to run custom map-reduce applications, depending on the type of program you're developing and which libraries you intend to use.
- Custom JAR: Run your custom map-reduce program written in Java. This cluster provides low-level access to the MapReduce API. You have the most flexibility programming for this type of cluster, but also the responsibility of defining and implementing the map reduce tasks in your Java application.
- Cascading: This cluster makes use of the Cascading Java library, which provides features such as splitting and joining data streams. You have less flexibility than with a Custom JAR cluster, but the application development is simplified.
- Streaming: Run a single Hadoop job based on map and reduce functions you upload to Amazon S3. The functions can be implemented in any of the following supported languages: Ruby, Perl, Python, PHP, R, Bash, C++.
Data Analysis on Amazon EMR
You can use Amazon EMR to analyze data without writing a line of code. Several open-source applications run on top of Hadoop and make it possible to run map-reduce jobs and query data using either a SQL-like syntax or a specialized query language called Pig Latin. Amazon EMR is integrated with Apache Hive and Apache Pig.
Data Storage on Amazon EMR
Distributed storage is a way to store large amounts of data over a distributed network of computers with redundancy to protect against data loss. Amazon EMR is integrated with the Hadoop Distributed File System (HDFS) and Apache HBase.
Move Data with Amazon EMR
You can use Amazon EMR to move large amounts of data in and out of databases and data stores. By distributing the work, the data can be moved quickly. Amazon EMR provides custom libraries to move data in and out of Amazon Simple Storage Service (Amazon S3), Amazon Dynamo DB, and Apache HBase.
EMR Management Tools
There are several ways you can launch and manage clusters with Amazon EMR:
Console — a graphical interface that you can use to launch and manage job flows. With it, you fill out web forms to specify the details of job flows to launch, view the details of existing job flows, debug and terminate job flows. Using the console is the easiest way to get started with Amazon EMR. No programming knowledge is required. The console is available online here.
- Command Line Interface (CLI) — an application you run on your local machine to connect to Amazon EMR and create and manage job flows. With it, you can write scripts that automate the process of launching and managing job flows. Using the CLI is the best option if you prefer working from a command line. For more information, see Command Line Interface Reference for Amazon EMR.
- Software Development Kit (SDK) — AWS provides an SDK with functions that call Amazon EMR to create and manage job flows. With it, you can write applications that automate the process of creating and managing job flows. Using the SDK is the best option if you want to extend or customize the functionality of Amazon Data Pipeline. You can download the AWS SDK for Java here.
- Web Service API — AWS provides a low-level interface that you can use to call the web service directly using JSON. Using the API is the best option if you want to create an custom SDK that calls Amazon Data Pipeline. For more information, see the Amazon EMR API Reference