Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
New to EMR? Check out these resources:
Amazon Elastic MapReduce automatically spins up a Hadoop implementation of the MapReduce framework on Amazon EC2 instances, sub-dividing the data in a job flow into smaller chunks so that they can be processed (the “map” function) in parallel, and eventually recombining the processed data into the final solution (the “reduce” function). Amazon S3 serves as the source for the data being analyzed, and as the output destination for the end results.
To use Amazon Elastic MapReduce, you simply:
Elastic — Amazon Elastic MapReduce enables you to use as many or as few compute instances running Hadoop as you want. You can commission one, hundreds, or even thousands of instances to process gigabytes, terabytes, or even petabytes of data. You can modify the number of instances while your job flow is running and you can run as many job flows concurrently as you wish. You can instantly spin up large Hadoop job flows which will start processing within minutes, not hours or days. When your job flow completes, unless you specify otherwise, the service automatically tears down your instances.
Easy to use — You don’t need to worry about setting up, running, or tuning the performance of Hadoop clusters; instead, you can concentrate on data analysis. We provide easy-to-use tools and sample data processing applications that let you get up and running without writing a single line of code. Once you start a job flow, Amazon Elastic MapReduce handles Amazon EC2 instance provisioning, security settings, Hadoop configuration and set-up, log collection, health monitoring, and other hardware-related complexities such as automatically removing faulty instances from your running job flow.
Reliable — Amazon Elastic MapReduce is built on Amazon’s highly reliable infrastructure, and has tuned Hadoop’s performance specifically for Amazon’s infrastructure environment. The service also monitors your job flow execution—retrying failed tasks, shutting down problematic instances, and provisioning new nodes to replace those that fail.
Seamlessly integrated with other AWS services — Amazon Elastic MapReduce is designed to integrate easily with other AWS services such as Amazon S3, DynamoDB, and EC2, providing the infrastructure for data processing applications. The service runs job flows in Amazon EC2 and stores input and output data in Amazon S3 and/or Amazon DynamoDB.
Secure — Amazon Elastic MapReduce automatically configures Amazon EC2 firewall settings that control network access to and between instances that run your job flows. Job Flows can also be launched within Amazon Virtual Private Cloud (Amazon VPC), allowing you to isolate your compute instances by specifying the IP range you wish to use and connect to your existing IT infrastructure using industry-standard encrypted IPsec VPN.
Inexpensive — Amazon Elastic MapReduce passes on to you the financial benefits of Amazon’s scale. You pay a very low rate for the compute capacity you actually consume. Amazon Elastic MapReduce is optimized to save you money by monitoring progress of your job flows and turning off resources when a job flow is completed.
Multiple Locations — Amazon Elastic MapReduce uses geographically dispersed EC2 infrastructure and is currently available in the US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), South America (Sao Paulo), and AWS GovCloud (US) Regions.
Third Party Tools — Amazon Elastic MapReduce integrates with a wide array of third party tools and solutions. For example, Karmasphere Analyst is a visual, desktop workspace for analyzing data on Amazon Elastic MapReduce. It provides graphical tools to perform SQL-based querying of structured and unstructured data and visualize the results. Karmasphere Analyst is available with hourly pricing and no upfront fees or long-term commitments. Please visit the Elastic MapReduce with Karamasphere Analytics detail page to learn more.
To use Amazon Elastic MapReduce, you need to select the type and quantity of Amazon EC2 instances to include in your job flow. EMR supports On-Demand, Reserved, and Spot pricing options; if you have Reserved Instances they will be used first.
Instances of this family are well suited for most applications.
Instances of this family offer large memory sizes for high throughput applications, including database and memory caching applications.
Instances of this family have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.
Instances of this family combine large memory sizes and high CPU resources with 10 Gbps networking. They are well-suited for high performance, I/O intensive applications, such as mapping genomes for scientific research, simulating aerospace and automotive designs for engineering activities, and mining data for business intelligence.
High I/O instances are ideal for high performance database applications such as HBase.
High Storage instances are ideal for applications that require sequential access to very large data sets.
*EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
With Elastic MapReduce (EMR) you can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete. EMR supports a variety of EC2 instance types (standard, high CPU, high memory, high I/O, etc.) and EC2 pricing options (On-Demand, Reserved, and Spot). When you launch an EMR cluster (also called a "job flow"), you choose how many and what type of Amazon EC2 Instances to provision. The EMR price is in addition to the EC2 price. EMR and EC2 charge by the hour, so you only pay for what you use.
You are charged from the time the job flow begins processing until it is terminated. Partial hours are rounded up.
The Amazon EC2 prices above are for On-demand Instances. On-Demand Instances are the most expensive but give you the most flexibility. EC2 also offers Reserved Instances and Spot Instances.
"Amazon Elastic MapReduce with Spot Instances has made it easy to prototype and surprisingly cost-effective to scale, decreasing our data processing costs by over 50%." - VP of Engineering at FliptopTo view more information and current prices for Reserved Instances and Spot Instances, see the Amazon EC2 pricing page.
EMR supports the MaprR M3 and MapR M5 Hadoop Distributions. There is an additional charge for the MapR M5 Distribution. See the MapR detail page for more information and current prices.
Amazon S3 is billed separately. (Many customers store their input and output data in S3; others store all of the data locally on HDFS.) Currently it costs $668 per month to store 10 TB of data in S3 with reduced redundancy. The more data you store, the lower the monthly price per GB.
Amazon SimpleDB is also billed separately. (Only applies if you enable debugging for your job flow)
You can use the AWS Simple Monthly Calculator to estimate your bill.
Amazon Elastic MapReduce uses Apache Hadoop as its distributed processing engine. Hadoop is an open source Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware. Hadoop implements a computational model named “MapReduce,” in which the job is divided into many small fragments of work, each of which may be executed on any node in the cluster. This framework has been used by developers, enterprises, and startups and has proven to be a reliable software platform for processing up to petabytes of data on clusters of thousands of commodity machines.
Amazon Elastic MapReduce allows you to implement data processing applications in many languages including Java, Perl, Ruby, Python, PHP, R, or C++. You can test these applications on different instance types and job flow sizes to pick the optimal performance settings for your specific case.
Log in to the AWS Management Console to start an Amazon Elastic MapReduce “job flow.” Simply choose the number and type of Amazon EC2 instances you want, specify the location of your data and/or application on Amazon S3, and then click the “Create Job Flow” button. Alternatively you can start a job flow by specifying the same information mentioned above via our Command Line Tools or APIs. Amazon Elastic MapReduce employs a simple web service interface that is easy to use and highly flexible:
If you wish to run job flows with more than 20 instances, please complete the instance request form.
You are only charged for the resources actually consumed. For example, let’s say you launched 100 Amazon EC2 Standard Small instances for an Amazon Elastic MapReduce job flow, where the Amazon Elastic MapReduce cost is an incremental $0.015 per hour. The Amazon EC2 instances will begin booting immediately, but they won’t necessarily all start at the same moment. Amazon Elastic MapReduce will track when each instance starts and will check it into the cluster so that it can accept processing tasks.
In the first 10 minutes after your launch request, Amazon Elastic MapReduce either starts your job flow (if all of your instances are available) or checks in as many instances as possible. Once the 10 minute mark has passed, Amazon Elastic MapReduce will start processing (and charging for) your job flow as soon as 90% of your requested instances are available. As the remaining 10% of your requested instances check in, Amazon Elastic MapReduce starts charging for those instances as well.
So, in the above example, if all 100 of your requested instances are available 10 minutes after you kick off a launch request, you’ll be charged $1.50 per hour (100 * $0.015) for as long as the job flow takes to complete. If only 90 of your requested instances were available at the 10 minute mark, you’d be charged $1.35 per hour (90 * $0.015) for as long as this was the number of instances running your job flow. When the remaining 10 instances checked in, you’d be charged $1.50 per hour (100 * $0.015) for as long as the balance of the job flow takes to complete. Each job flow will run until one of the following occurs: you terminate the job flow with the TerminateJobFlows API call (or an equivalent tool), the job flow shuts itself down, or the job flow is terminated due to software or hardware failure. Partial instance hours consumed are billed as full hours.