Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
Featured Tutorial
Contextual Advertising using Apache Hive and Amazon Elastic MapReduce Read tutorial…
Need Help?
Ask on the Elastic MapReduce forums
Amazon Elastic MapReduce automatically spins up a Hadoop implementation of the MapReduce framework on Amazon EC2 instances, sub-dividing the data in a job flow into smaller chunks so that they can be processed (the “map” function) in parallel, and eventually recombining the processed data into the final solution (the “reduce” function). Amazon S3 serves as the source for the data being analyzed, and as the output destination for the end results.
To use Amazon Elastic MapReduce, you simply:
Elastic — Amazon Elastic MapReduce enables you to use as many or as few compute instances running Hadoop as you want. You can commission one, hundreds, or even thousands of instances to process gigabytes, terabytes, or even petabytes of data. And, you can run as many job flows concurrently as you wish. You can instantly spin up large Hadoop job flows which will start processing within minutes, not hours or days. When your job flow completes, unless you specify otherwise, the service automatically tears down your instances.
Easy to use — You don’t need to worry about setting up, running, or tuning the performance of Hadoop clusters; instead, you can concentrate on data analysis. We provide easy-to-use tools and sample data processing applications that let you get up and running without writing a single line of code. Once you start a job flow, Amazon Elastic MapReduce handles Amazon EC2 instance provisioning, security settings, Hadoop configuration and set-up, log collection, health monitoring, and other hardware-related complexities such as automatically removing faulty instances from your running job flow.
Reliable — Amazon Elastic MapReduce is built on Amazon’s highly reliable infrastructure, and has tuned Hadoop’s performance specifically for Amazon’s infrastructure environment. The service also monitors your job flow execution—retrying failed tasks and shutting down problematic instances.
Seamlessly integrated with other AWS services — Amazon Elastic MapReduce is designed to integrate easily with other AWS services such as Amazon S3 and EC2, providing the infrastructure for data processing applications. The service runs job flows in Amazon EC2 and stores input and output data in Amazon S3.
Secure — Amazon Elastic MapReduce automatically configures Amazon EC2 firewall settings that control network access to and between instances that run your job flows.
Inexpensive — Amazon Elastic MapReduce passes on to you the financial benefits of Amazon’s scale. You pay a very low rate for the compute capacity you actually consume. Amazon Elastic MapReduce is optimized to save you money by monitoring progress of your job flows and turning off resources when a job flow is completed.
Multiple Locations — Amazon Elastic MapReduce uses geographically dispersed EC2 infrastructure and is currently available in the US-East (Northern Virginia), US-West (Northern California), and EU (Ireland) Regions.
Third Party Tools — Amazon Elastic MapReduce is supported by Karmasphere Studio for Hadoop, a NetBeans based integrated development environment (IDE) that makes it easy to develop debug and deploy job flows from your desktop directly to Amazon Elastic MapReduce. See Karmasphere Studio for Hadoop for more details on this IDE.
To use Amazon Elastic MapReduce, you need to first select the type and quantity of Amazon EC2 instances you want. Amazon Elastic MapReduce works with any Amazon EC2 Linux/Unix instance type. It supports both On-Demand and Reserved instances; if you have Reserved Instances they will be used first by your job flows.
Instances of this family are well suited for most applications.
Instances of this family offer large memory sizes for high throughput applications, including database and memory caching applications.
Instances of this family have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.
EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
Amazon Elastic MapReduce currently is available in the US and EU Regions. Pay only for what you use – there is no minimum fee. Amazon Elastic MapReduce pricing is in addition to normal Amazon EC2 and Amazon S3 pricing.
| Standard On-Demand Instances |
Amazon EC2
Price per hour (On-Demand Instances) |
Amazon Elastic
MapReduce Price per hour |
|---|---|---|
| Small (Default) | $0.085 per hour | $0.015 per hour |
| Large | $0.34 per hour | $0.06 per hour |
| Extra Large | $0.68 per hour | $0.12 per hour |
| High-Memory On-Demand Instances | ||
| Extra Large | $0.50 per hour | $0.09 per hour |
| Double Extra Large | $1.20 per hour | $0.21 per hour |
| Quadruple Extra Large | $2.40 per hour | $0.42 per hour |
| High-CPU On-Demand Instances | ||
| Medium | $0.17 per hour | $0.03 per hour |
| Extra Large | $0.68 per hour | $0.12 per hour |
| Standard On-Demand Instances |
Amazon EC2
Price per hour (On-Demand Instances) |
Amazon Elastic
MapReduce Price per hour |
|---|---|---|
| Small (Default) | $0.095 per hour | $0.015 per hour |
| Large | $0.38 per hour | $0.06 per hour |
| Extra Large | $0.76 per hour | $0.12 per hour |
| High-Memory On-Demand Instances | ||
| Extra Large | $0.57 per hour | $0.09 per hour |
| Double Extra Large | $1.58 per hour | $0.21 per hour |
| Quadruple Extra Large | $3.16 per hour | $0.42 per hour |
| High-CPU On-Demand Instances | ||
| Medium | $0.19 per hour | $0.03 per hour |
| Extra Large | $0.76 per hour | $0.12 per hour |
| Standard On-Demand Instances |
Amazon EC2
Price per hour (On-Demand Instances) |
Amazon Elastic
MapReduce Price per hour |
|---|---|---|
| Small (Default) | $0.095 per hour | $0.015 per hour |
| Large | $0.38 per hour | $0.06 per hour |
| Extra Large | $0.76 per hour | $0.12 per hour |
| High-Memory On-Demand Instances | ||
| Extra Large | $0.57 per hour | $0.09 per hour |
| Double Extra Large | $1.58 per hour | $0.21 per hour |
| Quadruple Extra Large | $3.16 per hour | $0.42 per hour |
| High-CPU On-Demand Instances | ||
| Medium | $0.19 per hour | $0.03 per hour |
| Extra Large | $0.76 per hour | $0.12 per hour |
Amazon EC2, Amazon S3 and Amazon SimpleDB charges are billed separately. Pricing for Amazon Elastic MapReduce is per instance-hour consumed for each instance type, from the time job flow began processing until it is terminated. Each partial instance-hour consumed will be billed as a full hour. For additional details on Amazon EC2 Instance Types, Amazon EC2 Reserved Instances Pricing, Amazon S3 Pricing, or Amazon SimpleDB Pricing, follow the links below:
| Developer Resources |
Amazon Elastic MapReduce uses Apache Hadoop as its distributed processing engine. Hadoop is an open source Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware. Hadoop implements a computational model named “MapReduce,” in which the job is divided into many small fragments of work, each of which may be executed on any node in the cluster. This framework has been used by developers, enterprises, and startups and has proven to be a reliable software platform for processing up to petabytes of data on clusters of thousands of commodity machines.
Amazon Elastic MapReduce allows you to implement data processing applications in many languages including Java, Perl, Ruby, Python, PHP, R, or C++. You can test these applications on different instance types and job flow sizes to pick the optimal performance settings for your specific case.
Log in to the AWS Management Console to start an Amazon Elastic MapReduce “job flow.” Simply choose the number and type of Amazon EC2 instances you want, specify the location of your data and/or application on Amazon S3, and then click the “Create Job Flow” button. Alternatively you can start a job flow by specifying the same information mentioned above via our Command Line Tools or APIs. Amazon Elastic MapReduce employs a simple web service interface that is easy to use and highly flexible:
If you wish to run job flows with more than 20 instances, please complete the instance request form.
You are only charged for the resources actually consumed. For example, let’s say you launched 100 Amazon EC2 Standard Small instances for an Amazon Elastic MapReduce job flow, where the Amazon Elastic MapReduce cost is an incremental $0.015 per hour. The Amazon EC2 instances will begin booting immediately, but they won’t necessarily all start at the same moment. Amazon Elastic MapReduce will track when each instance starts and will check it into the cluster so that it can accept processing tasks.
In the first 10 minutes after your launch request, Amazon Elastic MapReduce either starts your job flow (if all of your instances are available) or checks in as many instances as possible. Once the 10 minute mark has passed, Amazon Elastic MapReduce will start processing (and charging for) your job flow as soon as 90% of your requested instances are available. As the remaining 10% of your requested instances check in, Amazon Elastic MapReduce starts charging for those instances as well.
So, in the above example, if all 100 of your requested instances are available 10 minutes after you kick off a launch request, you’ll be charged $1.50 per hour (100 * $0.015) for as long as the job flow takes to complete. If only 90 of your requested instances were available at the 10 minute mark, you’d be charged $1.35 per hour (90 * $0.015) for as long as this was the number of instances running your job flow. When the remaining 10 instances checked in, you’d be charged $1.50 per hour (100 * $0.015) for as long as the balance of the job flow takes to complete. Each job flow will run until one of the following occurs: you terminate the job flow with the TerminateJobFlows API call (or an equivalent tool), the job flow shuts itself down, or the job flow is terminated due to software or hardware failure. Partial instance hours consumed are billed as full hours.
Your use of this service is subject to the Amazon Web Services Customer Agreement