Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data.
Amazon EMR uses Hadoop, an open source framework, to distribute your data and processing across a resizable cluster of Amazon EC2 instances. Amazon EMR is used in a variety of applications, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. Customers launch millions of Amazon EMR clusters every year.
|Razorfish increased return on ad spend by 500%.||Yelp was able to save $55,000 in upfront hardware costs.||So-net reduced the cost of ad hoc data analysis by 50%.||Olaworks maximized the performance of image recognition.|
|Read the story||Read the story||Read the story||Read the story|
|Fliptop scaled their lookup capacity to over 1 million contacts per day.||Bankinter runs simulations not previously possible.||Getty Images reduced the time to rank images by 300%.||Unilever processes genetic sequences 20 times faster.|
|Read the story||Read the story||Read the story||Read the story|
Easy to Use
You can launch an Amazon EMR cluster in minutes. You don’t need to worry about node provisioning, cluster setup, Hadoop configuration, or cluster tuning. Amazon EMR takes care of these tasks so you can focus on analysis.
With Amazon EMR, you can provision one, hundreds, or thousands of compute instances to process data at any scale. You can easily increase or decrease the number of instances and you only pay for what you use.
You can launch a 10-node Hadoop cluster for as little as $0.15 per hour. Because Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, you can also save 50-80% on the cost of the underlying instances.
You can spend less time tuning and monitoring your cluster. Amazon EMR has tuned Hadoop for the cloud; it also monitors your cluster —retrying failed tasks and automatically replacing poorly performing instances.
Amazon EMR automatically configures Amazon EC2 firewall settings that control network access to instances, and you can launch clusters in an Amazon Virtual Private Cloud (VPC), a logically isolated network you define.
You have complete control over your cluster. You have root access to every instance, you can easily install additional applications, and you can customize every cluster. Amazon EMR also supports multiple Hadoop distributions and applications.
Amazon EMR can be used to analyze click stream data in order to segment users and understand user preferences. Advertisers can also analyze click streams and advertising impression logs to deliver more effective ads.
Amazon EMR can be used to process vast amounts of genomic data and other large scientific data sets quickly and efficiently. Researchers can access genomic data hosted for free on AWS.
Amazon EMR can be used to process logs generated by web and mobile applications. Amazon EMR helps customers turn petabytes of un-structured or semi-structured data into useful insights about their applications or users.
|Learn how Razorfish uses EMR for click stream analysis||Read about the 1000 Genomes Project and AWS||Learn how Yelp uses EMR to drive key website features|
To use Amazon EMR, you simply:
Are you ready to launch your first cluster? Click here to view the Getting Started Tutorial. In the tutorial you will create a cluster that will count the frequency of words in a sample text file. In just a few minutes your cluster will be up and running.
Amazon EMR enables you to quickly and easily provision as much capacity as you need and add or remove capacity at any time. This is very useful if you have variable or unpredictable processing requirements. For example, if the bulk of your processing occurs at night, you might need 100 instances during the day and 500 instances at night. Alternatively, you might need a significant amount of capacity for a short period of time. With Amazon EMR you can quickly provision hundreds or thousands of instances, and shut them down when your job is complete (to avoid paying for idle capacity).
There are two main options for adding or removing capacity:
Deploy multiple clusters: If you need more capacity, you can easily launch a new cluster and terminate it when you no longer need it. There is no limit to how many clusters you can have. You may want to use multiple clusters if you have multiple users or applications. For example, you can store your input data in Amazon S3 and launch one cluster for each application that needs to process the data. One cluster might be optimized for CPU, a second cluster might be optimized for storage, etc.
Resize a running cluster: With Amazon EMR it is easy to resize a running cluster. You may want to resize a cluster if you are storing your data in HDFS and you want to temporarily add more processing power. For example, some customers add hundreds of instances to their clusters when their batch processing occurs, and remove the extra instances when processing completes.
Amazon EMR is designed to reduce the cost of processing large amounts of data. Some of the features that make it low cost include low hourly pricing, Amazon EC2 Spot integration, Amazon EC2 Reserved Instance integration, elasticity, and Amazon S3 integration.
Low Hourly Pricing: Amazon EMR pricing is per instance hour and starts at $.015 per instance hour for a small instance ($131.40 per year). See the pricing section for more detail.
Amazon EC2 Spot Integration: Amazon EC2 Spot Instances allow you to name your own price for Amazon EC2 capacity. You simply specify the maximum hourly price that you are willing to pay to run a particular instance type. As long as your bid price exceeds the Spot market price, you will keep the instances and typically pay a fraction of the On-Demand price. The Spot Price fluctuates based on supply and demand for instances, but you will never pay more than the maximum price you specified. Amazon EMR makes it easy to use Spot instances so you can save both time and money. Amazon EMR clusters include 'core nodes' that run HDFS and ‘task nodes’ that do not; task nodes are ideal for Spot because if the Spot price increases and you lose those instances you will not lose data stored in HDFS. (Learn more about core and task nodes.)
Amazon EC2 Reserved Instance Integration: Amazon EC2 Reserved Instances enable you to maintain the benefits of elastic computing while lowering costs and reserving capacity. With Reserved Instances you pay a low, one-time fee and in turn receive a significant discount on the hourly charge for that instance. Amazon EMR makes it easy to utilize Reserved Instances so you can save up to 65% off the On-Demand price.
Elasticity: Because Amazon EMR makes it easy to add and remove capacity, you don’t need to provision excess capacity. For example, you may not know how much data your cluster(s) will be handling in 6 months, or you may have spikey processing needs. With Amazon EMR you don't need to guess your future requirements or provision for peak demand because you can easily add/remove capacity at any time.
Amazon S3 Integration: Amazon EMR makes it possible to efficiently process data in Amazon S3, so you can store your data in Amazon S3 and use multiple Amazon EMR clusters to process the same data set. Each cluster can be optimized for a particular workload, which can be more efficient than a single cluster serving multiple workloads with different requirements. For example, you might have one cluster that is optimized for I/O and another that is optimized for CPU, each processing the same data set in Amazon S3. In addition, by storing your input and output data in Amazon S3, you can shut down clusters when they are no longer needed.
With Amazon EMR, you can leverage multiple data stores, including Amazon S3, the Hadoop Distributed File System (HDFS), and Amazon DynamoDB.
Amazon S3: Amazon S3 is Amazon’s highly durable, scalable, secure, fast, and inexpensive storage service. Amazon EMR has made numerous improvements to Hadoop so you can seamlessly process large amounts of data stored in Amazon S3. When you launch your cluster, Amazon EMR streams the data from Amazon S3 to each instance in your cluster and begins processing it immediately. One advantage of storing your data in Amazon S3 and processing it with Amazon EMR is you can use multiple clusters to process the same data. For example, you might have a Hive development cluster that is optimized for Memory and CPU and an HBase production cluster that is optimized for I/O.
Hadoop Distributed File System (HDFS): HDFS is the Hadoop file system. In Amazon EMR, HDFS uses local ephemeral storage. Depending on the instance type, this could be spinning disks or solid state drives. Every instance in your cluster has local ephemeral storage, but you decide which instances run HDFS. Amazon EMR refers to instances running HDFS as ‘core nodes’ and instances not running HDFS as ‘task nodes’.
Amazon DynamoDB: Amazon DynamoDB is a fast, fully managed NoSQL database service. Amazon EMR has direct integration with Amazon DynamoDB so you can quickly and efficiently process data stored in Amazon DynamoDB and transfer data between Amazon DynamoDB, Amazon S3, and HDFS in Amazon EMR.
Other AWS Data Stores: Amazon EMR customers also use Amazon Relational Database Service (a web service that makes it easy to set up, operate, and scale a relational database in the cloud), Amazon Glacier (an extremely low-cost storage service that provides secure and durable storage for data archiving and backup), and Amazon Redshift (a fast, fully managed, petabyte-scale data warehouse service). AWS Data Pipeline is a web service that helps customers reliably process and move data between different AWS compute and storage services (including Amazon EMR) as well as on-premise data sources at specified intervals.
EMR supports powerful and proven Hadoop tools such as Hive, Pig, and HBase.
Hive is an open source data warehouse and analytics package that runs on top of Hadoop. Hive is operated by Hive QL, a SQL-based language which allows users to structure, summarize, and query data. Hive QL goes beyond standard SQL, adding first-class support for map/reduce functions and complex extensible user-defined data types like JSON and Thrift. This capability allows processing of complex and unstructured data sources such as text documents and log files. Hive allows user extensions via user-defined functions written in Java. Amazon EMR has made numerous improvements to Hive, including direct integration with Amazon DynamoDB and Amazon S3. For example, with Amazon EMR you can load table partitions automatically from Amazon S3, you can write data to tables in Amazon S3 without using temporary files, and you can access resources in Amazon S3 such as scripts for custom map/reduce operations and additional libraries. Learn more about Hive and EMR.
Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by Pig Latin, a SQL-like language which allows users to structure, summarize, and query data. As well as SQL-like operations, Pig Latin also adds first-class support for map/reduce functions and complex extensible user defined data types. This capability allows processing of complex and unstructured data sources such as text documents and log files. Pig allows user extensions via user-defined functions written in Java. Amazon EMR has made numerous improvements to Pig, including the ability to use multiple file systems (normally Pig can only access one remote file system), the ability to load customer JARs and scripts from Amazon S3 (e.g. “REGISTER s3:///my-bucket/piggybank.jar”), and additional functionality for String and DateTime processing. Learn more about Pig and EMR.
HBase is an open source, non-relational, distributed database modeled after Google's BigTable. It was developed as part of Apache Software Foundation's Hadoop project and runs on top of Hadoop Distributed File System(HDFS) to provide BigTable-like capabilities for Hadoop. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBase is optimized for sequential write operations, and it is highly efficient for batch inserts, updates, and deletes. HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC). With Amazon EMR you can back up HBase to Amazon S3 (full or incremental, manual or automated) and you can restore from a previously created backup. Learn more about HBase and EMR.
Other: Amazon EMR also supports a variety of other popular applications and tools, such as R, Mahout (machine learning), Ganglia (monitoring), Spark (in-memory MapReduce), Shark (data warehouse on Spark), Accumulo (secure NoSQL database), Sqoop (relational database connector), HCatalog (table and storage management), and more.
Use the MapR Distribution: MapR delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. MapR brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified Big Data platform. Learn more about using MapR on EMR.
Tune Your Cluster: You choose what types of EC2 instances to provision in your cluster (standard, high memory, high CPU, high I/O, etc.) based on your application’s requirements. You have root access to every instance and you can fully customize your cluster to suit your requirements. Learn more about supported EC2 Instance Types.
Debug Your Applications: When you enable debugging on a cluster, Amazon EMR archives the log files to Amazon S3 and then indexes those files. You can then use a graphical interface to browse the logs in an intuitive way. Learn more about debugging EMR jobs.
Monitor Your Cluster: You can use Amazon CloudWatch to monitor 23 custom Amazon EMR metrics, such as the average number of running map and reduce tasks. You can also set alarms on these metrics. Learn more about monitoring EMR clusters.
Schedule Recurring Workflows: You can use AWS Data Pipeline to schedule recurring workflows involving Amazon EMR. AWS Data Pipeline is a web service that helps you reliably process and move data between different AWS compute and storage services as well as on-premise data sources at specified intervals. Learn more about EMR and Data Pipeline.
Cascading: Cascading is an open-source Java library that provides a query API, a query planner, and a job scheduler for creating and running Hadoop MapReduce applications. Applications developed with Cascading are compiled and packaged into standard Hadoop-compatible JAR files similar to other native Hadoop applications. Learn more about Cascading and EMR.
Control Network Access to Your Cluster: You can launch your cluster in an Amazon Virtual Private Cloud (VPC), a logically isolated section of the AWS cloud. You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways. Learn more about EMR and Amazon VPC.
Manage Users and Permissions: You can use AWS Identity & Access Management (IAM) tools such as IAM Users and Roles to control access and permissions. For example, you could giver certain users read but not write access to your clusters. Learn more about controlling access to your cluster.
Install Additional Software: You can use bootstrap actions to install additional software and to change the configuration of applications on the cluster. Bootstrap actions are scripts that are run on the cluster nodes when Amazon EMR launches the cluster. They run before Hadoop starts and before the node begins processing data. You can write custom bootstrap actions, or use predefined bootstrap actions provided by Amazon EMR. Learn more about EMR Bootstrap Actions.
Efficiently Copy Data: You can quickly move large amounts of data from Amazon S3 to HDFS, from HDFS to Amazon S3, and between Amazon S3 buckets using Amazon EMR’s S3DistCp, an extension of the open source tool Distcp, which uses MapReduce to efficiently move large amounts of data. Learn more about S3DistCp.
Hadoop Streaming: Hadoop Streaming is a utility that comes with Hadoop that enables you to develop MapReduce executables in languages other than Java. Streaming is implemented in the form of a JAR file. Learn more about Hadoop Streaming with EMR.
Custom Jar: Write a Java program, compile against the version of Hadoop you want to use, and upload to Amazon S3. You can then submit Hadoop jobs to the cluster using the Hadoop JobClient interface. Learn more about Custom Jar processing with EMR.
|BI/Visualization||Hadoop Distribution||Graphical IDE||Data Transfer|
|Graphical IDE||Data Transformation||Performance Tuning||BI/Visualization|
With Amazon EMR you can launch a persistent cluster that stays up indefinitely or a temporary cluster that terminates after the analysis is complete. Amazon EMR supports a variety of Amazon EC2 instance types (standard, high CPU, high memory, high I/O, etc.) and Amazon EC2 pricing options (On-Demand, Reserved, and Spot). When you launch an Amazon EMR cluster (also called a "job flow"), you choose how many and what type of Amazon EC2 Instances to provision. The Amazon EMR price is in addition to the Amazon EC2 price.
You can estimate your bill using the AWS Simple Monthly Calculator.
You are charged from the time the cluster begins processing until it is terminated. Partial hours are rounded up.
The Amazon EC2 prices above are for On-Demand Instances. On-Demand Instances are the most expensive but give you the most flexibility. EC2 also offers Reserved Instances and Spot Instances.
"Amazon Elastic MapReduce with Spot Instances has made it easy to prototype and surprisingly cost-effective to scale, decreasing our data processing costs by over 50%." - VP of Engineering at FliptopTo view more information and current prices for Reserved Instances and Spot Instances, see the Amazon EC2 pricing page.
Instances of this family are well suited for most applications.
Instances of this family offer large memory sizes for high throughput applications, including database and memory caching applications.
Instances of this family have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.
Instances of this family combine large memory sizes and high CPU resources with 10 Gbps networking. They are well-suited for high performance, I/O intensive applications, such as mapping genomes for scientific research, simulating aerospace and automotive designs for engineering activities, and mining data for business intelligence.
High I/O instances are ideal for high performance database applications such as HBase.
High Storage instances are ideal for applications that require sequential access to very large data sets.
*EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
Are you ready to launch your first cluster? Click here to create a cluster that will count the frequency of words in a sample text file. In just a few minutes your cluster will be up and running.
Note: The videos below show the old EMR Management Console. Check out the Documentation if you are unsure how to do something in the new EMR Management Console.Amazon EMR 101 (11 Videos)
If you are planning to process more than 1 TB per day you may be eligible for EMR Bootcamp, an onsite proof-of-concept and knowledge transfer workshop with an AWS Solutions Architect who specializes in EMR. To learn more, click here or contact us.
Scale Unlimited offers customized on-site training for companies that need to quickly learn how to use EMR and other big data technologies. To find out more, click here.
Detailed Amazon EMR Documentation: