Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data.
Amazon EMR uses Hadoop, an open source framework, to distribute your data and processing across a resizable cluster of Amazon EC2 instances. Amazon EMR is used in a variety of applications, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. Customers launch millions of Amazon EMR clusters every year.
You can launch an Amazon EMR cluster in minutes. You don’t need to worry about node provisioning, cluster setup, Hadoop configuration, or cluster tuning. Amazon EMR takes care of these tasks so you can focus on analysis.
With Amazon EMR, you can provision one, hundreds, or thousands of compute instances to process data at any scale. You can easily increase or decrease the number of instances and you only pay for what you use.
You can launch a 10-node Hadoop cluster for as little as $0.15 per hour. Because Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, you can also save 50-80% on the cost of the underlying instances.
Amazon EMR automatically configures Amazon EC2 firewall settings that control network access to instances, and you can launch clusters in an Amazon Virtual Private Cloud (VPC), a logically isolated network you define.
You have complete control over your cluster. You have root access to every instance, you can easily install additional applications, and you can customize every cluster. Amazon EMR also supports multiple Hadoop distributions and applications.
Amazon EMR can be used to analyze click stream data in order to segment users and understand user preferences. Advertisers can also analyze click streams and advertising impression logs to deliver more effective ads.
Amazon EMR can be used to process logs generated by web and mobile applications. Amazon EMR helps customers turn petabytes of un-structured or semi-structured data into useful insights about their applications or users.
To use Amazon EMR, you simply:
- Develop your data processing application. You can use Java, Hive (a SQL-like language), Pig (a data processing language), Cascading, Ruby, Perl, Python, R, PHP, C++, or Node.js. Amazon EMR provides code samples and tutorials to get you up and running quickly.
- Upload your application and data to Amazon S3. If you have a large amount of data to upload, you may want to consider using AWS Import/Export (to upload data using physical storage devices) or AWS Direct Connect (to establish a dedicated network connection from your data center to AWS). If you prefer, you can also write your data directly to a running cluster.
- Configure and launch your cluster. Using the AWS Management Console, EMR's Command Line Interface, SDKs, or APIs, specify the number of EC2 instances to provision in your cluster, the types of instances to use (standard, high memory, high CPU, high I/O, etc.), the applications to install (Hive, Pig, HBase, etc.), and the location of your application and data. You can use Bootstrap Actions to install additional software or change default settings.
- (Optional) Monitor the cluster. You can monitor the cluster’s health and progress using the Management Console, Command Line Interface, SDKs, or APIs. EMR integrates with Amazon CloudWatch for monitoring/alarming and supports popular monitoring tools like Ganglia. You can add/remove capacity to the cluster at any time to handle more or less data. For troubleshooting, you can use the console’s simple debugging GUI.
- Retrieve the output. Retrieve the output from Amazon S3 or HDFS on the cluster. Visualize the data with tools like Tableau and MicroStrategy. Amazon EMR will automatically terminate the cluster when processing is complete. Alternatively you can leave the cluster running and give it more work to do.
Are you ready to launch your first cluster? Click here to view the Getting Started Tutorial. In the tutorial you will create a cluster that will count the frequency of words in a sample text file. In just a few minutes your cluster will be up and running.