EMR | AWS Big Data Blog

Tag: EMR

Getting Started with Amazon EMR Bootstrap Actions

by Steve McPherson | on 06 SEP 2014 | Permalink | Comments

Steve McPherson is a Senior Manager for Amazon Elastic MapReduce

Note: This post was updated 2/8/16. The Presto bootstrap action documented in the original post has been deprecated because EMR now offers a Presto-Sandbox as a full-fledged EMR application. For details, see the EMR sandbox.

Amazon Elastic MapReduce (EMR) is a fully managed Hadoop-as-a-service platform that removes the operational overhead of setting up, configuring and managing the end-to-end lifecycle of Hadoop clusters. Many of our customers use the service for scheduled data processing tasks or job flows (clusters in EMR terminology) without ever having to interact with Hadoop infrastructure itself. Instead, they specify an input data source, the query or program that should be run, and the output location for the results.

As the Hadoop ecosystem has expanded from being a generic MapReduce (batch-oriented data processing) system, EMR has expanded to support Hadoop clusters that are long-running, shared, interactive data-processing environments. EMR clusters come prepackaged with the most common Hadoop apps like Hive, Pig and Cascading. The apps are configured to implement the full suite of best practices and integrations with related AWS services such as EC2, VPC, CloudWatch, S3, DynamoDB and Kinesis.

Despite the name Elastic MapReduce, the service goes far beyond batch-oriented processing. Clusters in EMR have a flexible and rich cluster-management framework that users can customize to run any Hadoop ecosystem application such as low-latency query engines like Hbase (with Phoenix), Impala, Spark/Shark and machine learning frameworks like Mahout. These additional components can be installed using Bootstrap Actions or Steps.

Bootstrap Actions are scripts that run on every machine in the cluster as they are brought online, but before the core Hadoop services like HDFS (name node or data node) and the Hive Metastore are configured and started.

(more…)

Building a Recommender with Apache Mahout on Amazon Elastic MapReduce (EMR)

by Accenture | on 16 JUL 2014 | Permalink | Comments

This is a guest post by Andrew Musselman, who as chief data scientist leads the global big data practice from the technical side at Accenture. He is a PMC member on the Apache Mahout project and is writing a book on data science for O’Reilly. Accenture is an APN Big Data Competency Partner.

This post introduces machine learning, provides context for the Apache Mahout project, and offers some specifics about recommender systems. Then, using Amazon EMR, we’ll tour the workflows for building a simple movie recommender and for writing and running a simple web service to provide results to client applications. Finally, we’ll list some ways to learn more and engage with the Mahout community.

Machine Learning

Machine learning has its roots in artificial intelligence. The term implies that machine learning tools bring cognition and automated decision-making to data problems, but currently machine learning methods do not include computer thought. Even so, machine learning tools usually do employ some type of automated decision making, often iteratively working toward minimizing or maximizing a specific measurement about the performance of a model.

The field of machine learning encompasses many topics and approaches, usually falling into the categories of classification, clustering, and recommenders.

(more…)