AWS Big Data Blog

Category: Amazon EMR*

Deploying Cloudera’s Enterprise Data Hub on AWS

Karthik Krishnan is an AWS Solutions Architect UPDATE April 6, 2015: The newest quickstart reference guide supports Cloudera Director 1.1.0. To manage your cluster with Cloudera Director 1.1.0, refer to the updated reference guide.  Apache Hadoop is an open-source software framework to store and process large scale data-sets.  In this post, we discuss the deployment of […]

Read More

Ensuring Consistency When Using Amazon S3 and Amazon Elastic MapReduce for ETL Workflows

Jonathan Fritz is a Senior Product Manager for Amazon EMR. AWS Solutions Architect Manjeet Chayel also contributed to this post. The EMR File System (EMRFS) is an implementation of HDFS that allows Amazon Elastic MapReduce (Amazon EMR) clusters to store data on Amazon Simple Storage Service (Amazon S3). Many Amazon EMR customers use it to […]

Read More

Statistical Analysis with Open-Source R and RStudio on Amazon EMR

Markus Schmidberger is a Senior Big Data Consultant for AWS Professional Services Big Data is on every CIO’s mind. It is synonymous with technologies like Hadoop and the ‘NoSQL’ class of databases. Another technology shaking things up in Big Data is R. This blog post describes how to set up R, RHadoop packages and RStudio […]

Read More

Using Amazon EMR with SQL Workbench and other BI Tools

This is a guest post by Kyle Porter, a Sales Engineer at Simba Technologies. Jon Einkauf, a Senior Product Manager for Amazon Elastic MapReduce and AWS Senior Technical Writer Jeff Slone also contributed to this post. —————- Note: Ports have changed on EMR 4.x,. Before walking through this post, please consult the EMR documentation to […]

Read More

Using Amazon EMR and Tableau to Analyze and Visualize Data

Rahul Bhartia is an AWS Solutions Architect Introduction Hadoop provides a great ecosystem of tools for extracting value from data in various formats and sizes. Originally focused on large-batch processing with tools like MapReduce, Pig and Hive, Hadoop now provides many tools for running interactive queries on your data, such as Impala, Drill, and Presto. […]

Read More

Getting Started with Amazon EMR Bootstrap Actions

Steve McPherson is a Senior Manager for Amazon Elastic MapReduce Note: This post was updated 2/8/16. The Presto bootstrap action documented in the original post has been deprecated because EMR now offers a Presto-Sandbox as a full-fledged EMR application. For details, see the EMR sandbox.   Amazon Elastic MapReduce (EMR) is a fully managed Hadoop-as-a-service platform […]

Read More

Building a Recommender with Apache Mahout on Amazon Elastic MapReduce (EMR)

This is a guest post by Andrew Musselman, who as chief data scientist leads the global big data practice from the technical side at Accenture. He is a PMC member on the Apache Mahout project and is writing a book on data science for O’Reilly. Accenture is an APN Big Data Competency Partner. This post […]

Read More