Tag: EMR
Building and Running a Recommendation Engine at Any Scale
This is a guest post by K Young, co-founder and CEO of Mortar Data. Mortar Data is an AWS advanced technology partner.
UPDATE: MortarData has transitioned into Datadog and has wound down the public Mortar service. The tutorial below no longer works. To learn more about building a recommendation engine on AWS, see Building a Recommendation Engine with Spark ML on Amazon EMR using Zeppelin.
This post shows you how to build a powerful, scalable, customizable recommendation engine using Mortar Data and run it on AWS. You’ll fork an open-source template project, so you won’t have to build from scratch, and you’ll start seeing results fast.
A companion webinar will be held at 11AM PT on December 17, 2014. This webinar will include a live demo, plus additional background material on recommendation engine motivation, theory, and technologies, plus advice for avoiding technical and business missteps. K will also be available for Q&A. Register here.
Why Build a Custom Recommendation Engine?
Most of us have experienced the power of personalized recommendations firsthand. Maybe you found former colleagues and classmates with LinkedIn’s “People You May Know” feature. Perhaps you watched a movie because Netflix suggested it to you. And you’ve most likely bought something that Amazon recommended under “Frequently Bought Together” or “Customers Who Bought This.” Recommendation engines account for a huge share of revenue and user activity, often 30 to 50 percent, at those companies and countless others.
IndiaMart, the world’s second-largest B2B marketplace according to the Economic Times, implemented a custom recommendation engine using Mortar Data in just one week. Since that time the company has reported a 30 percent increase in click-through rate.
The open-source recommendation engine provided by Mortar Data is robust, easy to operate, and very customizable. The Mortar platform-as-a-service runs almost entirely on top of AWS primarily using Amazon Elastic MapReduce (Amazon EMR), along with Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB, making the execution and operations of such a recommendation system cost-effective and straightforward.
Below, I’ll outline how the recommendation engine works and show you how to implement one yourself.
Getting HBase Running on Amazon EMR and Connecting it to Amazon Kinesis
Wangechi Doble is an AWS Solutions Architect
Introduction
Apache HBase is an open-source, column-oriented, distributed NoSQL database that runs on the Apache Hadoop framework. In the AWS Cloud, you can choose to deploy Apache HBase on Amazon Elastic Cloud Compute (Amazon EC2) and manage it yourself or leverage Apache HBase as a managed service on Amazon Elastic MapReduce (Amazon EMR). Amazon EMR is a fully managed, hosted Hadoop framework on top of Amazon EC2. This post shows you how to launch an Apache HBase cluster on Amazon EMR using the AWS SDK for Java and how to extend the Amazon Kinesis Connector Library to stream data in real-time to HBase running on an Amazon EMR cluster. Amazon Kinesis is a fully managed service for real-time processing of streaming big data.
Launching an Amazon EMR Cluster with HBase
We will use the AWS SDK for Java in this post to launch an Amazon EMR cluster with HBase. To learn more about launching Apache HBase on Amazon EMR, see the documentation for installing HBase on an Amazon EMR Cluster section of the Amazon EMR documentation.
The Impact of Using Latest-Generation Instances for Your Amazon EMR Job
Nick Corbett is a Big Data Consultant for AWS Professional Services
Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to process large amounts of data efficiently. Amazon EMR uses the popular open source framework Apache Hadoop combined with several other AWS products to do such tasks as web indexing, data mining, log file analysis, machine learning, scientific simulation and data warehousing.
Traditionally, Hadoop has used a distributed processing engine called MapReduce to divide large jobs into small tasks. These are spread out over a fleet of servers and executed in parallel. This scales well, even when the input data is in the order of petabytes. However, MapReduce is not the only choice; open source products such as Impala, Apache Spark, or Presto can all be run on Amazon EMR and offer alternative frameworks for processing large amounts of data.
Whatever your framework choice, when you run an Amazon EMR cluster you need to choose the instance type that’s used for your cluster’s nodes. An overview of all the instance types is provided in the Amazon EMR documentation. The table below lists specifications of instance types discussed in this post.
ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce
Manjeet Chayel is an AWS Solutions Architect
This blog post shows you how to build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket. AWS Data Pipeline is an ETL service that you can use to automate the movement and transformation of data. It launches an Amazon EMR cluster for each scheduled interval, submits jobs as steps to the cluster, and terminates the cluster after tasks have completed.
In this post, you’ll create the following ETL workflow:
To create the workflow, we’ll use the Pig and Hive examples discussed in the blog post “Ensuring Consistency when Using Amazon S3 and Amazon EMR.” This ETL workflow pushes webserver logs to an Amazon S3 bucket, cleans and filters the data using Pig scripts, and then generates analytical reports from this data using Hive scripts. AWS Data Pipeline allows you to run this workflow for a schedule in the future and lets you backfill data by scheduling a pipeline to run from a start date in the past.
Installing Apache Spark on an Amazon EMR Cluster
Jonathan Fritz is a Senior Product Manager for Amazon EMR
———————–
Please note – Amazon EMR now officially supports Spark. For more information about Spark on EMR, visit the Spark on Amazon EMR page or read Intent Media’s guest post on the AWS Big Data Blog about Spark on EMR.
——–—————
Over the last five years, Amazon EMR has evolved into a container for running many distributed computing frameworks beyond just Hadoop MapReduce. Customers can choose to run a variety of engines such as HBase, Impala, Spark, or Presto in their EMR cluster and leverage Amazon EMR’s features like fast performance of Amazon S3, connectivity with other AWS services, and ease of use (cluster creation and management).
We’re particularly excited about Apache Spark, an engine in the Apache Hadoop ecosystem for fast and efficient processing of large datasets. By using in-memory, fault-tolerant resilient distributed datasets (RDDs) and directed, acyclic graphs (DAGs) to define data transformations, Spark has shown significant performance increases for certain workloads when compared to Hadoop MapReduce.
EMR is no stranger to Spark. In fact, customers have been running Spark on EMR managed Hadoop clusters for years. To provide our customers with easy access to Spark on their EMR cluster, we wrote a bootstrap action accompanied by an article back in February 2013 on how to use Spark and Shark.
Much has changed in the Spark ecosystem since then: Spark graduated to 1.x thereby guaranteeing stability of its core API for all 1.x releases, Shark has been deprecated in favor of Spark SQL, and Spark can be run on top of YARN (the resource manager for Hadoop 2). In light of these changes, we have revised the bootstrap action to install Spark 1.x on our Hadoop 2.x AMIs and run it on top of YARN. The bootstrap action also installs and configures Spark SQL, Spark Streaming, MLlib, and GraphX.
Deploying Cloudera’s Enterprise Data Hub on AWS
Karthik Krishnan is an AWS Solutions Architect
UPDATE April 6, 2015: The newest quickstart reference guide supports Cloudera Director 1.1.0. To manage your cluster with Cloudera Director 1.1.0, refer to the updated reference guide.
Apache Hadoop is an open-source software framework to store and process large scale data-sets. In this post, we discuss the deployment of a Hadoop cluster via Cloudera’s Enterprise Data Hub (EDH) on AWS. The easy deployment below leverages various AWS technologies such as AWS Cloudformation, Amazon Elastic Compute Cloud (Amazon EC2), and Amazon Virtual Private Cloud (Amazon VPC) along with Cloudera Director software. Cloudera Director enables the delivery of enterprise-class, elastic, self-service experience for the Enterprise Data Hub on cloud infrastructure. This deployment allows customers to build a Hadoop cluster rapidly on-demand, while providing enough options to customize their cluster at fine granularity.
The flexible architecture allows you to choose the most appropriate network, compute and storage infrastructure for your environment, while the automation via Cloudformation and Cloudera Director Software takes care of building the infrastructure on AWS. The automation roughly works by launching a Launcher Instance through which the entire cluster is constructed. Options to customize the cluster is done on the launcher instance. Because most of the steps are automated, customers can rapidly construct the cluster on AWS by changing configuration files and parameters during deployment. Let’s begin!
Ensuring Consistency When Using Amazon S3 and Amazon Elastic MapReduce for ETL Workflows
Jonathan Fritz is a Senior Product Manager for Amazon EMR. AWS Solutions Architect Manjeet Chayel also contributed to this post.
The EMR File System (EMRFS) is an implementation of HDFS that allows Amazon Elastic MapReduce (Amazon EMR) clusters to store data on Amazon Simple Storage Service (Amazon S3). Many Amazon EMR customers use it to inexpensively store massive amounts of data with high durability and availability. However, Amazon S3 was designed for eventual consistency, which can cause issues for certain multi-step, extract-transform-load (ETL) data processing pipelines. For instance, if you list objects in an Amazon S3 bucket immediately after adding new objects from another step in the pipeline, the list may be incomplete. Because this list is the input for the next step, the set of files being processed in that stage of the job will be incomplete.
Creating a Consistent View of Amazon S3 for Amazon EMR
To address the challenges presented by Amazon S3’s eventual consistency model, the Amazon EMR team has released a new feature called “consistent view” for EMRFS. Consistent view is an optional feature that allows Amazon EMR clusters to check for list and read-after-write consistency for new Amazon S3 objects written by or synced with EMRFS. If it detects that Amazon S3 is inconsistent during a file system operation, it will retry that operation according to user defined rules. Consistent view does this by storing metadata in Amazon DynamoDB to keep track of Amazon S3 objects. This creates stronger ETL pipelines by making sure the output from a previous step is completely listed as the input for the current step. By default an Amazon DynamoDB table is created to hold the EMRFS metadata with 500 read capacity and 100 write capacity, so there is a small Amazon DynamoDB charge associated with enabling consistent view. The table read/write capacity settings are configurable depending on how many objects EMRFS is tracking and the number of concurrent nodes reading from the metadata.
Statistical Analysis with Open-Source R and RStudio on Amazon EMR
Markus Schmidberger is a Senior Big Data Consultant for AWS Professional Services
Big Data is on every CIO’s mind. It is synonymous with technologies like Hadoop and the ‘NoSQL’ class of databases. Another technology shaking things up in Big Data is R. This blog post describes how to set up R, RHadoop packages and RStudio server on Amazon Elastic MapReduce (Amazon EMR). This combination provides a powerful statistical analyses environment, including a user-friendly IDE on a fully managed Hadoop environment that starts up in minutes, and saves time and money for your data-driven analyses. At the end of this post, I’ve added a Big Data analysis using a public data set with daily global weather measurements.
R is an open source programming language and software environment designed for statistical computing, visualization and data. Due to its flexible package system and powerful statistical engine, the statistical software R can provide methods and technologies to manage and process a big amount of data. It is the fastest-growing analytics platform in the world, and is established in both academia and business due to its robustness, reliability, and accuracy. Nearly every top vendor of advanced analytics has integrated R and can now import R models. This allows data scientists, statisticians and other sophisticated enterprise users to leverage R within their analytics package.
Using Amazon EMR with SQL Workbench and other BI Tools
This is a guest post by Kyle Porter, a Sales Engineer at Simba Technologies.
Jon Einkauf, a Senior Product Manager for Amazon Elastic MapReduce and AWS Senior Technical Writer Jeff Slone also contributed to this post.
—————-
Note: Ports have changed on EMR 4.x,. Before walking through this post, please consult the EMR documentation to make sure you are connecting to the correct port.
—————-
Many customers use business intelligence tools like SQL Workbench to query data with Amazon Elastic MapReduce (Amazon EMR). These tools are simple to set up and make it easier to develop and run queries. This post shows you how to use SQL Workbench to query sample Amazon CloudFront access logs stored in Amazon Simple Storage Service (Amazon S3) using Hive 0.13 on Amazon EMR.
Amazon EMR employs drivers from Simba Technologies to connect to client JDBC and ODBC applications. The example in this blog post uses the Hive JDBC driver, but Amazon EMR also provides drivers for Hive ODBC, Impala JDBC, Impala ODBC, and HBase ODBC, so you can use a variety of other tools and applications. You can download these drivers. This tutorial assumes that you already have an Amazon EMR cluster with Hive running. If you aren’t sure how to do this, AWS provides instructions.
Using Amazon EMR and Tableau to Analyze and Visualize Data
Rahul Bhartia is an AWS Solutions Architect
Introduction
Hadoop provides a great ecosystem of tools for extracting value from data in various formats and sizes. Originally focused on large-batch processing with tools like MapReduce, Pig and Hive, Hadoop now provides many tools for running interactive queries on your data, such as Impala, Drill, and Presto. This post shows you how to use Amazon Elastic MapReduce (Amazon EMR) to analyze a data set available on Amazon Simple Storage Service (Amazon S3) and then use Tableau with Impala to visualize the data.
Amazon Elastic Map Reduce
Amazon EMR is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Amazon EMR uses Apache Hadoop, an open source framework, to distribute and process your data across a resizable cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances.
Impala
Impala is an open source tool in the Hadoop ecosystem and is available on EMR for interactive, ad hoc querying using SQL syntax. Instead of using a MapReduce engine like Hive, Impala leverages a massively parallel processing (MPP) engine similar to what’s used in traditional relational database management systems (RDBMS), which allows it to achieve faster query response times.