AWS Big Data Blog

Building and Running a Recommendation Engine at Any Scale

by Mortar Data | on | Permalink | Comments |  Share

This is a guest post by K Young, co-founder and CEO of Mortar Data. Mortar Data is an AWS advanced technology partner.

UPDATE: MortarData has transitioned into Datadog and has wound down the public Mortar service. The tutorial below no longer works. To learn more about building a recommendation engine on AWS, see Building a Recommendation Engine with Spark ML on Amazon EMR using Zeppelin.


This post shows you how to build a powerful, scalable, customizable recommendation engine using Mortar Data and run it on AWS. You’ll fork an open-source template project, so you won’t have to build from scratch, and you’ll start seeing results fast.

A companion webinar will be held at 11AM PT on December 17, 2014. This webinar will include a live demo, plus additional background material on recommendation engine motivation, theory, and technologies, plus advice for avoiding technical and business missteps. K will also be available for Q&A. Register here.

Why Build a Custom Recommendation Engine?

Most of us have experienced the power of personalized recommendations firsthand. Maybe you found former colleagues and classmates with LinkedIn’s “People You May Know” feature. Perhaps you watched a movie because Netflix suggested it to you. And you’ve most likely bought something that Amazon recommended under “Frequently Bought Together” or “Customers Who Bought This.” Recommendation engines account for a huge share of revenue and user activity, often 30 to 50 percent, at those companies and countless others.

IndiaMart, the world’s second-largest B2B marketplace according to the Economic Times, implemented a custom recommendation engine using Mortar Data in just one week. Since that time the company has reported a 30 percent increase in click-through rate.

The open-source recommendation engine provided by Mortar Data is robust, easy to operate, and very customizable. The Mortar platform-as-a-service runs almost entirely on top of AWS primarily using Amazon Elastic MapReduce (Amazon EMR), along with Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB, making the execution and operations of such a recommendation system cost-effective and straightforward.

Below, I’ll outline how the recommendation engine works and show you how to implement one yourself.

How does Mortar Data’s Recommendation Engine Work?

The Mortar recommendation engine is a collaborative filtering system, meaning that it bases recommendations on the preferences of many users. In general, the input data can be just about anything—product purchases, favorites of social media posts, article shares, video plays—just as long as each record captures an interaction between a user and an item. You can also add links directly between items (books by the same author, for instance).

By default, the recommendation engine produces two sets of recommendations: user-item (personalized recommendations for individual users) and item-item (similar to Amazon’s “Customers Who Bought This” recommendations). Here is a diagram of what a full recommender pipeline looks like in production, with a green box around the portion of the pipeline that is powered by open code and works out of the box with Mortar running on AWS:

Getting Started

To get started, you’ll need a Mortar account. If you don’t have a Mortar login, you can sign up for a free trial here.

You also must install the Mortar development framework to follow the example in this blog. Mac/Linux users can install Mortar with one command:

curl -sSL install.mortardata.com | bash

For other operating systems, check out our install documentation. You will also need an account with github and SSH access to github configured on your machine.

1. Fork the Code

Mortar’s open-source recommendation engine is free for anyone to use, modify, or port to other platforms. To build and run it on Mortar, run the following command (replace project_name with a unique project name of your own) to grab a copy of the code:

mortar projects:fork git@github.com:mortardata/mortar-recsys.git project_name
cd project_name

2. Run an Example Recommender

If you glance through the project directory you’ve just forked, you’ll see that the code for the recommender is contained in a number of subdirectories that follow a standardized code-organization scheme. User-defined functions (helper functions written in Java, Python, or Jython) live in the udfs directory, parameter files live in params, and Luigi scripts for dependency and workflow management live in luigiscripts. In this example, though, we’re mostly going to focus on the Apache Pig data-flow scripts, which live in the pigscripts directory.

Open pigscripts/retail-recsys.pig in your favorite code editor. This is a top-level Pig script that runs the recommendation engine, pulling in some dummy retail data we’ve provided as local JSON files (two separate files in this case—one for purchases and one for wishlisted items—both located in the data/retail directory of your project). Below is a snippet of the purchase data from a fictional movie store. Each line contains a movie ID, a row ID, a user ID, a purchase price, and the movie name:

The example script pigscripts/retail-recsys.pig is ready to run, so give it a quick spin. To run the entire recommendation engine for free on your machine (don’t worry, it shouldn’t take more than a minute or two with such small input data files), just run the following command from inside your project’s directory:

mortar local:run pigscripts/retail-recsys.pig -f params/retail.params

The first time you run a mortar local command, it will take a minute or two to set up your environment. Then you’ll see status messages stream across the terminal as the job progresses. While it’s running, I’ll quickly walk through what’s happening under the hood, since you’ll use similar strategies to run your own recommender. The first section of the script imports the core recommendation engine code (recommenders.pig), points the recommendation engine to the input data files, and tells the recommender where to store the output:

Next, a load statement tells Pig how to load the purchase data (forming a new alias called purchase_input) by specifying the file type and the schema of the data:

The script also includes a similar load statement for the wishlist data, which we’ll skip over here for brevity.

Now that we’ve loaded the data files into Pig, we need to focus on the data fields (one for users, one for items) that we will use as input signals for the recommendation engine. For purchases, the relevant fields are user_id and movie_name. We assign purchase signals an arbitrary weight of 1.0.

When we generate signals for wishlisted items, we assign those interactions a weight of 0.5 so that the recommendation engine interprets purchasing an item as a more meaningful interaction (by a factor of two) than adding an item to a wishlist. Having generated our weighted signals, we then run a union on the two aliases to merge all our input signals into a single alias called user_signals:

Now that we have our signals prepped, we’re ready to run them through the recommender by calling standard macros that are included in the recommendation engine project. Basically, the macros build a graph of users and their interactions, use the weights of the signals to infer their preferences and tastes, and then return the top-rated recommendations.

Finally, the script stores the recommendations to the output location we defined at the start of the script (running an rmf beforehand to make sure there is no pre-existing data in the output location):

That’s how it works—below you’ll use similar strategies to run a recommender with your own data at scale.

When the job is done running in your terminal, open up the item-item recommendations at data/retail/out/item_item_recs/part-r-00000 (Hadoop saves output as numbered part files, but in this case there should only be one file). You’ll see a tab-delimited file listing films and the top recommendations for those films. In the snippet below, you can see the top five recommendations generated for the films Analyze This and Blade Runner. (For the purposes of this example, you can ignore the third and fourth columns; the fifth column is simply the rank of the recommendation.)

Now that you’ve done a quick pass through the recommendation engine, I’ll show you how to run it with your own input data.

3. Run a Recommendation Engine with Your Own Data

The Mortar recommendation engine runs in AWS, so storing your input data in Amazon S3 maximizes performance and cost-efficiency. As mentioned above, just about any data recording user-item interactions will work.

Open up pigscripts/my-recommender.pig in your favorite code editor. This is a template script that you can easily modify to work with your data, as I’ll outline below. For simplicity, I’ll assume that you’re only using one type of input data, but as you saw in the movie example above, it’s easy to incorporate multiple signals of different weights.

3.1 Set Input/Output Paths

Change the placeholder Amazon S3 paths at the top of the script to match the actual Amazon S3 path to your input data and the Amazon S3 location where you want the recommendations to be stored. (You can specify a single input file or an entire input directory, in which case the recommender will ingest all the data in that directory.)

If you haven’t already done so, you’ll need to create an IAM user for Mortar via the AWS console and grant the Mortar IAM user access to your Amazon S3 buckets.

3.2 Generate a Load Statement

The template script contains a sample load statement in a comment block. Uncomment the “Load Data” section of the script (by removing `/*` and `*/`).

The easiest way to craft your own load statement for CSVs, JSON, XML, or MongoDB data is to use Mortar’s Load Statement Generator. (Info on loading other data types is available here.) The Load Statement Generator will provide you a complete load statement that you can copy and paste into any script, but in this case it’ll make downstream integration easier if you retain the start of the sample load statement and only replace the part that follows the word “using”:

3.3 Generate Signals

Uncomment the next section of code (under “Convert Data to Signals”). All you need to do here is to specify which field in your data represents the user, and which represents the item. For the load statement above, that would look like this:

Note that you don’t need to worry about adjusting the value of `weight` unless you’re combining multiple signals that should be treated differently.

3.4 Finalize and Test Recommender Code

Uncomment the final section of code (under “Use Mortar recommendation engine to convert signals to recommendations”). You can leave this entire code block as-is, but if you don’t want to generate recommendations for individual users (that is, if you only want item-item recommendations) you can delete the three lines of code pertaining to user_item_recs.

Now we’ll use the Illustrate tool to make sure your data is being loaded correctly. Illustrate pulls in a small sample of your data and shows you how data will flow through your script. Running Illustrate is a fast and free way to get a sense of how your code is working before launching a cluster and to catch any errors you may have missed. In this case we’re going to focus on the alias raw_input:

mortar local:illustrate pigscripts/my-recommender.pig raw_input -f params/my_recommender.params     

In a browser window, you should see a small snapshot of your data. Take a moment to ensure that the schema accurately represents the data:

3.5 Run on Amazon EMR

With Mortar, you can launch an Amazon EMR cluster of whatever size you need with a single command. For your initial run, we suggest that you try a 10-node Amazon EMR cluster and scale up or down for future runs depending on how long it takes. (The Mortar recommender project defaults to using AWS spot instances. While spot instance prices aren’t guaranteed, they’re typically in the ballpark of $0.14 per hour per node.) To run on 10 nodes, just use the following command:

mortar run pigscripts/my-recommender.pig -f params/my_recommender.params --clustersize 10

You’ll see output in your terminal providing you with a URL where you can monitor the status of your job. By following that link you can also explore a visualization of your job and view logs from Pig and from individual MapReduce jobs.

3.6 Interpret Results

Once your job completes, the Job Detail page will serve up a link to download the recommendations directly from Amazon S3. You can also use the AWS Command Line Interface to access your results.

It’s usually easiest to dive into the item-item recommendations first. You can get a good, high-level sense of recommendation quality by examining a few “sentinel” items—for a music recommender, you might look at the recommendations generated for a violinist, a top 40 pop singer, and a rap group, just to make sure that the output is sensible across a range of inputs. For a deep dive on evaluating and tuning your recommender, see the Mortar recommendation engine tutorial, where we’ve documented about a dozen common variations and modification techniques that you can use.

Conclusion

Once you’re satisfied with the output from your recommendation engine, you can easily productionize it. That includes automating the recommender so that it runs as often as you need to account for new data, such as daily or weekly, and serving up recommendations not just into TSV files but into a scalable, high-availability database.

Mortar makes both these steps simple by integrating fully with Luigi, a powerful dependency manager and workflow engine developed and open-sourced by Spotify. Your recommendation engine project contains an example Luigi script (luigiscripts/retail-luigi.py) that runs a full recommendation engine pipeline end-to-end, writing the results to Amazon DynamoDB (and adjusting the table throughput as needed) and then automatically shutting down the Amazon EMR clusters once they’re no longer needed. You can easily modify the script to run your own recommender, and to add any intermediate data-processing steps you may need. Plus, you can use Mortar to schedule regularly recurring Luigi jobs.

The Mortar help site has a complete guide to how Luigi works and how to automate your recommendation engine, as well as more detailed instructions on all the steps outlined above.

Happy building!

If you have a question or suggestion, please leave a comment below.

 

Do more with Machine Learning:

Building a Numeric Regression Model with Amazon Machine Learning