Build a movie recommender with factorization machines on Amazon SageMaker
Recommendation is one of the most popular applications in machine learning (ML). In this blog post, I’ll show you how to build a movie recommendation model based on factorization machines — one of the built-in algorithms of Amazon SageMaker — and the popular MovieLens dataset.
A word about factorization machines
Factorization Machines (FM) are a supervised machine learning technique introduced in 2010 (research paper, PDF). FM get their name from their ability to reduce problem dimensionality thanks to matrix factorization.
Factorization machines can be used for classification or regression and are much more computationally efficient on large sparse data sets than traditional algorithms like linear regression. This property is why FM are widely used for recommendation. User count and item count are typically very large although the actual number of recommendations is very small (users don’t rate all available items!).
Here’s a simple example: Where a sparse rating matrix (dimension 4×4) is factored into a dense user matrix (dimension 4×2) and a dense item matrix (2×4). As you can see, the number of factors (2) is smaller than the number of columns of the rating matrix (4). In addition, this multiplication also lets us fill all blank values in the rating matrix, which we can then use to recommend new items to any user.
The MovieLens dataset
This dataset is a great starting point for recommendation. It comes in multiples sizes. In this blog post we’ll use ml100k: 100,000 ratings from 943 users on 1682 movies. As you can see, the ml100k rating matrix is quite sparse (93.6% to be precise) because it only holds 100,000 ratings out of a possible 1,586,126 (943*1682).
Here are the first 10 lines in the data set: user 754 gave movie 595 a 2-star rating, and so on.
# user id, movie id, rating, timestamp
754 595 2 879452073
932 157 4 891250667
751 100 4 889132252
101 820 3 877136954
606 1277 3 878148493
581 475 4 879641850
13 50 5 882140001
457 59 5 882397575
111 321 3 891680076
123 657 4 879872066
Data set preparation
As explained earlier, FM work best on high-dimension datasets. As a consequence, we’re going to one-hot encode user IDs and movie IDs (we’ll ignore timestamps). Thus, each sample in our data set will be a 2,625 Boolean vector (943+1682) with only two values set to 1 with respect to the user ID and movie ID.
We’re going to build a binary recommender (that is, like/don’t like). 4-star and 5-star ratings are set to 1. Lower ratings are set to 0.
One last thing: the FM implementation in Amazon SageMaker requires training and test data to be stored in float32 tensors in protobuf format. (Yes, that sounds complicated 🙂 However, the Amazon SageMaker SDK provides a convenient utility function that takes cares of this, so don’t worry too much about it.
The high-level view
Here are the steps you need to implement:
- Load the MovieLens training set and test set from disk.
- For each set, build a sparse matrix holding one-hot encoded data samples.
- For each set, build a label vector holding ratings.
- Write both sets to protobuf-encoded files.
- Copy these files to an Amazon S3 bucket.
- Configure and run a factorization machines training job on Amazon SageMaker.
- Deploy the corresponding model to an endpoint.
- Run some predictions.
Let’s get going!
Loading the MovieLens dataset
ml-100k contains multiple text files, but we’re only going to use two of them to build our model:
- ua.base (90,570 samples) will be our training set.
- ua.test (9,430 samples) will be our test set.
Both files have the same tab-separated format:
- user id (integer between 1 and 943)
- movie id (integer between 1 and 1682)
- rating (integer between 1 and 5)
- timestamp (epoch-based integer)
As a consequence, we’re going to build the following data structures:
- A training sparse matrix: 90,570 lines and 2,625 columns (943 one-hot encoded features for the user ID, plus 1682 one-hot encoded features for the movie ID)
- A training label array: 90,570 ratings
- A test sparse matrix: 9,430 lines and 2,625 columns
- A test label array: 9,430 ratings
Reminder: Each sample must be a single one-hot encoded feature vector. Yes, you do need to concatenate the one-hot encoded values for user ID, movie ID, and any additional feature you might add. Building a list of distinct vectors (one for the user ID, one for the movie ID, etc.) isn’t the right way.
Our training matrix is now even sparser: Of all 237,746,250 values (90,570*2,625), only 181,140 are non-zero (90,570*2). In other words, the matrix is 99.92% sparse. Storing this as a dense matrix would be a massive waste of both storage and computing power.
We should check that we have approximately the same number of samples per class. An unbalanced data set is a serious problem for classifiers.
Slightly unbalanced, but nothing bad. Let’s move on!
Writing to protobuf files
Next, we’re going to write the training set and the test set to two protobuf files stored in Amazon S3. Fortunately, we can rely on the write_spmatrix_to_sparse_tensor() utility function. It writes our samples and labels into an in-memory protobuf-encoded sparse multi-dimensional array (AKA tensor).
Then we commit the buffer to Amazon S3. After this step is complete, we’re done with data preparation, and we can now focus on our training job.
Troubleshooting tips for training
- Are both samples and labels float32 values?
- Are samples stored in a sparse matrix (not a numpy array or anything else)?
- Are labels stored in a vector (not any kind of matrix)?
- Is write_spmatrix_to_sparse_tensor() undefined? It was added in SDK 1.0.2, and you might need to upgrade the Amazon SageMaker SDK See the appendix at the end of the post.
Note: Upgrading to the latest Amazon SageMaker SDK.
- Open your notebook instance.
- On your instance, open a Jupyter terminal.
- Activate the Conda environment where you’d like to upgrade the SDK, for example:
The following is our training set in Amazon S3: only 5.5MB. Sparse matrices FTW!
Running the training job
Let’s start by creating an Estimator based on the FM container available in our AWS Region. Then, we have to set some FM-specific hyperparameters (full list in the documentation):
- feature_dim: the number of features in each sample (2,625 in our case).
- predictor_type: ‘binary_classifier’ is what we’re going to use.
- num_factors: the common dimension for the user and item matrices (as explained in the example at the start of the post).
The other ones used here are optional (and quite self-explanatory).
Finally, let’s run the training job. Calling the fit() API is all it takes, passing both the training and test sets hosted in S3. Simple and elegant.
A few minutes later, training is complete. We can check out the training log either in the Jupyter notebook or in Amazon CloudWatch Logs (in the /aws/sagemaker/trainingjobs log group).
After 50 epochs, test accuracy is 71.5% and the F1 score (a typical metric for a binary classifier) is 0.75 (1 indicates a perfect classifier). Not great, but with all that sparse matrix and protobuf excitement, I didn’t spend much time tuning hyperparameters. Surely you can do better
[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1) : binary_classification_accuracy
[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1) : 0.7159
[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1) : binary_classification_cross_entropy
[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1) : 0.581087609863
[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1) : binary_f_1.000
[01/29/2018 13:42:41 INFO 140015814588224] #test_score (algo-1) : 0.74558968389
We have one last step to cover: model deployment.
Deploying the model
All it takes to deploy the model is a simple API call. In the old days (6 months or so ago), this would have required quite a bit of work, even on AWS. Here, just call deploy().
We’re now ready to invoke the model’s HTTP endpoint thanks to the predict() API. The format for both request and response data is JSON, which requires us to provide a simple serializer to convert our sparse matrix samples to JSON.
We’re now able to classify any movie for any user. Just build a new data set, process it the same way as the training and test set, and use predict() to get results. You should also experiment with different prediction thresholds (set prediction to 1 above a given score and to 0 under it) and see what value gives you the most efficient recommendations. The MovieLens data set also includes movie titles, so there’s plenty more to explore.
Built-in algorithms are a great way to get the job done quickly, without having to write any training code. There’s quite a bit of data preparation involved, but as you saw in this blog post, it’s key to make very large training jobs fast and scalable.
If you’re curious about other Amazon SageMaker built-in algorithms, here are a couple of previous posts:
In addition, if you’d like to know more about recommendation systems, here are a few resources you may find interesting.
- “Two Decades of Recommender Systems at Amazon.com” — Research paper
- Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine — GitHub
- “Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE” — AWS blog
- “A quick demo of Amazon DSSTNE” — YouTube video
- “Using MXNet for Recommendation Modeling at Scale (MAC306)” — AWS re:Invent 2016 video
- “Building Content Recommendation Systems Using Apache MXNet and Gluon (MCL402)” — AWS re:Invent 2017 presentation
As always, thank you for reading. Happy to answer questions on Twitter.
Several of my AWS colleagues provided excellent advice as well as debugging tips, so please let me thank Sireesha Muppala, Yuri Astashanok, David Arpin, and Guy Ernest.
About the Author
Julien is the Artificial Intelligence & Machine Learning Evangelist for EMEA. He focuses on helping developers and enterprises bring their ideas to life. In his spare time, he reads the works of JRR Tolkien again and again.