Using collaborative filtering on Yelp data to build a recommendation system in Amazon Neptune
“I’m hungry. Where should I go to eat?” It’s one of the most common questions we ask ourselves every day, and when you’re going out to spend money somewhere, you don’t want to simply pick a random place and try it—you want some sort of assurance that the restaurant you choose matches what you’re looking for. Ideally, you follow the suggestion of a trusted friend who knows your tastes, but what do you do when that isn’t an option? For the last decade, people have increasingly been turning to apps like Yelp, Foursquare, or OpenTable, which gather the collective opinions of hundreds and thousands of diners and distill this information into personalized recommendations for each end-user in a process called collaborative filtering.
Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build applications that work with highly connected datasets. In this post, we use Neptune to ingest and analyze the Yelp Open Dataset, which contains a subset of business, review, and user data from real Yelp users and businesses. After reading this post, you should feel comfortable working with Neptune to build a recommendation engine using collaborative filtering.
To launch a Neptune and Amazon SageMaker instance to follow along with this post, see Analyze Amazon Neptune Graphs using Amazon SageMaker Jupyter Notebooks. The post provides an AWS CloudFormation template that works in several Regions.
The Yelp dataset is fairly large, and the author for this post used an r5.12xlarge instance while testing these queries on the Yelp dataset. Be aware that you incur charges by launching this stack.
Formatting and loading data with Neptune
Amazon Neptune supports a variety of data formats when loading data. For this post, you query the data using Gremlin, a graph traversal and query language developed as part of the open-source Apache TinkerPop project, so you need to load the data using the CSV format for property graphs. For more information about data formats, see Load Data Formats. After you format the data, you use the Neptune bulk loading API to load the CSV files. For more information about bulk loading, see Using the Amazon Neptune Bulk Loader to Ingest Data.
The stack includes some Jupyter notebooks, which include Python utility libraries for bulk loading data, querying the Neptune endpoints for loading errors, and launching a Gremlin console for running queries inside of a browser terminal window. This significantly simplifies the process, which otherwise would require additional infrastructure steps around creating Amazon Simple Storage Service (Amazon S3) VPC endpoints for Neptune to use. This post first looks at loading the Yelp business data file to completion before moving on to the other Yelp data files.
In the end, your Neptune graph will contain the following:
- Business vertices.
- User vertices.
- Review edges, which traverse from a user to a business. Edges usually represent verbs (for this use case, “user review businesses”).
- Friend edges, which create a relationship between users. Because edges in graph databases are directional and Yelp users are friends which each other in both directions, you need to create two friend edges for each relationship, one in each direction (from UserA to UserB and from UserB to UserA). For this use case, this happens naturally if both users are in the dataset and UserA’s friends array would contain UserB and UserB’s friends array would contain UserA, so you don’t need to do anything special in the convertor code. The verb for this edge is “users befriend users”.
The Yelp dataset is delivered in JSON format with a similar structure for every business. See the following example data:
The following Python 3 script processes this JSON file and converts it into the CSV format that Neptune is expecting. You don’t need every single property for your sample recommendation engine.
The preceding script has a bug (it doesn’t sanitize the address field like it does with the business name field), which leads to problems importing the data. This was a real bug found while developing this post; we left it in to demonstrate debugging and correcting data loads. This post also includes a second version of the script that fixes the bug.
The following code is the first two lines the script outputs:
You should keep in mind the following:
- Vertices always need an ID column that uniquely identifies the vertex. For this use case, the Yelp dataset already includes unique IDs for businesses, reviews, and users, which you can use.
- Vertices always need a label column identifying the type of the vertex. For this use case, you use the strings
- The default data type for properties unless otherwise specified is a list of strings. These strings should be surrounded by double quotes. If more than one entry is in the list (as in business categories), the entries must be separated by a semicolon.
- Other data types can be specified in the CSV header. This use case uses the stars rating for a business, which is rounded to the nearest 0.5 (for example, 3.0, 3.5, 4.0, 4.5).
To format and load your data, complete the following steps:
- Copy this file to an S3 bucket in your AWS account.
- On the Amazon SageMaker console, under Notebook, choose Notebook instances.
- Choose your notebook.
- Choose Open Jupyter and load the CSV file into Neptune.
- In Jupyter, from the New drop-down menu, choose conda_python3.
You now enter and run several commands in the notebook. You first load the Neptune utility library, which allows Jupyter to interact with Neptune. Then you bulk load the yelp_business_nodes.csv file from your S3 bucket.
If you attempt to use the bucket as shown in this post, it fails; the Yelp Dataset license doesn’t allow us to distribute the dataset, so you need to download it and make the transformation in your own S3 bucket.
- Enter the following code:
- Choose Run.
Inside that curl command, you see the endpoint of your Neptune cluster and a status endpoint specific to this bulk load request. (for this post,
- Record this endpoint to use later.
When the bulk load finishes, an exception is thrown from the bulk loader that tells you the status is
LOAD_FAILED with one
parsingError. See the following screenshot.
Jupyter notebooks allow you to run bash commands, and you want to use curl to hit the loader API endpoint from earlier while appending the query parameters
“?details=true&errors=true”. This tells Neptune to show you the error details from the bulk load that caused the exception.
- Enter the following code in the Jupyter command cell:
The following screenshot shows the output.
The response shows that an invalid character is in line number 108514 of the CSV file. See the following code:
The address column for this row contains an embedded double quote:
"30 Brisdale Rd, Bldg "C", Building C", which makes the CSV format invalid. You could change the code to make sure any such entries are properly escaped, but for this use case, you can just strip out any single or double quotes from the business names during the JSON to CSV transformation by using Python’s
replace() function. After you make this change, the following is the correct Python code for this transformation:
Before you upload the corrected CSV file to Amazon S3 and perform the bulk load again, you first need to tell Neptune to empty out the vertices you already loaded so you can load the CSV file again.
clear(), which is a utility function included in the Neptune Python utility library mentioned earlier.
See the following screenshot for the output.
- Reload the file and perform the bulk load again.
You should now get a successful response with no errors shown. See the following screenshot.
- To verify there were no errors, query the loader API endpoint again, using the specific load ID from your previous load. See the following screenshot.
Now that you know the process, you can follow it for the other three entity types using the following code (this is the correct version, without bugs). Regarding edge data, keep in mind the following:
- You convert the review’s date property to ISO 8601 format, which is important later when you write your queries.
- Edges also need unique ID values, just like vertices do. For review edges, you can use the unique
review_idprovided by the Yelp dataset. Because there is no unique ID representing the relationship between two friends, you need to construct one. You can join the two user IDs with the string
friendbetween them. For example,
ntlvfPzc8eglqvk92iDIAw-friend-oeMvJh94PiGQnx_6GlndPQ, and the relationship in the other direction is
- Repeat the preceding steps for the remaining entity types. Use the following code:
- Upload those files to your S3 bucket and run the bulk load for all three files.
If you have the CSV files of edges and vertices in a single bucket, Neptune detects which is which and loads them in the correct order. Neptune also allows you to queue ingest jobs using the
queueRequest parameter. For more information about data formats, see Neptune Loader Command.
When debugging the ingest phase, load each file one at a time so you can more easily detect if one file has a problem. Therefore, you need to load the vertices before the edges, because the edges reference a relationship between two vertices and if the vertices don’t already exist, the edge fails to load. The
neptune.bulkLoad() is a helper function from the Python utility library. See the following code:
If you worked for Yelp and had access to the full dataset of all users, businesses, and reviews, you could actually load the full graph. However, this is a sample dataset, which means reviews might exist in the dataset but the associated users or businesses do not. The same is true for friend relationships. For example, UserA might have UserB in the friends array, but UserB isn’t in the dataset. Because of this, you see
FROM_OR_TO_VERTEX_ARE_MISSING errors from the edge
bulkLoad() calls with an error message such as
"Either from vertex, 'dxCce2EdLOAy78Y3fAXCgw', or to vertex, 'XXuS66qQ8np3D3_j_7YMAQ', is not present." See the following screenshot.
You could have your Python code check for these dangling edges and remove them from the CSV files, but the Python utility bulk loading libraries default the
failOnError flag to
FALSE. The bulk loader lets us know what fails, so for this use case it’s easier to pass them through to the loader and ignore these errors because there’s nothing you can do about it with the incomplete dataset.
Running queries against Neptune
Now that you have ingested your dataset into Neptune, you can start querying that data. For this post, we use Gremlin queries to explore the data. You can run these queries directly from the Jupyter notebook. See the following code:
The following screenshot shows the output.
You may prefer using the Gremlin console when developing queries instead of the Jupyter notebook. Fortunately, this stack includes a web-based Linux terminal and installs the Gremlin console for you.
- In Jupyter, from the New drop-down menu, choose Terminal.
When you’re in the terminal, you need to change to the /home/ec2-user/neptune-tools/apache-tinkerpop-gremlin-console-3.4.1 directory and execute bin/gremlin.sh to start the console.
- Tell the console to connect to your remote Neptune instance with the following two commands:
Your console is now ready to run queries. See the following screenshot.
Exploring the data
This section demonstrates how to explore your data with several different queries and determine the following:
- How many of each entity type is in the dataset?
- Which cities in the database have the most businesses?
- Which cities have the most reviews?
- Who are the top five reviewers in the city with the most reviews?
Number of entity types
To determine how many of each entity type is in the dataset, see the following code:
The output tells you that the dataset has roughly 200,000 business vertices, 2 million user vertices, 8 million review edges (which link user and business vertices) and 19 million friend edges (which link user vertices to other user vertices).
Cities with the most businesses
To determine which cities in the dataset have the most businesses, see the following code:
The results show that Las Vegas is the clear winner, with 31,556 businesses. However, the fact that it has a lot of businesses in the dataset doesn’t mean it has the largest user base or set of reviews for you to analyze.
Cities with the most reviews
The query to determine which cities have the most reviews traverses the graph between vertices. If I’m sitting on an edge, I can never remember which way the
outV() predicates take me in the graph, so we can pick a random review edge and try it both ways to see what happens. See the following code:
g.E() is generally frowned upon because there could be many types of edges in your graph with counts numbering into the billions. You should start your traversal with
g.V(), rewriting the previous two queries (and all future queries) to start with vertices instead of edges. See the following code:
If you’re sitting on a review edge,
inV() takes you to the business vertex for that review, and
outV() takes you to the vertex of the user who wrote that review.
One way to approach the original question (which cities in the dataset have the most reviews) is to start with the set of all edges and see which cities have the most reviews. See the following code:
This query returned an error indicating that some of the vertices are missing the city property. If you want to find the businesses that don’t have a city property and clean up your dataset, you could find them using the
hasNot() operator. See the following code:
Because there are only two such businesses, you can filter them out of the query by adding a
.has(‘city’) step after traversing to the business vertices. See the following code:
Las Vegas is also the clear winner on number of reviews.
Now that you know Las Vegas has the most reviews, you can focus your analysis there as you drill down.
Top five reviewers
For this use case, you need to pick a user and generate recommendations for them. It’s hard to make informed recommendations for users that don’t have reviews because you can’t judge what they like and don’t like, so you need to find a user that has written several reviews.
To determine which users in this dataset are the top five reviewers in Las Vegas, you must build the query one step at a time.
First, find all the businesses in Las Vegas. See the following code:
That’s reassuring because 31,556 is the same number you got from the
Next, find all the reviews for these Las Vegas businesses with the following code:
Again, the result matches the previous
Lastly, find the top users who wrote reviews in Las Vegas by grouping and counting by the user represented in the review. You do that by starting with all the reviews from Las Vegas restaurants and stepping to the user vertices, where you can group by the user’s profile URL and get the count for the top five users with the most reviews. See the following code:
The first result shows a user with ID
bLbSNkLggFnqwNNzzq-Ijw, who wrote 2,278 reviews in Las Vegas. This is the kind of super user you want when testing your recommendation engine because you have a lot of indicators of what they like.
Building out the recommendation engine
Now that you know the ID of a user with lots of reviews, you want to build a recommendation list for them. Start your analysis on that specific user vertex. See the following code (note that names have been changed to protect the innocent):
Because you use these
valueMap() options repeatedly, note the following:
- This user has a very large friends array. The
valueMap()step allows you to specify which keys to include in the results, but you don’t have an option to include all keys except for a select few. You can accomplish this by using a filter step to include everything except the
friendskey (you could also add more entries to the
without()step if you needed to exclude more than just friends).
with(WithOptions.tokens)adds the ID and label properties to the output.
- For the remainder of this post, we refer to this user as Jane (names have been changed to protect the innocent).
You can ask a few simple questions of Jane’s data, like how many reviews they have written. See the following code:
2,555 is slightly larger than the 2,278 you saw in the previous query, because the previous query only counted reviews in Las Vegas and Jane also has some reviews outside of Las Vegas.
You can calculate a histogram of Jane’s star ratings. The following code returns the key names in descending order:
You can deduce that Jane doesn’t often give 5-star reviews, and you have a large set of information about what Jane likes given the total number of 4- and 5-star reviews. This gives you good information for the recommendation engine you’re about to build.
You can also count Jane’s 4- and 5-star reviews by using
has(). Both give the same answer, but for simple cases when just comparing a property and a value, you should use
where() step allows a traversal to be used inside its body, which
has() does not, and there’s no guarantee a query planner will optimize it down to the same exact pattern that
has() creates. See the following code:
Your basic algorithm for building out the recommendation list is as follows:
- Find Jane’s 5-star reviews to know which businesses they liked. In later queries, you can see where we add in an
aggregate()step to save the list into a temporary variable
myLikes. This is so you can refer to this variable later in the query.
- Find all the users who also gave 4- or 5-star reviews to the same businesses that Jane gave 5 stars to.
- Some of those users from Jane’s like-minded group might have liked more than one of the same business that Jane did, so you also deduplicate the list.
- From a graph point of view, Jane is also in the like-minded group, so you need to remove their vertex from the list.
- Take the top 100 people from Jane’s like-minded group and find the top 10 places that the group liked that Jane hasn’t been to, and sort that list by average star rating. You use the
myLikesaggregated set (also known as a bulk set) to filter out the businesses Jane already reviewed from the businesses Jane hasn’t reviewed yet. This is your master recommendation list, which you further refine later in this post.
The following code builds the recommendation list:
The following code shows the first business on that list:
This breakfast and brunch spot in Las Vegas averages 5 stars, the highest rating possible. It appears we’re on the right track!
From this point you could make several further refinements to your recommendation list:
- Only return businesses with
Coffee & Teaas a category for when Jane wants a dose of caffeine.
- Only return highly-rated businesses that have over 100 reviews. This sometimes helps to reduce the noise of businesses that only have 1 or 2 reviews posted by friends of the owner.
- Only return businesses that are open for business right now (we omitted operating hours from the dataset to save space when loading the Yelp dataset; you would need to add it back in).
- Filter the list using geospatial boundaries, for example, if Jane wants a coffee shop within walking distance. This would best be achieved by dual-ingesting the data into both Neptune and Amazon OpenSearch Service, a secondary search indexing system that supports geospatial queries.
- Only show businesses that opened recently, for example if Jane is interested in trying new places.
These refinements aren’t mutually exclusive. For example, Jane might want to find a coffee shop within walking distance that opened within the last 6 months.
We leave the first four query refinements for you to experiment with. Finding businesses that opened recently presents a special challenge because there isn’t any data in the business vertex to tell you when the business was started. However, you can find the earliest review for that business and use it as a proxy for the opening date of the business. In other words, if the first review for that business was 6 months ago, you can infer that the business opened in the last 6 months because very few businesses would be open for years without receiving a single Yelp review. Each review has a
date property in ISO 8601 format. ISO 8601 format is lexicographically sortable, so to find reviews that came in 2020 or later, look for reviews where the
date string is greater than
However, for this use case, the Yelp dataset uses data from 2019. To determine which of Jane’s recommendations opened in 2019, see the following code:
The Bronze Cafe, Water Grill, and Tatsujin X are some new restaurants that Jane might want to try.
If you launched the CloudFormation stack to follow along with this post, don’t forget to delete that stack to avoid incurring unnecessary costs.
Neptune is a very powerful managed graph database service that can help you quickly build a recommendation system like the one we built in this post using collaborative filtering on restaurant review data.
If you want to generate recommendations from your data, get started today!
About the Author
Chad Tindel is a DynamoDB Specialist Solutions Architect based out of New York City. He works with large enterprises to evaluate, design, and deploy DynamoDB-based solutions. Prior to joining Amazon he held similar roles at Red Hat, Cloudera, MongoDB, and Elastic.