AWS Database Blog
Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service
Content streaming platforms and video on-demand services have become the go-to for many movie viewers today. If you use a content streaming service, you will understand how important it is to find movies that interest you. Many services have expansive bodies of movie content to select from, which makes movie recommendations essential to sifting through which titles will be more important to you. Designing an effective movie recommendation system can improve the viewer experience, and drive users to new and exciting content.
In this post, we discuss a design for a highly searchable movie content graph database built on Amazon Neptune, a managed graph database service. We demonstrate how to build a list of relevant movies matching a user’s search criteria through the powerful combination of lexical, semantic, and graphical similarity methods using Neptune, Amazon OpenSearch Service, and Neptune Machine Learning. To match, we compare movies with similar text as well as similar vector embeddings. We use both sentence and graph neural network (GNN) models to build these embeddings.
Solution overview
We explore a solution for combined lexical, semantic, and graph search through two segments. First we walk through the end-user’s search experience, then we discuss populating our search data stores.
End-user search experience
The following image illustrates the solution from the end-user search perspective.
The user first initiates a request to an OpenSearch Service domain, passing in a search term, most likely a movie title in this case, as input. On the user’s behalf, we then execute three different types of search and combine the results for the strongest final response.
- Lexical: We find movies with most lexically or “textually” similar titles to the movie being searched. For example,
Poseidon
andPoseidon Adventure
are lexically similar movies you could expect to be returned from the search inputPoseidon
. - Semantic. We search for movies with titles that, even if lexically dissimilar, have similar semantic meaning to that of the movie being searched. For example,
Poseidon
andZeus
have semantically similar titles; both titles refer to mythological gods. To facilitate semantic search, we encode movie titles as vector embeddings that represent a movie title as a list of numbers. Mathematically, movies are semantically similar if their vectors are similar or close to each other. Building a semantic search in OpenSearch is a helpful introduction to the concept of working with vector embeddings and semantic similarity search. - Graphical. We find movies with similar network relationships to neighboring node types. For example, we find movies with similar genres and actors to that of the movie being searched. This search also uses vector embeddings that find close matches for movies with similar neighborhoods. We can then query the Neptune graph to locate the matching movies and traverse their relationships to learn more about them and further explain their similarities. As we’ll discover,
Poseidon
andDragonball: Evolution
are graphically similar, although their titles are markedly different.
Data population
The following diagram illustrates how the data is populated.
In this solution, the primary data store is a movie content graph in Neptune. It is populated from sources such as IMDB, Rotten Tomatoes, Wikidata, and others. Our demo uses only IMDB data. The OpenSearch Service domain contains a copy of the movie content data in a searchable text index. OpenSearch Service domain also contains embeddings, or vector representations of movies that support nearest-neighbor search query patterns.
There are two types of embeddings which we produce using Amazon SageMaker:
- Semantic embeddings, which represent the text attributes of the movie, notably its title. In order to turn these features into embeddings we use a Bidirectional Encoder Representations from Transformers (BERT) model. The BERT model consumes the text features as input, and outputs a vector representation of the inputs in the form of an embedding. These embeddings enable semantic matching of movie-related text.
- Graph embeddings, which represent the movie’s surrounding context and its general graph neighborhood. We use Neptune Machine Learning (Neptune ML) to produce embeddings based on GNN. These embeddings enable graphically-similar matching of movies based on comparison of their embeddings alone. For more background on how GNN’s produce graph-aware embeddings, explore Models and Model Training in Amazon Neptune.
Prerequisites
To setup this solution, you need an AWS account with permission to create resources such as a Neptune cluster, OpenSearch Service cluster, and SageMaker resources. Running this solution will incur charges. Refer to pricing guides for Neptune, OpenSearch Service, and SageMaker for specific pricing details. We execute the following steps to build our movie search capabilities.
- Provision resources: Neptune cluster, OpenSearch Service domain, SageMaker environment, and notebook.
- Populate movie content graph.
- Create and populate OpenSearch Service indexes, lexical, semantic, and graph.
- Explore search results.
Provision resources
Follow the setup instructions to create a Neptune cluster, Neptune Workbench notebook instance, OpenSearch Service domain, and other resources. On the Neptune Workbench notebook instance, open Jupyter and run three notebooks, shown in the next figure.
Populate movie content graph
In the Jupyter notebook window, open the notebook 01-PopulateAndExploreNeptune.ipynb
. Follow instructions to:
- Bulk-load the movie content data into the Neptune database. Use the Neptune bulk loader to load movie data held in a public Amazon Simple Storage Service (S3) bucket. This data is for demonstration purposes only.
- Run OpenCypher graph queries to explore and get familiar with that data.
The movie content graph is modeled as a Labeled Property Graph whose structure is shown in the next figure.
In the figure, circles represent types of nodes. Arrows are edges connecting nodes. Here is a summary of the model:
- At the heart of the model is the
movie
node. Nodes with this label have propertiesyear
,averageRating
,runtime
,numVotes
, andtitle
. - A node labeled
Artist
is a person who contributes to a movie. Properties includename
,birthYear
, and optionallydeathYear
. - There are edges labeled
actor
,writer
,producer
,director
, andactress
. These represent artist contributions to movies. Notice edge direction moves frommovie
toArtist
. An edge labeledactor
from amovie
node to anArtist
node indicates the artist is an actor in that movie. - A movie can have zero or more genres. There is a node labeled
Genre
with agenre
property. Example values for this property include comedy and drama. To indicate that a specific movie has that genre, we draw an edge labeledgenre
from themovie
to thegenre
node. - The node labeled
Person
represents a movie fan who rates movies and follows artists. There are edges labeledrated
andfollows
that, respectively, connect a person to a movie and artist. Therated
edge has the propertyrating
, indicating the numeric rating the person gave to the movie. Properties of a person arebirthDay
,firstName
,lastName
, and others. When one person knows another person, we draw an edge labeledknows
between them, withcreationDate
indicating when that relationship began. - The node labeled
Place
represents a geographical location in which a person is located. The edge labeledisLocatedIn
connects a person to their place. A place can be part of another place, indicated by the edge labeledisPartOf
connecting one place to another.
Populate OpenSearch indices
In Jupyter, open the notebook 02-PopulateOpenSearch.ipynb
. Run the steps to create indexes, synchronize content from the Neptune database to your OpenSearch domain, and create and ingest embeddings.
Create OpenSearch Indices
We create three indices in the OpenSearch domain.
movie
: A lexical index whose documents represent movie nodes. The document ID is the movie node ID. Attributes are title
, averageRating
, runtime
, year
, and numVotes
, as well as two KNN embeddings: sentence_embedding
and gnn_embedding
. We don’t perform searches on these embeddings but carry them to be returned as output from queries.
movie_sentence
: a KNN index with the same structure as movie
, but intended to support vector similar search on sentence_embedding
.
movie_gnn
: a KNN index with the same structure as movie
, but intended to support vector similar search on gnn_embedding.
Create the Sentence embeddings
We use a BERT model to create a sentence-level vectorization of movie titles. In particular, we use the all-MiniLM-L6-v2
model available in the Python sentence transformers library. We define the get_embeddings
function to use this library to create an embedding:
We scroll through the movies CSV file from which we bulk-loaded movie nodes to the Neptune database. For each movie title, we call the sentence model to create its respective embedding. We save the results to sentence_embeddings.csv
. Here is the code:
Here is an example snippet of that file. It has two columns: ~id
is the movie node ID; embedding:vector
is a semi-colon-separated list of numbers constituting the embedding. The dimension of these embeddings is 384.
Create the GNN embeddings
We use a SageMaker pipeline to generate GNN embeddings for the movie. The first step is to create the pipeline.
Next, we start the pipeline. The code that drives the pipeline resides in the S3 bucket at the location shown (PipelineDefintionS3Location
). The pipeline runs under an IAM role (RoleArn
) that has access to input and output data and can create SageMaker resources.
Next we start the pipeline.
There are several important parameters:
inputDataS3Location
is where our source data resides. Our source data is the S3 CSV files that we used to bulk load to the Neptune database.processedDataS3Location
is the S3 output folder where the pipeline records its processing results. Processing is one of two main steps in the pipeline. The other is training. In processing, the pipeline prepares the source data for training by converting it to numeric vectors, normalizing it, and dividing data into training and test sets.trainModelS3Location
is the S3 output folder for the training step of the pipeline. Training considers node structure plus neighboring nodes to learn a model from which it produces embeddings.embeddingS3Location
is the S3 location where the training step writes CSV files containing the embeddings. We look at the structure of this presently.embeddingDimension
is the dimension of embeddings for training to generate. We choose 64, a suitable value balancing processing time and expressivity of embeddings.model
– We use rgcn, the recommended default. A relational graph convolutional network (R-GCN) is a model that considers the properties of nodes together with node relationships when building embeddings.processingInstanceType
is the SageMaker instance type to use for the processing step. Before starting the pipeline, check SageMaker quotas to ensure you have available instances of this type.trainingInstanceType
is the SageMaker instance type to use for the training step. Check SageMaker quotas to ensure you have available instances of this type.
The pipeline runs asynchronously. We check its status wait for the pipeline to complete before continuing.
Training produces separate CSV files for each node label, but we use only the file for Movie. The CSV format is much like that of the sentence embedding file.
Populate OpenSearch indices
We populate the three indices using the bulk()
method of the OpenSearch client aos_client
.
We pass records in chunks of, say, 10000. Each record has a structure like the following.
We load the same record to each of the three indices. The _index
attribute specifies the index to use. The ID of the document to be loaded is the movie node ID. The fields are node properties – title
, year
, averageRating
, runtime
, numVotes
– plus the two embeddings: sentence_embedding
and gnn_embedding
.
We bring this data together from three files: the CSV from which we bulk-loaded movies to Neptune plus the two embeddings files.
Neptune provides a full-text-search (FTS) capability that automatically syncs Neptune to OpenSearch using a streams-triggered polling approach. We did not adopt FTS for this design because it does not directly support embeddings, KNN indices, or semantic search. However, the bulk approach we used to populate OpenSearch is suitable for a movie content graph, where new content is added in incremental batches.
Perform search
In Jupyter, open the notebook 03-Search.ipynb
. Run the steps to perform various searches.
Perform lexical search
First, perform a fuzzy lexical search for Posidon
, which we have deliberately misspelled.
The results indicate several movies like Posidon
:
Perform semantic search using Sentence embeddings
Next, perform a semantic search using sentence embedding comparison. The following query finds movies semantically similar to Poseidon
:
It returns the following results:
Among the matches is a movie titled Zeus
. Intuitively this makes sense because both Zeus and Poseidon are gods from mythology.
Combine query results across three indices
Now bring three indices together in one overall search. Run the code cell under the heading Combined score to execute the following steps:
- Perform a lexical search of that title using the
movie
index. - Perform a semantic search of that title using the
movie_sentence
index. - Normalize scores for each match. Lexical and semantic searches produce scores on different scales, so transform each score to a number between 0 and 1 recording its importance with its search. The highest match in a result gets a score of 1, the lowest a score of 0. The normalized score is calculated as
(score – min)/(max – min)
. In the following,The Poseidon Adventure
has a local score returned from the lexical search of 10.5. Relative to the minimum and maximum scores in the lexical result, we arrive at a normalized score of 0.4544. On sentence search, the same movie has a local score of 0.87. Relative to the minimum and maximum scores in the sentence result, the movie has a normalized score of 0.588.
- For each movie returned from lexical and semantic search, obtain the GNN embedding of the movie. Run a kNN search on the
movie_gnn
index to find movies with similar GNN embeddings. Note that we cannot perform a top-level search on the GNN index because there is no easy way to create a GNN embedding of the search term. To find the closest GNN matches for search termPoseidon
, we find movies with similar GNN embeddings to those of parent movies from the above table:The Poseidon Adventure
,Zeus
,Captain Conan
, and the others. The score we assign to a GNN result is the average of the normalized GNN score and the normalized score of the parent. - Combine the lexical, sentence, and GNN scores for a movie into a single combined score. The combined score is the maximum of these scores, with extra weight on GNN and sentence results.
Here are the top results for Poseidon
in descending order of combined score. The rightmost why column indicates the search indices (l for lexical, s to sentence, g for GNN) that most affected the score.
Among the results, Dragonball: Evolution
– evidenced by having a why value of g — is graphically similar to Poseidon
by virtue of having similar GNN embeddings. We dig into why by querying the Neptune content graph in the next section.
Another method for combined scores
OpenSearch Service recently announced support for hybrid search. As Hybrid Search with Amazon OpenSearch Service explains, score normalization and combination is a stage in a search pipeline powering combined lexical and semantic query. We could use this approach to combine lexical and sentence searches on movies, avoiding the need to calculate scores on our own. Additionally, hybrid search is better equipped than our approach to handle differences in scoring between data nodes in an OpenSearch Service domain.
The hybrid approach does not easily accommodate our GNN search. We cannot perform that search directly. Rather, for movies returned from lexical and sentence searches, we perform a search on a separate GNN index for movies with embeddings similar to the embeddings we pull lexical and sentence results. We must then incorporate GNN scores into the combined score.
Query Neptune for more context
We can query the content graph in Neptune to explore the relationship between Poseidon
(with ID tt0409182
) and Dragonball: Evolution
(ID tt1098327
). The following OpenCypher query finds dynamic paths up to three hops connecting these movies.
We can visualize the paths in the Graph view.
Notice that the two movies directly share genres Action and Adventure plus an artist, Emmy Rossum
(nm0002536
).
From the graph, we can obtain basic details of the movies, genres, and cast with this query.
The results are shown in the next figure.
Clearly querying the OpenSearch domain indices helps us find movies, but querying Neptune helps us explain similarity and further explore their relationships.
Clean up
If you’re done with the solution and wish to avoid future charges, delete the Neptune cluster, SageMaker notebook instance, and OpenService service domain.
Conclusion
In this post, we discussed a highly searchable movie content graph providing recommendations to end users of movie titles that are lexically, semantically, and graphically relevant to their criteria. Our combined method approach, with a Neptune database and an OpenSearch Service domain with both text and kNN indices, was able to identify relevant movie suggestions, which would have been overlooked by lexical or semantic search alone. We created two types of embeddings – a vectorization of movie text attributes, plus a Neptune ML embedding of movie nodes based on GNN – to support semantic and graphical similarity matching.
For a related example, read Power recommendation and search using an IMDb knowledge graph.
About the Authors
Graham Kutchek is a Database Specialist Solutions Architect with expertise across all of Amazon’s database offerings. He is an industry specialist in media and entertainment, helping some of the largest media companies in the world run scalable, efficient, and reliable database deployments. Graham has a particular focus on graph databases, vector databases, and AI recommendation systems. Connect with him on LinkedIn.
Mike Havey is a Senior Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. Visit his Amazon author page.