Build generative AI conversational search assistant on IMDb dataset using Amazon Bedrock and Amazon OpenSearch Service

This blog demonstrates how to use Large Language Models (LLMs), Amazon Bedrock, and Amazon OpenSearch Service to create a movie recommendations chatbot. The demonstration uses the IMDb and Box Office Mojo Movies/TV/OTT licensable data package, which provides a wide range of entertainment metadata. Many Amazon Web Services (AWS) media and entertainment (M&E) customers license IMDb data through AWS Data Exchange to improve content discovery and increase customer engagement and retention. We explain how to build a retrieval-augmented generation (RAG) chatbot on top of the IMDb dataset for both exact and semantic match querying.

Background

While watching a movie or TV show, have you ever thought to yourself, “I wish I could find other movies like this one?” or “In what other movies have actors from this film also appeared?” In this blog post, you will learn how to answer these questions by using external data sources like IMDb with LLMs in the Amazon Bedrock service.

This post provides a walk-through of how conversational search is enabled for online M&E platforms to provide a friendly user experience. Recently, LLMs have shown ground breaking results on various Natural Language Processing and Understanding (NLP/NLU) tasks. LLMs have the ability to accurately understand raw user intention and generate results within the specific context. They are also amenable to teaching through a few examples (few-shot learning) and can answer questions grounded in knowledge bases through RAG techniques. By using these technologies along with datasets like IMDb, users on streaming platforms can create advanced search queries like “Movies with Tom Cruise that are comical” or “Spiderman movie shot in London with Tom Holland”. Additionally, with the conversational capabilities of LLMs, users no longer need to have refined queries to start with. They can go through a series of probing queries to the LLM and shortlist the movies of interest.

IMDb and Box Office Mojo Movie/TV/OTT licensable dataset

IMDb dataset on AWS Data Exchange provides over 1.6 billion user ratings; credits for more than 13 million cast and crew members; 10 million movie, TV, and entertainment titles; and global box office reporting data from more than 60 countries.

Large Language Models (LLMs)

Large language models (LLMs) are artificial intelligence (AI) models that undergo extensive training on vast amounts of text data to anticipate subsequent words or phrases in a sequence. These models exhibit proficiency in various natural language processing duties, including language translation, text generation, and sentiment analysis. Their value is especially prominent in situations with insufficient labeled data, as they can acquire the capability to foresee the appropriate context and interpretation of words from unlabeled data, enabling them to produce text that closely resembles human language.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) combines content retrieval and generation for neural text generation. Recent advances in LLMs have led to models that can generate coherent and fluent text. However, these models often struggle with generating factually accurate and consistent text since they rely solely on their generative capabilities. To overcome this issue, researchers have proposed RAG, which combines neural text generation with neural information retrieval. First, a retrieval model is employed to retrieve relevant information from a knowledge source. This retrieved information serves as a foundation to inform the subsequent text generation process. Next, retrieved information is integrated into a neural text generation model. This integration serves to guide and constrain the generation process. The result of this combined approach allows the generation model to produce outputs that are more factually grounded and consistent with the retrieved information.

Amazon OpenSearch Service

Amazon OpenSearch Service is a fully managed service that makes it easy to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open source, distributed search and analytics suite derived from Elasticsearch. OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), as well as visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).

Architecture

Let’s examine how conversational search and chat take place with IMDb and Box Office Mojo Movie/TV/OTT dataset.

The following diagram illustrates the solution architecture, which comprises (1) a streamlit app that acts as the frontend for receiving user queries and displaying results, and (2) a number of backend components that process queries to obtain responses.

Architecture diagram

First, the user interacts with the streamlit app by entering a query relating to the search use case (as shown in blue circles in the prior diagram). This can include a question searching for movies in a genre, played by specific actors, and much more.

In the backend, an LLM from Amazon Bedrock is used to convert a user query into OpenSearch domain-specific language (DSL) query or embeddings. For example, a user search query:

"What movies are starring Tom Cruise?"

converts to the search query:

{"query":{"bool":{"must":[{"terms":{"stars.keyword":
["Tom Cruise"]}}]}}, "sort": [{"rating":
{"order": "desc","missing":"_last",
"unmapped_type" : "long"}}]}

The search query returns the most similar records from the IMDb and Box Office Mojo data set stored in the OpenSearch index (we create both full-text and knn based indices). The response includes information about relevant movies such as their plot, release date, genre, location, rating, directors, producers, etc.

Note that the ideal prompt or instruction to get the required output (prompt engineering) can vary from one LLM to another. You can modify the instructions here and optimize for the LLM of choice.

After receiving search results, users can choose to activate a virtual agent. This agent utilizes the response documents from the search query and, using an LLM, can respond to any questions related to the movies found in the search results. Additionally, the chat interaction maintains a session-specific memory that allows the virtual agent to reference previous user search queries when providing answers.

Prerequisites

To implement this solution, you need an AWS account, familiarity with the OpenSearch Service and Bedrock. The high-level steps to testing this application are as follows:

IMDb dataset creation
IMDb embedding generation
OpenSearch index creation
Streamlit application launching and testing

Provision resources with AWS CloudFormation

Now that you’ve seen the structure of the solution, you can deploy it into your account to run an example workflow. This workflow will spin up Amazon OpenSearch Service and an Amazon SageMaker studio domain with appropriate settings in a VPC mode.

You can launch the stack in AWS Region us-east-1 on the AWS CloudFormation console using the cloudformation template in the github repo.

1. IMDb dataset creation

1.1 Export data to Amazon S3

In order to use the IMDb dataset, use the following steps:

Step 1: Subscribe to IMDb data in AWS Data Exchange

Log into the AWS Management Console using this link: https://console.aws.amazon.com/.
In the search bar, search for AWS Data Exchangeand then click on AWS Data Exchange.
In the left panel, click on Browse catalog.
In the search box under Browse catalog, type IMDb.
Subscribe to either IMDb and Box Office Mojo Movie/TV/OTT Data (SAMPLE) or IMDb and Box Office Mojo Movie/TV/OTT Data (PAID).

IMDb publishes its data set once every day on AWS Data Exchange.

Step 2: Export the IMDb data from ADX into Amazon S3

Follow the steps in this workshop to export the IMDb data from ADX to Amazon S3.

Step 3: Unzip the files to obtain: title_essential_v1_complete.jsonl and name_essential_v1_complete.jsonl

1.2 Process IMDb dataset

To use IMDb data for index creation, the raw data needs processing into a tabular format. We first unify the IMDb files to merge movie title information with movie cast/crew information to get all the names of the actors, directors etc into one dataset. We then subset it with smaller set of movies present in MovieLens data to reduce the number of movies to imitate a smaller catalog. Creating this subset is optional. You can also work with the full dataset.

The following are the step to merge the two IMDb datasets (code):

Filter the title_essential dataset by columns: `[‘image’, ‘titleId’, ‘originalTitle’, ‘titleDisplay’, ‘principalCastMembers’, ‘principalCrewMembers’, ‘genres’, ‘keywordsV2’, ‘locations’, ‘plot’, ‘plotShort’, ‘plotMedium’, ‘plotLong’, ‘imdbRating’, ‘year’, ‘titleType’]`
Split the principalCastMembers column by category and create 3 new columns: Actors, Directors and Producers. (These contain just a numerical ID)
Include the actual names of cast members (actors, directors, and producers) from the mapping of the name_essential dataset and add it into the movies dataset.
Add processed versions of the keyword, location, and poster url.
Save the results as a parquet file in s3 (for example: s3://<bucket>/<folder>/movies.parquet).

Once the IMDb dataset is created, additional processing retains only the movies in the smaller MovieLens dataset (ml-latest-small.zip) to reduce the dataset size. You can skip this step if you prefer to work on the larger IMDb dataset, ensuring the data schema is consistent.

Run this notebook, which performs the following steps:

Filter the full raw IMDb dataset to only the movies that are included in the Movie Lens dataset.
Process the location to remove duplicates in city and country names.
Save the results as a parquet file into s3. The default path to the file is s3://<bucket>/<folder>/imdb_ml_data.parquet.

Following is a snapshot of a subset of the sample dataset.

Snapshot of the processed dataset.

2. Embedding creation

While the dataset above could be useful for filtering queries like “What are some Tom Hanks movies?”, we want to further enhance the dataset with rich semantic information to answer queries like “What are some sniper action movies?”.

Follow the instructions in this notebook, which performs the following tasks:

Augments the IMDb keywords with highly confident keywords from movielens-20m dataset. This is an optional step.
Use a T5 large sentence transformer to generate the embeddings (size 768) of the “plot”, “keywords”,and “plots+keywords” columns.
Add the embeddings into the original movies dataset and save it as a parquet file into s3.

Now, nagivate back to the repo and under the src/ folder, run:

python index_creation.py

This command will perform the following tasks:

Initializes the OpenSearch Service client using the Boto3 Python library
Fills empty entries in the IMDb dataset as null.
Creates two indices for text and kNN embedding search and bulk uploads data from the combined dataframe through the ingest_data_into_os function

Once the indices are created, update the domain name and index name in config.yml

This data ingestion process takes 5-10 minutes. In this example, two indexes are created to enable text-based search and kNN embedding based search. The text search maps the free-form query the user enters to the metadata of the movies. Queries like “Movies directed by Christopher Nolan”, “Movies with actor Tom Hanks” will be used for direct text search as it maps to specific metadata like director and producer. However, open-ended queries like “what are some sniper action movies” will be routed to embedding-based semantic search. The kNN embedding search finds the k closest movies in embedding latent space to return as outputs.

For more information on the OpenSearch indices, refer to this blog post.

3. Create Streamlit app

Now that you have a working text search and kNN index on OpenSearch, you can build an ML-powered application using Streamlit, a python package to create a front-end for this use case.

To run the code, first make sure that Streamlit and aws_requests_auth are installed in your Python environment (any suitable compute infra, such as EC2, or an ECS container). The code repo provided runs streamlit on Amazon SageMaker Studio. Use the following commands to install the requirements:

pip install -r requirements.txt

Go into the streamlit/ folder and run:

sh run.sh

to get the SM studio URL for the streamlit app. Then run:

streamlit run chat.py

to get the port number. Combine the SM studio URL with the port number in the following way: {sm_studio_url}/{port_number}/.

Following is how the streamlit app looks:

Screenshot of the UI.

Search and chat

Once you have navigated to the IMDb Conversational Search application, you have a choice to select the following:

LLM
This demonstration uses Amazon Bedrock (Claude-V1) as the LLM using langchain.

Task type
Search or search and chat. Select search if you only want to identify movies in the IMDb dataset that subscribe to your specific query. If you want chatbot access as well to ask further questions about the movies output from the search query, select “search and chat”. The search and chat functionalities is performed by the specific LLM you select.

Question
Select a question for the search use case (whether you select search or search and chat as the task type). The application provides a set of default questions as well as an option to add your own question as shown in the following:

Sample questions in the demo.

Any of the questions for the search use case can be classified as either:

Exact Match: Searching for movies based on location, actor, plot, rating, directors, etc.
Semantic Match: Searching for movies that are similar to others

For example, if you ask the question, “What are some action movies starring Tom Cruise?” the LLM provides searches for the movies in the IMDb dataset that conform to this exact match and provides the following output:

Generated answer for one of the built-in questions for the search task.

In addition, if you selected the “search and chat” capability, then the following chat interface will be ouput in the application:

Generated chat interface for the same sample question.

Currently, the chat chain supports five questions at a time to maintain context length. However, one can try advanced approaches like chunking context or story last k conversation history to maintain a constant context length.

Enhancing search results with recommendations

While the current system does generic retrieval of content from search index, it can be interfaced with another tool like Amazon Personalize reranking service to reorder search results based on user history captured previously.

Conclusion

In this post, we described how to create a solution to build conversational search on top of the IMDb dataset using text and kNN-based search through Bedrock and the OpenSearch service. The application uses LLMs to convert user questions into commands that can query through OpenSearch and to provide a conversational virtual agent to further ask questions about the given movies.

For more information about the code sample in this post, visit the Github repo.

AWS for M&E Blog