How to use IMDb data in search and machine learning applications
The IMDb Essential Metadata for Movies/TV/OTT licensed data package provides metadata for more than 8 million movies, TV shows and video games. Many AWS media and entertainment customers license this data through AWS Data Exchange (ADX) to improve content discovery and increase customer engagement and retention.
This blog post explains how to transform and prepare IMDb data to power search and recommendations in your applications. An AWS CloudFormation template uses AWS Step Functions and Amazon Athena to define a schema for the data and to stage it for efficient query performance.
The result is IMDb data stored in Amazon Simple Storage Service (Amazon S3), from where you can query it using AWS analytic services such as Amazon Athena, Amazon EMR, AWS Glue, and Amazon Redshift. You can ingest IMDb data into Amazon OpenSearch for building search applications. You can also ingest IMDb data into Amazon Personalize and Amazon SageMaker for building recommendation engines and machine learning applications.
IMDb Essential Metadata for Movies/TV/OTT
The IMDb Essential Metadata for Movies/TV/OTT licensed data package consists of JSON files with IMDb metadata for more than 8 million titles (including movies, TV and OTT shows, and video games) and 11 million people (including cast, crew, and entertainment professionals). IMDb’s licensed metadata includes unique IDs, a billion 1-10 star ratings from fans globally, plots, genres, categorized keywords, posters, credits, and more.
Media and entertainment customers including pay TV, direct-to-consumer, and streaming operators license this data via ADX to improve content discovery and increase customer engagement and retention. Customers use IMDb data to enhance in and out of catalog title search and power relevant content recommendations.
Using IMDb data
To query and use IMDb data from Athena and other AWS analytic services, you need to perform the following actions:
- Define schema: Querying the IMDb data from Athena and from other AWS analytic services requires the schema for this data in the Glue Data Catalog. The schema for the IMDb data is included in the data package in the file called documentation/essential_v1/documentation/essential_v1.pdf as Athena DDL statements. The CloudFormation template provided executes these DDL statements to create the schema for the IMDb data.
- Convert data format from JSON to Parquet: The IMDb data files are provided in JSON gzip format. The JSON format is human-readable, which makes it easy to inspect. However, Parquet, which is a columnar data format, provides better performance and lower cost compared to text formats such as JSON. The CloudFormation template provided converts the IMDb data files from JSON to Parquet.
- Remove non-movie titles: The IMDb data contains metadata for movies, TV shows, and other content. Queries run faster with less data to scan. The CloudFormation template provided reduces the data that Athena scans for each query by removing titles that are not movies. This helps improve query performance and lower cost.
- Join title and name datasets: The IMDb data contains a title dataset and a name dataset. The title dataset contains information on movies, TV series, and other content. The name dataset contains information on performers and creators. Queries that need information from both of these datasets require a join across these datasets. Joins on large datasets are time consuming. The CloudFormation template provided joins the title dataset and the name dataset to create a merged dataset. This avoids the cost of doing a join in subsequent queries, which improves query performance.
The IMDb data used in this blog post requires an IMDb content license and paid subscription to the IMDb Essential Metadata for Movies/TV/OTT in ADX. To inquire about a license and to access sample data, visit developer.imdb.com.
Steps to use the IMDb data
To use this data, perform the following steps. The details for these steps are in the following sections.
- Step 1: Subscribe to IMDb in ADX.
- Step 2: Export IMDb data from ADX into Amazon S3.
- Step 3: Use the CloudFormation included with this blog post to set up Step Functions state machine that will run a sequence of Athena queries to stage the data.
- Step 4: Execute the Step Functions state machine created in the previous step to set up the IMDb data in Amazon S3 and enable queries on it from Athena and other AWS analytic services.
Step 1: Subscribe to IMDb data in AWS Data Exchange
Follow these steps to subscribe to IMDb data in ADX:
- Log into the AWS Management Console using this link https://console.aws.amazon.com/.
- In the search bar, search for AWS Data Exchangeand then click on AWS Data Exchange.
- In the left panel, click on Browse catalog.
- In the search box under Browse catalog, type IMDb.
- Subscribe to either IMDb and Box Office Mojo Movie/TV/OTT Data (SAMPLE)or IMDb and Box Office Mojo Movie/TV/OTT Data.
IMDb publishes its data set once every day on AWS Data Exchange.
Step 2: Export the IMDb data from ADX into Amazon S3
Follow the steps in this workshop to export the IMDb data from ADX to Amazon S3.
Step 3: Create Step Functions state machine using CloudFormation template
In this step, you will launch a CloudFormation stack that builds a Step Functions state machine to run a sequence of Athena queries to clean and stage the IMDb data.
Before launching the CloudFormation template, note the title S3 path and the name S3 path from the IMDb data that you exported to Amazon S3.
The title S3 path will look something like this:
The name S3 path will look something like this:
Next, perform the following actions:
- Launch the following CloudFormation stack.
- Click Next to enter the stack details.
- Under Parameters, enter the IMDb title S3 path and the IMDb name S3 path. These are the paths noted earlier in this step.
Click Next. . The stack options appear.
4. Click Next to accept the default settings. Under Capabilities, choose I acknowledge that AWS CloudFormation might create IAM resources.
5. Click Create stack.
Step 4: Set up the IMDb data in Amazon S3 and AWS Glue to enable queries
In this step, you will run the Step Functions state machine created in the previous step by the CloudFormation template.
This step produces the following outcomes:
- Define schema on the IMDb title and name data in Glue Data Catalog.
- Convert the IMDb in S3 from JSON gzip format to Parquet for faster queries and more compact storage.
- Filter the IMDb data to remove titles that are not movies.
- Join the IMDb name and title datasets into a single denormalized dataset so you can query it faster by avoiding joins in each query.
To perform this step, run the Step Functions state machine created in the previous step using the following procedure:
- Click this link to go to the CloudFormation service in the AWS Management Console. If you are prompted, sign in.
- Click on CloudFormation stack you created in the previous step.
- Click on the Resources tab.
- Click on the resource called EtlStateMachine. This takes you to the Step Functions service in the AWS Management Console and to the state machine created in the previous step.
- Click on Start execution.
- In the dialog, leave the default values unchanged and click Start execution
- Wait for execution to complete successfully. This takes about 30 minutes.
Step 5: Validate joined table by querying from Athena
After the state machine in the previous step successfully completes execution, the IMDb data can be queried from Athena. To validate the data, follow these steps:
- Click this link to go to the Athena query editor in the AWS Management Console. If you are prompted, sign in.
- Calculate the number of rows in the IMDb name table. To do this, enter this query in the query editor.
SELECT COUNT(1) AS row_count FROM imdb_name_parquet
The following screenshot shows the output. The exact number of rows you get will depend on how recently you fetched the IMDb data.
3. Calculate the number of rows in the IMDb title table. To do this, enter this query in the query editor.
SELECT COUNT(1) AS row_count FROM imdb_title_parquet
The following screenshot shows the output.
4. Calculate the number of rows in the IMDb table. To do this, enter this query in the query editor.
SELECT COUNT(1) AS row_count FROM imdb_movie_merged_parquet
The following screenshot shows the output.
5. Next, randomly select one movie released in 2021. To do this, enter this query in the query editor.
SELECT titleId, year, originalTitle FROM imdb_movie_merged_parquet WHERE year = 2021 LIMIT 1
This produces a single record containing metadata for a movie released in 2021. The following screenshot shows the output.
Here are some troubleshooting tips.
- If you have AWS Lake Formation enabled on your account, you will need to give the role executing the CloudFormation template database creator access in Lake Formation.
- Some of the JSON to Parquet conversion queries can take 30 minutes to complete. If you see timeouts in Athena queries, you should increase the timeout quota. In the AWS Management Console, go to Service Quotas, find the Athena DML query timeout quota, and increase it to 60 minutes.
You started with IMDb data exported from ADX to Amazon S3. Then, you ran a CloudFormation template to generate a Step Functions state machine. Next, you ran the Step Functions state machine that ran a sequence of Athena queries. The effect of these queries was to put a schema on the IMDb data, to convert the format of the IMDb data from JSON gzip to Parquet, to filter the data to remove titles that were not movies, and then to join the IMDb title dataset with the IMDb name dataset. Finally, you validated that you could run queries on the data from Athena.
This data can now be used to run queries on the IMDb data from Athena, EMR, Glue, Redshift and other AWS analytic services. It can be exported into OpenSearch to build search applications. It can also be used to build recommendation engines and machine learning models in SageMaker and Personalize.
To learn more about IMDb data and subscription options, visit developer.imdb.com.