Create real-time, personalized movie recommendations with Amazon Personalize

Create campaign and get recommendations

Download and prepare the dataset

In this module, you download your dataset, inspect the dataset, then create the dataset group and schema you use in this tutorial.

Time to Complete Module: 20 Minutes

Amazon Personalize datasets are containers for data. A dataset group is a collection of related datasets (Interactions, Users, and Items). There are three types of datasets in Amazon Personalize:

Interactions: This dataset stores historical and real-time data from interactions between users and items. This data can include impressions data and contextual metadata on your users’ browsing context, such as their location or device (mobile, tablet, desktop, and so on). You must at minimum create an Interactions dataset.
Users: This dataset stores metadata about your users. This might include information such as age, gender, or loyalty membership which can be important in personalization systems.
Items: This dataset stores metadata about your items. This might include information such as price, SKU type, or availability.

In this tutorial, you only use Interactions data. For the advanced use of other types of datasets, see Datasets and Schemas.

Your Amazon Personalize model will be trained on the MovieLens Latest Small dataset that contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. The MovieLens dataset is curated by GroupLens Research.

Step 1. Import the libraries
Step 1. Import the libraries
To prepare the data, train the Personalize model, and deploy it, you must first import some libraries in your Jupyter notebook environment. Copy and paste the following code into the code cell in your Jupyter notebook instance and choose Run.

import time from time import sleep import json from datetime import datetime import boto3 import pandas as pd

While the code runs, an * appears between the square brackets. After a few seconds, the code execution completes, the * is replaced with the number 1.
(Click to enlarge)
Step 2. Fetch the dataset
Step 2. Fetch the dataset
For this tutorial, you use the MovieLens dataset to create a movie recommendation model. You use the ml-latest-small dataset version that includes 100,836 interactions. First, download the dataset and unzip the contents into a new folder.

Copy and paste the following code into the code cell in your Jupyter notebook instance and choose Run.

data_dir = "data" !mkdir $data_dir !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip !cd $data_dir && unzip ml-latest-small.zip dataset_dir = data_dir + "/ml-latest-small/" !ls $dataset_dir
(Click to enlarge)
Next, open the data file and take a closer look at the data.

Copy and paste the following code into the code cell in your Jupyter notebook instance and choose Run.

original_data = pd.read_csv(dataset_dir + '/ratings.csv') print(original_data.info()) original_data.head()

From this, you can see that there is a total of 100,836 entries in the dataset, with 4 columns, and each cell stored as int64 format, with the exception of the rating which is a float64.

Remember, the data you need is user-item-interaction data, which in this case, is userId, movieId, and timestamp. This dataset has an additional column, rating, which can be dropped from the dataset after you have used it to focus on positive interactions.
(Click to enlarge)
Step 3. Prepare the data
Step 3. Prepare the data

In this step, you define two variables in the dataset to filter out unliked movies and better simulate data gathered by a video-on-demand (VOD) platform.
Since this is an explicit feedback movie rating dataset, it includes movies rated from 1 to 5. For this tutorial, you want to include only moves that were "liked" by the users, and simulate a implicit dataset that is similar to data that is gathered by a video-on-demain (VOD) platform. For that, you will next filter out all interactions below 2 out of 5, and create two EVENT_TYPE variables: click and watch. Any movies rated 2 and above are assigned as click, and any movies rated 4 and above are assigned as click and watch.

In your Jupyter notebook, copy and paste the following code into your code cell and choose Run.

watched_df = original_data.copy() watched_df = watched_df[watched_df['rating'] > 3] watched_df = watched_df[['userId', 'movieId', 'timestamp']] watched_df['EVENT_TYPE']='watch' clicked_df = original_data.copy() clicked_df = clicked_df[clicked_df['rating'] > 1] clicked_df = clicked_df[['userId', 'movieId', 'timestamp']] clicked_df['EVENT_TYPE']='click' interactions_df = clicked_df.copy() interactions_df = interactions_df.append(watched_df) interactions_df.sort_values("timestamp", axis = 0, ascending = True, inplace = True, na_position ='last')

You can use the watched_df and clicked_df sets for more advanced training in Amazon Personalize. However, for this tutorial, you work with only the interactions_df set.
(Click to enlarge)
Next, save the interaction_df set. You upload this set to an Amazon S3 bucket in a later step. Copy and paste the following code into your Jupyter notebook.

interactions_df.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', 'timestamp':'TIMESTAMP'}, inplace = True) interactions_filename = "interactions.csv" interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')
(Click to enlarge)

Step 4. Create the dataset group

A dataset group is a collection of related datasets. For this step, you create a new dataset group named personalize-demo-movielens and then activate it.

# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

create_dataset_group_response = personalize.create_dataset_group(
    name = "personalize-demo-movielens"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

Before you can use the dataset group, it must be active. Run the following code block and wait for the output to print an ACTIVE status.

Note: The dataset group status is checked every second, up to a maximum of 3 hours.

%%time
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

Step 5. Create the schema and dataset

Amazon Personalize needs a schema to understand your data. The following code block creates the appropriate schema for the MovieLens dataset and provides it to Personalize. This code block also creates the interactions dataset within the dataset group. Personalize uses this dataset to train the recommendation model.

Run the following code block to create the schema and the dataset.

interactions_schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "personalize-demo-movielens-interactions",
    schema = json.dumps(interactions_schema)
)

interaction_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "personalize-demo-movielens-ints",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = interaction_schema_arn
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

Conclusion

In this module, you imported and fetched the dataset you use for your movie title recommendation system. Then, you prepared the dataset by splitting the dataset based on movie ratings. Finally, you created the dataset group, schema, and interactions dataset that you use to train your Amazon Personalize model.

In the next module, you create your Amazon S3 bucket that stores the interaction data, and configure the S3 bucket to allow Amazon Personalize access to the data.

Next: Import the data

Create real-time, personalized movie recommendations with Amazon Personalize

Introduction

Background and setup

Download and prepare dataset

Import dataset

Create solution

Create campaign and get recommendations

Clean up and next steps

Download and prepare the dataset

Step 1. Import the libraries

Step 1. Import the libraries

Step 2. Fetch the dataset

Step 2. Fetch the dataset

Step 3. Prepare the data

Step 3. Prepare the data

Step 4. Create the dataset group

Step 4. Create the dataset group

Step 5. Create the schema and dataset

Step 5. Create the schema and dataset

Conclusion

Step 1. Import the libraries

Step 2. Fetch the dataset

Step 3. Prepare the data

Step 4. Create the dataset group

Step 5. Create the schema and dataset

Ending Support for Internet Explorer