In this module, you download your dataset, inspect the dataset, then create the dataset group and schema you use in this tutorial.

Time to Complete Module: 20 Minutes


Amazon Personalize datasets are containers for data. A dataset group is a collection of related datasets (Interactions, Users, and Items). There are three types of datasets in Amazon Personalize:

  • Interactions: This dataset stores historical and real-time data from interactions between users and items. This data can include impressions data and contextual metadata on your users’ browsing context, such as their location or device (mobile, tablet, desktop, and so on). You must at minimum create an Interactions dataset.
  • Users: This dataset stores metadata about your users. This might include information such as age, gender, or loyalty membership which can be important in personalization systems.
  • Items: This dataset stores metadata about your items. This might include information such as price, SKU type, or availability.

In this tutorial, you only use Interactions data. For the advanced use of other types of datasets, see Datasets and Schemas.

Your Amazon Personalize model will be trained on the MovieLens Latest Small dataset that contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. The MovieLens dataset is curated by GroupLens Research.


  • Step 1. Import the libraries

    To prepare the data, train the Personalize model, and deploy it, you must first import some libraries in your Jupyter notebook environment. Copy and paste the following code into the code cell in your Jupyter notebook instance and choose Run.

    import time
    from time import sleep
    import json
    from datetime import datetime
    import boto3
    import pandas as pd

    While the code runs, an * appears between the square brackets. After a few seconds, the code execution completes, the * is replaced with the number 1.

    (Click to enlarge)

  • Step 2. Fetch the dataset

    For this tutorial, you use the MovieLens dataset to create a movie recommendation model. You use the ml-latest-small dataset version that includes 100,836 interactions. First, download the dataset and unzip the contents into a new folder.

    Copy and paste the following code into the code cell in your Jupyter notebook instance and choose Run.

    data_dir = "data"
    !mkdir $data_dir
    
    !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
    !cd $data_dir && unzip ml-latest-small.zip
    dataset_dir = data_dir + "/ml-latest-small/"
    !ls $dataset_dir

    (Click to enlarge)

    Next, open the data file and take a closer look at the data.

    Copy and paste the following code into the code cell in your Jupyter notebook instance and choose Run.

    original_data = pd.read_csv(dataset_dir + '/ratings.csv')
    print(original_data.info())
    original_data.head()

    From this, you can see that there is a total of 100,836 entries in the dataset, with 4 columns, and each cell stored as int64 format, with the exception of the rating which is a float64.

    Remember, the data you need is user-item-interaction data, which in this case, is userId, movieId, and timestamp. This dataset has an additional column, rating, which can be dropped from the dataset after you have used it to focus on positive interactions.

    (Click to enlarge)

  • Step 3. Prepare the data

    In this step, you define two variables in the dataset to filter out unliked movies and better simulate data gathered by a video-on-demand (VOD) platform.

    Since this is an explicit feedback movie rating dataset, it includes movies rated from 1 to 5. For this tutorial, you want to include only moves that were "liked" by the users, and simulate a implicit dataset that is similar to data that is gathered by a video-on-demain (VOD) platform. For that, you will next filter out all interactions below 2 out of 5, and create two EVENT_TYPE variables: click and watch. Any movies rated 2 and above are assigned as click, and any movies rated 4 and above are assigned as click and watch.

    In your Jupyter notebook, copy and paste the following code into your code cell and choose Run.

    watched_df = original_data.copy()
    watched_df = watched_df[watched_df['rating'] > 3]
    watched_df = watched_df[['userId', 'movieId', 'timestamp']]
    watched_df['EVENT_TYPE']='watch'
    
    clicked_df = original_data.copy()
    clicked_df = clicked_df[clicked_df['rating'] > 1]
    clicked_df = clicked_df[['userId', 'movieId', 'timestamp']]
    clicked_df['EVENT_TYPE']='click'
    
    interactions_df = clicked_df.copy()
    interactions_df = interactions_df.append(watched_df)
    interactions_df.sort_values("timestamp", axis = 0, ascending = True, 
                     inplace = True, na_position ='last')

    You can use the watched_df and clicked_df sets for more advanced training in Amazon Personalize. However, for this tutorial, you work with only the interactions_df set. 

    (Click to enlarge)

    Next, save the interaction_df set. You upload this set to an Amazon S3 bucket in a later step. Copy and paste the following code into your Jupyter notebook.

    interactions_df.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', 
                                  'timestamp':'TIMESTAMP'}, inplace = True) 
    interactions_filename = "interactions.csv"
    interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

    (Click to enlarge)

  • Step 4. Create the dataset group

    A dataset group is a collection of related datasets. For this step, you create a new dataset group named personalize-demo-movielens and then activate it.

    # Configure the SDK to Personalize:
    personalize = boto3.client('personalize')
    personalize_runtime = boto3.client('personalize-runtime')
    
    create_dataset_group_response = personalize.create_dataset_group(
        name = "personalize-demo-movielens"
    )
    
    dataset_group_arn = create_dataset_group_response['datasetGroupArn']
    print(json.dumps(create_dataset_group_response, indent=2))

    (Click to enlarge)

    Before you can use the dataset group, it must be active. Run the following code block and wait for the output to print an ACTIVE status.

    Note: The dataset group status is checked every second, up to a maximum of 3 hours.

    %%time
    max_time = time.time() + 3*60*60 # 3 hours
    while time.time() < max_time:
        describe_dataset_group_response = personalize.describe_dataset_group(
            datasetGroupArn = dataset_group_arn
        )
        status = describe_dataset_group_response["datasetGroup"]["status"]
        print("DatasetGroup: {}".format(status))
        
        if status == "ACTIVE" or status == "CREATE FAILED":
            break
            
        time.sleep(60)

    (Click to enlarge)

  • Step 5. Create the schema and dataset

    Amazon Personalize needs a schema to understand your data. The following code block creates the appropriate schema for the MovieLens dataset and provides it to Personalize. This code block also creates the interactions dataset within the dataset group. Personalize uses this dataset to train the recommendation model.
     
    Run the following code block to create the schema and the dataset.
    interactions_schema = {
        "type": "record",
        "name": "Interactions",
        "namespace": "com.amazonaws.personalize.schema",
        "fields": [
            {
                "name": "USER_ID",
                "type": "string"
            },
            {
                "name": "ITEM_ID",
                "type": "string"
            },
            {
                "name": "EVENT_TYPE",
                "type": "string"
            },
            {
                "name": "TIMESTAMP",
                "type": "long"
            }
        ],
        "version": "1.0"
    }
    
    create_schema_response = personalize.create_schema(
        name = "personalize-demo-movielens-interactions",
        schema = json.dumps(interactions_schema)
    )
    
    interaction_schema_arn = create_schema_response['schemaArn']
    print(json.dumps(create_schema_response, indent=2))
    
    dataset_type = "INTERACTIONS"
    create_dataset_response = personalize.create_dataset(
        name = "personalize-demo-movielens-ints",
        datasetType = dataset_type,
        datasetGroupArn = dataset_group_arn,
        schemaArn = interaction_schema_arn
    )
    
    interactions_dataset_arn = create_dataset_response['datasetArn']
    print(json.dumps(create_dataset_response, indent=2))

    (Click to enlarge)


In this module, you imported and fetched the dataset you use for your movie title recommendation system. Then, you prepared the dataset by splitting the dataset based on movie ratings. Finally, you created the dataset group, schema, and interactions dataset that you use to train your Amazon Personalize model.

In the next module, you create your Amazon S3 bucket that stores the interaction data, and configure the S3 bucket to allow Amazon Personalize access to the data.