AWS Machine Learning Blog

Amazon Personalize can now use 10X more item attributes to improve relevance of recommendations

January 2023: This blog post was reviewed and updated by Brian Soper and Rob Percival, with new steps and code along with the option to use AWS CloudShell to run the procedure.

Amazon Personalize is a machine learning service which enables you to personalize your website, app, ads, emails, and more, with custom machine learning models which can be created in Amazon Personalize, with no prior machine learning experience. AWS is pleased to announce that Amazon Personalize now supports ten times more item attributes for modeling in Personalize. Previously, you could use up to five item attributes while building an ML model in Amazon Personalize. This limit is now 50 attributes. You can now use more information about your items, for example, category, brand, price, duration, size, author, year of release etc., to increase the relevance of recommendations.

In this post, you learn how to add item metadata with custom attributes to Amazon Personalize and create a model using this data and user interactions. This post uses the Amazon customer reviews data for beauty products. For more information and to download this data, see Amazon Customer Reviews Dataset. We will use the history of what items the users have reviewed along with user and item metadata to generate product recommendations for them.

Pre-processing the data

To model the data in Amazon Personalize, you need to break it into the following datasets:

  • Users – Contains metadata about the users
  • Items – Contains metadata about the items
  • Interactions – Contains interactions (for this post, reviews) and metadata about the interactions

For each respective dataset, this post uses the following attributes:

  • Userscustomer_id, helpful_votes, and total_votes
  • Itemsproduct_id, product_category, and product_parent
  • Interactionsproduct_id, customer_id, review_date, and star_rating

This post does not use the other attributes available, which include marketplace, review_id, product_title, vine, verified_purchase, review_headline, and review_body.

Additionally, to conform with the keywords in Amazon Personalize, this post renames customer_id to USER_ID, product_id to ITEM_ID, and review_date to TIMESTAMP.

To make getting started easier, you can use AWS CloudShell to experiment with this procedure. To do this choose a region using the AWS Regional Services List that supports both AWS CloudShell and Amazon Personalize. If you are not using CloudShell, be sure your environment includes the AWS CLI.

To download and process the data for input to Amazon Personalize, use the following example code blocks. The Python code blocks assume Python3 will be used.

#Downloading data
#If using AWS CloudShell, use the /tmp directory for more space to work
cd /tmp
aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz .
gunzip amazon_reviews_us_Beauty_v1_00.tsv.gz
#Adding Pandas package
pip3 install pandas

For the Users dataset, enter the following code:

#Generating the user dataset
import pandas as pd
fields = ['customer_id', 'helpful_votes', 'total_votes']
df = pd.read_csv('amazon_reviews_us_Beauty_v1_00.tsv', sep='\t', usecols=fields)
df = df.rename(columns={'customer_id':'USER_ID'})
df.to_csv('User_dataset.csv', index = None, header=True)

The following screenshot shows the Users dataset. This output can be generated by

df.head()

Delete the User dataset dataframe to free up memory by running del [df].

For the Items dataset, enter the following code:

#Generating the item dataset
import pandas as pd
fields = ['product_id', 'product_category', 'product_parent']
df1 = pd.read_csv('amazon_reviews_us_Beauty_v1_00.tsv', sep='\t', usecols=fields)
df1= df1.rename(columns={'product_id':'ITEM_ID'})

#Clip category names to 999 characters to confirm to Personalize limits
maxlen = 999
for index, row in df1.iterrows():
    product_category = row['product_category'][:maxlen]
    df1.at[index, 'product_category'] = product_category
# End of for loop - hit enter here if running interactive mode
df1.to_csv('Item_dataset.csv', index = None, header=True)

The following screenshot shows the Items dataset. This output can be generated by

df1.head()

Delete the Items dataset dataframe to free up memory by running del [df1].

For the Interactions dataset, enter the following code:

#Generating the interactions dataset
import pandas as pd
from datetime import datetime
fields = ['product_id', 'customer_id', 'review_date', 'star_rating']
df2 = pd.read_csv('amazon_reviews_us_Beauty_v1_00.tsv', sep='\t', usecols=fields)
#Note that you can ignore the "...DtypeWarning..." message if you are running this process in CloudShell
df2= df2.rename(columns={'product_id':'ITEM_ID', 'customer_id':'USER_ID', 'review_date':'TIMESTAMP'})

#Converting timstamp to UNIX timestamp and rounding milliseconds
num_errors =0
for index, row in df2.iterrows(): 
    time_input= row["TIMESTAMP"]
    try:
        time_input = datetime.strptime(time_input, "%Y-%m-%d")
        timestamp = round(datetime.timestamp(time_input))
        df2.at[index, "TIMESTAMP"] = timestamp
    except:
        print("exception at index: {}".format(index))
        num_errors += 1
# End of for loop - hit enter here if running interactive mode
# You should receive a series of "exception at index..." outputs
print("Total rows in error: {}".format(num_errors))
df2.to_csv("Interaction_dataset.csv", index = None, header=True)

The following screenshot shows the Interactions dataset. This output can be generated by

df2.head()

If using interactive mode, quit python3 and return to the bash shell by running quit().

Uploading the data

Note that if your session to CloudShell is lost at any point in the procedure, work can resume by pulling previously set variables from persistent file by running the Bash command “source ~/local_variables.txt”

Also note that CloudShell is a regional instance, so make sure you are logging back into CloudShell in the same region that you started.

After Pre-processing has been completed, upload the data to your Amazon S3 bucket.  Be sure to replace <your_bucket_name_here> with a globally unique S3 bucket name while observing S3 bucket naming rules.

demo_bucket_name="<your_bucket_name_here>"\
&&echo demo_bucket_name=$demo_bucket_name \
>> ~/local_variables.txt

demo_key_prefix="train/demo"\
&&echo demo_key_prefix=$demo_key_prefix \
>> ~/local_variables.txt

aws s3 mb s3://$demo_bucket_name

aws s3api put-object \
--bucket $demo_bucket_name \
--key "${demo_key_prefix}/users/user_dataset.csv" \
--body User_dataset.csv

aws s3api put-object \
--bucket $demo_bucket_name \
--key "${demo_key_prefix}/items/item_dataset.csv" \
--body Item_dataset.csv

aws s3api put-object \
--bucket $demo_bucket_name \
--key "${demo_key_prefix}/interactions/interaction_dataset.csv" \
--body Interaction_dataset.csv

Ingesting the data

After you process the preceding data, you can ingest it in Amazon Personalize.

Creating a dataset group

To create a dataset group to store events (user interactions) sent by your application and the metadata for users and items, complete the following commands:

dataset_group_name="demo-dataset"\
&&echo dataset_group_name=$dataset_group_name \
>> ~/local_variables.txt

aws personalize create-dataset-group \
--name $dataset_group_name

dataset_group_arn=$(aws personalize list-dataset-groups \
--query 'datasetGroups[?name==`demo-dataset`].datasetGroupArn' \
--output=text)\
&&echo dataset_group_arn=$dataset_group_arn \
>> ~/local_variables.txt

Creating a dataset and defining schema

After you create the dataset group, create a dataset and define schema for each of them. The following commands are for your three datasets:

Create schemas for Items, Users, and Interactions:

# Create the Items Schema
aws personalize create-schema \
--name 'demo-items-schema' \
--schema ' {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "product_parent",
            "type": "string",
            "categorical": true
        },
        {
            "name": "product_category",
            "type": "string",
            "categorical": true
        }
    ],
    "version": "1.0"
}'

items_schema_arn=$(aws personalize list-schemas \
--query 'schemas[?name==`demo-items-schema`].schemaArn' \
--output text)\
&&echo items_schema_arn=$items_schema_arn \
>> ~/local_variables.txt

# Create the Users Schema
aws personalize create-schema \
--name 'demo-users-schema' \
--schema ' {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "helpful_votes",
            "type": "float"
        },
        {
            "name": "total_votes",
            "type": "float"
        }
    ],
    "version": "1.0"
}'

users_schema_arn=$(aws personalize list-schemas \
--query 'schemas[?name==`demo-users-schema`].schemaArn' \
--output text)\
&&echo users_schema_arn=$users_schema_arn \
>> ~/local_variables.txt

# Create the Interactions Schema
aws personalize create-schema \
--name 'demo-interactions-schema' \
--schema ' {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "star_rating",
            "type": "string",
            "categorical": true
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}'

interactions_schema_arn=$(aws personalize list-schemas \
--query 'schemas[?name==`demo-interactions-schema`].schemaArn' \
--output text)\
&&echo interactions_schema_arn=$interactions_schema_arn \
>> ~/local_variables.txt 

Create the datasets for Items, Users, and Interactions:

# Create Items datasets
aws personalize create-dataset \
--name "demo-items" \
--schema-arn $items_schema_arn \
--dataset-group-arn $dataset_group_arn \
--dataset-type Items

items_dataset_arn=$(aws personalize list-datasets \
--query 'datasets[?name==`demo-items`].datasetArn' \
--output=text)\
&&echo items_dataset_arn=$items_dataset_arn \
>> ~/local_variables.txt

# Create Users datasets
aws personalize create-dataset \
--name "demo-users" \
--schema-arn $users_schema_arn \
--dataset-group-arn $dataset_group_arn \
--dataset-type Users

users_dataset_arn=$(aws personalize list-datasets \
--query 'datasets[?name==`demo-users`].datasetArn' \
--output=text)\
&&echo users_dataset_arn=$users_dataset_arn \
>> ~/local_variables.txt

# Create Interactions datasets
aws personalize create-dataset \
--name "demo-interactions" \
--schema-arn $interactions_schema_arn \
--dataset-group-arn $dataset_group_arn \
--dataset-type Interactions

interactions_dataset_arn=$(aws personalize list-datasets \
--query 'datasets[?name==`demo-interactions`].datasetArn' \
--output=text)\
&&echo interactions_dataset_arn=$interactions_dataset_arn \
>> ~/local_variables.txt

Importing the data

After you create the dataset, import the data from Amazon S3. To import your Items data, complete the following commands.

Set up policies and roles to allow S3 and Personalize interactions:

# Create IAM Execution Role for Personalize service to read data from bucket
personalize_iam_policy_name="Demo-Personalize-ExecutionPolicy"\
&&echo personalize_iam_policy_name=$personalize_iam_policy_name \
>> ~/local_variables.txt

personalize_iam_role_name="Demo-Personalize-ExecutionRole"\
&&echo personalize_iam_role_name=$personalize_iam_role_name \
>> ~/local_variables.txt

personalize_managed_iam_service_policy_arn=\
"arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"\
&&echo personalize_managed_iam_service_policy_arn=\
$personalize_managed_iam_service_policy_arn \
>> ~/local_variables.txt

printf -v personalize_iam_policy_json '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::%s",
                "arn:aws:s3:::%s/*"
            ]
        }
    ]
}' "$demo_bucket_name" "$demo_bucket_name"

aws iam create-policy \
--policy-name $personalize_iam_policy_name \
--policy-document "$personalize_iam_policy_json"

personalize_iam_policy_arn=$(aws iam list-policies \
--query 'Policies[?PolicyName==`Demo-Personalize-ExecutionPolicy`].Arn' \
--output text)\
&&echo personalize_iam_policy_arn=$personalize_iam_policy_arn \
>> ~/local_variables.txt

aws iam create-role \
--role-name $personalize_iam_role_name \
--assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}'

personalize_iam_role_arn=$(aws iam list-roles \
--query 'Roles[?RoleName==`Demo-Personalize-ExecutionRole`].Arn' \
--output text)\
&&echo personalize_iam_role_arn=$personalize_iam_role_arn \
>> ~/local_variables.txt


aws iam attach-role-policy \
--role-name $personalize_iam_role_name \
--policy-arn $personalize_iam_policy_arn


aws iam attach-role-policy \
--role-name $personalize_iam_role_name \
--policy-arn $personalize_managed_iam_service_policy_arn

# Create S3 bucket policy and attach to bucket for Personalize to access S3
printf -v s3_bucket_policy_json '{
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::%s",
                "arn:aws:s3:::%s/*"
            ]
        }
    ]
}' "$demo_bucket_name" "$demo_bucket_name"

aws s3api put-bucket-policy \
--bucket $demo_bucket_name \
--policy "$s3_bucket_policy_json"

Create dataset import jobs:


# Create dataset import jobs
aws personalize create-dataset-import-job \
--role-arn $personalize_iam_role_arn \
--job-name "demo-initial-items-import" \
--dataset-arn $items_dataset_arn \
--data-source "dataLocation=s3://${demo_bucket_name}/${demo_key_prefix}/items/item_dataset.csv"

aws personalize create-dataset-import-job \
--role-arn $personalize_iam_role_arn \
--job-name "demo-initial-users-import" \
--dataset-arn $users_dataset_arn \
--data-source "dataLocation=s3://${demo_bucket_name}/${demo_key_prefix}/users/user_dataset.csv"

aws personalize create-dataset-import-job \
--role-arn $personalize_iam_role_arn \
--job-name "demo-initial-interactions-import" \
--dataset-arn $interactions_dataset_arn \
--data-source "dataLocation=s3://${demo_bucket_name}/${demo_key_prefix}/interactions/interaction_dataset.csv"

Check status of the dataset import jobs. This may take several minutes.

# Check status of dataset import jobs for "ACTIVE" status before proceeding
aws personalize list-dataset-import-jobs \
--query 'datasetImportJobs[?jobName==`demo-initial-items-import`].[jobName, status]' \
--output text&&\
aws personalize list-dataset-import-jobs \
--query 'datasetImportJobs[?jobName==`demo-initial-users-import`].[jobName, status]' \
--output text&&\
aws personalize list-dataset-import-jobs \
--query 'datasetImportJobs[?jobName==`demo-initial-interactions-import`].[jobName, status]' \
--output text
# Once status of dataset import jobs are all "ACTIVE," proceed to next step.

Training a model

After you ingest the data into Amazon Personalize, you are ready to train a model (solutionVersion). To do so, map the recipe (algorithm) you want to use to your use case. The following are your available options:

This post uses the User-Personalization recipe to define a solution and then train a solutionVersion (model). Complete the following commands.

# Create solution and train solution version
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt

aws personalize create-solution \
--name demo-user-personalization \
--dataset-group-arn $dataset_group_arn \
--recipe-arn arn:aws:personalize:::recipe/aws-user-personalization

personalize_solution_arn=$(aws personalize list-solutions \
--query 'solutions[?name==`demo-user-personalization`].solutionArn' \
--output text)\
&&echo personalize_solution_arn=$personalize_solution_arn \
>> ~/local_variables.txt

aws personalize create-solution-version \
--solution-arn $personalize_solution_arn \
--training-mode FULL

personalize_solution_version_arn=$(aws personalize describe-solution \
--solution-arn $personalize_solution_arn \
--query 'solution.latestSolutionVersion.solutionVersionArn' \
--output text)\
&&echo personalize_solution_version_arn=$personalize_solution_version_arn \
>> ~/local_variables.txt

You can also change the default hyperparameters or perform hyperparameter optimization for a solution.

Check status of the solution version. This may take an hour or longer as it is running full training on the datasets.

# Check status of solution version training for "ACTIVE" before proceeding
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt

aws personalize describe-solution-version \
--solution-version-arn $personalize_solution_version_arn \
--query 'solutionVersion.[solutionVersionArn, status]'

Getting recommendations

To get recommendations, create a campaign using the solution and solution version you just created. Complete the following steps:

# Create Campaign from solution version
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt

aws personalize create-campaign \
--name demo-user-personalization-test \
--solution-version-arn $personalize_solution_version_arn \
--min-provisioned-tps 1

personalize_campaign_arn=$(aws personalize list-campaigns \
--solution-arn $personalize_solution_arn \
--query 'campaigns[?name==`demo-user-personalization-test`].campaignArn' \
--output text)\
&&echo personalize_campaign_arn=$personalize_campaign_arn \
>> ~/local_variables.txt

Check status of the campaign. This may take several minutes.

# Check status of Campaign for "ACTIVE" status
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt

aws personalize describe-campaign \
--campaign-arn $personalize_campaign_arn \
--query 'campaign.[name, status]'

After you set up the campaign, you can programmatically call the campaign to get recommendations in form of item IDs. You can also use the console to get the recommendations and perform spot checks. Additionally, Amazon Personalize offers the ability to batch process recommendations. For more information, see Now available: Batch Recommendations in Amazon Personalize.

One way to test the campaign is with the following commands that will test both an existing a nonexistent user.

Test Campaign result with random user ID
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt

# Known user from dataset
test_user_id_1="19551372"

# New nonexistent user
test_user_id_2="1955137000"

# View recommendations for known user with previous recorded interactions
aws personalize-runtime get-recommendations \
--campaign-arn $personalize_campaign_arn \
--user-id $test_user_id_1 \
--num-results 5

# View recommendations for new user with no previous recorded interactions
aws personalize-runtime get-recommendations \
--campaign-arn $personalize_campaign_arn \
--user-id $test_user_id_2 \
--num-results 5

You should see the top five ranked item IDs for this user in descending order.

Removal of Created Resources

If you would like to remove the resources that you created in this post, run the following commands:

# Note that the you may need to source saved variables from file:
source ~/local_variables.txt
#Delete the campaign
aws personalize delete-campaign --campaign-arn $personalize_campaign_arn
#WAIT FOR DELETION
#Delete the solution
aws personalize delete-solution --solution-arn $personalize_solution_arn
#Delete datasets
aws personalize delete-dataset --dataset-arn $items_dataset_arn
aws personalize delete-dataset --dataset-arn $users_dataset_arn
aws personalize delete-dataset --dataset-arn $interactions_dataset_arn
#Delete the dataset group
aws personalize delete-dataset-group --dataset-group-arn $dataset_group_arn
#Delete schemas
aws personalize delete-schema --schema-arn $items_schema_arn
aws personalize delete-schema --schema-arn $users_schema_arn
aws personalize delete-schema --schema-arn $interactions_schema_arn
#Delete the S3 bucket and contents
aws s3 rb s3://$demo_bucket_name --force
#Delete IAM Role and Policy
aws iam detach-role-policy --role-name $personalize_iam_role_name --policy-arn $personalize_managed_iam_service_policy_arn
aws iam detach-role-policy --role-name $personalize_iam_role_name --policy-arn $personalize_iam_policy_arn
aws iam delete-role --role-name $personalize_iam_role_name
aws iam delete-policy --policy-arn $personalize_iam_policy_arn
#To clear the file saving variables
rm -f ~/local_variables.txt

Conclusion

You can now use these recommendations to power display experiences, such as personalize the homepage of your beauty website based on what you know about the user or send a promotional email with recommendations. Performing real-time recommendations with Amazon Personalize requires you to also send user events as they occur. For more information, see Amazon Personalize is Now Generally Available. Get started with Amazon Personalize today!


About the author

Vaibhav Sethi is the Product Manager for Amazon Personalize. He focuses on delivering products that make it easier to build machine learning solutions. In his spare time, he enjoys hiking and reading.

Brian Soper is a Solutions Architect at Amazon Web Services helping AWS customers transform and architect for the cloud since 2018. Brian has a 20+ year background building out physical and virtual infrastructure for both on-premises and cloud.

Rob Percival is an Account Manager in the AWS Games organization. He works with operators, game developers, and software providers in the US Real Money Gaming (online sports betting and casino gambling) industry to increase speed to market, gain deeper insight on their players, and accelerate experimentation and innovation using AWS.