AWS Machine Learning Blog
Amazon Personalize can now use 10X more item attributes to improve relevance of recommendations
January 2023: This blog post was reviewed and updated by Brian Soper and Rob Percival, with new steps and code along with the option to use AWS CloudShell to run the procedure.
Amazon Personalize is a machine learning service which enables you to personalize your website, app, ads, emails, and more, with custom machine learning models which can be created in Amazon Personalize, with no prior machine learning experience. AWS is pleased to announce that Amazon Personalize now supports ten times more item attributes for modeling in Personalize. Previously, you could use up to five item attributes while building an ML model in Amazon Personalize. This limit is now 50 attributes. You can now use more information about your items, for example, category, brand, price, duration, size, author, year of release etc., to increase the relevance of recommendations.
In this post, you learn how to add item metadata with custom attributes to Amazon Personalize and create a model using this data and user interactions. This post uses the Amazon customer reviews data for beauty products. For more information and to download this data, see Amazon Customer Reviews Dataset. We will use the history of what items the users have reviewed along with user and item metadata to generate product recommendations for them.
Pre-processing the data
To model the data in Amazon Personalize, you need to break it into the following datasets:
- Users – Contains metadata about the users
- Items – Contains metadata about the items
- Interactions – Contains interactions (for this post, reviews) and metadata about the interactions
For each respective dataset, this post uses the following attributes:
- Users –
customer_id
,helpful_votes
, andtotal_votes
- Items –
product_id
,product_category
, andproduct_parent
- Interactions –
product_id
,customer_id
,review_date
, andstar_rating
This post does not use the other attributes available, which include marketplace
, review_id
, product_title
, vine
, verified_purchase
, review_headline
, and review_body
.
Additionally, to conform with the keywords in Amazon Personalize, this post renames customer_id
to USER_ID
, product_id
to ITEM_ID
, and review_date
to TIMESTAMP
.
To make getting started easier, you can use AWS CloudShell to experiment with this procedure. To do this choose a region using the AWS Regional Services List that supports both AWS CloudShell and Amazon Personalize. If you are not using CloudShell, be sure your environment includes the AWS CLI.
To download and process the data for input to Amazon Personalize, use the following example code blocks. The Python code blocks assume Python3 will be used.
#Downloading data
#If using AWS CloudShell, use the /tmp directory for more space to work
cd /tmp
aws s3 cp s3://amazon-reviews-pds/tsv/amazon_reviews_us_Beauty_v1_00.tsv.gz .
gunzip amazon_reviews_us_Beauty_v1_00.tsv.gz
#Adding Pandas package
pip3 install pandas
For the Users dataset, enter the following code:
#Generating the user dataset
import pandas as pd
fields = ['customer_id', 'helpful_votes', 'total_votes']
df = pd.read_csv('amazon_reviews_us_Beauty_v1_00.tsv', sep='\t', usecols=fields)
df = df.rename(columns={'customer_id':'USER_ID'})
df.to_csv('User_dataset.csv', index = None, header=True)
The following screenshot shows the Users dataset. This output can be generated by
df.head()
Delete the User dataset dataframe to free up memory by running del [df]
.
For the Items dataset, enter the following code:
#Generating the item dataset
import pandas as pd
fields = ['product_id', 'product_category', 'product_parent']
df1 = pd.read_csv('amazon_reviews_us_Beauty_v1_00.tsv', sep='\t', usecols=fields)
df1= df1.rename(columns={'product_id':'ITEM_ID'})
#Clip category names to 999 characters to confirm to Personalize limits
maxlen = 999
for index, row in df1.iterrows():
product_category = row['product_category'][:maxlen]
df1.at[index, 'product_category'] = product_category
# End of for loop - hit enter here if running interactive mode
df1.to_csv('Item_dataset.csv', index = None, header=True)
The following screenshot shows the Items dataset. This output can be generated by
df1.head()
Delete the Items dataset dataframe to free up memory by running del [df1]
.
For the Interactions dataset, enter the following code:
#Generating the interactions dataset
import pandas as pd
from datetime import datetime
fields = ['product_id', 'customer_id', 'review_date', 'star_rating']
df2 = pd.read_csv('amazon_reviews_us_Beauty_v1_00.tsv', sep='\t', usecols=fields)
#Note that you can ignore the "...DtypeWarning..." message if you are running this process in CloudShell
df2= df2.rename(columns={'product_id':'ITEM_ID', 'customer_id':'USER_ID', 'review_date':'TIMESTAMP'})
#Converting timstamp to UNIX timestamp and rounding milliseconds
num_errors =0
for index, row in df2.iterrows():
time_input= row["TIMESTAMP"]
try:
time_input = datetime.strptime(time_input, "%Y-%m-%d")
timestamp = round(datetime.timestamp(time_input))
df2.at[index, "TIMESTAMP"] = timestamp
except:
print("exception at index: {}".format(index))
num_errors += 1
# End of for loop - hit enter here if running interactive mode
# You should receive a series of "exception at index..." outputs
print("Total rows in error: {}".format(num_errors))
df2.to_csv("Interaction_dataset.csv", index = None, header=True)
The following screenshot shows the Interactions dataset. This output can be generated by
df2.head()
If using interactive mode, quit python3 and return to the bash shell by running quit()
.
Uploading the data
Note that if your session to CloudShell is lost at any point in the procedure, work can resume by pulling previously set variables from persistent file by running the Bash command “source ~/local_variables.txt”
Also note that CloudShell is a regional instance, so make sure you are logging back into CloudShell in the same region that you started.
After Pre-processing has been completed, upload the data to your Amazon S3 bucket. Be sure to replace <your_bucket_name_here> with a globally unique S3 bucket name while observing S3 bucket naming rules.
demo_bucket_name="<your_bucket_name_here>"\
&&echo demo_bucket_name=$demo_bucket_name \
>> ~/local_variables.txt
demo_key_prefix="train/demo"\
&&echo demo_key_prefix=$demo_key_prefix \
>> ~/local_variables.txt
aws s3 mb s3://$demo_bucket_name
aws s3api put-object \
--bucket $demo_bucket_name \
--key "${demo_key_prefix}/users/user_dataset.csv" \
--body User_dataset.csv
aws s3api put-object \
--bucket $demo_bucket_name \
--key "${demo_key_prefix}/items/item_dataset.csv" \
--body Item_dataset.csv
aws s3api put-object \
--bucket $demo_bucket_name \
--key "${demo_key_prefix}/interactions/interaction_dataset.csv" \
--body Interaction_dataset.csv
Ingesting the data
After you process the preceding data, you can ingest it in Amazon Personalize.
Creating a dataset group
To create a dataset group to store events (user interactions) sent by your application and the metadata for users and items, complete the following commands:
dataset_group_name="demo-dataset"\
&&echo dataset_group_name=$dataset_group_name \
>> ~/local_variables.txt
aws personalize create-dataset-group \
--name $dataset_group_name
dataset_group_arn=$(aws personalize list-dataset-groups \
--query 'datasetGroups[?name==`demo-dataset`].datasetGroupArn' \
--output=text)\
&&echo dataset_group_arn=$dataset_group_arn \
>> ~/local_variables.txt
Creating a dataset and defining schema
After you create the dataset group, create a dataset and define schema for each of them. The following commands are for your three datasets:
Create schemas for Items, Users, and Interactions:
# Create the Items Schema
aws personalize create-schema \
--name 'demo-items-schema' \
--schema ' {
"type": "record",
"name": "Items",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "product_parent",
"type": "string",
"categorical": true
},
{
"name": "product_category",
"type": "string",
"categorical": true
}
],
"version": "1.0"
}'
items_schema_arn=$(aws personalize list-schemas \
--query 'schemas[?name==`demo-items-schema`].schemaArn' \
--output text)\
&&echo items_schema_arn=$items_schema_arn \
>> ~/local_variables.txt
# Create the Users Schema
aws personalize create-schema \
--name 'demo-users-schema' \
--schema ' {
"type": "record",
"name": "Users",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "helpful_votes",
"type": "float"
},
{
"name": "total_votes",
"type": "float"
}
],
"version": "1.0"
}'
users_schema_arn=$(aws personalize list-schemas \
--query 'schemas[?name==`demo-users-schema`].schemaArn' \
--output text)\
&&echo users_schema_arn=$users_schema_arn \
>> ~/local_variables.txt
# Create the Interactions Schema
aws personalize create-schema \
--name 'demo-interactions-schema' \
--schema ' {
"type": "record",
"name": "Interactions",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "USER_ID",
"type": "string"
},
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "star_rating",
"type": "string",
"categorical": true
},
{
"name": "TIMESTAMP",
"type": "long"
}
],
"version": "1.0"
}'
interactions_schema_arn=$(aws personalize list-schemas \
--query 'schemas[?name==`demo-interactions-schema`].schemaArn' \
--output text)\
&&echo interactions_schema_arn=$interactions_schema_arn \
>> ~/local_variables.txt
Create the datasets for Items, Users, and Interactions:
# Create Items datasets
aws personalize create-dataset \
--name "demo-items" \
--schema-arn $items_schema_arn \
--dataset-group-arn $dataset_group_arn \
--dataset-type Items
items_dataset_arn=$(aws personalize list-datasets \
--query 'datasets[?name==`demo-items`].datasetArn' \
--output=text)\
&&echo items_dataset_arn=$items_dataset_arn \
>> ~/local_variables.txt
# Create Users datasets
aws personalize create-dataset \
--name "demo-users" \
--schema-arn $users_schema_arn \
--dataset-group-arn $dataset_group_arn \
--dataset-type Users
users_dataset_arn=$(aws personalize list-datasets \
--query 'datasets[?name==`demo-users`].datasetArn' \
--output=text)\
&&echo users_dataset_arn=$users_dataset_arn \
>> ~/local_variables.txt
# Create Interactions datasets
aws personalize create-dataset \
--name "demo-interactions" \
--schema-arn $interactions_schema_arn \
--dataset-group-arn $dataset_group_arn \
--dataset-type Interactions
interactions_dataset_arn=$(aws personalize list-datasets \
--query 'datasets[?name==`demo-interactions`].datasetArn' \
--output=text)\
&&echo interactions_dataset_arn=$interactions_dataset_arn \
>> ~/local_variables.txt
Importing the data
After you create the dataset, import the data from Amazon S3. To import your Items data, complete the following commands.
Set up policies and roles to allow S3 and Personalize interactions:
# Create IAM Execution Role for Personalize service to read data from bucket
personalize_iam_policy_name="Demo-Personalize-ExecutionPolicy"\
&&echo personalize_iam_policy_name=$personalize_iam_policy_name \
>> ~/local_variables.txt
personalize_iam_role_name="Demo-Personalize-ExecutionRole"\
&&echo personalize_iam_role_name=$personalize_iam_role_name \
>> ~/local_variables.txt
personalize_managed_iam_service_policy_arn=\
"arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"\
&&echo personalize_managed_iam_service_policy_arn=\
$personalize_managed_iam_service_policy_arn \
>> ~/local_variables.txt
printf -v personalize_iam_policy_json '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::%s",
"arn:aws:s3:::%s/*"
]
}
]
}' "$demo_bucket_name" "$demo_bucket_name"
aws iam create-policy \
--policy-name $personalize_iam_policy_name \
--policy-document "$personalize_iam_policy_json"
personalize_iam_policy_arn=$(aws iam list-policies \
--query 'Policies[?PolicyName==`Demo-Personalize-ExecutionPolicy`].Arn' \
--output text)\
&&echo personalize_iam_policy_arn=$personalize_iam_policy_arn \
>> ~/local_variables.txt
aws iam create-role \
--role-name $personalize_iam_role_name \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "personalize.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}'
personalize_iam_role_arn=$(aws iam list-roles \
--query 'Roles[?RoleName==`Demo-Personalize-ExecutionRole`].Arn' \
--output text)\
&&echo personalize_iam_role_arn=$personalize_iam_role_arn \
>> ~/local_variables.txt
aws iam attach-role-policy \
--role-name $personalize_iam_role_name \
--policy-arn $personalize_iam_policy_arn
aws iam attach-role-policy \
--role-name $personalize_iam_role_name \
--policy-arn $personalize_managed_iam_service_policy_arn
# Create S3 bucket policy and attach to bucket for Personalize to access S3
printf -v s3_bucket_policy_json '{
"Version": "2012-10-17",
"Id": "PersonalizeS3BucketAccessPolicy",
"Statement": [
{
"Sid": "PersonalizeS3BucketAccessPolicy",
"Effect": "Allow",
"Principal": {
"Service": "personalize.amazonaws.com"
},
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::%s",
"arn:aws:s3:::%s/*"
]
}
]
}' "$demo_bucket_name" "$demo_bucket_name"
aws s3api put-bucket-policy \
--bucket $demo_bucket_name \
--policy "$s3_bucket_policy_json"
Create dataset import jobs:
# Create dataset import jobs
aws personalize create-dataset-import-job \
--role-arn $personalize_iam_role_arn \
--job-name "demo-initial-items-import" \
--dataset-arn $items_dataset_arn \
--data-source "dataLocation=s3://${demo_bucket_name}/${demo_key_prefix}/items/item_dataset.csv"
aws personalize create-dataset-import-job \
--role-arn $personalize_iam_role_arn \
--job-name "demo-initial-users-import" \
--dataset-arn $users_dataset_arn \
--data-source "dataLocation=s3://${demo_bucket_name}/${demo_key_prefix}/users/user_dataset.csv"
aws personalize create-dataset-import-job \
--role-arn $personalize_iam_role_arn \
--job-name "demo-initial-interactions-import" \
--dataset-arn $interactions_dataset_arn \
--data-source "dataLocation=s3://${demo_bucket_name}/${demo_key_prefix}/interactions/interaction_dataset.csv"
Check status of the dataset import jobs. This may take several minutes.
# Check status of dataset import jobs for "ACTIVE" status before proceeding
aws personalize list-dataset-import-jobs \
--query 'datasetImportJobs[?jobName==`demo-initial-items-import`].[jobName, status]' \
--output text&&\
aws personalize list-dataset-import-jobs \
--query 'datasetImportJobs[?jobName==`demo-initial-users-import`].[jobName, status]' \
--output text&&\
aws personalize list-dataset-import-jobs \
--query 'datasetImportJobs[?jobName==`demo-initial-interactions-import`].[jobName, status]' \
--output text
# Once status of dataset import jobs are all "ACTIVE," proceed to next step.
Training a model
After you ingest the data into Amazon Personalize, you are ready to train a model (solutionVersion
). To do so, map the recipe (algorithm) you want to use to your use case. The following are your available options:
- For user personalization, such as recommending items to a user, use one of the recipes described in the user personalization recipes documentation pages.
- For recommending items similar to an input item, use SIMS.
- For reranking a list of input items for a given user, use Personalized-Ranking.
This post uses the User-Personalization recipe to define a solution and then train a solutionVersion
(model). Complete the following commands.
# Create solution and train solution version
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt
aws personalize create-solution \
--name demo-user-personalization \
--dataset-group-arn $dataset_group_arn \
--recipe-arn arn:aws:personalize:::recipe/aws-user-personalization
personalize_solution_arn=$(aws personalize list-solutions \
--query 'solutions[?name==`demo-user-personalization`].solutionArn' \
--output text)\
&&echo personalize_solution_arn=$personalize_solution_arn \
>> ~/local_variables.txt
aws personalize create-solution-version \
--solution-arn $personalize_solution_arn \
--training-mode FULL
personalize_solution_version_arn=$(aws personalize describe-solution \
--solution-arn $personalize_solution_arn \
--query 'solution.latestSolutionVersion.solutionVersionArn' \
--output text)\
&&echo personalize_solution_version_arn=$personalize_solution_version_arn \
>> ~/local_variables.txt
You can also change the default hyperparameters or perform hyperparameter optimization for a solution.
Check status of the solution version. This may take an hour or longer as it is running full training on the datasets.
# Check status of solution version training for "ACTIVE" before proceeding
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt
aws personalize describe-solution-version \
--solution-version-arn $personalize_solution_version_arn \
--query 'solutionVersion.[solutionVersionArn, status]'
Getting recommendations
To get recommendations, create a campaign using the solution and solution version you just created. Complete the following steps:
# Create Campaign from solution version
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt
aws personalize create-campaign \
--name demo-user-personalization-test \
--solution-version-arn $personalize_solution_version_arn \
--min-provisioned-tps 1
personalize_campaign_arn=$(aws personalize list-campaigns \
--solution-arn $personalize_solution_arn \
--query 'campaigns[?name==`demo-user-personalization-test`].campaignArn' \
--output text)\
&&echo personalize_campaign_arn=$personalize_campaign_arn \
>> ~/local_variables.txt
Check status of the campaign. This may take several minutes.
# Check status of Campaign for "ACTIVE" status
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt
aws personalize describe-campaign \
--campaign-arn $personalize_campaign_arn \
--query 'campaign.[name, status]'
After you set up the campaign, you can programmatically call the campaign to get recommendations in form of item IDs. You can also use the console to get the recommendations and perform spot checks. Additionally, Amazon Personalize offers the ability to batch process recommendations. For more information, see Now available: Batch Recommendations in Amazon Personalize.
One way to test the campaign is with the following commands that will test both an existing a nonexistent user.
Test Campaign result with random user ID
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt
# Known user from dataset
test_user_id_1="19551372"
# New nonexistent user
test_user_id_2="1955137000"
# View recommendations for known user with previous recorded interactions
aws personalize-runtime get-recommendations \
--campaign-arn $personalize_campaign_arn \
--user-id $test_user_id_1 \
--num-results 5
# View recommendations for new user with no previous recorded interactions
aws personalize-runtime get-recommendations \
--campaign-arn $personalize_campaign_arn \
--user-id $test_user_id_2 \
--num-results 5
You should see the top five ranked item IDs for this user in descending order.
Removal of Created Resources
If you would like to remove the resources that you created in this post, run the following commands:
# Note that the you may need to source saved variables from file:
source ~/local_variables.txt
#Delete the campaign
aws personalize delete-campaign --campaign-arn $personalize_campaign_arn
#WAIT FOR DELETION
#Delete the solution
aws personalize delete-solution --solution-arn $personalize_solution_arn
#Delete datasets
aws personalize delete-dataset --dataset-arn $items_dataset_arn
aws personalize delete-dataset --dataset-arn $users_dataset_arn
aws personalize delete-dataset --dataset-arn $interactions_dataset_arn
#Delete the dataset group
aws personalize delete-dataset-group --dataset-group-arn $dataset_group_arn
#Delete schemas
aws personalize delete-schema --schema-arn $items_schema_arn
aws personalize delete-schema --schema-arn $users_schema_arn
aws personalize delete-schema --schema-arn $interactions_schema_arn
#Delete the S3 bucket and contents
aws s3 rb s3://$demo_bucket_name --force
#Delete IAM Role and Policy
aws iam detach-role-policy --role-name $personalize_iam_role_name --policy-arn $personalize_managed_iam_service_policy_arn
aws iam detach-role-policy --role-name $personalize_iam_role_name --policy-arn $personalize_iam_policy_arn
aws iam delete-role --role-name $personalize_iam_role_name
aws iam delete-policy --policy-arn $personalize_iam_policy_arn
#To clear the file saving variables
rm -f ~/local_variables.txt
Conclusion
You can now use these recommendations to power display experiences, such as personalize the homepage of your beauty website based on what you know about the user or send a promotional email with recommendations. Performing real-time recommendations with Amazon Personalize requires you to also send user events as they occur. For more information, see Amazon Personalize is Now Generally Available. Get started with Amazon Personalize today!
About the author
Vaibhav Sethi is the Product Manager for Amazon Personalize. He focuses on delivering products that make it easier to build machine learning solutions. In his spare time, he enjoys hiking and reading.
Brian Soper is a Solutions Architect at Amazon Web Services helping AWS customers transform and architect for the cloud since 2018. Brian has a 20+ year background building out physical and virtual infrastructure for both on-premises and cloud.
Rob Percival is an Account Manager in the AWS Games organization. He works with operators, game developers, and software providers in the US Real Money Gaming (online sports betting and casino gambling) industry to increase speed to market, gain deeper insight on their players, and accelerate experimentation and innovation using AWS.