AWS Machine Learning Blog

Automating an Amazon Personalize solution using the AWS Step Functions Data Science SDK

Machine learning (ML)-based recommender systems aren’t a new concept across organizations such as retail, media and entertainment, and education, but developing such a system can be a resource-intensive task—from data labelling, training and inference, to scaling. You also need to apply continuous integration, continuous deployment, and continuous training to your ML model, or MLOps. The MLOps model helps build code and integration across ML tools and frameworks. Moreover, building a recommender system requires managing and orchestrating multiple workflows, for example waiting for your training jobs to finish and trigger deployments.

In this post, we show how to use Amazon Personalize to create a serverless recommender system for movie recommendations with no ML experience required. Creating an Amazon Personalize solution workflow involves multiple steps, such as preparing and importing the data, choosing a recipe, creating a solution, and finally generating recommendations. End users or your data scientists can orchestrate these steps using AWS Lambda functions and AWS Step Functions. However, writing JSON code to manage such workflows can be complicated.

You can orchestrate an Amazon Personalize workflow using the AWS Step Functions Data Science SDK for Python using an Amazon SageMaker Jupyter notebook. The AWS Step Functions Data Science SDK is an open-source library that allows data scientists to easily create workflows that process and publish ML models using SageMaker Jupyter notebooks and Step Functions. You can create multi-step ML workflows in Python that orchestrate AWS infrastructure at scale, without having to provision and integrate the AWS services separately.

Solution and services overview

We orchestrated the Amazon Personalize training workflow steps such as preparing and importing the data, choosing a recipe, and creating a solution using Step Functions and Lambda functions to create the following end-to-end automated workflow.

We walk you through the following steps using the accompanying SageMaker Jupyter notebook in order to create this Amazon Personalize workflow:

  1. Complete the prerequisites of setting up permissions and preparing the dataset.
  2. Set up Lambda function and Step Functions task states.
  3. Add wait conditions to each Step Functions task state.
  4. Add a control flow and link the states.
  5. Define and run the workflows by combining the preceding steps.
  6. Generate recommendations.

Prerequisites

Before you build your workflow and generate recommendations, you must set up a notebook instance, AWS Identity and Access Management (IAM) roles, and Amazon Simple Storage Service (Amazon S3) bucket.

Creating a notebook instance in SageMaker

For full instructions on creating your SageMaker notebook instance, see Step 2: Create an Amazon SageMaker Notebook Instance. When creating an IAM role for your notebook, make sure your notebook IAM role has AmazonS3FullAccess and AmazonPersonalizeFullAccess in order to access Amazon S3 and Amazon Personalize APIs through the SageMaker notebook.

  1. When the notebook is active, choose Open Jupyter.
  2. On the Jupyter dashboard, choose New.
  3. Choose Terminal.
  4. In the terminal, enter the following code:
    cd SageMaker 
    git clone https://github.com/aws-samples/personalize-data-science-sdk-workflow.git
    
  5. Open the notebook by choosing Personalize-Stepfunction-Workflow.ipynb in the root folder.

You’re now ready to run the following steps through the notebook cells. You can find the complete notebook for this solution in the GitHub repo.

Setting up the Step Functions execution role

You need a Step Functions execution role so that you can create and invoke workflows in Step Functions. For instructions, see Create a role for Step Functions or go through the steps in the notebook Create an execution role for Step Functions.

Attach an inline policy to the role you created for the notebook instance in the previous step. Enter the JSON policy by copying the policy from the notebook.

Preparing your dataset

To prepare your dataset, complete the following steps:

  1. Create an S3 bucket to store the training dataset and provide the bucket name and file namemovie-lens-100k.csv in the notebook step Setup S3 location and filename. See the following code:
    bucket = "<SAMPLE BUCKET NAME>" # replace with the name of your S3 bucket
    filename = "<SAMPLE FILE NAME>" # replace with a name that you want to save the dataset under
    
  2. Attach the policy to the S3 bucket by running the Attach policy to Amazon S3 bucket step in the notebook.
  3. Create an IAM role for Amazon Personalize by running the Create Personalize Role step in the notebook using the Python Boto3 create_role API.
  4. To preprocess the dataset and upload to Amazon S3, run the data preparation step of the following notebook cell to download the movie-lens dataset and select the movies that have a rating of 2 or above from the dataset:
    !wget -N http://files.grouplens.org/datasets/movielens/ml-100k.zip
    !unzip -o ml-100k.zip
    data = pd.read_csv('./ml-100k/u.data', sep='\t', names=['USER_ID', 'ITEM_ID', 'RATING', 'TIMESTAMP'])
    pd.set_option('display.max_rows', 5)
    data
    

The following screenshot shows your output.

  1. Run the following code to upload your data to the S3 bucket that you created:
    data = data[data['RATING'] > 2] # keep only movies rated 2 and above
    data2 = data[['USER_ID', 'ITEM_ID', 'TIMESTAMP']] 
    data2.to_csv(filename, index=False)
    
    boto3.Session().resource('s3').Bucket(bucket).Object(filename).upload_file(filename)
    

Setting up your Lambda function and Step Functions task states

After you complete the prerequisite steps, you’re ready to set up your Lambda function and Step Functions task states.

You need to set up a Lambda function for each Amazon Personalize task, such as creating a dataset, choosing a recipe, and creating a solution using the Amazon Personalize AWS SDK for Python (Boto3). Use the Python code provided in the Lambda GitHub repo. This repository has Python code for various Lambda functions. For instructions on creating your functions, see Building Lambda functions with Python.

 While creating an execution role, make sure you provide the AmazonPersonalizeFullAccess and AWSLambdaBasicExecutionRole permission policies to the role.

Orchestrate each of these Lambda functions as Step Functions task states. A task state represents a single unit of work performed by a state machine. Step Functions can invoke Lambda functions directly from a task state. For more information, see Creating a Step Functions State Machine That Uses Lambda. The following is a Step Functions workflow diagram to orchestrate the Lambda functions we cover in this section.

We already created Lambda functions for each of the Amazon Personalize steps using the Amazon Personalize AWS SDK for Python (Boto3) with the Python code provided in the Lambda GitHub repo. We now deep dive and explain each step in this section. Alternatively, you can build a Lambda function on the Lambda console.

Creating a schema

Before you add a dataset to Amazon Personalize, you must define a schema for that dataset. For more information, see Datasets and Schemas. First we create a Step Functions task state to create a schema using the following code in the walkthrough notebook:

lambda_state_schema = LambdaStep(
    state_id="create schema",
    parameters={  
        "FunctionName": "stepfunction-create-schema", #replace with the name of the function you created
        "Payload": {  
           "input": "personalize-stepfunction-schema"
        }
    },
    result_path='$'    
)

The preceding step invokes the Lambda function stepfunction-create-schema.py. The following Lambda code snippet uses personalize.create_schema() to create a schema:

import boto3

client = boto3.client('personalize')
create_schema_response = personalize.create_schema(
        name = event['input'],
        schema = json.dumps(schema)
    )

This API creates an Amazon Personalize schema from the specified schema string. The schema you create must be in Avro JSON format.

Creating a dataset group

After you create a schema, we similarly create a Step Functions task state for creating a dataset group. A dataset group contains related datasets that supply data for training a model. A dataset group can contain at most three datasets, one for each type of dataset: Interactions, Items, and Users.

To train a model (create a solution), you need a dataset group that contains an Interactions dataset.

Run the notebook steps to the create the state ID create dataset:

lambda_state_createdataset = LambdaStep(
    state_id="create dataset",
    parameters={  
        "FunctionName": "stepfunctioncreatedataset", #replace with the name of the function you created
        
        "Payload": {  
           "schemaArn.$": '$.schemaArn',
           "datasetGroupArn.$": '$.datasetGroupArn',        
        } 
        
        
    },
    result_path = '$'
)

The preceding Step Functions task state invokes the Lambda function stepfunctioncreatedatagroup.py. The following is a code snippet using personalize.create_dataset() to create a dataset:

create_dataset_response = personalize.create_dataset(

name = "personalize-stepfunction-dataset",

datasetType = dataset_type,

datasetGroupArn = event['datasetGroupArn'],

schemaArn = event['schemaArn']

)

This API creates an empty dataset and adds it to the specified dataset group.

Creating a dataset

In this step, you create a Step Functions task state for automating an empty dataset and add it to the specified dataset group.

Run through the notebook steps to create a dataset step function. You can review the underlying Lambda function stepfunctioncreatedataset.py. We use the same personalize.create_dataset Python Boto3 API to automate dataset creation that we used to create a dataset group in the previous step.

Importing your data

When you complete Step 1: Creating a Dataset Group and Step 2: Creating a Dataset and a Schema, you’re ready to import your training data into Amazon Personalize. When you import data, you can choose to import records in bulk, import records individually, or both, depending on your business requirements and the amount of historical data you have. If you have a large amount of historical records, we recommend you import data in bulk and add data incrementally as necessary.

Run through the notebook steps to create a Step Functions task state to import a dataset with state_id="create dataset import job”. The following code is the snippet for the underlying Lambda function stepfunction-createdatasetimportjob.py:

create_dataset_import_job_response = personalize.create_dataset_import_job(

jobName = "stepfunction-dataset-import-job",

datasetArn = datasetArn,

dataSource = {

"dataLocation": "s3://{}/{}".format(bucket, filename)

},

roleArn = roleArn

)

We use the personalize.create_dataset_import_job() Boto3 API to import the dataset in the preceding Lambda code. This API creates a job that imports training data from your data source (an S3 bucket) to an Amazon Personalize dataset.

Creating a solution

After you prepare and import the data, you’re ready to create a solution. A solution refers to the combination of an Amazon Personalize recipe, customized parameters, and one or more solution versions (trained models). A recipe is an Amazon Personalize term specifying an appropriate algorithm to train for a given use case. After you create a solution with a solution version, you can create a campaign to deploy the solution version and get recommendations. We create a Step Functions task state for choosing a recipe and creating a solution.

Choosing a recipe, configuring a solution, and creating a solution version

Run the notebook cell Choose a recipe to create a task state to generate the state ID state_id="select recipe and create solution". This state represents the Lambda function stepfunction_select-recipe_create-solution.py. See the following code snippet:

list_recipes_response = personalize.list_recipes()

recipe_arn = "arn:aws:personalize:::recipe/aws-user-personalization" # aws-user-personalization selected for demo purposes

#list_recipes_response
  create_solution_response = personalize.create_solution(

name = "stepfunction-solution",

datasetGroupArn = event['dataset_group_arn'],

recipeArn = recipe_arn

)

We use the personalize.list_recipes() Boto3 API to return a list of available recipes and the personalize.create_solution() Boto3 API to create the configuration for training a model. A trained model is known as a solution.

We use the algorithm aws-user-personalization for this post. For more information, see Choosing a Recipe and Configuring a Solution.

After you choose a recipe and configure your solution, you’re ready to create a solution version, which refers to a trained ML model.

Run the notebook cell Create Solution Version to create a task state with the state ID state_id=" create solution version" and recipes using the underlying Lambda function stepfunction_create_solution_version.py. The following is the Lambda code snippet using the personalize.create_solution_version() Boto3 API, which trains or retrains an active solution:

create_solution_version_response = personalize.create_solution_version(


solutionArn = event['solution_arn']

)

A solution is created using the CreateSolution operation and must be in the ACTIVE state before calling CreateSolutionVersion. A new version of the solution is created every time you call this operation. For more information, see Creating a Solution Version.

Creating a campaign

A campaign is used to make recommendations for your users. You create a campaign by deploying a solution version.

Run the notebook cell Create Campaign to create a task state with the state ID state_id=" create campaign" using the underlying Lambda function stepfunction_getsolution_metric_create_campaign.py. This function uses the following API:

get_solution_metrics_response = personalize.get_solution_metrics(
        solutionVersionArn = event['solution_version_arn']
    )

    create_campaign_response = personalize.create_campaign(
        name = "stepfunction-campaign",
        solutionVersionArn = event['solution_version_arn'],
        minProvisionedTPS = 1
    )

personalize.get_solution_metrics() gets the metrics for the specified solution version and personalize.create_campaign() creates a campaign by deploying a solution version.

In this section, we covered all the Step Functions task states and their Lambda functions needed to orchestrate the Amazon Personalize workflow. Proceed to the next section to add the Wait states to these steps.

Adding a Wait state to your steps

In this section, we add Wait states because these steps need to wait for previous steps to finish before running the next step. For example, you want to create a dataset after you create a dataset group.

Run the notebook steps from Wait for Schema to be ready to Wait for Campaign to be ACTIVE to make sure these Step Functions or Lambda steps are waiting for each step to run before they trigger the next step.

The following code is an example of what a Wait state looks like to wait for the schema to be created:

wait_state_schema = Wait(
    state_id="Wait for create schema - 5 secs",
    seconds=5
)

We added a wait time of 5 seconds for the state ID, which means it should wait for 5 seconds before going to next state.

Adding a control flow and linking states by using the Choice state

The AWS Step Functions Data Science SDK’s Choice state supports branching logic based on the outputs from previous steps. You can create dynamic and complex workflows by adding this state.

After you define these steps, chain them together into a logical sequence. Run all the notebook steps to add choices while automating Amazon Personalize datasets, recipes, solutions, and the campaign.

The following code is an example of the Choice state for the create_campaign workflow:

create_campaign_choice_state = Choice(
    state_id="Is the Campaign ready?"
)
create_campaign_choice_state.add_choice(
    rule=ChoiceRule.StringEquals(variable=lambda_state_campaign_status.output()['Payload']['status'], value='ACTIVE'),
    next_step=Succeed("CampaignCreatedSuccessfully")     
)
create_campaign_choice_state.add_choice(
    ChoiceRule.StringEquals(variable=lambda_state_campaign_status.output()['Payload']['status'], value='CREATE PENDING'),
    next_step=wait_state_campaign
)
create_campaign_choice_state.add_choice(
    ChoiceRule.StringEquals(variable=lambda_state_campaign_status.output()['Payload']['status'], value='CREATE IN_PROGRESS'),
    next_step=wait_state_campaign
)

create_campaign_choice_state.default_choice(next_step=Fail("CreateCampaignFailed"))

Defining and running workflows

You now create a workflow that runs a group of Lambda functions (steps) in a specific order. For example, one function’s output passes to the next function’s input. We’re now ready to define the workflow definition for each step in Amazon Personalize:

  • Dataset workflow
  • Dataset import workflow
  • Recipe and solution workflow
  • Create campaign workflow
  • Main workflow to orchestrate the four preceding workflows for Amazon Personalize

After completing these steps, we can run our main workflow. To learn more about Step Functions workflows, see the AWS Step Functions Developer Guide.

Dataset workflow

Run the following code to generate a workflow definition for dataset creation using the Lambda function state defined by Step Functions:

Dataset_workflow_definition=Chain([lambda_state_schema,
                                   wait_state_schema,
                                   lambda_state_datasetgroup,
                                   wait_state_datasetgroup,
                                   lambda_state_datasetgroupstatus
                                  ])
Dataset_workflow = Workflow(
    name="Dataset-workflow",
    definition=Dataset_workflow_definition,
    role=workflow_execution_role
)

The following screenshot shows the Dataset workflow view.

Dataset import workflow

Run the following code to generate a workflow definition for dataset creation using the Lambda function state defined by Step Functions:

DatasetImport_workflow_definition=Chain([lambda_state_createdataset,
                                   wait_state_dataset,
                                   lambda_state_datasetimportjob,
                                   wait_state_datasetimportjob,
                                   lambda_state_datasetimportjob_status,
                                   datasetimportjob_choice_state
                                  ])

The following screenshot shows the DatasetImport workflow view.

Recipe and solution workflow

Run the notebook to generate a workflow definition for dataset creation using the Lambda function state defined by Step Functions:

Create_receipe_sol_workflow_definition=Chain([lambda_state_select_receipe_create_solution,
                                   wait_state_receipe,
                                   lambda_create_solution_version,
                                   wait_state_solutionversion,
                                   lambda_state_solutionversion_status,
                                   solutionversion_choice_state
                                  ])

The following screenshot shows the Create-recipe workflow view.

Campaign workflow

Run the notebook to generate a workflow definition for dataset creation using the Lambda function state defined by Step Functions:

Create_Campaign_workflow_definition=Chain([lambda_create_campaign,
                                   wait_state_campaign,
                                   lambda_state_campaign_status,
                                   wait_state_datasetimportjob,
                                   create_campaign_choice_state
                                  ])

The following screenshot shows the Campaign-workflow view.

Main workflow

Now we automate all four workflow definitions by combining them into a single workflow definition in order to automate all the task states to create the dataset, recipe, solution, and campaign in Amazon Personalize, including the Wait times and Choice states. Run the following notebook cell to generate a main workflow definition:

Main_workflow_definition=Chain([call_dataset_workflow_state,
                                call_datasetImport_workflow_state,
                                call_receipe_solution_workflow_state,
                                call_campaign_solution_workflow_state
                               ])

Running the workflow

Run the following code to trigger an Amazon Personalize automated workflow to create a dataset, recipe, solution, and campaign using the underlying Lambda function and Step Functions states:

Main_workflow_execution = Main_workflow.execute()
Main_workflow_execution.render_progress()

The following screenshot shows the progress of the overall workflow.

The following screenshots show the individual workflows.

Alternatively, you can monitor the steps on the Step Functions console to see the status of each step in progress.

To see the detailed progress of each step and if it succeeded or failed, run the following code:

Main_workflow_execution.list_events(html=True)

The following screenshot shows the step details.

Wait until the workflows are complete before moving on to the next steps.

Generating recommendations

Now that you have a successful campaign, you can use it to generate a recommendation workflow. The following screenshot shows your recommendation workflow view.

Running the recommendation workflow

Run the following code to trigger a recommendation workflow using the underlying Lambda function and Step Functions states:

recommendation_workflow_execution = recommendation_workflow.execute()
recommendation_workflow_execution.render_progress()

The following screenshot shows your workflow.

Use the following code to display the movie recommendations generated by the workflow:

item_list = recommendation_workflow_execution.get_output()['Payload']['item_list']

print("Recommendations:")
for item in item_list:
    np.int(item['itemId'])
    item_title = items.loc[items['ITEM_ID'] == np.int(item['itemId'])].values[0][-1]
    print(item_title)

You can find the complete notebook for this solution in the GitHub repo.

Cleaning up

Make sure to clean up the Amazon Personalize and the state machines created in this post to avoid incurring any charges. On the Amazon Personalize console, delete these resources in the following order:

  1. Campaign
  2. Receipts
  3. Solutions
  4. Dataset
  5. Dataset groups

Summary

Amazon Personalize makes it easy for developers to add highly personalized recommendations to customers who use their applications. It uses the same ML technology used by Amazon.com for real-time personalized recommendations, with no ML expertise required. This post shows how to use the AWS Step Functions Data Science SDK to automate the process of orchestrating Amazon Personalize workflows such as such as dataset group, dataset, dataset import job, solution, solution version, campaign, and recommendations. Automating the creation of a recommender function using Step Functions can improve the CI/CD pipeline of the model training and deployment process.

For additional technical documentation and example notebooks related to the SDK, see Introducing the AWS Step Functions Data Science SDK for Amazon SageMaker.

 


About the Authors

Neel Sendas is a Senior Technical Account Manager at Amazon Web Services. Neel works with enterprise customers to design, deploy, and scale cloud applications to achieve their business goals. He has worked on various ML use cases, ranging from anomaly detection to predictive product quality for manufacturing and logistics optimization. When he is not helping customers, he dabbles in golf and salsa dancing.

 

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML.