Incrementally update a dataset with a bulk import mechanism in Amazon Personalize

We are excited to announce that Amazon Personalize now supports incremental bulk dataset imports; a new option for updating your data and improving the quality of your recommendations. Keeping your datasets current is an important part of maintaining the relevance of your recommendations. Prior to this new feature launch, Amazon Personalize offered two mechanisms for ingesting data:

DatasetImportJob – DatasetImportJob is a bulk data ingestion mechanism designed to import large datasets into Amazon Personalize. A typical journey starts with importing your historical interactions dataset in addition to your item catalog and user dataset. DatasetImportJob can then be used to keep your datasets current by sending updated records in bulk. Prior to this launch, data ingested via previous import jobs was overwritten by any subsequent DatasetImportJob.
Streaming APIs: The streaming APIs (PutEvents, PutUsers, and PutItems) are designed to incrementally update each respective dataset in real-time. For example, after you have trained your model and launched your campaign, your users continue to generate interactions data. This data is then ingested via the PutEvents API, which incrementally updates your interactions dataset. Using the streaming APIs allows you to ingest data as you get it rather than accumulating the data and scheduling ingestion.

With incremental bulk imports, Amazon Personalize simplifies the data ingestion of historical records by enabling you to import incremental changes to your datasets with a DatasetImportJob. You can import 100 GB of data per FULL DatasetImportJob or 1 GB of data per INCREMENTAL DatasetImportJob. Data added to the datasets using INCREMENTAL imports are appended to your existing datasets. Personalize will update records with the current version if your incremental import duplicates any records found in your existing dataset, further simplifying the data ingestion process. In the following sections, we describe the changes to the existing API to support incremental dataset imports.

CreateDatasetImportJob

A new parameter called importMode has been added to the CreateDatasetImportJob API. This parameter is an enum type with two values: FULL and INCREMENTAL. The parameter is optional and is FULL by default to preserve backward compatibility. The CreateDatasetImportJob request is as follows:

{
   "datasetArn": "string",
   "dataSource": { 
      "dataLocation": "string"
   },
   "jobName": "string",
   "roleArn": "string",
   "importMode": {INCREMENTAL, FULL}
}

The Boto3 API is create_dataset_import_job, and the AWS Command Line Interface (AWS CLI) command is create-dataset-import-job.

DescribeDatasetImportJob

The response to DescribeDatasetImportJob has been extended to include whether the import was a full or incremental import. The type of import is indicated in a new importMode field, which is an enum type with two values: FULL and INCREMENTAL. The DescribeDatasetImportJob response is as follows:

{ 
    "datasetImportJob": {
        "creationDateTime": number,
        "datasetArn": "string",
        "datasetImportJobArn": "string",
        "dataSource": {
            "dataLocation": "string"
        },
        "failureReason": "string",
        "jobName": "string",
        "lastUpdatedDateTime": number,
        "roleArn": "string",
        "status": "string",
        "importMode": {INCREMENTAL, FULL}
    }
}

The Boto3 API is describe_dataset_import_job, and the AWS CLI command is describe-dataset-import-job.

ListDatasetImportJob

The response to ListDatasetImportJob has been extended to include whether the import was a full or incremental import. The type of import is indicated in a new importMode field, which is an enum type with two values: FULL and INCREMENTAL. The ListDatasetImportJob response is as follows:

{ 
    "datasetImportJobs": [ { 
        "creationDateTime": number,
        "datasetImportJobArn": "string",
        "failureReason": "string",
        "jobName": "string",
        "lastUpdatedDateTime": number,
        "status": "string",
        "importMode": " {INCREMENTAL, FULL}
    } ],
    "nextToken": "string" 
}

The Boto3 API is list_dataset_import_jobs, and the AWS CLI command is list-dataset-import-jobs.

Code example

The following code shows how to create a dataset import job for incremental bulk import using the SDK for Python (Boto3):

import boto3

personalize = boto3.client('personalize')

response = personalize.create_dataset_import_job(
    jobName = 'YourImportJob',
    datasetArn = 'arn:aws:personalize:us-east 1:111111111111:dataset/AmazonPersonalizeExample/INTERACTIONS',
    dataSource = {'dataLocation':'s3://bucket/file.csv'},
    roleArn = 'role_arn',
    importMode = 'INCREMENTAL'
)

dsij_arn = response['datasetImportJobArn']

print ('Dataset Import Job arn: ' + dsij_arn)

description = personalize.describe_dataset_import_job(
    datasetImportJobArn = dsij_arn)['datasetImportJob']

print('Name: ' + description['jobName'])
print('ARN: ' + description['datasetImportJobArn'])
print('Status: ' + description['status'])

Summary

In this post, we described how you can use this new feature in Amazon Personalize to perform incremental updates to a dataset with bulk import, keeping the data fresh and improving the relevance of Amazon Personalize recommendations. If you have delayed access to your data, incremental bulk import allows you to import your data more easily by appending it to your existing datasets.

Try out this new feature by accessing Amazon Personalize now.

About the authors

Neelam Koshiya is an enterprise solution architect at AWS. Her current focus is to help enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.

James Jory is a Principal Solutions Architect in Applied AI with AWS. He has a special interest in personalization and recommender systems and a background in ecommerce, marketing technology, and customer data analytics. In his spare time, he enjoys camping and auto racing simulations.

Daniel Foley is a Senior Product Manager for Amazon Personalize. He is focused on building applications that leverage artificial intelligence to solve our customers’ largest challenges. Outside of work, Dan is an avid skier and hiker.

Alex Berlingeri is a Software Development Engineer with Amazon Personalize working on a machine learning powered recommendations service. In his free time he enjoys reading, working out and watching soccer.

Artificial Intelligence

Incrementally update a dataset with a bulk import mechanism in Amazon Personalize

CreateDatasetImportJob

DescribeDatasetImportJob

ListDatasetImportJob

Code example

Summary

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help