AWS Machine Learning Blog
Incrementally update a dataset with a bulk import mechanism in Amazon Personalize
We are excited to announce that Amazon Personalize now supports incremental bulk dataset imports; a new option for updating your data and improving the quality of your recommendations. Keeping your datasets current is an important part of maintaining the relevance of your recommendations. Prior to this new feature launch, Amazon Personalize offered two mechanisms for ingesting data:
- DatasetImportJob –
DatasetImportJob
is a bulk data ingestion mechanism designed to import large datasets into Amazon Personalize. A typical journey starts with importing your historical interactions dataset in addition to your item catalog and user dataset.DatasetImportJob
can then be used to keep your datasets current by sending updated records in bulk. Prior to this launch, data ingested via previous import jobs was overwritten by any subsequentDatasetImportJob
. - Streaming APIs: The streaming APIs (
PutEvents
,PutUsers
, andPutItems
) are designed to incrementally update each respective dataset in real-time. For example, after you have trained your model and launched your campaign, your users continue to generate interactions data. This data is then ingested via thePutEvents
API, which incrementally updates your interactions dataset. Using the streaming APIs allows you to ingest data as you get it rather than accumulating the data and scheduling ingestion.
With incremental bulk imports, Amazon Personalize simplifies the data ingestion of historical records by enabling you to import incremental changes to your datasets with a DatasetImportJob
. You can import 100 GB of data per FULL DatasetImportJob
or 1 GB of data per INCREMENTAL DatasetImportJob
. Data added to the datasets using INCREMENTAL
imports are appended to your existing datasets. Personalize will update records with the current version if your incremental import duplicates any records found in your existing dataset, further simplifying the data ingestion process. In the following sections, we describe the changes to the existing API to support incremental dataset imports.
CreateDatasetImportJob
A new parameter called importMode
has been added to the CreateDatasetImportJob API. This parameter is an enum type with two values: FULL
and INCREMENTAL
. The parameter is optional and is FULL by default to preserve backward compatibility. The CreateDatasetImportJob
request is as follows:
The Boto3 API is create_dataset_import_job, and the AWS Command Line Interface (AWS CLI) command is create-dataset-import-job.
DescribeDatasetImportJob
The response to DescribeDatasetImportJob has been extended to include whether the import was a full or incremental import. The type of import is indicated in a new importMode
field, which is an enum type with two values: FULL
and INCREMENTAL
. The DescribeDatasetImportJob
response is as follows:
The Boto3 API is describe_dataset_import_job, and the AWS CLI command is describe-dataset-import-job.
ListDatasetImportJob
The response to ListDatasetImportJob has been extended to include whether the import was a full or incremental import. The type of import is indicated in a new importMode
field, which is an enum type with two values: FULL
and INCREMENTAL
. The ListDatasetImportJob
response is as follows:
The Boto3 API is list_dataset_import_jobs, and the AWS CLI command is list-dataset-import-jobs.
Code example
The following code shows how to create a dataset import job for incremental bulk import using the SDK for Python (Boto3):
Summary
In this post, we described how you can use this new feature in Amazon Personalize to perform incremental updates to a dataset with bulk import, keeping the data fresh and improving the relevance of Amazon Personalize recommendations. If you have delayed access to your data, incremental bulk import allows you to import your data more easily by appending it to your existing datasets.
Try out this new feature by accessing Amazon Personalize now.
About the authors
Neelam Koshiya is an enterprise solution architect at AWS. Her current focus is to help enterprise customers with their cloud adoption journey for strategic business outcomes. In her spare time, she enjoys reading and being outdoors.
James Jory is a Principal Solutions Architect in Applied AI with AWS. He has a special interest in personalization and recommender systems and a background in ecommerce, marketing technology, and customer data analytics. In his spare time, he enjoys camping and auto racing simulations.
Daniel Foley is a Senior Product Manager for Amazon Personalize. He is focused on building applications that leverage artificial intelligence to solve our customers’ largest challenges. Outside of work, Dan is an avid skier and hiker.
Alex Berlingeri is a Software Development Engineer with Amazon Personalize working on a machine learning powered recommendations service. In his free time he enjoys reading, working out and watching soccer.