Using Amazon SageMaker, AWS Marketplace, and AWS Data Exchange to predict retail product popularity
For many of my customers, use of machine learning and advanced analytics is a competitive differentiator. To do machine learning (ML) and analytics at scale, you need access to high-quality data sets.
While some data is available in house, data scientists often need high-quality, third-party data in addition to the data they have available for analysis. AWS Data Exchange makes it easy for customers to find, subscribe to, and use third-party data in AWS. As of this writing, AWS Data Exchange has more than 2,000 data products from over 100 qualified data providers. You can subscribe, download or load datasets into Amazon S3, and analyze them with AWS’s analytics and ML services, such as Amazon SageMaker.
In this blog post, I will share an experiment I recently conducted to predict popularity of retail products using a dataset from AWS Data Exchange. I also used Amazon SageMaker and a third-party algorithm available in AWS Marketplace to analyze and generate insights from the data.
The retail scenario: deliver a custom ML model to predict the popularity of retail bath products
In this scenario, my ML startup uses Decision Support System (DSS). My customer, a retail store, asked my startup to deliver a custom ML model to predict the popularity of bath products. The goal is to help increase sales and store revenues by stocking the most popular products. I must create a custom ML model to advise on which products my customers should stock in order to increase sales. I’ll create an ML model that enables me to predict the popularity of bath products based on attributes such as the product name and category.
My client had already purchased Retail data set package, which has real-world datasets for bath products comprised of data from large chains such as Bath & Body Works and Bed Bath & Beyond. I used Intel’s DAAL decision forest classification algorithm, available in AWS Marketplace, for training a machine learning model.
Here is how I prepared for the experiment:
- I ensured that I had permissions to use AWS Data Exchange and associated services to subscribe to and export datasets by associating AWSDataExchangeSubscriberFullAccess AWS Data Exchange managed policy with my IAM principal. This gave me all the necessary permissions needed to use AWS Data Exchange as a subscriber. For more information, see Identity and Access Management in AWS Data Exchange.
- I ensured that I had access to S3 bucket, to which I exported the dataset.
- I read Working with Data Sets to familiarize myself with basic AWS Data Exchange concepts.
- I also ensured that my IAM principal had AmazonSageMakerFullAccess managed IAM policy associated.
Since my client had already procured the dataset, next logical step was to transfer the dataset into an S3 bucket. He followed the documentation to transfer the data to an S3 bucket.
Data providers often provide updated revisions of their datasets. To ensure I am using the latest revision, he set up a process to automatically load the new data from AWS Data Exchange into the S3 bucket. For information on how to set up such processes, see Find and acquire new data sets and retrieve new updates automatically using AWS Data Exchange.
My approach involved three main steps:
- Subscription to the third-party algorithm available in AWS Marketplace. This step was optional, as you can use any algorithm for training a model.
- Creation of an Amazon SageMaker notebook instance and code writing to prepare the data
- Custom model training using the procured third-party data and algorithm
Step 1: Subscribe to the algorithm
I followed these steps to subscribe to the Intel®DAAL DecisionForest Classification algorithm in AWS Marketplace:
- Opened the Intel®DAAL DecisionForest Classification listing in AWS Marketplace.
- Read the Highlights, Product Overview, Usage information, and Additional resources. Note the supported instance types.
- Selected Continue to Subscribe.
- Reviewed the End user license agreement, Support Terms, and Pricing Information. I selected Accept Offer to agree to pricing, EULA, and the support terms of the listing.
For this exercise, I used Intel®DAAL DecisionForest Classification algorithm to train my machine learning model. You can also try other methods, including built-in Amazon SageMaker algorithms that support classification. These include the LinearLearner algorithm, XGBoost algorithm, and other algorithms available in major supported frameworks for training your machine learning model.
Step 2: Create an Amazon SageMaker notebook instance and conduct the experiment
I created an Amazon SageMaker notebook instance to analyze the data and train my machine learning model to predict popular bath products. For instructions on how to create the Amazon SageMaker notebook instance, see Create a Notebook Instance. Here are the steps I performed in the notebook:
- Analyzed and feature-engineered data
- Identified interesting features:
- I looked at the dataset and found that it contained features such as price, name, category, how well the product lasted in the market, reviews, ratings, and promotions applied.
- I decided to keep my model simple and chose name and category as features.
- I decided to combine review counts, review ratings, and the duration for which the product lasted in the market. I created a single outcome variable that signified whether product became popular or not.
- Cleansed the data: I cleansed the product name feature by removing all special characters, converting numerals to words, and changing the text case to lowercase. I also removed insignificant null values.
- Created new features:
- The data showed that popular products had shorter names. I created a product-name-length feature.
- I also found that categories were really broad. For example, three-wick-candle as well as single-wick-candle were in the same category. I decided to extract all common suffixes to create a new feature called sub-category.
- Generated embeddings: For the product name column itself, I generated embeddings. I visualized the embeddings via t-SNE plot.
- Identified interesting features:
- Prepared data for training
- Many algorithms accept data only in numeric format. The algorithm I intended to use also accepted data in numeric format. I one-hot encoded the category and subcategory features.
- I randomized and split the dataset into training and testing datasets and then uploaded them to S3.
- Specified the hyperparameters required by the algorithm and ran a training job. The training job used the data from my S3 bucket, the algorithm from AWS Marketplace, and the hyperparameters I specified to train my ML model.
- In this step, I specified values for hyperparameters and then trained a model. I was able to train a model with 70% accuracy.
- Once the model was trained, I had two options:
- Perform a batch inference.
- Stand up an Amazon SageMaker endpoint for performing real-time inference. For this exercise, I stood up an Amazon SageMaker endpoint.
- Tested my trained model.
- I entered Ginger orange 3-wick candle as the product name, and the ML model predicted whether it would become popular or not.
- Then I tried tweaking product names to a few other combinations, such as Orange vanilla hand soap and Orange hand soap.
- I tuned the model further by performing hyperparameter optimization to achieve higher accuracy results.
- I continued performing inference on the tuned model until I had a list of products and categories to recommend to my client.
I could extend the exercise further by adjusting features, such as price and promotions. I could also create additional synthetic features based on occurrences of specific fragrant elements. For example, I could predict whether lavender- or coconut-scented products were more likely to be popular.
In this post, I showed a scenario in which I delivered a custom ML model that predicts the popularity of retail bath products in order to increase sales. I advised my client to stock specific products based on the created model. For a deep dive walkthrough of this process, watch my re:Invent session demonstrating this scenario and exercise.
These other blog posts cover aspects of using data from AWS Data Exchange:
- AWS Data Exchange – Find, Subscribe To, and Use Data Products
- Building machine learning workflows with AWS Data Exchange and Amazon SageMaker
- Find and acquire new data sets and retrieve new updates automatically using AWS Data Exchange
- Using AWS Marketplace for machine learning workloads
About the author
Kanchan Waikar is a Senior Partner Solutions Architect at Amazon Web Services with AWS Marketplace for machine learning group. She has over 13 years of experience building, architecting, and managing, NLP, and software development projects. She has a masters degree in computer science(data science major) and she enjoys helping customers build solutions backed by AI/ML based AWS services and partner solutions.