In this lab, you learn how to build a semantic, content recommendation system that combines topic modeling and nearest neighbor techniques for information retrieval using Amazon SageMaker built-in algorithms for Neural Topic Model (NTM) and K-Nearest Neighbor (K-NN).

Information retrieval is the science of searching for information in a document, searching for documents themselves, or searching for metadata that describe data. This lab combines the techniques of topic modeling and nearest neighbor for information retrieval. This approach uses topic modeling to generate semantic distribution vectors representing the meaning of documents by topics, and then uses the nearest neighbor technique to index topic vectors to retrieve similar documents for a given input document based on topic similarity. By using Amazon SageMaker built-in algorithms, you do not need to label data and the information retrieval is based on semantic meaning similarity instead of simple string matching.

Amazon SageMaker is a fully managed end-to-end machine learning platform that covers the entire lifecycle of machine learning. Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Amazon K-Nearest Neighbors (k-NN) is a non parametric, index-based, supervised learning algorithm that can be used for classification and regression tasks.

One of the key components in Amazon SageMaker is a list of highly scalable built-in algorithms. This lab uses Amazon SageMaker Neural Topic Model (NTM) Algorithm and Amazon SageMaker K-Nearest Neighbors (K-NN) Algorithm to combine information retrieval techniques and build a recommendation system.

Some of the key reasons to use Amazon SageMaker for your information retrieval are:

  • Scalable Training: SageMaker fully manages the underlying infrastructure needed to train models at scale by handling the setup of the instances, data movement between storage and compute, as well as between compute instances and the de-provisioning of compute once the job is completed. With spot training, you can save up to 90% in training costs.
  • Scalable Deployment: Once your models are trained, SageMaker can fully manage both offline (batch) inferences and online deployments of your trained models by creating a hosted endpoint for you. SageMaker handles the automatic scaling of your endpoints to scale up and down based on the incoming traffic.
  • Monitoring: With Amazon SageMaker Model Monitor, you can monitor your model endpoints for drift in your data and emit alarms when data drift is detected to suggest retraining your models.
  • Fully Managed Hyperparameter Tuning: Training ML models often requires a time consuming process of tuning hyperparameters. SageMaker manages hyper-parameter tuning jobs for you; you simply select the number of tuning jobs you want to run in parallel and in total, the metrics you want to monitor and Amazon Sagemaker takes care of the rest.

AWS Experience: Intermediate

Time to complete: 2 hours

Cost to complete: This tutorial will cost you less than $2 (assuming all services are running for 2 hours)*

Technologies used:

• Active AWS Account**
• Browser: AWS recommends Chrome
• Amazon SageMaker
• Amazon SageMaker Notebooks
• Amazon SageMaker Built-in Algorithms
• Amazon S3
• AWS SDK for Python 3

*This estimate assumes you follow the recommended configurations throughout the tutorial and terminate all resources within 2 hours.

**Accounts that have been created within the last 24 hours might not yet have access to the resources required for this project.