In this module, you use the built-in Amazon SageMaker Neural Topic Model (NTM) Algorithm to train the topic model.

Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example. Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities. The topics from documents that NTM learns are characterized as a latent representation because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.

In the following steps, you specify your NTM algorithm for the training job, specify infrastructure for the model, set the hyperparameter values to tune the model, and run the model. Then, you deploy the model to an endpoint managed by Amazon SageMaker to make predictions.

Time to Complete Module: 20 Minutes


  • Step 1. Create and run the training job

    The built-in Amazon SageMaker algorithms are stored as docker containers in Amazon Elastic Container Registry (Amazon ECR). For model training, you first need to specify the location of the NTM container in Amazon ECR, closest to your region.

    In your notebook instance, copy and paste the following code into a new code cell and choose Run.

    import boto3
    from sagemaker.amazon.amazon_estimator import get_image_uri
    container = get_image_uri(boto3.Session().region_name, 'ntm')

    The Amazon SageMaker Python SDK includes the sagemaker.estimator.Estimator estimator. This estimator allows you to specify the infrastructure (Amazon EC2 instance type, number of instances, hyperparameters, output path, and optionally, any security-related settings (virtual private cloud (VPC), security groups, etc.) that may be relevant if we are training our model in a custom VPC of our choice as opposed to an Amazon VPC. The NTM fully takes advantage of GPU hardware and, in general, trains roughly an order of magnitude faster on a GPU than on a CPU. Multi-GPU or multi-instance training further improves training speed roughly linearly if communication overhead is low compared to compute time.

    To create an instance of the sagemaker.estimator.Estimator class, copy and paste the following code into a new code cell and choose Run.

    sess = sagemaker.Session()
    ntm = sagemaker.estimator.Estimator(container,
                                        role, 
                                        train_instance_count=2, 
                                        train_instance_type='ml.c4.xlarge',
                                        output_path=output_path,
                                        sagemaker_session=sess)
    

    Now, you can set the hyperparameters for the topic model:

    ntm.set_hyperparameters(num_topics=NUM_TOPICS, feature_dim=vocab_size, mini_batch_size=128, 
                            epochs=100, num_patience_epochs=5, tolerance=0.001)
    

    SageMaker offers two modes for data channels:

    • FullyReplicated: All data files are copied to all workers.
    • ShardedByS3Key: Data files are sharded to different workers, that is, each worker receives a different portion of the full data set.

    At the time of writing, by default, the Amazon SageMaker Python SDK uses FullyReplicated mode for all data channels. This mode is desirable for validation (test) channel but not as efficient for the training channel, when you use multiple workers.

    In this case, you want to have each worker go through a different portion of the full dataset to provide different gradients within epochs. You specify distribution to be ShardedByS3Key for the training data channel as follows.

    from sagemaker.session import s3_input
    s3_train = s3_input(s3_train_data, distribution='ShardedByS3Key') 
    ntm.fit({'train': s3_train, 'test': s3_val_data})
    

    You should see the following output in your terminal:

    Completed - Training job completed

    Success! You've trained your topic model with the NTM algorithm.

    In the next step, you deploy your model to Amazon Sagemaker hosting services.

  • Step 2. Deploy the topic model

    A trained model by itself is simply a tar file consisting of the model weights and does nothing on its own. To make the model useful and get predictions, you need to deploy the model.

    There are two ways to deploy the model in Amazon SageMaker, depending on how you want to generate inferences:

    • To get one inference at a time, set up a persistent endpoint using Amazon SageMaker hosting services.
    • To get inferences for an entire dataset, use Amazon SageMaker batch transform.

    This lab provides both options for you to choose the best approach for your use case.

    In the case of Amazon SageMaker hosting services, a live HTTPs endpoint lives on an Amazon EC2 instance that you can pass a payload to and obtain inferences.

    When you deploy the model, you call the deploy method of the sagemaker.estimator.Estimator object. When you call the deploy method, you specify the number and type of ML instances that you want to use to host the endpoint.

    Copy and paste the following code and choose Run to deploy the model.

    ntm_predictor = ntm.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

    The deploy method creates the deployable model, configures the Amazon SageMaker hosting services endpoint, and launches the endpoint to host the model.

    To run inferences against an endpoint, you need to ensure that the input payload is serialized in a format that the trained model can read, and the inference output is deserialized into a human readable format. In the following code, you use a csv_serializer and a json_deserializer which passes CSV formatted data (as vectors) to the model to produce JSON output.

    Copy and paste the following code into a code cell and choose Run.

    from sagemaker.predictor import csv_serializer, json_deserializer
    
    ntm_predictor.content_type = 'text/csv'
    ntm_predictor.serializer = csv_serializer
    ntm_predictor.deserializer = json_deserializer

    Next, extract the topic vectors for the training data that you will use in the K-NN model.

    Copy and paste the following code into a new code cell and choose Run.

    predictions = []
    for item in np.array(vectors.todense()):
        np.shape(item)
        results = ntm_predictor.predict(item)
        predictions.append(np.array([prediction['topic_weights'] for prediction in results['predictions']]))
        
    predictions = np.array([np.ndarray.flatten(x) for x in predictions])
    topicvec = train_labels[newidx]
    topicnames = [categories[x] for x in topicvec]
    

    Success! Now, you can explore the model outputs.

    With batch transform, you can run inferences on a batch of data at a time. Amazon SageMaker creates the necessary compute infrastructure and tears it down once the batch job is completed.

    The batch transform code creates a sagemaker.transformer.Transformer object from the topic model. Then, it calls that object's transform method to create a transform job. When you create the sagemaker.transformer.Transformer object, you specify the number and type of instances to use to perform the batch transform job, and the location in Amazon S3 where you want to store the inferences.  

    To run inferences as a batch job, copy and paste the following code into a code cell and choose Run.

    np.savetxt('trainvectors.csv',
               vectors.todense(),
               delimiter=',',
               fmt='%i')
    batch_prefix = '20newsgroups/batch'
    
    train_s3 = sess.upload_data('trainvectors.csv', 
                                bucket=bucket, 
                                key_prefix='{}/train'.format(batch_prefix))
    print(train_s3)
    batch_output_path = 's3://{}/{}/test'.format(bucket, batch_prefix)
    
    ntm_transformer = ntm.transformer(instance_count=1,
                                      instance_type ='ml.m4.xlarge',
                                      output_path=batch_output_path
                                     )
    ntm_transformer.transform(train_s3, content_type='text/csv', split_type='Line')
    ntm_transformer.wait()
    

    Once the transform job is done, you can use the following code to download the outputs back to your local notebook instance for inspection.

    !aws s3 cp --recursive $ntm_transformer.output_path ./
    !head -c 5000 trainvectors.csv.out

    Success! The model conveted each document into NUM_TOPICS dimensional training vectors. You can now explore the topic model.

  • Step 3. Explore the topic model

    One approach for exploring the model outputs is to visualize the topic vectors generated using a T-SNE plot. A T-SNE, or t-Distributed Stochastic Neighbor Embedding, is a non-linear technique for dimensionality reduction which aims to ensure that the distance between nearest neighbors in the original high dimensional space is preserved in the resulting lower dimensional space. By setting the number of dimensions to 2, it can be used as an visualization tool to visualize the topic vectors in 2D space.

    In your Jupyter notebook, copy and paste the following code into a new code cell and choose Run.

    from sklearn.manifold import TSNE
    time_start = time.time()
    tsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=5000)
    tsne_results = tsne.fit_transform(predictions)
    print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
    tsnedf = pd.DataFrame()
    tsnedf['tsne-2d-one'] = tsne_results[:,0]
    tsnedf['tsne-2d-two'] = tsne_results[:,1]
    tsnedf['Topic']=topicnames
    plt.figure(figsize=(25,25))
    sns.lmplot(
        x="tsne-2d-one", y="tsne-2d-two",
        hue='Topic',
        palette=sns.color_palette("hls", NUM_TOPICS),
        data=tsnedf,
        legend="full",
        fit_reg=False
    )
    plt.axis('Off')
    plt.show()

    The TSNE plot should show some large topic clusters like the following image. Plots like these can be used to extract the number of distinct topic clusters in the dataset. Currently, NUM_TOPICS is set to 30, but there appear to be a lot of topics that are close to each other in the TSNE plot and may be combined into a single topic. Ultimately, as topic modeling is largely an unsupervised learning problem, you must use visualizations such as these to determine what is the right number of topics to partition the dataset into.

    Try experimenting with different topic numbers to see what the visualization looks like.


In this module, you retrieved the Amazon SageMaker Neural Topic Model (NTM) Algorithm from Amazon ECR. Then, you specified algorithm-specific hyperparameters and provide the Amazon S3 bucket for artifact storage. Next, you deployed the model to an endpoint using Amazon SageMaker hosting services or batch transform. Finally, you explored the model using different values for the topic number.

In the next module, you train and deploy your content recommendation model.