Understanding and Monitoring Embeddings in Amazon SageMaker with WhyLabs AI Observatory Platform

By Andre Elizondo, Principal Solutions Architect – WhyLabs
By Shun Mao, Sr. Partner Solutions Architect – AWS
By James Yi, Sr. Partner Solutions Architect – AWS

WhyLabs

With the rise of large language models (LLMs), natural language processing (NLP), and generative AI models, embeddings are becoming a critical piece of data in more machine learning (ML) use cases.

In this post, we’ll explore different ways that embeddings are used in machine learning and where problems can show up that impact your ML models, and how you can use WhyLabs to identify those problems and create monitors to avoid them showing up again in the future.

WhyLabs is an AWS Partner and an essential artificial intelligence (AI) observability platform for machine learning model and data health. It’s the only ML monitoring and observability platform that doesn’t operate on raw data, which enables a no-configuration solution, privacy preservation, and massive scale.

WhyLabs AI Observatory Platform is available in AWS Marketplace.

What Are Embeddings and How Are They Used?

Embeddings are a way to represent complex data types as numerical representations that preserve context and relationships. They can be sparse or dense to represent different types of data, and embeddings are heavily used in machine learning for a variety of data types and tasks as inputs, intermediate products, and outputs.

Here are a few examples of where embeddings are often used in AI/ML. In each of these use cases, embeddings are critical as a way to preserve the original context that can be utilized to decode later or compare characteristics of the upstream data.

Natural language understanding and text analysis:

Sentiment analysis
Document classification
Text generation

Computer vision and image processing:

Manufacturing quality assurance
Autonomous driving

Audio processing:

Text-to-speech models
Speaker identification

Tabular machine learning:

Product recommendation
Anonymization and privacy

Because embeddings are context-preserving, they’re a trailing indicator of a change in the upstream data or a failure in transformation steps upstream. Depending on the structure of your organization, this may involve different teams which are creating the embeddings and those who are putting them to use.

While there are many different ways to handle the creation of embeddings, we won’t cover them in this post. Instead, we’ll discuss how embeddings can be measured to identify meaningful drift in the transformed inputs which can be used to identify close clusters of centroids or distances between individual centroids.

Typically for debugging, data scientists would use lower dimensional representations like UMAPs or t-SNE, which is helpful to visually identify clusters but isn’t a scalable approach to understand your embeddings over time in production.

To handle this in a scalable way, whylogs is an open-source library for logging any kind of data that creates a lightweight statistical profile of your data that can be used to extract meaningful insights and characteristics, letting you measure quality and drift over time.

Using whylogs, customers are able to identify centroids in their embeddings and measure different distances inside of different clusters. This can be helpful when identifying how your embeddings change over time or have a sudden shift due to a change in the upstream data. Read more in this WhyLabs blog post.

Figure 1 – Visualization of embedding space.

Train and Deploy a Classification Model

In this section, we’ll set up and train a simple classification model in Amazon SageMaker, which lets you build, train, and deploy machine learning models using fully managed infrastructure, tools, and workflows.

We’ll use the newsgroup datasource to create vectors and train our model on those vectors.

! pip install whylogs[whylabs] scikit-learn==1.0.2 --quiet

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd
from whylogs.experimental.preprocess.embeddings.selectors import PCACentroidsSelector
from sklearn.naive_bayes import MultinomialNB
import joblib

categories = [
    "alt.atheism",
    "soc.religion.christian",
    "comp.graphics",
    "rec.sport.baseball",
    "talk.politics.guns",
    "misc.forsale",
    "sci.med",
]

twenty_train = fetch_20newsgroups(
    subset="train", remove=("headers", "footers", "quotes"), categories=categories, shuffle=True, random_state=42
)

vectorizer = Pipeline(
    [
        ("vect", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
    ]
)
vectors_train = vectorizer.fit_transform(twenty_train.data)

vectors_train = vectors_train.toarray()

clf = MultinomialNB(alpha=0.01)
clf.fit(vectors_train, twenty_train.target)

Next, we’ll create an entrypoint script that defines how to load our model and make predictions. We’ll then deploy our model to an endpoint so we can make some batched predictions and compare our results. Finally, we’ll use a pretrained model on this same dataset to optimize the process of defining the endpoint in SageMaker.

If the model URI has a problem to download in the code, we can download the model first and upload it into our own Amazon Simple Storage Service (Amazon S3) bucket. We can then replace the S3 URI with the one matching the new model location.

%%writefile model.py

import pandas as pd
import numpy as np
import os
import joblib

def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

def predict_fn(data, model):
    result = model.predict(data)
    return result

from sagemaker.sklearn.model import SKLearnModel
from sagemaker import get_execution_role

sklearn_model = SKLearnModel(model_data="https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Newsgroups/model.tar.gz",
                             role=get_execution_role(),
                             entry_point="model.py",
                             framework_version="1.2-1")

predictor = sklearn_model.deploy(instance_type="ml.t2.medium", initial_instance_count=1)

Measure Embedding Distances with whylogs

Now that we have our model trained and entrypoint defined, we’ll capture a set of reference points in our embeddings to help identify the centroids in embeddings that our model was trained on. This will help us compare distances for unique centroids in our embeddings during inference, and we’ll come back to that a bit later in this post.

To capture our reference points, we have a few different options in whylogs. We can manually define relationships, or let whylogs automatically identify centroids based on corresponding labels or by utilizing an unsupervised clustering approach.

For this example, we have a well-labeled dataset so we’ll have whylogs choose our reference points based on labels denoting each of our centroids.

references, labels = PCACentroidsSelector(n_components=20).calculate_references(vectors_train, twenty_train.target)
ref_labels = [twenty_train.target_names[x].split(".")[-1] for x in labels]
print(ref_labels)

We now have our centroids and references defined, and this allows us to use the references when profiling our dataset and comparing it to other batches of data. Next, let’s define the necessary setup for whylogs to understand what reference points to compare to.

import whylogs as why
from whylogs.core.resolvers import MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.experimental.extras.embedding_metric import (
    DistanceFunction,
    EmbeddingConfig,
    EmbeddingMetric,
)
from whylogs.experimental.extras.nlp_metric import BagOfWordsMetric
from whylogs.core.resolvers import STANDARD_RESOLVER

config = EmbeddingConfig(
    references=references,
    labels=ref_labels,
    distance_fn=DistanceFunction.cosine,
)
embeddings_resolver = ResolverSpec(column_name="news_centroids", metrics=[MetricSpec(EmbeddingMetric, config)])
tokens_resolver = ResolverSpec(column_name="document_tokens", metrics=[MetricSpec(BagOfWordsMetric)])

embedding_schema = DeclarativeSchema(STANDARD_RESOLVER+[embeddings_resolver])
token_schema = DeclarativeSchema(STANDARD_RESOLVER+[tokens_resolver])

Monitor Embedding Drift with WhyLabs Observatory

At this point, we have a trained model, reference embeddings, and a whylogs resolver defined to extract the information we want from our embeddings. In order to see the power of measuring embeddings distances, we’ll create a scenario where we are using our classifier to predict the class of document it learned from our training set.

Let’s define a series of news article batches to transform and send to our model. First, we’ll add some perturbation towards the latter end of batches, and we’ll do this by taking a percentage of articles and translating them into Spanish before transforming them and running our classifier.

To speed things up, let’s download the production data from a public S3 bucket. That way, we won’t have to translate or tokenize the documents ourselves.

The dataframe below contains 5,306 documents—2,653 in English and 2,653 in Spanish. The Spanish documents were obtained by simply translating the English ones. Documents that have the same “doc_id” refer to the same document in different languages. We’ll also define a method to inject some interesting scenarios when we profile our data later.

download_url = "https://whylabs-public.s3.us-west-2.amazonaws.com/whylogs_examples/Newsgroups/production_en_es.parquet"
prod_df = pd.read_parquet(download_url)
language_perturbation_ratio = [0,0,0,0,0.33,0.66,1]

def get_docs_by_language_ratio(batch_df, ratio):
    n_docs = len(batch_df[batch_df["language"] == "en"])
    n_es_docs = int(n_docs * ratio)
    n_en_docs = n_docs - n_es_docs
    en_df = batch_df[batch_df["language"] == "en"].sample(n_en_docs)    
    es_df = batch_df[~batch_df['doc_id'].isin(en_df["doc_id"])]
    
    # filter out docs with doc_id in en_df
    es_df = es_df[es_df["language"] == "es"].sample(n_es_docs)
    docs = pd.concat([en_df, es_df])
return docs

Next, we want to define our WhyLabs organization, project ID, and API key that will be used to store and process our batches.

import os
os.environ["WHYLABS_DEFAULT_ORG_ID"] = "ORG-ID"
os.environ["WHYLABS_API_KEY"] = "API-Key"
os.environ["WHYLABS_DEFAULT_DATASET_ID"] = "PROJECT-ID"

Now that we have our authentication defined, we’ll separate our dataset into individual batches denoting a day’s worth of data. We’ll create a dataframe for each day with the raw inputs, tokens from our document, embeddings produced, and the output received from our model.

Afterwards, we’ll profile each dataframe and backdate it to show a progression towards the drift and model degradation we’re expecting.

from datetime import datetime,timedelta, timezone
import whylogs as why
from whylogs.api.writer.whylabs import WhyLabsWriter
import random

writer = WhyLabsWriter()

for day, batch_df in prod_df.groupby("batch_id"):
    batch_df['tokens'] = batch_df['tokens'].apply(lambda x: x.tolist())

    dataset_timestamp = datetime.now() - timedelta(days=6-day)
    dataset_timestamp = dataset_timestamp.replace(hour=0, minute=0, second=0, microsecond=0, tzinfo = timezone.utc)

    print(f"day {day}: {dataset_timestamp}")

    ratio = language_perturbation_ratio[day]
    print(f"{ratio*100}% of documents with language perturbation")
 
    mixed_df = get_docs_by_language_ratio(batch_df, ratio)
    mixed_df = mixed_df.dropna()

    sample_ratio = random.uniform(0.8, 1) # just to have some variability in the total number of daily docs
    mixed_df = mixed_df.sample(frac=sample_ratio).reset_index(drop=True)

    vectors = vectorizer.transform(mixed_df['doc']).toarray()
    
    predicted = []
    for i in range(0, len(vectors)):
        result = predictor.predict([vectors[i]])
        predicted.append(result[0])
    
    print("mean accuracy: ", np.mean(predicted == mixed_df['target']))

    print("Profiling and logging Embeddings and Tokens...")
    embeddings_profile = why.log(row={"news_centroids": vectors},
                    schema=embedding_schema)
    embeddings_profile.set_dataset_timestamp(dataset_timestamp)
    writer.write(file=embeddings_profile.view())

    tokens_df = pd.DataFrame({"document_tokens":mixed_df["tokens"]})
    
    tokens_profile = why.log(tokens_df, schema=token_schema)
    tokens_profile.set_dataset_timestamp(dataset_timestamp)
    writer.write(file=tokens_profile.view())    

    newsgroups_df = pd.DataFrame({"output_target": mixed_df["target"],
                            "output_prediction": predicted})
    # to map indices to label names
    newsgroups_df["output_target"] = newsgroups_df["output_target"].apply(lambda x: ref_labels[x])
    newsgroups_df["output_prediction"] = newsgroups_df["output_prediction"].apply(lambda x: ref_labels[x])

    
    print("Profiling and logging classification metrics...")
    classification_results = why.log_classification_metrics(
        newsgroups_df,
        target_column="output_target",
        prediction_column="output_prediction",
        log_full_data=True
    )
    classification_results.set_dataset_timestamp(dataset_timestamp) 
writer.write(file=classification_results.view())

Here’s a high-level architecture of what we just did:

Figure 2 – Architecture of the integration in this post.

When we open our project in WhyLabs, we see that our profiles were successfully generated for each batch and submitted to the platform. We won’t cover every feature and output created by our resolver but will highlight three of them below.

Observe Introduced Drift in WhyLabs

You should now have access to a number of different features in your dashboard that represent the different aspects of the pipeline we monitored:

news_centroids: Relative distance of each document to the centroids of each reference topic cluster, and frequent items for the closest centroid for each document.
document_tokens: Distribution of tokens (term length, document length and frequent items) in each document.
output_prediction and output_target: The output (predictions and targets) of the classifier that will also be used to compute metrics on the “Performance” tab.

With the monitored information, we should be able to correlate the anomalies and reach a conclusion about what happened.

news_centroids.closest

In the chart below, we can see the distribution of the closest centroid for each document. For the first four days, the distribution is similar between each other. The language perturbations injected in the last three days seem to skew the distribution towards the “forsale” topic.

Figure 3 – Visualization in WhyLabs for ‘news_centroids.closest’ input.

document_tokens.frequent_terms

Since we removed the English stopwords in our tokenization process but didn’t remove the Spanish stopwords, we can see that most of the frequent terms in the selected period are the Spanish stopwords, and those stopwords don’t appear in the first four days.

Figure 4 – Visualization in WhyLabs for ‘document_tokens.frequent_terms.’

Performance.F1

In the “Performance” tab, there is plenty of information that tells us our performance is degrading. For example, the F1 chart below shows the model is getting increasingly worse starting from the fifth day.

Figure 5 – F1 performance metric visualization in WhyLabs.

For now, we’ll focus on how to use WhyLabs to monitor these and be notified in the future when our dataset changes and impacts our models performance.

Navigate to the Monitor Manager and select the “Presets” tab.

Figure 6 – Preset drift monitors in WhyLabs.

Next, we’ll create a drift monitor on our discrete inputs using the “Configure” option on the “Data drift in model inputs” for “All discrete inputs.” Click through to modify the drift distance threshold under section 2 and leave everything else the same. Lastly, use the save button at the bottom to complete creating our monitor.

Figure 7 – Customization of preset monitor parameters in WhyLabs.

Now, we’ll test our monitor on the “news_centroids.closest” feature to show the drift in categorical distribution when we changed our language to Spanish, causing the “forsale” cluster to become the closest centroid cluster more consistently.

Figure 8 – Monitor failure preview in WhyLabs for ‘news_centroids.closest’ input.

We can see that WhyLabs identified the drift in closest clusters which would have triggered an alert to our downstream notification endpoint. This can help us to avoid a sudden change like this in the future.

Conclusion

Embarking on your journey with WhyLabs and Amazon SageMaker is simple. Take a look at our sample notebook the example in this post is built from, and then make your way over to WhyLabs Observatory to create a free account and begin monitoring your SageMaker models.

You can also learn more about the WhyLabs AI Observability Platform in AWS Marketplace.

.

.

WhyLabs – AWS Partner Spotlight

WhyLabs is an AWS Partner and AI observability platform for machine learning model and data health.

Contact WhyLabs | Partner Overview | AWS Marketplace