Amazon SageMaker Clarify

Evaluate models and explain model predictions

What is Amazon SageMaker Clarify?

Benefits of SageMaker Clarify

Automatically evaluate FMs for your generative AI use case with metrics such as accuracy, robustness, and toxicity to support your responsible AI initiative. For criteria or nuanced content that requires sophisticated human judgment, you can choose to leverage your own workforce or use a managed workforce provided by AWS to review model responses.
Explain how input features contribute to your model predictions during model development and inference. Evaluate your FM during customization using the automatic and human-based evaluations.
Generate easy to understand metrics, reports, and examples throughout the FM customization and MLOps workflow.
Detect potential bias and other risks, as prescribed by guidelines such as ISO 42001, during data preparation, model customization, and in your deployed models.

Evaluate foundation models

Evaluation wizard and reports

To launch an evaluation, select the model, task, and evaluation type — human-based or automatic reporting. Leverage evaluation results to select the best model for your use case, and to quantify the impact of your model customization techniques, such as prompt engineering, reinforcement learning from human feedback (RLHF), retrieval-augmented generation (RAG), and supervised fined tuning (SFT). Evaluation reports summarize scores across multiple dimensions, allowing quick comparisons and decisions. More detailed reports provide examples of the highest and the lowest scoring model outputs, allowing you to focus on where to optimize further.
Evaluation wizard and reports

Customization

Get started quickly with curated datasets, such as CrowS-Pairs, TriviaQA, and WikiText, and curated algorithms, such as Bert-Score, Rouge, and F1. You can customize your own prompt datasets and scoring algorithms specific for your generative AI application. The automatic evaluation is also available as an open-source library in GitHub to allow you to run it anywhere. Sample notebooks show you how to programmatically run evaluation for any FMs, including models that are not hosted on AWS, and how to integrate FM evaluations with SageMaker MLOps and governance tools, such as SageMaker Pipelines, SageMaker Model Registry, and SageMaker Model Cards.
Customization

Human-based evaluations

Some evaluation criteria are nuanced or subjective and require human judgement to assess. In addition to automated, metrics- based evaluations, you can ask humans (either your own employees or an AWS-managed evaluation team), to evaluate model outputs on dimensions like helpfulness, tone, and adherence to brand voice. Human evaluators can also check for consistency with company-specific guidelines, nomenclature, and brand voice. Setup custom instructions to instruct your evaluation team on how to evaluate prompts, for example by ranking or indicating thumbs up/down.
Human-based evaluations

Model quality evaluations

Evaluate your FM to determine if it provides high-quality responses for your specific generative AI task using automatic and/or human-based evaluations. Evaluate model accuracy with specific evaluation algorithms, such as Bert Score, Rouge and F1, tailored for specific generative AI tasks, such as summarization, question answering (Q&A), and classification. Check the semantic robustness of your FM output when prompted with semantic-preserving perturbations to the inputs, such as ButterFingers, random upper case, and whitespace add remove.
Model quality evaluations

Model responsibility evaluations

Evaluate the risk that your FM encoded stereotypes along the categories of race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status using automatic and/or human-based evaluations. You can also evaluate the risk of toxic content. These evaluations can be applied to any task that involves generation of content, including open-ended generation, summarization, and question answering.

Model responsibility evaluations

Model predictions

Explain model predictions

SageMaker Clarify is integrated with SageMaker Experiments to provide scores detailing which features contributed the most to your model prediction on a particular input for tabular, natural language processing (NLP), and computer vision models. For tabular datasets, SageMaker Clarify can also output an aggregated feature importance chart which provides insights into the overall prediction process of the model. These details can help determine if a particular model input has more influence than expected on overall model behavior.
Screenshot of a feature importance graph for a trained model in SageMaker Experiments

Monitor your model for changes in behavior

Changes in live data can expose a new behavior of your model. For example, a credit risk prediction model trained on the data from one geographical region could change the importance it assigns to various features when applied to the data from another region. SageMaker Clarify is integrated with SageMaker Model Monitor to notify you using alerting systems such as CloudWatch if the importance of input features shift, causing model behavior to change.
Screenshot of feature importance monitoring in SageMaker Model Monitor

Detect bias

Identify imbalances in data

SageMaker Clarify helps identify potential bias during data preparation without writing code. You specify input features, such as gender or age, and SageMaker Clarify runs an analysis job to detect potential bias in those features. SageMaker Clarify then provides a visual report with a description of the metrics and measurements of potential bias so that you can identify steps to remediate the bias. In case of imbalances, you can use SageMaker Data Wrangler to balance your data. SageMaker Data Wrangler offers three balancing operators: random undersampling, random oversampling, and SMOTE to rebalance data in your unbalanced datasets.

Screenshot of bias metrics during data preparation in SageMaker Data Wrangler

Check your trained model for bias

After you’ve trained your model, you can run a SageMaker Clarify bias analysis through Amazon SageMaker Experiments to check your model for potential bias such as predictions that produce a negative result more frequently for one group than they do for another. You specify input features with respect to which you would like to measure bias in the model outcomes, and SageMaker runs an analysis and provides you with a visual report that identifies the different types of bias for each feature. AWS open-source method Fair Bayesian Optimization can help mitigate bias by tuning a model’s hyperparameters.

Screenshot of bias metrics in a trained model in SageMaker Experiments

Monitor your deployed model for bias

Bias can be introduced or exacerbated in deployed ML models when the training data differs from the live data that the model sees during deployment. For example, the outputs of a model for predicting home prices can become biased if the mortgage rates used to train the model differ from current mortgage rates. SageMaker Clarify bias detection capabilities are integrated into Amazon SageMaker Model Monitor so that when SageMaker detects bias beyond a certain threshold, it automatically generates metrics that you can view in Amazon SageMaker Studio and through Amazon CloudWatch metrics and alarms.

Screenshot of bias monitoring in SageMaker Model Monitor

What's new

  • Date (Newest to Oldest)
No results found
1