Amazon SageMaker Clarify

Evaluate models and explain model predictions

What is Amazon SageMaker Clarify?

Benefits of SageMaker Clarify

Automatically evaluate FMs for your generative AI use case with metrics such as accuracy, robustness, and toxicity to support your responsible AI initiative. For criteria or nuanced content that requires sophisticated human judgment, you can choose to leverage your own workforce or use a managed workforce provided by AWS to review model responses.
Explain how input features contribute to your model predictions during model development and inference. Evaluate your FM during customization using the automatic and human-based evaluations.
Generate easy to understand metrics, reports, and examples throughout the FM customization and MLOps workflow.
Detect potential bias and other risks, as prescribed by guidelines such as ISO 42001, during data preparation, model customization, and in your deployed models.

Evaluate foundation models

Evaluation wizard and reports

To launch an evaluation, select the model, task, and evaluation type — human-based or automatic reporting. Leverage evaluation results to select the best model for your use case, and to quantify the impact of your model customization techniques, such as prompt engineering, reinforcement learning from human feedback (RLHF), retrieval-augmented generation (RAG), and supervised fined tuning (SFT). Evaluation reports summarize scores across multiple dimensions, allowing quick comparisons and decisions. More detailed reports provide examples of the highest and the lowest scoring model outputs, allowing you to focus on where to optimize further.
Evaluation wizard and reports

Customization

Get started quickly with curated datasets, such as CrowS-Pairs, TriviaQA, and WikiText, and curated algorithms, such as Bert-Score, Rouge, and F1. You can customize your own prompt datasets and scoring algorithms specific for your generative AI application. The automatic evaluation is also available as an open-source library in GitHub to allow you to run it anywhere. Sample notebooks show you how to programmatically run evaluation for any FMs, including models that are not hosted on AWS, and how to integrate FM evaluations with SageMaker MLOps and governance tools, such as SageMaker Pipelines, SageMaker Model Registry, and SageMaker Model Cards.
Customization

Human-based evaluations

Some evaluation criteria are nuanced or subjective and require human judgement to assess. In addition to automated, metrics- based evaluations, you can ask humans (either your own employees or an AWS-managed evaluation team), to evaluate model outputs on dimensions like helpfulness, tone, and adherence to brand voice. Human evaluators can also check for consistency with company-specific guidelines, nomenclature, and brand voice. Setup custom instructions to instruct your evaluation team on how to evaluate prompts, for example by ranking or indicating thumbs up/down.
Human-based evaluations

Model quality evaluations

Evaluate your FM to determine if it provides high-quality responses for your specific generative AI task using automatic and/or human-based evaluations. Evaluate model accuracy with specific evaluation algorithms, such as Bert Score, Rouge and F1, tailored for specific generative AI tasks, such as summarization, question answering (Q&A), and classification. Check the semantic robustness of your FM output when prompted with semantic-preserving perturbations to the inputs, such as ButterFingers, random upper case, and whitespace add remove.
Model quality evaluations

Model responsibility evaluations

Evaluate the risk that your FM encoded stereotypes along the categories of race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status using automatic and/or human-based evaluations. You can also evaluate the risk of toxic content. These evaluations can be applied to any task that involves generation of content, including open-ended generation, summarization, and question answering.

Model responsibility evaluations

Model predictions

Explain model predictions

SageMaker Clarify is integrated with SageMaker Experiments to provide scores detailing which features contributed the most to your model prediction on a particular input for tabular, natural language processing (NLP), and computer vision models. For tabular datasets, SageMaker Clarify can also output an aggregated feature importance chart which provides insights into the overall prediction process of the model. These details can help determine if a particular model input has more influence than expected on overall model behavior.
Screenshot of a feature importance graph for a trained model in SageMaker Experiments

Monitor your model for changes in behavior

Changes in live data can expose a new behavior of your model. For example, a credit risk prediction model trained on the data from one geographical region could change the importance it assigns to various features when applied to the data from another region. SageMaker Clarify is integrated with SageMaker Model Monitor to notify you using alerting systems such as CloudWatch if the importance of input features shift, causing model behavior to change.
Screenshot of feature importance monitoring in SageMaker Model Monitor