Close
Amazon SageMaker Unified Studio
Amazon SageMaker Lakehouse
Amazon SageMaker Data and AI Governance
Amazon SageMaker AI
Amazon Bedrock IDE
SQL Analytics
Amazon SageMaker Data Processing
What is Amazon SageMaker Clarify?
Benefits of SageMaker Clarify
Evaluate foundation models
Evaluation wizard and reports
To launch an evaluation, select the model, task, and evaluation type — human-based or automatic reporting. Leverage evaluation results to select the best model for your use case, and to quantify the impact of your model customization techniques, such as prompt engineering, reinforcement learning from human feedback (RLHF), retrieval-augmented generation (RAG), and supervised fined tuning (SFT). Evaluation reports summarize scores across multiple dimensions, allowing quick comparisons and decisions. More detailed reports provide examples of the highest and the lowest scoring model outputs, allowing you to focus on where to optimize further.
Customization
Get started quickly with curated datasets, such as CrowS-Pairs, TriviaQA, and WikiText, and curated algorithms, such as Bert-Score, Rouge, and F1. You can customize your own prompt datasets and scoring algorithms specific for your generative AI application. The automatic evaluation is also available as an open-source library in GitHub to allow you to run it anywhere. Sample notebooks show you how to programmatically run evaluation for any FMs, including models that are not hosted on AWS, and how to integrate FM evaluations with SageMaker MLOps and governance tools, such as SageMaker Pipelines, SageMaker Model Registry, and SageMaker Model Cards.
Human-based evaluations
Some evaluation criteria are nuanced or subjective and require human judgement to assess. In addition to automated, metrics- based evaluations, you can ask humans (either your own employees or an AWS-managed evaluation team), to evaluate model outputs on dimensions like helpfulness, tone, and adherence to brand voice. Human evaluators can also check for consistency with company-specific guidelines, nomenclature, and brand voice. Setup custom instructions to instruct your evaluation team on how to evaluate prompts, for example by ranking or indicating thumbs up/down.
Model quality evaluations
Evaluate your FM to determine if it provides high-quality responses for your specific generative AI task using automatic and/or human-based evaluations. Evaluate model accuracy with specific evaluation algorithms, such as Bert Score, Rouge and F1, tailored for specific generative AI tasks, such as summarization, question answering (Q&A), and classification. Check the semantic robustness of your FM output when prompted with semantic-preserving perturbations to the inputs, such as ButterFingers, random upper case, and whitespace add remove.
Model responsibility evaluations
Evaluate the risk that your FM encoded stereotypes along the categories of race/color, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status using automatic and/or human-based evaluations. You can also evaluate the risk of toxic content. These evaluations can be applied to any task that involves generation of content, including open-ended generation, summarization, and question answering.
Model predictions
Explain model predictions
SageMaker Clarify is integrated with SageMaker Experiments to provide scores detailing which features contributed the most to your model prediction on a particular input for tabular, natural language processing (NLP), and computer vision models. For tabular datasets, SageMaker Clarify can also output an aggregated feature importance chart which provides insights into the overall prediction process of the model. These details can help determine if a particular model input has more influence than expected on overall model behavior.
Monitor your model for changes in behavior
Changes in live data can expose a new behavior of your model. For example, a credit risk prediction model trained on the data from one geographical region could change the importance it assigns to various features when applied to the data from another region. SageMaker Clarify is integrated with SageMaker Model Monitor to notify you using alerting systems such as CloudWatch if the importance of input features shift, causing model behavior to change.