Skip to main contentAWS Startups
  1. Prompt & Agent Library
  2. LLM Evaluation Agent
Agent Icon

LLM Evaluation Agent

  • Prototyping
  • S3
  • Intermediate

This agent helps you evaluate LLMs, agents, and prompts through natural-language configuration, automated dataset generation, multi-judge scoring, and PDF reporting.

Created on May 14, 2026 by Andre Gomes

By using these prompts, you agree to this disclaimer.

Agent Details

An LLM Evaluation Agent that you can describe to it what you want to evaluate in natural language — the expert AI agent handles dataset generation, judge configuration, execution, and analysis end-to-end, and hands you back a PDF report.

Features
Expert agent interface — The agent knows evaluation best practices, recommends criteria and validates configurations before execution. No config files or CLI expertise needed.
Jury system — Multiple judges from different model families (e.g. Claude Sonnet, Nova Pro, Nemotron) each evaluate distinct aspects of every response — correctness, reasoning, completeness. Combining diverse judge families reduces self-preference bias, and aggregating weak signals from diverse judges and criteria produces stronger results than any single judge (Verma et al., 2025, Frick et al., 2025).
Adaptable binary scoring — Binary pass/fail per criteria rather than subjective numeric scales, shown to produce more reliable results across judges (Chiang et al., 2025). Criteria are tailored by the agent to what you're evaluating.
Document-grounded synthetic data — Upload PDFs, knowledge bases, or product docs and generate QA pairs grounded in your actual content, reflecting real customer scenarios.
Agentic eval support — Evaluate any agent calling Bedrock (Strands, LangChain, custom boto3) with zero code modification via OpenTelemetry instrumentation.

Installation instructions

Prerequisites

  • AWS credentials with Bedrock model access
  • uv installed
  • Claude Code, Cursor, Kiro, VS Code, or any MCP-compatible IDE

Install

Pick your IDE and paste / click.

Claude Code — one CLI command:

claude mcp add eval -s user -- uvx --from llm-evaluation-system eval-mcp

Cursor — one-click deeplink: Install eval-mcp in Cursor

Kiro — add to ~/.kiro/settings/mcp.json:

{ "mcpServers": { "eval": { "command": "uvx", "args": ["--from", "llm-evaluation-system", "eval-mcp"] } } }

Codex CLI — add to ~/.codex/config.toml, then restart Codex:

[mcp_servers.eval] command = "uvx"args = ["--from", "llm-evaluation-system", "eval-mcp"]

VS Code (with GitHub Copilot MCP) — one CLI command:

code --add-mcp '{"name":"eval","command":"uvx","args":["--from","llm-evaluation-system","eval-mcp"]}'

Using a coding agent to install? Point it at INSTALL.md — it handles the config edit and asks about optional S3 team sharing.

Upgrading

uvx caches the resolved version per package. To pull newer releases, invalidate the cache:

uv cache clean llm-evaluation-system

Restart your IDE after. The next launch resolves and caches the newest published version.

Use

Ask your AI assistant to evaluate agents, models, or prompts — using a dataset you provide or one generated from your documents or context:

  • "Evaluate my agent at ./my_agent.py"
  • "Compare Claude Sonnet vs Nova Pro on this dataset"
  • "Test these three prompt templates against my golden QA set"
  • "Generate a dataset from this PDF and run an eval"

The agent picks the right mode, auto-generates whatever's missing (dataset, judge, criteria), runs it, opens the results viewer in your browser, and hands you a PDF report.