LLM Evaluation Agent

콘텐츠 언어

현재 모든 콘텐츠가 번역되지는 않습니다.

프롬프트 및 에이전트 라이브러리
LLM Evaluation Agent

프로토타이핑
S3
중급

This agent helps you evaluate LLMs, agents, and prompts through natural-language configuration, automated dataset generation, multi-judge scoring, and PDF reporting.

Andre Gomes이(가) 2026년 5월 14일에 생성함

이러한 프롬프트를 사용하면 고지 사항에 동의하는 것으로 간주됩니다.

에이전트 세부 정보

An LLM Evaluation Agent that you can describe to it what you want to evaluate in natural language — the expert AI agent handles dataset generation, judge configuration, execution, and analysis end-to-end, and hands you back a PDF report.

Features
Expert agent interface — The agent knows evaluation best practices, recommends criteria and validates configurations before execution. No config files or CLI expertise needed.
Jury system — Multiple judges from different model families (e.g. Claude Sonnet, Nova Pro, Nemotron) each evaluate distinct aspects of every response — correctness, reasoning, completeness. Combining diverse judge families reduces self-preference bias, and aggregating weak signals from diverse judges and criteria produces stronger results than any single judge (Verma et al., 2025, Frick et al., 2025).
Adaptable binary scoring — Binary pass/fail per criteria rather than subjective numeric scales, shown to produce more reliable results across judges (Chiang et al., 2025). Criteria are tailored by the agent to what you're evaluating.
Document-grounded synthetic data — Upload PDFs, knowledge bases, or product docs and generate QA pairs grounded in your actual content, reflecting real customer scenarios.
Agentic eval support — Evaluate any agent calling Bedrock (Strands, LangChain, custom boto3) with zero code modification via OpenTelemetry instrumentation.

설치 지침

Prerequisites

AWS credentials with Bedrock model access
uv installed
Claude Code, Cursor, Kiro, VS Code, or any MCP-compatible IDE

Install

Pick your IDE and paste / click.

Claude Code — one CLI command:

claude mcp add eval -s user -- uvx --from llm-evaluation-system eval-mcp

Cursor — one-click deeplink: Install eval-mcp in Cursor

Kiro — add to ~/.kiro/settings/mcp.json:

{ "mcpServers": { "eval": { "command": "uvx", "args": ["--from", "llm-evaluation-system", "eval-mcp"] } } }

Codex CLI — add to ~/.codex/config.toml, then restart Codex:

[mcp_servers.eval] command = "uvx"args = ["--from", "llm-evaluation-system", "eval-mcp"]

VS Code (with GitHub Copilot MCP) — one CLI command:

code --add-mcp '{"name":"eval","command":"uvx","args":["--from","llm-evaluation-system","eval-mcp"]}'

Using a coding agent to install? Point it at INSTALL.md — it handles the config edit and asks about optional S3 team sharing.

Upgrading

uvx caches the resolved version per package. To pull newer releases, invalidate the cache:

uv cache clean llm-evaluation-system

Restart your IDE after. The next launch resolves and caches the newest published version.

Use

Ask your AI assistant to evaluate agents, models, or prompts — using a dataset you provide or one generated from your documents or context:

"Evaluate my agent at ./my_agent.py"
"Compare Claude Sonnet vs Nova Pro on this dataset"
"Test these three prompt templates against my golden QA set"
"Generate a dataset from this PDF and run an eval"

The agent picks the right mode, auto-generates whatever's missing (dataset, judge, criteria), runs it, opens the results viewer in your browser, and hands you a PDF report.