- AWS Marketplace
- AI agent learning series
- Module 3 - Agent evaluation & decision engines
Agent evaluation & decision engines
Your AI agents are running — but how do you know they’re performing as you expect? Learn how to measure agent performance with Amazon Bedrock AgentCore Evaluations, diagnose failures, and apply the decision engines and guardrails that shape how agents behave.
Overview
Technical workshop: Agent evaluation & decision engines
Join AWS experts for a hands-on workshop using Amazon Bedrock AgentCore Evaluations alongside evaluation frameworks and MLOps tools from LangChain, Vellum, LangFuse, Weights & Biases, Comet ML, Patronus AI, and Deepchecks. Measure what matters. Ship agents that work.
Agent evaluation & decision engines
Learn how to measure agent performance using Amazon Bedrock AgentCore Evaluations, understand the decision engines that shape agent behavior, and apply guardrails that keep agents safe in production.
What you’ll learn:
- End-to-end agent evaluation using Amazon Bedrock AgentCore Evaluations — managed, continuous scoring of agent behavior using built-in and custom evaluators, applied to a DevOps Companion agent walkthrough
- A three-layer evaluation framework covering task-level correctness, trajectory quality, and system-level health — with automated techniques like LLM-as-a-judge, rubric-based scoring, and tracing-driven observability
- Decision engines: the think-act-observe loop, chain-of-thought reasoning, and tool selection patterns that shape how agents behave
- Guardrails for confidence thresholds, uncertainty handling, and fallback behaviors — extended with partner tools from LangFuse, Patronus AI, Weights & Biases, and more
Featured AI tools in AWS Marketplace
Extend Amazon Bedrock AgentCore Evaluations with evaluation frameworks and MLOps platforms — all available through your AWS account.
Page topics
- The unique paradigm of agent evaluation
- Anatomy of an agent evaluation framework
- Defining what "good" looks like — Metrics and evaluation criteria
- Building and managing evaluation datasets
- Automated evaluation techniques
- Tracing and observability as evaluation infrastructure
- Decision engines — How agents choose what to do
- Confidence, uncertainty, and guardrails
- Evaluation in the development lifecycle — From prototyping to production
- Evaluating a real agent end-to-end
- Conclusion and looking ahead
The unique paradigm of agent evaluation
Every discipline eventually develops its own quality assurance practices, and agentic AI is no exception. But before teams can adopt good evaluation practices, they need to understand why the practices from related disciplines — software testing, ML model evaluation, and even traditional QA — are not quite applicable when used with agents.
The limits of traditional testing
In classical software development, testing is deterministic. A given input produces a given output, and the testing is all about checking whether those match. Unit tests, integration tests, and end-to-end tests all operate on this assumption. When a function deviates from the expected output, it is a failure.
Agents break this model in three fundamental ways:
- They are probabilistic. Even given identical inputs, an agent backed by a large language model (LLM) may produce different outputs across runs. Temperature settings, sampling parameters, and subtle variations in context can all shift the response.
- Agents are path-dependent. The quality of the final output depends not just on what the agent produced, but how it got there — which tools it called, in what order, with what parameters, and how it handled the results. An agent that arrives at a correct answer through a flawed reasoning chain is still a fragile agent.
- They operate at the intersection of language and action, where quality is often subjective or domain-specific. Saying that a response is "correct" likely requires a human that, having domain expertise or business context that is difficult to encode in a simple assertion, will judge whether the response is good or not.
The shift to continuous, multi-dimensional assessment
The right mental model for agent evaluation is not a test suite — it is a measurement program. Instead of asking "does this pass?", teams should be asking a series of layered questions about quality, behavior, and system health.
At the output level: Is the agent's response accurate, complete, and appropriately formatted? Does it faithfully follow instructions? Does it avoid hallucination and stay grounded in the available data?
At the trajectory level: Did the agent take the right steps to arrive at the output? Were tool calls appropriate and well-formed? Did the agent handle ambiguity, errors, and unexpected results gracefully?
At the system level: Is the agent completing tasks within acceptable latency bounds? Are token costs in line with expectations? Is the failure rate stable or trending in the wrong direction?
These three levels — output quality, trajectory quality, and system health — form the foundation of a complete evaluation framework, which we will build out in the next section.
Why this matters early
One of the most common mistakes teams make is treating evaluation as a late-stage concern — something to address after the agent is "done." This almost always leads to pain. Without evaluation instrumentation in place from the start, teams lack the baselines needed to detect regressions. Without clear quality criteria, determining whether any prompt change or model update was beneficial or not, becomes a guess. Without systematic measurement, there is no reliable way to demonstrate improvement — or to know when the agent is ready for production.
Evaluation is most powerful when it is built into the development process from the beginning. We will look into evaluation during development later in this guide, for now, the key takeaway is this: agent evaluation is not an afterthought — it is an ongoing discipline, and it starts on day one.
Anatomy of an agent evaluation framework
A complete agent evaluation framework is not a single tool or a single test. It is a layered system that measures agent behavior at multiple levels of granularity, from individual outputs to overall system performance. Understanding this structure helps teams decide where to invest their evaluation effort and how to interpret what they find.
Layer 1: Task-level correctness
The most immediate layer of evaluation asks: did the agent accomplish the task? This sounds simple but defining what "accomplished" means requires care. For some tasks — generating a structured JSON response, extracting a specific piece of information from a document, or answering a factual question — correctness can be defined precisely and checked automatically. For others — summarizing a policy document, drafting a customer email, or recommending a course of action — correctness is a matter of degree, and human judgment or proxy evaluation methods are required.
Task-level correctness evaluation typically uses one or more of the following approaches:
- Exact match: the output matches a reference answer.
- Fuzzy match: the output is semantically similar to a reference.
- Rubric-based scoring: the output is rated against a defined set of criteria.
- Reference-free evaluation: the output is judged on its own merits without a reference, often using an LLM as evaluator.
Each approach has trade-offs in terms of cost, reliability, and applicability, which will be explored in detail as we look into evaluation techniques.
Layer 2: Trajectory quality
Where task-level correctness tells you whether the agent got the right answer, trajectory quality tells you whether it got there in the right and most efficient way. This distinction matters enormously in production environments, where an agent that consistently arrives at correct outputs through fragile or inefficient reasoning chains is a disaster waiting to happen.
Trajectory evaluation examines the full execution path: the sequence of thoughts, tool calls, and observations that the agent produced.
Key questions include:
- Did the agent call the right tools?
- Did it provide correctly formatted and semantically appropriate arguments to those tools?
- Did it correctly interpret and integrate tool outputs into its reasoning?
- Did it take unnecessary steps, or skip steps it should have taken?
- Did it handle errors and unexpected outputs gracefully?
Capturing trajectory data requires instrumentation at the agent framework level — every tool call, its inputs and outputs, and the model's reasoning at each step must be logged and made available for analysis: this is the domain of observability.
Layer 3: System-Level Health
The third layer steps back from individual task executions and looks at aggregate system behavior over time. This is where latency, cost, and reliability metrics live. An agent that produces excellent output but takes 45 seconds per task and consumes thousands of tokens per request may be technically impressive but operationally undeployable.
System-level metrics to track include:
- Median and p95 task completion latency
- Total token consumption per task (broken down by input and output)
- Task success rate (the fraction of tasks that complete without errors or timeouts)
- Error and fallback rates.
- Cost per task
These metrics should be tracked over time and tied to specific versions of the agent, enabling teams to detect regressions when prompts, tools, or models change.
Together, these three layers — task-level correctness, trajectory quality, and system-level health — form the complete evaluation picture. Most teams start with the first layer, add the second as their systems mature, and eventually build dashboards that integrate all three. The goal is a measurement program that gives genuine confidence in agent behavior, not just a set of tests that agents can pass.
For teams building on AWS, Amazon Bedrock AgentCore Evaluations provides a fully managed service that addresses all three layers directly. It integrates with popular agent frameworks — including Strands Agents SDK and LangGraph — through OpenTelemetry and OpenInference instrumentation, converting agent traces into a unified format and scoring them automatically using LLM-as-a-Judge techniques.
Rather than requiring teams to assemble their own evaluation pipeline from scratch, AgentCore Evaluations provides the infrastructure for both development-time assessment and continuous production monitoring out of the box. The sections that follow explain the concepts behind each evaluation layer in depth; wherever a concept maps to a specific AgentCore Evaluations capability, that mapping is called out explicitly.
Defining what "good" looks like — Metrics and evaluation criteria
Metrics are only as useful as the criteria they measure. Before choosing tools or writing evaluation scripts, teams need to invest time in defining what good agent behavior actually looks like for their specific use case. This section covers the most used evaluation metrics and how to anchor them to real business outcomes rather than abstract benchmarks.
Output quality metrics
Output quality metrics assess the content of what the agent produces. The right set of metrics depends on the task type, but the following are widely applicable across use cases.
Faithfulness measures whether the agent's response is grounded in the information available to it — retrieved documents, tool outputs, or provided context — rather than fabricated from model weights. A faithful agent does not invent facts. Faithfulness is especially critical in enterprise settings where agents handle sensitive or regulated information.
Relevance measures whether the response addresses what was asked. An agent can produce accurate, well-written output that completely misses the point of the user's request. Relevance scoring evaluates alignment between the input intent and the output content.
Completeness measures whether the response covers all required aspects of the task. A partial answer to a multi-part question may score well on faithfulness and relevance but fail on completeness.
Format compliance measures whether the output adheres to any required structure or schema. For agents producing structured outputs — JSON, XML, markdown tables, or domain-specific formats — format compliance is often a prerequisite for downstream processing.
Platforms like Vellum, available in AWS Marketplace, provide built-in metric libraries and prompt comparison workflows that make it straightforward to evaluate these dimensions systematically. LangSmith, also available in AWS Marketplace, integrates natively with LangChain-based agents and provides output-level scoring alongside trace visualization.
Amazon Bedrock AgentCore Evaluations covers these same output quality dimensions through its built-in evaluator library: Builtin.Correctness evaluator measures factual accuracy; Builtin.Faithfulness checks whether the response is grounded in the available context rather than fabricated; and Builtin.Helpfulness scores from the user’s perspective how useful and actionable the response is. These evaluators require no configuration — they are pre-built with optimized prompt templates and standardized scoring criteria and can be applied immediately to any agent producing observable traces.
Trajectory and tool use metrics
Trajectory metrics assess the quality of the agent's reasoning and action path, not just its final output.
Tool call accuracy measures whether the agent invoked the right tools for the task. An agent that uses a web search tool to answer a question that should have been answered from a retrieved document is making a poor decision, even if the search returns a usable result.
Argument quality measures whether tool calls were made with well-formed and semantically appropriate arguments. A tool called with a malformed query or an incorrect parameter is a trajectory failure, regardless of whether the tool returns something usable.
Step efficiency measures whether the agent completed the task in a reasonable number of steps. An agent that takes 12 tool calls to answer a question that should require 3 is exhibiting inefficient reasoning, which affects both latency and cost.
Error handling quality measures how gracefully the agent responds to tool failures, unexpected outputs, or ambiguous situations. A well-designed agent does not spiral into loops or produce incoherent output when something goes wrong — it degrades gracefully and escalates appropriately.
LangFuse, available in AWS Marketplace, is particularly well suited for trajectory-level evaluation, providing step-by-step trace inspection, scoring at the span level, and the ability to flag individual tool calls for review. Weights & Biases and Comet ML, both available in AWS Marketplace, offer experiment tracking capabilities that are valuable when comparing trajectory quality across agent versions or prompt variations.
Amazon Bedrock AgentCore Evaluations provides dedicated built-in evaluators for trajectory quality as well. Builtin.ToolSelectionAccuracy operates at the TOOL_CALL level, assessing whether the agent chose the right tool for the current step. A companion evaluator measures tool parameter accuracy, checking whether the arguments passed to each tool were well-formed and semantically appropriate. These evaluators produce scores at the individual tool invocation level, making it straightforward to pinpoint which tool calls in a long execution trace are contributing most to quality problems.
Tying metrics to business outcomes
The most important rule in metric design is this: every metric should be traceable to a business outcome. Abstract metrics like "ROUGE score" or "cosine similarity" are useful for researchers but often fail to answer the question that matters to stakeholders: is this agent making our business better?
Translating between technical metrics and business outcomes requires collaboration between the engineering team and the domain experts who understand the business context. For a customer support agent, the relevant outcome might be first-contact resolution rate or customer satisfaction score. For a document processing agent, it might be extraction accuracy and exception rate. For the DevOps Companion used as the reference implementation throughout this series, the relevant outcomes are deployment success rate, infrastructure correctness, and observability coverage — metrics that map directly to engineering productivity and system reliability.
Starting with the business outcome and working backward to the metrics is almost always more effective than starting with metrics and hoping they correlate with outcomes. This approach also makes it much easier to communicate evaluation results to non-technical stakeholders.
AgentCore Evaluations includes Builtin.GoalSuccessRate, a session-level evaluator that measures whether the agent ultimately achieved the user’s goal across the full interaction — not just whether individual responses were high quality. This is the built-in evaluator that sits closest to business outcomes, and it complements the trace-level evaluators (Correctness, Faithfulness, Helpfulness) and tool_call-level evaluators (Tool Selection Accuracy) to give teams a full picture from individual steps through to overall task completion.
Building and managing evaluation datasets
A metric without data to measure it against is just a formula. Evaluation datasets — curated collections of inputs, expected outputs, and scoring criteria — are the raw material of any serious evaluation program. Building and maintaining them well is one of the highest-leverage investments a team can make.
Types of evaluation datasets
Most mature agent evaluation programs rely on several complementary dataset types, each serving a different purpose.
Golden datasets contain carefully curated examples of high-quality agent inputs and their expected outputs, typically annotated by domain experts. These are the ground truths against which agent behavior is measured. Golden datasets are expensive to create and must be maintained as the task scope evolves, but they provide the most reliable signal about output quality.
Regression datasets capture cases where the agent previously failed or behaved unexpectedly. As issues are identified and resolved, the corresponding inputs are added to the regression dataset, ensuring that fixes are verified and that previously resolved failures do not reappear in future versions.
Adversarial datasets contain deliberately challenging inputs designed to probe the agent's failure modes — ambiguous instructions, edge cases, missing context, contradictory information, or inputs designed to elicit hallucination. Adversarial testing is essential for understanding the boundaries of agent reliability before those boundaries are discovered in production.
Production-derived datasets are built from real agent interactions, sampled and annotated from production traffic. These datasets are invaluable because they reflect actual usage patterns rather than hypothetical scenarios designed in advance. Sampling strategies should be designed to capture a representative mix of common cases, rare cases, and edge cases.
Dataset versioning and maintenance
Evaluation datasets are living artifacts. As the agent's task scope evolves, new tools are added, or the underlying models change, datasets must be updated to remain relevant. Teams should treat evaluation datasets with the same versioning practices applied to code — changes should be tracked, reviewed, and linked to specific agent versions.
A common error teams make is allowing datasets to drift out of alignment with the current agent behavior. If the agent's task scope has expanded since the dataset was created, evaluations will only measure a subset of the agent's behavior, creating blind spots. Periodic reviews of dataset coverage — asking whether the dataset adequately represents current production traffic — help avoid this.
Vellum provides native dataset management and versioning capabilities, making it straightforward to maintain multiple dataset versions and link evaluation runs to specific dataset snapshots. LangSmith's dataset management features support similar workflows within the LangChain ecosystem. Teams building on AWS can also use Amazon S3 with versioning enabled as a durable, cost-effective store for evaluation data, combined with metadata management in Amazon DynamoDB for querying and lineage tracking.
The cold start problem
New agent deployments face a bootstrapping challenge: without production traffic, there is no production-derived dataset. Without a golden dataset, there is no reliable ground truth. Without an adversarial dataset, failure modes are unknown.
The practical approach to the cold start problem involves several steps. Start by defining the most important task types and manually creating 20 to 50 golden examples per type — enough to provide meaningful signal without requiring a massive upfront investment. Identify the most obvious failure modes based on domain knowledge and create a small adversarial set to test against them. As the agent handles early production traffic, sample those interactions and prioritize annotation of interesting or ambiguous cases and continue building out from there.
The goal is not a perfect dataset on day one — it is a dataset that improves in quality and coverage over time, in step with the agent itself.
Automated evaluation techniques
Manual evaluation by domain experts provides the highest quality signal, but it is slow and expensive. Automated evaluation makes continuous measurement feasible — and for agent systems that handle hundreds or thousands of tasks per day, it is the only viable path. This section covers the most effective automated evaluation techniques and their respective trade-offs.
LLM-as-a-judge
One of the most significant recent developments in agent evaluation is the use of LLMs as automated evaluators. The core idea is straightforward: rather than requiring a human to score an agent's output against a rubric, a capable LLM is provided with the rubric, the input, and the agent's output, and asked to produce a score and an explanation.
LLM-as-a-judge works surprisingly well for many evaluation tasks, particularly those involving natural language quality dimensions like coherence, completeness, and faithfulness. It scales cheaply, produces human-readable explanations alongside scores, and can be configured to evaluate multiple dimensions in a single call. On AWS, foundation models accessible through Amazon Bedrock can serve as the evaluator model, providing a cost-effective and scalable solution.
The technique has important limitations that teams should be aware of. LLM judges can exhibit positional bias (favoring responses that appear first in a comparison), verbosity bias (favoring longer responses even when shorter ones are better), and self-reinforcing bias (a model tends to rate its own outputs more favorably). Mitigations include using different model families for generation and evaluation, employing multiple judges and aggregating scores, and using calibrated rubrics with specific, unambiguous scoring criteria.
Patronus AI, available in AWS Marketplace, specializes in LLM evaluation and provides a library of pre-built evaluation metrics, bias detection, and support for custom evaluation criteria. Deepchecks, also available in AWS Marketplace, offers similar capabilities with a focus on LLM testing infrastructure and integration with CI/CD pipelines.
Amazon Bedrock AgentCore Evaluations implements LLM-as-a-Judge as the underlying mechanism for all 13 of its built-in evaluators, as well as for any custom evaluators teams define. For custom evaluators, teams provide evaluation instructions (the rubric), select a foundation model from Amazon Bedrock to act as the judge, and define a rating scale — either categorical labels or numerical scores. AgentCore handles the prompt construction, model invocation, and result collection automatically. Critically, the same judge infrastructure powers both on-demand evaluation (for development and investigation) and online evaluation (for continuous production monitoring), so teams do not need to maintain two separate evaluation systems as their agents move through the lifecycle.
Rubric-based scoring
A rubric is a structured scoring guide that defines what different score levels mean for a given dimension. Rather than asking an evaluator (human or LLM) to produce a free-form score from 1 to 10, a rubric specifies exactly what a 1, 3, and 5 look like in terms of observable characteristics of the output.
Well-designed rubrics dramatically improve evaluation consistency. They make implicit quality criteria explicit, reduce evaluator disagreement, and make it possible to onboard new human or automated evaluators quickly. Building good rubrics requires close collaboration with domain experts and iterative refinement based on observed scoring disagreements.
When creating custom evaluators in Amazon Bedrock AgentCore Evaluations, the rating scale definition is effectively the rubric in structured form. Teams specify each score level with a label (such as “Excellent,” “OK,” or “Poor”) and a written definition that describes what observable characteristics of the agent’s output qualify for that score. These definitions are injected into the judge model’s prompt alongside the actual trace, ensuring that scoring decisions are anchored to the rubric rather than the judge model’s unconstrained judgment.
Reference-free evaluation
Some tasks do not lend themselves to reference-based evaluation because there is no single correct answer — there are many acceptable answers, and the task is to judge quality rather than correctness. In these cases, reference-free evaluation, where the output is assessed on its own merits without comparison to a ground truth, is the appropriate approach.
Reference-free evaluation is commonly used for open-ended generation tasks, conversational agents, and use cases where the acceptable response space is large. LLM-as-a-judge approaches are well suited here, as are trained reward models where sufficient annotation data is available.
Embedding-based similarity
For tasks where a reference output is available but exact or fuzzy string matching is too rigid, embedding-based similarity provides a middle ground. Both the reference output and the agent's output are encoded as embedding vectors, and the cosine similarity between them serves as a proxy for semantic similarity. This approach is more robust than string matching across paraphrases and minor wording variations.
Amazon Bedrock provides access to embedding models that can be used directly in evaluation pipelines. Teams can build embedding-based similarity scoring into AWS Lambda functions or Amazon SageMaker Processing jobs, making it straightforward to run at scale without managing dedicated infrastructure.
Structured output validation
For agents that produce structured outputs — JSON, XML, or domain-specific schemas — automated validation against the expected schema is both inexpensive and reliable. Schema validation catches format compliance failures immediately and provides a high-confidence, low-cost baseline metric that can be incorporated into every evaluation run.
Going beyond schema validation, semantic validation checks whether the content of a structured output is logically consistent and within expected ranges. For example, a booking agent that produces valid JSON but schedules a meeting in the past has passed schema validation but failed semantic validation. Both layers are necessary for complete coverage.
Tracing and observability as evaluation infrastructure
Evaluation does not happen in a vacuum. It depends on data — specifically, detailed records of what the agent did, why it did it, and what happened as a result. Tracing and observability are the infrastructure that generates that data. Without them, evaluation is limited to examining final outputs in isolation, with no visibility into the reasoning and actions that produced them.
What tracing captures
A trace is a structured record of a single agent execution. It captures every step in the execution path: the initial input, the model's reasoning at each step, every tool call with its input arguments and returned output, any intermediate state updates, and the final output. In multi-step agent loops, a trace may contain dozens of spans, each representing a discrete unit of work.
For evaluation purposes, traces provide the raw material for trajectory-level assessment. Without traces, questions like "did the agent call the right tool?" or "how did the agent handle the tool returning an empty result?" cannot be answered systematically. With traces, these questions become answerable at scale.
Instrumentation approaches
Instrumentation — the process of adding tracing code to an agent — can be done at several levels. Framework-level instrumentation, provided by agent frameworks themselves, is the most convenient: if a team is using a framework that has native tracing support, traces are generated automatically with minimal configuration. Application-level instrumentation, where teams add explicit trace logging at key points in their agent code, provides more control at the cost of more implementation effort.
Most production agent implementations on AWS combine both approaches: framework-level tracing captures the standard execution flow, while application-level instrumentation adds business-specific context — the customer ID, the transaction reference, the policy version — that makes traces useful for debugging real-world issues rather than just observing agent mechanics.
LangSmith provides deep tracing integration for LangChain-based agents, capturing the full execution graph with per-step timing, token counts, and the ability to annotate individual spans for evaluation. LangFuse, available in AWS Marketplace, is a strong alternative for teams using other frameworks, offering an open standard for trace collection and a rich UI for trace inspection and evaluation workflows.
For teams building on AWS infrastructure, traces can be stored in Amazon S3 in a structured format (such as JSON Lines) and queried using Amazon Athena, providing a cost-effective foundation for large-scale trace analysis without dedicated infrastructure. Amazon CloudWatch can surface aggregate metrics derived from traces, providing operational dashboards alongside the detailed trace data.
From traces to evaluation
The connection between tracing and evaluation runs in both directions. Traces feed evaluation by providing the data that automated and human evaluators need to assess trajectory quality. Evaluation feeds traces by enriching them with quality scores, annotations, and failure labels that make individual traces searchable and comparable.
A mature tracing and evaluation system allows teams to ask questions like: "Show me all traces from the last week where the agent made more than 5 tool calls and the final output received a faithfulness score below 0.7." This kind of query is the foundation of effective debugging and continuous improvement — it surfaces systematic failure modes that would be invisible if evaluations were conducted on a small, manually selected sample.
Weights & Biases and Comet ML, both available in AWS Marketplace, provide experiment tracking capabilities that integrate well with tracing workflows, making it possible to correlate trace-level data with agent configuration changes and model updates across experiments.
Amazon Bedrock AgentCore Evaluations uses traces as its primary input. It reads from CloudWatch log groups where agents emit their OpenTelemetry traces, converting them to a unified format before scoring. This means the instrumentation investment a team makes once — configuring their Strands or LangGraph agent to emit OTEL traces to CloudWatch — feeds both the observability layer and the evaluation layer simultaneously, without requiring a separate data pipeline. All evaluation results are published back to Amazon CloudWatch in the AgentCore Observability dashboard, giving teams a single pane of glass for trace inspection, evaluation scores, and operational metrics. Because results live in CloudWatch natively, teams can use all of its capabilities — alarms, dashboards, anomaly detection, and automated actions — to act on evaluation data without building additional tooling.
Decision engines — How agents choose what to do
Evaluation tells us how well an agent is performing. Decision engines determine what the agent does in the first place. These two concepts are deeply interconnected: understanding how an agent makes decisions is a prerequisite for designing effective evaluations of those decisions, and evaluation results are the primary feedback mechanism for improving decision quality over time.
The think-act-observe loop
The most widely adopted pattern for agent decision-making is the think-act-observe loop, introduced in Module 1 of this series. The loop works as follows: the agent receives an input (a goal, a question, or a task), thinks about what to do next, takes an action (typically a tool call), observes the result of that action, and then thinks again, incorporating the new observation into its next step. This cycle continues until the agent concludes that the goal has been achieved, determines that it cannot proceed, or hits a defined termination condition.
The think-act-observe loop is powerful because it is general. It does not assume a fixed task structure — the agent adapts its plan dynamically based on what it observes. This generality comes with a cost: because the execution path is not predetermined, predicting and evaluating agent behavior requires examining the full trace rather than just the inputs and outputs.
On AWS, the Amazon Bedrock Agents service implements the think-act-observe loop natively, handling the orchestration loop, tool invocation, and context management automatically. For teams using the AWS Strands framework, the same pattern is supported with more flexibility for custom tool definitions and multi-agent configurations.
Chain-of-thought reasoning
Chain-of-thought (CoT) prompting is a technique that elicits step-by-step reasoning from a language model before it produces a final answer or decision. Rather than jumping directly to an output, the model is encouraged (via prompt design or model training) to think through the problem explicitly, producing a reasoning trace as an intermediate artifact.
For agent decision-making, chain-of-thought reasoning serves two important functions. First, it improves decision quality — models that reason explicitly before acting tend to make better decisions, particularly on complex or multi-step tasks. Second, it makes decision-making inspectable. The reasoning trace produced by a chain-of-thought prompt is a natural artifact for evaluation: it reveals why the agent made the decision it made, which is invaluable for debugging and improvement.
Modern foundation models, including those available through Amazon Bedrock, support chain-of-thought reasoning natively. Some models produce reasoning traces implicitly; others can be prompted to reason explicitly using prompt patterns like "Let's think through this step by step before deciding on an action."
Tool selection and planning
At the heart of the decision engine is the tool selection mechanism: given the current state of the world (the goal, the conversation history, the results of previous tool calls, and any retrieved context), which tool should the agent invoke next?
In practice, tool selection is performed by the language model itself. The agent is provided with a list of available tools, descriptions of what each tool does and when it should be used, and the current context. The model reasons about which tool is most appropriate for the next step and generates a structured tool call — typically a JSON object containing the tool name and its arguments.
The quality of tool selection is heavily influenced by prompt design. Tool descriptions that are clear, specific, and include examples of appropriate use tend to produce significantly better selection behavior than vague or overlapping descriptions. This is an area where evaluation feedback loops are especially valuable: systematic analysis of tool call identifies which tools are being misused and informs targeted improvements to their descriptions.
For complex, multi-step tasks, agents may also engage in explicit planning — decomposing a high-level goal into a sequence of sub-tasks before beginning execution. Planning approaches range from simple decomposition ("I need to do A, then B, then C") to more sophisticated methods like Tree of Thoughts, which explores multiple possible plans in parallel and selects the most promising one. The right planning approach depends on task complexity, acceptable latency, and the cost of execution errors.
Confidence, uncertainty, and guardrails
An agent that always produces confident, decisive outputs is not a reliable agent — it is a reckless one. In practice, there are inputs that genuinely fall outside the agent's reliable operating range, tool states that make confident execution impossible, and situations where escalating to a human is the right answer. Agents that cannot recognize and respond appropriately to uncertainty are dangerous in proportion to their autonomy.
Recognizing uncertainty
Uncertainty in agent behavior arises from several sources. Input uncertainty occurs when the agent's input is ambiguous, incomplete, or contradictory — the agent cannot determine with confidence what is being asked. Model uncertainty arises from the probabilistic nature of LLM outputs: some inputs fall in regions of the model's distribution where output quality is low or inconsistent. Tool uncertainty occurs when tool outputs are unexpected, empty, or erroneous, leaving the agent without the information it needs to proceed confidently. Context uncertainty arises in long conversations or multi-step tasks where accumulated context may be incomplete or internally inconsistent.
Well-designed agents are equipped to detect these uncertainty signals and respond appropriately. Detection mechanisms include confidence thresholds (where the model's output probability or an evaluator's score falls below a defined minimum), consistency checks (running the same reasoning step multiple times and flagging high variance in outputs), and explicit self-assessment prompts (asking the model to rate its confidence in its own output before proceeding).
Fallback behaviors and escalation
When an agent detects uncertainty above a configured threshold, it needs a well-defined fallback behavior. Common patterns include requesting clarification from the user, routing to a more capable (typically larger and more expensive) model, escalating to a human reviewer, returning a conservative default response with a clear indication that the agent's confidence was low, or declining to act and explaining why.
The right fallback behavior depends on the stakes of the task. For a customer-facing support agent, asking a clarifying question is often the appropriate first response to ambiguity. For an agent executing financial transactions, any uncertainty above a defined threshold should trigger human review before action is taken. For an agent generating draft content, returning a low-confidence draft with a flag for human review may be perfectly acceptable.
Escalation paths should be designed into the agent architecture from the start, not bolted on after deployment. This means defining the conditions under which escalation occurs, the mechanism by which control is transferred (queue, notification, synchronous review), and the process by which the human reviewer's decision is incorporated back into the agent's context.
Input and output guardrails
Guardrails are safety and quality filters applied to agent inputs and outputs. They complement evaluation by providing real-time protection rather than retrospective measurement.
Input guardrails filter or transform incoming requests before they reach the agent. Common input guardrail functions include detecting and blocking prompt injection attempts (where adversarial inputs attempt to override the agent's instructions), classifying inputs as out-of-scope and routing them appropriately, redacting or masking sensitive information (PII, credentials, proprietary data) before it is included in model context, and enforcing input length or format constraints.
Output guardrails filter or transform the agent's outputs before they are returned to the user or passed to downstream systems. Common output guardrail functions include detecting hallucinations or factual inconsistencies, blocking outputs that contain sensitive information that should not be surfaced, enforcing format and schema compliance, and flagging outputs that exceed defined toxicity or policy violation thresholds.
Amazon Bedrock Guardrails provides a managed implementation of both input and output guardrail functions, including PII detection, content filtering, grounding checks, and topic blocking. This is a natural first choice for teams building agent systems on AWS, as it integrates directly with Amazon Bedrock's model invocation APIs without requiring custom infrastructure.
The relationship between guardrails and evaluation is complementary rather than substitutable. Guardrails provide real-time protection; evaluation provides the measurement and feedback loops needed to improve guardrail configurations over time. Guardrail trigger rates — how often each guardrail fires, and on what inputs — are themselves valuable evaluation data.
Evaluation in the development lifecycle — From prototyping to production
Evaluation is not a phase that happens after development is complete. It is a continuous practice that evolves in form and rigor as an agent moves from early prototype to production deployment. Teams that treat evaluation as a late-stage gate almost always find themselves in a reactive posture — fixing problems that measurement would have caught weeks earlier.
Prototyping: Informal rvaluation and rapid iteration
In the prototyping phase, the goal is to establish whether the agent can handle the core task at all. Evaluation at this stage is typically informal: developers examine outputs manually, adjust prompts, swap tools, and observe the effect. The volume of test cases is small, and the criteria for "good enough" are loosely defined.
Even at this early stage, two practices pay significant dividends later. First, keeping a running log of interesting or problematic inputs — the cases that caused unexpected behavior — creates the seed of a regression dataset. Second, documenting the prompt iterations and the rationale for each change establishes a foundation for the prompt versioning discipline that will be needed in production.
Pre-production: Structured testing and baselines
As the agent matures toward a release candidate, evaluation becomes more structured. Golden datasets are formalized, automated evaluation pipelines are established, and baseline scores are recorded for each key metric. Every subsequent change — to prompts, tools, models, or routing logic — is evaluated against these baselines before being promoted.
This is also the stage at which adversarial testing should be conducted systematically. The agent should be tested against edge cases, unusual inputs, and deliberate attempts to elicit failure modes. Any failures discovered at this stage are far cheaper to address than failures discovered in production.
LangSmith's dataset and evaluation run management is well suited to this phase, providing a structured workflow for running evaluation suites against specific agent versions and comparing results across runs. Vellum's prompt comparison and regression testing features serve a similar purpose, with particular strength in managing prompt variants and their associated evaluation scores.
Amazon Bedrock AgentCore Evaluations’ on-demand evaluation mode is particularly well suited to this pre-production phase. Teams specify the exact span or trace IDs they want to assess — whether from a regression suite run, an adversarial test, or a specific interaction that revealed unexpected behavior — and apply any combination of built-in and custom evaluators to those traces precisely. This makes on-demand evaluation an effective tool for validating fixes (“does this prompt change resolve the tool selection errors we saw in traces X, Y, and Z?”), comparing agent versions head-to-head on the same set of traces, and signing off on release candidates against documented quality thresholds before promotion to production.
Production: Continuous monitoring and evaluation-driven improvement
In production, evaluation shifts from periodic testing to continuous monitoring. A representative sample of production traffic is evaluated automatically on an ongoing basis, with results feeding dashboards that surface quality trends, cost metrics, and anomalous behavior in near-real time.
Production evaluation programs typically involve sampling (evaluating a fraction of all traffic rather than every interaction, both for cost reasons and to avoid evaluator bias from over-indexing on high-volume but unrepresentative cases), human annotation queues (surfacing low-scoring or flagged interactions for human review and annotation, building the production-derived dataset described earlier), and automated alerting (triggering notifications when key metrics drop below defined thresholds, enabling rapid response to regressions).
Amazon Bedrock AgentCore Evaluations’ online evaluation mode is the recommended approach for production monitoring on AWS. Teams create an online evaluation configuration that points to the CloudWatch log group where their agent emits traces, selects up to 10 evaluators (built-in and custom), and sets a sampling rate — for example, evaluating 10% or 20% of all sessions continuously. The service then scores live traffic automatically as it arrives, with no custom pipeline to maintain. Results are published in real time to the AgentCore Observability dashboard in Amazon CloudWatch, where teams can create alarms on evaluation scores directly. For example, a team can configure a CloudWatch alarm to trigger if the Builtin.GoalSuccessRate score drops below 0.80 or if Builtin.ToolSelectionAccuracy falls below a threshold that signals systematic routing problems — and wire those alarms to automated notifications or rollback actions.
Evaluation-driven development
Teams that fully internalize the discipline of continuous evaluation tend to converge on a development practice that can be called evaluation-driven development: the practice of defining evaluation criteria and target metrics before making changes, rather than after. This sounds simple but represents a significant cultural shift for most teams.
The payoff is substantial. When changes are made with clear, pre-defined success criteria, improvement is intentional and measurable. When changes are measured against established baselines, regressions are caught immediately. When evaluation data is treated as a first-class product of the development process, the entire team develops a shared, data-grounded understanding of agent quality. Over time, this shared understanding is one of the most valuable assets an agent development team can possess.
Evaluating a real agent end-to-end
The preceding sections have covered the theory and components of agent evaluation in depth. This section brings them together through a concrete, narrative walkthrough of a realistic scenario: evaluating an Agentic AI DevOps Companion — the series’ running reference implementation — that autonomously handles the full software delivery lifecycle on AWS. The techniques and tools referenced are drawn from the ISV ecosystem available in AWS Marketplace, applied in the context where they naturally fit.
The scenario
The agent — referred to throughout this series as the DevOps Companion — is given a source code repository and tasked with doing something that would normally occupy a team of engineers for days: inspecting the codebase to identify its constituent services, determining the infrastructure each service requires, generating the corresponding AWS infrastructure code (using AWS CDK), producing CI/CD pipeline definitions, deploying those services to Amazon EKS, and automatically generating Amazon CloudWatch alarms calibrated to each service’s actual error logging patterns. The agent accomplishes this by executing a think-act-observe loop across a suite of tools that interact with Git, the AWS SDK, and the Kubernetes API. The business has defined success across four dimensions: service identification accuracy (did the agent correctly identify all services and their dependencies?), infrastructure correctness (is the generated CDK code valid and does it provision what the services actually need?), deployment success (did the services start and pass health checks in EKS?), and observability coverage (do the generated CloudWatch alarms reflect the real error signals present in the service logs?).
Step 1: Defining success criteria and metrics
Working with the platform engineering team, the evaluation criteria are defined as follows. For service identification accuracy: the fraction of services correctly identified compared to a ground truth inventory of the repository’s services and their declared dependencies. For infrastructure correctness: a combination of schema validation (does the generated CDK synthesize without errors?) and semantic validation (does the provisioned infrastructure match what the service actually requires in terms of compute, networking, storage, and IAM permissions?). For deployment success: the fraction of services that successfully deploy to EKS and pass readiness probes within a defined time window. For observability coverage: the fraction of distinct error signal patterns present in the service logs that are covered by at least one generated CloudWatch alarm, evaluated using an LLM judge against a rubric that assesses alarm relevance, threshold appropriateness, and namespace correctness. Efficiency metrics — tool calls per pipeline run and total token consumption — are tracked as cost and latency indicators throughout.
These metrics are documented, reviewed with platform engineering leadership, and approved as the official success criteria for the DevOps Companion.
Step 2: Building the evaluation dataset
The team creates an initial golden dataset of 30 representative repositories: a mix of microservice architectures (ranging from 3 to 12 services), monoliths being decomposed into services, and repositories with intentionally complex infrastructure dependencies (RDS, ElastiCache, SQS, MSK). For each repository, the dataset includes the ground truth service inventory, the expected infrastructure resources and their configurations, a reference CDK stack, and an annotated set of expected CloudWatch alarms derived from reviewing the actual error log patterns in each service’s codebase.
An adversarial set of 10 repositories is also created, designed to probe specific failure modes: repositories with unconventional project structures, services with no infrastructure declarations at all, polyglot repos mixing multiple runtimes, and codebases with misleading directory names that do not correspond to discrete services. Dataset versioning is managed in Amazon S3 with versioning enabled, with metadata tracked in Amazon DynamoDB.
Step 3: Instrumenting the agent
The team instruments the DevOps Companion using LangFuse, capturing the full execution trace for every pipeline run: the initial repository URL and branch, each tool call with its arguments and returned data (Git clone, CDK synth, EKS deploy, CloudWatch API calls), the agent’s reasoning at each think step, and the final set of generated artifacts. Token counts and latency are captured at the span level, enabling cost analysis broken down by pipeline stage — repo inspection, CDK generation, deployment, and alarm generation — which helps identify where the most expensive reasoning is occurring.
Step 4: Running the evaluation pipeline
The evaluation pipeline is built in two complementary layers. The first layer handles the deterministic checks that do not require an LLM judge: an AWS Lambda function runs CDK synth on each generated infrastructure stack and compares the resulting CloudFormation template against the ground truth, validates CI/CD pipeline definitions for syntactic correctness, and records deployment success or failure from the EKS dry-run. These checks run on every evaluation cycle and produce binary pass/fail signals that are stored in Amazon S3.
The second layer handles the qualitative assessments using Amazon Bedrock AgentCore Evaluations. The team creates on-demand evaluation runs against the trace IDs captured by LangFuse for each golden dataset execution. Two custom evaluators are defined in AgentCore Evaluations for this use case: one that scores infrastructure correctness by prompting a foundation model to compare the generated CDK stack against the reference specification, and one that scores observability coverage by assessing whether the generated CloudWatch alarms reflect the actual error signal patterns present in each service’s log samples. These custom evaluators specify their own rating scales (a 5-point rubric for each dimension) and use a foundation model from Amazon Bedrock as the judge. All scores are published automatically to the AgentCore Observability dashboard in CloudWatch, giving the team a unified view across both deterministic and qualitative results.
Vellum is used to manage prompt variants during development, enabling the team to compare evaluation scores across different system prompt configurations for each pipeline stage before selecting the best-performing combination.
Step 5: Interpreting results and iterating
The initial evaluation run reveals that service identification accuracy is strong (91%) and CDK synthesis succeeds in 87% of cases, but observability coverage scores are significantly below target (average 2.6 out of 5). The LLM judge’s explanations, combined with trace inspection in LangFuse, identify the root cause: the agent’s alarm generation step is reading only the top-level exception types from log patterns without drilling into the structured log fields that carry the most signal-rich error context. It is generating generic alarms ("ERROR count exceeds threshold") rather than service-specific ones tied to meaningful error categories.
The team iterates in two directions. First, the alarm generation prompt is updated to explicitly instruct the agent to parse structured log fields and group error patterns by category before generating alarms, with few-shot examples drawn from the golden dataset’s annotated alarm sets. Second, a new tool is added that extracts a structured summary of the top 10 error patterns from each service’s log samples before the alarm generation step begins, giving the agent richer input to work from. After four prompt and tool iterations, observability coverage scores improve to 4.3 on average, with no regression in infrastructure correctness or deployment success rates. The winning configuration is promoted to production.
Step 6: Production monitoring
In production, the team uses Amazon Bedrock AgentCore Evaluations’ online evaluation mode for continuous quality monitoring. An online evaluation configuration is created pointing to the CloudWatch log group where the DevOps Companion emits its OpenTelemetry traces, with the two custom evaluators (infrastructure correctness and observability coverage) applied at a 20% sampling rate to manage cost. The Builtin.GoalSuccessRate evaluator is added at full (100%) sampling given its low cost and high signal value — it answers the most important question directly: did the agent successfully complete the end-to-end pipeline for this repository? All evaluation results surface in the AgentCore Observability dashboard in Amazon CloudWatch. The team creates CloudWatch alarms on the evaluation score metrics: an alert fires if GoalSuccessRate drops below 0.90, if infrastructure correctness scores drop below 3.5, or if observability coverage drops below 3.5. Deterministic checks (CDK synthesis success, deployment success) continue to run on every pipeline execution via the Lambda layer and write their pass/fail results to CloudWatch as custom metrics alongside the AgentCore evaluation scores. The combined view gives the team a complete production quality picture from a single dashboard. The production trace data feeds into a weekly annotation review where engineers flag repositories where the agent made unusual decisions, building a growing production-derived dataset that keeps the custom evaluators’ rubrics aligned with the real diversity of codebases the DevOps Companion encounters.
Within two months of deployment, the production dataset has expanded to include over 150 annotated pipeline runs, several new adversarial repository patterns have been added based on real failures caught in production, and three separate regressions have been detected and resolved before they affected significant traffic. The evaluation program is working as intended.
Conclusion and looking ahead
Agent evaluation is not glamorous work. It does not generate the kind of excitement that comes from building a new agent capability or integrating an impressive new tool. But it is, arguably, the most important discipline in the agentic AI engineering toolkit. Teams that invest in evaluation early and maintain it consistently build agents that improve over time, catch regressions before they reach users, and develop the shared understanding of quality that makes scaling possible.
This module has covered the full evaluation landscape: the conceptual shift from traditional testing to continuous assessment, the three-layer evaluation framework, metric design and dataset management, automated evaluation techniques, the role of tracing and observability, the decision-making mechanisms that generate the behavior being evaluated, routing approaches and their trade-offs, confidence handling and guardrails, and the integration of evaluation across the development lifecycle. The worked example demonstrated how these pieces fit together in a realistic production scenario.
Two threads run through all of this. The first is measurement: the discipline of defining what good looks like before building,and then measuring relentlessly against that definition. The second is feedback: using measurement results to drive targeted improvement, catching regressions early, and building a compounding understanding of agent quality over time.
In Module 4: Multi-Agent Architectures, the concepts introduced here become both more important and more challenging. When multiple agents operate as a coordinated system, evaluation must extend to the interactions between agents — routing decisions become inter-agent handoffs, trajectory quality must be assessed across agent boundaries, and the failure modes of individual agents can cascade through the system in ways that are difficult to anticipate without systematic measurement. The evaluation foundations built in Module 3 are the prerequisite for navigating that complexity effectively.
Featured AI tools in AWS Marketplace
Build with foundation models, deployment platforms, and AI infrastructure — all available through your AWS account.
Why AWS Marketplace for on-demand cloud tools
Free to try. Deploy in minutes. Pay only for what you use.
Featured tools are designed to plug in to your AWS workflows and integrate with your favorite AWS services.
Subscribe through your AWS account with no upfront commitments, contracts, or approvals.
Try before you commit. Most tools include free trials or developer-tier pricing to support fast prototyping.
Only pay for what you use. Costs are consolidated with AWS billing for simplified payments, cost monitoring, and governance.
A broad selection of tools across observability, security, AI, data, and more can enhance how you build with AWS.
Continue your journey
Module 3 builds on frameworks and introduces evaluation. Continue with multi-agent architectures.