Evaluating AI Agents: A Practical Guide for Product and Technology Leaders



Key Takeaways
AI agents go far beyond chatbots. They are software systems that reason about tasks, make decisions, and take action autonomously, with direct access to sensitive systems and data. They review contracts against policy, run credit checks across multiple systems, update client records, and draft recommendations.
AI agents are moving from pilots into production across professional services and financial services, but most organisations cannot yet measure whether they work.
Some AI agent outputs must be precisely correct, such as quoted regulations or account numbers. Others, such as summaries or recommendations, accept reasonable variation. Each type requires a different quality check, and without deliberate controls, AI agents can fabricate both with equal confidence, producing errors that surface only after damage is done.
Recent benchmarking research shows that even top-performing models exhibit significant reliability drops when asked to perform the same task repeatedly. An AI agent that can do something is not the same as one that will do it consistently.¹³
This article provides a practical framework: how to classify outputs, build checks, catch failures, and monitor quality as deployments grow.
What AI Agents Are, and Why Evaluation Matters Now
For leaders in professional services considering or already deploying AI agents, the critical question is not whether the technology works. It is how you know.
AI agents are software systems that do work autonomously. Unlike chatbots, which respond to individual questions within a single conversation, AI agents reason about tasks, connect to business systems and data sources, make decisions, and take multi-step actions to automate complex business processes. A legal AI agent reviews an entire contract, cross-references clauses against internal policy, and drafts a client summary; a financial services AI agent pulls transaction history, runs credit checks across multiple systems, and produces a recommendation. These are tasks that previously required a qualified professional working over several hours, and AI agents handle them end to end, without human involvement.
Adoption is accelerating across the sector.⁹ Law firms deploy AI agents for contract review and due diligence, accounting firms use them for tax research and audit preparation, and consulting firms have them synthesising market intelligence and drafting proposals. Gartner projects that 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.¹ The scale of investment makes this concrete: in early 2026, the Financial Times reported that KPMG negotiated lower fees from its own auditor, arguing partly that AI should make the work more efficient.¹⁰ Separately, Goldman Sachs disclosed it has been working with Anthropic to build AI agents for trade accounting and client onboarding, functions that combine calculations that must be exactly right, qualitative assessments such as client risk profiling, and mandatory regulatory compliance.¹²
These are not experimental projects, and the firms deploying them have direct financial and regulatory exposure to AI agent errors. In our work with professional services firms, we see the same pattern: organisations that have deployed AI agents but have no systematic way to know whether those agents are producing reliable work. The capability that makes AI agents valuable is also what makes them risky.
Three characteristics combine to make structured evaluation essential:
Cascading complexity. AI agents handle complex, multi-step processes where a single error can cascade through every subsequent decision.
Direct system access. They typically have direct access to business systems, client data, and internal documents, meaning an AI agent that takes the wrong action can cause real damage.
Non-determinism. The same instruction can produce different results on different runs. You cannot test an AI agent once, confirm it works, and assume it will always behave the same way.
This fundamental unpredictability means that ongoing, structured evaluation is not optional; it is the only way to know whether your AI agent is performing reliably. Beyond quality assurance, structured evaluation supports broader risk management and governance objectives. For regulated industries, it provides the auditable evidence of due diligence that compliance teams and external auditors increasingly expect when AI systems are making or influencing business decisions.
Why "It Seems to Work" Is Not Enough
When traditional software fails, the failure is usually visible: an error message appears, a process crashes, a dashboard turns red. The system itself signals that something has gone wrong, and teams can respond. AI agents fail differently. They produce outputs that look entirely reasonable while being subtly wrong: the system shows no errors, logs indicate successful completion, and the output is grammatically correct, properly formatted, and entirely confident, but the content itself contains errors, and nothing in the AI agent's behaviour alerts anyone to the fact.
Consider what this looks like in practice. A research AI agent at an accountancy firm summarises a due diligence report but omits a material risk buried on page forty-seven. The summary reads well, hits all the expected sections, but the omission is invisible unless someone reads the original document and notices what was left out. A client-facing assistant at a wealth management firm provides investment guidance based on a regulation that was updated last month; the advice sounds authoritative, but the error only surfaces when compliance reviews the interaction three weeks later. A pricing AI agent calculates a quote using the wrong client tier, the number looks entirely plausible, there is no warning, no exception thrown, and the mistake costs the firm money on every transaction until someone notices the margin compression and traces it back to the AI agent.
None of these failures trigger alerts, yet all of them have consequences, and all of them passed whatever basic checks were in place.
Why Traditional Testing Misses These Problems
Traditional software is deterministic: give it the same input, and it produces the same output. You can test it once, confirm it works, and move on. AI agents do not work this way. Ask an AI agent to perform the same task twice, and you may receive different but equally plausible results. This variability makes it impossible to check an AI agent's output against a single expected answer in the way you would test traditional software. Two summaries of the same document might emphasise different points or structure information differently; both might be acceptable, and some might contain subtle errors that only a domain expert would catch. The challenge is not just that AI agents can be wrong; it is that their outputs occupy a spectrum from precisely correct to subtly flawed, and distinguishing between the two requires a fundamentally different approach to quality assurance.
This is compounded by a persistent and well-documented tendency: without deliberate controls, AI agents can and often do fabricate. Rather than simply making occasional mistakes, AI agents tend to generate plausible-sounding information when they encounter gaps in their knowledge or instructions. An AI agent might invent a statistic, cite a study that does not exist, claim to have checked a source when it has not, or conflate figures from different contexts to produce a number that appears authoritative but is wrong. The industry often calls this "hallucination," but "fabrication" better captures the reality: the output shows no signs of uncertainty, and the AI agent presents information it has generated rather than verified with the same confidence regardless of accuracy.
The question therefore shifts from "did it produce the right answer?" to "is this output good enough, and can we trust it?" This is why AI practitioners use a different term: evaluation (often shortened to "evals" in the industry). Evaluation means measuring quality, not just correctness; it includes checking outputs that must be exact alongside qualitative assessment of outputs where judgment is required. These problems are not intractable, but they do require a different approach to quality assurance, one designed for systems that are non-deterministic, prone to fabrication, and operating across complex multi-step processes. The following sections set out how to build that approach.
Two Types of Output, Two Types of Check
Modern AI agent workflows are multi-dimensional. A single response might combine quoted regulations, numerical calculations, risk assessments, recommendations, and system actions, each with different quality characteristics. Not all of these can be checked the same way, and recognising this distinction is the foundation of effective evaluation.
AI agent outputs fall into two categories. Exact outputs must be precisely correct and can be verified mechanically. Judgment outputs accept reasonable variation and require qualitative evaluation. When a legal AI agent reviews a franchise agreement, quoting clause 14.2 is an exact output; assessing the risk it poses is a judgment output. Most AI agents produce both in a single response, and each type needs a fundamentally different quality check.
Exact Outputs
Certain AI agent outputs must be precisely correct, and the checks for these are correspondingly specific:
Quoted text must match the source character-for-character. A fabricated quote, even one that sounds plausible, is a failure.
Citations and references must be verifiable: the cited study must exist, and it must say what the AI agent claims.
Specific data such as account numbers, client codes, and reference numbers must be accurate.
Calculations must be mathematically correct. An incorrect discount rate compromises every calculation downstream.
System actions must execute as specified: if an AI agent should update a record and notify the relationship manager, both steps must happen.
For these outputs, automated verification is essential, and the mechanism is straightforward: code can check whether a quote appears verbatim in a source document by running a string match against the original text; a separate calculation engine can independently recompute any figures the AI agent produces; and system logs can confirm whether an action was actually executed and completed successfully. Human reviewers cannot scale to check every quote against every source, but machines can and should.
We have written separately about how we designed our own research AI agents to prevent fabrication.² Rather than relying on instructing the AI agent to be accurate, we built safeguards into the architecture: the AI agent can only select from pre-verified sources, quotes are matched character-by-character against source documents before inclusion, and outputs are validated in real time. Prevention is better, but detection is essential.
Judgment Outputs
Other outputs have no single correct answer, and quality is a matter of judgment:
Summaries and synthesis can be written in many valid ways. The question is whether the summary captures what matters, not whether it matches predetermined text.
Recommendations and advice can be expressed differently while conveying the same substance.
Tone, style, and completeness involve subjective assessment that depends on context, audience, and purpose.
For these outputs, you need clearly defined criteria that specify what "good" looks like, and a separate AI model that reviews the AI agent's work against those criteria, acting as an independent judge. Earlier approaches favoured numerical scoring rubrics, where a judge model would rate outputs on a scale (for example, completeness: 1-5, accuracy: 1-5) across multiple dimensions. In practice, these rubrics are difficult to define well: what exactly distinguishes a 3 from a 4 on "completeness"? Different reviewers, whether human or AI, interpret scales differently, and the problem is compounded early in development when you do not yet fully understand the AI agent's failure modes. Current best practice has therefore moved towards binary pass/fail checks: rather than asking a judge to rate "completeness: 1 to 5," you ask a focused question with a yes-or-no answer, such as "Does this summary identify all material risks disclosed in the source document? Yes or No."¹⁴ Binary checks force clarity in the criteria themselves, produce more consistent results across repeated assessments, and make disagreements between the AI judge and human reviewers easier to diagnose. Each check should evaluate one specific quality dimension in isolation; Anthropic recommends grading each dimension with a separate, focused evaluation rather than asking a single judge to score multiple qualities simultaneously.⁷ Human experts validate that the criteria capture what actually matters, and periodically review a sample to confirm the AI judge's assessments align with expert judgment.
Why This Distinction Matters
When you design evaluations, start by classifying each output as exact or judgment. The boundary between the two is a practical starting point rather than a rigid line; some outputs sit in between, and you will need to decide which type of check best fits based on the specific quality risk. For exact outputs, build specific automated checks: string matching for quotes, independent recalculation for figures, log verification for system actions. For judgment outputs, invest time in defining precise pass/fail criteria. A criterion that says "the summary should be good" tells you nothing. A criterion that asks "Does the summary identify all material risks, state the valuation range with supporting rationale, and flag areas where information was insufficient?" gives you something to measure against. Since most AI agents produce both types in a single response, your evaluation architecture needs to handle both, applying the right type of check to each part of the output.
What Typically Goes Wrong
Understanding how AI agents fail helps you focus evaluation efforts where they will have the most impact. Research has catalogued over a dozen distinct failure patterns in AI agents.⁴ The seven below are the most common across industries and use cases, and the ones most likely to affect professional services deployments. For each, we describe the pattern, give a concrete example, and explain how to catch or prevent it.
The AI Agent Starts Well, Then Fabricates
The AI agent performs correctly in the early steps, then introduces fabrications later in the workflow. Partial correctness builds false confidence: the user sees accurate information early on and assumes the rest is equally reliable. A tax research AI agent, for example, might accurately cite three relevant statutes, then fabricate a fourth that sounds plausible but does not exist.
How to catch it: Log every step of the AI agent's execution using an observability platform (such as Langfuse, Braintrust, or LangSmith) or a structured logging framework, not just the final output. At each step, record what data the AI agent accessed, what it produced, and what decisions it made. Automated checks that only validate the final message will miss fabrications introduced in earlier turns; you need checks at intermediate points throughout the workflow.
One Mistake Cascades Into Bigger Problems
An early error propagates through subsequent decisions, and the final output might be internally consistent yet fundamentally wrong. A valuation AI agent that uses last year's revenue figure instead of this year's produces a cascade: every ratio, multiple, and recommendation downstream is based on the wrong starting point. In multi-agent systems, the problem multiplies because downstream AI agents receive and trust corrupted outputs from upstream AI agents.
How to catch it: Deliberately inject known errors into the workflow to see whether downstream AI agents catch the problem or propagate it. Build validation checks at handoff points between AI agents; for example, a rule that compares key figures against the source data before passing them to the next stage. If the injected error passes through unchallenged, your validation needs strengthening.
The AI Agent Works Around Rules Instead of Following Them
AI agents sometimes find creative workarounds that technically comply with instructions while violating the underlying intent. This is optimisation without understanding. Microsoft's taxonomy of agentic failure modes identifies tool misuse as a key risk category.⁵ A client onboarding AI agent told to verify identity documents might access an internal HR system to cross-reference names, technically finding the information but violating data access policies.
How to prevent it: Define what the AI agent cannot do, not just what it should do. An AI agent instructed to "find client contact information" might access systems it should not touch unless those systems were explicitly placed off-limits. Review the AI agent's execution logs regularly to check whether it is accessing only the systems and data sources it should be using. Assume the AI agent will find any gap you leave.
Tools Available Does Not Mean Tools Used
Having access to tools does not guarantee an AI agent will use them. An AI agent connected to a CRM system, a regulatory database, and a document repository can bypass these tools entirely and generate plausible-looking information from its training data instead. Worse, AI agents can claim to have consulted these systems when they have not, presenting fabricated data as though it came from an authoritative source. A client-facing AI agent asked to look up a customer's account status might produce a realistic-looking response complete with account numbers and recent activity, none of which came from the actual CRM.
This failure mode is particularly dangerous because the output looks exactly like what a properly functioning AI agent would produce. The only way to detect it is to independently verify that the AI agent actually made the tool calls it should have, and that the data in its response matches what those tools returned.
How to catch it: Log all tool calls alongside the AI agent's output. Build validation rules that cross-reference: if the AI agent's response contains data that should have come from a specific system, check that the corresponding tool call actually appears in the execution log and that the returned data matches what the AI agent presented. Flag any output containing system-sourced data where no corresponding tool call was made.
Self-Assessment Cannot Be Relied Upon
A tempting shortcut is to ask the AI agent whether it followed the rules. When asked "did you use only verified sources?" AI agents often claim compliance, even when demonstrably non-compliant. This is not deception in any intentional sense; language models are trained to be helpful, and asserting compliance can feel like the helpful response. We learned this early in our own work: AI agents would self-certify compliance while outputs contained fabricated citations. A compliance AI agent asked to confirm it checked all required regulatory databases might report "all sources verified" despite having accessed only two of the four required databases.
What works instead: Rather than relying on an AI agent's self-assessment, verification must come from outside the AI agent being verified, through a separate, independent process. In practice, this means using automated validators that check the AI agent's claims against objective evidence (for example, comparing the AI agent's citation list against actual source documents, or cross-referencing its "verified" status against the tool call logs to confirm it actually accessed the systems it claims to have checked). A different AI model can serve as an independent reviewer, or human reviewers can spot-check a sample. The principle is simple: the verifier and the verified must always be separate.
Quality Can Degrade Over Time
An AI agent that passes evaluation before launch can degrade over weeks or months, a phenomenon practitioners call "drift." It takes several forms:
Data drift occurs when the information the AI agent relies on becomes outdated or changes in character. A regulatory database is updated, client records are restructured, or market conditions shift.
Model drift happens when the underlying AI model is updated by its provider, sometimes without notice, altering how the AI agent responds to the same inputs.
Concept drift means the definition of "good work" itself evolves: client expectations change, new regulations take effect, or business processes are restructured.
Judge drift affects the evaluation layer itself: the AI models used as judges are also updated by their providers, which can shift how they assess quality. An evaluator that was well-calibrated against human judgment in January may score differently in March. Re-align AI judges against human labels periodically to ensure your quality measurements remain trustworthy.
The AI agent you deployed in January might behave differently in March, and even if it behaves identically, "correct" might mean something different. Worse, the evaluations you built to catch problems might themselves shift without warning.
How to stay ahead of it: Define specific quality metrics for your AI agent, such as citation accuracy rate, pass rates per quality check, and task completion rate, and track them over time using your observability platform. Set concrete thresholds: for example, if your citation accuracy rate drops below 95%, or if the pass rate on any quality check falls below a defined minimum, configure the system to trigger an automated alert. When you see a downward trend, investigate which type of drift is responsible: check whether source data has changed (data drift), whether the model provider has released an update (model drift), whether business requirements have shifted (concept drift), or whether your AI judges have drifted from human consensus (judge drift), and address it at the source.
The Reliability Problem
There is a critical distinction between an AI agent that can do something and an AI agent that will do it reliably. Passing an evaluation once demonstrates capability, not reliability, and the gap between the two is larger than most organisations expect. A January 2026 analysis of enterprise AI agent benchmarks found that this reliability gap affects every model tested: AI agents that score well on single-run evaluations show significant performance drops when asked to perform the same task repeatedly, and "all models had runs that derailed completely," not gradual degradation, but sudden coherence breakdowns.¹³ The underlying benchmarks, including τ²-bench and Vending-Bench 2, confirm the pattern across different task types and providers.³
For professional services, this matters enormously. If you are deploying an AI agent to handle client queries at a law firm or accounting practice, inconsistency means clients returning with the same issue may receive different responses. For regulated industries where advice must be consistent and auditable, this variability is unacceptable. As regulators become more active in overseeing AI use in financial services, robust evaluations also contribute to the auditable evidence that organisations will need to demonstrate compliance.¹¹
The distinction shows up in two ways of measuring success. The first asks: can the AI agent do this at all? If you run it multiple times, does it succeed at least once? Researchers call this "pass at k" (written as Pass@k); it is a prototyping metric. If the AI agent cannot succeed even once across several attempts, the task is beyond its current capability and you need a different approach. The second asks: will the AI agent do this reliably? Does it succeed every time across repeated attempts? Researchers call this "pass to the power of k" (written as Pass^k); this is the metric that matters for live deployment.
The difference is dramatic. Consider an AI agent that succeeds on any single attempt with 75% probability. Run it three times and the chance it succeeds at least once (Pass@3) is approximately 98%, which looks reassuring. But the chance it succeeds all three times (Pass^3) is approximately 42%. The prototyping metric says "this works"; the reliability metric says "it fails more often than it succeeds." In a firm processing thousands of client interactions, that gap translates directly into hundreds of errors that the capability metric would have masked.
The practical response: Run each evaluation case multiple times to distinguish capability from reliability. The number of trials needed depends on the complexity of the task and the variability of inputs: for simple, well-constrained tasks with limited input variation, a handful of repeated runs may reveal the pattern. For complex workflows processing diverse real-world data, where different client records, document formats, or edge cases activate different paths through the AI agent, you need evaluation across a representative sample of your actual production data, which may require dozens or hundreds of runs to cover the meaningful variations. The key principle is to run enough trials that you are confident you are seeing the AI agent's reliable behaviour, not just its best-case performance.
Compare the results across trials. Consistent success gives you evidence of reliability. If results vary, examine the variation: consistent failure points to a capability gap requiring better instructions, architectural changes, or a different approach entirely. Intermittent failure is more nuanced; it might indicate a prompt that is not specific enough, poor handling of edge cases, or inherently variable model behaviour. Use this diagnostic information to decide where the AI agent can operate autonomously, where it needs human-in-the-loop checkpoints, and where full automation is not yet appropriate. The goal is not to wait for perfection but to know exactly where human oversight is still required.
These reliability challenges are amplified when AI agents operate in multi-step workflows or collaborate with other AI agents, which is where the complexity of evaluation increases substantially.
Evaluating Multi-Step Workflows and Multi-Agent Systems
The failure patterns above become harder to detect as AI agent tasks grow in complexity. In practice, most production AI agents do not simply answer a question. They carry out structured workflows: retrieving information from multiple systems, passing data between processing steps, interacting with databases and CRMs, performing calculations, applying business rules, and producing a final deliverable. Some workflows involve multiple AI agents working together, with one AI agent retrieving data, another analysing it, a third generating recommendations, and a fourth executing actions in external systems.
The sheer complexity of these systems creates evaluation challenges of its own. More complicated workflows mean more potential failure points, not just at individual steps but in the interactions between them. Evaluating a multi-step AI agent requires you to deeply understand the business process being modelled, because your evaluations must be robust enough to cover all the meaningful paths the AI agent might take, the failure modes specific to each stage, and the success criteria at each handoff point. Evaluating these systems means checking quality at multiple points along the way, not just the final output.
Why Multi-Step Evaluation Is Different
In multi-step workflows, problems can hide at intermediate stages even when the final output looks acceptable. Consider a due diligence AI agent reviewing a target company's contracts. The final report might look reasonable, but perhaps the AI agent missed a category of documents during retrieval, miscategorised a material risk as minor, or applied the wrong client's risk thresholds. Evaluating only the final output would miss these intermediate failures.
Two specific technical challenges arise in extended workflows. Context drift occurs when the AI agent gradually loses track of original instructions as its working memory fills with intermediate work. State accumulation means early decisions constrain later ones: if an AI agent decides in step two that a document is not relevant, it will not consider that document in any subsequent step. Both of these problems compound over the length of the workflow, and both require specific evaluation techniques to detect, which the following subsection addresses.
An important distinction applies here: not everything needs to be evaluated; some things need to be observed. Evaluation means grading an output against defined quality criteria: did the AI agent produce the right result? Observation means recording what happened: which systems did the AI agent call, what data did it pass, how long did each step take? Observation feeds into evaluation but is also valuable in its own right for debugging, compliance, and understanding system behaviour. In multi-step workflows, you need both.
Techniques for Multi-Step Evaluation
Evaluate intermediate outputs, not just the final deliverable. Define checkpoints throughout the workflow where you can assess whether the AI agent is on track. For a contract review workflow, checkpoints might include: did the AI agent retrieve all relevant documents? Did it correctly identify the applicable jurisdiction? Did it flag all clauses matching the risk criteria? Each checkpoint has its own evaluation criteria, and each can be checked independently, using automated rules for exact outputs (such as whether the correct documents were retrieved) and AI-as-judge scoring for judgment outputs (such as whether the risk categorisation is appropriate).
Use trajectory evaluation for high-stakes workflows. A trajectory is the complete record of every step the AI agent took: which systems it called, what data it accessed, what reasoning it applied, and what it decided at each point. Most observability platforms (such as Langfuse, Braintrust, or LangSmith) capture this automatically. Reviewing the full trajectory catches cases where the AI agent reached an acceptable answer through flawed reasoning, a result that might look correct this time but is unlikely to be repeatable.
Test for context retention. In long workflows, check whether the AI agent remembers key constraints from earlier in the interaction. A contract review AI agent told to focus on IP provisions should still be focusing on IP provisions ten steps later. Build specific test cases where an early instruction should constrain a later output, then verify that the constraint held; if the AI agent drifts from its original brief, this signals a context retention problem.
Inject errors at intermediate steps. Deliberately introduce a known mistake, such as an incorrect figure in a data feed or a contradictory instruction, partway through the workflow. Observe whether the AI agent catches the error, flags it, or silently propagates it. This tests the robustness of the AI agent's validation at each stage.
Evaluate handoffs in multi-agent systems. When multiple AI agents work together, each handoff is a potential failure point. The data passed between AI agents must be complete, correctly structured, and interpreted as intended by the receiving AI agent. Build validation checks at each handoff that confirm the data arrived intact and matches what the sending AI agent produced. A system of individually capable AI agents can still fail as a whole if the handoffs are unreliable.
Building Evaluations That Work
The previous sections covered what can go wrong and why traditional testing misses it. This section covers the practical side: how to build quality checks that catch problems before they cause damage, what to check, when to check it, and how to scale evaluation as your AI agent deployments grow. The underlying discipline follows a cycle: analyse real outputs to understand how the AI agent fails, measure quality through targeted checks, and improve the AI agent's instructions and architecture based on what the measurements reveal. Each iteration through this cycle strengthens both the AI agent and the evaluations themselves.
Three Types of Quality Check
Effective evaluation draws on three approaches, each suited to different aspects of AI agent output. Most production systems combine all three.
Automated rule checks verify specific requirements: correct format, permitted value ranges, authorised actions, citations that match source documents. In practice, these are scripts or validation functions that run against every AI agent output. For example, a function that checks whether every regulation quoted by the AI agent exists in the regulatory database, or a rule that confirms all numerical outputs fall within expected ranges. They run fast, examine every output, and cost almost nothing at the margin. The limitation is that they cannot assess reasoning quality, completeness, or nuance.
AI-as-judge checks use a separate AI model to assess outputs against defined criteria, capturing quality dimensions that resist simple rules, such as whether a summary identified the most material risks, or whether a recommendation is appropriately qualified. In practice, each check should focus on one specific quality dimension and return a binary pass/fail verdict. For example, rather than asking a single judge to score completeness, accuracy, and relevance simultaneously on a numerical scale, you run three separate checks: "Does the summary cover all material risks identified in the source? Yes/No." "Are all factual claims supported by the cited evidence? Yes/No." "Is the recommendation relevant to the client's stated objectives? Yes/No."¹⁴ This approach is more reliable than numerical scoring because it gives the judge a narrower, better-defined task, reducing variability and making disagreements between the AI judge and human reviewers easier to diagnose.⁷ The limitation is that they add cost and latency per output.
Human review remains the gold standard for quality assessment. Human experts catch subtleties that neither rules nor AI judges reliably detect, particularly in areas where domain expertise, client context, or regulatory nuance matters. The limitation is that it is expensive, slow, and does not scale to every output.
In practice, you layer all three. Automated checks run on every output. AI judges score a larger sample. Human experts review a smaller sample. The percentages vary by risk: a low-stakes internal tool might use 100% automated, 10% AI judge, and 1% human review. A client-facing advisory tool might use 100%, 50%, and 10% respectively.
Combine Outcome Evaluation with Process Observation
Focus evaluation primarily on what the AI agent produced; the final output is what affects clients and business outcomes.⁷ However, a correct answer reached through flawed or unusual reasoning is a warning sign, not a success. An AI agent that arrives at the right conclusion by chance or through faulty logic is unlikely to do so consistently. In professional services, the reasoning path also matters for auditability: regulators, clients, and internal compliance teams may need to understand not just what the AI agent concluded, but how it got there.
The practical approach is to combine both. Use outcome-based evaluation as the primary measure: did the output meet the quality criteria? Then use trajectory review as a diagnostic tool: was the reasoning sound, did the AI agent use appropriate sources, did it follow the expected process? When an AI agent produces a correct output through questionable reasoning, treat it as a reliability risk to investigate, not a pass.
Build Evaluations From Two Directions
Top-down: define what good looks like. Start from business requirements and work downward to specific evaluation criteria: completeness, accuracy, relevance, appropriate tone. These outcome-based evaluations tell you whether the AI agent is producing good work.
Bottom-up: analyse how the AI agent actually fails. Review a sample of real AI agent outputs, not against a predefined checklist, but with an open question: what went wrong here? Practitioners call this "error analysis," and it is the foundation of effective evaluation design.¹⁴ The process is straightforward: collect 20-50 real outputs as a starting foundation (complex systems benefit from reviewing 50-100 or more for fuller failure coverage), review each one, and write open-ended notes on every problem you observe. Then group similar problems into categories. You will discover failure patterns you did not anticipate, patterns that would never appear in a top-down requirements document. These failures become permanent test cases.
Both directions are necessary. Top-down evaluations ensure the AI agent meets business requirements. Bottom-up error analysis catches the failures you did not know to look for, and these are often the most damaging in practice.⁷
Diagnose Before You Build
Not every failure needs an evaluator. When error analysis reveals a problem, the first question is whether you can even classify the failure clearly. If you cannot, the answer is more error analysis, not more evaluators; go back and review additional outputs until the pattern becomes clear.¹⁴ Once you can classify the failure, ask whether the AI agent's instructions are the cause. If the AI agent mishandles a scenario because the prompt never addressed it, or because the instructions were ambiguous, that is a specification problem: fix the prompt. If the AI agent fails despite clear, specific instructions, that is a generalisation problem: the model cannot reliably perform the task as described, and you need an evaluator to catch the failures, or a different architectural approach.
This distinction matters because building evaluators is more expensive than fixing prompts. In practice, a significant proportion of the failures teams discover through error analysis turn out to be specification problems, fixable by clarifying the AI agent's instructions rather than building new quality checks. The discipline of asking "is this a prompt problem or a capability problem?" before writing evaluation code saves considerable effort and leads to better-performing AI agents.
When to Write Evaluations
The most effective approach is to write evaluations before building the AI agent. This mirrors established practices in both software engineering and product management. In test-driven development, engineers write tests before writing code. In agile product management, product owners define acceptance criteria before development begins: specific, measurable conditions that a feature must meet to be considered complete. The same discipline applies to AI agents.
A consulting firm building a proposal drafting AI agent would define acceptance criteria before writing any code: correct client name throughout, pricing matching the approved rate card, terms complying with contracting policy, all statistics verified against source documents. Each criterion becomes a measurable evaluation check. At Serpin, we formalised this through an agent-specific product lifecycle document that defines evaluation criteria alongside functional requirements from the outset. The question "how will we know it works?" is answered before "how will we build it?"
If you are working with an existing AI agent, start by cataloguing what the AI agent produces, classifying each output type, and building evaluations working backward from desired outcomes.
When to Run Evaluations
During development, run evaluations after every change, multiple times, and examine the distribution of results. After deployment, run evaluations continuously, because performance drifts over time and the only way to detect it is to keep measuring. This creates a feedback loop: problems detected in production become new evaluation cases, which help catch similar problems earlier next time.
Making Quality Visible: Observability and Evaluation at Scale
As AI agent deployments grow, manual evaluation becomes impractical. You need systems that make quality measurement automatic and continuous. The industry term for this capability is "observability": the ability to understand what is happening inside a system by examining its outputs.
For traditional software, observability means recording what happened (logs), measuring performance (metrics), and tracking request flow (traces). AI agent observability extends this to capture what the AI agent decided to do, why, what information it considered, and what it ignored. A growing ecosystem of platforms supports this, including both commercial options (such as Braintrust and LangSmith) and open-source alternatives (such as Langfuse and Arize Phoenix). The specific platform matters less than having the capability, but the capability itself is essential at scale.
Three Components of AI Agent Quality Infrastructure
Effective AI agent observability combines three components, each serving a different purpose. Together, they form the quality infrastructure that makes ongoing evaluation possible.
AI agent tracing records every step the AI agent took: which systems it called, what data it accessed, what reasoning it applied. Think of it as a flight recorder for AI agent interactions. In practice, most observability platforms provide tracing out of the box. You instrument your AI agent code by adding trace-emitting calls at each significant step (for example, wrapping each tool call or model invocation in a tracing context that records inputs, outputs, and timing), and the platform captures, stores, and makes them searchable. Without tracing, debugging a failed AI agent output means guessing what went wrong. With tracing, you can replay the exact sequence of events and identify where the process broke down.
Scoring applies quality checks against defined criteria, transforming raw outputs into trackable quality signals. In practice, you configure your observability platform to run automated rule checks (such as citation verification or format validation) and AI-as-judge assessments against each traced output, either in real time or as a post-processing step. A score might indicate that 94% of citations were verified, or that a summary passed four of five quality checks. These scores are logged alongside each trace, creating a continuous quality record that you can query and analyse over time.
Monitoring aggregates scores over time, detects patterns, and alerts when quality degrades. In practice, this means configuring dashboards in your observability platform that display rolling averages of your key quality metrics (citation accuracy, pass rates per quality check, task completion rate) and setting alert thresholds. For example, you might configure a notification if the seven-day rolling average of citation accuracy drops below 95%, or if the proportion of outputs failing any quality check exceeds 10%. A single failed output is an incident. A downward trend in scores across a week reveals a systemic problem that needs investigation.
The Automation Progression
Building evaluation capability is iterative. Most teams progress through four stages, each building on the previous one.
Start with human evaluation and error analysis to understand what good looks like and how the AI agent actually fails. Have domain experts review a sample of 20-50 real AI agent outputs as a starting foundation (expanding to 50-100 or more for complex systems) with an open question: what is wrong with this? Write free-form notes on every problem observed, then group similar problems into categories. This dual process builds a shared understanding of quality and reveals the AI agent's actual failure patterns, which are often different from what the team expected. Before building evaluators for the failures you discover, check whether each one is a specification problem (fixable by improving the AI agent's instructions) or a generalisation problem (requiring an evaluator to catch). The output of this stage is a set of clearly defined quality criteria, a taxonomy of real failure modes, and an improved prompt.
Codify patterns into automated rules once you understand them. If human reviewers consistently flag the same types of errors, such as missing citations, calculations that do not sum correctly, or responses that exceed authorised scope, write automated checks that catch these patterns. These rules run on every output, instantly and at no marginal cost. The output is a growing library of automated validators that catch known failure modes.
Add AI-as-judge evaluation for nuanced assessment that resists simple rules. Configure a separate AI model to assess outputs against your defined criteria, with each check focused on one specific quality dimension and returning a binary pass/fail verdict.¹⁵ In production monitoring, many of these checks are "reference-free," meaning the judge assesses quality without a pre-defined correct answer, relying instead on the criteria themselves to determine whether the output is acceptable. Run these assessments on a sample of outputs, typically 10-50% depending on risk level. The output is a set of quality signals that track trends over time.
Reserve human review for validation and edge cases. Human experts review a smaller sample to confirm that automated checks and AI judges are calibrated correctly, and to handle cases where automated assessment is uncertain. This is also where you update criteria and rules as the definition of "good" evolves.
Managing Evaluation Costs
Comprehensive evaluation has real costs: compute time for AI-as-judge checks, staff time for human review, and platform fees. In our experience building and deploying AI agent systems, evaluation typically adds 10-20% to the running cost of the agent system itself, but prevents failures that cost multiples of that amount. Three strategies keep costs proportionate to value.
Tiered evaluation matches the level of scrutiny to the level of risk. To implement this, categorise each AI agent output type by its potential business impact. Low-stakes outputs, such as internal status summaries or routine notifications, run only automated rule checks on every output. Medium-stakes outputs, such as client-facing communications or research summaries, add AI-as-judge scoring on a significant sample. High-stakes outputs, such as regulatory filings, financial recommendations, or advisory work where errors could result in penalties or client harm, warrant human expert review. Define these categories during the evaluation design phase, not ad hoc, so that every output is routed to the appropriate level of scrutiny automatically.
Sampling strategies reduce cost without sacrificing the ability to detect quality problems. Rather than running AI-as-judge checks on every AI agent output, you run them on a representative sample selected randomly from the full population of responses the AI agent produces during a given period. In our deployments, we find that sampling around 20% of outputs provides a reliable signal for detecting quality trends, though the appropriate rate depends on the volume and variability of your AI agent's workload: higher-volume, lower-variability systems can sample less; lower-volume or highly variable systems may need more. Targeted sampling, where you focus additional scrutiny on outputs that automated checks flag as borderline or unusual, concentrates effort where it is most likely to find problems.
Asynchronous evaluation decouples quality measurement from user-facing response times. The user receives their response immediately. Evaluation runs in the background, with results feeding into dashboards and alerting systems. Problems are still detected and flagged, typically within minutes rather than the weeks it might take for a human to notice an issue organically.
Building a Robust Evaluation Process
If you are deploying AI agents, or considering doing so, here is how to apply what we have covered. The starting point is not the AI agent itself; it is the business process the AI agent is intended to support.
Start from the Business Process
Before evaluating what an AI agent produces, define what the business process requires. What are the intended outcomes? What actions must be performed? What standards must be met? A client onboarding process, for example, requires identity verification, risk assessment, regulatory compliance checks, and record creation. A contract review process requires document retrieval, clause analysis, risk identification, and a structured report. These requirements exist independently of whether the work is done by a person or an AI agent. They are your evaluation criteria. The question then becomes: does the AI agent meet these requirements reliably?
Essential Practices
Map the business process the AI agent supports. Define the required outcomes, actions, and quality standards; these become your evaluation criteria. For each step, identify what a correct result looks like and what would constitute a failure.
Classify each output as exact or judgment. Determine which outputs must be precisely correct (quoted text, figures, system actions) and which accept reasonable variation (summaries, recommendations, assessments). This classification determines which type of check you build.
Build specific automated checks for exact outputs. String matching for quotes, independent recalculation for figures, execution log verification for system actions, and cross-referencing tool call responses for retrieved data.
Define pass/fail criteria for judgment outputs. For each quality dimension that matters, write a specific yes-or-no question: "Does the summary identify all material risks?" "Are all factual claims supported by the cited evidence?" Each question becomes a separate check, evaluated independently. Test your criteria by having both an AI judge and a human expert assess the same outputs, then refine until their pass/fail decisions align consistently.
Conduct error analysis on real outputs before finalising evaluations. Review 20-50 real outputs (more for complex systems) with an open question: what went wrong? Group problems into categories, determine whether each is a specification problem (fix the prompt) or a generalisation problem (build an evaluator), and fix specification problems first. This bottom-up analysis reveals failure patterns that top-down requirements analysis alone would miss.⁷ ¹⁴
Ensure the verifier is always separate from the verified. Never rely on the AI agent's own assessment of its work. Use automated validators, a different AI model as independent judge, or human reviewers, and make the separation explicit in your evaluation architecture.
Building Evaluation Discipline
For new AI agents, write evaluations before building. Define acceptance criteria upfront, while changes are still cheap. Each criterion should be specific enough to be testable: not "the AI agent should handle client queries" but "the AI agent should correctly identify the client's account tier, retrieve the current rate card, and produce a quote within 2% of the manually calculated figure."
Run repeated trials to distinguish capability from reliability. A single successful run proves capability, not reliability. Run enough trials across representative inputs that you are measuring the AI agent's consistent behaviour, not its best-case performance. The number depends on task complexity and input diversity, from a handful for well-constrained tasks to dozens or hundreds for complex workflows with varied real-world data.
Instrument the AI agent with tracing and build continuous monitoring. Record every tool call, model invocation, and decision point using an observability platform or structured logging. Build dashboards tracking rolling averages of key quality metrics and configure automated alerts when those metrics fall below defined thresholds.
Use evaluation results to determine where human oversight is needed. Not every task will be suitable for full automation. Evaluation data reveals where the AI agent performs reliably enough to operate autonomously, where human-in-the-loop checkpoints should be built in, and where full automation is not yet appropriate. Let the data drive these decisions rather than assumptions.
Close the feedback loop. Every production incident becomes a permanent evaluation case. Every edge case discovered in production strengthens the evaluation suite. Schedule regular reviews, monthly at minimum, where the team examines recent failures, updates evaluation cases, and refines criteria based on what the data is showing.
Advanced Practices
These practices typically require dedicated AI engineering capability or specialist support.
Layer your evaluation approaches. Run automated checks on every output, AI-as-judge scoring on a broader sample (typically 10-50% depending on risk level), and human expert review on a smaller sample. Calibrate the AI judge against human reviewers periodically: measure the judge's true positive rate (does it catch the problems humans catch?) and true negative rate (does it pass the outputs humans pass?) separately, rather than relying on raw agreement rate. A judge that always says "pass" will appear highly accurate if most outputs are good, but it will catch zero failures.
Evaluate multi-step workflows at intermediate checkpoints, not just final outputs. Define specific evaluation criteria for each stage of the workflow and build validation checks at every handoff point between AI agents or processing stages.
Measure reliability, not just capability. Track whether your AI agent succeeds consistently across repeated trials with diverse inputs, not just whether it can succeed at least once. Use the distinction between capability metrics (Pass@k) and reliability metrics (Pass^k) to guide deployment decisions.
Next Steps
The difference between organisations that deploy AI agents successfully and those that encounter costly failures is not the sophistication of their tools or the size of their team. It is evaluation discipline: the willingness to measure quality rigorously, understand what the measurements reveal, and act on them.
AI agents can fail while appearing to succeed. They can produce confident, well-formatted outputs containing errors that only surface after damage is done. They can bypass the tools they were given and fabricate data instead. They can perform brilliantly on Tuesday and fail on Wednesday. The only way to know whether your AI agent is doing good work is to measure it systematically, continuously, and from multiple angles, using checks that are independent of the AI agent itself.
The techniques in this article are not theoretical. They reflect what we have learned building AI agents that handle real client work, where errors have consequences and consistency is required. Evaluation is not a phase of development you complete and move past. It is an ongoing discipline, one that determines whether your AI agents can be trusted with the work you are giving them. The organisations that will succeed with AI agents are those that treat evaluation with the same rigour they apply to financial controls, client quality standards, and regulatory compliance.
For security-specific considerations, see Securing AI Agents: What We've Learned Building Them.⁶ For operational practices, see Running AI Agents: What Changes When the Bot Joins the Team.⁸ If you are building or evaluating AI agents in your organisation, we welcome the conversation.
Sources
Gartner (2025). 'Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026.' Press release, 26 August 2025. Available at: https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025
Serpin (2025). 'How We Designed a Zero-Fabrication Research Agent.' Available at: https://serpin.ai/insights/how-we-designed-a-zero-fabrication-research-agent
Yao, S., Shinn, N., Razavi, P. and Narasimhan, K. (2024). 'τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.' arXiv:2406.12045. Available at: https://arxiv.org/abs/2406.12045
Maxim AI (2025). 'Diagnosing and Measuring AI Agent Failures: A Complete Guide.' Available at: https://www.getmaxim.ai/articles/diagnosing-and-measuring-ai-agent-failures-a-complete-guide/
Microsoft AI Red Team (2025). 'Taxonomy of Failure Mode in Agentic AI Systems.' April 2025. Available at: https://www.microsoft.com/en-us/security/blog/2025/04/24/new-whitepaper-outlines-the-taxonomy-of-failure-modes-in-ai-agents/
Serpin (2026). 'Securing AI Agents: What We've Learned Building Them.' Available at: https://serpin.ai/insights/securing-ai-agents-what-we-ve-learned-building-them
Anthropic (2026). 'Demystifying Evals for AI Agents.' Engineering blog, 9 January 2026. Available at: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Serpin (2026). 'Running AI Agents: What Changes When the Bot Joins the Team.' Available at: https://serpin.ai/insights/running-ai-agents-what-changes-when-the-bot-joins-the-team
Serpin (2025). 'How AI Agents Are Transforming Professional Services — and How to Implement Successfully.' Available at: https://serpin.ai/insights/how-ai-agents-are-transforming-professional-services-and-how-to-implement-successfully
Financial Times (2026). 'KPMG negotiated lower fees from auditor by citing AI efficiency gains.' February 2026. Available at: https://www.ft.com/content/c891c47c-b21f-4e0f-84b3-b80c794eff3d
Serpin (2025). 'More AI Regulation Is Coming in Financial Services.' Available at: https://serpin.ai/insights/more-ai-regulation-is-coming-in-financial-services
CNBC (2026). 'Goldman Sachs taps Anthropic's Claude to automate accounting, compliance roles.' 6 February 2026. Available at: https://www.cnbc.com/2026/02/06/anthropic-goldman-sachs-ai-model-accounting.html
Simmering, P. (2026). 'The Reliability Gap: Agent Benchmarks for Enterprise.' 4 January 2026. Available at: https://simmering.dev/blog/agent-benchmarks/
Husain, H. and Shankar, S. (2026). 'LLM Evals: Everything You Need to Know.' Available at: https://hamel.dev/blog/posts/evals-faq/
Langfuse (2025). 'Automated Evaluations of LLM Applications.' Available at: https://langfuse.com/blog/2025-09-05-automated-evaluations
Key Takeaways
AI agents go far beyond chatbots. They are software systems that reason about tasks, make decisions, and take action autonomously, with direct access to sensitive systems and data. They review contracts against policy, run credit checks across multiple systems, update client records, and draft recommendations.
AI agents are moving from pilots into production across professional services and financial services, but most organisations cannot yet measure whether they work.
Some AI agent outputs must be precisely correct, such as quoted regulations or account numbers. Others, such as summaries or recommendations, accept reasonable variation. Each type requires a different quality check, and without deliberate controls, AI agents can fabricate both with equal confidence, producing errors that surface only after damage is done.
Recent benchmarking research shows that even top-performing models exhibit significant reliability drops when asked to perform the same task repeatedly. An AI agent that can do something is not the same as one that will do it consistently.¹³
This article provides a practical framework: how to classify outputs, build checks, catch failures, and monitor quality as deployments grow.
What AI Agents Are, and Why Evaluation Matters Now
For leaders in professional services considering or already deploying AI agents, the critical question is not whether the technology works. It is how you know.
AI agents are software systems that do work autonomously. Unlike chatbots, which respond to individual questions within a single conversation, AI agents reason about tasks, connect to business systems and data sources, make decisions, and take multi-step actions to automate complex business processes. A legal AI agent reviews an entire contract, cross-references clauses against internal policy, and drafts a client summary; a financial services AI agent pulls transaction history, runs credit checks across multiple systems, and produces a recommendation. These are tasks that previously required a qualified professional working over several hours, and AI agents handle them end to end, without human involvement.
Adoption is accelerating across the sector.⁹ Law firms deploy AI agents for contract review and due diligence, accounting firms use them for tax research and audit preparation, and consulting firms have them synthesising market intelligence and drafting proposals. Gartner projects that 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.¹ The scale of investment makes this concrete: in early 2026, the Financial Times reported that KPMG negotiated lower fees from its own auditor, arguing partly that AI should make the work more efficient.¹⁰ Separately, Goldman Sachs disclosed it has been working with Anthropic to build AI agents for trade accounting and client onboarding, functions that combine calculations that must be exactly right, qualitative assessments such as client risk profiling, and mandatory regulatory compliance.¹²
These are not experimental projects, and the firms deploying them have direct financial and regulatory exposure to AI agent errors. In our work with professional services firms, we see the same pattern: organisations that have deployed AI agents but have no systematic way to know whether those agents are producing reliable work. The capability that makes AI agents valuable is also what makes them risky.
Three characteristics combine to make structured evaluation essential:
Cascading complexity. AI agents handle complex, multi-step processes where a single error can cascade through every subsequent decision.
Direct system access. They typically have direct access to business systems, client data, and internal documents, meaning an AI agent that takes the wrong action can cause real damage.
Non-determinism. The same instruction can produce different results on different runs. You cannot test an AI agent once, confirm it works, and assume it will always behave the same way.
This fundamental unpredictability means that ongoing, structured evaluation is not optional; it is the only way to know whether your AI agent is performing reliably. Beyond quality assurance, structured evaluation supports broader risk management and governance objectives. For regulated industries, it provides the auditable evidence of due diligence that compliance teams and external auditors increasingly expect when AI systems are making or influencing business decisions.
Why "It Seems to Work" Is Not Enough
When traditional software fails, the failure is usually visible: an error message appears, a process crashes, a dashboard turns red. The system itself signals that something has gone wrong, and teams can respond. AI agents fail differently. They produce outputs that look entirely reasonable while being subtly wrong: the system shows no errors, logs indicate successful completion, and the output is grammatically correct, properly formatted, and entirely confident, but the content itself contains errors, and nothing in the AI agent's behaviour alerts anyone to the fact.
Consider what this looks like in practice. A research AI agent at an accountancy firm summarises a due diligence report but omits a material risk buried on page forty-seven. The summary reads well, hits all the expected sections, but the omission is invisible unless someone reads the original document and notices what was left out. A client-facing assistant at a wealth management firm provides investment guidance based on a regulation that was updated last month; the advice sounds authoritative, but the error only surfaces when compliance reviews the interaction three weeks later. A pricing AI agent calculates a quote using the wrong client tier, the number looks entirely plausible, there is no warning, no exception thrown, and the mistake costs the firm money on every transaction until someone notices the margin compression and traces it back to the AI agent.
None of these failures trigger alerts, yet all of them have consequences, and all of them passed whatever basic checks were in place.
Why Traditional Testing Misses These Problems
Traditional software is deterministic: give it the same input, and it produces the same output. You can test it once, confirm it works, and move on. AI agents do not work this way. Ask an AI agent to perform the same task twice, and you may receive different but equally plausible results. This variability makes it impossible to check an AI agent's output against a single expected answer in the way you would test traditional software. Two summaries of the same document might emphasise different points or structure information differently; both might be acceptable, and some might contain subtle errors that only a domain expert would catch. The challenge is not just that AI agents can be wrong; it is that their outputs occupy a spectrum from precisely correct to subtly flawed, and distinguishing between the two requires a fundamentally different approach to quality assurance.
This is compounded by a persistent and well-documented tendency: without deliberate controls, AI agents can and often do fabricate. Rather than simply making occasional mistakes, AI agents tend to generate plausible-sounding information when they encounter gaps in their knowledge or instructions. An AI agent might invent a statistic, cite a study that does not exist, claim to have checked a source when it has not, or conflate figures from different contexts to produce a number that appears authoritative but is wrong. The industry often calls this "hallucination," but "fabrication" better captures the reality: the output shows no signs of uncertainty, and the AI agent presents information it has generated rather than verified with the same confidence regardless of accuracy.
The question therefore shifts from "did it produce the right answer?" to "is this output good enough, and can we trust it?" This is why AI practitioners use a different term: evaluation (often shortened to "evals" in the industry). Evaluation means measuring quality, not just correctness; it includes checking outputs that must be exact alongside qualitative assessment of outputs where judgment is required. These problems are not intractable, but they do require a different approach to quality assurance, one designed for systems that are non-deterministic, prone to fabrication, and operating across complex multi-step processes. The following sections set out how to build that approach.
Two Types of Output, Two Types of Check
Modern AI agent workflows are multi-dimensional. A single response might combine quoted regulations, numerical calculations, risk assessments, recommendations, and system actions, each with different quality characteristics. Not all of these can be checked the same way, and recognising this distinction is the foundation of effective evaluation.
AI agent outputs fall into two categories. Exact outputs must be precisely correct and can be verified mechanically. Judgment outputs accept reasonable variation and require qualitative evaluation. When a legal AI agent reviews a franchise agreement, quoting clause 14.2 is an exact output; assessing the risk it poses is a judgment output. Most AI agents produce both in a single response, and each type needs a fundamentally different quality check.
Exact Outputs
Certain AI agent outputs must be precisely correct, and the checks for these are correspondingly specific:
Quoted text must match the source character-for-character. A fabricated quote, even one that sounds plausible, is a failure.
Citations and references must be verifiable: the cited study must exist, and it must say what the AI agent claims.
Specific data such as account numbers, client codes, and reference numbers must be accurate.
Calculations must be mathematically correct. An incorrect discount rate compromises every calculation downstream.
System actions must execute as specified: if an AI agent should update a record and notify the relationship manager, both steps must happen.
For these outputs, automated verification is essential, and the mechanism is straightforward: code can check whether a quote appears verbatim in a source document by running a string match against the original text; a separate calculation engine can independently recompute any figures the AI agent produces; and system logs can confirm whether an action was actually executed and completed successfully. Human reviewers cannot scale to check every quote against every source, but machines can and should.
We have written separately about how we designed our own research AI agents to prevent fabrication.² Rather than relying on instructing the AI agent to be accurate, we built safeguards into the architecture: the AI agent can only select from pre-verified sources, quotes are matched character-by-character against source documents before inclusion, and outputs are validated in real time. Prevention is better, but detection is essential.
Judgment Outputs
Other outputs have no single correct answer, and quality is a matter of judgment:
Summaries and synthesis can be written in many valid ways. The question is whether the summary captures what matters, not whether it matches predetermined text.
Recommendations and advice can be expressed differently while conveying the same substance.
Tone, style, and completeness involve subjective assessment that depends on context, audience, and purpose.
For these outputs, you need clearly defined criteria that specify what "good" looks like, and a separate AI model that reviews the AI agent's work against those criteria, acting as an independent judge. Earlier approaches favoured numerical scoring rubrics, where a judge model would rate outputs on a scale (for example, completeness: 1-5, accuracy: 1-5) across multiple dimensions. In practice, these rubrics are difficult to define well: what exactly distinguishes a 3 from a 4 on "completeness"? Different reviewers, whether human or AI, interpret scales differently, and the problem is compounded early in development when you do not yet fully understand the AI agent's failure modes. Current best practice has therefore moved towards binary pass/fail checks: rather than asking a judge to rate "completeness: 1 to 5," you ask a focused question with a yes-or-no answer, such as "Does this summary identify all material risks disclosed in the source document? Yes or No."¹⁴ Binary checks force clarity in the criteria themselves, produce more consistent results across repeated assessments, and make disagreements between the AI judge and human reviewers easier to diagnose. Each check should evaluate one specific quality dimension in isolation; Anthropic recommends grading each dimension with a separate, focused evaluation rather than asking a single judge to score multiple qualities simultaneously.⁷ Human experts validate that the criteria capture what actually matters, and periodically review a sample to confirm the AI judge's assessments align with expert judgment.
Why This Distinction Matters
When you design evaluations, start by classifying each output as exact or judgment. The boundary between the two is a practical starting point rather than a rigid line; some outputs sit in between, and you will need to decide which type of check best fits based on the specific quality risk. For exact outputs, build specific automated checks: string matching for quotes, independent recalculation for figures, log verification for system actions. For judgment outputs, invest time in defining precise pass/fail criteria. A criterion that says "the summary should be good" tells you nothing. A criterion that asks "Does the summary identify all material risks, state the valuation range with supporting rationale, and flag areas where information was insufficient?" gives you something to measure against. Since most AI agents produce both types in a single response, your evaluation architecture needs to handle both, applying the right type of check to each part of the output.
What Typically Goes Wrong
Understanding how AI agents fail helps you focus evaluation efforts where they will have the most impact. Research has catalogued over a dozen distinct failure patterns in AI agents.⁴ The seven below are the most common across industries and use cases, and the ones most likely to affect professional services deployments. For each, we describe the pattern, give a concrete example, and explain how to catch or prevent it.
The AI Agent Starts Well, Then Fabricates
The AI agent performs correctly in the early steps, then introduces fabrications later in the workflow. Partial correctness builds false confidence: the user sees accurate information early on and assumes the rest is equally reliable. A tax research AI agent, for example, might accurately cite three relevant statutes, then fabricate a fourth that sounds plausible but does not exist.
How to catch it: Log every step of the AI agent's execution using an observability platform (such as Langfuse, Braintrust, or LangSmith) or a structured logging framework, not just the final output. At each step, record what data the AI agent accessed, what it produced, and what decisions it made. Automated checks that only validate the final message will miss fabrications introduced in earlier turns; you need checks at intermediate points throughout the workflow.
One Mistake Cascades Into Bigger Problems
An early error propagates through subsequent decisions, and the final output might be internally consistent yet fundamentally wrong. A valuation AI agent that uses last year's revenue figure instead of this year's produces a cascade: every ratio, multiple, and recommendation downstream is based on the wrong starting point. In multi-agent systems, the problem multiplies because downstream AI agents receive and trust corrupted outputs from upstream AI agents.
How to catch it: Deliberately inject known errors into the workflow to see whether downstream AI agents catch the problem or propagate it. Build validation checks at handoff points between AI agents; for example, a rule that compares key figures against the source data before passing them to the next stage. If the injected error passes through unchallenged, your validation needs strengthening.
The AI Agent Works Around Rules Instead of Following Them
AI agents sometimes find creative workarounds that technically comply with instructions while violating the underlying intent. This is optimisation without understanding. Microsoft's taxonomy of agentic failure modes identifies tool misuse as a key risk category.⁵ A client onboarding AI agent told to verify identity documents might access an internal HR system to cross-reference names, technically finding the information but violating data access policies.
How to prevent it: Define what the AI agent cannot do, not just what it should do. An AI agent instructed to "find client contact information" might access systems it should not touch unless those systems were explicitly placed off-limits. Review the AI agent's execution logs regularly to check whether it is accessing only the systems and data sources it should be using. Assume the AI agent will find any gap you leave.
Tools Available Does Not Mean Tools Used
Having access to tools does not guarantee an AI agent will use them. An AI agent connected to a CRM system, a regulatory database, and a document repository can bypass these tools entirely and generate plausible-looking information from its training data instead. Worse, AI agents can claim to have consulted these systems when they have not, presenting fabricated data as though it came from an authoritative source. A client-facing AI agent asked to look up a customer's account status might produce a realistic-looking response complete with account numbers and recent activity, none of which came from the actual CRM.
This failure mode is particularly dangerous because the output looks exactly like what a properly functioning AI agent would produce. The only way to detect it is to independently verify that the AI agent actually made the tool calls it should have, and that the data in its response matches what those tools returned.
How to catch it: Log all tool calls alongside the AI agent's output. Build validation rules that cross-reference: if the AI agent's response contains data that should have come from a specific system, check that the corresponding tool call actually appears in the execution log and that the returned data matches what the AI agent presented. Flag any output containing system-sourced data where no corresponding tool call was made.
Self-Assessment Cannot Be Relied Upon
A tempting shortcut is to ask the AI agent whether it followed the rules. When asked "did you use only verified sources?" AI agents often claim compliance, even when demonstrably non-compliant. This is not deception in any intentional sense; language models are trained to be helpful, and asserting compliance can feel like the helpful response. We learned this early in our own work: AI agents would self-certify compliance while outputs contained fabricated citations. A compliance AI agent asked to confirm it checked all required regulatory databases might report "all sources verified" despite having accessed only two of the four required databases.
What works instead: Rather than relying on an AI agent's self-assessment, verification must come from outside the AI agent being verified, through a separate, independent process. In practice, this means using automated validators that check the AI agent's claims against objective evidence (for example, comparing the AI agent's citation list against actual source documents, or cross-referencing its "verified" status against the tool call logs to confirm it actually accessed the systems it claims to have checked). A different AI model can serve as an independent reviewer, or human reviewers can spot-check a sample. The principle is simple: the verifier and the verified must always be separate.
Quality Can Degrade Over Time
An AI agent that passes evaluation before launch can degrade over weeks or months, a phenomenon practitioners call "drift." It takes several forms:
Data drift occurs when the information the AI agent relies on becomes outdated or changes in character. A regulatory database is updated, client records are restructured, or market conditions shift.
Model drift happens when the underlying AI model is updated by its provider, sometimes without notice, altering how the AI agent responds to the same inputs.
Concept drift means the definition of "good work" itself evolves: client expectations change, new regulations take effect, or business processes are restructured.
Judge drift affects the evaluation layer itself: the AI models used as judges are also updated by their providers, which can shift how they assess quality. An evaluator that was well-calibrated against human judgment in January may score differently in March. Re-align AI judges against human labels periodically to ensure your quality measurements remain trustworthy.
The AI agent you deployed in January might behave differently in March, and even if it behaves identically, "correct" might mean something different. Worse, the evaluations you built to catch problems might themselves shift without warning.
How to stay ahead of it: Define specific quality metrics for your AI agent, such as citation accuracy rate, pass rates per quality check, and task completion rate, and track them over time using your observability platform. Set concrete thresholds: for example, if your citation accuracy rate drops below 95%, or if the pass rate on any quality check falls below a defined minimum, configure the system to trigger an automated alert. When you see a downward trend, investigate which type of drift is responsible: check whether source data has changed (data drift), whether the model provider has released an update (model drift), whether business requirements have shifted (concept drift), or whether your AI judges have drifted from human consensus (judge drift), and address it at the source.
The Reliability Problem
There is a critical distinction between an AI agent that can do something and an AI agent that will do it reliably. Passing an evaluation once demonstrates capability, not reliability, and the gap between the two is larger than most organisations expect. A January 2026 analysis of enterprise AI agent benchmarks found that this reliability gap affects every model tested: AI agents that score well on single-run evaluations show significant performance drops when asked to perform the same task repeatedly, and "all models had runs that derailed completely," not gradual degradation, but sudden coherence breakdowns.¹³ The underlying benchmarks, including τ²-bench and Vending-Bench 2, confirm the pattern across different task types and providers.³
For professional services, this matters enormously. If you are deploying an AI agent to handle client queries at a law firm or accounting practice, inconsistency means clients returning with the same issue may receive different responses. For regulated industries where advice must be consistent and auditable, this variability is unacceptable. As regulators become more active in overseeing AI use in financial services, robust evaluations also contribute to the auditable evidence that organisations will need to demonstrate compliance.¹¹
The distinction shows up in two ways of measuring success. The first asks: can the AI agent do this at all? If you run it multiple times, does it succeed at least once? Researchers call this "pass at k" (written as Pass@k); it is a prototyping metric. If the AI agent cannot succeed even once across several attempts, the task is beyond its current capability and you need a different approach. The second asks: will the AI agent do this reliably? Does it succeed every time across repeated attempts? Researchers call this "pass to the power of k" (written as Pass^k); this is the metric that matters for live deployment.
The difference is dramatic. Consider an AI agent that succeeds on any single attempt with 75% probability. Run it three times and the chance it succeeds at least once (Pass@3) is approximately 98%, which looks reassuring. But the chance it succeeds all three times (Pass^3) is approximately 42%. The prototyping metric says "this works"; the reliability metric says "it fails more often than it succeeds." In a firm processing thousands of client interactions, that gap translates directly into hundreds of errors that the capability metric would have masked.
The practical response: Run each evaluation case multiple times to distinguish capability from reliability. The number of trials needed depends on the complexity of the task and the variability of inputs: for simple, well-constrained tasks with limited input variation, a handful of repeated runs may reveal the pattern. For complex workflows processing diverse real-world data, where different client records, document formats, or edge cases activate different paths through the AI agent, you need evaluation across a representative sample of your actual production data, which may require dozens or hundreds of runs to cover the meaningful variations. The key principle is to run enough trials that you are confident you are seeing the AI agent's reliable behaviour, not just its best-case performance.
Compare the results across trials. Consistent success gives you evidence of reliability. If results vary, examine the variation: consistent failure points to a capability gap requiring better instructions, architectural changes, or a different approach entirely. Intermittent failure is more nuanced; it might indicate a prompt that is not specific enough, poor handling of edge cases, or inherently variable model behaviour. Use this diagnostic information to decide where the AI agent can operate autonomously, where it needs human-in-the-loop checkpoints, and where full automation is not yet appropriate. The goal is not to wait for perfection but to know exactly where human oversight is still required.
These reliability challenges are amplified when AI agents operate in multi-step workflows or collaborate with other AI agents, which is where the complexity of evaluation increases substantially.
Evaluating Multi-Step Workflows and Multi-Agent Systems
The failure patterns above become harder to detect as AI agent tasks grow in complexity. In practice, most production AI agents do not simply answer a question. They carry out structured workflows: retrieving information from multiple systems, passing data between processing steps, interacting with databases and CRMs, performing calculations, applying business rules, and producing a final deliverable. Some workflows involve multiple AI agents working together, with one AI agent retrieving data, another analysing it, a third generating recommendations, and a fourth executing actions in external systems.
The sheer complexity of these systems creates evaluation challenges of its own. More complicated workflows mean more potential failure points, not just at individual steps but in the interactions between them. Evaluating a multi-step AI agent requires you to deeply understand the business process being modelled, because your evaluations must be robust enough to cover all the meaningful paths the AI agent might take, the failure modes specific to each stage, and the success criteria at each handoff point. Evaluating these systems means checking quality at multiple points along the way, not just the final output.
Why Multi-Step Evaluation Is Different
In multi-step workflows, problems can hide at intermediate stages even when the final output looks acceptable. Consider a due diligence AI agent reviewing a target company's contracts. The final report might look reasonable, but perhaps the AI agent missed a category of documents during retrieval, miscategorised a material risk as minor, or applied the wrong client's risk thresholds. Evaluating only the final output would miss these intermediate failures.
Two specific technical challenges arise in extended workflows. Context drift occurs when the AI agent gradually loses track of original instructions as its working memory fills with intermediate work. State accumulation means early decisions constrain later ones: if an AI agent decides in step two that a document is not relevant, it will not consider that document in any subsequent step. Both of these problems compound over the length of the workflow, and both require specific evaluation techniques to detect, which the following subsection addresses.
An important distinction applies here: not everything needs to be evaluated; some things need to be observed. Evaluation means grading an output against defined quality criteria: did the AI agent produce the right result? Observation means recording what happened: which systems did the AI agent call, what data did it pass, how long did each step take? Observation feeds into evaluation but is also valuable in its own right for debugging, compliance, and understanding system behaviour. In multi-step workflows, you need both.
Techniques for Multi-Step Evaluation
Evaluate intermediate outputs, not just the final deliverable. Define checkpoints throughout the workflow where you can assess whether the AI agent is on track. For a contract review workflow, checkpoints might include: did the AI agent retrieve all relevant documents? Did it correctly identify the applicable jurisdiction? Did it flag all clauses matching the risk criteria? Each checkpoint has its own evaluation criteria, and each can be checked independently, using automated rules for exact outputs (such as whether the correct documents were retrieved) and AI-as-judge scoring for judgment outputs (such as whether the risk categorisation is appropriate).
Use trajectory evaluation for high-stakes workflows. A trajectory is the complete record of every step the AI agent took: which systems it called, what data it accessed, what reasoning it applied, and what it decided at each point. Most observability platforms (such as Langfuse, Braintrust, or LangSmith) capture this automatically. Reviewing the full trajectory catches cases where the AI agent reached an acceptable answer through flawed reasoning, a result that might look correct this time but is unlikely to be repeatable.
Test for context retention. In long workflows, check whether the AI agent remembers key constraints from earlier in the interaction. A contract review AI agent told to focus on IP provisions should still be focusing on IP provisions ten steps later. Build specific test cases where an early instruction should constrain a later output, then verify that the constraint held; if the AI agent drifts from its original brief, this signals a context retention problem.
Inject errors at intermediate steps. Deliberately introduce a known mistake, such as an incorrect figure in a data feed or a contradictory instruction, partway through the workflow. Observe whether the AI agent catches the error, flags it, or silently propagates it. This tests the robustness of the AI agent's validation at each stage.
Evaluate handoffs in multi-agent systems. When multiple AI agents work together, each handoff is a potential failure point. The data passed between AI agents must be complete, correctly structured, and interpreted as intended by the receiving AI agent. Build validation checks at each handoff that confirm the data arrived intact and matches what the sending AI agent produced. A system of individually capable AI agents can still fail as a whole if the handoffs are unreliable.
Building Evaluations That Work
The previous sections covered what can go wrong and why traditional testing misses it. This section covers the practical side: how to build quality checks that catch problems before they cause damage, what to check, when to check it, and how to scale evaluation as your AI agent deployments grow. The underlying discipline follows a cycle: analyse real outputs to understand how the AI agent fails, measure quality through targeted checks, and improve the AI agent's instructions and architecture based on what the measurements reveal. Each iteration through this cycle strengthens both the AI agent and the evaluations themselves.
Three Types of Quality Check
Effective evaluation draws on three approaches, each suited to different aspects of AI agent output. Most production systems combine all three.
Automated rule checks verify specific requirements: correct format, permitted value ranges, authorised actions, citations that match source documents. In practice, these are scripts or validation functions that run against every AI agent output. For example, a function that checks whether every regulation quoted by the AI agent exists in the regulatory database, or a rule that confirms all numerical outputs fall within expected ranges. They run fast, examine every output, and cost almost nothing at the margin. The limitation is that they cannot assess reasoning quality, completeness, or nuance.
AI-as-judge checks use a separate AI model to assess outputs against defined criteria, capturing quality dimensions that resist simple rules, such as whether a summary identified the most material risks, or whether a recommendation is appropriately qualified. In practice, each check should focus on one specific quality dimension and return a binary pass/fail verdict. For example, rather than asking a single judge to score completeness, accuracy, and relevance simultaneously on a numerical scale, you run three separate checks: "Does the summary cover all material risks identified in the source? Yes/No." "Are all factual claims supported by the cited evidence? Yes/No." "Is the recommendation relevant to the client's stated objectives? Yes/No."¹⁴ This approach is more reliable than numerical scoring because it gives the judge a narrower, better-defined task, reducing variability and making disagreements between the AI judge and human reviewers easier to diagnose.⁷ The limitation is that they add cost and latency per output.
Human review remains the gold standard for quality assessment. Human experts catch subtleties that neither rules nor AI judges reliably detect, particularly in areas where domain expertise, client context, or regulatory nuance matters. The limitation is that it is expensive, slow, and does not scale to every output.
In practice, you layer all three. Automated checks run on every output. AI judges score a larger sample. Human experts review a smaller sample. The percentages vary by risk: a low-stakes internal tool might use 100% automated, 10% AI judge, and 1% human review. A client-facing advisory tool might use 100%, 50%, and 10% respectively.
Combine Outcome Evaluation with Process Observation
Focus evaluation primarily on what the AI agent produced; the final output is what affects clients and business outcomes.⁷ However, a correct answer reached through flawed or unusual reasoning is a warning sign, not a success. An AI agent that arrives at the right conclusion by chance or through faulty logic is unlikely to do so consistently. In professional services, the reasoning path also matters for auditability: regulators, clients, and internal compliance teams may need to understand not just what the AI agent concluded, but how it got there.
The practical approach is to combine both. Use outcome-based evaluation as the primary measure: did the output meet the quality criteria? Then use trajectory review as a diagnostic tool: was the reasoning sound, did the AI agent use appropriate sources, did it follow the expected process? When an AI agent produces a correct output through questionable reasoning, treat it as a reliability risk to investigate, not a pass.
Build Evaluations From Two Directions
Top-down: define what good looks like. Start from business requirements and work downward to specific evaluation criteria: completeness, accuracy, relevance, appropriate tone. These outcome-based evaluations tell you whether the AI agent is producing good work.
Bottom-up: analyse how the AI agent actually fails. Review a sample of real AI agent outputs, not against a predefined checklist, but with an open question: what went wrong here? Practitioners call this "error analysis," and it is the foundation of effective evaluation design.¹⁴ The process is straightforward: collect 20-50 real outputs as a starting foundation (complex systems benefit from reviewing 50-100 or more for fuller failure coverage), review each one, and write open-ended notes on every problem you observe. Then group similar problems into categories. You will discover failure patterns you did not anticipate, patterns that would never appear in a top-down requirements document. These failures become permanent test cases.
Both directions are necessary. Top-down evaluations ensure the AI agent meets business requirements. Bottom-up error analysis catches the failures you did not know to look for, and these are often the most damaging in practice.⁷
Diagnose Before You Build
Not every failure needs an evaluator. When error analysis reveals a problem, the first question is whether you can even classify the failure clearly. If you cannot, the answer is more error analysis, not more evaluators; go back and review additional outputs until the pattern becomes clear.¹⁴ Once you can classify the failure, ask whether the AI agent's instructions are the cause. If the AI agent mishandles a scenario because the prompt never addressed it, or because the instructions were ambiguous, that is a specification problem: fix the prompt. If the AI agent fails despite clear, specific instructions, that is a generalisation problem: the model cannot reliably perform the task as described, and you need an evaluator to catch the failures, or a different architectural approach.
This distinction matters because building evaluators is more expensive than fixing prompts. In practice, a significant proportion of the failures teams discover through error analysis turn out to be specification problems, fixable by clarifying the AI agent's instructions rather than building new quality checks. The discipline of asking "is this a prompt problem or a capability problem?" before writing evaluation code saves considerable effort and leads to better-performing AI agents.
When to Write Evaluations
The most effective approach is to write evaluations before building the AI agent. This mirrors established practices in both software engineering and product management. In test-driven development, engineers write tests before writing code. In agile product management, product owners define acceptance criteria before development begins: specific, measurable conditions that a feature must meet to be considered complete. The same discipline applies to AI agents.
A consulting firm building a proposal drafting AI agent would define acceptance criteria before writing any code: correct client name throughout, pricing matching the approved rate card, terms complying with contracting policy, all statistics verified against source documents. Each criterion becomes a measurable evaluation check. At Serpin, we formalised this through an agent-specific product lifecycle document that defines evaluation criteria alongside functional requirements from the outset. The question "how will we know it works?" is answered before "how will we build it?"
If you are working with an existing AI agent, start by cataloguing what the AI agent produces, classifying each output type, and building evaluations working backward from desired outcomes.
When to Run Evaluations
During development, run evaluations after every change, multiple times, and examine the distribution of results. After deployment, run evaluations continuously, because performance drifts over time and the only way to detect it is to keep measuring. This creates a feedback loop: problems detected in production become new evaluation cases, which help catch similar problems earlier next time.
Making Quality Visible: Observability and Evaluation at Scale
As AI agent deployments grow, manual evaluation becomes impractical. You need systems that make quality measurement automatic and continuous. The industry term for this capability is "observability": the ability to understand what is happening inside a system by examining its outputs.
For traditional software, observability means recording what happened (logs), measuring performance (metrics), and tracking request flow (traces). AI agent observability extends this to capture what the AI agent decided to do, why, what information it considered, and what it ignored. A growing ecosystem of platforms supports this, including both commercial options (such as Braintrust and LangSmith) and open-source alternatives (such as Langfuse and Arize Phoenix). The specific platform matters less than having the capability, but the capability itself is essential at scale.
Three Components of AI Agent Quality Infrastructure
Effective AI agent observability combines three components, each serving a different purpose. Together, they form the quality infrastructure that makes ongoing evaluation possible.
AI agent tracing records every step the AI agent took: which systems it called, what data it accessed, what reasoning it applied. Think of it as a flight recorder for AI agent interactions. In practice, most observability platforms provide tracing out of the box. You instrument your AI agent code by adding trace-emitting calls at each significant step (for example, wrapping each tool call or model invocation in a tracing context that records inputs, outputs, and timing), and the platform captures, stores, and makes them searchable. Without tracing, debugging a failed AI agent output means guessing what went wrong. With tracing, you can replay the exact sequence of events and identify where the process broke down.
Scoring applies quality checks against defined criteria, transforming raw outputs into trackable quality signals. In practice, you configure your observability platform to run automated rule checks (such as citation verification or format validation) and AI-as-judge assessments against each traced output, either in real time or as a post-processing step. A score might indicate that 94% of citations were verified, or that a summary passed four of five quality checks. These scores are logged alongside each trace, creating a continuous quality record that you can query and analyse over time.
Monitoring aggregates scores over time, detects patterns, and alerts when quality degrades. In practice, this means configuring dashboards in your observability platform that display rolling averages of your key quality metrics (citation accuracy, pass rates per quality check, task completion rate) and setting alert thresholds. For example, you might configure a notification if the seven-day rolling average of citation accuracy drops below 95%, or if the proportion of outputs failing any quality check exceeds 10%. A single failed output is an incident. A downward trend in scores across a week reveals a systemic problem that needs investigation.
The Automation Progression
Building evaluation capability is iterative. Most teams progress through four stages, each building on the previous one.
Start with human evaluation and error analysis to understand what good looks like and how the AI agent actually fails. Have domain experts review a sample of 20-50 real AI agent outputs as a starting foundation (expanding to 50-100 or more for complex systems) with an open question: what is wrong with this? Write free-form notes on every problem observed, then group similar problems into categories. This dual process builds a shared understanding of quality and reveals the AI agent's actual failure patterns, which are often different from what the team expected. Before building evaluators for the failures you discover, check whether each one is a specification problem (fixable by improving the AI agent's instructions) or a generalisation problem (requiring an evaluator to catch). The output of this stage is a set of clearly defined quality criteria, a taxonomy of real failure modes, and an improved prompt.
Codify patterns into automated rules once you understand them. If human reviewers consistently flag the same types of errors, such as missing citations, calculations that do not sum correctly, or responses that exceed authorised scope, write automated checks that catch these patterns. These rules run on every output, instantly and at no marginal cost. The output is a growing library of automated validators that catch known failure modes.
Add AI-as-judge evaluation for nuanced assessment that resists simple rules. Configure a separate AI model to assess outputs against your defined criteria, with each check focused on one specific quality dimension and returning a binary pass/fail verdict.¹⁵ In production monitoring, many of these checks are "reference-free," meaning the judge assesses quality without a pre-defined correct answer, relying instead on the criteria themselves to determine whether the output is acceptable. Run these assessments on a sample of outputs, typically 10-50% depending on risk level. The output is a set of quality signals that track trends over time.
Reserve human review for validation and edge cases. Human experts review a smaller sample to confirm that automated checks and AI judges are calibrated correctly, and to handle cases where automated assessment is uncertain. This is also where you update criteria and rules as the definition of "good" evolves.
Managing Evaluation Costs
Comprehensive evaluation has real costs: compute time for AI-as-judge checks, staff time for human review, and platform fees. In our experience building and deploying AI agent systems, evaluation typically adds 10-20% to the running cost of the agent system itself, but prevents failures that cost multiples of that amount. Three strategies keep costs proportionate to value.
Tiered evaluation matches the level of scrutiny to the level of risk. To implement this, categorise each AI agent output type by its potential business impact. Low-stakes outputs, such as internal status summaries or routine notifications, run only automated rule checks on every output. Medium-stakes outputs, such as client-facing communications or research summaries, add AI-as-judge scoring on a significant sample. High-stakes outputs, such as regulatory filings, financial recommendations, or advisory work where errors could result in penalties or client harm, warrant human expert review. Define these categories during the evaluation design phase, not ad hoc, so that every output is routed to the appropriate level of scrutiny automatically.
Sampling strategies reduce cost without sacrificing the ability to detect quality problems. Rather than running AI-as-judge checks on every AI agent output, you run them on a representative sample selected randomly from the full population of responses the AI agent produces during a given period. In our deployments, we find that sampling around 20% of outputs provides a reliable signal for detecting quality trends, though the appropriate rate depends on the volume and variability of your AI agent's workload: higher-volume, lower-variability systems can sample less; lower-volume or highly variable systems may need more. Targeted sampling, where you focus additional scrutiny on outputs that automated checks flag as borderline or unusual, concentrates effort where it is most likely to find problems.
Asynchronous evaluation decouples quality measurement from user-facing response times. The user receives their response immediately. Evaluation runs in the background, with results feeding into dashboards and alerting systems. Problems are still detected and flagged, typically within minutes rather than the weeks it might take for a human to notice an issue organically.
Building a Robust Evaluation Process
If you are deploying AI agents, or considering doing so, here is how to apply what we have covered. The starting point is not the AI agent itself; it is the business process the AI agent is intended to support.
Start from the Business Process
Before evaluating what an AI agent produces, define what the business process requires. What are the intended outcomes? What actions must be performed? What standards must be met? A client onboarding process, for example, requires identity verification, risk assessment, regulatory compliance checks, and record creation. A contract review process requires document retrieval, clause analysis, risk identification, and a structured report. These requirements exist independently of whether the work is done by a person or an AI agent. They are your evaluation criteria. The question then becomes: does the AI agent meet these requirements reliably?
Essential Practices
Map the business process the AI agent supports. Define the required outcomes, actions, and quality standards; these become your evaluation criteria. For each step, identify what a correct result looks like and what would constitute a failure.
Classify each output as exact or judgment. Determine which outputs must be precisely correct (quoted text, figures, system actions) and which accept reasonable variation (summaries, recommendations, assessments). This classification determines which type of check you build.
Build specific automated checks for exact outputs. String matching for quotes, independent recalculation for figures, execution log verification for system actions, and cross-referencing tool call responses for retrieved data.
Define pass/fail criteria for judgment outputs. For each quality dimension that matters, write a specific yes-or-no question: "Does the summary identify all material risks?" "Are all factual claims supported by the cited evidence?" Each question becomes a separate check, evaluated independently. Test your criteria by having both an AI judge and a human expert assess the same outputs, then refine until their pass/fail decisions align consistently.
Conduct error analysis on real outputs before finalising evaluations. Review 20-50 real outputs (more for complex systems) with an open question: what went wrong? Group problems into categories, determine whether each is a specification problem (fix the prompt) or a generalisation problem (build an evaluator), and fix specification problems first. This bottom-up analysis reveals failure patterns that top-down requirements analysis alone would miss.⁷ ¹⁴
Ensure the verifier is always separate from the verified. Never rely on the AI agent's own assessment of its work. Use automated validators, a different AI model as independent judge, or human reviewers, and make the separation explicit in your evaluation architecture.
Building Evaluation Discipline
For new AI agents, write evaluations before building. Define acceptance criteria upfront, while changes are still cheap. Each criterion should be specific enough to be testable: not "the AI agent should handle client queries" but "the AI agent should correctly identify the client's account tier, retrieve the current rate card, and produce a quote within 2% of the manually calculated figure."
Run repeated trials to distinguish capability from reliability. A single successful run proves capability, not reliability. Run enough trials across representative inputs that you are measuring the AI agent's consistent behaviour, not its best-case performance. The number depends on task complexity and input diversity, from a handful for well-constrained tasks to dozens or hundreds for complex workflows with varied real-world data.
Instrument the AI agent with tracing and build continuous monitoring. Record every tool call, model invocation, and decision point using an observability platform or structured logging. Build dashboards tracking rolling averages of key quality metrics and configure automated alerts when those metrics fall below defined thresholds.
Use evaluation results to determine where human oversight is needed. Not every task will be suitable for full automation. Evaluation data reveals where the AI agent performs reliably enough to operate autonomously, where human-in-the-loop checkpoints should be built in, and where full automation is not yet appropriate. Let the data drive these decisions rather than assumptions.
Close the feedback loop. Every production incident becomes a permanent evaluation case. Every edge case discovered in production strengthens the evaluation suite. Schedule regular reviews, monthly at minimum, where the team examines recent failures, updates evaluation cases, and refines criteria based on what the data is showing.
Advanced Practices
These practices typically require dedicated AI engineering capability or specialist support.
Layer your evaluation approaches. Run automated checks on every output, AI-as-judge scoring on a broader sample (typically 10-50% depending on risk level), and human expert review on a smaller sample. Calibrate the AI judge against human reviewers periodically: measure the judge's true positive rate (does it catch the problems humans catch?) and true negative rate (does it pass the outputs humans pass?) separately, rather than relying on raw agreement rate. A judge that always says "pass" will appear highly accurate if most outputs are good, but it will catch zero failures.
Evaluate multi-step workflows at intermediate checkpoints, not just final outputs. Define specific evaluation criteria for each stage of the workflow and build validation checks at every handoff point between AI agents or processing stages.
Measure reliability, not just capability. Track whether your AI agent succeeds consistently across repeated trials with diverse inputs, not just whether it can succeed at least once. Use the distinction between capability metrics (Pass@k) and reliability metrics (Pass^k) to guide deployment decisions.
Next Steps
The difference between organisations that deploy AI agents successfully and those that encounter costly failures is not the sophistication of their tools or the size of their team. It is evaluation discipline: the willingness to measure quality rigorously, understand what the measurements reveal, and act on them.
AI agents can fail while appearing to succeed. They can produce confident, well-formatted outputs containing errors that only surface after damage is done. They can bypass the tools they were given and fabricate data instead. They can perform brilliantly on Tuesday and fail on Wednesday. The only way to know whether your AI agent is doing good work is to measure it systematically, continuously, and from multiple angles, using checks that are independent of the AI agent itself.
The techniques in this article are not theoretical. They reflect what we have learned building AI agents that handle real client work, where errors have consequences and consistency is required. Evaluation is not a phase of development you complete and move past. It is an ongoing discipline, one that determines whether your AI agents can be trusted with the work you are giving them. The organisations that will succeed with AI agents are those that treat evaluation with the same rigour they apply to financial controls, client quality standards, and regulatory compliance.
For security-specific considerations, see Securing AI Agents: What We've Learned Building Them.⁶ For operational practices, see Running AI Agents: What Changes When the Bot Joins the Team.⁸ If you are building or evaluating AI agents in your organisation, we welcome the conversation.
Sources
Gartner (2025). 'Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026.' Press release, 26 August 2025. Available at: https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025
Serpin (2025). 'How We Designed a Zero-Fabrication Research Agent.' Available at: https://serpin.ai/insights/how-we-designed-a-zero-fabrication-research-agent
Yao, S., Shinn, N., Razavi, P. and Narasimhan, K. (2024). 'τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.' arXiv:2406.12045. Available at: https://arxiv.org/abs/2406.12045
Maxim AI (2025). 'Diagnosing and Measuring AI Agent Failures: A Complete Guide.' Available at: https://www.getmaxim.ai/articles/diagnosing-and-measuring-ai-agent-failures-a-complete-guide/
Microsoft AI Red Team (2025). 'Taxonomy of Failure Mode in Agentic AI Systems.' April 2025. Available at: https://www.microsoft.com/en-us/security/blog/2025/04/24/new-whitepaper-outlines-the-taxonomy-of-failure-modes-in-ai-agents/
Serpin (2026). 'Securing AI Agents: What We've Learned Building Them.' Available at: https://serpin.ai/insights/securing-ai-agents-what-we-ve-learned-building-them
Anthropic (2026). 'Demystifying Evals for AI Agents.' Engineering blog, 9 January 2026. Available at: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
Serpin (2026). 'Running AI Agents: What Changes When the Bot Joins the Team.' Available at: https://serpin.ai/insights/running-ai-agents-what-changes-when-the-bot-joins-the-team
Serpin (2025). 'How AI Agents Are Transforming Professional Services — and How to Implement Successfully.' Available at: https://serpin.ai/insights/how-ai-agents-are-transforming-professional-services-and-how-to-implement-successfully
Financial Times (2026). 'KPMG negotiated lower fees from auditor by citing AI efficiency gains.' February 2026. Available at: https://www.ft.com/content/c891c47c-b21f-4e0f-84b3-b80c794eff3d
Serpin (2025). 'More AI Regulation Is Coming in Financial Services.' Available at: https://serpin.ai/insights/more-ai-regulation-is-coming-in-financial-services
CNBC (2026). 'Goldman Sachs taps Anthropic's Claude to automate accounting, compliance roles.' 6 February 2026. Available at: https://www.cnbc.com/2026/02/06/anthropic-goldman-sachs-ai-model-accounting.html
Simmering, P. (2026). 'The Reliability Gap: Agent Benchmarks for Enterprise.' 4 January 2026. Available at: https://simmering.dev/blog/agent-benchmarks/
Husain, H. and Shankar, S. (2026). 'LLM Evals: Everything You Need to Know.' Available at: https://hamel.dev/blog/posts/evals-faq/
Langfuse (2025). 'Automated Evaluations of LLM Applications.' Available at: https://langfuse.com/blog/2025-09-05-automated-evaluations
Category
Written by

Julia Druck
Latest insights and trends
Let's have a conversation.
No pressure. No lengthy pitch deck. Just a straightforward discussion about where you are with AI and whether we can help.
If we're not the right fit, we'll tell you. If you're not ready, we'll say so. Better to find that out in a 30-minute call than after signing a contract.

Let's have a conversation.
No pressure. No lengthy pitch deck. Just a straightforward discussion about where you are with AI and whether we can help.
If we're not the right fit, we'll tell you. If you're not ready, we'll say so. Better to find that out in a 30-minute call than after signing a contract.

Let's have a conversation.
No pressure. No lengthy pitch deck. Just a straightforward discussion about where you are with AI and whether we can help.
If we're not the right fit, we'll tell you. If you're not ready, we'll say so. Better to find that out in a 30-minute call than after signing a contract.





