Running AI Agents: What Changes When the Bot Joins the Team

Executive Summary

  • AI agents make decisions autonomously. Traditional IT operations can tell you if software is running, not if decisions are correct.

  • Adoption is accelerating rapidly, driving the emergence of AgentOps: practices for managing agents throughout their lifecycle.

  • Deploying decision-making agents requires new governance: autonomy boundaries, human-agent handoffs, escalation paths, and compliance frameworks.

  • The bigger change is how existing roles evolve. Everyone who works with agents becomes responsible for oversight.

AI agents are moving from pilots to production. Law firms use them for contract review. Insurance companies deploy them for claims processing. Consulting firms have them synthesising research and drafting proposals.

Adoption is accelerating. Gartner predicts 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.¹ The market may grow from $7.8 billion to over $52 billion by 2030.² A Lloyds survey found that 59% of UK financial institutions now report productivity gains from AI, nearly double the figure from a year earlier.³

The question is no longer whether to use agents. It is how to run them well. And that turns out to be harder than most organisations expected.

Consider a law firm that deploys an AI agent to review contracts. It processes 500 documents in a day. The IT dashboard shows green across the board: no errors, no downtime, no alerts. Two weeks later, a partner discovers the agent has been missing a category of risk clause in every document it reviewed.

Nothing was broken. The agent was running perfectly. It was just making bad decisions, and nobody had the tools or processes to catch it.

We hear this concern from many clients: if agents need constant supervision, what is the point? Does managing them just create new work that offsets the efficiency gains? This article addresses that concern directly: the answer lies in a combination of organisational design, evolving roles, and automated monitoring tools that make agent oversight scalable.

What Makes Agents Different

An AI agent is not a chatbot with a better marketing name. The distinction matters for understanding what running them involves.

A chatbot responds to queries within a narrow domain. An agent reasons about tasks, uses tools, makes decisions, and takes multi-step actions to achieve goals. Where a chatbot might answer a question about a contract clause, an agent might review an entire franchise agreement, flag every section containing costs and fees, cross-reference each one against internal policy, and draft a summary for the client. In financial services, an agent might pull a customer's transaction history from a CRM, run credit checks across multiple bureaus, update the bid management system with the results, and draft a recommendation, all tasks that previously required an analyst to complete manually over several hours.

This is powerful. It also requires fundamentally different management.

Traditional software is deterministic. The same input produces the same output, every time. When something breaks, you trace the execution path through predictable code.

Agents are not like that. They reason about problems in ways that can vary between sessions. They choose which tools to use, in what order, for what purpose. The same instruction might produce different results on different days, and both results might be reasonable. Managing this variability requires a combination of organisational controls (clear policies on what agents can decide independently), process design (defined review and escalation steps), and technical safeguards (runtime enforcement, output validation, and continuous monitoring). We explore the technical security dimension in detail in our companion article, Securing AI Agents: What We've Learned Building Them.

This changes what 'operations' means. With traditional software, IT operations teams focus on keeping systems running: server uptime, deployment pipelines, incident response. With agents, you also have to ask "is it doing the right thing?" And answering that question requires different tools, different skills, and often different organisational structures.

Why AgentOps Is Emerging

The challenge described above, monitoring agent quality rather than just availability, is why a new set of practices called AgentOps is emerging. IBM defines AgentOps as "an emerging set of practices focused on the lifecycle management of autonomous AI agents."⁴ In plain terms, AgentOps is about making sure your agents keep doing good work, not just that they keep running.

Why is this a distinct discipline rather than an extension of existing IT operations? Because the operational challenge is fundamentally different. The pattern becomes clear when you look at how operational practices have evolved as technology has become more complex.

DevOps emerged because software development and IT operations used to work in silos. Developers would write code, hand it to operations, and operations would figure out how to run it. DevOps teams bridge that gap. They build automated pipelines that test code, deploy it to servers, and monitor whether it is running correctly. When something breaks at 3am, they get the alert and fix it. Their core question is: "is the software running reliably?"⁵

MLOps does something similar for machine learning models. Unlike traditional software, ML models can degrade over time. A fraud detection model trained on last year's data might miss new fraud patterns this year. MLOps teams monitor model accuracy, retrain models when performance drops, and manage the pipeline from data to deployed model. Their core question is: "is the model still accurate?"⁶

AgentOps extends this to AI agents, but the core question is harder. DevOps monitors whether software is running. MLOps monitors whether models are accurate. AgentOps must monitor whether agents are making good decisions, and unlike uptime or accuracy, 'good decisions' is often a judgment call that requires domain expertise to evaluate.

How AgentOps Adds Value

In our work building and deploying agents, we see four areas where AgentOps practices make a concrete difference:

Visibility: knowing what your agents are actually doing. Without visibility, problems remain invisible until a client complains or an auditor asks questions. Several observability, governance and security platforms now provide step-by-step traces of agent behaviour: which tools the agent used, what data it accessed, what reasoning led to its output, and where it spent time and money.⁷ Think of it like a flight recorder for each agent interaction. If a document review agent misses a key risk in a contract, the trace shows exactly what the agent looked at and where it went wrong, so the team can fix the root cause rather than just correcting the single output.

Quality evaluation: judging whether agent outputs are correct. This is where domain expertise matters most. An IT operations team can monitor whether the agent ran without errors, but only a lawyer can judge whether a contract summary is accurate, or an underwriter can judge whether a risk assessment is sound. Organisations address this in two ways. First, through sampling: selecting a percentage of agent outputs for expert human review. Second, through automated checks: rules that verify specific requirements were met (for example, confirming that a compliance review agent checked every required regulation, or that a document summary included all sections flagged as high-risk). These platforms allow teams to define custom quality criteria and run automated assessments against them.⁸ The combination of sampling and automation makes quality evaluation scalable rather than requiring every output to be manually reviewed.

Governance: deciding what agents can and cannot do without human approval. This is a business decision, not a technical one. For each agent capability, someone must decide: can the agent act independently, or does it need a human to approve the action first? High-value decisions (approving a loan, sending legal advice to a client) typically require human sign-off. Routine tasks (summarising a document, categorising an email) may not. These autonomy boundaries must be formally defined, documented, and enforced through both process design and technical controls. In regulated industries like financial services, these decisions also need to satisfy compliance requirements for audit trails and accountability.

Continuous improvement: making agents better over time. When a lawyer corrects an agent's document summary, that correction should feed back into how the agent works in future. When an underwriter overrides an agent's risk assessment, the system should capture why. Without structured feedback loops, organisations fix the same problems repeatedly. AgentOps tools capture these corrections and use them to evaluate whether agent performance is improving, degrading, or drifting from expected behaviour.

From Single Agents to Multi-Agent Systems

So far we have discussed individual agents. But as organisations scale their use of AI, they increasingly deploy multi-agent systems: multiple specialised agents that work together on complex tasks.

Think of it like a team. Rather than one generalist agent trying to do everything, you might have a research agent that gathers information, an analysis agent that evaluates it, and a drafting agent that produces the output. Each agent is purpose-built for its role.

In financial services, for example, an insurance claims workflow might use one agent for initial claim intake (gathering details from the claimant), a second for damage appraisal (cross-referencing against policy terms and historical data), and a third for settlement calculation.⁹ In legal work, LexisNexis deploys four specialised agents that collaborate on complex research tasks: one handles orchestration, while the others focus on legal research, web search, and customer document analysis respectively.¹⁰

We use multi-agent architectures in our own work, and the advantages are real: specialised agents perform better than generalists, the system is more modular and easier to update, and complex workflows become manageable. But they also introduce new operational challenges:

  • Agent-to-agent handoffs: When one agent passes work to another, who checks the quality of what was handed over? A mistake by the first agent can cascade through the entire chain.

  • Human-in-the-loop and human-on-the-loop: Some steps require a human to approve before the workflow continues (human-in-the-loop). Others allow the agent to proceed but keep a human monitoring and able to intervene (human-on-the-loop). Designing which steps need which level of oversight is a critical governance decision.

  • Exception handling: What happens when an agent encounters something outside its competence? There must be clear escalation paths, both to other agents and to human experts, so that edge cases do not produce bad outcomes.

These coordination challenges are why dedicated oversight roles are emerging.

What This Means in Practice

The shift from traditional IT operations to agent operations is not about replacing existing capabilities. It is about adding new ones.

With Traditional Software We Focus On...

With AI Agents We Also Need To...

Keep systems running

Evaluate whether agent decisions are correct

Monitor uptime and errors

Monitor quality of reasoning and detect drift

Manage access and permissions

Decide autonomy boundaries (what can agents do alone?)

Deploy and update software

Design human-agent handoffs and escalation paths

Troubleshoot when things break

Intervene when agents are wrong but not technically broken

A reasonable question: does all this monitoring and evaluation just create new work that cancels out the efficiency gains?

Not if it is done well. The monitoring tools described above automate much of the visibility and quality checking. Automated evaluation rules catch common problems without human involvement. Sampling strategies mean humans review a manageable percentage of outputs, not every one. As organisations learn what 'good' looks like for their specific agents, they can automate more of the quality assurance and reduce the manual overhead. The goal is a level of oversight that is proportionate to the risk, not a permanent bottleneck.

The implication is that agent operations requires collaboration between technical teams (who manage the systems and monitoring tools) and business teams (who define what 'good' looks like for agent outputs). Neither can do it alone.

New Roles and Evolving Responsibilities

This collaboration challenge is why new roles are appearing. Recent Capgemini research found that 48% of financial institutions are creating new roles for employees to supervise AI agents.¹¹

No single existing function owns the full picture. IT operations manages technology but typically lacks domain expertise to judge whether a legal summary is accurate or a claims decision is fair. Business teams understand quality but do not have the technical infrastructure to monitor agent behaviour across the organisation. Change management addresses adoption but not ongoing technical governance. Compliance teams understand regulatory requirements but may not understand how agents actually make decisions. Someone needs to bridge these gaps.

The Head of Agent Operations

The title scales with the organisation: Head of Agent Operations at director level, or a Chief Agent Officer at C-suite level where agents handle a significant share of business processes.

Some organisations are creating roles specifically to coordinate how agents, humans, and multi-agent systems work together effectively. Microsoft's 2025 Work Trend Index found that 81% of leaders expect agents to be integrated into their company's AI strategy within 12 to 18 months. Yet only 67% of leaders are familiar with agents, and just 40% of employees. That gap between executive ambition and workforce readiness is exactly what a dedicated coordination role addresses.¹²

What they actually do:

  • Set autonomy boundaries and governance frameworks. For each agent capability, someone must decide: can the agent act independently, or does it need human approval? A&O Shearman, the first Big Law firm to deploy Harvey AI to more than 3,500 employees, requires that every AI output is audited by humans.¹³ Someone had to make that governance decision, align it with the firm's risk appetite and regulatory obligations, and ensure it is followed in practice.

  • Design human-agent workflows, including handoffs and escalation. Where does agent work end and human work begin? What happens when an agent encounters an exception it cannot handle? At Generali France, the voice assistant handles initial calls and reassures claimants, but a human steps in for complex cases.¹⁴ Designing that handoff, including the criteria for what counts as 'complex', the escalation process, and the feedback mechanism, requires understanding both the technology and the business context.

  • Coordinate across business units and compliance. Agents often serve multiple teams and must satisfy regulatory requirements. In multi-agent systems, this coordination extends to how agents interact with each other: which agent has authority, what data can be shared between agents, and how conflicts are resolved. Someone needs to ensure consistency, manage conflicts, prevent duplication, and maintain audit trails that demonstrate compliance.

  • Build trust and drive adoption through organisational change. New technology fails without organisational buy-in. This includes helping teams understand what agents can and cannot do, designing training programmes, and building confidence through demonstrated results and transparent governance.

As the share of business processes handled by agents grows, this role moves to the executive level, with strategic responsibility for where agents should and should not be used, how they interact with the wider workforce, and how the organisation manages the associated risks.¹⁵

How Existing Roles Evolve

For most organisations, the bigger change is how current roles gain new responsibilities, not creating entirely new positions.

Technical teams retain their core expertise in infrastructure and systems management. What changes is the addition of agent-specific monitoring. This includes watching for drift (agent behaviour gradually changing as underlying models are updated), degradation (agent quality declining over time as the tasks or data it encounters evolve), and overstep (agents taking actions beyond their intended scope). Dashboards and metrics need to evolve beyond uptime and error rates to include quality scores, decision audit trails, and compliance checks. Technical teams also enforce least privilege: ensuring agents have only the minimum access and autonomy needed for their specific tasks, a principle we explore in depth in Securing AI Agents https://serpin.ai/insights/securing-ai-agents-what-we-ve-learned-building-them.

Product managers have a significant role to play. Agents are, in effect, products: they have users, capabilities, performance requirements, and lifecycles. Product managers bring the discipline of requirements definition, user research, and iterative improvement that agent development needs. This includes evolving tools like product requirements documents (PRDs) to account for agent-specific considerations: autonomy boundaries, quality criteria, human oversight requirements, and feedback mechanisms. At Serpin, we are developing the Agent Requirements Document, which adapts the PRD to include these agent-specific considerations.

Project and change managers face a different expansion. Agent implementation is not just technology deployment. It requires decisions about autonomy (which tasks need human approval?), quality criteria (how do we know the agent is doing good work?), and organisational readiness (how do we help teams use agent outputs appropriately?). These are change management questions, not technical ones. They also require understanding regulatory requirements and ensuring that governance structures satisfy compliance obligations.

Everyone who works with agents becomes what some call a 'day-one manager'.¹⁶ In practical terms:

  • Directing: Telling the agent what to prioritise, what context matters

  • Checking: Reviewing outputs before they go further

  • Intervening: Recognising when the agent is outside its competence and escalating

  • Improving: Providing feedback that makes the agent better over time

These are supervision skills, not programming skills. Most people already have them from working with colleagues. The adjustment is learning how to apply them to AI.

What We Have Learned

At Serpin, we build and deploy AI agents for professional services clients. Several lessons from that work have shaped our approach.

Organisational and process change is as important as technology. We have seen agent projects stall not because the technology failed, but because the organisation was not ready. One client deployed a research agent that produced good outputs, but nobody had defined who was responsible for reviewing them, how exceptions would be escalated, or how the agent's work fitted into existing approval workflows. The technology worked. The process around it did not. The organisations seeing results are redesigning processes to include agent-human handoffs, updating governance structures to account for automated decision-making, defining new escalation paths, and investing in training. Treating agent deployment as a technology project without addressing these structural changes consistently leads to underperformance.

Autonomy decisions are business decisions with regulatory implications. For each agent task, someone must decide: does this need human approval every time, or can the agent act independently? Only high-value items? Only flagged ones? We have found that getting these decisions right early prevents costly rework later. One approach that works well is starting with human-in-the-loop for all outputs, then progressively relaxing oversight as confidence builds and the data supports it. These choices shape risk, efficiency, and adoption. They cannot be delegated to IT alone. In regulated industries, they also need to satisfy compliance frameworks and demonstrate accountability through audit trails.

Verification must be external to the agent. Early in our development, we asked agents to confirm they had followed quality rules: "Did you use only verified sources? Answer true or false." The agents answered "true" regardless of whether they actually complied. This is a known property of large language models: they will assert compliance with any rule you ask them to assert. Verification must come from outside the agent being verified, through automated checks, sampling, or human review. Some organisations use one AI to evaluate another's work, which can help, but for compliance-critical outputs, external verification is essential.

Product and project managers need agent literacy. They do not need to become AI engineers. But they do need to understand enough about how agents work, including their non-deterministic nature, their tendency to drift, and their limitations, to make sound decisions about autonomy, oversight, and implementation sequencing. We have seen projects where well-intentioned managers set agent scope too broadly because they did not understand the difference between what an agent could attempt and what it could do reliably. Without this understanding, agent projects become technology deployments without the organisational change that makes them successful.

Start with the workflow, not the technology. The most successful agent implementations we have seen began by mapping the existing human workflow in detail: every decision point, every handoff, every exception. Only then did they identify which steps an agent could handle, which needed human oversight, and where the boundaries should sit. Starting with the technology and trying to fit workflows around it produces fragile implementations that break when they encounter edge cases.

Getting Started

The organisations seeing results from AI agents have started with fundamentals.

Visibility first. Before you can improve agent behaviour, you need to understand what your agents are actually doing. Instrumentation that traces agent reasoning, tool usage, and decision paths is not optional. It is the foundation for everything else.

Define governance early. Decide what agents can do autonomously and what requires human approval before you deploy, not after. Document these decisions and review them regularly as you learn more about agent behaviour in your specific context.

Clear ownership. Someone needs to be accountable for agent operations, with responsibility for performance, quality, governance, and continuous improvement. Whether that is a new role or an evolution of an existing one depends on the organisation. What matters is that the accountability exists.

Invest in people and processes alongside technology. For every pound spent on agent development, consider what is being spent on training, change management, process redesign, and the governance structures that determine whether agents are used effectively.

The organisations that figure out how to run agents well, combining technology, governance, and organisational change, will see better results. Those that treat deployment as the finish line will find themselves rebuilding operational foundations under pressure.

The agents are the visible part. How you run them is what makes them work.

Sources

  1. Gartner (2025) 'Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up from Less Than 5% in 2025.' Press release, 26 August 2025: https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025

  2. Market projection cited in Machine Learning Mastery (2025) '7 Agentic AI Trends to Watch in 2026': https://machinelearningmastery.com/7-agentic-ai-trends-to-watch-in-2026/

  3. Lloyds Banking Group (2025) 'Financial Institutions Sentiment Survey', September 2025. Survey of over 100 senior leaders across UK banks, asset managers, insurers, and financial sponsors: https://www.lloydsbank.com/business/resource-centre/insight/financial-institutions-sentiment-survey.html

  4. IBM (2025) 'What is AgentOps?' https://www.ibm.com/think/topics/agentops

  5. AWS (2025) 'What is DevOps?' https://aws.amazon.com/devops/what-is-devops/

  6. AWS (2025) 'What is MLOps?' https://aws.amazon.com/what-is/mlops/

  7. Agent observability and governance platforms include LangSmith (https://www.langchain.com/langsmith/observability), Langfuse (https://langfuse.com/), Arize AI (https://arize.com/), and others. These provide tracing, evaluation, and monitoring capabilities for agent workflows.

  8. Agent quality evaluation frameworks include LangSmith Evaluation (https://docs.smith.langchain.com/evaluation) and similar capabilities in competing platforms. These allow teams to define custom criteria and run automated assessments against agent outputs.

  9. AWS (2025) 'Agentic AI in Financial Services: Choosing the Right Pattern for Multi-Agent Systems': https://aws.amazon.com/blogs/industries/agentic-ai-financial-services-choosing-the-right-pattern-for-multi-agent-systems/

  10. LexisNexis Protege General AI deployment described in National Law Review (2026) 'Ten AI Predictions for 2026': https://natlawreview.com/article/ten-ai-predictions-2026-what-leading-analysts-say-legal-teams-should-expect

  11. Capgemini Research Institute (2025) 'World Cloud Report for Financial Services 2026'. Survey of 1,100 leaders across 14 markets, June-September 2025: https://www.capgemini.com/us-en/news/press-releases/banks-and-insurers-deploy-ai-agents-to-fight-fraud-and-process-applications-with-plans-for-new-roles-to-supervise-the-ai/

  12. Microsoft (2025) 'Work Trend Index Annual Report 2025: The Year the Frontier Firm is Born'. Survey of 31,000 professionals across 31 countries: https://www.microsoft.com/en-us/worklab/work-trend-index/2025-the-year-the-frontier-firm-is-born

  13. A&O Shearman Harvey AI implementation described in National Law Review (2026) 'The Legal Industry's AI Arms Race Exposed': https://natlawreview.com/article/inside-legal-industrys-ai-arms-race

  14. Generali France case study cited in Capgemini (2025) press release (see source 11) and Microsoft customer story: https://www.microsoft.com/en/customers/story/25382-generali-microsoft-365-copilot

  15. Digital Workforce (2026) 'Will we get a Chief Agent Officer in 2026?': https://digitalworkforce.com/rpa-news/will-we-get-a-chief-agent-officer-in-2026/

  16. Training Industry (2025) 'How to Prepare Your Workforce for Agentic AI', 20 May 2025: https://trainingindustry.com/articles/workforce-development/how-to-prepare-your-workforce-for-agentic-ai/

Executive Summary

  • AI agents make decisions autonomously. Traditional IT operations can tell you if software is running, not if decisions are correct.

  • Adoption is accelerating rapidly, driving the emergence of AgentOps: practices for managing agents throughout their lifecycle.

  • Deploying decision-making agents requires new governance: autonomy boundaries, human-agent handoffs, escalation paths, and compliance frameworks.

  • The bigger change is how existing roles evolve. Everyone who works with agents becomes responsible for oversight.

AI agents are moving from pilots to production. Law firms use them for contract review. Insurance companies deploy them for claims processing. Consulting firms have them synthesising research and drafting proposals.

Adoption is accelerating. Gartner predicts 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025.¹ The market may grow from $7.8 billion to over $52 billion by 2030.² A Lloyds survey found that 59% of UK financial institutions now report productivity gains from AI, nearly double the figure from a year earlier.³

The question is no longer whether to use agents. It is how to run them well. And that turns out to be harder than most organisations expected.

Consider a law firm that deploys an AI agent to review contracts. It processes 500 documents in a day. The IT dashboard shows green across the board: no errors, no downtime, no alerts. Two weeks later, a partner discovers the agent has been missing a category of risk clause in every document it reviewed.

Nothing was broken. The agent was running perfectly. It was just making bad decisions, and nobody had the tools or processes to catch it.

We hear this concern from many clients: if agents need constant supervision, what is the point? Does managing them just create new work that offsets the efficiency gains? This article addresses that concern directly: the answer lies in a combination of organisational design, evolving roles, and automated monitoring tools that make agent oversight scalable.

What Makes Agents Different

An AI agent is not a chatbot with a better marketing name. The distinction matters for understanding what running them involves.

A chatbot responds to queries within a narrow domain. An agent reasons about tasks, uses tools, makes decisions, and takes multi-step actions to achieve goals. Where a chatbot might answer a question about a contract clause, an agent might review an entire franchise agreement, flag every section containing costs and fees, cross-reference each one against internal policy, and draft a summary for the client. In financial services, an agent might pull a customer's transaction history from a CRM, run credit checks across multiple bureaus, update the bid management system with the results, and draft a recommendation, all tasks that previously required an analyst to complete manually over several hours.

This is powerful. It also requires fundamentally different management.

Traditional software is deterministic. The same input produces the same output, every time. When something breaks, you trace the execution path through predictable code.

Agents are not like that. They reason about problems in ways that can vary between sessions. They choose which tools to use, in what order, for what purpose. The same instruction might produce different results on different days, and both results might be reasonable. Managing this variability requires a combination of organisational controls (clear policies on what agents can decide independently), process design (defined review and escalation steps), and technical safeguards (runtime enforcement, output validation, and continuous monitoring). We explore the technical security dimension in detail in our companion article, Securing AI Agents: What We've Learned Building Them.

This changes what 'operations' means. With traditional software, IT operations teams focus on keeping systems running: server uptime, deployment pipelines, incident response. With agents, you also have to ask "is it doing the right thing?" And answering that question requires different tools, different skills, and often different organisational structures.

Why AgentOps Is Emerging

The challenge described above, monitoring agent quality rather than just availability, is why a new set of practices called AgentOps is emerging. IBM defines AgentOps as "an emerging set of practices focused on the lifecycle management of autonomous AI agents."⁴ In plain terms, AgentOps is about making sure your agents keep doing good work, not just that they keep running.

Why is this a distinct discipline rather than an extension of existing IT operations? Because the operational challenge is fundamentally different. The pattern becomes clear when you look at how operational practices have evolved as technology has become more complex.

DevOps emerged because software development and IT operations used to work in silos. Developers would write code, hand it to operations, and operations would figure out how to run it. DevOps teams bridge that gap. They build automated pipelines that test code, deploy it to servers, and monitor whether it is running correctly. When something breaks at 3am, they get the alert and fix it. Their core question is: "is the software running reliably?"⁵

MLOps does something similar for machine learning models. Unlike traditional software, ML models can degrade over time. A fraud detection model trained on last year's data might miss new fraud patterns this year. MLOps teams monitor model accuracy, retrain models when performance drops, and manage the pipeline from data to deployed model. Their core question is: "is the model still accurate?"⁶

AgentOps extends this to AI agents, but the core question is harder. DevOps monitors whether software is running. MLOps monitors whether models are accurate. AgentOps must monitor whether agents are making good decisions, and unlike uptime or accuracy, 'good decisions' is often a judgment call that requires domain expertise to evaluate.

How AgentOps Adds Value

In our work building and deploying agents, we see four areas where AgentOps practices make a concrete difference:

Visibility: knowing what your agents are actually doing. Without visibility, problems remain invisible until a client complains or an auditor asks questions. Several observability, governance and security platforms now provide step-by-step traces of agent behaviour: which tools the agent used, what data it accessed, what reasoning led to its output, and where it spent time and money.⁷ Think of it like a flight recorder for each agent interaction. If a document review agent misses a key risk in a contract, the trace shows exactly what the agent looked at and where it went wrong, so the team can fix the root cause rather than just correcting the single output.

Quality evaluation: judging whether agent outputs are correct. This is where domain expertise matters most. An IT operations team can monitor whether the agent ran without errors, but only a lawyer can judge whether a contract summary is accurate, or an underwriter can judge whether a risk assessment is sound. Organisations address this in two ways. First, through sampling: selecting a percentage of agent outputs for expert human review. Second, through automated checks: rules that verify specific requirements were met (for example, confirming that a compliance review agent checked every required regulation, or that a document summary included all sections flagged as high-risk). These platforms allow teams to define custom quality criteria and run automated assessments against them.⁸ The combination of sampling and automation makes quality evaluation scalable rather than requiring every output to be manually reviewed.

Governance: deciding what agents can and cannot do without human approval. This is a business decision, not a technical one. For each agent capability, someone must decide: can the agent act independently, or does it need a human to approve the action first? High-value decisions (approving a loan, sending legal advice to a client) typically require human sign-off. Routine tasks (summarising a document, categorising an email) may not. These autonomy boundaries must be formally defined, documented, and enforced through both process design and technical controls. In regulated industries like financial services, these decisions also need to satisfy compliance requirements for audit trails and accountability.

Continuous improvement: making agents better over time. When a lawyer corrects an agent's document summary, that correction should feed back into how the agent works in future. When an underwriter overrides an agent's risk assessment, the system should capture why. Without structured feedback loops, organisations fix the same problems repeatedly. AgentOps tools capture these corrections and use them to evaluate whether agent performance is improving, degrading, or drifting from expected behaviour.

From Single Agents to Multi-Agent Systems

So far we have discussed individual agents. But as organisations scale their use of AI, they increasingly deploy multi-agent systems: multiple specialised agents that work together on complex tasks.

Think of it like a team. Rather than one generalist agent trying to do everything, you might have a research agent that gathers information, an analysis agent that evaluates it, and a drafting agent that produces the output. Each agent is purpose-built for its role.

In financial services, for example, an insurance claims workflow might use one agent for initial claim intake (gathering details from the claimant), a second for damage appraisal (cross-referencing against policy terms and historical data), and a third for settlement calculation.⁹ In legal work, LexisNexis deploys four specialised agents that collaborate on complex research tasks: one handles orchestration, while the others focus on legal research, web search, and customer document analysis respectively.¹⁰

We use multi-agent architectures in our own work, and the advantages are real: specialised agents perform better than generalists, the system is more modular and easier to update, and complex workflows become manageable. But they also introduce new operational challenges:

  • Agent-to-agent handoffs: When one agent passes work to another, who checks the quality of what was handed over? A mistake by the first agent can cascade through the entire chain.

  • Human-in-the-loop and human-on-the-loop: Some steps require a human to approve before the workflow continues (human-in-the-loop). Others allow the agent to proceed but keep a human monitoring and able to intervene (human-on-the-loop). Designing which steps need which level of oversight is a critical governance decision.

  • Exception handling: What happens when an agent encounters something outside its competence? There must be clear escalation paths, both to other agents and to human experts, so that edge cases do not produce bad outcomes.

These coordination challenges are why dedicated oversight roles are emerging.

What This Means in Practice

The shift from traditional IT operations to agent operations is not about replacing existing capabilities. It is about adding new ones.

With Traditional Software We Focus On...

With AI Agents We Also Need To...

Keep systems running

Evaluate whether agent decisions are correct

Monitor uptime and errors

Monitor quality of reasoning and detect drift

Manage access and permissions

Decide autonomy boundaries (what can agents do alone?)

Deploy and update software

Design human-agent handoffs and escalation paths

Troubleshoot when things break

Intervene when agents are wrong but not technically broken

A reasonable question: does all this monitoring and evaluation just create new work that cancels out the efficiency gains?

Not if it is done well. The monitoring tools described above automate much of the visibility and quality checking. Automated evaluation rules catch common problems without human involvement. Sampling strategies mean humans review a manageable percentage of outputs, not every one. As organisations learn what 'good' looks like for their specific agents, they can automate more of the quality assurance and reduce the manual overhead. The goal is a level of oversight that is proportionate to the risk, not a permanent bottleneck.

The implication is that agent operations requires collaboration between technical teams (who manage the systems and monitoring tools) and business teams (who define what 'good' looks like for agent outputs). Neither can do it alone.

New Roles and Evolving Responsibilities

This collaboration challenge is why new roles are appearing. Recent Capgemini research found that 48% of financial institutions are creating new roles for employees to supervise AI agents.¹¹

No single existing function owns the full picture. IT operations manages technology but typically lacks domain expertise to judge whether a legal summary is accurate or a claims decision is fair. Business teams understand quality but do not have the technical infrastructure to monitor agent behaviour across the organisation. Change management addresses adoption but not ongoing technical governance. Compliance teams understand regulatory requirements but may not understand how agents actually make decisions. Someone needs to bridge these gaps.

The Head of Agent Operations

The title scales with the organisation: Head of Agent Operations at director level, or a Chief Agent Officer at C-suite level where agents handle a significant share of business processes.

Some organisations are creating roles specifically to coordinate how agents, humans, and multi-agent systems work together effectively. Microsoft's 2025 Work Trend Index found that 81% of leaders expect agents to be integrated into their company's AI strategy within 12 to 18 months. Yet only 67% of leaders are familiar with agents, and just 40% of employees. That gap between executive ambition and workforce readiness is exactly what a dedicated coordination role addresses.¹²

What they actually do:

  • Set autonomy boundaries and governance frameworks. For each agent capability, someone must decide: can the agent act independently, or does it need human approval? A&O Shearman, the first Big Law firm to deploy Harvey AI to more than 3,500 employees, requires that every AI output is audited by humans.¹³ Someone had to make that governance decision, align it with the firm's risk appetite and regulatory obligations, and ensure it is followed in practice.

  • Design human-agent workflows, including handoffs and escalation. Where does agent work end and human work begin? What happens when an agent encounters an exception it cannot handle? At Generali France, the voice assistant handles initial calls and reassures claimants, but a human steps in for complex cases.¹⁴ Designing that handoff, including the criteria for what counts as 'complex', the escalation process, and the feedback mechanism, requires understanding both the technology and the business context.

  • Coordinate across business units and compliance. Agents often serve multiple teams and must satisfy regulatory requirements. In multi-agent systems, this coordination extends to how agents interact with each other: which agent has authority, what data can be shared between agents, and how conflicts are resolved. Someone needs to ensure consistency, manage conflicts, prevent duplication, and maintain audit trails that demonstrate compliance.

  • Build trust and drive adoption through organisational change. New technology fails without organisational buy-in. This includes helping teams understand what agents can and cannot do, designing training programmes, and building confidence through demonstrated results and transparent governance.

As the share of business processes handled by agents grows, this role moves to the executive level, with strategic responsibility for where agents should and should not be used, how they interact with the wider workforce, and how the organisation manages the associated risks.¹⁵

How Existing Roles Evolve

For most organisations, the bigger change is how current roles gain new responsibilities, not creating entirely new positions.

Technical teams retain their core expertise in infrastructure and systems management. What changes is the addition of agent-specific monitoring. This includes watching for drift (agent behaviour gradually changing as underlying models are updated), degradation (agent quality declining over time as the tasks or data it encounters evolve), and overstep (agents taking actions beyond their intended scope). Dashboards and metrics need to evolve beyond uptime and error rates to include quality scores, decision audit trails, and compliance checks. Technical teams also enforce least privilege: ensuring agents have only the minimum access and autonomy needed for their specific tasks, a principle we explore in depth in Securing AI Agents https://serpin.ai/insights/securing-ai-agents-what-we-ve-learned-building-them.

Product managers have a significant role to play. Agents are, in effect, products: they have users, capabilities, performance requirements, and lifecycles. Product managers bring the discipline of requirements definition, user research, and iterative improvement that agent development needs. This includes evolving tools like product requirements documents (PRDs) to account for agent-specific considerations: autonomy boundaries, quality criteria, human oversight requirements, and feedback mechanisms. At Serpin, we are developing the Agent Requirements Document, which adapts the PRD to include these agent-specific considerations.

Project and change managers face a different expansion. Agent implementation is not just technology deployment. It requires decisions about autonomy (which tasks need human approval?), quality criteria (how do we know the agent is doing good work?), and organisational readiness (how do we help teams use agent outputs appropriately?). These are change management questions, not technical ones. They also require understanding regulatory requirements and ensuring that governance structures satisfy compliance obligations.

Everyone who works with agents becomes what some call a 'day-one manager'.¹⁶ In practical terms:

  • Directing: Telling the agent what to prioritise, what context matters

  • Checking: Reviewing outputs before they go further

  • Intervening: Recognising when the agent is outside its competence and escalating

  • Improving: Providing feedback that makes the agent better over time

These are supervision skills, not programming skills. Most people already have them from working with colleagues. The adjustment is learning how to apply them to AI.

What We Have Learned

At Serpin, we build and deploy AI agents for professional services clients. Several lessons from that work have shaped our approach.

Organisational and process change is as important as technology. We have seen agent projects stall not because the technology failed, but because the organisation was not ready. One client deployed a research agent that produced good outputs, but nobody had defined who was responsible for reviewing them, how exceptions would be escalated, or how the agent's work fitted into existing approval workflows. The technology worked. The process around it did not. The organisations seeing results are redesigning processes to include agent-human handoffs, updating governance structures to account for automated decision-making, defining new escalation paths, and investing in training. Treating agent deployment as a technology project without addressing these structural changes consistently leads to underperformance.

Autonomy decisions are business decisions with regulatory implications. For each agent task, someone must decide: does this need human approval every time, or can the agent act independently? Only high-value items? Only flagged ones? We have found that getting these decisions right early prevents costly rework later. One approach that works well is starting with human-in-the-loop for all outputs, then progressively relaxing oversight as confidence builds and the data supports it. These choices shape risk, efficiency, and adoption. They cannot be delegated to IT alone. In regulated industries, they also need to satisfy compliance frameworks and demonstrate accountability through audit trails.

Verification must be external to the agent. Early in our development, we asked agents to confirm they had followed quality rules: "Did you use only verified sources? Answer true or false." The agents answered "true" regardless of whether they actually complied. This is a known property of large language models: they will assert compliance with any rule you ask them to assert. Verification must come from outside the agent being verified, through automated checks, sampling, or human review. Some organisations use one AI to evaluate another's work, which can help, but for compliance-critical outputs, external verification is essential.

Product and project managers need agent literacy. They do not need to become AI engineers. But they do need to understand enough about how agents work, including their non-deterministic nature, their tendency to drift, and their limitations, to make sound decisions about autonomy, oversight, and implementation sequencing. We have seen projects where well-intentioned managers set agent scope too broadly because they did not understand the difference between what an agent could attempt and what it could do reliably. Without this understanding, agent projects become technology deployments without the organisational change that makes them successful.

Start with the workflow, not the technology. The most successful agent implementations we have seen began by mapping the existing human workflow in detail: every decision point, every handoff, every exception. Only then did they identify which steps an agent could handle, which needed human oversight, and where the boundaries should sit. Starting with the technology and trying to fit workflows around it produces fragile implementations that break when they encounter edge cases.

Getting Started

The organisations seeing results from AI agents have started with fundamentals.

Visibility first. Before you can improve agent behaviour, you need to understand what your agents are actually doing. Instrumentation that traces agent reasoning, tool usage, and decision paths is not optional. It is the foundation for everything else.

Define governance early. Decide what agents can do autonomously and what requires human approval before you deploy, not after. Document these decisions and review them regularly as you learn more about agent behaviour in your specific context.

Clear ownership. Someone needs to be accountable for agent operations, with responsibility for performance, quality, governance, and continuous improvement. Whether that is a new role or an evolution of an existing one depends on the organisation. What matters is that the accountability exists.

Invest in people and processes alongside technology. For every pound spent on agent development, consider what is being spent on training, change management, process redesign, and the governance structures that determine whether agents are used effectively.

The organisations that figure out how to run agents well, combining technology, governance, and organisational change, will see better results. Those that treat deployment as the finish line will find themselves rebuilding operational foundations under pressure.

The agents are the visible part. How you run them is what makes them work.

Sources

  1. Gartner (2025) 'Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up from Less Than 5% in 2025.' Press release, 26 August 2025: https://www.gartner.com/en/newsroom/press-releases/2025-08-26-gartner-predicts-40-percent-of-enterprise-apps-will-feature-task-specific-ai-agents-by-2026-up-from-less-than-5-percent-in-2025

  2. Market projection cited in Machine Learning Mastery (2025) '7 Agentic AI Trends to Watch in 2026': https://machinelearningmastery.com/7-agentic-ai-trends-to-watch-in-2026/

  3. Lloyds Banking Group (2025) 'Financial Institutions Sentiment Survey', September 2025. Survey of over 100 senior leaders across UK banks, asset managers, insurers, and financial sponsors: https://www.lloydsbank.com/business/resource-centre/insight/financial-institutions-sentiment-survey.html

  4. IBM (2025) 'What is AgentOps?' https://www.ibm.com/think/topics/agentops

  5. AWS (2025) 'What is DevOps?' https://aws.amazon.com/devops/what-is-devops/

  6. AWS (2025) 'What is MLOps?' https://aws.amazon.com/what-is/mlops/

  7. Agent observability and governance platforms include LangSmith (https://www.langchain.com/langsmith/observability), Langfuse (https://langfuse.com/), Arize AI (https://arize.com/), and others. These provide tracing, evaluation, and monitoring capabilities for agent workflows.

  8. Agent quality evaluation frameworks include LangSmith Evaluation (https://docs.smith.langchain.com/evaluation) and similar capabilities in competing platforms. These allow teams to define custom criteria and run automated assessments against agent outputs.

  9. AWS (2025) 'Agentic AI in Financial Services: Choosing the Right Pattern for Multi-Agent Systems': https://aws.amazon.com/blogs/industries/agentic-ai-financial-services-choosing-the-right-pattern-for-multi-agent-systems/

  10. LexisNexis Protege General AI deployment described in National Law Review (2026) 'Ten AI Predictions for 2026': https://natlawreview.com/article/ten-ai-predictions-2026-what-leading-analysts-say-legal-teams-should-expect

  11. Capgemini Research Institute (2025) 'World Cloud Report for Financial Services 2026'. Survey of 1,100 leaders across 14 markets, June-September 2025: https://www.capgemini.com/us-en/news/press-releases/banks-and-insurers-deploy-ai-agents-to-fight-fraud-and-process-applications-with-plans-for-new-roles-to-supervise-the-ai/

  12. Microsoft (2025) 'Work Trend Index Annual Report 2025: The Year the Frontier Firm is Born'. Survey of 31,000 professionals across 31 countries: https://www.microsoft.com/en-us/worklab/work-trend-index/2025-the-year-the-frontier-firm-is-born

  13. A&O Shearman Harvey AI implementation described in National Law Review (2026) 'The Legal Industry's AI Arms Race Exposed': https://natlawreview.com/article/inside-legal-industrys-ai-arms-race

  14. Generali France case study cited in Capgemini (2025) press release (see source 11) and Microsoft customer story: https://www.microsoft.com/en/customers/story/25382-generali-microsoft-365-copilot

  15. Digital Workforce (2026) 'Will we get a Chief Agent Officer in 2026?': https://digitalworkforce.com/rpa-news/will-we-get-a-chief-agent-officer-in-2026/

  16. Training Industry (2025) 'How to Prepare Your Workforce for Agentic AI', 20 May 2025: https://trainingindustry.com/articles/workforce-development/how-to-prepare-your-workforce-for-agentic-ai/

Category

Written by

Scott Druck

Blog and Articles
Blog and Articles
Blog and Articles

Latest insights and trends

What next?

Let's have a conversation.

No pressure. No lengthy pitch deck. Just a straightforward discussion about where you are with AI and whether we can help.

If we're not the right fit, we'll tell you. If you're not ready, we'll say so. Better to find that out in a 30-minute call than after signing a contract.

Two male professionals collaborating during brainstorming session
What next?

Let's have a conversation.

No pressure. No lengthy pitch deck. Just a straightforward discussion about where you are with AI and whether we can help.

If we're not the right fit, we'll tell you. If you're not ready, we'll say so. Better to find that out in a 30-minute call than after signing a contract.

Two male professionals collaborating during brainstorming session
What next?

Let's have a conversation.

No pressure. No lengthy pitch deck. Just a straightforward discussion about where you are with AI and whether we can help.

If we're not the right fit, we'll tell you. If you're not ready, we'll say so. Better to find that out in a 30-minute call than after signing a contract.

Two male professionals collaborating during brainstorming session