Securing AI Agents: What We've Learned Building Them

Executive Summary

AI agents connect to databases, APIs, email, and files. This connectivity makes them powerful—and makes them targets.
Security risks scale with AI capability. The more an AI system can do, the more damage a compromise can cause.
Protective measures called guardrails constrain agent behaviour and reduce security risks, but are not sufficient on their own.
Agent-specific security measures are essential for deployment at scale.
A January 2026 vulnerability in n8n demonstrated the stakes - one flaw potentially exposing 100,000 orchestration servers globally.

Introduction

AI agents that automate workflows need to connect to databases, call APIs, send emails, and access files. This connectivity is what makes them powerful. It's also what makes them targets. When an attacker compromises an orchestration platform, they don't just breach one system; they gain access to everything the agent can touch.

This article examines the security risks organisations face when deploying AI agents, the attack patterns to understand, and practical measures that reduce exposure.

Why Orchestration Platforms Are Targets

Most AI agents do not work in isolation. They need to connect to databases, call APIs, send emails, and access files. Orchestration platforms handle this complexity. They provide a central layer that coordinates the agent's actions across multiple systems, managing credentials, scheduling tasks, and routing data between services.

Platforms like n8n, Langflow, and Make have become popular precisely because they make this easy. You can connect an AI agent to dozens of services in minutes, automate complex workflows without writing code, and spin up prototypes rapidly. These are genuine benefits that we use ourselves.

But the same features that make orchestration platforms convenient also make them attractive targets. They store credentials for every connected system. They often run with broad permissions. And because they abstract away complexity, users may not realise how much access they are granting until something goes wrong.

Attackers know this. They look for vulnerabilities in orchestration layers because a single breach can unlock access to everything the platform connects to. Some attackers also create honeypots: malicious data sources, fake APIs, or poisoned documents designed to be discovered and processed by agents. When an agent retrieves content from a compromised source, it may follow hidden instructions embedded in that content.

A Real-World Example

We use n8n extensively, a workflow automation platform for orchestrating AI agents and connecting them to business systems. It is a capable platform that we recommend for both production workflows and rapid prototyping.

In January 2026, security researchers at Cyera disclosed a vulnerability they codenamed Ni8mare.¹ The vulnerability was received the maximum severity score. It allows an unauthenticated attacker to take complete control of any self-hosted n8n instance exposed to the internet. The researchers' assessment was direct: the blast radius of a compromised n8n is massive.²

Why massive? Because workflow platforms like n8n are integration hubs. They store credentials for every system they connect to: API keys, OAuth tokens, database connections, cloud storage access. A single breach does not compromise one system. It potentially hands attackers the keys to everything. Another security researcher, Censys, identified 26,512 n8n instances exposed on the public internet, and the vulnerability is estimated to impact roughly 100,000 instances globally.³

Fortunately, the n8n team had already released a fix in November 2025, nearly two months before the public disclosure in January 2026. This is responsible disclosure working as intended. But the potential blast radius illustrates a broader point that applies to any agentic system, whether using n8n, other platforms, or bespoke developments: orchestration layers that concentrate credentials and permissions become high-value targets.

This was not an isolated case. Throughout 2025, significant vulnerabilities emerged across the AI workflow ecosystem, including in Langflow and multiple implementations of MCP servers (the protocol that allows AI agents to connect to external tools). The pattern illustrates something we have observed in our own work: security challenges scale dramatically with capability, and with connectivity. The more your agent can do, the more damage a compromise can cause.

Why AI Agents Face Greater Security Risks

Not all AI systems present equal risk. Based on our analysis of the leading security frameworks (including OWASP, MITRE ATLAS, and the NIST AI Risk Management Framework), we categorise AI systems into three levels based on their capabilities and what they have access to.⁴

Level 1: Conversational AI (chatbots, Q&A assistants) faces risks including prompt injection, information leakage, jailbreaking, and denial of service attacks. The AI can be tricked or leak information, but damage is limited because it cannot take actions beyond generating text.

Level 2: Connected AI (systems with document or database access) faces additional risks including retrieval poisoning, context window stuffing, and relevance manipulation. Attackers can now poison the data sources the AI retrieves from, not just the prompts. A malicious instruction hidden in a document becomes an instruction the AI might follow.

Level 3: AI Agents (autonomous systems that take actions) face the broadest range of security risks, including goal hijacking, tool misuse, privilege escalation, memory poisoning, cascading failures, and inter-agent trust exploitation. A tricked agent can cause real damage: sending confidential data to attackers, modifying critical files, or escalating its own privileges.

The n8n vulnerability illustrates a critical point about Level 3 systems: integration hubs that connect AI agents to business systems become high-value targets precisely because they concentrate access. One breach does not just compromise an AI chatbot. It hands attackers the credentials to everything the AI was permitted to touch.

Attack Patterns to Understand

Many attack patterns apply to any agentic system, but multi-agent architectures are increasingly common because they offer real advantages: specialised agents can handle different tasks, systems become more modular, and complex workflows become easier to manage. However, multi-agent systems also introduce specific security concerns. OWASP's 2026 Top 10 for Agentic Applications identifies several patterns worth understanding.⁵

Goal hijacking exploits a fundamental weakness in how language models work: they cannot reliably distinguish data from instructions. Attackers decompose malicious objectives into small, seemingly innocent tasks. An agent asked to "summarise this document" does not know the document contains hidden instructions. It processes what it reads, including commands disguised as content.

Memory and context poisoning plants malicious content that lies dormant until triggered later. Modern agents can remember past interactions, and attackers can inject content during one session that activates harmful behaviour days or weeks later, making attribution difficult.

Cascading failures occur when one agent's problem spreads to others. Research from Galileo AI found that in simulated multi-agent systems, a single compromised agent poisoned 87% of downstream decision-making within four hours.⁶

Inter-agent trust exploitation targets the weakest link. In August 2025, attackers exploited stolen OAuth tokens from Drift's Salesforce integration to access customer environments across more than 700 organisations. The activity appeared legitimate because it came from a trusted SaaS connection rather than a compromised user account.⁷ Security researchers described the blast radius as ten times greater than previous incidents where attackers infiltrated Salesforce directly.

The principle is the same as traditional cybersecurity: attackers do not need to defeat your strongest defences. They need to find one gap.

Toxic capability combinations are worth understanding when evaluating agent risk. Some capabilities are safe on their own but dangerous together. If an agent can read your email and also send data over the internet, an attacker could trick it into forwarding sensitive messages to an external server. Neither capability alone is the problem. The combination is.

We like Simon Willison's formalisation of this as the lethal trifecta: three capabilities that become dangerous when combined.⁸ They are: access to private data, exposure to untrusted content (like web pages or documents from external sources), and the ability to communicate externally (send emails, make API calls, or even just render an image from a URL).

Any two of these are manageable. All three together create an exfiltration path. An attacker plants instructions in a document the agent will read. The agent follows those instructions, accesses your private data, and sends it somewhere the attacker controls. MCP server ecosystems can be particularly vulnerable because they encourage mixing tools from different sources, often combining all three capabilities without the user realising it.

How Protections Work (And Where They Fall Short)

Given these attack patterns, what can organisations do to protect themselves? The industry uses the term "guardrails" to describe protective measures that prevent AI systems from causing harm. Think of them like safety barriers on a mountain road: they do not stop you driving, but they prevent you going off the edge.

Guardrails work at four layers:

Input guardrails screen what goes into the AI: validating user inputs, detecting attempts to manipulate the AI through malicious prompts, filtering known attack patterns, and vetting data sources.

Output guardrails control what comes out: scanning for sensitive information, blocking policy violations, and detecting when the AI reveals instructions it should not.

Behaviour guardrails limit what the AI can do: defining permitted actions, requiring human approval for high-risk operations, and implementing rate limits and spending caps.

System guardrails contain the AI within secure boundaries: sandboxing execution environments, isolating agents from each other, and implementing circuit breakers to stop problems spreading.

All four layers are necessary, but each can be circumvented by the attack patterns described above:

Input guardrails struggle with indirect prompt injection. Malicious instructions hidden in documents, images, or external data sources bypass direct input validation because the attack comes through content the agent retrieves, not content the user types.
Output guardrails cannot prevent actions already taken. If an agent has been manipulated into sending data to an attacker-controlled endpoint, output filtering sees only the confirmation message, not the exfiltration.
Behaviour guardrails can be circumvented through goal hijacking. An attacker who decomposes a prohibited action into many small permitted actions may stay within each individual rule while achieving a harmful outcome.
System guardrails face challenges with cascading failures and inter-agent trust. If agents trust each other's outputs without verification, a compromised agent can poison decisions across the system.

OWASP's 2025 Top 10 for LLM Applications keeps prompt injection as the number one vulnerability and notes that, given the stochastic nature of these models, fully fool-proof prevention methods are not yet known.⁹

This does not mean guardrails are useless. It means they must be deployed in depth, with each layer catching what others miss, and with the understanding that no single control will stop every attack.

What We Have Learned Building Agents

Several lessons from our own development have shaped how we approach agent security.

Prompt injection vulnerability can come from unexpected places. During a security audit of one of our research agents, we identified a prompt injection vulnerability: article content was being interpolated directly into prompts without sanitisation.¹⁰ Any text the agent processed could potentially have contained injection patterns. The fix was systematic: sanitise all external content before it reaches the agent's prompt. Escape XML tags. Filter patterns like "ignore previous instructions." Treat every external text as potentially adversarial.

Self-attestation does not work. Early in our development, we asked agents to confirm they had followed security rules: "Did you only use verified sources? Answer true or false." The agents answered "true" regardless of whether they actually complied.¹¹ LLMs will assert compliance with any rule you ask them to assert, which means verification must be external to the agent being verified.

All security vulnerabilities are blocking. We adopted a policy that no security issue is "low priority" or "nice to have." Path traversal risks, injection vulnerabilities, insufficient input validation: all block deployment until fixed. A single security breach can cost more than the entire development budget.

Regulatory Requirements Are Arriving

The EU AI Act's Article 15 requires high-risk AI systems to be resilient against attempts by unauthorised third parties to alter their use, outputs or performance by exploiting system vulnerabilities.¹²

This includes explicit requirements for measures to prevent:

Data poisoning and model poisoning attacks
Adversarial examples and model evasion
Confidentiality attacks and model flaws

For AI used in credit scoring (classified as high-risk under Annex III), these requirements apply from 2 August 2026. Article 9 mandates risk management systems that identify, evaluate, and mitigate security risks throughout the AI lifecycle.

This is not optional compliance. Organisations deploying AI agents for high-stakes decisions face regulatory obligations that require documented security measures, not just good intentions.

Practical Implementation

IBM's 2025 Cost of a Data Breach Report found that organisations using AI and automation technologies extensively saved an average of $1.9 million per breach.¹³ The business case for investing in AI security is clear.

Start with visibility. Before adding security controls, map your current attack surface: what tools do your agents have access to, what data can they reach, and what actions can they take. Many organisations discover their agents have broader capabilities than intended once they examine this systematically.

Do not overlook shadow AI: the AI systems operating in your organisation without formal governance. This includes staff using unapproved tools like ChatGPT for work tasks, but also AI capabilities quietly embedded in software you already use. Many SaaS applications have added AI features without making it obvious to customers. Both create blind spots in your security posture and potential data leakage paths you have not accounted for.

Apply the principle of least agency: give AI systems only the autonomy they need to complete their task. If an agent only needs to read files, do not give it write access. If it only needs to search, do not give it the ability to send emails.

Start here: Before deploying any AI agent, answer these questions:

Security fundamentals (all AI levels):

How are inputs validated before reaching the AI?
What data does the AI have access to? How is access controlled?
What logging and audit capabilities exist?
What guardrails are in place at each layer (input, output, behaviour, system)?

For autonomous agents specifically:

What actions can the AI take autonomously vs. requiring human approval?
How is the AI prevented from exceeding its intended scope?
What happens if the AI encounters an unexpected situation?
Is there an emergency shutdown capability?
Are agents isolated from each other to prevent cascading failures?
Is inter-agent communication authenticated and validated?

Security as Enabler

There is a temptation to view security as friction that slows AI deployment. The n8n vulnerability suggests the opposite: insufficient security enables attacks at machine speed and scale. One vulnerability, tens of thousands of exposed instances, credentials for everything those servers connect to.

AI agents that cannot be trusted cannot be deployed at scale. Security controls do not prevent deployment. They enable it. The organisations that build security into their agent architectures from the start will deploy further and faster than those scrambling to add it after an incident.

Start with the fundamentals. Apply least privilege and least agency. Build defence in depth. Plan for multi-agent complexity before you need it.

The capabilities are real. The risks are real. Both deserve serious attention.

References

1. Cyera Research Labs (2026). 'Ni8mare: Unauthenticated Remote Code Execution in n8n', CVE-2026-21858 (CVSS 10.0), January. Available at: https://www.cyera.com/research-labs/ni8mare-unauthenticated-remote-code-execution-in-n8n-cve-2026-21858

2. Cyera Research Labs (2026). Assessment from vulnerability disclosure report.

3. Censys (2026). Internet scan identifying 26,512 n8n instances exposed on public internet; vulnerability estimated to impact roughly 100,000 instances globally.

4. Serpin (2026). 'AI Security Risks', v4.4. Analysis consolidating OWASP Top 10 for LLM Applications (2025), OWASP Top 10 for Agentic Applications (2026), OWASP AI Exchange, MITRE ATLAS, and NIST AI RMF.

5. OWASP (2026). 'Top 10 for Agentic Applications'. Available at: https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/

6. Galileo AI (2025). 'Research on multi-agent system failures', December. Cited in Stellar Cyber, 'Top Agentic AI Security Threats in 2026'.

7. Google Cloud/Mandiant and Obsidian Security (2025). 'Widespread Data Theft Targets Salesforce Instances via Salesloft Drift', August. Also documented by Cloud Security Alliance, 'The Salesloft Drift OAuth Supply-Chain Attack', September 2025.

8. Willison, S. (2025). 'The lethal trifecta for AI agents: private data, untrusted content, and external communication', June 16. Available at: https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

9. OWASP (2025). 'Top 10 for Large Language Model Applications 2025', LLM01: Prompt Injection. Available at: https://owasp.org/www-project-top-10-for-large-language-model-applications/

10. Serpin internal development records (2025). 'J-01: Security Audit for Production Code', December 30. Prompt injection vulnerability identified in article content processing; sanitisation function implemented.

11. Serpin internal development records (2026). 'B-08: Self-Attestation Does Not Work', January 8.

12. European Union (2024). 'Regulation (EU) 2024/1689 (Artificial Intelligence Act)', Article 15: Accuracy, robustness and cybersecurity. Available at: https://artificialintelligenceact.eu/article/15/

13. IBM (2025). 'Cost of a Data Breach Report 2025'. Available at: https://www.ibm.com/reports/data-breach

Executive Summary

AI agents connect to databases, APIs, email, and files. This connectivity makes them powerful—and makes them targets.
Security risks scale with AI capability. The more an AI system can do, the more damage a compromise can cause.
Protective measures called guardrails constrain agent behaviour and reduce security risks, but are not sufficient on their own.
Agent-specific security measures are essential for deployment at scale.
A January 2026 vulnerability in n8n demonstrated the stakes - one flaw potentially exposing 100,000 orchestration servers globally.