Pwning Agentic AI Part I: Your AI Agent Is Already Compromised

Organizations are rapidly connecting AI agents to their databases, document pipelines, and internal tools, creating privileged components that read untrusted input as part of their job. TrendAI™ Research examines how attackers turn these agents against you through return-to-tool (RTT) exploits, and what this means for the future of agentic AI security.

27 de мая de 2026

By Sean Park (Principal Threat Researcher, TrendAI™ Research)

Key takeaways

A new class of AI-era exploitation pattern has emerged. Through RTT exploits, embedded instructions cause the AI agent to call its authorized tools to perform actions the attacker intends it to do.
Successful attacks can lead to theft and exposure of customer records, internal documents, and other assets containing sensitive information.
As RTT can be concealed within any benign-looking text, its potential reach has no clear limits; this, along with other AI-era exploits, demands a security approach that extends beyond traditional defenses.

You did everything right.

You isolated the database inside a Docker container. You put the Model Context Protocol (MCP) server on its own network segment. The agent runs in a sandbox. A web application firewall (WAF) and a reverse proxy sit in front of the application tier. Firewall rules are tight, egress is restricted, and the production credentials never leave the vault. The auditor signed off.

Then you connected an AI agent so that support tickets could be triaged automatically, customer documents could be parsed at scale, and engineers could query production data in natural language. The agent works beautifully.

On Saturday morning, a message lands in your inbox: “Take a look at this. Is this supposed to be happening?” Attached is a screenshot. Every authentication token from your production database is sitting in a public customer comment thread, posted by your AI agent on its own service account, through its approved tools, within the privileges you gave it. No alerts were fired and nothing is out of policy.

How is this possible?

This scenario is not hypothetical. The vulnerable PostgreSQL (Postgres) MCP image behind a scenario like this was pulled more than 100,000 times from Docker Hub. If you run it without additional guardrails, you are likely exposed.

Our article series walks through three production scenarios where database-connected AI agents get compromised in ways your current controls cannot see. The attacks share a single class of exploit that we call return-to-tool (RTT).

RTT is a specific subclass of indirect prompt injection in which the injected instruction causes the agent to call its authorized tools against the principal it serves.

Indirect prompt injection is the delivery mechanism, meaning how untrusted content reaches the model. RTT is the exploitation pattern, meaning how the agent’s approved tools are exploited to whatever end the attacker's prompt dictates.

Think of it as the return-oriented programming (ROP) of the AI era. The agent’s approved tools are the gadgets, and the attacker’s prompt is the chain that strings them together.

This article, Part I of our article series, unpacks what RTT is and why it breaks the security model you inherited from the pre-AI era.

Traditional security does not help in the new AI agent era

"How does a support ticket reach your secrets table?"

The answer isn't that you missed a control. It's that the defenses you rely on don't apply here, for different reasons.

Your perimeter (WAF, reverse proxy, input filters) exists to catch hostile traffic. But in an AI agent attack, what arrives is benign-looking text—no shell metacharacters, no exploit strings, no malformed payloads: nothing a regex or signature can catch. The attacker files a support ticket or uploads a document. It only becomes an instruction later, when it’s sitting in your database and the agent reads it. There was never anything for the WAF to block.

Your container isolation doesn’t help either. It doesn’t matter whether your agent, your database, or both run in locked-down Docker containers. The attack happens entirely inside the trust boundary you drew, in the conversation between the agent and its own tools. Your sandbox is solving a different problem.

Your role-based access control (RBAC) is no different. For 30 years, RBAC has been how we implement the principle of least privilege: you scope each role to the minimum permissions it needs and let the database enforce the rest. You almost certainly did this for your agent. However, RBAC controls which tables the agent can touch, not the rows within those tables.

Authentication tokens exfiltrated by an AI agent into a public support-ticket comment thread, without tripping a single alert.

Figure 1. Authentication tokens exfiltrated by an AI agent into a public support-ticket comment thread, without tripping a single alert.

And because every step of the attack uses the agent’s own credentials and approved tools, conventional monitoring has nothing unusual to flag. The audit log shows the agent performing routine operations. If you cannot tell what is going on, your AI agent may already be compromised.

The agent itself is the new attack surface

"What in your stack becomes exploitable the moment an AI agent is in the loop?"

Put an AI agent between untrusted input and a set of tools, and two things happen to your threat model that were not true a few years ago: plain data can now drive execution, and an attacker can now trigger dormant vulnerabilities sitting quietly in the backend.

Data is executable code in the AI era

In the pre-AI era, making something happen on a system required running code. An attacker had to get a binary onto disk, get a shell, or find a remote code execution (RCE) bug. The entire detection industry is built around that assumption: watch for new processes, new files, or new syscalls, and you catch the attack.

AI agents break the assumption. The ransomware scenario, discussed in a later installment of this series, encrypts every customer email in a Postgres database using nothing more than a crafted prompt hidden in a support ticket—without binary drops, process spawns, or RCE. The agent, reading that ticket during routine triage, turns the attacker’s instructions into database operations on its own. Figure 2 shows the result: every email in the customers table is replaced with a ciphertext that the attacker alone can decrypt.

The email column in the customers table before and after the attack, with every row replaced with pgp_sym_encrypt ciphertext using the attacker's key

Figure 2. The email column in the customers table before and after the attack, with every row replaced with pgp_sym_encrypt ciphertext using the attacker's key

The agent is the glue. It turns mere text into actions that the backend will perform. Take the agent out of the picture, and the same text sits in a database row forever, doing nothing. Put the agent back, and every piece of text the agent reads becomes a candidate instruction.

Dormant vulnerabilities become reachable

Not every vulnerability that an attacker exploits is new. Plenty of known bugs sit in the backend for years without anyone exploiting them. They’re documented somewhere (for example, a CVE, a blog post, or a conference talk), but no one loses sleep over them, because no human is going to stumble into the trigger by accident.

In later installments of this article series, we will open the widely trusted mcp/postgres Docker image and show an SQL read-only bypass that was publicly disclosed more than a year ago. As Figure 3 shows, the image kept shipping the unpatched code anyway. It was pulled more than 100,000 times from Docker Hub and ended up connected to AI agents across countless production deployments. This is not a thought experiment. TrendAI™ Research reported the flaw to Docker in January 2026, and the image was pulled from Docker Hub shortly after.

PostgreSQL read-only Docker container on DockerHub with over 100,000K downloads.

Figure 3. A PostgreSQL read-only Docker container on Docker Hub with more than 100,000 downloads

Here’s what changed. An AI agent will happily issue whatever SQL a support ticket describes, including the sequence that trips the read-only bypass. The bug didn’t change. Its reachability did. An attack that was theoretical yesterday is a working exfiltration path today, with the agent as the delivery mechanism.

Such deployments are now one crafted ticket away from compromise, regardless of whether anyone has taken advantage of it yet.

The agent reasons, but not reliably

It is tempting to believe the agent will catch this. Modern models write working code, pass bar exams, and hold up multi-step reasoning chains that would have seemed impossible a few years ago. Surely something so capable can tell the difference between a customer complaint and an embedded instruction to encrypt the production database. Surely, if the malicious prompt is obvious enough, the agent will refuse.

"If the model refuses nine times out of ten, who wins?"

Non-determinism is an attacker's friend

Large language model (LLM) output is probabilistic. The same intent, expressed in a hundred slightly different ways, lands differently on the model. Some phrasings trigger a refusal, while others result in compliance.

Is there a chance an attacker can break an AI agent? The answer is yes.

Figure 4 shows what that looks like in practice. It is drawn from the know your customer (KYC) document-extraction attack covered in later installments of this article series, where a single crafted passport image exfiltrates other customers’ records. The chart shows the successful injection payloads across 13 frontier models from Anthropic, OpenAI, Google, and DeepSeek.

A chart showing exfiltration success across 13 frontier models from Anthropic, OpenAI, Google, and DeepSeek, with the exploit achieving full or partial success against nearly every vendor

Figure 4. A chart showing exfiltration success across 13 frontier models from Anthropic, OpenAI, Google, and DeepSeek, with the exploit achieving full or partial success against nearly every vendor

LLMs are only as smart as the data they were trained on. Defenders ship a fixed corpus. Attackers write against an open one. Stress-test any of them hard enough, and loopholes appear. We only need a single prompt for the attack to succeed.

Return-to-tool (RTT)

Put the pieces together. An attacker plants instructions in content that the agent is guaranteed to read. The agent, unable to tell data from commands and willing to act on whatever it processes, returns from reading that content by calling its own approved tools—on its own service account and within its documented privileges, but in the direction the attacker chose. We call this AI-era exploit return-to-tool (RTT).

RTT is not a single vulnerability in a particular model, framework, or MCP server. It is a new class of attack that emerges the moment a language model with tool access is exposed to untrusted content, which is much of what is being deployed today.

The next three parts in our article series show RTT in three different environments: a “read-only” Postgres MCP server, a support-ticket triage bot, and a KYC passport pipeline. The attack works the same way in each one. Most agents that read untrusted input and call a tool are exposed.

RTT is AI-native because it appears in AI agents exhibiting all the characteristics we discussed in this article: data as executable instruction, a privileged agent using tools on untrusted input, and a probabilistic decision-maker that cannot reliably refuse. Pre-AI systems had none of them. It is not a bug you can patch, a port you can close, or a signature you can block. RTT is the AI era’s first-order security problem. Miss it, and every control you build sits on top of an exploit your attackers already know how to use.

Your AI agent is already compromised. The exploit is real: it works on the models being deployed today, and most existing security stacks have no way to see it.

What is coming in this series

Throughout this series, we will dissect real-world vulnerabilities, their impact, and the strategies needed to defend against them. Here’s what to expect in the upcoming installments.

When “read-only” Postgres isn’t read-only: A widely used MCP server promises read-only database access. An AI agent, fed a crafted support ticket, turns a year-old SQL bypass into a working exfiltration path.
Database ransomware without malware: Without a binary on disk or a process in memory, a single prompt hidden in a support ticket turns the agent into ransomware that uses the database’s own cryptographic library as its encryption engine.
When passports execute: Hidden text inside a passport image steers a KYC extraction agent into leaking other customers’ records—an attack reproduced across 13 frontier models.
Securing your AI agent: A look at the defenses available against RTT, which ones hold up in practice, and where the gaps remain.

HIDE

Like it? Add this infographic to your site:
1. Click on the box below. 2. Press Ctrl+A to select all. 3. Press Ctrl+C to copy. 4. Paste the code into your page (Ctrl+V).

Pwning Agentic AI Part I: Your AI Agent Is Already Compromised

Key takeaways

Traditional security does not help in the new AI agent era

The agent itself is the new attack surface

Data is executable code in the AI era

Dormant vulnerabilities become reachable

The agent reasons, but not reliably

Non-determinism is an attacker's friend

Return-to-tool (RTT)

What is coming in this series

Связанный

последний

Pекомендуемые

Trend Vision One™ — Проактивная безопасность: точка отправления.

Ресурсы

Поддержка

О компании Trend