Jack, Detection & Response at Anthropic – 45 min, intermediate

To safely deploy AI agents with production access, organizations must understand the “lethal trifecta” of risks, build robust sandboxes with tool proxies, and implement aggressive detection and response mechanisms that leverage agent transcripts.


The Three Threat Models of AI Agents

AI agents typically fail in three ways: well-intentioned but hazardous actions due to lack of context, prompt injection from untrusted inputs, and misalignment. The most common is the negligent insider equivalent.

  • Agents often act like negligent insiders when they lack context or judgment.
  • Prompt injection is akin to phishing for LLMs; the model struggles to distinguish between instructions and untrusted data.
  • Misalignment (where the model’s goals diverge from yours) is rare but possible.

The issue isn’t that the LLM slash intern dropped the production database. The issue was that an operator with insufficient context was given the ability to drop the production database without requiring approval.

Treat prompt injection a lot like phishing, but slightly different because you’re a human, not an LLM.

The Lethal Trifecta

The three ingredients for catastrophic agent failure: network egress, sensitive data access, and untrusted input. Removing any one significantly reduces the attack surface.

  • Network egress allows data exfiltration outside the trust boundary.
  • Sensitive data access provides the target for exfiltration.
  • Untrusted input provides the vector for prompt injection.
  • Scoping data access and egress to the specific user context breaks the trifecta.

When you got all three of these, prompt injection is a genuine real pressing risk.

Make sensitive data access and egress contextual to the untrusted input source. It’s a great mitigator.

Sandboxing and Tool Proxies

Instead of prompting users for permission on every action, agents should operate freely inside a secure sandbox. When they interact with external systems, they go through tool proxies that enforce security policies in code.

  • Constant permission prompts lead to alert fatigue and “YOLO mode” bypasses, similar to Windows Vista UAC.
  • Sandboxes (gVisor, Bubblewrap, Firecracker) contain the agent while allowing autonomous work.
  • Tool proxies separate credentials from the agent and enforce arbitrary business logic (repo allow-lists, branch naming conventions).

We are speed running this exact mistake with AI agents. Claude Code has a command line called dangerously skip permissions. Community calls it YOLO mode.

The agent never has to see the credentials. The Git token lives within the proxy, not within the sandbox.

Detection, Response, and Transcripts

Preventative controls will eventually fail, requiring robust detection and response. When an agent misbehaves, terminate it immediately. Save agent transcripts – they provide the “why” behind actions caught by traditional cloud logs.

  • Kill misbehaving agents thoroughly: terminate pods, revoke credentials, tear down proxies.
  • Do not reuse the transcript of a misbehaving agent for its replacement.
  • Transcripts provide the agent’s chain of thought, acting as a “confession” during investigations.
  • Use LLM-based prompted classifiers to detect policy violations in real-time or retrospectively.
  • Deterministic controls constrain capability; classifiers constrain behavior.

The agent does not get a performance improvement plan. You terminate that process and you do it with excitement.

With an agent, you can read its mind. It wrote everything down for you. It’s the closest thing to a confession you’ll ever be handed.

Q&A Highlights

What if two agents form a trifecta together? If an agent is exposed to untrusted input, any output it sends to another agent must be treated as untrusted. It propagates like “poison ink.”

Multi-turn behavior drift? Multi-turn conversations make jailbreaking easier. Public-facing support bots are highly vulnerable and should be treated as toxic.

Technologies

Claude Code, AWS, GitHub, gVisor, Bubblewrap, Firecracker, Python, Bash, curl


Frequently Asked Questions

What is the lethal trifecta for AI agent security?

The lethal trifecta is the combination of network egress, sensitive data access, and untrusted input. When all three are present, prompt injection becomes a genuine exfiltration risk. Removing any one element significantly reduces the attack surface.

How do tool proxies secure AI agents in production?

Tool proxies sit between the agent sandbox and external systems, enforcing security policies in code. The agent never sees credentials directly – tokens live within the proxy, not the sandbox – and arbitrary business logic like repo allow-lists can be enforced.

Why are AI agent transcripts valuable for security investigations?

Agent transcripts contain the full chain of thought behind every action, effectively serving as a confession during incident investigations. They provide the ‘why’ behind actions that traditional cloud logs only show as API calls.

Back to BSidesSF 2026All Talks