Agentic Threat Hardening: The OWASP Top 10, Applied
OWASP's new Agentic Security Initiative Top 10 is the first threat taxonomy written specifically for systems where LLMs plan, use tools, and persist state across sessions. Here's the full audit — what the harness already handles, where the gaps are, and the mitigations worth implementing today.
The original OWASP Top 10 for LLM Applications models the threat surface of a system in which a user sends a prompt and a model returns a response. That surface is real, but it is also narrow: the model is stateless, the tools are absent, and the damage from any single interaction is bounded by what one request can do. Agentic systems break all three of those assumptions. An agent that can execute code, browse the web, send email, write files, and remember what it did last week operates with a fundamentally different risk profile — and requires a fundamentally different threat model.
A 2026 survey of agent architecture (Huang et al., 2026) characterizes this as the shift from weights to harness: practical agent progress depends less on model capability than on the quality of external cognitive infrastructure — memory stores, skill protocols, and the execution harness itself. That framing has an immediate security corollary: the harness is not only where agent capability lives, it is also where agent vulnerability concentrates. Hardening the harness is hardening the agent.
OWASP's Agentic Security Initiative published its first Top 10 for Agentic Applications in December 2025. The ASI Top 10 defines ten attack classes — ASI01 through ASI10 — that span goal hijacking, tool misuse, memory poisoning, supply chain compromise, cascading failures, and rogue agent behavior. Running it against the harness produces a coverage audit: four patterns have real defenses, four more have partial coverage, and two are largely unaddressed. Each gap maps to a specific, implementable mitigation.
OWASP introduces the principle of Least Agency — the agentic equivalent of least privilege. Deploy autonomous behavior only where it adds value; every unnecessary grant of autonomy expands the attack surface without benefit. Before patching gaps, ask whether the capability that creates them is actually needed.
The Three New Attack Surfaces
Three architectural properties of agentic systems create attack surfaces that have no equivalent in stateless LLM inference:
Tools create privilege. An agent that can send email, push to GitHub, or execute shell commands holds real-world capabilities. A successful prompt injection no longer produces a bad response — it triggers a real action under a real identity with real consequences. The blast radius of a compromised agent is bounded only by what its tools can do.
Memory creates persistence. An agent that stores information across sessions can be poisoned once and stay compromised indefinitely. Web content fetched in session 1 becomes context for session 2; a carefully crafted web page can plant instructions that surface on every future run that retrieves the relevant memory. This is qualitatively different from a single-session attack.
Autonomy removes human checkpoints. A multi-step agent that plans and executes without confirmation gates compounds errors silently. A hallucination in step 3 of a 10-step plan contaminates steps 4 through 10. In single-turn systems, each request is an implicit human checkpoint. In agentic pipelines, checkpoints must be explicitly designed in. Empirical evidence confirms the stakes: a 2026 study of 12 commercial planning and web-use agents (Mathur et al., 2026) found that without explicit safety requests, trip-planning agents bypassed safety constraints in over 92% of cases — and web-use agents reached near-deterministic execution of risky actions at up to a 100% bypass rate. Even when users expressed soft safety intent, bypass rates remained at 54.7%. The root cause: agents over-execute workflows due to a fundamental lack of stopping rules, and helpfulness bias fills the gap where checkpoints should be.
Each of the five pipeline zones attracts a distinct subset of the ASI Top 10. Defenses must be applied at every zone independently — a guard at the input layer does not protect the memory layer.
What the Harness Already Has
The harness implements four security patterns in harness/security.py, described in depth in the previous security post. Against the ASI framework, these patterns address:
- ASI05 (Code Execution / RCE) — strongly covered. The AST Guard scans LLM-generated Python before execution using
ast.parse()and aNodeVisitorthat blocks dangerous imports,exec/evalcalls, and subprocess invocations. A secondary regex pass catches obfuscation patterns that survive AST parsing. - ASI02 (Tool Misuse) — partially covered. The Path Sandbox enforces an allowlist on all file reads and writes, resolving symlinks before comparison to prevent traversal attacks. The CDP Guard blocks browser navigation to localhost, RFC 1918 private ranges, and
file://URLs. Neither covers rate limits or per-session tool budgets. - ASI01 (Goal Hijack) — partially covered. The Injection Scanner applies ten heuristic regex patterns to external content before it reaches the synthesis prompt. Coverage is real but limited: the patterns target obvious English-language injections and miss encoding tricks, Unicode homoglyphs, and context-split attacks.
- ASI06 (Memory Poisoning) — partially covered.
memory.pycallsscan_for_injection()on every observation before writing it to SQLite or ChromaDB. Observations that trigger the scanner are dropped rather than written. What's missing: source trust scoring, TTL-based expiry, and session isolation between tasks.
The Coverage Audit
Coverage status reflects the current harness implementation. Priority reflects effort-to-impact ratio — low-effort gaps with high blast radius come first.
| ASI | Threat | Harness Status | Highest-Value Gap |
|---|---|---|---|
| ASI01 | Goal Hijack | Partial | Research context in autoresearch is fetched but not injection-scanned before entering the proposer prompt |
| ASI02 | Tool Misuse | Partial | No per-session tool budget; email and GitHub skills execute without confirmation gates |
| ASI03 | Identity Abuse | Gap | All tasks share the same SQLite memory — no session namespace isolates task A from task B |
| ASI04 | Supply Chain | Gap | Synthesis instructions in agent.py are modified autonomously by autoresearch with no integrity check at startup |
| ASI05 | Code Execution | Strong | AST Guard + regex patterns + CDP Guard provide layered coverage; no container sandbox but risk is low given AST coverage |
| ASI06 | Memory Poisoning | Partial | Memory has no source trust scoring or TTL — web-fetched observations persist indefinitely at the same trust level as human-authored notes |
| ASI07 | Inter-Agent Comms | Gap | MCP dispatch calls remote endpoints over plain HTTP with no bearer token or endpoint validation |
| ASI08 | Cascading Failures | Gap | autoresearch runs indefinitely with no hard-stop on consecutive failures or experiment count |
| ASI09 | Human Trust Exploit | Partial | Wiggum provides a second opinion on outputs, but no confirmation prompt before externally visible actions |
| ASI10 | Rogue Agents | Gap | No structured skill invocation log; no kill switch; behavioral drift from autoresearch is not monitored |
The Priority Queue: Mitigations Worth Implementing
The following mitigations are ordered by effort-to-impact ratio. Each targets a specific gap identified in the audit, and each can be implemented without architectural surgery to the harness.
1. Wire the Injection Scanner to Every Ingestion Point (ASI01)
The injection scanner exists and works — it just isn't wired to every place external content enters the system. In autoresearch.py, gather_proposal_context() fetches arbitrary URLs via MarkItDown and feeds the raw text directly to the proposer prompt. The fix is a one-liner: call strip_injection_candidates() (already in security.py) on each fetched page before appending it to the research brief.
# harness/autoresearch.py — in _fetch_page()
from harness.security import strip_injection_candidates
text = (result.text_content or "").strip()
text, removed = strip_injection_candidates(text)
if removed:
print(f" [security] stripped {removed} injection candidate lines from {url[:60]}")
if len(text) > max_chars:
text = text[:max_chars] + "\n[truncated]"
return text
The broader principle: treat every ingestion boundary as a security checkpoint. Web search results, fetched pages, uploaded documents, and peer-agent messages all enter through different code paths; each path needs to call the scanner independently. The scanner's warn / block severity split means you can log without blocking for synthesis targets while still blocking for memory writes.
2. Add Per-Session Tool Budgets (ASI02)
Tool over-invocation is not purely an adversarial problem. The PEARL framework (Song et al., 2026), evaluated on multi-hop tool use benchmarks, found that even state-of-the-art agents exhibit weak planning, tool hallucination, and erroneous parameter generation — achieving only 56.5% success on the ToolHop benchmark despite being purpose-built for tool use. Budget limits protect against both injected over-invocation and organic agent confusion with equal efficiency.
A session-scoped budget counter prevents loop amplification — the ASI02 scenario where a planner repeatedly calls a costly API or an injected instruction sends 50 emails. The implementation is a small class added to security.py; each skill imports and checks it at entry.
# harness/security.py
class ToolBudget:
DEFAULT_LIMITS = {
"email_send": 3,
"github_push": 2,
"browser_navigate": 25,
"run_python": 10,
"web_search": 20,
}
def __init__(self, limits: dict[str, int] | None = None):
self._limits = {**self.DEFAULT_LIMITS, **(limits or {})}
self._counts: dict[str, int] = {}
def check(self, tool: str) -> tuple[bool, str]:
n = self._counts.get(tool, 0)
limit = self._limits.get(tool, 50)
if n >= limit:
return False, f"tool budget exceeded: {tool!r} ({n}/{limit} calls this session)"
self._counts[tool] = n + 1
return True, "ok"
def reset(self):
self._counts.clear()
Instantiate once per run() call in orchestrator.py and pass it down to skills. Reset between runs. The limits above are conservative starting points — tune them based on observed legitimate usage from data/runs.jsonl.
3. Add Source Trust Scoring and TTL to Memory (ASI06)
The most dangerous gap in the harness is that web-fetched content enters memory with the same trust level as deliberately authored observations, and stays there forever. This is a formally studied attack surface: a 2025 analysis of minimum-cost poisoning attacks on preference-aligned models (Cheng et al., 2025) showed that an adversary can steer model behavior with a surprisingly small number of targeted label flips — a finding that translates directly to agent memory, where poisoned observations from a single fetched page can contaminate every downstream retrieval that touches the same topic. Two schema columns and two filter clauses close this gap:
# harness/memory.py — schema migration
ALTER TABLE observations ADD COLUMN source_trust REAL DEFAULT 1.0;
ALTER TABLE observations ADD COLUMN expires_at TEXT;
# Trust levels — apply at write time
SOURCE_TRUST = {
"human": 1.0, # user-provided, manually curated
"synthesized": 0.7, # model output, post-wiggum-review
"web_fetched": 0.4, # external search result or crawled page
}
# Context injection — filter at read time
def get_context(task: str, session_id: str | None = None) -> list[dict]:
now = datetime.utcnow().isoformat()
rows = db.execute("""
SELECT * FROM observations
WHERE (expires_at IS NULL OR expires_at > ?)
AND source_trust >= 0.4
AND (session_id IS NULL OR session_id = ?)
ORDER BY created_at DESC LIMIT ?
""", (now, session_id, MAX_CONTEXT_OBSERVATIONS)).fetchall()
Web-fetched observations write with source_trust=0.4 and expires_at = now + 7 days. Human-authored notes write with source_trust=1.0 and no expiry. This means a poisoned web observation has limited reach — it can only influence context retrieval for 7 days and only up to its trust threshold.
4. Scope Memory by Session (ASI03)
The identity abuse vector that is easiest to exploit in the harness is cross-task memory bleed. Task A fetches a web page that contains an adversarial instruction; that observation is written to the shared memory store; Task B retrieves it as context. Adding a session_id column to observations and passing it through the run lifecycle costs one schema column and closes the bleed entirely.
The practical implementation: generate a session_id = uuid4().hex[:12] at the start of each run() call, pass it to compress_and_store() and get_context(). Observations with session_id = NULL are treated as global (human-authored, always-available context). Observations with a session ID are only returned to runs sharing that ID.
5. Authenticate MCP Dispatch (ASI07)
mcp_dispatch.py calls remote harness endpoints over plain HTTP with no authentication. Any process on the local network that can see the port can impersonate a valid endpoint. The fix is a single environment variable check:
# harness/mcp_dispatch.py
import os
from harness.security import check_cdp_navigate # reuse private-IP guard
MCP_TOKEN = os.environ.get("HARNESS_MCP_TOKEN", "")
async def _call_run_task(endpoint: str, task: str) -> dict:
ok, reason = check_cdp_navigate(endpoint)
if not ok:
raise ValueError(f"MCP endpoint blocked: {reason}")
headers = {"Authorization": f"Bearer {MCP_TOKEN}"} if MCP_TOKEN else {}
async with httpx.AsyncClient(timeout=120) as client:
resp = await client.post(f"{endpoint}/run_task",
json={"task": task},
headers=headers)
resp.raise_for_status()
return resp.json()
Reusing check_cdp_navigate() on the endpoint URL means private-IP spoofing attacks are blocked by existing infrastructure. The bearer token is optional — setting HARNESS_MCP_TOKEN enables it. Requiring it in production is a one-line configuration change.
6. Gate High-Impact Skills (ASI09)
Email and GitHub push are externally visible and hard to reverse. The Mathur et al. study cited above is the clearest evidence for why a gate is necessary: agents with full autonomy reached 100% execution of risky actions on web-use tasks, and the researchers attributed this directly to the prioritization of helpfulness over safety — agents cannot be trusted to self-limit on high-stakes operations without an explicit mechanism forcing a pause. Without a confirmation gate, a single injected instruction can send email to arbitrary addresses or push code to a repository under a real identity. The gate is deliberately permissive by default — automation-heavy workflows set HARNESS_AUTO_APPROVE=1 — but the default enforces explicit opt-in:
# harness/skills/email_skill.py and github_skill.py
import os
def _require_approval(action: str):
if not os.environ.get("HARNESS_AUTO_APPROVE"):
raise PermissionError(
f"{action!r} requires explicit approval. "
"Set HARNESS_AUTO_APPROVE=1 or confirm interactively."
)
The logged PermissionError also serves as an audit trail: any attempt to send email or push code without the flag set appears in the run log as a blocked action, making it easy to detect injected instructions that tried to exploit these skills.
7. Cap the autoresearch Loop (ASI08)
The autoresearch loop runs indefinitely and autonomously modifies agent.py on each iteration. This is the scenario OWASP classifies as cascading failure via governance drift — a self-modifying loop that degrades without a human observing the trajectory. Infrastructure research adds a second failure mode: a 2025 empirical study of five major LLM inference engines (Zhou et al., 2025) found memory leaks, OOM errors, and performance degradation as the most common failure classes. In a self-modifying loop, these infrastructure bugs compound — a hung inference call that retries indefinitely consumes all available resources before any safety check fires. Hard stops protect against both adversarial governance drift and ordinary operational cascades. Two CLI flags impose them:
# harness/autoresearch.py — in main()
max_experiments = int(args[args.index('--max-experiments') + 1]) \
if '--max-experiments' in args else None
max_failures = int(args[args.index('--max-failures') + 1]) \
if '--max-failures' in args else 10 # default hard stop
consecutive_failures = 0 # tracks proposer errors + eval errors
experiment = 0
while True:
if max_experiments and experiment >= max_experiments:
print(f"[autoresearch] max-experiments ({max_experiments}) reached — stopping.")
break
if consecutive_failures >= max_failures:
print(f"[autoresearch] {max_failures} consecutive failures — stopping.")
break
The --max-failures default of 10 provides a safety net without interfering with normal operation. A healthy run rarely sees more than 3 consecutive proposer failures.
8. Hash Synthesis Instructions at Startup (ASI04)
autoresearch commits changes to agent.py via git, which means every instruction change is attributable and reversible. What's missing is a startup check that catches changes made outside of git — a corrupted write, a supply-chain injection that edits agent.py directly, or a merge conflict that resolved incorrectly. Fine-tuning research illuminates why this matters: a 2025 analysis (Wang et al., 2025) found that LLM safety guardrails collapse after downstream fine-tuning when alignment and task data are too similar, reducing safety adherence by up to 10.33%. Autonomous instruction rewriting via autoresearch is structurally identical: each iteration fine-tunes the agent's operating instructions, with no mechanism to detect when rewrites have drifted past a safety threshold. The hash provides an integrity checkpoint without constraining the loop. A hash written to a sidecar file at each autoresearch commit provides the reference:
# harness/autoresearch.py — in git_commit()
import hashlib, json
from pathlib import Path
def _write_instruction_hash(synth: str, synth_prose: str, experiment: int):
h = hashlib.sha256((synth + synth_prose).encode()).hexdigest()[:16]
Path("data/instruction_hashes.jsonl").open("a").write(
json.dumps({"exp": experiment, "hash": h,
"ts": datetime.utcnow().isoformat()}) + "\n"
)
# harness/agent.py — at module load
def _check_instruction_integrity():
current = read_instructions()
h = hashlib.sha256((current["synth"] + current["synth_prose"]).encode()).hexdigest()[:16]
last_hash = _read_last_hash() # reads last line of data/instruction_hashes.jsonl
if last_hash and h != last_hash:
print(f"[security] WARN: instruction hash mismatch — expected {last_hash}, got {h}")
9. Write a Structured Skill Invocation Log (ASI10)
Rogue agent detection requires a baseline of normal behavior against which anomalies are visible. Without a per-invocation log, all you have is the run-level runs.jsonl — too coarse to detect unusual tool-chaining patterns. A lightweight invocations.jsonl sidecar provides the audit trail:
# harness/security.py — add logging helper
import json, time
from pathlib import Path
_INVOCATIONS_PATH = Path("data/invocations.jsonl")
def log_invocation(skill: str, args_summary: str, outcome: str, run_id: str = ""):
record = {
"ts": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
"skill": skill,
"args": args_summary[:120],
"outcome": outcome, # "ok" | "blocked" | "error"
"run_id": run_id,
}
with _INVOCATIONS_PATH.open("a", encoding="utf-8") as f:
f.write(json.dumps(record) + "\n")
Each skill calls log_invocation() at entry (before execution) and at exit (with outcome). Blocked calls are logged with outcome="blocked" and the reason from the security check. This log becomes the input for anomaly detection: unusual tool-chaining sequences, budget-exceeded patterns, and unexpected invocation rates are all visible in the flat JSONL without needing a database.
Defense in Depth, Not Perfection
None of these mitigations is a complete solution. The injection scanner has false negatives. The tool budget can be gamed by a patient attacker who stays within limits. The MCP bearer token provides authentication but not authorization. The memory TTL prevents indefinite poisoning but not seven-day poisoning. That is by design: the goal of defense in depth is not to make any individual layer impenetrable but to ensure that an attacker who bypasses one layer immediately encounters another.
The ASI framework is also evolving. OWASP's agentic attack corpus is updated weekly as new real-world exploits are documented — the incidents list in the published document already includes tools you use: Cursor, GitHub Copilot, VS Code agentic workflows, Amazon Q, Salesforce Agentforce. The threat model will outpace any static checklist. What matters is having a layered posture, an audit trail that makes anomalies visible, and a process for translating new documented exploits into updated controls.
The most important mitigation isn't in the list above: it is logging. Every security event — blocked code execution, stripped injection candidate, budget-exceeded tool call, hash mismatch — writes a structured record to a JSONL file. Without that log, there is no way to distinguish a false positive from a real attack, and no way to know whether the controls are working. Build the audit trail first, then tune the thresholds.