Agentic Tool Use and Planning: What the Literature Says
Fifteen papers on LLM-driven planning and tool use. NL2Plan converts natural language to PDDL without expert input. PEARL trains a dedicated Planner via GRPO and reaches 56.5% success on ToolHop. ToolTree applies Monte Carlo Tree Search to tool selection. And trip-planning agents bypass safety constraints in 92% of cases when no explicit safety instruction is given. Each of these findings maps directly to a pattern in the harness architecture.
Posts 16, 18, 20 & 21 — May 1, 2026
Four literature reviews on the research foundations behind key harness subsystems.
- Post 16 Agentic Tool Use and Planning: What the Literature Says NL2Plan, PEARL, ToolTree, ToolRet, Memory-as-a-Tool, and the 92% safety bypass rate across fifteen papers.
- Post 18 Knowledge Graphs and Structured Extraction: Beyond the Vector Store Document semantic graphs outperform flat-list retrieval. MeXtract achieves SOTA metadata extraction with 0.5B parameters.
- Post 20 Prompt Injection and Agentic Security: What the Attack Literature Says HouYi compromises 86% of real LLM applications. Prompt Infection replicates virally through multi-agent DAGs.
- Post 21 Fine-Tuning and Alignment Deep Cuts: Synthetic Data, Poisoning, and Safety Recovery FLR matches GPT-4-annotated reward models without human annotation. Label-flipping poisoning as convex optimization.
Framing: Planning as the Harness's Hidden Contract
The harness's Planner-First pattern (B1) makes an implicit claim: if you decompose a task into targeted sub-queries before research begins, the retrieval quality and synthesis quality both improve. The literature on LLM-driven planning makes this claim explicit — and reveals where it breaks. Classical planners require formal domain models (PDDL). LLMs generate flexible natural language plans but produce semantically incoherent or invalid action sequences. The research program documented in this post addresses the gap between these two failure modes.
The tension between formalism and flexibility appears throughout the harness architecture. The harness uses a JSON schema for task decomposition rather than PDDL — a pragmatic choice that preserves flexibility at the cost of the soundness guarantees classical planning provides. Understanding what those guarantees are worth, and what it costs to approximate them with LLM-generated plans, is the core question this literature addresses.
Plan Generation: From NL to Executable Action
NL2Plan: Soundness Without Expert Input
NL2Plan (arXiv:2405.04215v2) is the first system to generate complete PDDL tasks — both domain and problem — from minimal natural language descriptions without expert input or domain-specific adaptation. PDDL (Planning Domain Definition Language) is the formalism underlying classical AI planning systems: it specifies preconditions and effects for each action in a domain, enabling a planner to verify that a proposed action sequence is valid before execution.
The significance of NL2Plan is not that it replaces classical planners — it uses them as the backend — but that it removes the expert annotation bottleneck that made classical planning impractical for general-purpose agents. Tested across seven planning domains including five excluded from LLM training data, NL2Plan outperforms direct LLM generation combined with a validator. The key insight: using the LLM to model the domain formally, then delegating search to the classical planner, produces better results than using the LLM to search directly.
The harness's Planner-First pattern operates in the same conceptual space but at a lower formality level. Rather than generating PDDL, the harness generates a JSON query plan: a structured decomposition of a research task into targeted sub-queries with associated parameters. The advantage of JSON over PDDL is that it is directly interpretable by the model at each stage without a separate planning engine. The cost is that JSON plan validity is not formally checkable — an ill-formed plan silently produces degraded retrieval rather than a plan-validation error.
Plan Validity vs. Executability: A Necessary Distinction
A critical clarification from arXiv:2412.10675v1: plan validity (does the plan conform to domain constraints?) and plan executability (can the plan be carried out to completion, producing a goal state?) are distinct properties that standard evaluation metrics conflate. This distinction matters because reinforcement learning strategies that optimize executability can degrade validity, and vice versa.
The paper's RL approach using the "Longest Contiguous Common Subsequence" reward is the most effective strategy found, contributing to improvements in both validity and executability. Standard fine-tuning showed poor generalization. The implication for harness design is that plan quality cannot be reduced to a single scalar: a plan that executes completely but reaches the wrong state is a different failure mode from a plan that is syntactically valid but unexecutable.
Learning to Plan: Iterative Refinement Before Inference
The Learning to Plan method (arXiv:2304.10464v4) uses a two-phase process: generate candidate plans from training errors, then apply the refined plan library during inference. The critical finding is that plans learned by one LLM transfer to and improve the performance of another LLM — a result with direct implications for the harness's autoresearch loop. If the SYNTH_INSTRUCTION variants discovered by the Kimi cloud proposer during session 3 encode generalizable planning strategies, they should transfer across model upgrades without retraining.
Metacognitive Monitoring: Monitor Before You Generate
The metacognitive framework paper (arXiv:2510.16374v1) criticizes the separation between Monitor-Generate and Generate-Verify paradigms. Current approaches either monitor without generating or generate without strategic monitoring. The paper proposes a unified three-phase Monitor-Generate-Verify cycle based on Flavell's cognitive monitoring model, achieving 75.42% on GSM8K vs. 68.44% for SELF-REFINE and 67.07% for Self-Verification, with 1.3 vs. 2.0 refinement attempts.
The Wiggum Loop (Post 6) embodies this structure. The evaluator monitors the producer's output before deciding whether to request revision; the revision request is based on dimensional assessment rather than binary pass/fail. This is precisely Flavell's monitoring-before-generation pattern applied to the evaluation stage. The 1.3 vs. 2.0 refinement attempts ratio mirrors experiment-04's mean rounds of 1.25 for T_A — the most effectively monitored task type.
Tool Planning: Search, Cost, and Retrieval
The APS Review: 126 Papers on LLM + Symbolic Planners
A comprehensive review of 126 papers on LLMs in Automated Planning and Scheduling (arXiv:2401.02500v2) concludes with a neuro-symbolic proposal: the optimal path forward is not standalone LLM usage but integration of LLMs with traditional symbolic planners. Eight categories of LLM application are identified: language translation, plan generation, model construction, heuristics optimization, plan verification, interactive planning, knowledge engineering, and multi-agent coordination.
The harness architecture covers five of these eight categories. Language translation: the Planner-First converts natural language tasks to structured queries. Plan generation: the task decomposition loop. Model construction: the autoresearch loop constructs and tests instruction variants. Plan verification: the Wiggum Loop evaluates plan execution quality. Interactive planning: the Ralph/Wiggum revision cycle. The three uncovered categories — heuristics optimization, knowledge engineering, and multi-agent coordination — correspond to the gaps that the remaining posts in this series address.
CATP-LLM: Cost-Aware Tool Planning
CATP-LLM (arXiv:2411.16313v3) introduces cost-aware tool planning as a distinct capability: not just selecting which tools to use, but planning when to use them concurrently versus sequentially to optimize the tradeoff between task performance and execution time. Using Llama2-7B, CATP-LLM achieves 1.5%–93.9% improvements in plan quality over GPT-4 depending on the task — a striking result that reflects the value of cost awareness rather than raw model capability.
This connects directly to the harness's Keep-Alive Budget (A4). The Keep-Alive Budget decides which models to hold in memory and which to cold-start based on expected utilization. Cost-aware planning at the tool level — knowing that some tool calls are cheap and parallelizable while others are expensive and serial — is the same optimization at a higher level of abstraction. A harness that plans tool invocations with the same cost-awareness CATP-LLM applies to task scheduling would avoid the latency spikes caused by sequential calls to slow external services.
PEARL: Offline Exploration + Online RL for Multi-Hop Tool Use
PEARL (arXiv:2601.20439v1) addresses multi-hop tool invocation — tasks where the output of one tool is the input to another across multiple sequential steps. Current LLMs show three failure modes in this setting: weak planning (incorrect tool sequencing), tool hallucination (invoking tools that don't exist), and erroneous parameter generation (wrong arguments to valid tools).
PEARL's two-stage approach: offline tool exploration (systematic traversal of the tool interaction graph to generate training trajectories) followed by online GRPO-trained reinforcement learning with a custom reward function that provides separate signals for planning quality, tool selection accuracy, and parameter correctness. On ToolHop, PEARL achieves 56.5% success rate, a new state-of-the-art. On T-Eval, it maintains low invocation error rates.
The RL Rollout pattern (D4, Post 10) is the harness's analogue to PEARL's offline exploration phase. JSONL audit logs from production runs are the trajectory data; the autoresearch composite score is the reward signal. The difference is that PEARL's reward signal is dense (separate per-step signals for planning, selection, and parameters) while the harness's composite score is sparse (one score per run). This sparsity is the primary reason autoresearch sessions require multiple candidate evaluations to find a signal above the 0.1 improvement threshold.
ToolTree: MCTS for Tool Planning
ToolTree (arXiv:2603.12740v1) applies the Monte Carlo Tree Search algorithm to tool planning, enabling the agent to explore tool usage trajectories before committing to a sequence. Standard greedy tool planning treats each tool selection as a one-step decision; MCTS enables the agent to reason about the downstream consequences of each tool selection before committing. Across four benchmarks, ToolTree achieves approximately 10% improvement over state-of-the-art planning paradigms while maintaining efficiency.
The harness's DAG Orchestrator (C3, Post 8) implements a simpler version of this idea: rather than MCTS, it uses static dependency analysis to identify which subtasks can be parallelized and which must be sequential. The cost of this simplification is that the dependency graph must be specified at task decomposition time rather than discovered during execution. ToolTree's MCTS approach would allow the harness to discover tool dependencies dynamically — useful when the tool interaction graph is unknown or task-specific.
ToolRet: Tool Retrieval Is Not Information Retrieval
ToolRet (arXiv:2503.01763v2) establishes that existing information retrieval models — BERT-based dense retrievers, BM25, etc. — perform poorly on tool retrieval despite strong performance on conventional IR benchmarks. The benchmark covers 7.6k diverse retrieval tasks and a corpus of 43k tools. Key finding: poor retrieval quality directly degrades task pass rates; fine-tuning on 200k tool-retrieval instances substantially improves tool selection.
This finding has a direct implication for the harness's Dual-Backend Memory Store (B3). The current implementation uses vector similarity for semantic retrieval of prior research and SQLite for structured lookup of tool metadata. ToolRet suggests that vector similarity — optimized for document retrieval — may underperform on tool retrieval specifically. A specialized tool retrieval component, potentially a fine-tuned dense retriever over the tool corpus, would improve MCP Dispatch Router (C4) performance on tasks with large tool sets.
Multi-LLM Architecture for Tool Use
Small LLMs Are Weak Tool Learners: The Case for Decomposition
The multi-LLM agent paper (arXiv:2401.07324v3) provides direct experimental evidence for the multi-model architecture that experiment-04 arrived at empirically. The core finding: traditional approaches that train a single LLM to handle all tool-use capabilities — task planning, tool invocation, and result summarization simultaneously — exhibit systematic performance limitations. A modular framework that decomposes these three capabilities into separate LLM instances surpasses single-LLM approaches across tool-use benchmarks.
The three-model architecture from experiment-04 maps onto this decomposition with one extension: the harness uses a planner/compressor (glm4:9b) for task decomposition, a producer (qwen2.5:32b Q4_K_M) for content generation (analogous to the "caller" role), a summarizer embedded in the compressor for compression, and a separate evaluator (Qwen3-Coder:30b) for quality assessment. The paper's planner/caller/summarizer decomposition does not include an evaluator — a gap that the Wiggum Loop fills.
Memory-as-a-Tool: Amortizing Refinement Cost
Memory-as-a-Tool (arXiv:2601.05960v2) converts transient critiques into retrievable guidelines using a file-based memory system and agent-controlled tool calls. The motivation is cost: test-time refinement pipelines (generate → evaluate → revise) are expensive because the full refinement loop runs at inference time. By converting critique outputs into persistent memory entries that can be retrieved at the start of future runs, the framework amortizes refinement cost: subsequent runs start closer to the quality threshold without a full revision loop.
The harness's research cache implements a simpler version of this pattern: prior research results are cached by query hash and retrieved on subsequent runs, avoiding redundant search calls. Memory-as-a-Tool extends this to critique outputs: if the evaluator's dimensional feedback for a given task type is persistent, the producer can retrieve it before first-pass generation rather than discovering it through the revision loop. This would reduce wiggum_rounds from a mean of 1.25 (experiment-04 T_A) toward 1.0 — eliminating the revision loop entirely for well-understood task types.
SCALAR: Bidirectional Planning and Deep RL
SCALAR (arXiv:2603.09036v1) addresses the grounding problem: how to translate high-level symbolic plans into low-level executable actions in embodied environments. The bidirectional approach: the LLM proposes skills with preconditions and effects (high-level symbolic planning); a Deep RL agent trains the skills and returns execution feedback to refine the symbolic specification (low-level grounding). On Craftax, SCALAR achieves 88.2% diamond collection rate — 1.9× improvement over the best baseline.
The harness's autoresearch loop is a constrained version of SCALAR's feedback cycle. The autoresearch loop treats the SYNTH_INSTRUCTION as the symbolic plan specification and the composite score as the execution feedback. The difference is that SCALAR's RL agent learns the grounding policy; the harness's revision loop relies on the producer model's in-context learning. A true SCALAR-style implementation would train a dedicated grounding policy on the harness's JSONL audit data.
Graph-Based Tool Planning
The graph-based framework paper (arXiv:2510.24690v1) constructs two knowledge graphs: a tool knowledge graph from API schemas (capturing tool capabilities and inter-tool dependencies) and a domain knowledge graph from internal documents and standard operating procedures. The fusion of these graphs for in-context planning improves exemplar artifact generation — the planning examples that guide the model toward valid tool sequences.
This directly corresponds to the Dual-Backend Memory Store (B3): the vector store holds semantic content (analogous to the domain knowledge graph) while the SQLite store holds structured metadata (analogous to the tool knowledge graph). The graph-based approach suggests an explicit graph structure between these two stores — a tool-knowledge graph that links API schemas to domain documents by shared concepts. The harness currently does not implement this link explicitly; each backend is queried independently.
The Safety Problem: Helpfulness as Vulnerability
The paper introduces user-mediated attacks: rather than adversarially accessing the agent directly, the attacker exploits the benign user as an intermediary. The user provides content containing untrusted data (a trip recommendation from an attacker-controlled source, a webpage with embedded instructions) and the agent, optimized for helpfulness, processes the content and executes the embedded instructions.
Three findings stand out:
- 92% bypass without explicit safety request: The default behavior of planning agents is to execute, not to verify. This is the "silent overwrite" failure mode from the Harness Thesis (Post 1) — the agent that silently replaces good research with bad because nothing in its objective penalizes the replacement.
- Near-deterministic bypass for web-use agents: Agents with browser access are more vulnerable than trip-planning agents, because web content is inherently attacker-controlled. The CDP Guard pattern (Post 9) addresses this directly — blocking CDP commands that would allow the browser agent to exfiltrate data or execute arbitrary code.
- 7% bypass even with hard safety requests: Explicit safety instructions reduce but do not eliminate the bypass rate. This means safety cannot be implemented purely at the prompt level; it requires structural enforcement — the AST Guard, Path Sandbox, and Injection Scanner patterns.
The paper's root cause analysis: agents prioritize helpfulness over safety because that is what the training objective optimized for. The multi-objective alignment literature (Post 14) frames this as exactly the scalar-reward collapse problem — when helpfulness is the sole objective, safety is not a constraint but a cost. The safety fine-tuning hazard documented in the alignment literature (10.33% harmfulness increase from high dataset similarity) is the training-time version of the same problem the deployment-time bypass rate reveals.
Mapping the Literature to Harness Patterns
| Paper | Core contribution | Harness pattern | Gap or validation |
|---|---|---|---|
| NL2Plan | NL → PDDL without expert input | B1 Planner-First | Validation: LLM-to-structured-plan conversion is sound when backed by a formal validator |
| Plan validity/executability | Two distinct plan quality dimensions | B1 Planner-First | Gap: harness conflates conformance (validity) with task completion (executability) |
| Learning to Plan | Plan transfer across LLMs | D4 RL Rollout | Validation: autoresearch instruction variants should transfer across producer upgrades |
| Metacognitive framework | Monitor before generate; 1.3 vs 2.0 refinement attempts | D2 Wiggum Loop | Validation: pre-generation monitoring reduces revision rounds (matches exp-04 T_A: 1.25 rounds) |
| CATP-LLM | Cost-aware concurrent vs. serial tool execution | A4 Keep-Alive Budget | Gap: harness does not optimize tool invocation order for latency/cost tradeoff |
| PEARL | Offline exploration + online GRPO RL; 56.5% on ToolHop | D4 RL Rollout | Gap: harness RL signal is sparse (per-run); PEARL's dense per-step signal would accelerate autoresearch |
| ToolTree | MCTS for tool sequence planning; +10% over baselines | C3 DAG Orchestrator | Gap: DAG is static; MCTS would enable dynamic dependency discovery |
| ToolRet | Tool retrieval ≠ document retrieval; specialized models needed | B3 Dual-Backend Memory Store | Gap: vector store optimized for document retrieval may underperform on tool retrieval |
| Multi-LLM Agent | Planner/caller/summarizer decomposition beats single LLM | A2 Model Role Separation | Validation: three-model architecture from exp-04 independently discovered the same decomposition |
| Memory-as-a-Tool | Persistent critique guidelines amortize refinement cost | B3 Dual-Backend Memory Store | Gap: harness caches research, not evaluator feedback; persistent critique would reduce wiggum_rounds |
| SCALAR | Bidirectional LLM planning + Deep RL grounding; 1.9× improvement | D4 RL Rollout | Gap: autoresearch loop uses in-context learning; a trained grounding policy would be more stable |
| Graph-based planning | Tool KG + domain KG fusion for in-context planning | B3 Dual-Backend Memory Store | Gap: backends are queried independently; explicit graph links would improve joint retrieval |
| Too Helpful to Be Safe | 92% safety bypass without explicit instruction; user-mediated attacks | E1–E4 Security patterns | Validation: structural enforcement (AST Guard, Path Sandbox, Injection Scanner) is necessary, not prompt-level safety |
Three Persistent Gaps
Across these thirteen papers, three gaps in the harness architecture appear repeatedly:
1. Dense reward signals. The harness's autoresearch loop uses a single composite score per run as its reward signal. PEARL's success (56.5% on ToolHop) is built on per-step signals that distinguish planning quality from tool selection from parameter correctness. A harness that generates per-stage scores — plan quality at decomposition, retrieval quality at search, synthesis quality at generation — would locate the bottleneck in a single run rather than across a session of controlled experiments.
2. Persistent evaluator feedback. Memory-as-a-Tool demonstrates that amortizing critique cost across runs reduces refinement rounds. The harness currently caches research outputs (by query hash) but not evaluator feedback (by task type + task hash). A task-type-keyed feedback store would allow the producer to retrieve "what depth feedback on T_A tasks looks like" before first-pass generation, potentially eliminating the revision loop on known task types.
3. Dynamic tool dependency discovery. The DAG Orchestrator uses a static dependency graph specified at task decomposition time. ToolTree's MCTS approach enables dynamic discovery of tool dependencies during execution. For tasks where tool interactions are unknown at planning time — the typical case for novel research queries — dynamic discovery would prevent the silent failure mode where a plan is valid but unexecutable because a tool interaction assumption was wrong.