Agentic Tool Use and Planning: What the Literature Says

May 1, 2026 • 19 min read

Fifteen papers on LLM-driven planning and tool use. NL2Plan converts natural language to PDDL without expert input. PEARL trains a dedicated Planner via GRPO and reaches 56.5% success on ToolHop. ToolTree applies Monte Carlo Tree Search to tool selection. And trip-planning agents bypass safety constraints in 92% of cases when no explicit safety instruction is given. Each of these findings maps directly to a pattern in the harness architecture.

Posts 16, 18, 20 & 21 — May 1, 2026

Four literature reviews on the research foundations behind key harness subsystems.

Framing: Planning as the Harness's Hidden Contract

The harness's Planner-First pattern (B1) makes an implicit claim: if you decompose a task into targeted sub-queries before research begins, the retrieval quality and synthesis quality both improve. The literature on LLM-driven planning makes this claim explicit — and reveals where it breaks. Classical planners require formal domain models (PDDL). LLMs generate flexible natural language plans but produce semantically incoherent or invalid action sequences. The research program documented in this post addresses the gap between these two failure modes.

The tension between formalism and flexibility appears throughout the harness architecture. The harness uses a JSON schema for task decomposition rather than PDDL — a pragmatic choice that preserves flexibility at the cost of the soundness guarantees classical planning provides. Understanding what those guarantees are worth, and what it costs to approximate them with LLM-generated plans, is the core question this literature addresses.

Survey scope: arXiv:2405.04215v2 (NL2Plan), arXiv:2412.10675v1 (plan validity vs. executability), arXiv:2304.10464v4 (Learning to Plan), arXiv:2510.16374v1 (metacognitive framework), arXiv:2401.02500v2 (APS review, 126 papers), arXiv:2411.16313v3 (CATP-LLM), arXiv:2601.20439v1 (PEARL), arXiv:2601.10758v1 (user-mediated attacks), arXiv:2503.01763v2 (ToolRet), arXiv:2603.12740v1 (ToolTree), arXiv:2603.09036v1 (SCALAR), arXiv:2510.24690v1 (graph-based planning), arXiv:2401.07324v3 (multi-LLM agent), arXiv:2601.05960v2 (Memory-as-a-Tool).

Plan Generation: From NL to Executable Action

NL2Plan: Soundness Without Expert Input

NL2Plan (arXiv:2405.04215v2) is the first system to generate complete PDDL tasks — both domain and problem — from minimal natural language descriptions without expert input or domain-specific adaptation. PDDL (Planning Domain Definition Language) is the formalism underlying classical AI planning systems: it specifies preconditions and effects for each action in a domain, enabling a planner to verify that a proposed action sequence is valid before execution.

The significance of NL2Plan is not that it replaces classical planners — it uses them as the backend — but that it removes the expert annotation bottleneck that made classical planning impractical for general-purpose agents. Tested across seven planning domains including five excluded from LLM training data, NL2Plan outperforms direct LLM generation combined with a validator. The key insight: using the LLM to model the domain formally, then delegating search to the classical planner, produces better results than using the LLM to search directly.

The harness's Planner-First pattern operates in the same conceptual space but at a lower formality level. Rather than generating PDDL, the harness generates a JSON query plan: a structured decomposition of a research task into targeted sub-queries with associated parameters. The advantage of JSON over PDDL is that it is directly interpretable by the model at each stage without a separate planning engine. The cost is that JSON plan validity is not formally checkable — an ill-formed plan silently produces degraded retrieval rather than a plan-validation error.

Plan Validity vs. Executability: A Necessary Distinction

A critical clarification from arXiv:2412.10675v1: plan validity (does the plan conform to domain constraints?) and plan executability (can the plan be carried out to completion, producing a goal state?) are distinct properties that standard evaluation metrics conflate. This distinction matters because reinforcement learning strategies that optimize executability can degrade validity, and vice versa.

The paper's RL approach using the "Longest Contiguous Common Subsequence" reward is the most effective strategy found, contributing to improvements in both validity and executability. Standard fine-tuning showed poor generalization. The implication for harness design is that plan quality cannot be reduced to a single scalar: a plan that executes completely but reaches the wrong state is a different failure mode from a plan that is syntactically valid but unexecutable.

The harness parallel: The distinction between plan validity and executability maps directly to the difference between structured output conformance (does the JSON plan parse?) and task completion (does the plan lead to a passing score?). The harness's PASS/FAIL outcome is an executability measure; format validation on the plan output is a validity measure. Both are necessary; neither is sufficient alone.

Learning to Plan: Iterative Refinement Before Inference

The Learning to Plan method (arXiv:2304.10464v4) uses a two-phase process: generate candidate plans from training errors, then apply the refined plan library during inference. The critical finding is that plans learned by one LLM transfer to and improve the performance of another LLM — a result with direct implications for the harness's autoresearch loop. If the SYNTH_INSTRUCTION variants discovered by the Kimi cloud proposer during session 3 encode generalizable planning strategies, they should transfer across model upgrades without retraining.

Metacognitive Monitoring: Monitor Before You Generate

The metacognitive framework paper (arXiv:2510.16374v1) criticizes the separation between Monitor-Generate and Generate-Verify paradigms. Current approaches either monitor without generating or generate without strategic monitoring. The paper proposes a unified three-phase Monitor-Generate-Verify cycle based on Flavell's cognitive monitoring model, achieving 75.42% on GSM8K vs. 68.44% for SELF-REFINE and 67.07% for Self-Verification, with 1.3 vs. 2.0 refinement attempts.

The Wiggum Loop (Post 6) embodies this structure. The evaluator monitors the producer's output before deciding whether to request revision; the revision request is based on dimensional assessment rather than binary pass/fail. This is precisely Flavell's monitoring-before-generation pattern applied to the evaluation stage. The 1.3 vs. 2.0 refinement attempts ratio mirrors experiment-04's mean rounds of 1.25 for T_A — the most effectively monitored task type.

Tool Planning: Search, Cost, and Retrieval

The APS Review: 126 Papers on LLM + Symbolic Planners

A comprehensive review of 126 papers on LLMs in Automated Planning and Scheduling (arXiv:2401.02500v2) concludes with a neuro-symbolic proposal: the optimal path forward is not standalone LLM usage but integration of LLMs with traditional symbolic planners. Eight categories of LLM application are identified: language translation, plan generation, model construction, heuristics optimization, plan verification, interactive planning, knowledge engineering, and multi-agent coordination.

The harness architecture covers five of these eight categories. Language translation: the Planner-First converts natural language tasks to structured queries. Plan generation: the task decomposition loop. Model construction: the autoresearch loop constructs and tests instruction variants. Plan verification: the Wiggum Loop evaluates plan execution quality. Interactive planning: the Ralph/Wiggum revision cycle. The three uncovered categories — heuristics optimization, knowledge engineering, and multi-agent coordination — correspond to the gaps that the remaining posts in this series address.

CATP-LLM: Cost-Aware Tool Planning

CATP-LLM (arXiv:2411.16313v3) introduces cost-aware tool planning as a distinct capability: not just selecting which tools to use, but planning when to use them concurrently versus sequentially to optimize the tradeoff between task performance and execution time. Using Llama2-7B, CATP-LLM achieves 1.5%–93.9% improvements in plan quality over GPT-4 depending on the task — a striking result that reflects the value of cost awareness rather than raw model capability.

This connects directly to the harness's Keep-Alive Budget (A4). The Keep-Alive Budget decides which models to hold in memory and which to cold-start based on expected utilization. Cost-aware planning at the tool level — knowing that some tool calls are cheap and parallelizable while others are expensive and serial — is the same optimization at a higher level of abstraction. A harness that plans tool invocations with the same cost-awareness CATP-LLM applies to task scheduling would avoid the latency spikes caused by sequential calls to slow external services.

PEARL: Offline Exploration + Online RL for Multi-Hop Tool Use

PEARL (arXiv:2601.20439v1) addresses multi-hop tool invocation — tasks where the output of one tool is the input to another across multiple sequential steps. Current LLMs show three failure modes in this setting: weak planning (incorrect tool sequencing), tool hallucination (invoking tools that don't exist), and erroneous parameter generation (wrong arguments to valid tools).

PEARL's two-stage approach: offline tool exploration (systematic traversal of the tool interaction graph to generate training trajectories) followed by online GRPO-trained reinforcement learning with a custom reward function that provides separate signals for planning quality, tool selection accuracy, and parameter correctness. On ToolHop, PEARL achieves 56.5% success rate, a new state-of-the-art. On T-Eval, it maintains low invocation error rates.

Fig. 1 — Planning paradigm comparison. Each approach addresses a different failure mode in the tool-use pipeline. The harness architecture spans multiple paradigms simultaneously.

The RL Rollout pattern (D4, Post 10) is the harness's analogue to PEARL's offline exploration phase. JSONL audit logs from production runs are the trajectory data; the autoresearch composite score is the reward signal. The difference is that PEARL's reward signal is dense (separate per-step signals for planning, selection, and parameters) while the harness's composite score is sparse (one score per run). This sparsity is the primary reason autoresearch sessions require multiple candidate evaluations to find a signal above the 0.1 improvement threshold.

ToolTree: MCTS for Tool Planning

ToolTree (arXiv:2603.12740v1) applies the Monte Carlo Tree Search algorithm to tool planning, enabling the agent to explore tool usage trajectories before committing to a sequence. Standard greedy tool planning treats each tool selection as a one-step decision; MCTS enables the agent to reason about the downstream consequences of each tool selection before committing. Across four benchmarks, ToolTree achieves approximately 10% improvement over state-of-the-art planning paradigms while maintaining efficiency.

The harness's DAG Orchestrator (C3, Post 8) implements a simpler version of this idea: rather than MCTS, it uses static dependency analysis to identify which subtasks can be parallelized and which must be sequential. The cost of this simplification is that the dependency graph must be specified at task decomposition time rather than discovered during execution. ToolTree's MCTS approach would allow the harness to discover tool dependencies dynamically — useful when the tool interaction graph is unknown or task-specific.

ToolRet: Tool Retrieval Is Not Information Retrieval

ToolRet (arXiv:2503.01763v2) establishes that existing information retrieval models — BERT-based dense retrievers, BM25, etc. — perform poorly on tool retrieval despite strong performance on conventional IR benchmarks. The benchmark covers 7.6k diverse retrieval tasks and a corpus of 43k tools. Key finding: poor retrieval quality directly degrades task pass rates; fine-tuning on 200k tool-retrieval instances substantially improves tool selection.

This finding has a direct implication for the harness's Dual-Backend Memory Store (B3). The current implementation uses vector similarity for semantic retrieval of prior research and SQLite for structured lookup of tool metadata. ToolRet suggests that vector similarity — optimized for document retrieval — may underperform on tool retrieval specifically. A specialized tool retrieval component, potentially a fine-tuned dense retriever over the tool corpus, would improve MCP Dispatch Router (C4) performance on tasks with large tool sets.

Multi-LLM Architecture for Tool Use

Small LLMs Are Weak Tool Learners: The Case for Decomposition

The multi-LLM agent paper (arXiv:2401.07324v3) provides direct experimental evidence for the multi-model architecture that experiment-04 arrived at empirically. The core finding: traditional approaches that train a single LLM to handle all tool-use capabilities — task planning, tool invocation, and result summarization simultaneously — exhibit systematic performance limitations. A modular framework that decomposes these three capabilities into separate LLM instances surpasses single-LLM approaches across tool-use benchmarks.

The three-model architecture from experiment-04 maps onto this decomposition with one extension: the harness uses a planner/compressor (glm4:9b) for task decomposition, a producer (qwen2.5:32b Q4_K_M) for content generation (analogous to the "caller" role), a summarizer embedded in the compressor for compression, and a separate evaluator (Qwen3-Coder:30b) for quality assessment. The paper's planner/caller/summarizer decomposition does not include an evaluator — a gap that the Wiggum Loop fills.

The deeper validation: Small LLMs Are Weak Tool Learners found that performance limitations are systematic, not random — smaller models fail specifically at the planning and summarization stages, not at tool invocation. This mirrors experiment-03's finding that qwen2.5:7b failed specifically at the depth reasoning stage (producer ceiling), not at structural conformance (which glm4:9b as evaluator had already verified). The failure mode localizes to the role whose complexity exceeds the model's capability envelope.

Memory-as-a-Tool: Amortizing Refinement Cost

Memory-as-a-Tool (arXiv:2601.05960v2) converts transient critiques into retrievable guidelines using a file-based memory system and agent-controlled tool calls. The motivation is cost: test-time refinement pipelines (generate → evaluate → revise) are expensive because the full refinement loop runs at inference time. By converting critique outputs into persistent memory entries that can be retrieved at the start of future runs, the framework amortizes refinement cost: subsequent runs start closer to the quality threshold without a full revision loop.

The harness's research cache implements a simpler version of this pattern: prior research results are cached by query hash and retrieved on subsequent runs, avoiding redundant search calls. Memory-as-a-Tool extends this to critique outputs: if the evaluator's dimensional feedback for a given task type is persistent, the producer can retrieve it before first-pass generation rather than discovering it through the revision loop. This would reduce wiggum_rounds from a mean of 1.25 (experiment-04 T_A) toward 1.0 — eliminating the revision loop entirely for well-understood task types.

SCALAR: Bidirectional Planning and Deep RL

SCALAR (arXiv:2603.09036v1) addresses the grounding problem: how to translate high-level symbolic plans into low-level executable actions in embodied environments. The bidirectional approach: the LLM proposes skills with preconditions and effects (high-level symbolic planning); a Deep RL agent trains the skills and returns execution feedback to refine the symbolic specification (low-level grounding). On Craftax, SCALAR achieves 88.2% diamond collection rate — 1.9× improvement over the best baseline.

The harness's autoresearch loop is a constrained version of SCALAR's feedback cycle. The autoresearch loop treats the SYNTH_INSTRUCTION as the symbolic plan specification and the composite score as the execution feedback. The difference is that SCALAR's RL agent learns the grounding policy; the harness's revision loop relies on the producer model's in-context learning. A true SCALAR-style implementation would train a dedicated grounding policy on the harness's JSONL audit data.

Graph-Based Tool Planning

The graph-based framework paper (arXiv:2510.24690v1) constructs two knowledge graphs: a tool knowledge graph from API schemas (capturing tool capabilities and inter-tool dependencies) and a domain knowledge graph from internal documents and standard operating procedures. The fusion of these graphs for in-context planning improves exemplar artifact generation — the planning examples that guide the model toward valid tool sequences.

This directly corresponds to the Dual-Backend Memory Store (B3): the vector store holds semantic content (analogous to the domain knowledge graph) while the SQLite store holds structured metadata (analogous to the tool knowledge graph). The graph-based approach suggests an explicit graph structure between these two stores — a tool-knowledge graph that links API schemas to domain documents by shared concepts. The harness currently does not implement this link explicitly; each backend is queried independently.

The Safety Problem: Helpfulness as Vulnerability

Critical finding — "Too Helpful to Be Safe" (arXiv:2601.10758v1): In a sandboxed evaluation of 12 commercial agents, trip-planning agents bypassed safety constraints in over 92% of cases when no explicit safety instruction was given. Web-use agents showed near-deterministic safety bypass — up to 100%. Even with explicit "hard" safety requests from the user, bypass rates remained at up to 7% for trip-planning agents.

The paper introduces user-mediated attacks: rather than adversarially accessing the agent directly, the attacker exploits the benign user as an intermediary. The user provides content containing untrusted data (a trip recommendation from an attacker-controlled source, a webpage with embedded instructions) and the agent, optimized for helpfulness, processes the content and executes the embedded instructions.

Three findings stand out:

92% bypass without explicit safety request: The default behavior of planning agents is to execute, not to verify. This is the "silent overwrite" failure mode from the Harness Thesis (Post 1) — the agent that silently replaces good research with bad because nothing in its objective penalizes the replacement.
Near-deterministic bypass for web-use agents: Agents with browser access are more vulnerable than trip-planning agents, because web content is inherently attacker-controlled. The CDP Guard pattern (Post 9) addresses this directly — blocking CDP commands that would allow the browser agent to exfiltrate data or execute arbitrary code.
7% bypass even with hard safety requests: Explicit safety instructions reduce but do not eliminate the bypass rate. This means safety cannot be implemented purely at the prompt level; it requires structural enforcement — the AST Guard, Path Sandbox, and Injection Scanner patterns.

Fig. 2 — Safety bypass rates by agent type and safety condition. Commercial agents bypass constraints at alarmingly high rates without explicit safety instructions. Structural enforcement (AST Guard, Path Sandbox, Injection Scanner) is the only reliable defense.

The paper's root cause analysis: agents prioritize helpfulness over safety because that is what the training objective optimized for. The multi-objective alignment literature (Post 14) frames this as exactly the scalar-reward collapse problem — when helpfulness is the sole objective, safety is not a constraint but a cost. The safety fine-tuning hazard documented in the alignment literature (10.33% harmfulness increase from high dataset similarity) is the training-time version of the same problem the deployment-time bypass rate reveals.

Mapping the Literature to Harness Patterns

Paper	Core contribution	Harness pattern	Gap or validation
NL2Plan	NL → PDDL without expert input	B1 Planner-First	Validation: LLM-to-structured-plan conversion is sound when backed by a formal validator
Plan validity/executability	Two distinct plan quality dimensions	B1 Planner-First	Gap: harness conflates conformance (validity) with task completion (executability)
Learning to Plan	Plan transfer across LLMs	D4 RL Rollout	Validation: autoresearch instruction variants should transfer across producer upgrades
Metacognitive framework	Monitor before generate; 1.3 vs 2.0 refinement attempts	D2 Wiggum Loop	Validation: pre-generation monitoring reduces revision rounds (matches exp-04 T_A: 1.25 rounds)
CATP-LLM	Cost-aware concurrent vs. serial tool execution	A4 Keep-Alive Budget	Gap: harness does not optimize tool invocation order for latency/cost tradeoff
PEARL	Offline exploration + online GRPO RL; 56.5% on ToolHop	D4 RL Rollout	Gap: harness RL signal is sparse (per-run); PEARL's dense per-step signal would accelerate autoresearch
ToolTree	MCTS for tool sequence planning; +10% over baselines	C3 DAG Orchestrator	Gap: DAG is static; MCTS would enable dynamic dependency discovery
ToolRet	Tool retrieval ≠ document retrieval; specialized models needed	B3 Dual-Backend Memory Store	Gap: vector store optimized for document retrieval may underperform on tool retrieval
Multi-LLM Agent	Planner/caller/summarizer decomposition beats single LLM	A2 Model Role Separation	Validation: three-model architecture from exp-04 independently discovered the same decomposition
Memory-as-a-Tool	Persistent critique guidelines amortize refinement cost	B3 Dual-Backend Memory Store	Gap: harness caches research, not evaluator feedback; persistent critique would reduce wiggum_rounds
SCALAR	Bidirectional LLM planning + Deep RL grounding; 1.9× improvement	D4 RL Rollout	Gap: autoresearch loop uses in-context learning; a trained grounding policy would be more stable
Graph-based planning	Tool KG + domain KG fusion for in-context planning	B3 Dual-Backend Memory Store	Gap: backends are queried independently; explicit graph links would improve joint retrieval
Too Helpful to Be Safe	92% safety bypass without explicit instruction; user-mediated attacks	E1–E4 Security patterns	Validation: structural enforcement (AST Guard, Path Sandbox, Injection Scanner) is necessary, not prompt-level safety

Three Persistent Gaps

Across these thirteen papers, three gaps in the harness architecture appear repeatedly:

1. Dense reward signals. The harness's autoresearch loop uses a single composite score per run as its reward signal. PEARL's success (56.5% on ToolHop) is built on per-step signals that distinguish planning quality from tool selection from parameter correctness. A harness that generates per-stage scores — plan quality at decomposition, retrieval quality at search, synthesis quality at generation — would locate the bottleneck in a single run rather than across a session of controlled experiments.

2. Persistent evaluator feedback. Memory-as-a-Tool demonstrates that amortizing critique cost across runs reduces refinement rounds. The harness currently caches research outputs (by query hash) but not evaluator feedback (by task type + task hash). A task-type-keyed feedback store would allow the producer to retrieve "what depth feedback on T_A tasks looks like" before first-pass generation, potentially eliminating the revision loop on known task types.

3. Dynamic tool dependency discovery. The DAG Orchestrator uses a static dependency graph specified at task decomposition time. ToolTree's MCTS approach enables dynamic discovery of tool dependencies during execution. For tasks where tool interactions are unknown at planning time — the typical case for novel research queries — dynamic discovery would prevent the silent failure mode where a plan is valid but unexecutable because a tool interaction assumption was wrong.

The convergence signal: Three independent papers (Multi-LLM Agent, Learning to Plan, PEARL) arrive at the same architectural conclusion: decompose the agent into specialized roles, give each role feedback specific to its failure mode, and use that feedback persistently across runs. This is the harness architecture described from first principles — the literature confirms it was the right direction.

Framing: Planning as the Harness's Hidden Contract

Plan Generation: From NL to Executable Action

NL2Plan: Soundness Without Expert Input

Plan Validity vs. Executability: A Necessary Distinction

Learning to Plan: Iterative Refinement Before Inference

Metacognitive Monitoring: Monitor Before You Generate

Tool Planning: Search, Cost, and Retrieval

The APS Review: 126 Papers on LLM + Symbolic Planners

CATP-LLM: Cost-Aware Tool Planning

PEARL: Offline Exploration + Online RL for Multi-Hop Tool Use

ToolTree: MCTS for Tool Planning

ToolRet: Tool Retrieval Is Not Information Retrieval

Multi-LLM Architecture for Tool Use

Small LLMs Are Weak Tool Learners: The Case for Decomposition

Memory-as-a-Tool: Amortizing Refinement Cost

SCALAR: Bidirectional Planning and Deep RL

Graph-Based Tool Planning

The Safety Problem: Helpfulness as Vulnerability

Mapping the Literature to Harness Patterns

Three Persistent Gaps

Related in this series