The Case Against Prompt Engineering
This week I watched two local language models argue about what year it is, because we had injected the date into the context window and one of them didn’t believe it. It was funny. It was also a clean demonstration of the thing this post is about: systems whose behavior pivots on incidental properties of their input text are systems we have been engineering with incantations.
I have spent a couple of months arguing that scaffolding beats model selection—that the quality of an agentic system lives in its harness, much of which is, at bottom, carefully engineered prompt text. So this post is written against interest. The rebuttals to prompt engineering are not strawmen, and the literature behind them is stronger than most practitioners realize. There are three, and they escalate: prompt engineering is brittle, it is a Rube Goldberg machine whose complexity outruns its understanding, and its artifacts go vestigial as foundation models advance. I’ll argue each with the receipts—forty-some annotated papers from five literature review runs, and where it hurts most, my own run logs.
Rebuttal One: It Was Always Brittle
The canon here predates the prompt-engineering job title. In 2021, “Calibrate Before Use” (arXiv:2102.09690) showed that GPT-3’s few-shot accuracy swings from near chance to near state-of-the-art depending on which examples you pick and what order you put them in—and diagnosed the mechanism as majority-label, recency, and common-token bias. Its companion result, “Fantastically Ordered Prompts” (arXiv:2104.08786), found that example ordering alone spans the difference between near state-of-the-art and random-guess performance. These are not edge cases; they are the baseline condition of the technique.
The natural rejoinder is that bigger, instruction-tuned models fixed this. The measurement literature says no. FormatSpread (arXiv:2310.11324)—the hub paper of this corpus, cited by nine of the others my pipeline annotated—shows that performance swings driven by formatting alone (separators, casing, spacing) persist through added few-shot examples and instruction tuning. POSIX (arXiv:2410.02185) finds that increasing parameter count does not reliably reduce sensitivity. The Order Effect (arXiv:2502.04134), in 2025, reports the problem unresolved in frontier APIs, with few-shot mitigation “mixed.” And it reaches the evaluation layer too: JudgeSense (arXiv:2604.23478) finds eight of nine LLM judges collapse into always-pick-A position bias under semantically equivalent paraphrases—and that model scale does not predict judging consistency.
Four years, three model generations, and the headline result is unchanged: the technique’s output depends on properties of the input that no one intended to be load-bearing. That is the definition of brittleness, and scale did not buy our way out of it.
Rebuttal Two: The Rube Goldberg Machine
If hand-written prompts are brittle, the obvious move is to optimize them automatically. That is where the second rebuttal lands: the optimizer makes the machine bigger, not sounder. TextReg (arXiv:2605.21318) names the failure mode prompt distributional overfitting: iterative prompt optimizers produce ever-longer prompts that accumulate narrow, sample-specific rules—and recovers up to +11.8% out-of-distribution accuracy over TextGrad simply by regularizing against the accumulation. SESS (arXiv:2601.03493) shows the feedback signal itself is suspect: which evaluation subset the optimizer scores against is usually a random afterthought, and principled selection beats it across reasoning benchmarks. The gradient metaphor the field borrowed is, per one pointed title, a flawed one (arXiv:2512.13598).
Then there is the deployment-scale number. The LENS study (arXiv:2604.17650) measured natural prompt distribution shift across 192 post-deployment settings, 81 models, and 4.68 million prompts: moderate shifts in user behavior produce a 73% average performance loss. Every carefully tuned prompt is tuned against a snapshot of a distribution that is already moving.
I can corroborate from my own basement. The harness’s autoresearch loop—an automated optimizer pointed at the harness’s own synthesis instructions—ran 107 experiments and kept roughly 3% of them. The rest were noise, attractor lock, or baseline contamination: the optimizer rediscovering, then un-discovering, phrasings whose measured gains did not survive re-evaluation. That is what optimizing in a brittle search space looks like from the inside.
Rebuttal Three: Vestigial Artifacts
The deepest rebuttal is not that prompt engineering fails, but that its successes expire. The cleanest controlled evidence is the Prompting Inversion paper, titled with admirable directness “You Don’t Need Prompt Engineering Anymore” (arXiv:2510.22251). Its constrained prompting method, Sculpting, beats standard chain-of-thought on gpt-4o (97% vs. 93% on GSM8K)—and then loses to plain CoT on the next model generation (94.0% vs. 96.4%). The authors call it the Guardrail-to-Handcuff transition: constraints that prevented a mid-tier model’s errors induce hyper-literalism in a stronger one. The best practice did not merely stop helping. It started hurting, with no change to a single character of the prompt.
And consider what happened to prompt engineering’s flagship trick. “Think step by step” was the field’s founding demonstration that words could unlock latent capability. Reasoning models then internalized the technique—so thoroughly that there are now two entire surveys (“Stop Overthinking,” arXiv:2503.16419; “Reasoning Economy,” arXiv:2503.24377) cataloging methods to make models do less of it. ENTRA cuts reasoning traces 37–53% with no accuracy loss (arXiv:2601.07123); ARS cuts tokens 53% and energy 58% (arXiv:2510.00071); CROP cuts token consumption 80.6% (arXiv:2604.14214). The technique we engineered prompts to elicit is now a cost center we engineer systems to suppress. A best practice does not get more vestigial than that.
Two more cuts from the same review: a controlled data-distribution study argues CoT gains are pattern-matching that collapses out of distribution—“a brittle mirage” (arXiv:2508.01191)—and a 405-experiment study of legal reasoning found CoT actively degrades performance on some task types while the largest model tested performed worst (arXiv:2603.25944). Neither the technique nor the scaling intuition behind it transfers the way the best practices assumed.
My Own Logs Testify for the Prosecution
It would be convenient to treat this as other people’s problem. My
runs.jsonl says otherwise. Three exhibits from the eval suite’s
instruction-variant and model-swap runs:
The thinking toggle flips sign across tasks
Same model (qwen3-14b), same toggle, ten runs per cell: enabling extended thinking moved the mean composite +0.41 on the failure-modes task, −0.19 on context-engineering, and +0.14 on cost-management. “Thinking helps” is not a property of the model; it is a property of the (model, task) pair—exactly the task-dependence the legal-reasoning study found at publication scale.
The promising instruction is a wash at suite level
The prose_depth synthesis instruction showed a +1.223 groundedness delta in
a controlled single-task experiment—below
the significance bar, but encouraging. Across the full suite its composite effect vs.
baseline is +0.06, +0.04,
and −0.07 on the three eval tasks. The gain that
motivated the variant does not replicate as a general improvement—prompt
distributional overfitting, observed at home.
Model swaps preserve the instructions and scramble the ranking
Under identical synthesis instructions on the cost-management task, the 14B producers matched or beat the 32–35B ones (qwen3-14b 7.97, qwen2.5-14b 7.80, pi-qwen-32b 7.76, qwen3.6-35b 7.31). The swap cells are small—treat the numbers as direction, not magnitude—but the direction is the rebuttal: nothing in the prompt stack guaranteed its value would carry across the producer pool, and it didn’t.
And one exhibit the pipeline produced this week without being asked. While assembling the citations for this post, my five-persona curation panel rejected “Fantastically Ordered Prompts”—one of the foundational papers of the field, which the order-sensitivity review’s own citation graph had ranked as its #2 gap candidate, cited by eleven papers in that corpus—scoring it 2.80, with the Academic Rigorist persona objecting that the annotation context was truncated. Its sibling, “Calibrate Before Use,” had sailed through the same gate at 4.0 minutes earlier. The quality gate evaluating the literature on prompt sensitivity exhibited prompt sensitivity. The defense rests; unfortunately, the defense is also the defendant.
What Survives the Rebuttals
Here is where I am obligated to answer my own thesis, because “scaffolding beats model selection” sounds like exactly the position these rebuttals demolish. I think the rebuttals win against prompt engineering as commonly practiced—and lose against a narrower claim that is worth stating precisely.
Look at what the brittleness literature actually built. The answer to example-order chaos was not a better prompt; it was contextual calibration—estimate the model’s bias with a content-free input and subtract it (arXiv:2102.09690), a line that runs through Batch Calibration (arXiv:2309.17249), an inference-time correction requiring no prompt text at all. The answer to format brittleness was measurement: FormatSpread, POSIX, sensitivity indices—report distributions over plausible formats, not a number from one. The answer to optimizer overfitting was regularization and principled evaluation-set selection. None of the durable assets are prompts. They are instruments and enforcement mechanisms wrapped around prompts.
That is the version of the harness thesis that survives: the harness is not the prompt library, it is the machinery that treats every prompt as a depreciating asset—the evaluation loops, the convergence detectors and ban lists that refuse to bank an unreplicated gain, the regression suite that re-prices the prompt stack every time a model in the pool changes. The Prompting Inversion paper’s own conclusion points here: prompting strategies must co-evolve with model capability. Static prompt artifacts cannot co-evolve. Measurement infrastructure is the thing that notices when they must.
The honest revision of the thesis: scaffolding beats model selection only while the scaffolding measures itself. Prompt text is the most perishable layer of the stack—closer to a cache entry than to source code—and a harness that cannot detect its own prompts going stale is just a Rube Goldberg machine with version control.
Open Questions
- What is a prompt’s depreciation schedule? The Inversion paper gives one data point (one technique, one model family, one generation gap). A harness with logged runs across model swaps could estimate the half-life of a prompt-stack’s measured gains empirically.
- Is there a regression test for the Guardrail-to-Handcuff transition—a canary task where constraint-induced hyper-literalism shows up before it degrades production output?
- How much of the harness’s measured value survives if every hand-written synthesis instruction is replaced with the model default? I have never run that ablation at full scale. The rebuttals say I should.
- If the durable layer is measurement, the sensitivity indices themselves are prompts-in-disguise (rubrics, judge templates). JudgeSense says the instruments are unstable too. What calibrates the calibrators?