Evaluation Uncertainty, Calibration, and Harness Reliability

April 29, 2026 • 17 min read

A point score from an LLM judge is not a measurement — it is a sample from a distribution. Conformal prediction converts that sample into a calibrated interval with guaranteed coverage. Fact-level confidence calibration reduces hallucinations without any retrieval system. Agent-as-a-Judge outperforms LLM-as-a-Judge. And the externalization framework paper (arXiv:2604.08224v1) provides the theoretical account of why the harness architecture — memory, skills, protocols, harness engineering — is not an engineering preference but a cognitive necessity.

The Measurement Problem in Automated Evaluation

The harness evaluator produces a number. Experiment-01's evaluator (glm4:9b) produced 9.0 for every run. Experiment-03's evaluator (Qwen3-Coder:30b) produced a mean of 7.07 with a standard deviation of 0.55. Both numbers are point estimates. Neither tells you how confident the evaluator is in that estimate, or how wide the interval of plausible scores is.

This matters because the harness makes a binary decision based on the point estimate: PASS if score_final ≥ threshold, FAIL otherwise. A score of 7.9 against a threshold of 8.0 is a FAIL, but if the evaluator's uncertainty interval is [7.3, 8.5], then FAIL is only the modal outcome, not a confident verdict. The harness currently has no mechanism to express this uncertainty — it treats the point score as definitive.

The literature on evaluation uncertainty provides three complementary approaches to this problem: conformal prediction (convert point scores to statistically valid intervals), verbal uncertainty estimation (have the evaluator express low confidence directly), and Agent-as-a-Judge (replace the single-model evaluator with an agent that can reason about its own uncertainty).

Conformal Prediction for LLM Judges

Conformal prediction (arXiv:2509.18658v1) was introduced in Post 13 as part of the judge reliability cluster. Here the focus is on the statistical machinery and its practical implications for the harness threshold decision.

The standard conformal prediction framework works as follows. Given a calibration set of (input, true_score) pairs and a new input with model score ŝ, conformal prediction constructs a prediction interval [ŝ − q, ŝ + q] where q is chosen such that the interval contains the true score with probability at least 1 − α on future inputs. This guarantee is distribution-free — it holds regardless of the model's score distribution, without any parametric assumptions.

For discrete rating tasks (1–10 Likert scales), an ordinal boundary adjustment is required: the interval must align with discrete score boundaries rather than producing fractional endpoints. The paper proposes using the interval midpoint as a low-bias estimate — the observation that the average of the lower and upper interval bounds is less biased than the raw point score, because it smooths out evaluator overconfidence.

Harness application: Replace score_r1 ≥ threshold with interval_midpoint(score_r1) ≥ threshold and add an uncertainty gate: if interval_width(score_r1) > W_max, trigger the revision loop regardless of the point score. A score of 9.0 with interval [7.5, 10.5] is less reliable than a score of 8.2 with interval [7.8, 8.6]. The current harness cannot distinguish these two cases.

The companion paper on VLM judges (arXiv:2604.25235v1) extends this analysis to vision-language models and identifies a failure mode called ranking-scoring decoupling: judges achieve high ranking correlation (correctly identifying which output is better) while producing wide, uninformative absolute intervals. This means you can trust a VLM judge to say "A is better than B" but not to say "A scores 7.5 out of 10." Interval width is driven by task difficulty and annotation quality — on clean, multi-annotator benchmarks, intervals are 4.5× narrower than on noisy single-annotator benchmarks.

The ranking-scoring decoupling has a direct harness implication. The Wiggum Loop's threshold decision uses absolute scores (score_r1 ≥ 8.0). If the evaluator is reliable for ranking but not for absolute scoring, the threshold decision is less reliable than the revision decision. The harness could exploit this asymmetry: use absolute scores to decide whether to pass or fail, but use relative comparisons (which revision is better?) to select among multiple revision candidates.

Verbal Uncertainty Estimation

The personalized judge paper (arXiv:2406.11657v1) takes a different approach: instead of post-hoc interval construction, it has the evaluator express uncertainty verbally during evaluation. When the judge's confidence is below a threshold, it returns "uncertain" rather than a score; downstream decisions treat "uncertain" outputs as requiring human review. On binary evaluation tasks with this filter applied, agreement with ground truth exceeds 80% on high-certainty samples — and matches or surpasses third-party human performance.

The practical version of this for the harness: if the evaluator's score falls in a narrow band around the threshold (e.g., [7.5, 8.5] for a threshold of 8.0), add a verbal uncertainty probe — a second prompt asking "How confident are you in this score, and what would need to change to push it above/below the threshold?" This uncertainty probe can surface the specific dimensional gap that the point score obscures.

Agent-as-a-Judge: Evaluating Agents with Agents

Agent-as-a-Judge (arXiv:2410.10934v2) extends LLM-as-a-Judge by replacing the single-model evaluator with an agentic system that can reason about intermediate steps, access tools, and produce structured feedback. The DevAI benchmark covers 55 realistic automated AI development tasks with 365 hierarchical user requirements. On this benchmark, Agent-as-a-Judge dramatically outperforms LLM-as-a-Judge and achieves reliability comparable to human evaluation.

The failure mode that motivates Agent-as-a-Judge is the same one that experiment-03 exposed: a single-model evaluator cannot reliably assess multi-step agentic outputs because it lacks access to the intermediate state. When the harness's Qwen3-Coder:30b evaluator scored T_A runs, it could only see the final synthesis output — not the retrieval quality, the query plan, or the research cache state at each stage. Its dimensional scores for depth and specificity were inferred from the output text alone.

The Agent-as-a-Judge proposal for the harness: Instead of a single evaluator model that sees the final synthesis, a judging agent that can inspect the full JSONL audit trail — retrieval results, planner output, intermediate compression steps — would score with access to causal evidence rather than just the output. Depth_r1 could then reflect actual retrieval breadth, not just the density of hedging language in the synthesis.

Fact-Level Calibration and Self-Correction

The fact-level calibration paper (arXiv:2411.13343v1) addresses a different aspect of evaluation quality: the calibration of the producer's self-assessed confidence. Current confidence calibration methods assign a single scalar to the entire response. This scalar fails to capture partial correctness in long-form generation — a response that is correct on four of five enumerated items is not well-characterized by a single confidence value.

The Fact-Level Calibration framework calibrates confidence to relevance-weighted correctness at the individual atomic fact level. The associated method, Confidence-Guided Fact-level Self-Correction (ConFix), uses these fact-level confidence signals to selectively revise low-confidence facts without external retrieval. Evaluated across four datasets and six models, ConFix demonstrably reduces hallucinations.

The harness connection: the dimensional rubric in the Wiggum Loop already approximates fact-level calibration at a coarser granularity — scoring six dimensions separately rather than one composite. ConFix suggests going further: score individual claims within each dimension, identify which claims are low-confidence, and target revision at those claims specifically. This would transform the revision prompt from "improve depth and specificity" to "revise claim X (low confidence) and claim Y (low confidence) while preserving claims A, B, C (high confidence)."

Bayesian Prompt Ensembles for Multimodal Calibration

The MMB paper (arXiv:2509.08777v1) applies a Bayesian prompt ensemble approach to calibrate multimodal evaluators. Standard prompt ensembling (average scores across multiple prompt variants) fails for multimodal evaluation because the judge's score distribution shifts with image domain. MMB assigns prompt weights dynamically based on image clustering — different prompt weights for different visual contexts — achieving superior calibration on HPSv2 and MJBench.

For text-only harnesses, the multimodal complexity reduces but the core insight applies: a single evaluation prompt may be systematically miscalibrated for certain task types. The autoresearch loop discovered this empirically — T_B tasks consistently scored lower despite comparable output quality, suggesting the evaluation prompt weights the best_practices task type differently. A task-type-conditioned prompt ensemble would separate this confound from genuine quality differences.

Uncertainty in Multimodal Models

Uncertainty-o (arXiv:2506.07575v1) introduces a model-agnostic framework for estimating uncertainty across diverse large multimodal models regardless of architecture, modality, or capability. Evaluated across 18 benchmarks and 10 models (open- and closed-source), the framework provides reliable uncertainty estimates for hallucination detection, hallucination mitigation, and uncertainty-aware Chain-of-Thought reasoning.

The three design questions Uncertainty-o addresses translate directly to the harness context: (1) How do you unify uncertainty evaluation across different models in the portfolio? (2) How do you prompt models to reveal uncertainty? (3) How do you quantify uncertainty for downstream decisions? The harness currently answers none of these questions — it uses a single point score and a binary threshold. Uncertainty-o's framework provides a path to all three.

Fig. 1 — Current harness score decision (left) vs. conformal prediction interval decision (right). Point scores near the threshold are unreliable; interval midpoints and interval width together provide a more principled PASS/FAIL signal.

Inference Engine Reliability: Where the Bugs Are

The LLM inference engine bugs paper (arXiv:2506.09713v2) provides an empirical analysis of bugs across five widely-used LLM inference engines based on real-world data. The four most common bug classes, in order of prevalence: memory leaks, out-of-memory errors, incorrect tensor shapes or sizes, and performance degradation from suboptimal configuration settings.

These bug classes map directly to the failure modes in the harness's hardware substrate. Memory leaks in the inference engine accumulate over long sessions — the Keep-Alive Budget (A4) guards against this by cycling model processes after a configured number of calls. Out-of-memory errors are the crash signature of KV cache overflow — the Surgical Compressor (C2) guards against this by fitting documents to context budget before the inference call. Incorrect tensor shapes appear as silent numerical errors or NaN outputs — the harness's output validation step catches these as format conformance failures. Performance degradation from configuration settings is the least visibly failure mode: the model responds but at 3-5× normal latency, which compounds in the Wiggum revision loop.

The practical implication: The harness's Perfetto trace exporter (from Post 10) is the right tool for diagnosing configuration-induced performance degradation — it visualizes latency at the stage level, making it possible to distinguish a slow model from a slow inference engine from a slow network call. Without stage-level tracing, configuration-induced degradation looks identical to model-induced degradation.

The Externalization Framework: Why the Harness Is the Architecture

The most significant paper in this cluster is "Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering" (arXiv:2604.08224v1). It provides the theoretical framework that explains why the harness series arrived at its architecture — not from engineering preference, but from a cognitive necessity.

The paper's central argument: agent capabilities are increasingly externalized to transform hard cognitive burdens into forms that models can solve more reliably. This externalization proceeds in a historical sequence: from weights (what the model knows parametrically) to context (what the model is given at inference time) to harness (the external infrastructure that manages what gets put in context, what gets stored, what gets routed, and what gets evaluated). The harness is not an add-on to the model — it is the primary locus of agent capability.

Fig. 2 — The externalization progression from weights to context to harness. Each layer externalizes cognitive burdens that the previous layer handles inefficiently or unreliably.

The four-category taxonomy the paper offers — memory, skills, protocols, harness engineering — maps onto the harness architecture exactly:

Externalization category	Harness implementation	Posts
Memory stores	Dual-Backend Memory Store (B3): vector store + SQLite; research cache; JSONL audit log	Posts 5, 10
Reusable skills	Skill Registry (C5): slash-command handlers that extend the agent's capability envelope	Post 8
Interaction protocols	Wiggum Loop (D2): the dimensional rubric evaluation protocol; count_check; task_type_routing	Posts 6, 7
Harness engineering	Inference Shim (A1), Keep-Alive Budget (A4), DAG Orchestrator (C3), MCP Dispatch Router (C4)	Posts 4, 8

The paper identifies emerging directions: self-evolving harnesses and shared agent infrastructure. The autoresearch loop is the harness's implementation of self-evolution — the pipeline modifies its own synthesis instruction based on run outcomes. The MCP Dispatch Router and Skill Registry are the harness's implementation of shared infrastructure — standardized protocols for extending agent capability without retraining.

The paper also identifies the open challenges: evaluation and governance of externalized systems. This is precisely the problem that Post 13 (Judge Reliability), Post 17 (this post), and the autoresearch loop together address. A self-evolving harness requires trustworthy evaluation of its own output to avoid optimizing for evaluator artifacts rather than true quality — the same evaluator calibration problem that experiments 01 and 02 exposed.

The thesis validated: The externalization framework paper was published in April 2026. The harness architecture described in this series was built from 2025 empirical data. The convergence is not coincidental — both the paper and the harness arrived at the same conclusion from different starting points: practical agent progress depends on external cognitive infrastructure, not stronger models. The four experiments in Post 12 are the empirical evidence. The externalization framework is the theoretical explanation.

Five Design Implications

#	Implication	Source	Current harness gap
1	Wrap evaluator point scores in conformal prediction intervals; use interval midpoint for threshold decisions and interval width as a reliability gate	arXiv:2509.18658v1	Current: raw point score vs. threshold; no uncertainty quantification
2	Add verbal uncertainty probe when score falls within ±0.5 of threshold; surface the specific claim that is driving low confidence	arXiv:2406.11657v1	Current: binary PASS/FAIL on point score; no borderline handling
3	Extend revision prompt from dimensional feedback to claim-level feedback using ConFix-style fact-level confidence signals	arXiv:2411.13343v1	Current: revision prompt targets dimensions; does not identify specific low-confidence claims
4	Instrument the inference engine with stage-level latency tracing; distinguish model latency from engine overhead from configuration degradation	arXiv:2506.09713v2	Current: Perfetto tracing is implemented but not systematically monitored for engine overhead patterns
5	Frame harness development as externalization engineering: each new pattern externalizes a cognitive burden previously handled inside the model's context window	arXiv:2604.08224v1	This is the architectural frame the series has been building toward — now grounded in the literature

What the Literature Leaves Open

Several questions raised by this body of research remain unresolved — and bear directly on how the harness should evolve:

What calibration-set size and composition does a harness need to produce statistically valid conformal prediction intervals across heterogeneous task types — does a single calibration pool generalize, or must each task type maintain its own?
At what task complexity does Agent-as-a-Judge begin to outperform LLM-as-a-Judge in a harness pipeline, and does that threshold shift with the number of Wiggum revision rounds?
How does ranking-scoring decoupling behave within a six-dimensional rubric — are some dimensions (Groundedness, Specificity) more vulnerable to wide intervals than others (Relevance, Structure)?
When the evaluator pool introduces variance across pool members, how does the harness distinguish genuine evaluator disagreement from genuine output ambiguity — and which warrants a revision, which warrants recalibration?
What is the right recalibration cadence for an evaluator pool operating on a self-improving harness, where the production score distribution shifts as the flywheel improves the producer?

← Previous 16 · Tool Use & Planning Next → 18 · Knowledge Graphs

The Measurement Problem in Automated Evaluation

Conformal Prediction for LLM Judges

Verbal Uncertainty Estimation

Agent-as-a-Judge: Evaluating Agents with Agents

Fact-Level Calibration and Self-Correction

Bayesian Prompt Ensembles for Multimodal Calibration

Uncertainty in Multimodal Models

Inference Engine Reliability: Where the Bugs Are

The Externalization Framework: Why the Harness Is the Architecture

Five Design Implications

What the Literature Leaves Open

Related in this series