May 23, 2026 • 16 min read

Judge Benchmarks and Test-Time Scaling: Where LLM Judges Succeed and Where They Don’t

Post 13 established that the best judges lag human agreement by 5 points and show systematic position bias. This post goes further: six papers that benchmark judges across test-time scaling scenarios, multimodal tasks, instruction following, personalization, and mathematical reasoning — and find failure modes that position bias alone doesn’t explain.

Series context. The harness Wiggum evaluator (Post 6) uses a dimensional rubric scored by a judge model. Posts 13 and 17 covered judge reliability and calibration. This post adds the test-time scaling dimension: when judges are used not just for evaluation but as active signals during generation — for reranking candidates, guiding beam search, or critiquing drafts — the failure modes are qualitatively different, and so are the design implications.

Test-Time Scaling and the Judge Gap

Test-time compute scaling — spending more inference budget to improve output quality — has become a major axis of LLM capability improvement. The standard mechanisms are response reranking (generate N candidates, pick the best), step-level beam search (guide generation step by step using a reward signal), and critique-based refinement (generate, critique, revise). Each requires a reliable evaluator signal at inference time.

Reward models trained for this purpose are well understood. LLM judges — which generate natural language evaluations — are more flexible and cheaper to deploy, and their use as test-time scaling evaluators has grown alongside their use in offline evaluation. The question JETTS asks is whether that flexibility comes at a cost.

1. JETTS: The Performance Profile Depends on the Scaling Method

JETTS (Judge Evaluation for Test-Time Scaling, arXiv:2504.15253v2) is the most rigorous comparative study of judges versus reward models in test-time scenarios to date. The benchmark evaluates ten judge models (7B–70B parameters) against eight base generator models (6.7B–72B parameters) across three domains: math reasoning, code generation, and instruction following. Three scaling settings are tested independently:

Response reranking: generate N responses, use judge score to select the best one
Step-level beam search: use process-level signal at each generation step to guide a beam search
Critique-based refinement: generate a critique of an initial response, then generate a revision based on the critique

JETTS performance profile: judge effectiveness relative to reward models varies sharply by test-time scaling method. Reranking: competitive. Beam search: dominated by PRMs. Refinement: currently ineffective.

The results break cleanly by setting:

Response reranking: judges are competitive with outcome reward models. The task is well-suited to judges because it requires holistic comparison across complete responses — which is what LLM judges were trained to do.
Step-level beam search: judges are consistently outperformed by process reward models (PRMs). Beam search requires granular step-level signals that a PRM can provide but a generative judge cannot reliably produce. The mismatch is architectural, not a matter of scale.
Critique-based refinement: judge-generated critiques are currently ineffective at improving responses. The natural language critiques don’t provide actionable enough signal for the generator to revise meaningfully.

Harness implication. The Wiggum evaluator issues holistic dimensional scores and dimensional revision feedback. This maps closest to the reranking setting — the one where judges are competitive. The harness is not using step-level beam search (it operates at the document level, not the token level). The JETTS finding validates that design choice: the judge role the harness assigns is the role where judges are strongest.

2. Omni-MATH-2: Benchmark Saturation as a Judge Failure Mode

The conventional explanation for benchmark saturation is that models have memorized training data or exhausted the benchmark’s discriminative range. Omni-MATH-2 (arXiv:2601.19532v1) offers a different explanation: saturation may be an artifact of judge failure, not model capability.

The work audits the Omni-MATH dataset for evaluation quality: each problem was checked for LaTeX compilability, solvability, verifiability, and formatting consistency. The result is two subsets — a clean exact-answer subset (n=4,181) and a tagged non-standard subset (n=247) containing proofs, images, and answers requiring non-standard formats.

Comparing Omni-Judge (the original benchmark judge) against expert human annotations on disagreement cases produces the headline finding:

Omni-Judge is wrong in 96.4% of cases where it disagrees with expert annotations. This means that when the judge and expert disagree, the expert is almost always right. The judge isn’t unreliable occasionally — it’s systematically wrong on the cases that matter most: the hard problems near the frontier of model capability, which are precisely the cases that determine whether benchmark saturation has occurred.

The implication is pointed: current judges cannot differentiate between model abilities even before benchmarks saturate. A model appearing to hit a ceiling on Omni-MATH may just be hitting a judge quality ceiling. The paper argues for treating judge correctness as an experimental condition that must be measured and reported alongside benchmark scores.

For the harness, this finding extends the Post 13 warning about judge reliability into a new domain: when the harness uses Wiggum scores as a proxy for output quality over time, score plateaus may reflect judge saturation rather than pipeline optimization having converged. The Data Flywheel should track the conditions under which Wiggum scores stop improving — and distinguish judge ceiling from pipeline ceiling.

3. M-JudgeBench: Ten Dimensions, MCTS-Generated Training Data

Multimodal judge benchmarks lag behind their text-only counterparts. M-JudgeBench (arXiv:2603.00546v1) addresses this with a ten-dimensional capability-oriented benchmark for multimodal LLM judges, covering:

Visual reasoning accuracy
Cross-modal consistency
Spatial relationship judgment
Temporal sequence evaluation
Fine-grained attribute assessment
Counterfactual reasoning
Instruction adherence
Factual grounding
Style and format judgment
Compositional scene understanding

Systematic evaluation across existing MLLM judges reveals consistent weaknesses across multiple dimensions — particularly in cross-modal consistency and counterfactual reasoning. Existing judges that perform well on unimodal text tasks show significant degradation when visual content is introduced.

The paper’s second contribution is Judge-MCTS, a data generation framework that uses Monte Carlo Tree Search to generate pairwise reasoning trajectories for training judge models. MCTS explores the judgment reasoning tree explicitly, generating diverse high-quality preference pairs for training. The resulting M-Judger models, trained on these trajectories, outperform existing MLLM-as-a-judge systems on both M-JudgeBench and prior benchmarks.

Why MCTS for judge training. Standard preference data generation relies on human annotators or model outputs that happen to diverge. MCTS generates trajectories by systematically exploring branches of the judgment reasoning tree, producing pairs where the branching point is known. This gives the training signal explicit coverage of the reasoning patterns that distinguish good judges from bad ones — rather than sampling coverage of the output distribution.

4. IF-RewardBench: Instruction-Following Meta-Evaluation

Instruction-following evaluation is a common use case for judge models: given a user instruction and a model response, did the model follow the instruction? The standard approach is pairwise comparison against a reference response, but this has a known weakness — pairwise evaluation conflates response quality with instruction adherence, and the simplification away from multi-constraint instructions means the benchmark doesn’t test the hard cases.

IF-RewardBench (arXiv:2603.04738v2) fixes both problems. The benchmark covers diverse instruction and constraint types, including:

Multi-constraint instructions (follow constraints A and B simultaneously)
Negation constraints (do X but not Y)
Format specifications (respond in exactly N words)
Conditional constraints (if X is true, do Y; otherwise do Z)

Evaluation on IF-RewardBench reveals significant deficiencies in current judge models on these dimensions. More importantly, IF-RewardBench achieves a stronger positive correlation with downstream task performance than existing benchmarks — meaning that performance on IF-RewardBench predicts real-world judge usefulness better than the benchmarks practitioners currently use to select judges.

Selection implication. The harness Dimensional Rubric (Post 7) specifies multi-constraint evaluation criteria across six dimensions. IF-RewardBench’s finding suggests that judge selection should prioritize performance on multi-constraint instruction-following benchmarks over single-score aggregate benchmarks when the judge will be used for rubric-graded evaluation.

5. Multimodal Judge Model: Text, Audio, Image, Video

Most judge benchmarks are monomodal. The Multimodal Judge Model (arXiv:2601.06106v1) proposes a framework for reliable evaluation across all four major modalities simultaneously — text, audio, image, and video — with diagnostic feedback rather than simple scores.

The design choices address a concrete problem in multimodal evaluation: train-test leakage. When public datasets are used for evaluation, models trained on those datasets may have seen the evaluation data. The framework uses carefully sampled public datasets with fixed seeds to ensure reproducibility and minimize leakage, making evaluation scores comparable across model versions and labs.

The framework aggregates judgments across modalities, analyzes output quality and reasoning consistency, and generates diagnostic feedback identifying which modality or reasoning step caused a failure. Results show strong alignment between the Judge Model’s assessments and human scores.

The audio modality coverage is notable: most multimodal evaluation frameworks cover image and video but treat audio as secondary. Including audio in judge evaluation matters for agentic systems that process spoken inputs, transcripts, or audio-grounded documents.

6. Personalized Judge: Verbal Uncertainty Estimation

Most judge evaluation assumes a single ground-truth preference. Personalized Judge (arXiv:2406.11657v1) relaxes this assumption: what if user preferences legitimately vary, and the judge’s task is to evaluate whether a response matches a specific user’s values and style?

Direct application of LLMs to personalized evaluation fails: models asked to judge whether a response matches a user persona show low agreement with human ground truth, below what you’d want for a deployable system. The paper’s intervention is verbal uncertainty estimation: rather than forcing the judge to produce a binary verdict on every example, the judge is allowed to express low confidence on uncertain cases.

Results. By incorporating verbal uncertainty estimation, the method achieves agreement above 80% on high-certainty samples for binary personalization tasks. Human evaluations show the method matches or surpasses third-party human performance on these samples. The catch: the model scales well by filtering uncertain cases to human review, but the rate of uncertain cases determines the fraction that reach human annotators.

The uncertainty estimation insight connects directly to Post 17’s conformal prediction framework: conformal prediction provides statistically guaranteed coverage for score intervals; verbal uncertainty estimation provides model-expressed uncertainty on binary decisions. They’re complementary instruments for the same underlying reliability problem.

Reading the Benchmark Landscape

Judge benchmark coverage by task type and scaling context. Each benchmark targets a distinct gap in the evaluation landscape.

The six papers collectively map a benchmark landscape that was mostly unmeasured two years ago:

Benchmark	Primary gap addressed	Key failure uncovered
JETTS	Test-time scaling (rerank / beam / refine)	Critiques are ineffective; PRMs dominate beam search
Omni-MATH-2	Math reasoning benchmark saturation	Judge wrong in 96.4% of expert disagreements
M-JudgeBench	Multimodal judge capabilities (10 dimensions)	Cross-modal consistency and counterfactual reasoning weak
IF-RewardBench	Instruction-following meta-evaluation	Current judges show significant deficiencies on multi-constraint tasks
Multimodal Judge	Text/audio/image/video evaluation framework	Strong human alignment when evaluation is reproducible (fixed seeds)
Personalized Judge	Persona-aligned preference evaluation	>80% agreement achievable but only on high-certainty subset

Synthesis: What These Papers Mean for Harness Evaluation

The recurring pattern across all six papers is that judge reliability is task-specific in ways that aggregate metrics obscure. A judge that scores well on standard benchmarks may be wrong 96% of the time on hard math problems (Omni-MATH-2), fail on multi-constraint instructions (IF-RewardBench), degrade on cross-modal consistency (M-JudgeBench), or be unable to guide beam search (JETTS).

For harness design, this means judge selection should match the evaluation task:

For response reranking: LLM judges are competitive. Select on the benchmark dimension closest to your rubric domain.
For step-level guidance: use a dedicated PRM, not a generative judge.
For critique-based revision: the harness’s structured revision prompts (Post 7) are more reliable than open-ended natural language critiques from a judge.
For dimensional rubric scoring: prioritize IF-RewardBench performance when selecting judges for multi-constraint rubrics.
For score interpretation: score plateaus may be judge saturation; use verbal uncertainty or conformal intervals to distinguish true convergence from judge ceiling.

The benchmark gap problem is closing. JETTS, M-JudgeBench, IF-RewardBench, and Omni-MATH-2 were all published in 2025–2026. The field is rapidly building the measurement infrastructure to know which judges are reliable in which contexts. For harness practitioners, the practical takeaway is that this infrastructure now exists and should be used for judge selection — rather than relying on aggregated leaderboard scores.

← Previous 22 · Structured Knowledge Queries

Next → End of series

Test-Time Scaling and the Judge Gap

1. JETTS: The Performance Profile Depends on the Scaling Method

2. Omni-MATH-2: Benchmark Saturation as a Judge Failure Mode

3. M-JudgeBench: Ten Dimensions, MCTS-Generated Training Data

4. IF-RewardBench: Instruction-Following Meta-Evaluation

5. Multimodal Judge Model: Text, Audio, Image, Video

6. Personalized Judge: Verbal Uncertainty Estimation

Reading the Benchmark Landscape

Synthesis: What These Papers Mean for Harness Evaluation

Related in this series