Judge Benchmarks and Test-Time Scaling: Where LLM Judges Succeed and Where They Don’t
Post 13 established that the best judges lag human agreement by 5 points and show systematic position bias. This post goes further: six papers that benchmark judges across test-time scaling scenarios, multimodal tasks, instruction following, personalization, and mathematical reasoning — and find failure modes that position bias alone doesn’t explain.
Test-Time Scaling and the Judge Gap
Test-time compute scaling — spending more inference budget to improve output quality — has become a major axis of LLM capability improvement. The standard mechanisms are response reranking (generate N candidates, pick the best), step-level beam search (guide generation step by step using a reward signal), and critique-based refinement (generate, critique, revise). Each requires a reliable evaluator signal at inference time.
Reward models trained for this purpose are well understood. LLM judges — which generate natural language evaluations — are more flexible and cheaper to deploy, and their use as test-time scaling evaluators has grown alongside their use in offline evaluation. The question JETTS asks is whether that flexibility comes at a cost.
1. JETTS: The Performance Profile Depends on the Scaling Method
JETTS (Judge Evaluation for Test-Time Scaling, arXiv:2504.15253v2) is the most rigorous comparative study of judges versus reward models in test-time scenarios to date. The benchmark evaluates ten judge models (7B–70B parameters) against eight base generator models (6.7B–72B parameters) across three domains: math reasoning, code generation, and instruction following. Three scaling settings are tested independently:
- Response reranking: generate N responses, use judge score to select the best one
- Step-level beam search: use process-level signal at each generation step to guide a beam search
- Critique-based refinement: generate a critique of an initial response, then generate a revision based on the critique
The results break cleanly by setting:
- Response reranking: judges are competitive with outcome reward models. The task is well-suited to judges because it requires holistic comparison across complete responses — which is what LLM judges were trained to do.
- Step-level beam search: judges are consistently outperformed by process reward models (PRMs). Beam search requires granular step-level signals that a PRM can provide but a generative judge cannot reliably produce. The mismatch is architectural, not a matter of scale.
- Critique-based refinement: judge-generated critiques are currently ineffective at improving responses. The natural language critiques don’t provide actionable enough signal for the generator to revise meaningfully.
2. Omni-MATH-2: Benchmark Saturation as a Judge Failure Mode
The conventional explanation for benchmark saturation is that models have memorized training data or exhausted the benchmark’s discriminative range. Omni-MATH-2 (arXiv:2601.19532v1) offers a different explanation: saturation may be an artifact of judge failure, not model capability.
The work audits the Omni-MATH dataset for evaluation quality: each problem was checked for LaTeX compilability, solvability, verifiability, and formatting consistency. The result is two subsets — a clean exact-answer subset (n=4,181) and a tagged non-standard subset (n=247) containing proofs, images, and answers requiring non-standard formats.
Comparing Omni-Judge (the original benchmark judge) against expert human annotations on disagreement cases produces the headline finding:
The implication is pointed: current judges cannot differentiate between model abilities even before benchmarks saturate. A model appearing to hit a ceiling on Omni-MATH may just be hitting a judge quality ceiling. The paper argues for treating judge correctness as an experimental condition that must be measured and reported alongside benchmark scores.
For the harness, this finding extends the Post 13 warning about judge reliability into a new domain: when the harness uses Wiggum scores as a proxy for output quality over time, score plateaus may reflect judge saturation rather than pipeline optimization having converged. The Data Flywheel should track the conditions under which Wiggum scores stop improving — and distinguish judge ceiling from pipeline ceiling.
3. M-JudgeBench: Ten Dimensions, MCTS-Generated Training Data
Multimodal judge benchmarks lag behind their text-only counterparts. M-JudgeBench (arXiv:2603.00546v1) addresses this with a ten-dimensional capability-oriented benchmark for multimodal LLM judges, covering:
- Visual reasoning accuracy
- Cross-modal consistency
- Spatial relationship judgment
- Temporal sequence evaluation
- Fine-grained attribute assessment
- Counterfactual reasoning
- Instruction adherence
- Factual grounding
- Style and format judgment
- Compositional scene understanding
Systematic evaluation across existing MLLM judges reveals consistent weaknesses across multiple dimensions — particularly in cross-modal consistency and counterfactual reasoning. Existing judges that perform well on unimodal text tasks show significant degradation when visual content is introduced.
The paper’s second contribution is Judge-MCTS, a data generation framework that uses Monte Carlo Tree Search to generate pairwise reasoning trajectories for training judge models. MCTS explores the judgment reasoning tree explicitly, generating diverse high-quality preference pairs for training. The resulting M-Judger models, trained on these trajectories, outperform existing MLLM-as-a-judge systems on both M-JudgeBench and prior benchmarks.
4. IF-RewardBench: Instruction-Following Meta-Evaluation
Instruction-following evaluation is a common use case for judge models: given a user instruction and a model response, did the model follow the instruction? The standard approach is pairwise comparison against a reference response, but this has a known weakness — pairwise evaluation conflates response quality with instruction adherence, and the simplification away from multi-constraint instructions means the benchmark doesn’t test the hard cases.
IF-RewardBench (arXiv:2603.04738v2) fixes both problems. The benchmark covers diverse instruction and constraint types, including:
- Multi-constraint instructions (follow constraints A and B simultaneously)
- Negation constraints (do X but not Y)
- Format specifications (respond in exactly N words)
- Conditional constraints (if X is true, do Y; otherwise do Z)
Evaluation on IF-RewardBench reveals significant deficiencies in current judge models on these dimensions. More importantly, IF-RewardBench achieves a stronger positive correlation with downstream task performance than existing benchmarks — meaning that performance on IF-RewardBench predicts real-world judge usefulness better than the benchmarks practitioners currently use to select judges.
5. Multimodal Judge Model: Text, Audio, Image, Video
Most judge benchmarks are monomodal. The Multimodal Judge Model (arXiv:2601.06106v1) proposes a framework for reliable evaluation across all four major modalities simultaneously — text, audio, image, and video — with diagnostic feedback rather than simple scores.
The design choices address a concrete problem in multimodal evaluation: train-test leakage. When public datasets are used for evaluation, models trained on those datasets may have seen the evaluation data. The framework uses carefully sampled public datasets with fixed seeds to ensure reproducibility and minimize leakage, making evaluation scores comparable across model versions and labs.
The framework aggregates judgments across modalities, analyzes output quality and reasoning consistency, and generates diagnostic feedback identifying which modality or reasoning step caused a failure. Results show strong alignment between the Judge Model’s assessments and human scores.
The audio modality coverage is notable: most multimodal evaluation frameworks cover image and video but treat audio as secondary. Including audio in judge evaluation matters for agentic systems that process spoken inputs, transcripts, or audio-grounded documents.
6. Personalized Judge: Verbal Uncertainty Estimation
Most judge evaluation assumes a single ground-truth preference. Personalized Judge (arXiv:2406.11657v1) relaxes this assumption: what if user preferences legitimately vary, and the judge’s task is to evaluate whether a response matches a specific user’s values and style?
Direct application of LLMs to personalized evaluation fails: models asked to judge whether a response matches a user persona show low agreement with human ground truth, below what you’d want for a deployable system. The paper’s intervention is verbal uncertainty estimation: rather than forcing the judge to produce a binary verdict on every example, the judge is allowed to express low confidence on uncertain cases.
The uncertainty estimation insight connects directly to Post 17’s conformal prediction framework: conformal prediction provides statistically guaranteed coverage for score intervals; verbal uncertainty estimation provides model-expressed uncertainty on binary decisions. They’re complementary instruments for the same underlying reliability problem.
Reading the Benchmark Landscape
The six papers collectively map a benchmark landscape that was mostly unmeasured two years ago:
| Benchmark | Primary gap addressed | Key failure uncovered |
|---|---|---|
| JETTS | Test-time scaling (rerank / beam / refine) | Critiques are ineffective; PRMs dominate beam search |
| Omni-MATH-2 | Math reasoning benchmark saturation | Judge wrong in 96.4% of expert disagreements |
| M-JudgeBench | Multimodal judge capabilities (10 dimensions) | Cross-modal consistency and counterfactual reasoning weak |
| IF-RewardBench | Instruction-following meta-evaluation | Current judges show significant deficiencies on multi-constraint tasks |
| Multimodal Judge | Text/audio/image/video evaluation framework | Strong human alignment when evaluation is reproducible (fixed seeds) |
| Personalized Judge | Persona-aligned preference evaluation | >80% agreement achievable but only on high-certainty subset |
Synthesis: What These Papers Mean for Harness Evaluation
The recurring pattern across all six papers is that judge reliability is task-specific in ways that aggregate metrics obscure. A judge that scores well on standard benchmarks may be wrong 96% of the time on hard math problems (Omni-MATH-2), fail on multi-constraint instructions (IF-RewardBench), degrade on cross-modal consistency (M-JudgeBench), or be unable to guide beam search (JETTS).
For harness design, this means judge selection should match the evaluation task:
- For response reranking: LLM judges are competitive. Select on the benchmark dimension closest to your rubric domain.
- For step-level guidance: use a dedicated PRM, not a generative judge.
- For critique-based revision: the harness’s structured revision prompts (Post 7) are more reliable than open-ended natural language critiques from a judge.
- For dimensional rubric scoring: prioritize IF-RewardBench performance when selecting judges for multi-constraint rubrics.
- For score interpretation: score plateaus may be judge saturation; use verbal uncertainty or conformal intervals to distinguish true convergence from judge ceiling.