Small Language Models and the Efficiency-Accuracy Frontier

May 8, 2026 • 16 min read

The model selection problem in an agentic harness is not "which model is best?" It is "which model is best for this role?" SLM-Bench makes the tradeoff measurable across 15 models, 9 tasks, and 4 hardware configurations. Our four-experiment sequence operationalizes the same insight without a benchmark lab — by treating the harness itself as the measurement instrument.

The Model Selection Fallacy

The default intuition when building an agentic system is to pick the most capable model available and use it everywhere. This intuition is wrong in three independent ways. First, the most capable model is not always the most calibrated evaluator — experiments 01 and 02 showed that glm4:9b, given full evaluator responsibility, awarded 9/10 to every structurally complete output regardless of actual depth, effectively removing the evaluation signal. Second, the most capable producer is not always necessary: qwen2.5:7b passed 9/9 runs against the lenient evaluator. The bottleneck was calibration, not capability. Third, when the bottleneck is the synthesis instruction rather than the model, upgrading the model changes nothing — T_B's pass rate went from 4/9 (44%) with a 7B producer to 4/7 (57%) with a 32B producer, an improvement so small it was drowned by variance.

These three observations share a common structure: the constraint is always somewhere other than where you're looking. SLM-Bench (arXiv:2508.15478v2) provides the benchmark-level evidence for why this happens. Accuracy and energy efficiency are not co-optimized by any single model architecture. The implication is that model selection is not a ranking problem — it is a matching problem between model capability profiles and role requirements.

SLM-Bench (arXiv:2508.15478v2, 2025-08-21) — the first benchmark specifically designed to evaluate small language models across three axes simultaneously: accuracy, computational efficiency, and sustainability metrics. Coverage: 15 models, 9 NLP tasks, 4 hardware configurations. Key finding: diverse trade-offs with no single model dominating all three axes.

What SLM-Bench Measured

The benchmark is systematic in a way that ad hoc model comparisons typically are not. Rather than evaluating models on a single hardware configuration or a single task type, SLM-Bench crosses both dimensions: 15 models × 9 tasks × 4 hardware configurations. This produces a 540-cell evaluation matrix, enough to identify interaction effects that single-condition comparisons miss entirely.

The nine tasks span the NLP task space: classification, question answering, summarization, named entity recognition, sentiment analysis, natural language inference, machine translation, reading comprehension, and commonsense reasoning. The four hardware configurations represent the range from GPU-accelerated inference to CPU-only edge deployment. Sustainability metrics include energy consumption per inference, CO₂ equivalent per 1,000 queries, and peak memory footprint.

Fig. 1 — Accuracy-efficiency tradeoff across SLM model classes. No model dominates both axes. The optimal choice depends on which axis is the binding constraint for a given role.

The empirical findings confirm the theoretical expectation: models that achieve highest accuracy on multi-step reasoning tasks tend to have higher energy footprints; models that achieve lowest energy cost tend to underperform on tasks requiring synthesis across multiple contexts. This is not surprising given that deeper attention patterns and larger parameter counts both improve reasoning quality and increase inference cost. What is surprising is the magnitude of the variance within each axis — different models at similar parameter counts can differ by 2× in energy consumption and by significant margins in accuracy, suggesting that architecture and quantization choices matter as much as raw size.

The central finding: Accuracy and energy efficiency do not co-optimize. Within a given hardware configuration, the most accurate model is rarely the most efficient, and vice versa. This holds across all nine task types, with task-specific intensity: summarization and reasoning tasks show the largest accuracy-efficiency divergence; classification tasks show the smallest.

From Benchmark to Architecture

The practical implication of SLM-Bench's finding is that single-model architectures are suboptimal by construction. If accuracy and efficiency don't co-optimize, you cannot satisfy both by picking any one model. The solution is role separation: assign different models to roles where their capability profile is the binding constraint, and use smaller models where the role's requirements are within their capability envelope.

This is exactly what the Model Role Separation pattern (A2, Post 4) describes in abstract terms. The three-model architecture that emerged from experiment-04 is its concrete instantiation:

Role Model Parameters Binding constraint Why this model
Planner / Compressor glm4:9b 9B Latency, structured output Fast JSON-mode planning; compression does not require depth reasoning
Evaluator Qwen3-Coder:30b 30B Calibration fidelity Calibrated dimensional rubric scoring; distributes scores across 6–9 range rather than ceiling at 9
Producer qwen2.5:32b Q4_K_M 32B (4-bit) Depth on enumerated tasks Can respond to depth feedback without revision regression; T_A depth_r1 ≥ 7.2 vs 6.0 with 7B

The planner/compressor role is the clearest SLM case. Planning decomposition produces structured JSON from a natural language task description: it requires accurate slot-filling, fast response time, and reliable format adherence. It does not require multi-step synthesis or depth reasoning. glm4:9b satisfies these requirements at the lowest resource cost of the three models, and its latency advantage is meaningful for the Keep-Alive Budget (A4): a fast planner means the producer's context window is filled with research rather than waiting for plan tokens.

The evaluator role has the opposite profile. Evaluation requires calibration — the ability to distribute scores across a rubric rather than collapsing to the extremes. Experiments 01 and 02 established that glm4:9b as evaluator produced a ceiling distribution: every structurally complete output received 9/10. This is not a capability failure in any conventional sense — the model correctly assessed that the outputs met the structural criteria. It is a calibration failure: the model did not penalize for shallow depth or low specificity because those dimensions are harder to evaluate than structure. Qwen3-Coder:30b broke the ceiling. Its mean score_r1 dropped to 7.07 (std=0.55) across all tasks, activating the revision loop in 8/9 runs. This is the role where the larger model's depth reasoning capacity is the binding constraint.

Task-Role Alignment: The Experimental Evidence

The four experiments trace the model selection process empirically. Each experiment held one variable constant while changing another, isolating the contribution of evaluator calibration, producer capability, and synthesis instruction quality in sequence.

Fig. 2 — Pass rates by task type across all four experiments. T_A recovered from 0/3 to 4/4 with the producer upgrade. T_B remained stuck at <60% despite both model upgrades — evidence that the constraint was the synthesis instruction, not the model.
Experiment Evaluator Producer T_A pass T_B pass T_C pass Overall
Exp-01 glm4:9b qwen2.5:7b 3/3 (100%) 3/3 (100%) 3/3 (100%) 9/9 (100%)
Exp-02 glm4:9b (threshold raised) qwen2.5:7b 3/3 (100%) 3/3 (100%) 3/3 (100%) 9/9 (100%)
Exp-03 Qwen3-Coder:30b qwen2.5:7b 0/3 (0%) 3/3 (100%) 1/3 (33%) 4/9 (44%)
Exp-04 Qwen3-Coder:30b qwen2.5:32b Q4_K_M 4/4 (100%) 4/7 (57%) 4/5 (80%) 12/16 (75%)

The T_A trajectory tells the producer-capability story cleanly. With the lenient evaluator, both 7B and 32B producers passed T_A 100% of the time — because the evaluator was not testing depth. When the calibrated evaluator entered, T_A went to 0/3 with the 7B producer. The evaluator's dimensional feedback was specific: depth = 6.0, specificity = 6.0 on all first passes. The revision loop fired, but the 7B producer could not respond to depth feedback on five-item enumerated tasks — both revision attempts regressed to 6.8 or stagnated at 7.0. The 32B producer broke the ceiling: depth_r1 rose to 7.2, T_A went 4/4.

The revision regression signature (exp-03, T_A): score_r1 = 7.0 → score_final = 6.8 across two of three T_A runs. This is the harness equivalent of an optimization instability: the producer received well-specified feedback and responded by degrading the output. This is not random noise. It reflects a systematic inability of the 7B model to integrate multi-dimensional rubric feedback into a coherent revision strategy on complex enumerated tasks.

The T_B trajectory tells a different story. T_B is a best_practices task — open-ended synthesis rather than constrained enumeration. With the 7B producer and calibrated evaluator, T_B went 3/3 (100%). With the 32B producer, T_B went 4/7 (57%). This is counterintuitive: upgrading the producer degraded the pass rate. The explanation lies in the output volume change: the 32B producer generates substantially shorter T_B outputs (2,198 bytes mean vs. 3,288 bytes in exp-03), and the calibrated evaluator's depth criterion is sensitive to coverage density. The 32B model produces more structured, less comprehensive T_B outputs — good style, insufficient depth. This is a synthesis instruction problem, not a model problem.

When the Constraint Is the Instruction

T_B's residual failure illustrates the most important lesson from the four-experiment sequence: once the model selection bottleneck is resolved, the instruction becomes the binding constraint. After upgrading to the 32B producer, T_B's depth_r1 across exp-04 was 6.1 — essentially identical to exp-03's 6.0. Two experiments, two different producers, same depth score. This is strong evidence that the evaluator is scoring something in the T_B synthesis instruction that neither 7B nor 32B producers address adequately.

The autoresearch loop (autoresearch.py) is the harness response to this. Rather than manually diagnosing and rewriting the synthesis instruction, the loop runs controlled experiments with candidate instructions, scores the results via the same composite metric used in normal operation (0.7 × mean_wiggum_r1 + 0.3 × criteria_rate × 10), and retains improvements above a 0.1 threshold via git commit. This is the Data Flywheel pattern (Observability post, Post 10) applied to configuration optimization: the pipeline's own output is used to improve the pipeline.

Autoresearch session 3 found that the highest-scoring instruction variant (composite score 8.915) used a "when NOT to use" framing — a constraint discovered by the Kimi cloud proposer, not by manual tuning. The lowest-scoring variant (7.965) added confidence ratings, which caused a −0.950 regression. This variance across instruction variants — 0.95 points on a 10-point scale — is comparable to the variance across model upgrades. The implication is that instruction quality and model selection are joint optimization problems of similar magnitude.

The joint optimization problem: For a fixed task domain, the harness performance ceiling is determined by the minimum of (a) model capability for the role, (b) evaluator calibration, and (c) synthesis instruction quality. Upgrading one dimension without addressing the others produces diminishing returns. The sequence of four experiments traced each bottleneck in isolation.

Quantization and the Q4_K_M Decision

The 32B producer in experiment-04 is quantized to Q4_K_M — a 4-bit quantization format that reduces the model's effective memory footprint from approximately 64 GB (full BF16) to approximately 20 GB, making it deployable on a single consumer GPU or on-device inference hardware. The Q4_K_M format uses a K-means groupwise quantization strategy with mixed precision: sensitive weights (typically attention projections) are quantized to 5 or 6 bits while less sensitive weights use 4 bits, with the "M" suffix indicating a medium-resolution scaling configuration.

The relevant question for harness design is not "what is the quality cost of Q4_K_M quantization?" but "does Q4_K_M retain sufficient quality for this role?" The experimental answer is yes for the producer role on enumerated tasks: depth_r1 = 7.2 on T_A with Q4_K_M, compared to 6.0 with the full-precision 7B model. The 32B Q4_K_M model outperformed the 7B BF16 model on the depth dimension despite the quantization, because the parameter count advantage dominates the precision reduction for this task type.

This connects to the hardware substrate discussion in Post 11. GGUF quantization formats like Q4_K_M are designed specifically for the llama.cpp inference stack, which handles CPU-GPU mixed-precision execution, memory-mapped weight streaming, and KV cache management automatically. The critical constraint is VRAM: a 20 GB model footprint leaves minimal headroom for the KV cache on a 24 GB GPU. Running producer and evaluator on the same device requires careful memory scheduling — which is exactly what the Keep-Alive Budget pattern (A4) is designed to manage.

SLM-Bench Implications for Harness Design

SLM-Bench's three-axis evaluation framework (accuracy, efficiency, sustainability) maps directly onto the three binding constraints in a multi-model harness architecture:

SLM-Bench axis Harness binding constraint Pattern Experimental resolution
Accuracy Evaluator calibration, producer depth A2 Model Role Separation Evaluator upgrade (exp-02→03) exposed producer ceiling; producer upgrade (exp-03→04) broke T_A ceiling
Efficiency Inference latency, VRAM budget A4 Keep-Alive Budget glm4:9b planner minimizes planning latency; Q4_K_M quantization makes 32B deployable within VRAM budget
Sustainability Run cost per task A1 Inference Shim Role-appropriate model selection reduces unnecessary large-model invocations for planning and compression tasks

SLM-Bench establishes that no model dominates all three axes. The harness architecture accepts this as a constraint and responds with specialization: use the model whose capability profile matches the role's binding constraint. The Inference Shim (A1) routes calls to the appropriate model per role; the Keep-Alive Budget (A4) manages concurrent model processes to avoid cold-start penalties; Model Role Separation (A2) formalizes which model handles which stage.

The sustainability dimension is often treated as an organizational concern rather than a system design constraint. SLM-Bench argues otherwise: energy consumption per inference varies by 2× or more within the same parameter class depending on architecture and quantization. For a pipeline running thousands of runs, this variance compounds. The harness design choice to use glm4:9b for planning and compression — rather than running the 32B producer for all stages — is a sustainability decision as much as a latency decision.

The Model Portfolio View

The framing that SLM-Bench implicitly supports, and that the four experiments confirm operationally, is that model selection in an agentic system is a portfolio problem rather than a ranking problem. A portfolio of models is selected not by finding the single highest performer but by finding a set of models whose capability profiles complement each other across the roles the harness requires.

In the three-model architecture: glm4:9b contributes fast structured output at low cost; Qwen3-Coder:30b contributes calibrated multidimensional evaluation; qwen2.5:32b Q4_K_M contributes depth reasoning on complex enumerated tasks. No single model in this portfolio is "best" in an absolute sense. glm4:9b as evaluator produced a ceiling distribution that invalidated all exp-01/02 quality signals. Qwen3-Coder:30b as producer would likely generate evaluation-biased outputs. qwen2.5:32b as planner would waste VRAM and latency on a role where 9B parameters are sufficient.

The portfolio is not static. The autoresearch loop represents a second level of optimization: given a fixed model portfolio, find the instruction that best extracts each model's capability for its role. Session 3 found a 0.95-point spread across instruction variants — comparable to a full model upgrade in impact. This suggests that the instruction optimization layer and the model selection layer should be treated as co-equal concerns, not as primary and secondary.

Harness Design Implications

# Implication Experimental grounding
1 Assign models by role requirement, not by global capability ranking. The best model for evaluation is not the best model for production. Exp-01/02 ceiling effect; exp-03 evaluator calibration breakthrough
2 Use quantized large models where the role requires depth but latency and VRAM permit. Q4_K_M 32B outperforms BF16 7B on depth dimensions. Exp-04 T_A depth_r1: 7.2 (32B Q4_K_M) vs 6.0 (7B full)
3 Identify which bottleneck is active before upgrading models. If the bottleneck is the synthesis instruction, model upgrades produce marginal gains. Exp-04 T_B: depth_r1 = 6.1 with 32B vs 6.0 with 7B — instruction-bounded
4 Use the pipeline's own output to optimize instructions via controlled experiments. Autoresearch composite score separates instruction quality from model noise. Session 3 spread: 8.915 (best) vs 7.965 (worst) — 0.95 points on 10-point scale
5 Track energy and VRAM cost per role, not just accuracy. A planner that uses 10% of the VRAM budget frees headroom for the evaluator and producer KV caches. SLM-Bench sustainability axis; Keep-Alive Budget (A4) pattern

What Comes Next

The four experiments in this series traced the constraint chain from evaluator calibration through producer capability to synthesis instruction quality. The autoresearch loop addresses the instruction layer. The remaining open questions concern the knowledge layer: how the harness acquires, stores, and retrieves domain knowledge across runs.

The Dual-Backend Memory Store (B3, Post 5) handles this at the run level: vector similarity for semantic retrieval, SQLite for structured lookup. But the literature on retrieval-augmented generation and knowledge graphs points toward a richer architecture — one where domain knowledge is represented explicitly as a structured graph rather than implicitly as dense vectors. That is the subject of the next post.

← Previous 14 · Multi-Objective Alignment Next → 16 · Tool Use & Planning