May 24, 2026 • 6 min read • Agentic Harness Engineering

The Supervisor: Four Convergence Signals and Advisory Interventions

A read-only monitor that scans runs.jsonl for signs that the pipeline is collapsing toward a fixed point — and tells you exactly which knob to turn.

The harness is a self-improving system. Autoresearch mutates the synthesis instruction, Wiggum evaluates each output, and the best runs feed back into the next generation. But self-improvement loops have a failure mode: they can converge to a local optimum and stop exploring. Outputs start looking the same. Evaluator scores cluster in a tight band. The search loop terminates early on every run. The system appears healthy on any single run but is quietly collapsing in diversity.

supervisor.py detects this. It reads a sliding window of recent runs from runs.jsonl, computes four convergence signals, compares each to a threshold, and prints a colored report with specific intervention recommendations for any signal that fires. It never modifies pipeline behavior — it is advisory only.

Four signals

Wiggum score variance

warn < 0.5σlow is bad

Standard deviation of final Wiggum scores across recent runs that used the evaluation loop. Low variance means the evaluator has stopped differentiating outputs — every run looks equally good (or equally mediocre) to the judge.

Output size CV

warn < 0.10low is bad

Coefficient of variation (std/mean) of output_bytes across research runs. A CV below 0.10 means outputs are converging in length and density — the pipeline is producing the same size document regardless of task complexity.

Search utilization

warn < 0.45low is bad

Mean fraction of MAX_SEARCH_ROUNDS (5) actually used. Consistently low utilization means the novelty gate is firing too early — the pipeline is saturating after 1–2 search rounds on tasks that should require more.

Content similarity

warn > 0.65high is bad

Mean SequenceMatcher ratio between consecutive final_content values (capped at 4000 chars per side for speed). Rising similarity means successive outputs are converging in structure or vocabulary — the hallmark of mode collapse.

The four signals cover different parts of the pipeline. Score variance catches evaluator collapse. Output size CV catches structural uniformity in the synthesis stage. Search utilization catches premature saturation in the search stage. Content similarity catches mode collapse in the generated text itself. A pipeline that passes all four is genuinely diverse across its recent history.

Interventions

Each signal has three specific interventions, drawn from practical experience tuning the pipeline:

If Wiggum score variance is too low

  1. Rotate the Wiggum rubric — add a novelty-biased evaluator 1-in-5 runs
  2. Temporarily lower PASS_THRESHOLD in wiggum.py to accept more diverse outputs
  3. Add a /deep or higher-temperature synthesis pass to surface different angles

If output size CV is too low

  1. Vary num_predict across runs (e.g. 4096 → 8192 alternating)
  2. Add task-format diversity to the eval suite (additional task types T_G / T_H)
  3. Check whether SYNTH_INSTRUCTION is forcing uniform structure regardless of task — a common cause after aggressive autoresearch convergence

If search utilization is too low

  1. Raise NOVELTY_EPSILON (currently 0.15) to let more sub-threshold rounds through
  2. Lower NOVELTY_THRESHOLD from 3 to 2 to require stronger saturation before stopping
  3. Expand the eval task suite to include topics where the current knowledge_state is sparse

If content similarity is too high

  1. Drop memory influence for 1-in-N runs by skipping memory.get_context()
  2. Rotate compression prompts in compress_and_store() to avoid schema lock-in
  3. Add raw excerpt memory alongside compressed summaries in memory.py to break structural uniformity

Sample report output

=== Supervisor Report  2026-06-01 14:22 UTC  (20 runs analyzed) ===

  [OK]               Wiggum score variance
             value=0.821σ  threshold=0.5σ
  n=17, mean_final_score=8.31, mean_revision_delta=0.44

  [OK]               Output size CV (std/mean)
             value=0.243  threshold=0.10
  n=19, mean_bytes=4821.00

  [WARN]             Search utilization (rounds used / max)
             value=0.380  threshold=0.45
  n=18, min=0.20, max=0.60

  Diagnosis: search_utilization is below threshold.
    1. Raise NOVELTY_EPSILON (currently 0.15) to let more sub-threshold rounds through
    2. Lower NOVELTY_THRESHOLD from 3 to 2 to require stronger saturation before stopping
    3. Expand eval tasks to include topics where current knowledge_state is sparse

  [OK]               Content similarity (sequential pairs)
             value=0.312  threshold=0.65
  n=17, max_pair=0.51, min_pair=0.18

The colored terminal output uses ANSI codes: green for OK, red for WARN. Machine-readable JSON output is available via --json for integration with monitoring scripts.

Signal computation details

Score variance uses only runs that have a wiggum_scores list (i.e. Wiggum ran). It takes the final score from each run's score list, ignoring intermediate revision scores. The mean_revision_delta in the detail line is the mean of (last score − first score) per run — positive deltas indicate the revision loop is improving outputs.

Output size CV excludes task types that naturally produce short outputs: email, GitHub, review, recall, queue. These would compress the variance artificially. Only research-type runs are included.

Search utilization uses two measurement paths: if tool_calls is logged, it counts entries where tool == "web_search". If not, it falls back to estimating rounds from total_search_chars at ~1800 chars per round, capped at MAX_SEARCH_ROUNDS. Orchestrated runs are excluded since they don't follow the standard search loop.

Content similarity computes pairwise SequenceMatcher.ratio() between consecutive final_content values in the analysis window, truncated at 4000 characters per side to bound computation time. The metric is sensitive to vocabulary and structure but not to minor reordering — appropriate for detecting the kind of template lock-in that autoresearch convergence produces.

Usage

# Analyze last 20 runs (default)
python supervisor.py

# Analyze last 50 runs
python supervisor.py --n 50

# Filter to research tasks only
python supervisor.py --task-type research

# Machine-readable JSON output
python supervisor.py --json

# No ANSI color (for log files)
python supervisor.py --no-color

The supervisor is advisory only — it never modifies config.ini, wiggum.py, or any pipeline parameter. Interventions require a human decision. This is intentional: automated parameter tuning based on aggregate statistics risks chasing noise, especially with a window of only 20 runs.

Relation to autoresearch and the convergence literature

The four signals were developed in response to the convergence failures documented in When the Loop Defeats Itself — three nested failure modes across 90 experiments. The supervisor operationalizes those failure modes as measurable signals: score variance catches the evaluator ceiling failure, search utilization catches the novelty saturation failure, and content similarity catches the output homogenization failure.

The GEPA framing from From Hill-Climbing to Pareto suggests that autoresearch oscillation is partly explained by the pipeline optimizing one objective (Wiggum score) while neglecting diversity. The supervisor's four signals are a proxy for the diversity half of that Pareto frontier — not optimizable directly, but monitorable.