← Back to Blog

The Wiggum Panel: Three-Persona Parallel Evaluation

The Wiggum loop uses a single evaluator model scoring output across six dimensions. That works well for most runs. But a single evaluator—even a capable one at low temperature—brings a single perspective: it cannot simultaneously optimize for technical depth, coverage completeness, and newcomer accessibility. A document can pass the dimensional rubric with a high score while still being impenetrable to a developer encountering the topic for the first time, or thin in the places a domain practitioner would notice first.

The Panel addresses this by running three reviewers concurrently after the initial Wiggum evaluation. Each comes with a distinct system prompt that anchors its perspective. Their issues are merged into the revision context alongside Wiggum’s own issues, giving the producer a richer signal before it rewrites.

The three personas

Domain Practitioner

Senior engineer with 10+ years in production systems. Focuses on whether the content is actionable and production-ready. Flags toy-level examples, vague claims, and missing implementation details. Core question: could a practitioner act on this today?

Critical Reviewer

Technical editor reviewing for completeness and intellectual rigor. Identifies gaps—what an expert would expect to see but is absent. Flags unsupported claims, missing caveats, and one-sided coverage. Reads for what is not there.

Informed Newcomer

Developer with general programming knowledge learning this specific topic for the first time. Tests whether a document is comprehensible without prior domain knowledge. Flags unexplained jargon and concepts assumed without introduction.

The split is deliberate. The Practitioner and Critic tend to find complementary issues: the Practitioner flags what’s missing in depth, the Critic flags what’s missing in breadth. The Newcomer catches problems neither would notice—because both already know the domain. A document can have sufficient technical depth and complete coverage while still being inaccessible to the person most likely to read it.

Execution: parallel threads, single content excerpt

All three personas receive the same content excerpt (capped at 5,000 characters) and the same task description. They run concurrently via ThreadPoolExecutor with one thread per persona. Wall-clock time is approximately 1x a single LLM call, not 3x—the only serial bottleneck is the slowest persona to respond.

def run_panel(task: str, content: str, model: str, trace=None) -> list[dict]:
    content_excerpt = content[:5000]

    with ThreadPoolExecutor(max_workers=len(PANEL_PERSONAS)) as pool:
        futures = {pool.submit(_run_persona, p): p["name"] for p in PANEL_PERSONAS}
        for future in as_completed(futures):
            result = future.result()
            if result is not None:
                reviews.append(result)

Each persona returns a structured JSON object: score (0–10), issues (list of specific observations), and strengths (list of things done well). A regex fallback parses the response if JSON decoding fails.

Integration with the Wiggum loop

The panel runs after Wiggum’s evaluate step and before the revision step. When WIGGUM_PANEL=1, wiggum.py calls run_panel() and then panel_issues() to flatten and deduplicate the issue lists across personas:

# wiggum.py, inside the evaluate() round
if _PANEL_ENABLED:
    panel_reviews = run_panel(task, content, evaluator_model, trace=parent_trace)
    panel_issue_list = panel_issues(panel_reviews)
    new_panel_issues = [i for i in panel_issue_list if i.lower() not in existing]
    issues = issues + new_panel_issues  # augment Wiggum's issue list

panel_issues() prefixes each issue with the originating persona name—[Domain Practitioner] entry price field missing concrete example—so the producer knows the evaluative lens behind each critique. Issues already present in Wiggum’s own list are deduplicated before appending.

evaluate() → Wiggum scores + issues
    ↓
run_panel() → 3 threads in parallel
    ↓
panel_issues() → deduplicated, persona-prefixed
    ↓
issues = wiggum_issues + new_panel_issues
    ↓
revise() receives enriched issue list

The panel does not independently gate the revision decision—the pass/fail threshold is still Wiggum’s dimensional score. The panel’s contribution is richer revision context, not a separate gating signal. If Wiggum scores the output above PASS_THRESHOLD, the run completes even if panel reviewers found issues.

Logging and downstream use

Each Wiggum evaluation round record in runs.jsonl includes a panel_reviews field when the panel ran. This stores the full per-persona output—score, issues, strengths, and raw model response—for every round. The structured per-persona scores are useful for preference learning: a document where the Newcomer scores 4/10 but the Practitioner scores 8/10 is a different training signal than one where both score 6/10.

Configuration and tradeoffs

SettingDefaultNotes
WIGGUM_PANEL0 (off)Set to 1 to enable. Off by default because it adds 3 LLM calls per Wiggum round.
Content excerpt5,000 charsEach persona sees the first 5,000 characters. Hardcoded in panel.py.
Temperature0.3Slightly higher than Wiggum’s evaluator (0.1) to allow more varied perspectives across personas.
ModelInherits evaluator modelAll three personas use the same model. Can be overridden via the standalone CLI.

The main tradeoff is cost: three extra LLM calls per evaluation round. At MAX_ROUNDS=3, a run that revises twice could generate up to 9 panel calls. In practice the panel fires most usefully on the first round—subsequent revisions are responding to known issues, and the panel adds diminishing signal. A common configuration is to enable the panel only for long-form outputs (literature reviews, design documents) where the multi-perspective coverage is most valuable.

Standalone CLI

The panel can be run independently of the full harness pipeline on any existing document:

python -m harness.panel "design a distributed rate limiter" output.md
python -m harness.panel "summarize RLHF literature" review.md Qwen3-Coder:30b

Output prints per-persona scores, issues, and strengths to stdout, followed by the merged issue list. Useful for one-shot quality audits on documents produced outside the harness pipeline.

The 5,000-character content excerpt means the panel only sees the beginning of long documents. For literature reviews that open with an abstract and methodology before the synthesis, the Practitioner may flag “no concrete findings” based on incomplete context. If the document structure front-loads boilerplate, consider shuffling the excerpt or running the panel on a summarized version.