The Wiggum Panel: Three-Persona Parallel Evaluation
The Wiggum loop uses a single evaluator model scoring output across six dimensions. That works well for most runs. But a single evaluator—even a capable one at low temperature—brings a single perspective: it cannot simultaneously optimize for technical depth, coverage completeness, and newcomer accessibility. A document can pass the dimensional rubric with a high score while still being impenetrable to a developer encountering the topic for the first time, or thin in the places a domain practitioner would notice first.
The Panel addresses this by running three reviewers concurrently after the initial Wiggum evaluation. Each comes with a distinct system prompt that anchors its perspective. Their issues are merged into the revision context alongside Wiggum’s own issues, giving the producer a richer signal before it rewrites.
The three personas
Domain Practitioner
Senior engineer with 10+ years in production systems. Focuses on whether the content is actionable and production-ready. Flags toy-level examples, vague claims, and missing implementation details. Core question: could a practitioner act on this today?
Critical Reviewer
Technical editor reviewing for completeness and intellectual rigor. Identifies gaps—what an expert would expect to see but is absent. Flags unsupported claims, missing caveats, and one-sided coverage. Reads for what is not there.
Informed Newcomer
Developer with general programming knowledge learning this specific topic for the first time. Tests whether a document is comprehensible without prior domain knowledge. Flags unexplained jargon and concepts assumed without introduction.
The split is deliberate. The Practitioner and Critic tend to find complementary issues: the Practitioner flags what’s missing in depth, the Critic flags what’s missing in breadth. The Newcomer catches problems neither would notice—because both already know the domain. A document can have sufficient technical depth and complete coverage while still being inaccessible to the person most likely to read it.
Execution: parallel threads, single content excerpt
All three personas receive the same content excerpt (capped at 5,000 characters) and the same task description. They run concurrently via ThreadPoolExecutor with one thread per persona. Wall-clock time is approximately 1x a single LLM call, not 3x—the only serial bottleneck is the slowest persona to respond.
def run_panel(task: str, content: str, model: str, trace=None) -> list[dict]:
content_excerpt = content[:5000]
with ThreadPoolExecutor(max_workers=len(PANEL_PERSONAS)) as pool:
futures = {pool.submit(_run_persona, p): p["name"] for p in PANEL_PERSONAS}
for future in as_completed(futures):
result = future.result()
if result is not None:
reviews.append(result)
Each persona returns a structured JSON object: score (0–10), issues (list of specific observations), and strengths (list of things done well). A regex fallback parses the response if JSON decoding fails.
Integration with the Wiggum loop
The panel runs after Wiggum’s evaluate step and before the revision step. When WIGGUM_PANEL=1, wiggum.py calls run_panel() and then panel_issues() to flatten and deduplicate the issue lists across personas:
# wiggum.py, inside the evaluate() round
if _PANEL_ENABLED:
panel_reviews = run_panel(task, content, evaluator_model, trace=parent_trace)
panel_issue_list = panel_issues(panel_reviews)
new_panel_issues = [i for i in panel_issue_list if i.lower() not in existing]
issues = issues + new_panel_issues # augment Wiggum's issue list
panel_issues() prefixes each issue with the originating persona name—[Domain Practitioner] entry price field missing concrete example—so the producer knows the evaluative lens behind each critique. Issues already present in Wiggum’s own list are deduplicated before appending.
↓
run_panel() → 3 threads in parallel
↓
panel_issues() → deduplicated, persona-prefixed
↓
issues = wiggum_issues + new_panel_issues
↓
revise() receives enriched issue list
The panel does not independently gate the revision decision—the pass/fail threshold is still Wiggum’s dimensional score. The panel’s contribution is richer revision context, not a separate gating signal. If Wiggum scores the output above PASS_THRESHOLD, the run completes even if panel reviewers found issues.
Logging and downstream use
Each Wiggum evaluation round record in runs.jsonl includes a panel_reviews field when the panel ran. This stores the full per-persona output—score, issues, strengths, and raw model response—for every round. The structured per-persona scores are useful for preference learning: a document where the Newcomer scores 4/10 but the Practitioner scores 8/10 is a different training signal than one where both score 6/10.
Configuration and tradeoffs
| Setting | Default | Notes |
|---|---|---|
| WIGGUM_PANEL | 0 (off) | Set to 1 to enable. Off by default because it adds 3 LLM calls per Wiggum round. |
| Content excerpt | 5,000 chars | Each persona sees the first 5,000 characters. Hardcoded in panel.py. |
| Temperature | 0.3 | Slightly higher than Wiggum’s evaluator (0.1) to allow more varied perspectives across personas. |
| Model | Inherits evaluator model | All three personas use the same model. Can be overridden via the standalone CLI. |
The main tradeoff is cost: three extra LLM calls per evaluation round. At MAX_ROUNDS=3, a run that revises twice could generate up to 9 panel calls. In practice the panel fires most usefully on the first round—subsequent revisions are responding to known issues, and the panel adds diminishing signal. A common configuration is to enable the panel only for long-form outputs (literature reviews, design documents) where the multi-perspective coverage is most valuable.
Standalone CLI
The panel can be run independently of the full harness pipeline on any existing document:
python -m harness.panel "design a distributed rate limiter" output.md
python -m harness.panel "summarize RLHF literature" review.md Qwen3-Coder:30b
Output prints per-persona scores, issues, and strengths to stdout, followed by the merged issue list. Useful for one-shot quality audits on documents produced outside the harness pipeline.
The 5,000-character content excerpt means the panel only sees the beginning of long documents. For literature reviews that open with an abstract and methodology before the synthesis, the Practitioner may flag “no concrete findings” based on incomplete context. If the document structure front-loads boilerplate, consider shuffling the excerpt or running the panel on a summarized version.