May 19, 2026 • 14 min read • Agentic Harness Engineering Series

Harness vs. Perplexity: Eight Iterations to Parity

The hypothesis was simple: a pipeline with live FRED data, Beige Book RAG, and BEA enrichment should outperform Perplexity on a current-conditions task where freshness is the key differentiator. Eight iterations later, the harness reached a tie. The story is four bugs, a rubric built for code, a two-pass extraction experiment that exposed a Beige Book chunking root cause, and a depth-vs-grounded tradeoff that proved resistant to instruction alone.

Previous experiments in this series used a controlled within-pipeline design: same model, same evaluator, same research context, with one variable changed at a time. The perplexity_vs_harness experiment is different. It uses a frozen external output — a real response from Perplexity AI — as the comparison condition, evaluated through the same Wiggum scorer as every harness output. The isolation is tight: same task string, same evaluate() call, document source is the only variable.

The specific task was chosen to favor the harness: current-conditions Fed district inflation analysis as of early 2025, where live FRED series and fresh Beige Book retrieval should provide genuine information advantage. The initial assumption was that Perplexity would be working from training data with a recency ceiling. That assumption turned out to be wrong, and it changes what the experiment actually measured.

Hypothesis: The harness pipeline (FRED data + Beige Book RAG + BEA enrichment + structured synthesis) produces a higher composite Wiggum score than Perplexity on a current-conditions Fed district inflation task. Falsified if: harness composite score ≤ perplexity composite score.

Experiment Design

The task asks about a domain where the harness has a designed advantage — live quantitative data, district-level qualitative retrieval, and structured synthesis instructions all exist specifically to answer questions like this one:

task string (identical for both conditions)

"Assess current inflation dynamics across Federal Reserve districts as of early 2025,

identifying which regions show the most persistent price pressures, what the latest

Beige Book reports signal about near-term conditions, and how regional variation

compares to the national trend."

perplexity condition

frozen Perplexity output (captured 2026-05-31) → wiggum.evaluate()

harness condition

gather_research() + beige_book_rag() + synthesize() → wiggum.evaluate()

The Perplexity output was captured from a live session, copied verbatim, and stored as a frozen constant in the run script. This ensures the perplexity condition never changes across re-runs, and that any delta movement is attributable to changes on the harness side.

The Eight Iterations

The experiment ran eight times across two phases. The first four fixed infrastructure failures; the next four chased the depth score. Here is what each one found.

v1 — 2022 historical task

Result: tie (delta 0.0 — Perplexity 7.7, Harness 7.7)

The original task asked about 2022 inflation dynamics. The harness tied instead of winning. The reason: FRED obs_limit=24 covers approximately mid-2024 to the present, so the 2022 time window falls entirely outside the live data range. Meanwhile, Perplexity's training data covers 2022 thoroughly. The harness had no information advantage. The task was redesigned to early 2025 current-conditions, where recency is the differentiator.

v2 — bugs cascade

Result: delta −0.3 (Perplexity wins)

With the task redesigned to early 2025, three separate bugs fired in sequence: FRED returned HTTP 429 (too many requests) mid-run; the BEA tool crashed with 'list' object has no attribute 'get'; and — most damaging — the Alpaca intent detector fired on the word "signal" in the task string ("what the latest Beige Book reports signal about near-term conditions"), injecting TSLA/GOOG/NVDA portfolio context into a macro inflation analysis. The harness ran on a partially corrupt research context.

v3 — bugs fixed, rubric problem

Result: delta −0.3 (still)

All three bugs were fixed. The harness now ran cleanly. But the delta didn't move. The evaluator was still penalizing the harness output for missing "implementation notes," "named tools," and "worked examples with parameters" — evaluation criteria from a rubric designed for software engineering best-practices tasks, applied verbatim to an economic research synthesis. The TASK_CRITERIA["research"] override existed in Wiggum but its depth and grounded anchors were being overridden by the base EVAL_PROMPT. The rubric needed a rewrite.

v4 — rubric fixed, RAG added

Result: delta −0.1 (depth=6, grounded=7)

TASK_CRITERIA["research"] was rewritten with explicit research-appropriate definitions for depth and grounded, and explicit instructions not to penalize for missing code/implementation/API content. SYNTH_INSTRUCTION_RESEARCH was added to require inline citations ([FRED:...], [BEA:...], Beige Book (Month Year, District)). An explicit Beige Book RAG retrieval step was added to run_harness(). The delta compressed from −0.3 to −0.1. Inspecting Perplexity's source panel revealed it retrieved 19 live sources, not training-data recall.

v5 — synthesis quote requirement

Result: delta −0.1 (no change)

SYNTH_INSTRUCTION_RESEARCH was updated to add explicit district attribution and direct quote requirements: "for each major finding, include at least one direct quote … format as > "exact quote" — Beige Book (Month Year, District)." The score did not change. Inspecting the output showed zero blockquotes and district references still using vague regional groupings. The instruction was not being followed — root-causing revealed the Beige Book chunks themselves were the problem: the top-scoring chunks were about real estate (Boston housing inventory) and LNG (Atlanta energy), not price sections.

v6 — multi-query Beige Book retrieval

Result: delta 0.0 — tied 7.7/7.7 (structure=9, grounded=7)

The single full-task query was replaced with three targeted sub-queries: "selling prices input costs inflation by district early 2025," "district price pressures tariffs cost pass-through persistent 2025," and "beige book near-term price outlook regional variation national trend 2025." Combined and deduplicated across 17 passage blocks. The evaluator's structure score rose from 8 to 9; depth stayed at 6. Composite reached 7.7 — tied with Perplexity. Best stable result.

v7 — extraction pass (Beige only)

Result: delta −0.2 (depth=7, but grounded=6, specificity=7)

A two-pass approach: a fast qwen3-8b extraction call over the raw Beige Book chunks produced structured "District — Date / Observation / Quote / Cite as:" entries, which were fed to synthesis instead of the raw passages. Depth moved to 7 — the target. But grounded dropped from 7 to 6 and specificity dropped from 8 to 7. The extracted observations stripped the [FRED:...] citation anchors from the research context, and the synthesis model lost the inline citation discipline that grounded had depended on.

v8 — raw + extraction (both)

Result: delta −0.1 (depth=6, grounded=7 — reverted)

Both raw Beige Book context (for citation anchors and verbatim text) and extracted observations (as a structured guide) were appended to the synthesis context. The hypothesis was that the model would use the extraction as a guide while preserving citation discipline from the raw context. It did not — the raw context dominated and the model reverted to its v6 summarization behavior, ignoring the extraction entirely. Depth returned to 6, grounded to 7.

Bug Deep-Dives

FRED HTTP 429 — Rate Limiting Architecture

The FRED API allows approximately 2 requests per second. The original fred_tool.py added a _SERIES_DELAY = 0.6s sleep between series calls, but get_series() makes two _get() calls internally (one for metadata, one for observations). With 5 series and 2 calls each, that's 10 HTTP requests with only 4 inter-series sleeps — bursts that reliably exceeded the rate limit.

The fix moved the throttle from between-series to per-request, using a module-level timestamp:

_MIN_INTERVAL = 0.7  # ~1.4 req/s — safely under FRED's 2 req/s limit
_last_req_t: float = 0.0

def _get(endpoint, params):
    global _last_req_t
    elapsed = time.monotonic() - _last_req_t
    if elapsed < _MIN_INTERVAL:
        time.sleep(_MIN_INTERVAL - elapsed)
    for attempt in range(_MAX_RETRIES):
        try:
            with urllib.request.urlopen(url, timeout=_TIMEOUT) as resp:
                result = json.loads(resp.read().decode("utf-8"))
            _last_req_t = time.monotonic()
            return result
        except urllib.error.HTTPError as e:
            if e.code == 429 and attempt < _MAX_RETRIES - 1:
                wait = _RETRY_BASE * (2 ** attempt)
                time.sleep(wait)
                _last_req_t = time.monotonic()
                continue
            raise

With _MIN_INTERVAL = 0.7s, every individual HTTP call is rate-throttled regardless of which function calls it. The inter-series sleep was removed entirely since the per-request throttle makes it redundant. _MAX_RETRIES = 4 and exponential backoff starting at 3s handle transient overload.

BEA `'list' object has no attribute 'get'`

The BEA API returns BEAAPI.Results as either a dict or a list depending on the dataset. GDPbyIndustry returns a list. It also nests data inside results["GDPbyIndustry"][n]["Data"] rather than the flat results["Data"] path that all other datasets use. Two code paths needed fixing:

# Fix 1: normalize list Results at parse time
results = raw.get("BEAAPI", {}).get("Results", {})
if isinstance(results, list):
    results = results[0] if results else {}

# Fix 2: GDPbyIndustry-specific data extraction
if dataset == "GDPbyIndustry":
    gdpi_list = results.get("GDPbyIndustry", [])
    data = []
    for item in (gdpi_list if isinstance(gdpi_list, list) else []):
        if isinstance(item, dict):
            data.extend(item.get("Data", []))
    if not data:
        data = results.get("Data", [])
else:
    data = results.get("Data", [])

Alpaca False Positive on "signal"

The task string contains the phrase "what the latest Beige Book reports signal about near-term conditions." The original _has_trading_intent() function used a token-level set membership check. The word "signal" was in _TRADING_KEYWORDS. So was "short," "long," "position," and several other words that appear naturally in economic analysis prose.

When Alpaca fired, it injected a prompt preamble about the user's paper trading portfolio (TSLA, GOOG, NVDA holdings, stop-loss levels, position sizing rules) into the front of the research context. The synthesis model then spent part of its output addressing portfolio implications of Fed district inflation — a coherent but completely off-task tangent that hurt depth and specificity on the actual task.

The fix replaced token matching with phrase matching, requiring unambiguous multi-word trading phrases that cannot appear in economic research prose:

_TRADING_PHRASES = frozenset({
    "paper trade", "paper trading", "alpaca",
    "buy signal", "sell signal", "long thesis", "short thesis",
    "trading thesis", "trade thesis", "entry point", "exit point",
    "position size", "portfolio allocation", "place order", "limit order",
    "market order", "trade idea", "trade setup", "trade recommendation",
    "actionable trade", "open position", "long position", "short position",
})

def _has_trading_intent(query: str) -> bool:
    q = query.lower()
    return any(phrase in q for phrase in _TRADING_PHRASES)

Evaluator Rubric Mismatch

This was the most subtle failure mode. The Wiggum evaluator's EVAL_PROMPT defines depth partly as: "Does the response include implementation notes, named tools, worked examples with parameters?" That language is appropriate for a software engineering best-practices task. It is not appropriate for an economic research synthesis.

TASK_CRITERIA["research"] existed to override this, but its formulation wasn't strong enough to suppress the base rubric's anchoring effect. The harness output scored depth=6 when the evaluator's feedback explicitly cited "missing implementation notes" — a criterion that has no meaningful interpretation for inflation analysis.

The fix added explicit negative constraints alongside research-appropriate positive definitions:

"research": (
    "This is a research synthesis task — NOT a software engineering or coding task.\n"
    "DO NOT penalize for missing 'implementation notes', 'code examples', or "
    "'named tools/APIs' — those criteria do not apply here.\n\n"
    "Reinterpret depth and grounded as follows for this task type:\n"
    "- depth: Does each section provide specific supporting evidence — data points, "
    "statistics, named regions/sectors/entities, direct quotes from sources, or "
    "mechanisms that explain the 'why'? A section that names a region but gives no "
    "figures, dates, or causal explanation is depth=6. A section with specific "
    "numbers, named actors, and causal reasoning is depth=8.\n"
    "- grounded: Are empirical claims traceable to named real-world sources — "
    "specific reports, datasets, agencies, or publications a reader could look up? "
    "Vague references to 'the Beige Book' without district or date are grounded=6. "
    "Named reports with period and district are grounded=8.\n"
)

After this fix, the perplexity output's grounded score dropped from 7 to 6 (its Beige Book references had no specific report dates or districts), while the harness output's specificity rose from 7 to 8 (its [FRED:...] inline citations are exactly the kind of traceable sourcing the corrected rubric rewards).

Final Results

Hypothesis verdict: FALSIFIED. The harness never strictly outperformed Perplexity. Best stable result (v6): Perplexity 7.7 | Harness 7.7 | Delta: 0.0. Practical parity achieved; hypothesis not met.

Best stable result (v6 — multi-query retrieval, no extraction):

perplexity

7.7

reference

harness v6

7.7

0.0

Per-dimension breakdown at v6:

Dimension	Weight	Perplexity	Harness v6	Δ (H−P)
relevance	0.20	9	9	0.0
completeness	0.20	8	8	0.0
depth	0.25	7	6	−1.0
grounded	0.15	6	7	+1.0
specificity	0.10	8	8	0.0
structure	0.10	8	9	+1.0

The harness reaches parity through an explicit trade: it loses depth (−1.0, synthesis summarizes rather than quotes district-level specifics) but gains grounded (+1.0, structured inline citations) and structure (+1.0, better-organized output from more targeted retrieval). Perplexity's depth advantage comes from retrieving and synthesizing district-specific cost categories directly from primary source PDFs. The harness groundedness advantage comes from citation discipline enforced by SYNTH_INSTRUCTION_RESEARCH.

A Corrected Premise: Perplexity Is Also a Retrieval System

After the experiment completed, Perplexity's source panel was inspected. It listed 19 sources, retrieved live at query time. This is not training-data recall — it is web retrieval, and the source quality is high:

Source	Document
federalreserve.gov	[PDF] Beige Book — Federal Reserve (primary)
bostonfed.org	The Beige Book – First District (Boston)
philadelphiafed.org	[PDF] Federal Reserve Bank of Philadelphia
minneapolisfed.org	Philadelphia: March 2025 \| Federal Reserve Bank of Minneapolis
newyorkfed.org	Summer of ‘25: The Data — FEDERAL RESERVE BANK OF NEW YORK
chicagofed.org	What is driving the differences in inflation across U.S. regions?
scribd.com	BeigeBook_20250305 [PDF] (February 2025 full report)
metaintelligence	Beige Book — August 2025
economics.td	U.S. Federal Reserve Beige Book (January 2026)
ainvest (×2)	March 2025 Review; Fed Policy and Inflation Dynamics 2025
binance / mexc	Beige Book March 5, 2025 coverage
suerf.org	[PDF] Consumer Inflation Expectations and Regional Price Changes
YouTube (×2)	February 2025 Beige Book summaries
cmegroup.com	US: Beige Book — CME Group
en.econostrum	Persistent Inflation Pressures Test Federal Reserve’s Resilience in 2025

Several things are notable about this source list. First, it includes actual PDF downloads from district Fed websites — the Boston and Philadelphia Beige Book PDFs, the full March 5 2025 report, and the primary federalreserve.gov PDF. These are the primary sources, not summaries of them. Second, the temporal range is wider than the task strictly requires: sources include an August 2025 Beige Book summary and a January 2026 Beige Book, even though the task asks about early 2025. Perplexity pulled more temporal context than requested. Third, the Chicago Fed research paper on regional inflation differences (“housing sector is the main driver of regional inflation differences”) directly supports the causal language in the response.

Why the Depth Gap Persists

Both systems were retrieving from the same underlying corpus — the Beige Book PDFs are public documents, and the harness has indexed them in ChromaDB. If the same information is available to both, the depth gap cannot be explained by information access. It has to be downstream: how each system extracts and presents what it retrieved.

There is a retrieval fidelity distinction worth noting: embedding-search over top_k=6 chunks returns what cosine similarity ranks highest against the query embedding. It may not surface the specific passage "New York: selling price increases picked up; firms flagged coffee, eggs, freight, tariff risk" if that passage wasn't among the top-6 ranked chunks. Perplexity retrieving the full Boston and Philadelphia PDFs can scan every sentence. Same corpus, different recall coverage on specific district-level details.

But the larger factor is synthesis behavior. Given whatever district-level passages the harness context contained, the synthesis model summarized them (“regions with high exposure to trade-sensitive industries”) rather than quoting them with the district names and cost categories attached. The evaluator cited exactly this: "it could mention which Federal Reserve districts are experiencing the most persistent price pressures and provide concrete data points or quotes." The information may have been in the context; the synthesis model didn't extract it at the specificity level that scores well on depth.

The real asymmetry is synthesis, not information. Both systems retrieved from the same public corpus. The depth gap reflects that the harness synthesis model, given that context, produces structurally organized summaries rather than district-attributed quotes. Perplexity's synthesis layer commits to specific figures and named cost categories (“coffee, chocolate, eggs, freight, insurance, utilities”); the harness synthesis hedges to generalizations. That is a synthesis quality gap, not a data access gap.

The extraction experiment (v7) confirmed this. When the raw Beige Book chunks were replaced with pre-digested district/quote pairs, depth rose to 7 — the synthesis model used the structured material. But grounded fell from 7 to 6 and specificity from 8 to 7 because the extracted observations didn't carry the [FRED:...] citation anchors the synthesis model had been using from the research context. Appending both raw and extracted context (v8) didn't help: the raw context dominated and the model reverted to summarization, depth back to 6.

The tradeoff exposed by v7 vs. v8 points to the Beige Book index as the actual root cause. The price sections of district reports are chunked together with real estate, agricultural, and banking sections. Embedding similarity for "inflation" and "price pressures" returns whatever scored highest in those mixed chunks — which happened to be home sales inventory (Boston) and LNG pipeline development (Atlanta). If the Beige Book index were re-chunked to isolate each report's "Prices" subsection, the raw context itself would contain the district-level cost-category language needed for depth=7, with no extraction pass required and no citation discipline lost.

What the Experiment Validated

A falsified hypothesis is still an informative result. Several things were confirmed or clarified:

The rubric matters as much as the pipeline. The delta was −0.3 before the evaluator fix and −0.1 after. Half of the improvement in this experiment came from correcting the evaluation criterion, not from improving the harness output. An evaluator that applies software engineering rubrics to economic research will systematically undercount the harness's real output quality.

Intent detection false positives are silent failures. The Alpaca false positive injected irrelevant content that degraded synthesis quality without any obvious error signal in the run logs. It only became apparent when inspecting the research context and noticing equity ticker references in a macro inflation task. Phrase-level matching is strictly necessary for any intent detector covering vocabulary that overlaps with general analytical language.

API throttle architecture matters at the per-call level. A between-series sleep cannot handle a tool that makes multiple HTTP calls per series. The throttle has to be inside the HTTP call function itself, tied to the last successful request timestamp, not to the outer loop iteration.

Task design for evaluating freshness. A valid comparison between a retrieval-augmented pipeline and a pre-trained model requires a task that genuinely falls outside the model's training data or near its recency boundary. Tasks with rich historical coverage in training data are effectively measuring synthesis quality against training-data recall, which the harness does not win.

What Remains

The depth gap is a Beige Book index problem, not a synthesis instruction problem. Eight iterations of instruction tuning, retrieval sub-query design, and two-pass extraction confirmed that the synthesis model can produce depth=7 when given the right material, but cannot reliably extract district-specific cost-category language from chunks that are mostly about real estate and agriculture. Re-chunking the Beige Book index to isolate the "Prices" subsection of each district report would let the retrieval system surface the relevant material without needing the extraction pass — and without the grounded/specificity tradeoff.

The groundedness and structure advantages are real and durable: inline citations with [FRED:{series_id}:{date}] and Beige Book (Month Year, District) make claims traceable in a way that a conversational AI response is not. Perplexity produces better depth; the harness produces better attribution. For use cases where sourcing matters — compliance, audit, downstream fact-checking — the harness advantage on grounded persists regardless of the composite delta.

Experiment Design

The Eight Iterations

Result: tie (delta 0.0 — Perplexity 7.7, Harness 7.7)

Result: delta −0.3 (Perplexity wins)

Result: delta −0.3 (still)

Result: delta −0.1 (depth=6, grounded=7)

Result: delta −0.1 (no change)

Result: delta 0.0 — tied 7.7/7.7 (structure=9, grounded=7)

Result: delta −0.2 (depth=7, but grounded=6, specificity=7)

Result: delta −0.1 (depth=6, grounded=7 — reverted)

Bug Deep-Dives

FRED HTTP 429 — Rate Limiting Architecture

BEA 'list' object has no attribute 'get'

Alpaca False Positive on "signal"

Evaluator Rubric Mismatch

Final Results

A Corrected Premise: Perplexity Is Also a Retrieval System

Why the Depth Gap Persists

What the Experiment Validated

What Remains

Related posts

BEA `'list' object has no attribute 'get'`