May 28, 2026 • 13 min read • Agentic Harness Engineering Series

The Position Swap: Beige Book RAG Results and the DPO Cold-Start Problem

Prepending Federal Reserve Beige Book passages to synthesis context hurt mean composite score by 0.08 points. Appending the same passages helped by 0.13. A falsified hypothesis — and what the position swap reveals about building domain-grounded DPO training data from scratch.

The hypothesis behind the beige_book_rag experiment was straightforward: injecting Federal Reserve Beige Book passages into synthesis context should improve mean composite score by at least 0.5 points, with the largest gains in the grounded and specificity dimensions. The Beige Book is primary-source economic intelligence — district-level field reports from business contacts, qualitative but direct. If any external corpus should improve the groundedness of economic synthesis, it is this one.

The hypothesis was falsified. Mean composite delta for the standard treatment (RAG prepended to context) versus control (web search only) was −0.08. The RAG made things slightly worse.

The position swap condition told a different story. When the same Beige Book passages were appended to the end of context rather than prepended to the front, mean composite delta versus control was +0.13. Same documents, different position, opposite direction of effect.

This post covers what the experiment actually showed, why position matters mechanically, and how a falsified hypothesis generates the exact kind of controlled DPO preference pairs the training pipeline needs.

Experiment Design

The beige_book_rag experiment uses synthesis-only isolation: research context is gathered once per task, cached, then synthesized four times — once per condition — so that any score difference is attributable to the RAG factor alone rather than web search variance. Six tasks cover different economic domains and time periods:

TaskDomainPeriod
T_BB_ARegional inflation variation2022
T_BB_BLabor market tightness2021–22
T_BB_CManufacturing sentiment pre/post COVID2019 vs 2021–22
T_BB_DConsumer spending and credit conditions2024–25
T_BB_EHousing market2023–24
T_BB_F1999 dot-com parallel evaluation2025–26 vs 1999

Each task was run under four conditions. The control and standard treatment were the primary comparison; RAG-only and RAG-end were ablations added to isolate position and web-search contributions separately.

Control
[web search context] → synthesis
Treatment (RAG prepended)
[beige book passages] + [web search context] → synthesis
RAG-end (position swap)
[web search context] + [beige book passages] → synthesis
RAG-only (ablation)
[beige book passages only — no web search] → synthesis

Beige Book retrieval used query_beige_book(task, top_k=5) — five passages retrieved from a local FAISS index over Federal Reserve Beige Book releases from 2019 through 2025. Each passage ranged from 1,100 to 6,500 characters of retrieved text.

Results

The overall composite scores across conditions, averaged over six tasks:

RAG-end
7.75
+0.13
RAG-only
7.70
+0.08
Control
7.62
baseline
Treatment
7.53
−0.08

Hypothesis verdict: FALSIFIED. The predicted +0.5 gain from treatment was not observed. Observed mean delta: −0.08. The RAG condition with the largest gain was RAG-end (+0.13), not the primary treatment, and the gain falls well short of the +0.5 threshold.

Per-task breakdown for the two main conditions:

Task Control Treatment Δ RAG-only RAG-end
T_BB_A (inflation) 7.6 7.6 0.0 7.4 7.8
T_BB_B (labor) 7.7 7.5 −0.2 7.5 7.9
T_BB_C (manufacturing) 7.9 7.5 −0.4 7.9 7.9
T_BB_D (consumer/credit) 7.5 7.4 −0.1 7.9 7.9
T_BB_E (housing) 7.5 7.6 +0.1 7.6 7.5
T_BB_F (dot-com parallel) 7.5 7.6 +0.1 7.9 7.5
Mean 7.62 7.53 −0.08 7.70 7.75

The dimension-level picture shows where the treatment diverged. Both treatment and control have nearly identical grounded means (7.0 each) — the hypothesis’s predicted beneficiary. The primary damage is in depth and specificity:

Dimension Weight Control Treatment Δ RAG-only RAG-end
relevance0.209.09.00.09.09.0
completeness0.208.08.00.08.08.0
depth0.256.76.5−0.26.76.8
grounded0.157.07.00.07.57.3
specificity0.106.86.3−0.56.77.2
structure0.108.08.00.08.08.0

Why Position Matters

The contrast between treatment (−0.08) and RAG-end (+0.13) from identical document sets points to a context window primacy effect. When the Beige Book passages arrive first in the context window, they shape the model’s framing of the task before it has read the web-sourced research context. The synthesis instruction and task description are already downstream of several thousand characters of Fed district reports. The model is, in some sense, synthesizing a Beige Book summary rather than using the Beige Book to enrich a research synthesis.

The dimension data supports this reading. Treatment scores lower than control on depth (−0.2) and especially specificity (−0.5). These are the dimensions most sensitive to whether the output addresses the actual task question in concrete terms, versus producing thematically relevant but generically organized content. Beige Books are useful as corroboration; they are not substitute research. When positioned first, they appear to anchor the output format.

Position finding: For synthesis tasks that require integrating retrieved domain context with active web research, appended context (+0.13) consistently outperforms prepended context (−0.08). The retrieved material works as a supplement to an already-framed synthesis, not as a primer for it.

The RAG-only condition (no web search, Beige Book only) also outperformed the standard treatment: +0.08 versus −0.08. This is a smaller sample but consistent — the Beige Book passages alone, without the treatment’s web-search context being reframed by them, produced comparable output to web-search-only on most tasks, and notably better output on T_BB_C, T_BB_D, and T_BB_F (all +0.4 vs control, reaching 7.9). Those three tasks involve direct comparison or evaluation using Fed district data, which is exactly what the Beige Book covers. Tasks requiring point-in-time market data (T_BB_A, T_BB_B) showed less benefit.

The DPO Pipeline

The training pipeline in scripts/ has three DPO signal sources. Two existed before this experiment:

The experiment adds a third:

_RAG_CONDITIONS = ("rag_end", "treatment", "rag_only")  # preference order

def build_rag_pairs(records: list[dict], min_delta: float) -> list[dict]:
    by_task: dict[str, dict[str, dict]] = {}
    for r in records:
        tid  = r.get("task_id", "")
        cond = r.get("condition", "")
        if tid and cond in ("control", *_RAG_CONDITIONS):
            by_task.setdefault(tid, {})[cond] = r

    pairs = []
    for task_id, pair in by_task.items():
        ctrl = pair.get("control")
        # Pick best available RAG condition in preference order
        rag_cond = next((c for c in _RAG_CONDITIONS if c in pair), None)
        ...
        if trt_score >= ctrl_score:
            chosen, rejected = trt, ctrl
        else:
            chosen, rejected = ctrl, trt  # control was better: reject RAG
        ...

At the default --min-delta 0.3, the experiment generates one pair: T_BB_D, where RAG-end (7.9) vs. control (7.5) produces a delta of 0.4. At --min-delta 0.1, the harvest expands to include the +0.2 cases (T_BB_A, T_BB_B with RAG-end; T_BB_C and T_BB_F with RAG-only). The choice of threshold is a quality-versus-quantity tradeoff: small deltas carry weaker preference signal and add noise relative to the cross-run and revision sources.

The inverted pair: For T_BB_B treatment (7.5) vs control (7.7), build_rag_pairs() correctly flips the chosen/rejected assignment — control becomes chosen, treatment becomes rejected. This is the signal that matters: when RAG harms output, the dataset should encode that preference explicitly, not just omit the pair.

The Cold-Start Gap

The cross-run and revision sources have a structural coverage problem. Cross-run pairs require the same task to have been run multiple times with different outcomes. Revision pairs require multi-round Wiggum evaluation with per-round content stored, which was added in mid-April 2026. Both sources are heavily weighted toward the tasks in the main evaluation suite: T_B (cost envelope management), T_W (production AI observability), and the other six standard tasks. Economic domain tasks — tasks that require integrating primary-source district data, historical comparison, or regional variation analysis — have little or no representation in the pre-experiment DPO dataset.

This is the cold-start problem. A DPO-fine-tuned model trained exclusively on the existing cross-run and revision pairs will have a strong prior for the synthesis style the harness has optimized over 107 experiments: structured, criteria-satisfying, technically grounded. It will have no prior for how to handle primary-source economic documents — whether to cite them directly, integrate them with secondary sources, how to weight district-level anecdotes versus national statistical series.

The Beige Book experiment fills this gap by design. Its six tasks require exactly the synthesis skill the DPO dataset lacks coverage of. The controlled experimental structure — same research context, same synthesizer, only the RAG condition varies — means the preference signal is clean: the only thing that changed between chosen and rejected outputs is what context the model received and in what order.

Coverage finding: RAG-experiment pairs are architecturally different from cross-run and revision pairs. They carry explicit source attribution (chosen_condition, rejected_condition), controlled provenance (single variable isolation), and domain coverage that cross-run pairs cannot generate unless the same domain tasks happen to be run repeatedly. For domain-specific cold starts, designed experiments outperform opportunistic pairing.

What Goes Into Training

The training pipeline reads from scripts/hf_datasets/dpo_mixed.jsonl, which is assembled by mixing the three sources. train_dpo.py fine-tunes Qwen/Qwen2.5-32B-Instruct with TRL’s DPOTrainer:

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

dpo_config = DPOConfig(
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=5e-5,
    beta=0.1,                      # KL penalty
    max_length=4096,
    max_prompt_length=1024,
    loss_type="sigmoid",           # standard DPO loss
)

The beta=0.1 KL penalty is on the aggressive end — it trades conservatism for stronger preference conformity. For a cold-start situation where the base model has no domain prior, a lower beta (higher learning signal) is appropriate. A model that already had reasonable domain performance would warrant a higher beta to avoid over-fitting to a small set of pairs.

After training, the adapter is merged and converted to GGUF for Ollama registration. The resulting checkpoint replaces pi-qwen-32b as the synthesis model in the harness’s .env configuration. The harness then runs its standard eval suite against the DPO-trained model to measure whether the preference signal transferred — and if it did, whether the improvement holds across tasks not represented in the training data.

What the Position Swap Teaches the Trainer

The most useful aspect of the position swap result for DPO training is that it generates two preferences from the same document set: prefer (RAG-end output) over (control output) and prefer (control output) over (treatment output). The model sees both that Beige Book context can help and that its placement can hurt. Neither preference alone teaches the full lesson.

The beige_book_chars field stored per record enables a further refinement: filtering by retrieval size. T_BB_F had only 1,837 characters of Beige Book retrieval versus T_BB_C’s 6,531. The small-retrieval cases may behave differently from large-retrieval cases because the primacy effect of a small passage is weaker than that of a large one. A training pipeline that separates these into different preference buckets — “small RAG prepended,” “large RAG prepended” — could teach a more nuanced routing policy than “never prepend.”

The next experiment: The position swap result suggests a follow-up: does appending RAG context hold its +0.13 advantage consistently, or does it depend on the retrieval quality? A retrieval-quality-stratified A/B on the same six tasks — using BM25 versus semantic retrieval, and varying top-k — would establish whether the position effect is robust or interacts with retrieval precision. That result would also determine whether the DPO pairs from this experiment generalize or need stratified weighting in the training mix.