The Position Swap: Beige Book RAG Results and the DPO Cold-Start Problem
Prepending Federal Reserve Beige Book passages to synthesis context hurt mean composite score by 0.08 points. Appending the same passages helped by 0.13. A falsified hypothesis — and what the position swap reveals about building domain-grounded DPO training data from scratch.
The hypothesis behind the beige_book_rag experiment was straightforward: injecting Federal Reserve Beige Book passages into synthesis context should improve mean composite score by at least 0.5 points, with the largest gains in the grounded and specificity dimensions. The Beige Book is primary-source economic intelligence — district-level field reports from business contacts, qualitative but direct. If any external corpus should improve the groundedness of economic synthesis, it is this one.
The hypothesis was falsified. Mean composite delta for the standard treatment (RAG prepended to context) versus control (web search only) was −0.08. The RAG made things slightly worse.
The position swap condition told a different story. When the same Beige Book passages were appended to the end of context rather than prepended to the front, mean composite delta versus control was +0.13. Same documents, different position, opposite direction of effect.
This post covers what the experiment actually showed, why position matters mechanically, and how a falsified hypothesis generates the exact kind of controlled DPO preference pairs the training pipeline needs.
Experiment Design
The beige_book_rag experiment uses synthesis-only isolation: research context is gathered once per task, cached, then synthesized four times — once per condition — so that any score difference is attributable to the RAG factor alone rather than web search variance. Six tasks cover different economic domains and time periods:
| Task | Domain | Period |
|---|---|---|
| T_BB_A | Regional inflation variation | 2022 |
| T_BB_B | Labor market tightness | 2021–22 |
| T_BB_C | Manufacturing sentiment pre/post COVID | 2019 vs 2021–22 |
| T_BB_D | Consumer spending and credit conditions | 2024–25 |
| T_BB_E | Housing market | 2023–24 |
| T_BB_F | 1999 dot-com parallel evaluation | 2025–26 vs 1999 |
Each task was run under four conditions. The control and standard treatment were the primary comparison; RAG-only and RAG-end were ablations added to isolate position and web-search contributions separately.
Beige Book retrieval used query_beige_book(task, top_k=5) — five passages retrieved from a local FAISS index over Federal Reserve Beige Book releases from 2019 through 2025. Each passage ranged from 1,100 to 6,500 characters of retrieved text.
Results
The overall composite scores across conditions, averaged over six tasks:
Hypothesis verdict: FALSIFIED. The predicted +0.5 gain from treatment was not observed. Observed mean delta: −0.08. The RAG condition with the largest gain was RAG-end (+0.13), not the primary treatment, and the gain falls well short of the +0.5 threshold.
Per-task breakdown for the two main conditions:
| Task | Control | Treatment | Δ | RAG-only | RAG-end |
|---|---|---|---|---|---|
| T_BB_A (inflation) | 7.6 | 7.6 | 0.0 | 7.4 | 7.8 |
| T_BB_B (labor) | 7.7 | 7.5 | −0.2 | 7.5 | 7.9 |
| T_BB_C (manufacturing) | 7.9 | 7.5 | −0.4 | 7.9 | 7.9 |
| T_BB_D (consumer/credit) | 7.5 | 7.4 | −0.1 | 7.9 | 7.9 |
| T_BB_E (housing) | 7.5 | 7.6 | +0.1 | 7.6 | 7.5 |
| T_BB_F (dot-com parallel) | 7.5 | 7.6 | +0.1 | 7.9 | 7.5 |
| Mean | 7.62 | 7.53 | −0.08 | 7.70 | 7.75 |
The dimension-level picture shows where the treatment diverged. Both treatment and control have nearly identical grounded means (7.0 each) — the hypothesis’s predicted beneficiary. The primary damage is in depth and specificity:
| Dimension | Weight | Control | Treatment | Δ | RAG-only | RAG-end |
|---|---|---|---|---|---|---|
| relevance | 0.20 | 9.0 | 9.0 | 0.0 | 9.0 | 9.0 |
| completeness | 0.20 | 8.0 | 8.0 | 0.0 | 8.0 | 8.0 |
| depth | 0.25 | 6.7 | 6.5 | −0.2 | 6.7 | 6.8 |
| grounded | 0.15 | 7.0 | 7.0 | 0.0 | 7.5 | 7.3 |
| specificity | 0.10 | 6.8 | 6.3 | −0.5 | 6.7 | 7.2 |
| structure | 0.10 | 8.0 | 8.0 | 0.0 | 8.0 | 8.0 |
Why Position Matters
The contrast between treatment (−0.08) and RAG-end (+0.13) from identical document sets points to a context window primacy effect. When the Beige Book passages arrive first in the context window, they shape the model’s framing of the task before it has read the web-sourced research context. The synthesis instruction and task description are already downstream of several thousand characters of Fed district reports. The model is, in some sense, synthesizing a Beige Book summary rather than using the Beige Book to enrich a research synthesis.
The dimension data supports this reading. Treatment scores lower than control on depth (−0.2) and especially specificity (−0.5). These are the dimensions most sensitive to whether the output addresses the actual task question in concrete terms, versus producing thematically relevant but generically organized content. Beige Books are useful as corroboration; they are not substitute research. When positioned first, they appear to anchor the output format.
The RAG-only condition (no web search, Beige Book only) also outperformed the standard treatment: +0.08 versus −0.08. This is a smaller sample but consistent — the Beige Book passages alone, without the treatment’s web-search context being reframed by them, produced comparable output to web-search-only on most tasks, and notably better output on T_BB_C, T_BB_D, and T_BB_F (all +0.4 vs control, reaching 7.9). Those three tasks involve direct comparison or evaluation using Fed district data, which is exactly what the Beige Book covers. Tasks requiring point-in-time market data (T_BB_A, T_BB_B) showed less benefit.
The DPO Pipeline
The training pipeline in scripts/ has three DPO signal sources. Two existed before this experiment:
- Cross-run pairs (
build_dpo_dataset.py): same task, two runs with different scores. Chosen = higher-scoring content. Requiresfinal_contentin both runs andscore_delta ≥ 0.5. - Wiggum-revision pairs (
build_dpo_dataset.py): within a single run, round-1 content (rejected) versus best-round content (chosen). The Wiggum feedback serves as a natural rationale for the preference. Requires multi-round eval log with per-round content stored.
The experiment adds a third:
- RAG-experiment pairs (
extract_rag_dpo.py): for each task, the best-scoring RAG condition is compared to control. If|score_delta| ≥ threshold, the higher-scoring output becomeschosenand the lower-scoring becomesrejected. The pair carries experiment provenance fields:experiment_id,task_id,chosen_condition,rejected_condition.
_RAG_CONDITIONS = ("rag_end", "treatment", "rag_only") # preference order
def build_rag_pairs(records: list[dict], min_delta: float) -> list[dict]:
by_task: dict[str, dict[str, dict]] = {}
for r in records:
tid = r.get("task_id", "")
cond = r.get("condition", "")
if tid and cond in ("control", *_RAG_CONDITIONS):
by_task.setdefault(tid, {})[cond] = r
pairs = []
for task_id, pair in by_task.items():
ctrl = pair.get("control")
# Pick best available RAG condition in preference order
rag_cond = next((c for c in _RAG_CONDITIONS if c in pair), None)
...
if trt_score >= ctrl_score:
chosen, rejected = trt, ctrl
else:
chosen, rejected = ctrl, trt # control was better: reject RAG
...
At the default --min-delta 0.3, the experiment generates one pair: T_BB_D, where RAG-end (7.9) vs. control (7.5) produces a delta of 0.4. At --min-delta 0.1, the harvest expands to include the +0.2 cases (T_BB_A, T_BB_B with RAG-end; T_BB_C and T_BB_F with RAG-only). The choice of threshold is a quality-versus-quantity tradeoff: small deltas carry weaker preference signal and add noise relative to the cross-run and revision sources.
The inverted pair: For T_BB_B treatment (7.5) vs control (7.7), build_rag_pairs() correctly flips the chosen/rejected assignment — control becomes chosen, treatment becomes rejected. This is the signal that matters: when RAG harms output, the dataset should encode that preference explicitly, not just omit the pair.
The Cold-Start Gap
The cross-run and revision sources have a structural coverage problem. Cross-run pairs require the same task to have been run multiple times with different outcomes. Revision pairs require multi-round Wiggum evaluation with per-round content stored, which was added in mid-April 2026. Both sources are heavily weighted toward the tasks in the main evaluation suite: T_B (cost envelope management), T_W (production AI observability), and the other six standard tasks. Economic domain tasks — tasks that require integrating primary-source district data, historical comparison, or regional variation analysis — have little or no representation in the pre-experiment DPO dataset.
This is the cold-start problem. A DPO-fine-tuned model trained exclusively on the existing cross-run and revision pairs will have a strong prior for the synthesis style the harness has optimized over 107 experiments: structured, criteria-satisfying, technically grounded. It will have no prior for how to handle primary-source economic documents — whether to cite them directly, integrate them with secondary sources, how to weight district-level anecdotes versus national statistical series.
The Beige Book experiment fills this gap by design. Its six tasks require exactly the synthesis skill the DPO dataset lacks coverage of. The controlled experimental structure — same research context, same synthesizer, only the RAG condition varies — means the preference signal is clean: the only thing that changed between chosen and rejected outputs is what context the model received and in what order.
chosen_condition, rejected_condition), controlled provenance (single variable isolation), and domain coverage that cross-run pairs cannot generate unless the same domain tasks happen to be run repeatedly. For domain-specific cold starts, designed experiments outperform opportunistic pairing.
What Goes Into Training
The training pipeline reads from scripts/hf_datasets/dpo_mixed.jsonl, which is assembled by mixing the three sources. train_dpo.py fine-tunes Qwen/Qwen2.5-32B-Instruct with TRL’s DPOTrainer:
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
task_type="CAUSAL_LM",
)
dpo_config = DPOConfig(
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=5e-5,
beta=0.1, # KL penalty
max_length=4096,
max_prompt_length=1024,
loss_type="sigmoid", # standard DPO loss
)
The beta=0.1 KL penalty is on the aggressive end — it trades conservatism for stronger preference conformity. For a cold-start situation where the base model has no domain prior, a lower beta (higher learning signal) is appropriate. A model that already had reasonable domain performance would warrant a higher beta to avoid over-fitting to a small set of pairs.
After training, the adapter is merged and converted to GGUF for Ollama registration. The resulting checkpoint replaces pi-qwen-32b as the synthesis model in the harness’s .env configuration. The harness then runs its standard eval suite against the DPO-trained model to measure whether the preference signal transferred — and if it did, whether the improvement holds across tasks not represented in the training data.
What the Position Swap Teaches the Trainer
The most useful aspect of the position swap result for DPO training is that it generates two preferences from the same document set: prefer (RAG-end output) over (control output) and prefer (control output) over (treatment output). The model sees both that Beige Book context can help and that its placement can hurt. Neither preference alone teaches the full lesson.
The beige_book_chars field stored per record enables a further refinement: filtering by retrieval size. T_BB_F had only 1,837 characters of Beige Book retrieval versus T_BB_C’s 6,531. The small-retrieval cases may behave differently from large-retrieval cases because the primacy effect of a small passage is weaker than that of a large one. A training pipeline that separates these into different preference buckets — “small RAG prepended,” “large RAG prepended” — could teach a more nuanced routing policy than “never prepend.”
The next experiment: The position swap result suggests a follow-up: does appending RAG context hold its +0.13 advantage consistently, or does it depend on the retrieval quality? A retrieval-quality-stratified A/B on the same six tasks — using BM25 versus semantic retrieval, and varying top-k — would establish whether the position effect is robust or interacts with retrieval precision. That result would also determine whether the DPO pairs from this experiment generalize or need stratified weighting in the training mix.