The Audio Data Flywheel
Every voice request the harness handles produces a (audio, raw_transcript, corrected_transcript) triple as a side effect of normal operation. Accumulate enough of them and you have supervised training data for ASR fine-tuning — closing a second flywheel alongside the text-based one described in Section G.
The text-based data flywheel described in Section G of this series collects (task, output, score) triples from research runs and feeds them into DPO preference training. This post describes a parallel flywheel that operates on audio: the harness voice pipeline already produces LLM-corrected transcripts on every request, and those corrections are, structurally, ground-truth labels for the raw whisper.cpp output.
This is a design proposal, not a fully closed loop. The pipeline produces the data; the fine-tuning target (NVIDIA's NeMo Parakeet) and the training infrastructure (NeMo RL) exist; what's missing is the volume threshold and the fine-tuning pipeline that connects them. The post explains the mechanism, traces the data path, and identifies what's needed to close it.
H1 — The Voice Pipeline
The harness voice interface lives in harness/api/routes/voice.py and handles POST /api/voice. The request carries a browser audio blob; the response carries a structured JSON object with task routing instructions. The pipeline has three stages.
Stage 1 — Conversion. The audio blob (typically WebM from MediaRecorder) is written to a temp file. ffmpeg converts it to 16kHz mono WAV — the format whisper.cpp requires. If ffmpeg isn't on $PATH, the harness falls back to imageio-ffmpeg so no system-level install is required on Windows.
Stage 2 — Transcription. whisper.cpp runs as a subprocess against the WAV file. It returns timestamped segments, one per sentence:
[00:00:00.000 --> 00:00:02.180] run a lit review on open a eye safety
[00:00:02.180 --> 00:00:04.740] after january twenty twenty five save to safety review dot md
The harness parses these segments via a regex against the [HH:MM:SS.mmm --> HH:MM:SS.mmm] line format. Segments are joined into a plain-text transcript and also kept as a structured list for downstream use (note-mode writes the timestamped version to markdown).
Stage 3 — LLM Classification. The plain transcript is passed to a local LLM (default: qwen3.6-35b) with a structured system prompt that does two things simultaneously: classify the request as "task" or "answer", and correct ASR errors in the transcript. The correction happens as part of the corrected_transcript field in the JSON response.
## ASR correction
Correct common speech-to-text errors:
- "lang chain" → LangChain | "open a eye" → OpenAI | "hug in face" → Hugging Face
- "git hub" → GitHub | "pie torch" → PyTorch | "fast api" → FastAPI
- "lama" / "llama" → Llama | "geo week" → GeoWeek | "esri" → Esri
The LLM returns a JSON object. The field that matters for the flywheel is corrected_transcript, which contains the same utterance as transcript but with ASR errors resolved. For the example above:
{
"type": "task",
"corrected_transcript": "run a lit review on OpenAI safety after January 2025, save to safety_review.md",
"task_string": "/lit-review \"OpenAI safety\" --after 2025-01 save to safety_review.md",
"suggested_path": "safety_review.md",
"uses_browser": false
}
The LLM classifier produces corrected_transcript as a side effect of every request — the raw whisper output and its corrected version accumulate in data/transcripts/ on every note-mode call.
H2 — The Flywheel Mechanism
The key observation is structural: the corrected_transcript field is a ground-truth label for what whisper.cpp heard. Every voice request where the LLM made a correction is, implicitly, a labeled training example for ASR:
- Input: the 16kHz WAV file (already saved to
notes/op-note-<ts>.wavin note mode) - Noisy label:
transcriptfrom whisper.cpp - Clean label:
corrected_transcriptfrom the LLM classifier
The supervision signal is the LLM's domain knowledge. The ASR correction rules embedded in _CLASSIFICATION_SYSTEM are not heuristics applied post-hoc — they are the LLM's understanding of what was actually said, grounded in the semantic context of the full utterance. When whisper transcribes "lang chain" and the LLM outputs "LangChain", that correction is reliable precisely because the LLM has seen enough context to know that the user is discussing software, not chain management.
This makes the correction signal stronger than a pure edit-distance correction would be, and weaker than a human-verified transcript would be. It's a noisy but systematically biased-toward-correct label — the LLM makes domain errors (it might hallucinate a product name), but it does not make the phonetic errors that whisper.cpp makes on technical vocabulary.
The triples accumulate passively. In note mode, the WAV is already saved. In task mode, the WAV is deleted after classification — but it could be retained with a one-line change to the cleanup path. Either way, the corrected_transcript is returned in the API response and could be logged to a JSONL file with the same append-only pattern used by data/runs.jsonl:
# data/asr_triples.jsonl — one record per corrected voice request
{
"timestamp": "2026-05-23T14:32:11Z",
"wav_path": "data/audio/2026-05-23-143211.wav",
"whisper": "run a lit review on open a eye safety after january twenty twenty five",
"corrected": "run a lit review on OpenAI safety after January 2025",
"delta": "open a eye → OpenAI"
}
Each record where whisper != corrected is a training example. Records where they match are still useful — they confirm the model was already accurate on that utterance, providing negative supervision.
The flywheel is already half-built: the corrections are happening, the WAVs are being saved in note mode, and the system prompt is acting as a supervision oracle. What's absent is the JSONL append, the WAV retention in task mode, and the fine-tuning pipeline that consumes the accumulated triples.
H3 — NeMo Parakeet as the Fine-Tuning Target
The natural fine-tuning target for the accumulated triples is NVIDIA's NeMo Speech framework, specifically the Parakeet family. Parakeet-unified-en-0.6b is the current recommended entry point: 0.6B parameters, supports both offline batch transcription and streaming, achieves 160ms minimum latency in streaming mode, and runs on a single GPU (RTX 4090 / RTX 5000 Ada feasible).
The fine-tuning path for Parakeet uses CTC (Connectionist Temporal Classification) loss, which means the training data format is straightforward: (wav_file, text_label) pairs. The clean label from the LLM classifier maps directly to the text_label field. No forced alignment is required; CTC handles the audio-to-text alignment internally.
# NeMo manifest format — one JSON per line
{"audio_filepath": "data/audio/2026-05-23-143211.wav", "duration": 4.74,
"text": "run a lit review on OpenAI safety after January 2025"}
{"audio_filepath": "data/audio/2026-05-23-144502.wav", "duration": 3.10,
"text": "fetch the arXiv abstract for attention is all you need"}
LoRA fine-tuning is supported in NeMo Speech for adapter-based updates, which keeps the base Parakeet weights frozen and trains only a small adapter layer. This is important for a harness deployment: the base model continues to handle general English, and the adapter specializes for the harness vocabulary (model names, framework names, research terminology) without catastrophic forgetting.
The hardware constraint is real. whisper.cpp runs on CPU — no VRAM required. NeMo with PyTorch requires a CUDA-capable GPU and at least 4–6GB VRAM for Parakeet-0.6b LoRA fine-tuning. The flywheel design handles this asymmetry by separating concerns: whisper.cpp handles online inference (zero-dependency, always available), and NeMo handles offline fine-tuning (scheduled, GPU-gated). Once a fine-tuned adapter is ready, it can be exported back to ONNX or a format compatible with lighter-weight runtimes — or Parakeet can replace whisper.cpp entirely for online use if VRAM is available.
Academic grounding: Radford et al. (2023) showed that Whisper large-v2 achieves near-human WER on general English but degrades on technical vocabulary — particularly proper nouns, product names, and code identifiers. Domain-adaptive fine-tuning on as few as 1–2 hours of in-domain audio has been shown to close this gap substantially (Gris et al., 2023; Peng et al., 2023). The harness accumulates exactly this kind of in-domain audio: one speaker, consistent recording conditions, and a narrow vocabulary of ML/engineering terminology.
H4 — NeMo RL: A Second Loop
NeMo RL extends the flywheel beyond ASR into instruction-following. Where the ASR loop takes (audio, corrected_transcript) pairs and trains a speech model, the RL loop takes (transcript, task_string, harness_score) triples and trains the LLM classifier itself.
The harness already scores its own outputs via the Wiggum evaluator. Every autoresearch run that completes produces a score. If the task originated from a voice request, the score can be attributed back to the task_string that the LLM classifier produced from the user's utterance — which can be traced back to the corrected_transcript and the original audio.
This creates a preference pair naturally:
- Chosen: the
task_stringfrom a voice request that produced a high-scoring run - Rejected: the
task_stringfrom a comparable voice request that produced a low-scoring run
NeMo RL supports DPO and GRPO on arbitrary models, including Qwen and Llama variants. The LLM classifier (currently qwen3.6-35b) could be fine-tuned via DPO on these pairs to produce task strings that are more likely to route to high-scoring runs — improving the voice interface's ability to correctly interpret user intent, not just transcribe it.
# NeMo RL DPO training — voice-to-task preference pairs
uv run python examples/run_dpo.py \
policy.model_name="Qwen/Qwen3-6B-Instruct" \
dpo.train_data_path="data/voice_preference_pairs.jsonl" \
checkpointing.checkpoint_dir="results/voice-classifier-dpo" \
logger.wandb_enabled=True
NeMo RL also supports GRPO, where the reward function can be plugged in directly. Rather than using preference pairs, GRPO generates multiple candidate task_string responses per transcript, executes them through the harness, and uses the resulting scores as rewards. This closes the loop without requiring human annotation or even post-hoc pair construction — the harness's own evaluator is the reward model.
Two nested loops: the inner ASR loop (whisper → LLM corrections → NeMo Parakeet fine-tuning) and the outer instruction loop (task_string → harness score → NeMo RL preference training on the classifier itself).
H5 — Design Tensions
Dependency footprint. whisper.cpp is a single compiled binary with no Python dependencies. It runs on CPU and works on any OS. NeMo requires PyTorch, CUDA, and roughly 4–6GB of VRAM for Parakeet-0.6b. Adding NeMo to requirements.txt would break the harness's current zero-GPU install path. The cleanest resolution is to keep NeMo in an optional extras group (pip install ollama-harness[audio]) and gate the fine-tuning pipeline behind a config flag, while whisper.cpp remains the default online transcription path.
The cold-start problem. Fine-tuning Parakeet on fewer than 30–60 minutes of in-domain audio (roughly 500–1,000 utterances) is unlikely to produce measurable WER improvements. The harness accumulates audio in note mode, but task-mode requests — which make up the majority of voice interactions — currently delete the WAV after classification. Enabling WAV retention in task mode adds persistent disk I/O; a configurable HARNESS_AUDIO_RETAIN=true flag lets operators opt in once they're ready to accumulate.
Label noise. The LLM corrects transcripts, but it can introduce its own errors. If the user says "Vespa" (the vector database) and whisper.cpp transcribes "Vespa" correctly, the LLM might still expand or rewrite it. Corrections where the edit distance is zero should be excluded from training; corrections where the LLM changed more than two tokens should be flagged for review before inclusion. A simple pre-filtering step on the accumulated JSONL would handle both cases.
Model swap logistics. The harness uses WHISPER_CPP_CLI and WHISPER_CPP_MODEL env vars to locate whisper. A post-fine-tuning model swap would mean pointing these vars at a NeMo Parakeet runtime instead of whisper.cpp. NeMo Speech can export Parakeet to ONNX, which opens a path back to a lightweight runtime — but the export pipeline adds complexity. The simpler transition is to run both in parallel during evaluation: whisper.cpp for online use, NeMo Parakeet for offline batch re-transcription of accumulated audio, comparing WER on held-out utterances before committing to the swap.
H6 — What's Needed to Close the Loop
Four additions would convert this design into a working flywheel:
-
WAV retention in task mode. One-line change in
voice.py: copy the WAV todata/audio/before unlinkingtmp_path, gated on aHARNESS_AUDIO_RETAINenv var. -
JSONL append on every classified request. Append a
(wav_path, whisper, corrected, delta)record todata/asr_triples.jsonlafter the LLM classifier returns. This accumulates passively with zero user friction. -
NeMo manifest builder. A script that reads
asr_triples.jsonl, filters records wherewhisper != correctedand edit distance is within bounds, and writes a NeMo-format manifest. This is a 30-line Python script. -
Fine-tuning trigger. A harness skill or cron job that runs when the manifest exceeds a configurable record threshold (e.g., 500 utterances), launches a NeMo Parakeet LoRA fine-tuning run, and saves the adapter to
models/parakeet-lora/. On completion, a config flag switches the online transcription backend.
The NeMo RL layer for the instruction classifier is an optional addition to this core loop — it requires harness scores attributed to voice-originated tasks, which adds an audit-log join step, but the data is already present in data/runs.jsonl.
The pattern generalizes: any pipeline stage that applies a correction or label to its input — ASR, entity normalization, citation formatting — is implicitly generating supervised training data for the model that would otherwise do that stage. The harness voice pipeline makes this especially clean because the correction happens in-band, in structured JSON, with the raw and corrected versions in adjacent fields.
What the Literature Leaves Open
- At what volume of in-domain utterances does LoRA fine-tuning on LLM-corrected ASR transcripts yield statistically significant WER improvements on held-out technical vocabulary, and does the threshold differ between CTC-based models (Parakeet) and attention-based models (Whisper large-v3)?
- When the LLM classifier is itself fine-tuned via GRPO using harness evaluation scores as rewards, does it develop a bias toward task strings that exploit evaluator blind spots rather than producing objectively better task formulations — and how should the reward function be structured to penalize this?
- How much does label noise from LLM-generated ASR corrections degrade fine-tuning outcomes compared to human-verified transcripts, and is a simple edit-distance filter sufficient to remove corrections that introduce errors rather than fix them?
- For a single-speaker, single-domain deployment (one operator, ML/engineering vocabulary), what is the minimum practical fine-tuning cadence — how frequently should a new LoRA adapter be trained and swapped in before the marginal improvement per additional utterance falls below a useful threshold?
- Does running whisper.cpp online and NeMo Parakeet offline for parallel batch re-transcription provide a reliable WER comparison signal, or do differences in decoding strategy (beam search parameters, language model integration) confound the comparison independent of model quality?