May 23, 2026 • 6 min read • Agentic Harness Engineering

The Voice View: Push-to-Talk Notes, Task Dictation, and the Audio Data Flywheel

Hold Space to record, release to transcribe. Note mode saves to the ASR training corpus. Task mode corrects the transcript with an LLM and drops the result into the run queue. Three result types, one floating panel.

The Voice panel is a floating overlay that sits above whichever view is currently active in the dashboard. It doesn't replace the main view — it appears in the corner, captures a recording, and dismisses itself after you act on the result. The interaction is designed to be fast: hold Space, dictate, release, review, submit or save. The whole interaction takes roughly five seconds for a short task string.

Voice panel floating over the MCP view, showing Note/Task toggle buttons, Auto-transcribe checkbox, circular mic FAB button, and status text

Voice panel in Note mode, idle state. The circular mic FAB and status hint are the only controls visible at rest. The MCP view remains active in the background — the panel overlays without navigating away.

Two Modes

A toggle at the top of the panel switches between Note and Task. The mode determines how the audio is processed after transcription:

Note Transcribe speech to text for freeform capture — meeting observations, ideas, reminders. The transcript is editable before saving. An Auto-transcribe to corpus checkbox (Note mode only) controls whether the raw audio and transcript are appended to the Whisper training corpus for future fine-tuning. Saving calls POST /api/notes/save with the text content, timestamp, and optional filename.

Task Transcribe speech and interpret it as a harness research task. The API runs the raw transcript through an LLM correction pass that converts casual speech ("research best practices for RAG and save it to my desktop") into a well-formed harness task string. The corrected string lands in an editable textarea for review before submission to the run queue via POST /api/queue.

Recording and Transcription

Two ways to start recording:

Click the mic FAB — single click to start, single click to stop.
Hold Space ≥ 900 ms — push-to-talk from anywhere in the dashboard (not captured inside a text input or textarea). Release Space to stop. The 900 ms hold threshold prevents accidental triggers from quick spacebar presses during normal keyboard use.

During recording, a green frequency-bar waveform renders on a canvas element above the mic button, driven by the Web Audio API AnalyserNode. The bars reflect real-time amplitude — a live visual confirmation that the microphone is capturing audio.

On stop, the browser packages the recorded chunks into a webm blob and POSTs it to /api/voice as a multipart form with audio, mode, and auto_transcribe fields. The server runs Whisper (on CUDA, per the system config) and returns a structured VoiceResult object.

Three Result Types

The server classifies the utterance into one of three result types and the panel renders accordingly:

task The utterance is a research instruction. The panel shows the LLM's reasoning for the correction (e.g. "added output path, inferred task type"), a pre-filled editable textarea with the corrected task string, and a Run task button. Edit the string if needed, then click Run — it submits to the queue and closes the panel.

answer The utterance is a question the harness can answer directly (e.g. "what was my last run score?"). The panel shows the corrected transcript and the LLM's response inline — no pipeline run is triggered. A Copy button copies the answer text.

note The utterance is a freeform note. The panel shows an editable textarea with the transcript and an optional filename input. Save note writes the content to notes/ with the recorded timestamp. If auto-transcribe was enabled, the audio pair was already saved to the corpus at transcription time.

The Auto-Transcribe Flywheel

The Auto-transcribe to corpus checkbox is the voice panel's contribution to the self-improvement loop. When enabled in Note mode, each recording produces two artifacts:

1.The audio file is saved as .wav with the recording timestamp.

2.The Whisper transcript is saved as the paired text label.

3.Both are registered in the ASR training corpus used to fine-tune the local Whisper model via NeMo RL.

4.More recordings → better Whisper accuracy on the operator's voice, vocabulary, and domain terms → faster task dictation.

This is the audio equivalent of the RLHF preference loop that drives DPO fine-tuning in the Fine-tune view. Voice recordings accumulate in the background; periodically a NeMo RL training pass uses them to improve the local ASR model's accuracy on the operator's specific speech patterns and technical vocabulary.

The practical benefit is domain adaptation: the local Whisper model learns to correctly transcribe harness-specific terms like "wiggum", "SYNTH_INSTRUCTION", "DPO", "autoresearch", and model names like "qwen3.6-35b" — terms that a general-purpose ASR model reliably garbles.

Floating panel design

Voice renders as an overlay rather than replacing the active view. This means you can initiate a recording while reviewing a run in the Runs view, dictate a follow-up task, and return to the same run without any navigation. The panel is stateless — dismissing it (with the X button or after submitting) restores full focus to the background view.

LLM correction pass

Task mode runs an extra LLM inference step beyond raw Whisper transcription. The corrector interprets casual speech into valid task syntax — adding output file paths, converting "do a deep dive on X" to the correct /deep research X save to ~/Desktop/out.md form, and normalizing skill flags.

Push-to-talk threshold

The 900 ms Space hold threshold is intentional. Quick spacebar presses during typing should not trigger recording, so the timer waits for a sustained hold before opening the microphone. Releasing Space before 900 ms cancels without recording.

CUDA transcription

The Whisper inference runs on the local GPU (WHISPER_DEVICE=cuda in the active config). Transcription latency for a 5-second clip is typically under 1 second on the current hardware — fast enough to feel synchronous with the recording.

Two Modes

Recording and Transcription

Three Result Types

The Auto-Transcribe Flywheel

Floating panel design

LLM correction pass

Push-to-talk threshold

CUDA transcription

Related posts