May 25, 2026 • 5 min read • Agentic Harness Engineering

YouTube and Media Transcription: Two Paths, One Research Input

Auto-captions without downloading, a Whisper fallback for videos without captions, and direct ffmpeg extraction for any audio or video URL — the harness can ingest spoken content as research context.

Web research tasks occasionally surface video content: conference talks, technical demos, interview podcasts. Without transcription, those sources are opaque to the text-based synthesis pipeline. youtube_transcribe.py adds two ingest paths — one for YouTube URLs and one for direct media files — that return a labeled transcript string ready to inject into the research context.

Two strategies for YouTube

Strategy B: youtube-transcript-api (preferred)

Uses YouTubeTranscriptApi.get_transcript(video_id) to fetch auto-generated or manually uploaded captions directly from YouTube's subtitle API. No audio download. Returns in under a second for most videos. The joined transcript is prepended with a source label and returned immediately if it contains text.

Fails silently if captions are disabled, the video is age-restricted, or the video is unavailable. Falls through to Strategy A on any exception.

Strategy A: pytubefix + Whisper (fallback)

Downloads the highest-bitrate audio-only stream via pytubefix.YouTube, then runs openai-whisper for transcription. Audio is pre-converted to 16 kHz mono WAV via ffmpeg before being passed to Whisper — this avoids relying on Whisper's own ffmpeg call, which can fail if the binary isn't on the system PATH.

All audio files are created in a tempfile.TemporaryDirectory() and deleted on exit. The WAV conversion artifact is deleted in a finally block even if transcription fails.

The naming convention (B preferred, A fallback) reflects implementation order rather than quality ranking: Strategy B requires no local compute and returns captions in the original human-authored or YouTube-auto-generated text, which is usually cleaner than Whisper output for well-captioned content. Strategy A handles the tail of videos that YouTube hasn't captioned or where captions are locked.

Direct media URL path

For non-YouTube URLs with media extensions (.mp4, .mp3, .wav, .webm, .ogg, .m4a, .mkv, .flv, .avi), the skill uses ffmpeg to extract audio and convert it to 16 kHz mono WAV in a single command:

ffmpeg -i <media_url> -ar 16000 -ac 1 -vn audio.wav -y

ffmpeg can read HTTP URLs directly, so no separate download step is needed for most hosted media files. The WAV file is then passed to Whisper for transcription. This path handles conference talk recordings, podcast episodes, and any other direct audio or video link that the browser skill might encounter during research.

ffmpeg discovery

The skill resolves ffmpeg through a three-level fallback:

shutil.which("ffmpeg") imageio_ffmpeg.get_ffmpeg_exe() "ffmpeg" (surface error)

imageio-ffmpeg ships a pre-compiled ffmpeg binary as a Python package — no system-level install required. When shutil.which finds nothing, the skill injects the imageio_ffmpeg binary directory into PATH so that Whisper's own subprocess calls also find it. This makes the skill self-contained on machines that don't have ffmpeg system-wide.

Whisper configuration

WHISPER_MODEL Model size: tiny, base, small, medium, large, or turbo. Default: base. Larger models are more accurate but slower; turbo is a recent Whisper variant optimized for speed at near-large quality.
WHISPER_DEVICE Compute device: cpu or cuda. Default: cpu. The Voice view (which also uses Whisper for real-time push-to-talk transcription) uses WHISPER_DEVICE=cuda — a shared setting that applies to both the voice pipeline and the media transcription skill.

The detected transcript language is logged after successful transcription (language=en 4312 chars) so you can verify the model handled multilingual content correctly.

Output format

Both paths return a labeled string that identifies the source URL and the transcription method:

[YouTube transcript — https://youtube.com/watch?v=abc123]

The key insight from the paper is that the retrieval step...
[Media transcript — https://example.com/talk.mp4]

So the main contribution here is a new evaluation framework...

The source label lets the synthesis pipeline cite the video as a distinct source in the output, rather than attributing the content to the page it was linked from. It also allows the security layer to flag transcript content for injection scanning — video transcripts from web research are a plausible vector for prompt injection payloads embedded in spoken content.

The voice transcription flywheel documented in The Voice View uses the same Whisper model and device settings. Audio from both paths — push-to-talk voice requests and media URL transcription — is added to the ASR training corpus that feeds NeMo RL fine-tuning.