Blog

Agentic systems, ML, computer vision, and AI research

The Case Against Prompt Engineering

June 12, 2026 • 12 min read • Agentic Harness Engineering

Written against interest: the three strongest rebuttals to prompt engineering — format brittleness that scale never fixed, optimizers that overfit their own feedback, and best practices that turn vestigial across model generations (the Guardrail-to-Handcuff transition) — argued with 40+ annotated papers and the harness’s own run logs, which testify for the prosecution.

Cost Envelope Management for Production AI Agents

June 12, 2026 • 10 min read • Agentic Harness Engineering

The harness ran the same cost-management eval task 239 times; this post is the answer it was searching for, written from the telemetry it left behind. Four nested budget levels — per-call, per-run, per-model, per-fleet — plus hierarchical cost attribution from runs.jsonl and the 7:1 input-to-output token ratio that makes verification, not generation, the dominant spend.

Closing the SkillOpt Gaps: What Actually Shipped

June 2, 2026 • 9 min read • Agentic Harness Engineering

Three gaps identified against SkillOpt: _validate_proposal() shipped as an adapted validation gate (routing check + ban-list patterns), MINIBATCH_FLOOR is a partial fast gate, and the skill artifact gap remains open. Kimi unblocking emerged as an alternative to the global-exit design.

The Wiggum Panel: Three-Persona Parallel Evaluation

June 2, 2026 • 8 min read • Agentic Harness Engineering

Domain Practitioner, Critical Reviewer, and Informed Newcomer run in parallel threads via ThreadPoolExecutor. Their deduplicated, persona-prefixed issues augment Wiggum’s revision context before the first rewrite. Enabled via WIGGUM_PANEL=1; panel reviews logged to runs.jsonl for preference learning.

The Developer Utility Skills: /scratchpad and /test-harness

June 1, 2026 • 5 min read • Agentic Harness Engineering

/scratchpad forces the Python tool loop on for any task and injects a synthesis instruction forbidding LLM estimation — all numbers must come from executed code. Prior scratch results are injected as additional context after a security scan. /test-harness runs the pytest suite, saves to latest.txt, and returns a structured result dict.

The Trading Skills: /validate-trades and /execute-trades

June 1, 2026 • 6 min read • Agentic Harness Engineering

/validate-trades runs eight deterministic TA checks against every thesis citation — momentum rank, Hurst exponent, BB z-score, SMA cross, volatility, R/R ratio, position sizing, price anchoring — with no LLM. /execute-trades builds GTC bracket orders from passing theses and submits to Alpaca, dry-run by default.

The Plugin System: /forge:plugin and /forge:list

June 1, 2026 • 5 min read • Agentic Harness Engineering

Generate a harness plugin from a natural-language description — the LLM produces a JSON spec, files are written to plugins/<name>/, and the plugin hot-loads into skills.REGISTRY without a restart. Skill files auto-inject domain knowledge into synthesis; command templates activate as /plugin:command slash tokens.

The Interview Skills: /grill-me and /onboarding

June 1, 2026 • 6 min read • Agentic Harness Engineering

/grill-me mirrors the research loop — plan_question() generates follow-ups, assess_novelty() gates rounds, compress_knowledge() accumulates state — producing a five-section knowledge brief. /onboarding extends it with a fixed 3-question scaffold, persistent TOML config, and ChromaDB user context seeding.

The Site Generation Skills: /design and /build-page

May 31, 2026 • 7 min read • Agentic Harness Engineering

/design visits a URL with Playwright, extracts CSS custom properties and computed styles, analyzes screenshots with a vision model, and synthesizes a 10-section design system. /build-page generates a themed HTML page from markdown files in three passes: content clustering, shell generation, per-file card injection.

The Site Generation Skills: /site and /deck

May 31, 2026 • 5 min read • Agentic Harness Engineering

/site chains /design and /build-page into a single command. /deck generates a themed .pptx from the same inputs — a design source (URL or .md) and a content source (folder, URL, or PDF) — using python-pptx with colors and fonts parsed directly from the design system tokens.

The Diagnostic Skills: /debug and /troubleshoot

May 31, 2026 • 6 min read • Agentic Harness Engineering

/debug loads the last two matching ERROR or FAIL runs, reads their Chrome Trace event sequences, maps task_type to relevant source anchors, and returns a structured Diagnosis / Evidence / Fix. /troubleshoot combines that with project state for a four-part report in one call.

The Navigation Skills: /suggest and /re-orient

May 31, 2026 • 5 min read • Agentic Harness Engineering

/suggest synthesizes the single most valuable next task from orientation cache, last 8 runs, git log, and autoresearch state — with a whitelist-constrained runnable command. /re-orient fetches commits, PRs, issues, and CI runs in parallel and synthesizes a fast project snapshot against a focus question.

The Explorer View: Per-Run Pipeline DAG Inspector

May 30, 2026 • 6 min read • Agentic Harness Engineering

Every completed run rendered as a clickable DAG — Task, Memory, Plan, Search × N, Synthesis, Eval rounds, Output. Click any node for a detail inspector: token counts, Wiggum dimension bars, evaluator reasoning, and an inline RLHF feedback panel per stage.

The Pipeline View: Data Enrichment DAG

May 30, 2026 • 5 min read • Agentic Harness Engineering

Static SVG DAG showing the harness enrichment architecture for financial and economic tasks. Six layers — web search, Beige Book, FRED, BEA, Market Signals, yfinance, Alpaca — converge at a Context Merge node before LLM synthesis. Dashed conditional edge marks trading thesis Alpaca order execution.

The Page Feedback Widget: Closing the Loop Between Browser and Agent

May 30, 2026 • 6 min read • Agentic Harness Engineering

A Chrome extension injects a floating feedback panel on every localhost and file:/// page. Notes are written to localStorage and POSTed to /api/page-feedback, landing in data/page_feedback.jsonl for agents to read, act on, and clear. Shadow DOM isolation, badge counter, completion checklist, and a null-origin CORS fix for local HTML files.

The Research History View: A Unified Activity Log

May 30, 2026 • 6 min read • Agentic Harness Engineering

Every search query, browser visit, and research output in one reverse-chronological feed. Filterable by type and time window, with score bars, visit counts, and direct links from timeline entries into the Memory store.

The Skills Registry: Hook Points, Auto-Activation, and 38 Built-In Skills

May 29, 2026 • 6 min read • Agentic Harness Engineering

skills.py wires 38 capabilities into six hook points across the research pipeline — standalone, pre_research, pre_synthesis, post_synthesis, post_wiggum, and modifier. Seven skills activate automatically via predicate lambdas: keyword detection, regex matching task text against self-referential patterns, plan complexity checks, macroeconomic term matching, and trading thesis detection. Lazy loading keeps import time zero unless a skill is actually invoked.

harness/api: The FastAPI Backend and Dashboard Server

May 29, 2026 • 6 min read • Agentic Harness Engineering

Async-native FastAPI replaces Flask: modular routers per concern, WebSocket live streaming that tails runs.jsonl instead of SSE thread hacks, static Vite dashboard served at root, CORS regex for any localhost port, and auto OpenAPI docs at /docs. Lifespan context manager handles startup and teardown cleanly.

inference.py: The Unified LLM Backend Shim

May 29, 2026 • 6 min read • Agentic Harness Engineering

A drop-in import ollama replacement that routes to Ollama, vLLM, or llama-server via a three-level priority stack: per-model endpoint JSON, hybrid routing map, or global backend flag. Adapters normalize response shapes across backends and measure real wall-clock TTFT, while a 14-entry name translation table keeps model tags consistent across the codebase.

The Search Cache: SQLite TTL Caching for DDGS Queries and Research Contexts

May 28, 2026 • 5 min read • Agentic Harness Engineering

Two SQLite tables, two use cases: search_cache deduplicates DDGS queries across all runs (always on, 24-hour TTL); research_cache stores full research contexts for autoresearch experiments (opt-in via RESEARCH_CACHE=1). SHA-256 keys, lazy eviction on write, schema migration on connect.

The Planner: Two-Pass Pre-Research Analysis

May 28, 2026 • 6 min read • Agentic Harness Engineering

Before a single web search runs, two fast glm4:9b calls assess prior knowledge and produce a structured plan. The prior knowledge pass identifies what the model already knows and what gaps need search; the main plan pass generates targeted queries, classifies the task, extracts section counts, and injects planner notes into synthesis.

Inside agent.py: The Three-Turn Research Pipeline

May 28, 2026 • 8 min read • Agentic Harness Engineering

Turn 1 gathers via novelty-gated web search (2–5 rounds, ε-greedy pass-through, URL enrichment); Turn 2 synthesizes with SYNTH_INSTRUCTION selected by task classifier (technical / count / prose); Turn 3 runs the Wiggum loop until PASS. Plus keep_alive estimation from run history, thinking-model detection, and Python code execution tool.

Deploying the Harness with Docker: CPU, GPU, and Compose Variants

May 27, 2026 • 5 min read • Agentic Harness Engineering

A CPU slim image for development, a CUDA 11.8 GPU image for production inference, and a three-service Compose stack (vLLM for large models, Ollama for fast models, harness dashboard) with health-check dependency ordering, GPU passthrough, and a live code mount so no rebuild is needed during iteration.

Synthetic Eval Task Generation with TinyTroupe Personas

May 27, 2026 • 5 min read • Agentic Harness Engineering

Eight practitioner archetypes — DevOps, data scientist, PM, security engineer, ML infra, and others — generate role-grounded research tasks via Microsoft TinyTroupe (with raw Ollama fallback). Evaluation criteria are auto-derived from the generated text: count detection triggers exact_sections(N); structural and quality checks apply unconditionally.

The Subagent Demo Suite: Orchestrating Multi-Task Research Portfolios

May 27, 2026 • 6 min read • Agentic Harness Engineering

Two orchestration scripts show sequential and parallel multi-task execution: v1 submits six literature-review tasks to the FastAPI queue and has the agent render its own landing page; v2 runs five data-grounded self-analysis tasks (reading real repo files) with an optional parallel mode through the MCP HTTP server.

The GitHub Skill: LLM-Assisted Git Operations from the Agent Loop

May 27, 2026 • 6 min read • Agentic Harness Engineering

Twelve operations auto-detected from a task string — push with LLM-generated commit message, PR creation with LLM title and body from the branch diff, issue creation from plain-text problem description, plus read-only status, list, view, and merge operations. Three distinct system prompts, one /github command.

Mining a Ground-Truth Knowledge Base for the Eval Suite

May 26, 2026 • 5 min read • Agentic Harness Engineering

Five deep research runs with the novelty gate disabled build authoritative reference documents — one per eval task — injected as file_context when the eval suite runs. /deep forces MAX_SEARCH_ROUNDS; --no-wiggum skips revision to cut mining time by 50–70%.

The Wiki Sync Skill: Deterministic Source Extraction and Gap-Targeted Code Injection

May 26, 2026 • 7 min read • Agentic Harness Engineering

No LLM required — regex reads six source files and writes an idempotent Implementation Reference section into wiki/pipeline.md. A second mode, triggered by Wiggum FAIL cycles, matches the evaluator’s issue text to nine trigger patterns and injects the relevant function bodies directly into the wiki.

The Email Skill: Personalized Outreach Drafts from Conference Speaker CSVs

May 26, 2026 • 6 min read • Agentic Harness Engineering

Two LLM calls per contact — a subject line and a warm, specific body referencing the speaker’s talk — with slide content fetched via MarkItDown, per-contact JSON output, a manifest file, and every draft logged to runs.jsonl as task_type: email_draft for full dashboard visibility.

The Lit-Review Skill: Seven Steps from ArXiv Query to Rendered Survey

May 26, 2026 • 8 min read • Agentic Harness Engineering

ArXiv fetch → Semantic Scholar citation enrichment (hub scores) → five-persona curation → per-paper annotation with Wiggum → LLM thematic clustering → cross-cluster synthesis with open questions → Jinja2 render. Three output templates: survey, gaps, executive. Per-paper checkpointing makes long runs resumable.

The Playwright Skill: LLM-Guided Navigation via ARIA Snapshots

May 25, 2026 • 8 min read • Agentic Harness Engineering

ARIA accessibility-tree snapshots instead of raw DOM, pre-navigation sitemap planning that picks the best URL before the browser opens, a completeness oracle that pulls up to three pages when coverage is below threshold 7/10, blocked-click memory, SPA URL-change polling, and detached Chromium persistence via CDP for multi-task sessions.

YouTube and Media Transcription: Two Paths, One Research Input

May 25, 2026 • 5 min read • Agentic Harness Engineering

Auto-captions via youtube-transcript-api (no download) with a pytubefix + Whisper fallback for uncaptioned videos, plus direct ffmpeg extraction for any mp4, mp3, webm, or other media URL. imageio_ffmpeg bundles its own binary so no system ffmpeg install is required.

SBOM and AIBOM for Agentic Systems

May 25, 2026 • 12 min read • Agentic Harness Engineering

pip freeze doesn't know about kimi-k2.5:cloud. An AIBOM does. Supply chain transparency for the full stack: Python packages, local GGUF models, custom Modelfiles with system-prompt overlays, and cloud endpoints that appear nowhere in a traditional SBOM.

Agentic Threat Hardening: The OWASP Top 10, Applied

May 25, 2026 • 25 min read • Agentic Harness Engineering

OWASP's Agentic Security Initiative Top 10 maps ten attack classes that emerge when LLMs gain tools, memory, and autonomy. The full coverage audit against the harness—four defenses covered, four partial, two gaps—with nine prioritized mitigations, research citations, and code.

Leverage: What the Metric Measures, and Why the Replacement Framing Gets the Math Wrong

May 25, 2026 • 12 min read • Analysis

The harness computes a leverage value on every run. A close reading of the formula—token amortization, the CapEx gap, the TAC calibration assumption—produces a quantitative argument against the corporate narrative that AI “naturally” targets lower-value human capital first.

Agentic Harness Engineering: The Architecture Series

May 25, 2026 • 12 posts • 27 patterns

Twelve posts covering the complete harness design across eight categories—plus a pattern catalog presenting all 27 named agentic system design patterns in textbook structure: Intent, Problem, Solution, Structure, and Related patterns.

The Supervisor: Four Convergence Signals and Advisory Interventions

May 24, 2026 • 6 min read • Agentic Harness Engineering

A read-only convergence monitor that scans runs.jsonl for four collapse signals — Wiggum score variance, output size CV, search utilization, and content similarity — and recommends specific interventions when thresholds are crossed. Advisory only; never modifies pipeline behavior.

Five Personas, One Veto: Consensus Filtering for Fine-Tuning Data

May 24, 2026 • 7 min read • Agentic Harness Engineering

Five LLM personas — Pragmatic Engineer, Academic Rigorist, Synthesis Thinker, Contrarian, Newcomer — each score every annotated paper 1–5. A paper reaches the fine-tuning dataset only if the mean is ≥ 3.5 and no single persona scores below a veto floor of 2. The Contrarian is the most common vetoer.

The op CLI: A Rich REPL for the Research Harness

May 24, 2026 • 6 min read • Agentic Harness Engineering

Two invocation modes (interactive REPL and single-task), eight slash commands covering browser navigation, sitemap discovery, paper annotation, Gmail drafting, and free-form research, four browser flags, and a pyfiglet isometric splash screen with live endpoint display. Persistent ~/.op_history for up-arrow recall across sessions.

Seven Principles and a Moving Frontier: The Harness Roadmap

May 24, 2026 • 14 min read • Agentic Harness Engineering

The goals that stayed constant, the milestones that multiplied, and what three rounds of roadmap revision reveal about building self-improving systems. How each completed milestone made the next constraint visible.

The MCP View: Exposing the Harness as a Tool Server

May 23, 2026 • 5 min read • Agentic Harness Engineering

Three tools — run_task, run_orchestrated, get_run — expose the full harness pipeline to any MCP client. Tool manifest with required/optional parameter badges and a live task log auto-refreshing every five seconds.

The Voice View: Push-to-Talk, Waveform Capture, and an ASR Training Flywheel

May 23, 2026 • 6 min read • Agentic Harness Engineering

Floating panel overlay with Note/Task mode toggle, 900ms push-to-talk threshold, WebAudio waveform canvas during recording, and three result types (task, answer, note). Every voice request auto-transcribes to a growing ASR corpus that feeds NeMo RL fine-tuning.

The Security View: 36 Events, Six Layers, and Real Injection Payloads

May 23, 2026 • 7 min read • Agentic Harness Engineering

28 blocks and 8 warnings across six defense layers — injection scanner, Python scanner, file sandbox, output sandbox, CDP guard, and scratch guard. Real injection payloads surfaced from web research content, including an output sandbox block on ~/.Desktop/.env.local.

The Harness Data Model: Schemas, Entities, and Query Patterns

May 23, 2026 • 7 subsections • Agentic Harness Engineering

A complete reference for the five-file JSONL schema at the core of the harness: entity hierarchy, ID format, per-stage token accounting, message role taxonomy, and querying patterns in jq, pandas, and DuckDB.

Experiments and Alignment Foundations

May 23, 2026 • 2 posts

Four production experiments that exposed the evaluator ceiling, the producer ceiling, and the synthesis instruction bottleneck—plus multi-objective alignment methods beyond scalar rewards.

The Sessions and Artifacts Views: CLI Audit Log and 1,053 Output Files

May 22, 2026 • 5 min read • Agentic Harness Engineering

59 CLI sessions logged since April 27 with per-session token accounting, duration, and run counts. 1,053 output files totaling 5,718 KB, color-coded by extension with inline Markdown preview and run-ID linkage back to the source pipeline execution.

The System View: Governance Docs, Active Configuration, and 38 Skills

May 21, 2026 • 8 min read • Agentic Harness Engineering

Editable AGENTS.md and four wiki files, a live config pane showing active inference endpoints and runtime settings (pass threshold 8.0, wiggum max rounds 3, research cache on), and a 38-skill registry with hook-type filter chips covering standalone, pre-synthesis, pre-research, post-synthesis, post-wiggum, and modifier tiers.

The Submit View: Queuing Tasks and Watching the Pipeline Execute in Real Time

May 21, 2026 • 7 min read • Agentic Harness Engineering

Task form with producer model override and skip-wiggum flag, live SSE event feed rendering memory hits, plan queries, search rounds, synthesis checkpoints, and wiggum scores as typed cards, an optional plan gate for human approval before research begins, expandable chain-of-thought accordions per pipeline stage, and a result card with rendered Markdown output.

The OSINT Skill: 11-Layer Target Enrichment for Research Tasks

May 21, 2026 • 9 min read • Agentic Harness Engineering

Automatic target detection from task strings, 11 parallel enrichment layers (DNS through HIBP breach check), LLM-generated dork queries with advanced operators, and citation-tagged markdown injected directly into the synthesis context window.

The Fine-tune View: DPO Training Runs, RL Data, and the Preference Feedback Loop

May 21, 2026 • 8 min read • Agentic Harness Engineering

DPO training to COMPLETE: 2,145 steps, 0.4665 final loss, 87.3% token accuracy, 39 minutes. The RL DATA tab surfaces the full preference dataset — 163 pairs, 76 reward signals, 40 ORPO examples — accumulating from autoresearch experiments and RLHF quality signals.

The Autoresearch View: 40 Experiments, One Keep, and a 3% Signal Rate

May 20, 2026 • 8 min read • Agentic Harness Engineering

Real-time supervision of the SYNTH_INSTRUCTION optimizer: 40 experiments run, 1 kept, 39 discarded, 3% keep rate, best score 8.530, average delta −1.038. The experiment table surfaces every mutation description, score delta, and discard count — the full hill-climbing log in one view.

The Analytics View: Score Trends, Token Spend, and Run Distribution

May 20, 2026 • 6 min read • Agentic Harness Engineering

Score trend across all 1,442 runs, daily volume and pass-rate charts, daily token spend with input/output split peaking at 285k, a score distribution histogram concentrated at 8, task type breakdown, and 15,855,026 lifetime input tokens.

The Runs View: Pipeline Stage Visualization and Live Run Monitoring

May 20, 2026 • 7 min read • Agentic Harness Engineering

1,442 runs filterable by PASS / FAIL / ERROR, a real-time pipeline stage map, a per-stage COMPUTE table, and a Context Window treemap that shows model context fill by stage — so you can see at a glance whether a run had headroom or was close to truncating its own prompt.

The Memory View: 2,173 Observations, Quality Signals, and an Ontology Graph

May 19, 2026 • 8 min read • Agentic Harness Engineering

A tour of the harness Memory UI: a semantic observation store that accumulates every run as structured facts, filterable by quality signal, searchable by task, provenance-linked to the run that created it, and visualized as a force-directed ontology graph.

The GitHub View: Repo Health at a Glance

May 19, 2026 • 6 min read • Agentic Harness Engineering

Branch status, ahead/dirty counts, a 107-commit year-long heatmap, a searchable commit log, open PRs and issues, and a full CI run history with pass/fail status — all live in the harness dashboard without leaving the experiment environment.

Harness vs. Perplexity: Eight Iterations to Parity

May 19, 2026 • 18 min read • Agentic Harness Engineering

Eight iterations against a frozen Perplexity output on a current-conditions Fed district inflation task. Best stable result: tied 7.7/7.7. Four bugs fixed, a rubric rewritten for economic research, a two-pass extraction experiment that proved depth=7 is achievable but trades grounded for it, and a Beige Book chunking root cause that explains why.

The Alt-Data Pipeline: From Beige Book to Paper Trading Thesis

May 18, 2026 • 15 min read • Agentic Harness Engineering

Five enrichment layers close the loop from macro narrative to actionable paper trading theses: BEA economic accounts, cross-sectional market signals (Hurst, momentum rank, cointegration), yfinance fundamentals, Alpaca portfolio context, and a structured [THESIS:...] citation template. The report gates every trade.

Small Language Models and the Efficiency-Accuracy Frontier

May 8, 2026 • 16 min read

SLM-Bench: 15 models, 9 tasks, 4 hardware configs. Accuracy and energy efficiency don't co-optimize—so model selection is a portfolio problem, not a ranking problem. How the three-model architecture from experiment-04 operationalizes this insight.

Literature Reviews: Tools, Knowledge Graphs, Security, and Fine-Tuning

May 1, 2026 • 4 posts

Four surveys covering agentic tool use and planning, structured knowledge extraction and graph retrieval, prompt injection attack patterns, and fine-tuning and alignment deep cuts—and what each means for harness design.

Literature Reviews: Evaluation, Judges, and Structured Queries

April 29, 2026 • 5 posts

Five surveys: benchmark contamination and judge reliability, evaluation uncertainty and calibration, automated evaluation robustness, SPARQL-grounded knowledge queries, and judge benchmarks with test-time scaling.

Circuit Extraction: Interpreting Object Detectors

January 13, 2026 • 14 min read

Using activation patching and co-activation analysis to extract the minimal computational circuit for pot detection in Faster R-CNN.

Object Detection on Drone Orthomosaics with SAM

January 10, 2026 • 8 min read

An overview of using Meta's Segment Anything Model for automated object detection in high-resolution aerial imagery, with applications in precision agriculture.

Sparse Linear Probing for Efficient Detection

January 9, 2026 • 10 min read

Using L1-regularized linear probes to identify minimal feature subsets from SAM and Faster R-CNN that are sufficient for pot detection.

Extracting Features from Vision Model Backbones

January 7, 2026 • 12 min read

A technical guide to extracting and visualizing internal representations from SAM and Faster R-CNN for interpretability research.

Mechanistic Interpretability for Agricultural AI

January 5, 2026 • 10 min read

Exploring how mechanistic interpretability techniques can help us understand what vision models learn about agricultural environments and build more trustworthy AI systems.

SAM vs Faster R-CNN: A Practical Comparison

January 3, 2026 • 10 min read

Comparing Segment Anything Model and Faster R-CNN for aerial object detection—architecture, fine-tuning approaches, and when to use each.

Fine-Tuning Vision Foundation Models

December 28, 2025 • 12 min read

A practical guide to fine-tuning strategies for vision models like SAM and Faster R-CNN, with insights on data efficiency and domain adaptation.

Building a GeoTIFF Object Detection Web App

December 28, 2025 • 5 min read

A walkthrough of building a web application for running Faster R-CNN inference on geospatial imagery with FastAPI, WebSockets, and Leaflet.

Training Faster R-CNN for Geospatial Object Detection

December 20, 2025 • 8 min read

A deep dive into training object detection models on aerial imagery, from SAM masks to production-ready Faster R-CNN with hard negative mining.