May 29, 2026 • 6 min read • Agentic Harness Engineering

inference.py: The Unified LLM Backend Shim

A drop-in replacement for import ollama that transparently routes calls to Ollama, vLLM, llama-server, or any OpenAI-compatible endpoint — with no changes required anywhere else in the codebase.

When the harness started, every file imported Ollama directly. Switching to vLLM for larger models meant touching every call site. inference.py resolves this: one module that speaks Ollama's API on the outside and translates to whatever backend is configured. Add vLLM, llama.cpp, or a remote API by setting an environment variable — the rest of the codebase never knows.

Three-level routing priority

1 HARNESS_ENDPOINTS per-model registry Highest priority — explicit URL + backend per model tag

2 VLLM_MODEL_MAP hybrid routing Listed models → vLLM; everything else → Ollama

3 INFERENCE_BACKEND global switch ollama (default) or vllm — applies to all models

HARNESS_ENDPOINTS is a JSON dict mapping model tags to endpoint configs. It enables scenarios like running vLLM on GPU 0 for the 14B producer and llama.cpp on GPU 1 for a 7B utility model simultaneously — each gets its own URL, served model name, and backend type:

HARNESS_ENDPOINTS='{
  "qwen3-14b": {"url": "http://localhost:8000/v1", "model_id": "qwen3-14b", "backend": "vllm"},
  "phi4-mini":  {"url": "http://localhost:8001/v1", "model_id": "phi-4-mini-q8", "backend": "llamacpp"}
}'

VLLM_MODEL_MAP is a lighter hybrid routing mechanism: list only the models that should go to vLLM. Any model tag not in the map falls back to Ollama. This is the typical production setup — the 32B producer via vLLM, glm4:9b and llama3.2:3b via Ollama for planning and utility tasks.

Ollama tag → HuggingFace model ID translation

Ollama uses short tags (pi-qwen-32b, glm4:9b). vLLM serves models by their HuggingFace repo ID or --served-model-name. _MODEL_MAP translates between them:

Ollama tag	vLLM model ID
`pi-qwen-32b`	`Qwen/Qwen2.5-32B-Instruct`
`pi-qwen3-32b`	`pi-qwen3-32b` (--served-model-name)
`glm4:9b`	`THUDM/glm-4-9b-chat`
`llama3.2:3b`	`meta-llama/Llama-3.2-3B-Instruct`
`nomic-embed-text`	`nomic-ai/nomic-embed-text-v1.5`
`phi4:14b`	`microsoft/phi-4`
…14 entries total; VLLM_MODEL_MAP merges additional overrides at runtime

If a tag isn't in the map, it passes through unchanged — useful for models served with --served-model-name set to the same string as the Ollama tag.

Response adapters

The rest of the codebase accesses LLM responses two ways: attribute access (response.message.content) and dict-style access (response["message"]["content"]). Both patterns were established when Ollama was the only backend. Two adapter classes make vLLM responses look identical:

_OllamaMessage wraps an OpenAI ChatCompletionMessage. It handles <think>...</think> tag extraction — some vLLM setups strip thinking content into reasoning_content, others leave tags inline. The adapter handles both cases, exposing a .thinking attribute that logger.py reads for the thinking-chars metric.

_OllamaResponse wraps the accumulated streaming output with real wall-clock timing:

prompt_eval_duration — time from request start to first content token (TTFT / prefill latency)
eval_duration — time from first token to stream end (pure generation)
total_duration — end-to-end wall time

These are actual measurements taken during the streaming loop, not approximations or estimates. They feed directly into runs.jsonl, the dashboard tok/s charts, and benchmark comparisons — so the latency numbers in the analytics view reflect real hardware behavior rather than vLLM's self-reported internal counters.

Migration pattern

Three import patterns cover the full codebase:

# Most files (skills, utilities) — full Ollama compatibility at module level
import inference as ollama
response = ollama.chat(model="glm4:9b", messages=[...])

# Files that need keep_alive per-instance (agent.py, wiggum.py, autoresearch.py)
from inference import OllamaLike
ollama = OllamaLike(keep_alive=_KEEP_ALIVE)

# Files calling with keyword args only (email_skill, lit_review, etc.)
from inference import chat as _llm_chat
_llm_chat(model="pi-qwen-32b", messages=[...], options={...})

The OllamaLike wrapper attaches a fixed keep_alive value to every call, which is how agent.py's per-run keep_alive estimation (described in Inside agent.py) propagates through to the actual API call without threading it through every function signature.

When VLLM_MODEL_MAP is set, only the listed model tags route to vLLM. Any call with a tag not in the map falls back to Ollama silently. This means adding a new model to the harness without updating the map will use Ollama — which is usually the right default, but can surprise you if you expect vLLM throughput for a new model tag.

Three-level routing priority

Ollama tag → HuggingFace model ID translation

Response adapters

Migration pattern

Related posts