inference.py: The Unified LLM Backend Shim
A drop-in replacement for import ollama that transparently routes calls to Ollama, vLLM, llama-server, or any OpenAI-compatible endpoint — with no changes required anywhere else in the codebase.
When the harness started, every file imported Ollama directly. Switching to vLLM for larger models meant touching every call site. inference.py resolves this: one module that speaks Ollama's API on the outside and translates to whatever backend is configured. Add vLLM, llama.cpp, or a remote API by setting an environment variable — the rest of the codebase never knows.
Three-level routing priority
HARNESS_ENDPOINTS per-model registry
Highest priority — explicit URL + backend per model tag
VLLM_MODEL_MAP hybrid routing
Listed models → vLLM; everything else → Ollama
INFERENCE_BACKEND global switch
ollama (default) or vllm — applies to all models
HARNESS_ENDPOINTS is a JSON dict mapping model tags to endpoint configs. It enables scenarios like running vLLM on GPU 0 for the 14B producer and llama.cpp on GPU 1 for a 7B utility model simultaneously — each gets its own URL, served model name, and backend type:
HARNESS_ENDPOINTS='{
"qwen3-14b": {"url": "http://localhost:8000/v1", "model_id": "qwen3-14b", "backend": "vllm"},
"phi4-mini": {"url": "http://localhost:8001/v1", "model_id": "phi-4-mini-q8", "backend": "llamacpp"}
}'
VLLM_MODEL_MAP is a lighter hybrid routing mechanism: list only the models that should go to vLLM. Any model tag not in the map falls back to Ollama. This is the typical production setup — the 32B producer via vLLM, glm4:9b and llama3.2:3b via Ollama for planning and utility tasks.
Ollama tag → HuggingFace model ID translation
Ollama uses short tags (pi-qwen-32b, glm4:9b). vLLM serves models by their HuggingFace repo ID or --served-model-name. _MODEL_MAP translates between them:
| Ollama tag | vLLM model ID |
|---|---|
pi-qwen-32b | Qwen/Qwen2.5-32B-Instruct |
pi-qwen3-32b | pi-qwen3-32b (--served-model-name) |
glm4:9b | THUDM/glm-4-9b-chat |
llama3.2:3b | meta-llama/Llama-3.2-3B-Instruct |
nomic-embed-text | nomic-ai/nomic-embed-text-v1.5 |
phi4:14b | microsoft/phi-4 |
| …14 entries total; VLLM_MODEL_MAP merges additional overrides at runtime | |
If a tag isn't in the map, it passes through unchanged — useful for models served with --served-model-name set to the same string as the Ollama tag.
Response adapters
The rest of the codebase accesses LLM responses two ways: attribute access (response.message.content) and dict-style access (response["message"]["content"]). Both patterns were established when Ollama was the only backend. Two adapter classes make vLLM responses look identical:
_OllamaMessage wraps an OpenAI ChatCompletionMessage. It handles <think>...</think> tag extraction — some vLLM setups strip thinking content into reasoning_content, others leave tags inline. The adapter handles both cases, exposing a .thinking attribute that logger.py reads for the thinking-chars metric.
_OllamaResponse wraps the accumulated streaming output with real wall-clock timing:
prompt_eval_duration— time from request start to first content token (TTFT / prefill latency)eval_duration— time from first token to stream end (pure generation)total_duration— end-to-end wall time
These are actual measurements taken during the streaming loop, not approximations or estimates. They feed directly into runs.jsonl, the dashboard tok/s charts, and benchmark comparisons — so the latency numbers in the analytics view reflect real hardware behavior rather than vLLM's self-reported internal counters.
Migration pattern
Three import patterns cover the full codebase:
# Most files (skills, utilities) — full Ollama compatibility at module level
import inference as ollama
response = ollama.chat(model="glm4:9b", messages=[...])
# Files that need keep_alive per-instance (agent.py, wiggum.py, autoresearch.py)
from inference import OllamaLike
ollama = OllamaLike(keep_alive=_KEEP_ALIVE)
# Files calling with keyword args only (email_skill, lit_review, etc.)
from inference import chat as _llm_chat
_llm_chat(model="pi-qwen-32b", messages=[...], options={...})
The OllamaLike wrapper attaches a fixed keep_alive value to every call, which is how agent.py's per-run keep_alive estimation (described in Inside agent.py) propagates through to the actual API call without threading it through every function signature.
When VLLM_MODEL_MAP is set, only the listed model tags route to vLLM. Any call with a tag not in the map falls back to Ollama silently. This means adding a new model to the harness without updating the map will use Ollama — which is usually the right default, but can surprise you if you expect vLLM throughput for a new model tag.