Parallel Inference: Hardware Substrates for LLM Workloads
GPU memory hierarchy, SM architecture, llama.cpp, vLLM, GGUF quantization, and KV cache pressure — the hardware realities that determine whether your agentic pipeline is fast or broken.
A cold llama.cpp process loading a 13B GGUF model from disk takes eight to fifteen seconds before it can process a single token. The same model, already resident in an Ollama keep-alive session, responds in under 200 milliseconds. The 50× latency gap is not a software artifact — it is physics. Model weights must traverse the PCIe bus from storage, fill VRAM, and warm up the SM register files before the first matmul can execute.
Most agentic pipeline practitioners treat this interface as a black box. They set OLLAMA_KEEP_ALIVE=30m, observe that cold starts hurt, and move on. But the black box is not opaque — and understanding what is inside it changes how you design the entire substrate layer. It explains why Pattern A4 (Keep-Alive Budget) exists, why A1 (Inference Shim) routes llama.cpp and vLLM differently, why the Wiggum Loop's evaluator must be a permanently-warm smaller model, and why RunTrace flame graphs show flat 400ms spans that have nothing to do with the model's reasoning quality.
This post draws from Kirk & Hwu's Programming Massively Parallel Processors (PMPP) for the GPU architecture fundamentals, and maps those fundamentals directly to the inference runtime landscape: llama.cpp, vLLM, GGUF quantization, and KV cache management.
Two Design Philosophies, One Workload
The CPU and GPU represent fundamentally different answers to the same question: how do you move data from memory to arithmetic units as fast as possible? The CPU's answer is to make each memory access fast — large L3 caches, hardware prefetchers, out-of-order execution pipelines that can reorder hundreds of in-flight instructions to hide latency. A modern CPU might have 16 cores, each with a 2MB L2 cache and access to 32MB of shared LLC. It is optimized for low-latency, single-thread performance.
The GPU's answer is the opposite: don't try to hide individual memory latency — instead, maintain so many threads simultaneously that while one group waits on a DRAM fetch, a thousand others are executing. NVIDIA's GT200 architecture (the first to match a petaflop) deployed 30 Streaming Multiprocessors (SMs), each capable of holding 8 concurrent thread blocks and up to 1,024 threads, for a total of roughly 30,000 simultaneously-scheduled threads. The G80 before it sustained 86.4 GB/s of DRAM bandwidth, compared to the ~50 GB/s of contemporary CPU DDR3 systems — and the gap widens every GPU generation.
CPU optimizes for single-thread latency via large caches and OOO execution. GPU optimizes for throughput via massive multithreading — thousands of threads hide DRAM latency by always having another warp ready to execute.
For transformer inference, the GPU's design is nearly optimal. Every token generation step is dominated by matrix-vector multiplications: the query, key, and value projections for attention, the feed-forward up/down projections, and the final unembedding. These operations are embarrassingly parallel — each output element can be computed independently — and they are bandwidth-bound, not compute-bound. The bottleneck is how fast you can stream model weights from VRAM to the arithmetic units. GPU memory bandwidth is the key figure.
Callback → Post 4 (A1 Inference Shim): The shim's backend routing isn't just an abstraction convenience. llama.cpp and vLLM have fundamentally different latency profiles because they make different hardware tradeoffs. The shim needs to route to them differently because their response contracts — single-token streaming vs. full-response batched — reflect those hardware realities.
The GPU Memory Hierarchy
PMPP Chapter 3 describes the GPU memory model in terms of four distinct levels, each with different scope, size, latency, and programmer-visibility. Understanding these four levels is the foundation for understanding why inference runtimes make the choices they do.
Memory closer to the compute units is faster but smaller and scoped to fewer threads. Global DRAM is the primary bottleneck — high bandwidth but 200–400 cycle access latency, tolerated by warp switching.
Registers are the fastest storage on the chip, allocated per-thread by the compiler. The G80 gave each SM 8,192 32-bit registers; the GT200 doubled this to 16,384. They are invisible to any other thread and have near-zero access latency. For inference kernels, the per-thread accumulator registers hold the running dot-product sum during a matmul iteration.
Shared memory (also called scratchpad or L1-mode) is the most important optimization target for compute-bound kernels. It is partitioned per SM (48–96KB depending on architecture), visible to all threads within the same block, and accessible in roughly 4 cycles. Tiled matrix multiplication — the canonical CUDA optimization — works by having threads collaboratively load sub-tiles of the input matrices from global memory into shared memory, then computing from shared memory rather than issuing per-element global loads. This slashes global memory traffic by the tile size factor.
Global DRAM is the GPU's main memory: gigabytes of HBM (High Bandwidth Memory) on modern parts, accessible by all threads on the device. It is the bottleneck. A GT200 sustained 150 GB/s to its global DRAM; a modern H100 sustains 3.35 TB/s. Individual accesses to global memory incur 200–400 clock cycle latency, which the GPU tolerates not by caching but by switching to another warp whose operands are already available. This is the latency-tolerance mechanism, and it is the key to understanding why GPU inference scales with batch size in a way CPU inference does not.
Streaming Multiprocessors and Warp Execution
The SM is the GPU's fundamental compute unit. Each SM contains a set of Streaming Processors (SPs, the individual ALUs), a shared memory bank, a register file, and a warp scheduler. In PMPP's terminology, threads are organized into a three-level hierarchy: grids contain blocks, blocks contain threads. A block is assigned to exactly one SM for its entire lifetime; the SM's warp scheduler then time-slices execution across the 32-thread warps that make up that block.
The SPMD (Single Program, Multiple Data) execution model means all threads in a warp execute the same instruction at the same clock cycle, on different data. When a warp issues a load to global memory, the scheduler marks it as waiting and immediately schedules a different warp whose operands are ready. If the SM has enough resident warps, the DRAM latency is completely hidden — by the time the memory request returns, the scheduler has rotated through all other warps and is ready to resume the blocked one.
This is transparent scalability in action: the same kernel code that runs on an 8-SM consumer GPU automatically parallelizes across 30 SMs on a workstation GPU or 132 SMs on an H100 data-center card. The programmer specifies work in terms of thread counts; the hardware distributes blocks across however many SMs are available. Kirk & Hwu call this the "write once, scale automatically" property, and it is precisely why vLLM's attention kernels work on a 3090 and an H100 without code changes.
Warp divergence: when threads within the same warp take different branches (e.g., in a masked attention kernel), the warp must execute both paths serially, deactivating threads that shouldn't take each branch. Flash Attention 2 specifically restructures the attention computation to eliminate intra-warp divergence in the softmax normalization step — this is one of the primary reasons it achieves 2–4× throughput over naive attention implementation.
The Inference Runtime Landscape
Two runtimes dominate the self-hosted LLM inference landscape. They represent different bets on the hardware tradeoffs above, and the right choice depends on VRAM headroom, concurrency requirements, and whether the deployment is a developer workstation or a production serving cluster.
Runtime selection is primarily determined by VRAM headroom and concurrency requirements. Ollama wraps llama.cpp with process lifecycle management, implementing Pattern A4 at the OS level.
llama.cpp CPU-First Inference
Written in C++ with no mandatory GPU dependency. Loads GGUF-format quantized models into system RAM, with optional GPU offload via --n-gpu-layers N. Each transformer block can be independently assigned to GPU or CPU. Single-user, streaming, low-latency per-token generation. Ollama wraps it with keep-alive process management.
vLLM GPU-First Serving
GPU-only inference engine with PagedAttention for KV cache management and continuous (iterative) batching. Processes multiple concurrent requests by interleaving token generation at the per-step level. Achieves 2–4× higher throughput than static batching. Requires full model weights to fit in VRAM.
Ollama is not a separate inference engine — it is a process manager and HTTP API layer wrapped around llama.cpp. Its key contribution to the harness is the OLLAMA_KEEP_ALIVE configuration, which maps directly to Pattern A4. When a model's keep-alive timer expires, Ollama unloads it from VRAM. The next request then pays the full cold-start cost: GGUF file read from disk, weight dequantization, VRAM allocation, SM register initialization. This is the 8–15 second gap that Pattern A4 exists to prevent.
Callback → Post 4 (A4 Keep-Alive Budget): The Keep-Alive Budget is not a quality-of-life feature — it is hardware cost amortization. The 8–15 second cold-start penalty is VRAM allocation time plus PCIe transfer time for model weights. A 7B Q4_K_M model is ~4GB on disk; at PCIe 4.0 x16 bandwidth (64 GB/s theoretical, ~20 GB/s practical for sequential reads), that's 200ms just for the transfer, before any compute begins.
GGUF Quantization and the VRAM Budget
The GGUF format (GGML Unified Format, successor to GGML's earlier binary formats) stores quantized model weights in a self-describing container. Quantization reduces the number of bits used to represent each weight, trading a small amount of model quality for large reductions in memory footprint and, critically, in memory bandwidth consumption during inference.
The K-quant variants (Q4_K_M, Q5_K_M, Q6_K) use block quantization with two learned scales per 256-weight super-block. The _M suffix ("medium") applies higher-precision quantization to attention layers while using lower precision in feed-forward layers — a heuristic that targets the layers most sensitive to quantization error. This is materially different from uniform quantization and accounts for K-quants consistently outperforming naive Q4/Q5 on downstream benchmarks.
VRAM requirements for four common model sizes across quantization tiers. Horizontal lines mark consumer GPU tiers. The KV cache (not shown) adds 1–8GB on top of model weights at typical batch sizes.
| Format | Bits/Weight | 7B VRAM | 13B VRAM | Quality vs FP16 | Typical Use |
|---|---|---|---|---|---|
FP16 |
16 | 14.0 GB | 26.0 GB | Baseline | vLLM evaluator (A2), highest-fidelity scoring |
Q8_0 |
8 | 7.7 GB | 14.4 GB | −0.1–0.3% | Evaluator on consumer GPU (C2 rubric scoring) |
Q5_K_M |
≈5.5 | 5.3 GB | 9.8 GB | −0.5–1.0% | High-quality producer on 8GB GPU |
Q4_K_M |
≈4.5 | 4.4 GB | 8.1 GB | −1.0–2.5% | Harness default producer; bandwidth-efficient |
Q3_K_M |
≈3.5 | 3.5 GB | 6.4 GB | −4.0–8.0% | Avoid for production; noticeable coherence loss |
The harness default is Q4_K_M for the producer and Q8_0 for the evaluator where VRAM permits. This is a deliberate asymmetry: the producer's output is revised if it scores below threshold (the Wiggum Loop will catch and correct quality regressions from quantization), but the evaluator's judgment must be reliable. An evaluator running at Q3_K_M or below introduces systematic scoring noise that cannot be compensated for by rubric design.
Callback → Post 6 (Wiggum Loop): The evaluator model quality floor matters more than the producer quality floor. A producer running Q4_K_M and scoring 6.8/10 can be revised to 7.5/10. An evaluator running Q3_K_M and scoring a 6.8 as a 7.5 silently poisons the entire evaluate→revise feedback loop. Keep the evaluator at Q8_0 minimum.
KV Cache: VRAM's Hidden Tenant
VRAM does not hold only model weights. Every token in every active context window requires a corresponding Key and Value tensor from each attention layer — the KV cache. For a model with 32 attention layers, 32 heads, 128 head dimensions, at FP16, a single token requires:
bytes_per_token = 2 (K and V) × 32 (layers) × 32 (heads) × 128 (head_dim) × 2 (FP16)
= 2 × 32 × 32 × 128 × 2
= 524,288 bytes ≈ 512 KB per token
For a context window of 8,192 tokens, that's 4GB of KV cache for a single request. A batch of four concurrent requests at 4K context each adds another 8GB on top of model weights. This is why a 24GB GPU that appears to have plenty of room for a 7B Q4_K_M model (4.4GB) can still run out of VRAM mid-inference: the KV cache is dynamic, grows with context length, and competes with the weights for the same DRAM.
vLLM's PagedAttention solves this with a virtual memory abstraction borrowed from operating systems. Rather than pre-allocating a contiguous KV cache block for each request's maximum sequence length, PagedAttention divides the KV cache into 16-token pages and allocates them lazily as sequences extend. Pages are freed when requests complete. This eliminates internal fragmentation, allows sequences with common prefixes to share pages, and enables the scheduler to accept new requests as long as any page is free — rather than rejecting requests whenever any single pre-allocated block is exhausted.
Callback → Post 5 (B3 Dual-Backend Memory Store): The Dual-Backend Memory Store's design — semantic vectors in a GPU-resident embedding index, lexical BM25 in CPU RAM — is partially motivated by this KV cache pressure. Keeping the embedding store GPU-resident is valuable for retrieval latency, but the store must give way to KV cache allocation when inference begins. The hybrid design ensures retrieval can fall back to CPU-resident lexical search when VRAM is under pressure.
Amdahl's Law and the Hybrid Inference Ceiling
llama.cpp's --n-gpu-layers flag implements a hybrid execution model: some transformer blocks run on GPU, the rest on CPU, with activations shuttling across the PCIe bus at each block boundary. Amdahl's Law governs the speedup ceiling of this arrangement. If the GPU fraction of compute delivers a 10× speedup over CPU, and we offload fraction p of layers to GPU:
speedup(p, s=10) = 1 / ((1 - p) + p/s)
p = 0.50 → speedup = 1 / (0.50 + 0.05) = 1.82×
p = 0.75 → speedup = 1 / (0.25 + 0.075) ≈ 3.1×
p = 0.90 → speedup = 1 / (0.10 + 0.01) ≈ 9.1×
p = 0.95 → speedup = 1 / (0.05 + 0.005) ≈ 18×
The practical implication: partial offload (50–75% of layers) delivers only 2–3× speedup despite the GPU being 10× faster. The PCIe bus becomes the bottleneck as layer boundaries multiply activation transfers. PCIe 4.0 x16 provides ~64 GB/s theoretical, but practical sustained bandwidth for small activation tensors (typically 4–16 KB per boundary) runs far lower due to per-transfer overhead. The optimal strategy is usually all-or-nothing: either fit the entire model on GPU (full GPU inference via vLLM or full Ollama layer offload) or accept CPU-only inference for models that don't fit.
Kirk & Hwu document this phenomenon directly: in PMPP Chapter 1, they note that even expertly optimized CUDA code achieves only 10–15× speedup over CPU for workloads where PCIe data transfer is on the critical path, but 45–105× when the workload is structured to amortize transfers across large batches. For LLM inference, "large batches" means more concurrent users — which is exactly vLLM's continuous batching argument.
Callback → Post 4 (A4 Keep-Alive Budget): The Keep-Alive Budget's warm-pool strategy amortizes the model-load transfer cost across many inference calls. But Amdahl's ceiling on partial offload explains why the budget also specifies which models get GPU priority: if a model doesn't fit in VRAM in full, the budget should allocate VRAM to the model that does fit fully, not split it for partial offload of the larger model.
Continuous Batching and Latency Tolerance
The GPU's warp-switching mechanism — tolerating DRAM latency by maintaining thousands of in-flight threads — has a direct software analog in vLLM's continuous batching. Traditional static batching processes all requests in a batch together, from prompt to end-of-sequence, then starts the next batch. This creates a "convoy effect": a short request that finishes in 50 tokens must wait for a long request running 2,000 tokens before the GPU moves to the next item.
Continuous (iterative) batching schedules at the token level rather than the request level. After each token generation step, the scheduler checks whether any request has completed (hit EOS) and immediately fills the freed batch slot with a new request. The GPU is never idle waiting for slow requests to finish. New requests begin generating tokens within one forward-pass cycle of their arrival. This is latency tolerance at the software layer, directly analogous to warp switching at the hardware layer.
# Conceptual continuous batching loop
while active_requests or pending_requests:
# Fill batch to capacity from pending
while len(batch) < MAX_BATCH and pending_requests:
batch.append(pending_requests.pop())
# Single forward pass across entire batch
next_tokens = model.forward(batch)
# Update sequences, evict completed requests
for req in batch:
req.append_token(next_tokens[req.id])
if req.is_done():
yield req.output
batch.remove(req) # slot freed immediately
Callback → Post 10 (F1 RunTrace / F3 Chrome Trace Exporter): When you export a harness run to Chrome Trace format and open it in Perfetto, the inference spans tell the hardware story. A flat 400ms span at the start of a run labeled model_load is a cold-start PCIe transfer. A series of short, uniform spans is continuous-batching token generation. A single long span followed by nothing is a static batch waiting for the longest request to finish. The flame graph is the hardware manifest.
Choosing a Runtime: A Decision Framework
The choice between llama.cpp/Ollama and vLLM is determined by three parameters: available VRAM, required concurrency, and deployment environment.
| Condition | Recommended Runtime | Reasoning |
|---|---|---|
| VRAM < model FP16 size; single user | llama.cpp / Ollama |
GGUF quantization fits model; no concurrency pressure; keep-alive handles cold start |
| VRAM ≥ model size (full load); single user | Ollama (full GPU offload) | All layers on GPU; eliminate PCIe hop; Amdahl ceiling at 100% parallel fraction |
| Multiple concurrent users; dedicated GPU server | vLLM |
PagedAttention + continuous batching; 3–4× throughput over static batch; scales to hundreds of req/s |
| Model too large for any single GPU | vLLM with tensor parallelism | Shards attention heads across GPUs; requires NVLink for acceptable inter-GPU bandwidth |
| No GPU available | llama.cpp CPU-only |
AMX/AVX-512 optimizations on modern CPUs; Q4_K_M maximizes throughput-per-watt; accept 10–20× slower than GPU |
The harness Inference Shim (Pattern A1) abstracts this decision behind a single chat(model, messages) call. The HARNESS_ENDPOINTS config maps model name tags to backend URLs and types. Switching from llama.cpp to vLLM for the producer model is a one-line config change — but it represents a fundamentally different hardware utilization pattern, and the RunTrace will show it.
What the Hardware Tells You
The patterns in this series were not designed in the abstract. They were extracted from a real pipeline where inference latency was the dominant runtime cost, where VRAM pressure caused OOM crashes during long Wiggum Loop revisions, and where partial GPU offload delivered disappointing speedups that only became explicable through Amdahl's Law. The hardware is not a detail — it is the constraint that shapes every design decision.
Pattern A4 (Keep-Alive Budget) exists because PCIe transfers are expensive and VRAM is limited. Pattern A2 (Model Role Separation) is feasible because keeping two different small models warm costs less VRAM than one large model. Pattern C3 (Surgical Compressor) reduces evaluation payloads below 6,000 characters partly because long contexts grow the KV cache during evaluation, burning VRAM that the producer needs for generation. Pattern D1 (DAG Orchestrator) spawns parallel agents partly because GPU throughput scales with concurrency — running two 256-token evaluations in parallel on vLLM costs less wall-clock time than running them sequentially.
Callback → Post 7 (C3 Surgical Compressor): The Surgical Compressor's 6,000-character evaluation threshold is not arbitrary. At that length, with a 32-head 7B model, the evaluator's KV cache during a single forward pass consumes roughly 512MB of additional VRAM. Above 6,000 characters, the probability of VRAM pressure during the evaluate→revise cycle rises sharply on consumer-grade hardware.
Research grounding: The runtime selection decision is complicated by a documented reliability gap in the inference engines themselves. An empirical study of five widely-adopted LLM inference engines found systematic bugs in production: memory leaks, out-of-memory errors, incorrect tensor shapes, and performance degradation from suboptimal configuration settings — with no single engine uniformly stable across all workloads. (arXiv:2506.09713) This is the practical motivation behind the harness's RunTrace pattern (Post 10, F1): inference engine failures are documented production failure modes, not hypothetical edge cases. The Inference Shim (A1) provides the abstraction layer that makes engine substitution feasible when a specific backend exhibits instability — without it, a memory leak in one engine's Python bindings requires a surgical rewrite across every inference call site.
Understanding the GPU memory hierarchy, the warp execution model, and Amdahl's Law won't make you write better Python. But it will make you understand why the patterns exist, why their thresholds are where they are, and why the Wiggum Loop's latency budget is tight. The physics does not negotiate. The patterns are a negotiation with it.