Cost Envelope Management for Production AI Agents
Between April 7 and April 27, the harness ran the same eval task 239 times: “Search for best practices for cost envelope management in production AI agents.” It issued 414 web searches, averaged 1.5 evaluation rounds per run, and passed its own quality bar 15 times. This post is the answer that task was looking for—written from the telemetry those 239 runs left behind. It is both the findings and the bill.
A cost envelope is the set of resource budgets an agent must operate inside, plus the enforcement that keeps it there. For an API-metered agent the envelope is denominated in dollars. For a local-first harness the currencies are different but the discipline is identical: tokens (context in, generation out), seconds (wall-clock inference time), and VRAM (which models are resident, and for how long). Every mechanism below generalizes to the dollar-denominated case—multiply tokens by a unit price and the math carries over.
The Commodity Answer
Before the real mechanics, it is worth showing what the agent actually found out there. One of the 239 runs—a 14B producer in a controlled experiment—synthesized the web’s consensus into advice like this:
# Example: Measuring AI cost per inference
import time
start_time = time.time()
result = model(prompt, context="...")
end_time = time.time()
cost = end_time - start_time
print("Cost of inference:", cost, "seconds")
Cost per inference, measured in units of seconds—with a budget elsewhere in the same document specified as CPU: 100 GHz, and every section closing with “use Prometheus for monitoring and Grafana for visualization.” This is not the model failing; it is the model faithfully compressing what ranks for the query. The search traces show the agent repeatedly reformulating—“best practices for cost envelope management beyond observability and budget limits”, “…beyond right-sizing and idle resource elimination”—trying to escape that commodity layer. What follows is the answer it could not find: an envelope described as enforcement points in a running system, with the run logs to show where the costs actually accumulate.
The Envelope Has Four Levels
Budget enforcement that lives in one place fails. A per-request token cap does nothing about a revision loop that issues fifteen requests; a loop cap does nothing about three models squatting in VRAM at 0.1× generation speed. The envelope is four nested budgets, each with its own enforcement point.
L1 — Per-Call tokens
Explicit num_ctx and num_predict overrides on every call site, so
no Modelfile default can silently set the context or generation budget. Context entering the
call is itself budgeted: wholesale document injection was replaced with gap-targeted
extraction after a single 14.8K-character wiki file was found bloating synthesis context past
27K characters—paying full price, in tokens and degraded output, for content the task
did not need.
L2 — Per-Run rounds × tokens
The Wiggum loop is capped at a maximum round count, and—more importantly—exits early when spending more would buy nothing. Cycling detection compares consecutive rounds’ scores and dimensional breakdowns; a producer repeating itself is cut off, saving roughly 1,300 s of inference per stuck run. Best-round restoration ensures that when the loop exhausts its budget without passing, the highest-scoring round is what ships—so the spend buys the best available output rather than the last one.
Enforced by: round caps, cycling detection, best-round restoration in wiggum.pyL3 — Per-Model VRAM × seconds
The Keep-Alive Budget treats VRAM residency as a spend decision. _estimate_keep_alive()
sets each model’s keep_alive adaptively from how soon that role—producer,
evaluator, planner—will be called again, instead of pinning everything resident with
-1. The failure mode it prevents is expensive precisely because it is quiet:
an over-committed GPU does not crash, it generates at a tenth of normal speed, and every
token in the pipeline silently costs ten times as much wall-clock.
L4 — Per-Fleet cost avoidance
The cheapest call is the one not made. The search cache fronts DDGS queries with a SQLite TTL cache, so the 239 runs’ 414 web searches were a floor, not the true demand—repeat queries within the TTL window cost nothing. The same logic applies to research contexts and Semantic Scholar lookups. At fleet level, cache hit rate is a budget lever as real as any cap.
Enforced by: SQLite TTL Search Cache, prompt/context reuseThe Anatomy of One Expensive Run
Aggregates hide structure, so here is a single fully-itemized run: a /lit-review
invocation that fetched 26 arXiv candidates, curated them through a five-persona panel,
annotated the 4 survivors, clustered, synthesized, and rendered a survey.
- Total: 1,823 s — 60,940 tokens in / 8,335 tokens out at ~10 tok/s
- Input dominates output 7:1. Curation is the reason: 26 papers × 5 personas, each persona re-reading the paper annotation. The cost driver is not generation, it is repeatedly showing things to models.
- Each annotation’s evaluation round: ~127–132 s at 7 tok/s. Quality gates are a second model reading the first model’s output—verification spend scales with however much you produce.
- The follow-up run was 10× cheaper. Annotating one known paper ID cost 174 s, 4,949 in / 1,162 out—because retrieval and curation were already done. Knowing what to read is most of the budget.
The ratio worth internalizing: in an agentic pipeline with quality gates, input tokens are the budget. Generation is the visible cost; re-reading—by evaluators, panels, and revision rounds—is the dominant one. Context engineering and cost management are the same discipline viewed from different dashboards.
What 239 Runs Say About Where the Budget Goes
The run log for the cost-management task itself, April 7–27, across five producer models (qwen3.6-35b carried 144 of the runs):
- 239 runs, 414 web searches, average 1.5 Wiggum rounds per run — every average round above 1.0 is a full re-read and partial re-write of the output.
- Average final composite 7.48 against an 8.0 pass bar — the marginal revision round was usually spent closing a half-point gap.
- Of 86 multi-round runs, 8 regressed — the final score came in below round one. Those rounds were worse than wasted: they paid tokens to damage the output, which is why best-round restoration is a cost control and not just a quality control.
- On enumerated tasks generally,
count_check_retryfires in 25% of runs and the retried output averages −0.39 composite — a measured price for satisfying a count constraint. Budgeting means knowing which retries have negative expected value.
This is hierarchical cost attribution in practice, and it requires no
platform—only logging discipline. Every run appends to runs.jsonl with its
model assignments, tool calls, searches, round count, scores, and output size; attribution
rolls up from call → stage → run → task type → fleet with a dozen lines of
pandas. The token-spend dashboard reads the same file. And
the same telemetry drives auto-throttling: cycling detection, count-check
gating, and round caps are throttles triggered by run-level signals, not human review.
Where the Hosted Tooling Fits
The query traces that motivated this post kept asking for tools by name—AgentOps,
Langfuse, Arize, Prometheus, Grafana—and for “FinOps integration.” The honest
placement: those platforms supply the ledger—per-call traces, token-to-dollar
attribution, dashboards, alerts. They are the hosted equivalent of runs.jsonl plus
the analytics view, and if your agents run against metered APIs, adopt one early; rebuilding
trace plumbing is undifferentiated work. What no ledger supplies is the enforcement:
round caps, cycling detection, residency budgets, and cache policy live inside the agent loop,
at the four levels above. A dashboard that watches an unbounded loop is a bill with better
graphs.
Academic grounding: the research frontier is pushing cost from a constraint into the planner’s objective. CATP-LLM (arXiv:2411.16313) trains LLMs for cost-aware tool planning—choosing tool sequences under explicit cost terms rather than treating spend as someone else’s problem after the plan is drawn. That is the envelope’s logical endpoint: L1–L4 enforce budgets around the model; cost-aware planning puts the budget inside the model’s decision process.
Open Questions
- Round caps are static; the regression data suggests they should not be. Can the supervisor learn a per-task-type stopping rule from the marginal-score-per-round curve—spend a third round on tasks where round 3 historically gains +0.4, never on tasks where it gains nothing?
- The 7:1 input-to-output ratio prices verification at a multiple of generation. At what quality bar does a cheaper evaluator model—accepting some scoring noise—beat a better one, net of the re-rounds the noise causes?
- Cache TTL is a cost-freshness trade made once, globally. Should TTL vary with the query’s volatility—economic data series going stale in days, arXiv metadata in months?
- What is the local-first equivalent of a dollar alert—a watt-hour budget per run? The telemetry has everything needed except the power draw.