May 25, 2026 • 12 min read • Analysis

Leverage: What the Metric Measures, What It Doesn’t, and Why the Replacement Framing Gets the Math Wrong

A corporate narrative is taking root: AI “naturally” displaces lower-value human capital first. The harness has been tracking a quantity called leverage for 1,500+ runs. The formula is more interesting than the narrative, and it points in a different direction.

The Formula

Every completed agentic run in the harness logs a leverage value to runs.jsonl. It is computed in RunTrace.finish() and printed to stdout:

leverage = (tac_s × quality_norm) / max(runtime_s + cost_s, 1.0)
tac_s — time a skilled human would need, in seconds
quality_norm — final WIGGUM score ÷ 10
runtime_s — actual wall-clock time of the run
cost_s — dollar cost converted to equivalent human-labor seconds

In concrete terms: a run with tac=3.0h, quality 8/10, and runtime=714s on local hardware (cost ≈ $0.00) produces:

tac_s       = 3.0 × 3600 = 10,800s
quality_norm = 8.0 / 10.0  = 0.8
runtime_s   = 714s
cost_s      = ($0.00 / $75/hr) × 3600 = 0s

leverage = (10,800 × 0.8) / (714 + 0) = 8,640 / 714 ≈ 12.1×

The harness produced 12 hours of quality-adjusted skilled research in 12 minutes of machine time. That is what 12.1× means.

tac_s comes from two sources. The WIGGUM evaluator estimates it as part of its scoring response, using a calibration scale from 0.25h (trivial lookup) to 24h (novel research requiring primary data collection). A separate _estimate_tac_hours() call in agent.py breaks the estimate into search_h + read_h + write_h components for cross-validation. The two estimates are averaged. The denominator reflects what was actually spent, in machine time plus machine cost. The ratio is leverage.

Where Hourly Rate Enters the Formula

The cost_s term converts dollar costs to time-equivalent seconds:

cost_s = ((energy_cost + infer_cost) / hourly_rate) × 3600

hourly_rate defaults to $75/hr and is configurable via HARNESS_HOURLY_RATE. Its role is to put machine cost on the same scale as human time — making the denominator a unified "cost in human-labor-seconds."

For a run with $0.50 in total machine cost, the cost penalty at three wage points:

Hourly rate cost_s cost_s as % of 714s runtime Leverage (3h TAC, quality 8/10)
$15/hr (min wage) 120s 14.6% 10.3×
$75/hr (default) 24s 3.3% 11.8×
$200/hr (senior engineer) 9s 1.3% 12.1×

The table shows that for a $0.50 run — already an expensive case, since local inference on the harness costs roughly $0.01 per run — the leverage penalty from the low-wage scenario is about 15% lower than the high-wage scenario. Not catastrophic. And for the actual harness running locally with near-zero cost, cost_s ≈ 0 at every wage level, and the table entries converge to the same number.

On 145 empirical runs, recalculating leverage at $15, $75, and $200/hr produces statistically indistinguishable distributions. The box plots are nearly identical. Local inference collapses the formula to:

leverage ≈ (tac_hours × quality_norm × 3600) / runtime_s

When cost_s ≈ 0 (local inference), hourly rate vanishes from the computation entirely.

The leverage metric, as actually computed across 1,500+ logged runs, is independent of the operator's wage. It is a pure time multiplier.

The Asymmetry That Gets Weaponized

There is a real asymmetry in the formula, and it is worth being precise about what it is.

tac_s is calibrated to the time a skilled researcher or engineer would need. The WIGGUM prompt specifies this explicitly: a 3h TAC estimate means "a skilled practitioner, familiar with the domain, working efficiently, would take 3 hours." It does not mean "an average worker" or "anyone asked to do this task."

This creates a numerator asymmetry, distinct from the cost_s term analyzed above. If the person whose work is being replaced is not a skilled researcher — if the real-world task would take them 8 hours instead of 3 — the TAC estimate understates the value of the displacement. The formula gives conservative leverage numbers when used to evaluate the displacement of less-experienced workers, not inflated ones.

This is the asymmetry that corporate framings reach for when they talk about AI "naturally" targeting lower-value labor. The argument runs: automation makes economic sense first where the displaced labor is most expensive, therefore the displacement pressure falls first on cheaper workers. The formula can be read to support this reading — if expensive labor is harder to automate profitably, the cheap-labor case is even worse.

The argument is coherent and partially true. It is also incomplete in ways that matter.

What the Formula Doesn’t Capture

The leverage metric answers a specific question: given that the harness runs a task, how much time does it save relative to a skilled human doing the same task? It does not answer:

The formula's most important implicit assumption: tac_s is calibrated to skilled labor, which means the numerator always represents the value of skilled work — regardless of who is operating the system. The formula does not ask for the operator's credentials before computing leverage.

Token Amortization: What the Formula Can’t See

The leverage formula sees runtime_s — total wall-clock time. It does not see how that time was spent, or which tokens were productive. This matters because the token budget inside a typical run is not evenly productive. From trace analysis across the logged run history, wall time breaks down approximately as follows:

Stage % of wall time Nature
synth + synth_count ~53% Productive — directly creates the output
wiggum_revise ~30% Quality improvement — rewrites to raise score
gather_research / compress_knowledge ~15% Overhead — preprocessing, not producing
wiggum_eval ~2% Overhead — scoring only, no output produced

Roughly half the tokens are infrastructure. A concrete example: in a typical run, the tool loop consumed 5,773 tokens at 45% of total — overhead for search, retrieval, and compression — while synthesis consumed 5,581 tokens at 44%, directly producing the output. The leverage formula rewards the wall time of all of this equally.

This creates a subtle but important accounting issue: the formula measures leverage over the full pipeline, not over the productive portion. A run that spends 80% of its tokens on overhead but produces a high-quality synthesis in short wall time gets the same leverage as a run that spends 80% on productive synthesis. The lever the formula actually measures is the end-to-end pipeline efficiency, not the efficiency of any individual stage.

The revision tokens are the most interesting case. When a revision round raises the WIGGUM score from 6.4 to 8.3, those tokens produce measurable value — the quality_norm term improves, and leverage rises. When a revision round produces a regression (which happens in about 10% of multi-round runs, per the run logs), those tokens actively destroyed value. The formula captures this via the final quality_norm, but it cannot distinguish "revision helped" from "revision hurt" during the run — only after the fact, in the score.

What this means for the labor economics discussion: a token-level analysis of leverage would show higher effective multipliers than the formula currently computes, because the overhead tokens are getting credited against the output quality of the productive tokens. This biases the leverage numbers downward. Every estimate of "12× leverage" is a conservative one — the true productivity of the synthesis stage, isolated, is higher.

The CapEx Gap: When the Platform Is Free but the Hardware Isn’t

The harness is open-source. Ollama is open-source. The models are open-weights. The total software cost of the stack is zero. But running it at the quality level that produces 12× leverage requires hardware that is not free.

The harness runs on a $4,000 consumer workstation. The dashboard computes a running breakeven comparison between local inference and cloud alternatives. Using the cloud pricing tiers the harness tracks, on a typical 12,815-token run:

Cloud tier Approx. cost/run Local electricity/run Savings/run (local vs. cloud) Breakeven runs (on $4K hardware)
Claude Sonnet (~$3/$15 per 1M tok) ~$0.07 ~$0.01 $0.06 ~67,000
GPT-4o (~$2.50/$10 per 1M tok) ~$0.06 ~$0.01 $0.05 ~80,000
Gemini Flash (~$0.075/$0.30 per 1M tok) ~$0.002 ~$0.01 −$0.008 (local is more expensive) Never

The breakeven calculation depends entirely on which cloud tier you are comparing against. For premium API calls (Sonnet, GPT-4o), the local hardware pays for itself after 67,000–80,000 runs. For the cheapest commodity cloud inference (Gemini Flash, Haiku), local hardware never reaches breakeven — the electricity alone exceeds the API cost per run.

The current formula accounts for energy_cost and infer_cost as per-run OpEx items. It does not account for hardware depreciation. If you add a depreciation term — a $4,000 machine over a 3-year useful life at 8-hour daily utilization contributes about $0.03 per 700-second run — the effective per-run cost rises from $0.01 to $0.04. The corrected leverage numbers are still compelling for Sonnet/GPT-4o displacement, but the formula is currently giving itself credit it hasn't fully earned.

This matters for the access discussion in a specific way. The hardware barrier is real but it is also a one-time investment, not a recurring cost. The leverage it produces is cumulative. After 67,000 runs against a Sonnet-tier API, the hardware has paid for itself in saved inference costs alone — without counting the value of the work it produced. And the $4,000 entry point is declining: comparable hardware cost roughly twice as much two years ago.

The "open-source but requires expensive computers" framing is accurate but incomplete. The full framing is: open-source, requires a one-time hardware investment, produces leverage that compounds across every subsequent run, at a cost that is declining over time, on hardware that has dual-use value outside the harness (training, development, other workloads). The CapEx barrier is real. It is also not permanent, and it is not a structural argument for who the tool is designed to benefit.

The Replacement Narrative’s Actual Claim

To be fair to the corporate framing: it is not purely wrong. Automation does historically eliminate job categories more readily than it creates them in the short term. The clerical job losses from spreadsheet software in the 1980s were real. Call center job losses from IVR are real. The argument that current AI follows the same pattern is at least defensible.

What is specifically wrong — and worth objecting to — is the phrase lower-value human capital. The phrase conflates two things that the formula treats as separate:

  1. The current market price of a person's labor — what they are currently paid, for the tasks they currently do, in a market that already exists.
  2. The person's potential leverage-adjusted output — what they could produce with access to tools that multiply their effective throughput.

"Lower-value human capital" collapses these into a single fixed quantity attached to the person. It treats the current wage as a property of the worker, not as the output of a specific task assignment in a specific market at a specific moment. It then uses this fixed quantity to predict automation priority — and from there, to predict that some people are natural candidates for displacement rather than augmentation.

This framing is not just uncharitable. It is empirically wrong about how leverage works. The leverage formula has a HARNESS_HOURLY_RATE parameter that you can set to any value. There is no version of the harness that checks whether the operator is worth amplifying.

A Concrete Counter-Example

The harness underlying this blog series was built by one person, running on local hardware, without a research team or institutional resources. Over the course of its development, it has logged over 1,500 production runs, run 90+ controlled autoresearch experiments, maintained a structured knowledge graph of retrieved literature, generated a full 12-post technical series, and produced a pattern catalog of 27 named engineering designs.

This is not a triumph of the expensive-labor category. It is a demonstration that the leverage the formula describes is available to whoever chooses to operate the system. The TAC estimates in the run logs represent thousands of hours of skilled research equivalent. The quality scores average above 8/10. The leverage values across 1,500 runs are consistently in the range of 10–20× on cognitively demanding tasks.

None of that leverage required institutional backing. It required a $4,000 workstation, open-source software, and the methodological discipline to run controlled experiments and measure what changed. The formula gives the same numbers whether the person running it has a PhD or doesn't, whether they're employed by a corporation or operating independently, whether their current market rate is $15/hr or $200/hr.

The Access Problem Is Real

None of the above means access barriers don’t exist. They do, and they matter.

The hardware cost is real: $4,000 is not trivial. Cloud alternatives bring the upfront cost to zero but introduce per-run costs that change the economics at scale — the breakeven analysis depends heavily on which cloud tier and how many runs you're doing. The barrier to entry is lower than it was in 2023 and lower still than it was in 2020, but it is not zero.

The methodological investment is real. The harness is not a button you press to generate leverage. It is a system with a task registry, criterion functions, a multi-stage evaluation pipeline, and a self-improvement loop. Running it well requires understanding what the composite score measures and what it doesn't — exactly the kind of understanding that takes time and effort to develop.

The TAC calibration itself is biased toward skilled researcher output. The system is designed to replicate what a skilled researcher would produce. If the user wants to produce something categorically different — creative work, physical-world work, relationship-dependent work — the leverage metric either doesn't apply or requires recalibration.

These are genuine constraints. What they add up to is an access problem, not a structural argument that some people are better candidates for leverage than others. An access problem can be addressed: through open-source tooling (the harness is public), through declining hardware costs, through documentation that reduces the methodological learning curve.

A structural argument that some people have lower human capital and therefore face natural displacement — that argument cannot be addressed by changing anything about the tools, because it has located the problem in the person rather than in the task allocation or the access conditions.

What the Math Actually Shows

The leverage formula, as implemented and measured, shows the following:

The formula does not have a field for the operator’s worth. It has a field for the task’s complexity, the output’s quality, the runtime’s duration, and the machine’s cost. Leverage is a property of the task and the tool, not a property of the person running either.

The corporate framing that AI "naturally" displaces lower-value workers first is a claim about market dynamics and corporate decision-making — who companies choose to give access to, and which job categories they choose to automate. That is a policy and power question, not a mathematical one. The mathematics of leverage, as actually computed across 1,500 runs of a production agentic harness, points in the opposite direction: the tool amplifies whoever uses it, calibrated to skilled output, at a leverage ratio that is largely independent of the operator’s current wage.

The interesting question is not who gets displaced. It is who gets access to the tool that makes the displacement question irrelevant.

What the Literature Leaves Open

“Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering” (2604.08224, April 2026) provides direct academic framing for the leverage argument. The paper’s central claim: “practical agent progress depends on better external cognitive infrastructure rather than just stronger models.” It categorizes agent cognitive artifacts into memory stores, reusable skills, interaction protocols, and harness engineering — the exact taxonomy this harness implements — and identifies self-evolving harnesses as the frontier direction.

The paper’s framing of external infrastructure as the locus of capability gains is the academic version of the leverage argument: the multiplier is in the scaffold, not in the model weights. This matters for the labor economics discussion because it locates capability in the tooling rather than in innate properties of the operator — which is precisely the point the leverage formula makes when HARNESS_HOURLY_RATE drops out of the computation at near-zero inference cost.

Open question the paper raises: How do models and external infrastructure co-evolve over time? As models improve, do they reduce the leverage provided by harness engineering (because the model needs less scaffolding), or increase it (because better models extract more value from the same scaffold)? The harness’s 1,500-run history, spanning multiple model upgrades (7B → 32B producer), is a small empirical dataset on exactly this question.