May 27, 2026 • 5 min read • Agentic Harness Engineering

Synthetic Eval Task Generation with TinyTroupe Personas

Eight practitioner archetypes — each with a defined role and perspective — generate diverse research task requests. TinyTroupe runs the simulation when available; raw Ollama handles the fallback. Evaluation criteria are auto-derived from the generated task text, not hand-authored.

The five fixed eval tasks in eval_suite.py (T_A through T_E) measure consistent dimensions of harness performance, but they're frozen. The autoresearch loop can overfit to them: an instruction change that improves all five may still fail on tasks with different structural requirements or audience expectations. tinytroupe_tasks.py extends the eval surface by generating tasks from role-grounded personas, covering needs the fixed suite wasn't designed to test.

Eight personas

P_devops

Jordan

Senior DevOps Engineer — CI/CD, Kubernetes, reliability, cost control

P_datasci

Priya

Staff Data Scientist — ML pipelines, reproducibility, explainability

P_backend

Marcus

Backend Engineer — high-throughput APIs, latency, fault tolerance

P_pm

Sofia

Product Manager — AI product roadmap, trade-offs, risk management

P_security

Alex

Security Engineer — threat modelling, attack surface, compliance

P_mleng

Tanaka

ML Infra Engineer — serving, batching, quantization, GPU utilization

P_founder

Elena

Technical Founder — cost efficiency, time-to-market, differentiation

P_techlead

Kwame

Tech Lead — architecture decisions, maintainability, team velocity

Each persona definition includes a name, a role title, and a context paragraph that establishes the professional perspective. The prompt asks: "What would you most want an AI research agent to investigate and write up for you right now?" — eliciting tasks grounded in real practitioner concerns rather than abstract capability tests.

Two generation backends

The script tries TinyTroupe first and falls back to raw Ollama on import failure or runtime error. Both receive the same prompt; the difference is that TinyTroupe runs the response through a full persona simulation step before returning.

# TinyTroupe path (if installed)
from tinytroupe.agent import TinyPerson
person = TinyPerson(persona["name"])
person.define("role", persona["role"])
person.define("personality_traits", [persona["context"]])
person.listen_and_act("A colleague asks: 'What would you most want...'")
actions = person.pop_actions_and_get_contents_for("TALK")

# Ollama fallback (always available)
response = ollama.chat(model=MODEL, messages=[{"role": "user", "content": prompt}],
                       options={"temperature": 0.8})

TinyTroupe is not on PyPI — it must be installed from GitHub (pip install git+https://github.com/microsoft/TinyTroupe.git@main). The script checks for availability at import time and prints which backend it's using at startup. The Ollama path with temperature 0.8 produces similar diversity to TinyTroupe for this task, so the fallback is a practical substitute for most use cases.

Criteria auto-derivation

The fixed eval tasks have hand-authored criteria. Generated tasks don't — so build_criteria() derives them from the task text at generation time. The derivation uses two signals:

Count detection — if the task asks for "top 5", "best 3", or "N most", _extract_count() extracts N via regex. A count in the 2–7 range triggers exact_sections(N) — the output must have exactly that many non-structural H2 sections. Tasks without a count get min_sections(3) instead.

Structural and quality criteria are applied regardless:

Criterion	What it checks
`min_bytes(800)` or `min_bytes(600)`	Minimum output length in UTF-8 bytes; lower threshold for counted-section tasks
`min_lines(15)` or `min_lines(10)`	Minimum line count; lower threshold for counted-section tasks
`no_placeholders()`	Rejects outputs containing "TODO", "[placeholder]", "add example here", etc.
`has_impl_notes()`	Requires at least one of: code block, "implementation note:", "example:" — signals concrete content
`no_file_path_refs()`	Rejects outputs that echo the save path back — "saved to ~/Desktop/..." is a producer artifact, not content

Output format

Generated tasks are saved as JSON in eval_suite.py's SUITE format — each task dict has the same keys as the hand-authored tasks, plus a criteria_specs field (serializable criterion specs) and a raw_response field for debugging. The eval_suite.py loader converts criteria_specs back to callable criterion functions at run time.

{
  "id": "P_devops_1",
  "desc": "generated: Senior DevOps Engineer",
  "task": "Search for the top 5 Kubernetes cost optimization strategies and save to .../eval-k8s-cost.md",
  "output": "~/Desktop/harness-engineering/eval-k8s-cost.md",
  "criteria_specs": [
    {"type": "exact_sections", "n": 5},
    {"type": "min_bytes", "n": 600},
    {"type": "min_lines", "n": 10},
    {"type": "no_placeholders"},
    {"type": "has_impl_notes"},
    {"type": "no_file_path_refs"}
  ],
  "persona": "Jordan",
  "raw_response": "TASK: Search for the top 5 ..."
}

Usage

# Generate one task per persona (default), save to generated_tasks.json
python tinytroupe_tasks.py

# Two tasks per persona
python tinytroupe_tasks.py --count 2

# Custom output path
python tinytroupe_tasks.py --out eval_generated.json

# Preview without saving
python tinytroupe_tasks.py --dry-run

The criteria functions in tinytroupe_tasks.py are local copies of the same factories in eval_suite.py, kept separate to avoid a circular import. If you add a new criterion type to eval_suite.py, add it here too — and to criteria_to_functions() — for the generated task runner to recognize it.

Eight personas

Two generation backends

Criteria auto-derivation

Output format

Usage

Related posts