Synthetic Eval Task Generation with TinyTroupe Personas
Eight practitioner archetypes — each with a defined role and perspective — generate diverse research task requests. TinyTroupe runs the simulation when available; raw Ollama handles the fallback. Evaluation criteria are auto-derived from the generated task text, not hand-authored.
The five fixed eval tasks in eval_suite.py (T_A through T_E) measure consistent dimensions of harness performance, but they're frozen. The autoresearch loop can overfit to them: an instruction change that improves all five may still fail on tasks with different structural requirements or audience expectations. tinytroupe_tasks.py extends the eval surface by generating tasks from role-grounded personas, covering needs the fixed suite wasn't designed to test.
Eight personas
Each persona definition includes a name, a role title, and a context paragraph that establishes the professional perspective. The prompt asks: "What would you most want an AI research agent to investigate and write up for you right now?" — eliciting tasks grounded in real practitioner concerns rather than abstract capability tests.
Two generation backends
The script tries TinyTroupe first and falls back to raw Ollama on import failure or runtime error. Both receive the same prompt; the difference is that TinyTroupe runs the response through a full persona simulation step before returning.
# TinyTroupe path (if installed)
from tinytroupe.agent import TinyPerson
person = TinyPerson(persona["name"])
person.define("role", persona["role"])
person.define("personality_traits", [persona["context"]])
person.listen_and_act("A colleague asks: 'What would you most want...'")
actions = person.pop_actions_and_get_contents_for("TALK")
# Ollama fallback (always available)
response = ollama.chat(model=MODEL, messages=[{"role": "user", "content": prompt}],
options={"temperature": 0.8})
TinyTroupe is not on PyPI — it must be installed from GitHub (pip install git+https://github.com/microsoft/TinyTroupe.git@main). The script checks for availability at import time and prints which backend it's using at startup. The Ollama path with temperature 0.8 produces similar diversity to TinyTroupe for this task, so the fallback is a practical substitute for most use cases.
Criteria auto-derivation
The fixed eval tasks have hand-authored criteria. Generated tasks don't — so build_criteria() derives them from the task text at generation time. The derivation uses two signals:
Count detection — if the task asks for "top 5", "best 3", or "N most", _extract_count() extracts N via regex. A count in the 2–7 range triggers exact_sections(N) — the output must have exactly that many non-structural H2 sections. Tasks without a count get min_sections(3) instead.
Structural and quality criteria are applied regardless:
| Criterion | What it checks |
|---|---|
min_bytes(800) or min_bytes(600) | Minimum output length in UTF-8 bytes; lower threshold for counted-section tasks |
min_lines(15) or min_lines(10) | Minimum line count; lower threshold for counted-section tasks |
no_placeholders() | Rejects outputs containing "TODO", "[placeholder]", "add example here", etc. |
has_impl_notes() | Requires at least one of: code block, "implementation note:", "example:" — signals concrete content |
no_file_path_refs() | Rejects outputs that echo the save path back — "saved to ~/Desktop/..." is a producer artifact, not content |
Output format
Generated tasks are saved as JSON in eval_suite.py's SUITE format — each task dict has the same keys as the hand-authored tasks, plus a criteria_specs field (serializable criterion specs) and a raw_response field for debugging. The eval_suite.py loader converts criteria_specs back to callable criterion functions at run time.
{
"id": "P_devops_1",
"desc": "generated: Senior DevOps Engineer",
"task": "Search for the top 5 Kubernetes cost optimization strategies and save to .../eval-k8s-cost.md",
"output": "~/Desktop/harness-engineering/eval-k8s-cost.md",
"criteria_specs": [
{"type": "exact_sections", "n": 5},
{"type": "min_bytes", "n": 600},
{"type": "min_lines", "n": 10},
{"type": "no_placeholders"},
{"type": "has_impl_notes"},
{"type": "no_file_path_refs"}
],
"persona": "Jordan",
"raw_response": "TASK: Search for the top 5 ..."
}
Usage
# Generate one task per persona (default), save to generated_tasks.json
python tinytroupe_tasks.py
# Two tasks per persona
python tinytroupe_tasks.py --count 2
# Custom output path
python tinytroupe_tasks.py --out eval_generated.json
# Preview without saving
python tinytroupe_tasks.py --dry-run
The criteria functions in tinytroupe_tasks.py are local copies of the same factories in eval_suite.py, kept separate to avoid a circular import. If you add a new criterion type to eval_suite.py, add it here too — and to criteria_to_functions() — for the generated task runner to recognize it.