May 26, 2026 • 8 min read • Agentic Harness Engineering

The Lit-Review Skill: Seven Steps from ArXiv Query to Rendered Survey

A single slash command chains ArXiv fetch, Semantic Scholar citation enrichment, five-persona curation, per-paper Wiggum annotation, LLM clustering, cross-cluster synthesis, and Jinja2 rendering into a complete structured literature review.

The harness accumulates ArXiv papers as part of its research corpus — but raw PDFs and abstracts aren't immediately useful as training signal or reference material. lit_review_skill.py closes the loop: it takes a query string and produces a structured Markdown survey, handling every step automatically from fetch through final render. The skill is invoked as /lit-review <topic> from the agent pipeline, or run standalone from the CLI.

Seven steps

Fetch from ArXiv

Calls arxiv_fetch.py to pull up to 100 papers for the query, newest first. Supports --after and --before date filters. Skippable with --csv if an existing annotated CSV is available.

Enrich via Semantic Scholar

Builds a citation graph via semantic_scholar.py, computing hub_score (how many in-corpus papers cite this one) and ref_count for each paper. Hub scores are used in Step 3 to prioritize annotation of the most-cited papers first. Skippable with --no-s2.

Persona curation

Calls curator.py's score_paper() for each paper. Passes only papers that clear the mean ≥ 3.5 + veto floor > 2 gate. Surviving papers are sorted by hub_score descending before annotation — ensuring the most-referenced papers in the corpus are annotated first. Skippable with --no-curate.

Annotate with Wiggum

Runs each paper through run_annotate_standalone() to produce an 8-section structured annotation (Topic, Motivation, Contribution, Detail/Nuance, Evidence, Weaker Result, Narrow Impact, Broad Impact). Optionally runs the Wiggum revision loop on each annotation. Results are checkpointed per paper in .lit_review_cache/ — a crash mid-run resumes from the last completed paper.

Cluster into themes

An LLM groups papers into 3–5 thematic clusters based on title and Contribution sentences. Outputs cluster names (5–7 words each) with their paper ID lists as JSON. Falls back to a single "All Papers" cluster if JSON parsing fails.

Synthesize

Two synthesis passes: (a) a 2–3 sentence paragraph per cluster naming specific techniques and how papers relate; (b) a 3–4 sentence cross-cluster overview plus 3–5 open research questions the literature hasn't fully answered. The open-questions list is a distinct output field used by the gaps template.

Render via Jinja2

Assembles a template context with clusters, per-paper annotations, hub-paper highlights, synthesis paragraphs, open questions, and citation gap candidates. Renders to .md using a named Jinja2 template from the templates/ directory.

Three output templates

survey

Full academic survey format: per-cluster summaries, hub paper highlighted per cluster, all annotated papers with 8-section breakdowns, cross-cluster synthesis, open questions, and gap candidates.

gaps

Research gap focus: clusters condensed, open questions promoted to the top section, citation gaps (papers cited by the corpus but not in it) surfaced prominently. Useful for identifying where to extend the literature.

executive

One-page summary format: synthesis overview only, top 3 papers from each cluster, open questions as bullet points. Designed for a reader who needs the landscape without the full annotation detail.

Bypass flags

Each step can be skipped independently. Common combinations:

# Use an existing annotated CSV, skip fetch and curation, run with executive template
python lit_review_skill.py --csv papers.csv --no-fetch --no-curate --template executive --out exec.md

# Full run with date filter
python lit_review_skill.py "prompt injection" --after 2024-06-01 --max-fetch 200 --max-annotate 30 --out review.md

# Fast annotation check — skip Wiggum eval per paper
python lit_review_skill.py "agentic LLM" --no-wiggum --out quick.md

--no-fetch --no-curate --no-wiggum --no-s2 --csv FILE --checkpoint DIR

Checkpointing and idempotency

Step 4 is the most expensive: each paper requires a full annotation LLM call, optionally followed by the Wiggum revision loop. A 20-paper run with Wiggum enabled can take 20–40 minutes. To make interruptions recoverable, each paper's annotation is checkpointed to .lit_review_cache/{arxiv_id}.json immediately after completion. Re-running the same skill invocation skips already-annotated papers and resumes from where it left off.

The checkpoint file stores the 8-section annotation dict, the raw annotation text, and the Wiggum score (if run). The same checkpoint directory is used across different review invocations for the same paper corpus — so a paper annotated for one review is immediately available for another.

Hub-score prioritization

After curation, surviving papers are sorted by hub_score descending before annotation begins. This means the most-cited papers in the in-corpus citation graph get annotated first, and if max_annotate truncates the list, the highest-influence papers are always included. A paper with hub_score = 0 (no in-corpus citations) ranks below any paper with at least one, and is the most likely to be dropped when the budget is tight.

The hub paper per cluster (the highest hub_score paper in that cluster) is surfaced as a highlighted entry in the rendered output — the implicit "most important paper to read" signal within that thematic group.

The /lit-review skill is the automated backend for the literature review posts in the blog's Literature Reviews series. Those posts were generated from this pipeline against specific ArXiv queries for tool use, security, evaluation, and fine-tuning topics.

Integration with the fine-tuning pipeline

The lit-review pipeline is the upstream source for the DPO training dataset. The flow is: lit_review_skill.py produces arxiv_*_annotated.csv → curator.py filters to arxiv_*_curated.csv → build_finetune_from_annotations.py constructs preference pairs for DPO training. A literature review that scores well on Wiggum annotation quality feeds stronger training signal into the evaluator model. For the full downstream pipeline, see Five Personas, One Veto and The Fine-tune View.

Seven steps

Fetch from ArXiv

Enrich via Semantic Scholar

Persona curation

Annotate with Wiggum

Cluster into themes

Synthesize

Render via Jinja2

Three output templates

survey

gaps

executive

Bypass flags

Checkpointing and idempotency

Hub-score prioritization

Integration with the fine-tuning pipeline

Related posts