The Playwright Skill: LLM-Guided Navigation via ARIA Snapshots
ARIA accessibility trees instead of raw DOM, sitemap-based pre-planning, a completeness oracle that pulls additional pages when coverage is insufficient, and blocked-click memory that prevents the LLM from revisiting dead ends.
Most LLM web-navigation agents give the model a dumped list of HTML links and hope for the best. playwright_skill.py takes a different approach: it uses Playwright's page.aria_snapshot() to give the model a structured, role-and-name view of the page — the same interface a screen reader sees. Semantic locators (get_by_role, get_by_text) rather than CSS selectors mean actions stay stable across minor HTML changes. The navigation loop runs up to 12 steps, with explicit backtracking, stuck detection, and per-step screenshots.
Why ARIA, not DOM
Raw DOM is noisy: class names, IDs, and nesting structure that mean nothing to a reading agent. The ARIA tree distills a page to its semantic skeleton — roles (searchbox, link, heading, button), accessible names, and interactive state. A page that takes 80 KB of HTML to represent fits in 4,000 characters of ARIA snapshot. More importantly, the roles and names are the natural vocabulary for the action verbs the LLM produces: "fill the searchbox named 'Query'", "click the link 'Best practices guide'".
If the ARIA tree is empty (JavaScript-heavy pages that haven't rendered roles yet), the skill falls back to page.inner_text("body") with whitespace normalization — a graceful degradation that avoids a hard failure on SPAs.
The six actions
fill
Fill an input by ARIA role + accessible name. If the locator fails, falls back to placeholder regex match, then CSS input type selectors. Auto-submits search inputs (presses Enter) to avoid the autocomplete-reopen loop where the LLM re-fills instead of submitting.
press
Press a key on the focused element. Dispatches to :focus first so autocomplete dropdowns receive the keypress rather than the global keyboard handler stealing it.
click
Click by visible text. Tries get_by_role("link") first, falls back to get_by_text. Blocked clicks (links that previously led to backtracked pages) are silently skipped — the LLM re-prompts with an updated snapshot and chooses a different link.
goto
Navigate directly to a known URL. Checks the visited-URL set and skips if already in history. Used when the LLM knows the destination with high confidence (e.g. the planner pre-selected it).
backtrack
Return to the most recent non-backtracked ancestor in history. If no valid ancestor exists (the planner sent us directly to this page and it was the best available), raises immediately rather than looping.
extract
Declare the current page as the answer and extract full body text (up to 16,000 chars). Triggers the saturation oracle before returning — if coverage is insufficient, pulls additional pages.
Pre-navigation sitemap planning
Before the navigation loop starts, the skill calls sitemap_skill.discover_pages() to crawl the domain for up to 80 pages (quick mode). The top 15 pages ranked by goal relevance are passed to a one-shot LLM planner that selects the single best URL and two alternatives:
{
"action": "goto",
"url": "https://docs.example.com/best-practices",
"alternatives": ["https://docs.example.com/guides", "https://docs.example.com/overview"],
"reason": "path contains 'best-practices', matches broad goal intent"
}
The planner system prompt encodes a key heuristic: for broad goals ("best practices", "overview", "getting started"), prefer high-level hub pages over deep feature-specific pages even if the keyword match is better on the deep page. A page at /docs/en/section/specific-feature answers "what is X?" — not "what are best practices across the topic?".
If the planner returns {"action": "fail"}, the skill raises immediately without even launching the browser — the content is not on this site. This saves the full navigation budget for sites that actually have the answer.
Saturation extraction
When the LLM decides to extract, the skill doesn't just return the first page. A completeness oracle scores how well the extracted content answers the goal on a 0–10 scale:
{
"score": 6,
"missing": "specific cost breakdown examples and comparison to alternatives"
}
If the score is below 7 (the SATURATION_THRESHOLD), the skill pulls additional pages — first from the planner's ranked alternatives, then from a re-ranked sitemap query that incorporates the oracle's "missing" description into the goal. Up to 3 pages total are merged, and the oracle re-scores the combined content after each addition. The score is constrained to be monotonically non-decreasing: the oracle is instructed that its score must be ≥ the prior score, enforced client-side as well.
Multiple source pages are concatenated with separator markers:
... first page content ...
---
Source: https://docs.example.com/guides
... additional page content ...
The merged text is returned to the caller with final-page URL as the canonical source. The synthesis step downstream works with the full multi-page extract.
Navigation memory and stuck detection
The skill maintains three persistence structures across steps:
| history[] | Per-page dict with URL, title, arrival note, and the link text used to get there (via). The last 6 entries are included in every LLM prompt so the model knows where it has been. |
| blocked_clicks | Set of link texts that led to backtracked pages. Injected into the prompt as a "DO NOT click" list and silently enforced in the executor. Prevents the LLM from cycling through the same dead-end links. |
| url_visit_count | URL → visit count. If the same URL is seen 4 times without progress, the skill raises an auto-fail. Prevents infinite loops on pages that keep redirecting back to themselves. |
SPA client-side routing requires special handling: after a click, the URL may not change immediately because the load event fires before the router updates the address bar. The skill polls for up to 6 seconds for an actual URL change. If the URL remains unchanged (modal, anchor link, or dead element), it injects a note into the next prompt explaining what happened and warning the LLM not to click the same element again.
Browser persistence and CDP reconnection
The skill supports three browser lifetime modes:
Ephemeral (default)
Playwright owns the browser. Chromium launches headless at the start of each task and closes when _teardown() is called. No persistence between tasks.
Keep-alive (--keep-browser / keep=True)
Chromium is launched as a detached OS subprocess (DETACHED_PROCESS on Windows, start_new_session=True on POSIX) connected via CDP on port 9222. Browser state is written to a JSON file in the system temp directory. The browser outlives the agent subprocess and is available for the next task.
Reuse (--reuse-browser / HARNESS_REUSE_BROWSER=1)
Reconnects to an existing browser via pw.chromium.connect_over_cdp() using the port recorded in the state file. Falls back to launching a fresh browser if the CDP endpoint doesn't respond. Useful for multi-step tasks that need to share session cookies across agent invocations.
Screenshots
Every step produces a screenshot saved to screenshots/{run_id}/step{N:02d}_{action}_{url_slug}.png. Extract steps use full_page=True to capture the complete scrollable content. The run_id parameter ties screenshots to the runs.jsonl entry for the task, so the dashboard's Artifacts view can surface them alongside the text output.
The skill is invoked as /browser <url> <goal> from both the op CLI and the dashboard Submit view. The --headed flag passes headed=True, which makes the browser window visible during navigation — useful for debugging what the ARIA tree shows at each step.