May 28, 2026 • 18 min read • Agentic Harness Engineering

OSINT Enrichment: Nine Layers of Passive Reconnaissance

The harness now automatically enriches any research task that mentions a domain, IP address, or email address with nine parallel passive-reconnaissance layers — from DNS records to certificate transparency logs to threat intelligence feeds. Zero configuration required for the first six. The enrichment block is appended to the synthesis context using the same position-end pattern that delivered the largest gains in both the Beige Book and FRED experiments.

Experimental Methodology Beige Book RAG FRED RAG Results OSINT Enrichment

TL;DR — harness/osint_tool.py adds a nine-layer enrichment pipeline to gather_research(). It fires automatically when a domain, IP, or email appears in the task string. All seven domain-level fetchers run in parallel via ThreadPoolExecutor. The six zero-config layers work with no API keys. Two optional layers (urlscan.io and OTX AlienVault) unlock richer tech-stack and threat-intel data.

Motivation

The RAG experiments demonstrated that appending factual, cited context to the synthesis prompt improves output quality. FRED series data raised composite scores by +0.40; Beige Book prose by +0.13. The pattern held: structured, authoritative data beats narrative prose, and end-of-context beats front-loading.

OSINT follows the same logic. When the task involves a domain, IP, or email address, a large volume of machine-readable public data exists — registrant details, DNS records, certificate history, open ports, historical screenshots, and threat assessments — that the harness was previously leaving on the table. Instead of routing that lookup to a human or treating it as a separate workflow, the enrichment pipeline collects it automatically before synthesis begins.

The design constraint was "no new dependencies required for the baseline." The harness already uses urllib and the standard library. All six zero-config layers use only those. Optional dnspython enriches DNS results if installed. Key-gated layers require their respective API keys in .env.

Architecture

The pipeline has three phases: detection, fetching, and injection. Detection and injection happen in agent.py; fetching happens entirely in osint_tool.py.

gather_research(task, ...) │ ├─ ① OSINT pre-flight ← before search loop │ _osint_themes(task) intent gate (keyword list) │ generate_dork_queries(task) LLM → dork queries │ planned_queries = dorks replaces empty search plan │ ├─ [...web search loop, FRED block, memory injection...] │ └─ ② OSINT enrichment block ← after FRED, before synthesis _detect_targets(task) regex → {domains, ips, emails} query_osint(task) parallel 9-layer fetch merged_text += osint_block append to context (position-end)

Phase 1: Intent Detection

_osint_themes(query) checks for keywords that signal an OSINT-relevant task: whois, domain, ip address, dns, registrant, certificate, ssl, hosting, threat intel, ioc, malware, phishing, dark web, shodan, recon, investigate, cyberattack, breach, vulnerability, portscan, infrastructure. Returning a non-empty list triggers dork generation. This gate exists to avoid spurious LLM calls on tasks like "summarize this paper."

Phase 1b: Dork Query Generation

When the intent gate fires and no planned_queries exist, the harness makes a single LLM call using COMPRESS_MODEL to generate 2–3 targeted search queries with Google operators (site:, filetype:, intitle:, inurl:). These replace the default search plan for the run, allowing the web search loop to pull in more operationally-relevant results than a naive keyword search would produce.

# Example dork output for: "investigate lukasdgreen.com"
[
  "site:lukasdgreen.com OR inurl:lukasdgreen.com",
  "\"lukasdgreen.com\" filetype:pdf OR filetype:txt",
  "intitle:lukasdgreen site:linkedin.com OR site:twitter.com"
]

Phase 2: Target Extraction

_detect_targets(text) applies three regex passes over the full task string to extract:

Domains — valid FQDNs with known TLDs; excludes common false positives like e.g. and filenames
IPs — IPv4 only (RFC 1918 ranges excluded for OSINT purposes)
Emails — extracts the domain portion for subsequent WHOIS lookup

Up to max_domains=2 and max_ips=2 targets are enriched per call. This caps latency and prevents runaway fetching on tasks that enumerate many hosts.

Phase 3: Parallel Fetch

All seven domain-level fetchers run concurrently via ThreadPoolExecutor(max_workers=8). Each fetcher is independent and has its own timeout (_TIMEOUT = 10s). A fetcher that times out or raises any exception returns an empty dict and is silently skipped — the enrichment block is never absent due to a single failed layer.

The Nine Layers

1 DNS Records free

A/AAAA, MX, NS, and TXT records via the system resolver (or dnspython if installed for richer output). TXT records frequently expose SPF configuration, DMARC policy, and domain-ownership verification tokens. The MX chain often reveals the email provider (Google Workspace, Microsoft 365, Proofpoint, Mimecast) without any key.

2 HTTP Headers + Security Files free

A HEAD request to https://{domain} captures: Server, X-Powered-By, X-Frame-Options, Content-Security-Policy, Strict-Transport-Security, and the TLS certificate CN/SAN fields. Follows with a GET to /robots.txt and /.well-known/security.txt — the latter often naming a security contact, disclosure policy URL, and PGP key.

3 RDAP (Registration Data) free

Replaces legacy WHOIS with the structured RDAP JSON API. Returns registrant org, registrar, creation/expiration dates, nameservers, and status flags (clientTransferProhibited, serverHold, etc.). No API key. Rate-limited by the registry, but a single lookup per domain is well within limits.

4 Certificate Transparency (crt.sh) free

Queries crt.sh/json?q=%.{domain} to retrieve the certificate issuance history across all CT logs. Returns issuer CA, subject CNs, and SANs for the 20 most recent certificates. Reveals subdomain enumeration via the SAN list, certificate pinning history, and whether the domain has switched CAs (e.g., from DigiCert to Let's Encrypt, which often signals a CDN migration).

5 Wayback Machine (CDX API) free

Queries the Internet Archive's CDX API for the first snapshot timestamp, most recent snapshot, and total snapshot count. A domain with 0 snapshots is either very new or has used disallow in robots.txt. A domain with a first snapshot from 2001 and a registration date from 2024 has been re-registered — a common indicator of domain squatting or brand impersonation.

6 IP Geolocation (ipinfo.io) free token boosts quota

Resolves the domain's A record to an IP, then queries ipinfo.io/{ip}/json for country, region, city, ASN, and org string. The free tier handles 50k requests/month. Setting IPINFO_TOKEN in .env raises the limit. Useful for attributing hosting to a specific cloud provider or CDN edge PoP.

7 Shodan InternetDB free

internetdb.shodan.io/{ip} returns open ports, CPEs (Common Platform Enumerations), and known CVEs for any public IP — no API key required, no scanning initiated. This is a read-only lookup against Shodan's historical scan database. A server exposing port 3389 (RDP), 22 (SSH), and 6379 (Redis) with a known CVE in its Apache version is a very different risk profile than a CDN IP with only 443.

8 urlscan.io Tech Stack URLSCAN_API_KEY

Submits the domain to the urlscan.io API for a headless browser scan. Returns detected technologies (CMS, analytics, CDN, ad networks), final redirected URL, screenshot URL, and security flags (phishing verdicts, suspicious content). The free tier allows 100 scans/day. This layer fires only if URLSCAN_API_KEY is set and the target is a domain (not a raw IP).

9 OTX AlienVault Threat Pulses OTX_API_KEY

Queries the OTX (Open Threat Exchange) API for threat intelligence associated with the domain or IP: pulse count, malware families, adversary names, and most recent indicator submission date. A domain with 12 active pulses linked to a ransomware family is context the synthesizer needs before writing its response. Free API key available at otx.alienvault.com.

Citation Format

Each data point in the enrichment block carries a structured citation in the same format used by FRED and semantic scholar blocks:

[OSINT:DNS:github.com:2026-05-28]
[OSINT:RDAP:github.com:2026-05-28]
[OSINT:CRT_SH:github.com:2026-05-28]
[OSINT:SHODAN:140.82.121.4:2026-05-28]
[OSINT:OTX:lukasdgreen.com:2026-05-28]

The SOURCE component matches the fetcher name. The target is the exact domain or IP that was queried. This lets the evaluator trace any specific claim back to its source layer and query date, and lets the DPO curation pipeline filter or weight OSINT-grounded responses separately from web-search-grounded ones.

Sample Output

Running the CLI on a domain produces a formatted markdown block that is appended verbatim to merged_text:

$ python -m harness.osint_tool lukasdgreen.com

## OSINT Enrichment — lukasdgreen.com

**DNS** [OSINT:DNS:lukasdgreen.com:2026-05-28]
- A: 185.199.108.153 (GitHub Pages CDN)
- MX: (none — no inbound email configured)
- NS: ns1.hover.com, ns2.hover.com
- TXT: v=spf1 -all

**HTTP Headers** [OSINT:HTTP:lukasdgreen.com:2026-05-28]
- Server: GitHub.com
- X-Fastly-Request-ID: present
- Strict-Transport-Security: max-age=31557600
- robots.txt: disallow /

**RDAP** [OSINT:RDAP:lukasdgreen.com:2026-05-28]
- Registrar: Hover / Tucows
- Created: 2023-06-14
- Expires: 2027-06-14
- Status: clientTransferProhibited

**Certificate Transparency** [OSINT:CRT_SH:lukasdgreen.com:2026-05-28]
- 4 certificates issued
- Most recent: Let's Encrypt (2026-04-17)
- SANs: lukasdgreen.com, www.lukasdgreen.com

**Wayback Machine** [OSINT:WAYBACK:lukasdgreen.com:2026-05-28]
- First snapshot: 2023-09-02
- Total snapshots: 14
- Last archived: 2026-05-10

**IP Geolocation** [OSINT:IPINFO:185.199.108.153:2026-05-28]
- 185.199.108.153 → US / San Francisco / AS36459 (GitHub)

Zero-config baseline: the above output requires no API keys. Six of nine layers fire on any public domain with only Python's standard library.

Integration Points in `agent.py`

The OSINT pipeline hooks into gather_research() at two points. Both are guarded by os.environ.get("HARNESS_OSINT_DISABLE") != "1", the same escape-hatch pattern used by the FRED integration.

Point 1: Dork Pre-flight (before the search loop)

if not planned_queries and not force_deep and \
        os.environ.get("HARNESS_OSINT_DISABLE") != "1":
    try:
        from harness.osint_tool import _osint_themes, generate_dork_queries
        if _osint_themes(task):
            _dorks = generate_dork_queries(task, model=producer_model, n=2)
            if _dorks:
                planned_queries = _dorks
                print(f"  [osint] dork queries: {_dorks}")
    except Exception as _osint_pre_err:
        print(f"  [osint] dork generation skipped: {_osint_pre_err}",
              file=sys.stderr)

This fires only if planned_queries is empty (i.e., no explicit search plan was passed in) and the task passes the intent gate. The effect is that OSINT-relevant tasks automatically receive operator-enriched search queries rather than raw keyword queries.

Point 2: Enrichment Block (after FRED, before synthesis)

if os.environ.get("HARNESS_OSINT_DISABLE") != "1":
    try:
        from harness.osint_tool import (
            query_osint as _query_osint,
            _detect_targets as _osint_targets,
        )
        _osint_tgts = _osint_targets(task)
        if _osint_tgts["domains"] or _osint_tgts["ips"]:
            print(f"  [osint] targets detected — fetching enrichment data...")
            _osint_block, _ = _query_osint(task)
            if _osint_block:
                merged_text = merged_text + "\n\n" + _osint_block
                print(f"  [osint] appended {len(_osint_block)} chars of OSINT data")
                trace.log_tool_call("osint", task, len(_osint_block),
                                    result_preview=_osint_block[:600], error=None)
    except Exception as _osint_err:
        print(f"  [osint] skipped: {_osint_err}", file=sys.stderr)

Unlike the FRED block (which fires on economic keyword detection), the enrichment block fires on structural target detection — only when a regex confirms an actual domain or IP appears in the task. The block is appended at the end of merged_text, consistent with the position-end advantage documented in the RAG experiments.

Cross-Skill Integrations

OSINT enrichment is most useful when combined with skills that already have a URL or domain as an input. Six natural integration points exist:

email skill

Domain portion of an address → RDAP gives registrant org, WHOIS age, and abuse contact. Useful for persona research, email validation, or header analysis.

design / site skill

URL → HTTP headers expose framework (X-Powered-By), CDN (Cloudflare/Fastly), and security posture before Playwright renders the page. Sets visual analysis in context.

crawl / sitemap skill

Registrant org from RDAP → related domains with the same registrant. A site map becomes a registrant map, revealing sibling properties.

beige-book / econ skill

Server IP → ipinfo.io geo → Federal Reserve district. Grounds economic commentary in the correct regional Fed district without manual lookup.

lit-review skill

RDAP registrant org → legal entity name. Lets literature reviews on institutional papers cite the correct parent organization, not just the journal domain.

cite skill

OSINT facts carry [OSINT:SOURCE:target:date] citations natively, so the cite skill can include them in bibliographies as first-class sources without extra processing.

Configuration Reference

Env var	Layer	Required?	Notes
`IPINFO_TOKEN`	6 — IP Geolocation	No	Raises quota from 50k to 150k/month
`URLSCAN_API_KEY`	8 — urlscan.io	Yes (for layer 8)	Free tier: 100 scans/day at urlscan.io
`OTX_API_KEY`	9 — OTX AlienVault	Yes (for layer 9)	Free at otx.alienvault.com
`HARNESS_OSINT_DISABLE=1`	all	N/A	Bypasses both injection points entirely

All keys are read via the same _key(env_var) helper used by fred_tool.py: it loads directly from .env at the repo root, falling back to os.environ.get(). The pattern is consistent across all tool integrations.

Performance

With all nine layers enabled, a single domain lookup completes in approximately 10–16 seconds wall-clock time. The bottleneck is urlscan.io, which initiates a headless browser scan (typically 8–12s). The five zero-config network layers (RDAP, crt.sh, Wayback, ipinfo.io, Shodan InternetDB) each resolve in 1–3s and are fully overlapped by the ThreadPoolExecutor. Without the urlscan layer, total latency drops to 3–5s.

The fetch runs in parallel with the tail end of the web search loop, so in practice the OSINT block adds less than 5 seconds of perceived latency to a full research run (which takes 30–90s for the search and compression phases alone).

Graceful Degradation

Every fetcher is wrapped in a try/except Exception. The return type is always dict, and the formatter skips any layer whose dict is empty or whose key fields are missing. This means:

A timeout on crt.sh does not prevent DNS results from appearing
A missing API key for OTX silently skips layer 9
A domain with no Wayback history returns an empty Wayback section, not an error
A completely inaccessible target still produces the enrichment block header with a note, not a traceback

The outer injection block in agent.py also wraps the entire query_osint() call in try/except, so an unexpected failure in osint_tool.py degrades gracefully to a console warning and continues the research run without OSINT data rather than crashing.

CLI Usage

The module can be run directly for development and debugging:

# Single domain enrichment
python -m harness.osint_tool example.com

# Dork query generation only
python -m harness.osint_tool --dorks "investigate infrastructure behind lukasdgreen.com"

# IP enrichment
python -m harness.osint_tool 185.199.108.153

# With API keys in environment
URLSCAN_API_KEY=... OTX_API_KEY=... python -m harness.osint_tool target.com

Scope boundary: this is a passive-only pipeline. No active scanning (port probes, fuzzing, directory brute-force) is performed. All data is retrieved from public APIs and pre-indexed sources. The distinction matters for legal and operational reasons: passive reconnaissance queries public databases; active scanning touches the target's infrastructure.

What's Next

Several natural extensions haven't been implemented yet:

Registrant graph walk — RDAP registrant org → query for other domains with the same registrant → expand the target set one hop out
Email permutation + haveibeenpwned — for tasks involving a person's name and domain, generate likely email formats and check breach exposure
Passive DNS history — OTX and SecurityTrails both expose historical IP-to-domain mappings; useful for tracking infrastructure migration
Abuse.ch feeds — MalwareBazaar and Feodo Tracker provide free, real-time IP and hash blocklists for threat correlation without an API key
Eval experiment — the RAG experiments measured context enrichment lift against a scoring rubric. OSINT enrichment deserves the same treatment: an OSINT-enabled vs OSINT-disabled condition on a benchmark of domain investigation tasks, with specificity and accuracy dimensions scored by the evaluator.

Motivation

Architecture

Phase 1: Intent Detection

Phase 1b: Dork Query Generation

Phase 2: Target Extraction

Phase 3: Parallel Fetch

The Nine Layers

1 DNS Records free

2 HTTP Headers + Security Files free

3 RDAP (Registration Data) free

4 Certificate Transparency (crt.sh) free

5 Wayback Machine (CDX API) free

6 IP Geolocation (ipinfo.io) free token boosts quota

7 Shodan InternetDB free

8 urlscan.io Tech Stack URLSCAN_API_KEY

9 OTX AlienVault Threat Pulses OTX_API_KEY

Citation Format

Sample Output

Integration Points in agent.py

Point 1: Dork Pre-flight (before the search loop)

Point 2: Enrichment Block (after FRED, before synthesis)

Cross-Skill Integrations

email skill

design / site skill

crawl / sitemap skill

beige-book / econ skill

lit-review skill

cite skill

Configuration Reference

Performance

Graceful Degradation

CLI Usage

What's Next

Integration Points in `agent.py`