SBOM and AIBOM for Agentic Systems
pip freeze knows about fastapi and ollama. It does not know about kimi-k2.5:cloud. An AI Bill of Materials fills that gap — enumerating the model artifacts, custom Modelfiles, and cloud endpoints that constitute the other half of an agentic system's supply chain.
The NTIA minimum elements for a Software Bill of Materials — supplier name, component name, version, unique identifier, dependency relationship, author, timestamp — were designed for a world where software components are versioned packages with cryptographic hashes. AI models are not that. A GGUF file has a hash; a cloud endpoint has a name and a terms-of-service page. A custom Modelfile has a base model, a quantization, and a system prompt baked in. None of these appear in a pyproject.toml or a pip freeze output.
This post documents what a complete supply chain picture looks like for the harness — both the software layer (SBOM, covering Python packages and system binaries) and the AI layer (AIBOM, covering model artifacts). The actual AIBOM.md document is committed to the repo. This post explains the design decisions behind it and the gaps that remain open.
I1 — The Three Supply Chain Layers
An agentic harness has three supply chain layers that a traditional SBOM only covers one of:
Traditional SBOM covers only the software layer. The model and cloud layers require an AIBOM. Runtime binaries (Ollama, llama.cpp, ffmpeg) span both documents.
| Layer | Examples | Tracking mechanism | Covered by SBOM? |
|---|---|---|---|
| Software | fastapi, ollama, chromadb, jinja2 | pip/uv lock, PyPI hash | Yes — fully |
| Runtime binaries | Ollama, llama.cpp, whisper.cpp, ffmpeg, Tesseract | Git submodule SHA, system package version | Partial — submodules yes; system deps no |
| Local models | pi-qwen-32b, Qwen3-Coder:30b, atla/selene-mini | Ollama manifest ID (SHA256 prefix) | No |
| Custom Modelfiles | pi-qwen-32b (system prompt overlay), nanda-annotator-v2-q4km | Ollama ID + system prompt hash | No |
| Cloud endpoints ⚠ untracked | kimi-k2.5:cloud (Moonshot AI), glm-5.1:cloud (Zhipu AI) | Name only — no content hash, no pin | No |
The cloud endpoint row is the one that keeps appearing in security conversations and disappearing from tooling. kimi-k2.5:cloud is a first-class runtime dependency of autoresearch.py — the loop calls it on every stuck episode. It is not in pyproject.toml; it is not in uv.lock; it does not appear in pip freeze. The only artifact that surfaces it is the AIBOM.
I2 — The Software SBOM Layer
The harness's Python dependencies are declared in pyproject.toml and resolved to pinned hashes in uv.lock. Generating a machine-readable SBOM from the lock file is one command:
pip install cyclonedx-bom
cyclonedx-py environment --output-format json > sbom-python.cdx.json
The CycloneDX 1.5 JSON output satisfies the NTIA minimum elements for the Python layer. Each component entry carries a PURL (Package URL), the declared version, and the PyPI hash. The runtime binaries that are not Python packages require separate entries:
| Component | Version pin | NTIA identifier | License |
|---|---|---|---|
| Ollama | ≥ 0.4 (runtime); not pinned | GitHub release tag | MIT |
| llama.cpp | Git submodule SHA (pinned) | git SHA | MIT |
| whisper.cpp | Git submodule SHA (pinned) | git SHA | MIT |
| ffmpeg | System (≥ 6.0); not pinned | OS package name + version | LGPL 2.1 |
| Tesseract OCR | System (≥ 5.0); not pinned | OS package name + version | Apache 2.0 |
| Chromium (Playwright) | Playwright-bundled (pinned transitively) | Playwright version + browser hash | BSD / project-specific |
The gap in the software SBOM is system dependencies: ffmpeg and Tesseract are installed outside Python's package manager. Their versions are not in any lock file. In a containerized deployment, the Dockerfile pins them; in a bare-metal deployment like this one, they are environment assumptions rather than declared dependencies.
I3 — The AIBOM: Local Models
The harness currently runs eight distinct model roles across six Ollama tags. Four are standard base models; two are custom Modelfiles. The distinction matters for the AIBOM because a custom Modelfile is not just a base model with a tag — it bakes in a system prompt that shapes every inference call, making the system prompt part of the effective model identity.
| Model | Role | Arch / params | Quant | Custom Modelfile |
|---|---|---|---|---|
pi-qwen-32b |
Primary producer / agent | Qwen2, 32.8B | Q4_K_M | Yes — task-completing agent persona |
pi-qwen3.6 |
Alternate producer | Qwen3 MoE, 36.0B / 3.6B active | Q4_K_M | Yes — system prompt + sampling overrides |
atla/selene-mini |
Wiggum evaluator (judge) | Llama, 8.0B | Q4_K_M | No — evaluation-specialist fine-tune |
Qwen3-Coder:30b |
Autoresearch proposer | Qwen3 MoE, 30.5B | Q4_K_M | No |
nanda-annotator-v2-q4km |
Lit-review annotator | Qwen2, 7.6B | Q4_K_M | Yes — annotation persona, structured output |
qwen3:8b |
Eval suite, general tasks | Qwen3, 8.0B | Standard | No |
The NTIA minimum element set maps cleanly to Ollama model metadata. The unique identifier is the Ollama manifest ID — a SHA256 prefix of the manifest blob that Ollama uses internally. It is not a content hash of the weights themselves (which are stored as separate blobs), but it is a stable identifier that changes when the model is updated via ollama pull. For the purposes of an AIBOM, it is the most actionable identifier available without external registry tooling.
The Modelfile-as-Artifact Problem
A custom Modelfile is a derived artifact: it takes a base model and overlays a system prompt, sampling parameters, and stop tokens. From the AIBOM perspective, the base model and the overlay are both supply chain inputs. The effective behavior of pi-qwen-32b depends on both:
# Effective model identity for pi-qwen-32b has two components:
base_model_id = "edee0c094406" # Ollama manifest ID (qwen2.5-32b-instruct-q4_K_M)
system_prompt = sha256(SYSTEM_PROMPT) # hash of the task-completing agent persona text
# A change to either component changes the model's effective behavior.
# Only the base_model_id appears in `ollama list`.
Standard SBOM tooling captures neither. The AIBOM tracks both. The current AIBOM.md records the Ollama ID and describes the system prompt's role; a hardened version would include the system prompt SHA256 and re-verify it on each Modelfile rebuild.
If a system prompt is part of the model's effective identity, then changing the system prompt without updating the AIBOM creates a silent divergence between the documented model and the running one. This is the same class of problem as updating a dependency without updating the lock file — the declared state and the runtime state drift apart.
I4 — The Cloud Endpoint Gap
Two models in the Ollama registry are cloud endpoints with no local weights:
| Tag | Provider | Role | Local size | Pinnable? |
|---|---|---|---|---|
kimi-k2.5:cloud |
Moonshot AI | Autoresearch Kimi unblock | — | No |
glm-5.1:cloud |
Zhipu AI | Registered; not currently wired | — | No |
When autoresearch.py calls ollama.chat(model=KIMI_MODEL, ...), the Ollama daemon forwards the request to Moonshot AI's API. The model that responds may differ from one call to the next — providers update cloud models without version-bump guarantees, and the Ollama cloud-model manifest carries only a routing entry, not a content hash. There is no mechanism to verify that the model responding today is the same as the one that responded yesterday.
Supply chain risk: A cloud model that is updated by its provider between two autoresearch runs can produce different instruction proposals for the same prompt. If those proposals produce different eval scores, the autoresearch loop may accept or reject based on model behavior that has changed underneath it — not based on the instruction change being tested. This is the same threat model as a mutable dependency: the environment changes while the code stays constant.
The AIBOM is the only artifact that surfaces this risk. A standard SBOM audit of the harness repo would find nothing wrong: kimi-k2.5:cloud does not appear in pyproject.toml, uv.lock, or any Python import. It appears only in autoresearch.py as a string default:
KIMI_MODEL = os.environ.get("KIMI_MODEL", "kimi-k2.5:cloud")
This is the canonical gap between an SBOM and an AIBOM: one audits what the package manager knows; the other audits what the running system actually calls.
What Mitigation Looks Like
Full mitigation of the cloud model risk would require provider-side versioning — a kimi-k2.5:cloud@2026-05-25 endpoint that Moonshot AI commits not to update. Providers do not generally offer this. The practical mitigations available on the client side are:
- Log the model name and timestamp of every cloud call in the run log (
runs.jsonl). If behavior changes, the log identifies which run was affected. - Treat cloud model calls as externally sourced content, not as trusted inference. The Kimi unblock suggestion is injected into the proposer prompt as guidance, not as a direct commit decision — the local proposer still generates the actual candidate.
- Cap cloud model invocations to advisory roles. The harness never commits an instruction directly from a cloud model; Kimi's output is one input among several to the local proposer. This limits blast radius if the cloud model is compromised or updated adversarially.
- Include cloud endpoints in the AIBOM and treat them as externally verified dependencies — similar to how a software SBOM marks dependencies with known vulnerabilities rather than removing them.
I5 — AIBOM Format and Tooling State
There is no settled standard for AIBOMs yet. The active tracks as of mid-2026:
| Initiative | Status | Relevance to agentic systems |
|---|---|---|
| CISA / NTIA AI SBOM working group | Framing documents published; minimum elements draft pending | Likely to extend NTIA SBOM minimum elements to cover model artifacts |
| CycloneDX 1.5 ML-BOM | Released; type: machine-learning-model + modelCard fields |
Best available machine-readable format; covers architecture, quantization, intended use, training data provenance |
| OWASP CycloneDX extensions | Active; model card schema, pedigree (base model chain) | Covers the Modelfile-as-derived-artifact problem via pedigree.ancestors |
| Hugging Face model cards | Widespread; not a BOM format | Useful as provenance source for AIBOM entries; not machine-readable for audit tooling |
The harness AIBOM is a Markdown document rather than a CycloneDX JSON, for two reasons. First, CycloneDX 1.5 ML-BOM tooling for Ollama-managed models does not exist — there is no cyclonedx-py equivalent that introspects ollama list and emits a compliant JSON. Second, the most important entries in the harness AIBOM are the cloud endpoints and custom Modelfiles, both of which require human-in-the-loop curation that automated tooling cannot provide. A Markdown document is maintainable manually; a JSON document generated by a tool that doesn't understand Modelfiles is not.
The right long-term state is a hybrid: generate the Python layer automatically from uv.lock via CycloneDX, and maintain the model layer manually in Markdown with a validation script that cross-checks Ollama manifest IDs against the document's recorded values. Neither half is complete without the other.
The regeneration test: An AIBOM is only useful if it stays current. The AIBOM.md in the repo includes a "How to Regenerate" section that lists the three steps: ollama list for current IDs, ollama show <tag> for architecture verification, and SHA256 of the system prompt text for custom Modelfiles. If a model is updated via ollama pull and the AIBOM isn't regenerated, the manifest ID will be stale. That staleness is detectable — which is the point.
What the Literature Leaves Open
- CycloneDX 1.5 defines a
pedigree.ancestorsfield for ML models, intended to capture the base model chain (e.g., Qwen2.5-32B-Instruct → pi-qwen-32b Modelfile). For a model served through Ollama without a HuggingFace model card, the ancestor chain relies on the Modelfile'sFROMdirective, which may reference another Ollama tag rather than a canonical HuggingFace repo ID. How should AIBOM tooling handle indirect provenance chains where each link is an Ollama tag rather than a versioned artifact? - Cloud endpoint models (kimi-k2.5:cloud, glm-5.1:cloud) are referenced by name but have no content hash available to the client. ISO/IEC 5962:2021 (SPDX) and CycloneDX both require a unique identifier for each component. A cloud model name is not a unique identifier in the cryptographic sense. What is the right AIBOM entry for a component that can only be identified, not verified?
- Custom Modelfiles bake a system prompt into the model definition. From a supply chain perspective, is the system prompt a configuration artifact (like a config file, tracked separately from the model) or a model artifact (part of the model's effective identity, requiring its own AIBOM entry)? The distinction affects how prompt changes are tracked in audit logs and whether they trigger a new AIBOM version.
- The harness's autoresearch loop modifies
SYNTH_INSTRUCTION— a runtime prompt that shapes agent behavior — through an automated optimization process. Each kept experiment is a new effective model configuration. Should the AIBOM track every accepted autoresearch candidate as a new model version? If so, the AIBOM would have 100+ entries for a single training run. If not, only the final committed state is tracked, and intermediate optimization history is invisible to supply chain auditors. - The harness uses
atla/selene-minias the Wiggum evaluator. The evaluator's outputs — per-dimension scores, issue flags — feed directly into the accept/reject decision for new instruction candidates. If ATLA updates the model, the scoring distribution shifts and autoresearch's optimization target silently changes. This is a supply chain dependency not on a software component but on an evaluative function. How should AIBOMs represent dependencies on evaluator models whose behavioral drift has a direct feedback effect on the system being evaluated?