May 27, 2026 • 5 min read • Agentic Harness Engineering

Deploying the Harness with Docker: CPU, GPU, and Compose Variants

Three Docker artifacts cover the full deployment range — a CPU slim image for development, a CUDA GPU image for production inference, and a three-service Compose stack that brings up vLLM, Ollama, and the harness dashboard together.

Running the harness locally with a Conda environment and a native Ollama install is the fastest path for development. But for reproducible deployments — on a rented GPU instance, a shared inference server, or a CI environment — Docker provides the isolation and repeatability that conda activate can't. The harness ships with two Dockerfiles and a Compose configuration that cover both cases.

Two Dockerfiles

File	Base image	PyTorch	Use case
`Dockerfile`	`python:3.11-slim`	CPU build from `download.pytorch.org/whl/cpu`	Development, CI, no GPU required
`Dockerfile.gpu`	`pytorch/pytorch:2.7.1-cuda11.8-cudnn9-runtime`	Pre-installed in base image (CUDA 11.8)	Production inference, sentence-transformers on GPU

Both Dockerfiles follow the same layer order: system packages (git, build-essential, curl, ffmpeg), Python dependencies from requirements.txt, Playwright browser installation, then the project source. The CPU variant installs PyTorch in its own layer before requirements.txt — PyTorch is large and changes rarely, so it benefits from Docker layer caching during iterative development. The GPU variant skips this step because PyTorch is already present in the base image.

Both expose port 8765 and start the server with:

CMD ["python", "server.py", "--host", "0.0.0.0", "--port", "8765"]

The ffmpeg system package is included to support youtube_transcribe.py, which needs ffmpeg available as a system binary for its Whisper fallback path.

The three-service Compose stack

vllm

Image: vllm/vllm-openai:latest
runtime: nvidia, GPU passthrough
Model from VLLM_MODEL env var (default: Qwen/Qwen2.5-14B-Instruct)
bfloat16, 16k max context
HuggingFace cache mounted from ./model-cache
Health: /health endpoint, 20 retries, 120s start

ollama

Image: ollama/ollama:latest
Optional GPU passthrough
Manages its own model downloads
Handles smaller/fast models (planner, evaluator, glm4:9b)
Data mounted from ./ollama-data
Health: root endpoint, 10 retries, 30s start

harness

Built from Dockerfile (or Dockerfile.gpu)
Live code mount: .:/app — no rebuild during dev
Persists ./runs and ./memory-db as volumes
Env vars: VLLM_BASE_URL and OLLAMA_HOST set for service discovery
Depends on vllm and ollama (both healthy) before starting
restart: on-failure

Service dependency and startup

The Compose file uses condition: service_healthy for both vLLM and Ollama before the harness starts. This matters because vLLM can take 1–2 minutes to load a 14B model into GPU memory before it's ready to serve requests. Without the health check, the harness would start, attempt an LLM call during initialization, get a connection refused, and fail.

The vLLM service has a generous startup window: start_period: 120s with retries: 20 at 15-second intervals. On a cold start with model weights not yet cached, the first pull from HuggingFace can extend this further — but Docker health checks don't count failed checks during start_period, so the 120s window absorbs most first-run delays.

Live code mount

The harness service mounts the project directory at /app:

volumes:
  - .:/app                    # live code — no rebuild during dev
  - ./runs:/app/runs          # persist run data
  - ./memory-db:/app/memory-db  # persist ChromaDB + SQLite

The live mount means that editing a Python file on the host is immediately reflected in the container — no docker compose build required. For a workflow that involves frequent edits to agent.py or skill files, this makes the Compose stack nearly as fast to iterate on as running the harness natively. The separate volume mounts for runs/ and memory-db/ ensure run history and the ChromaDB memory store persist across container restarts.

Switching to the GPU image

# In docker-compose.yml, change:
harness:
  build:
    context: .
    dockerfile: Dockerfile.gpu   # was: Dockerfile
  runtime: nvidia                # add this line
  environment:
    - NVIDIA_VISIBLE_DEVICES=all

The GPU variant is worth switching to when sentence-transformers (used by the memory store for embedding generation) is a throughput bottleneck. Embedding generation on CPU can add 200–500ms per research step when the memory store is active; on GPU it's typically under 20ms.

Quick start

# Copy environment template
cp .env.example .env

# Set your HuggingFace token (for model downloads)
echo "HF_TOKEN=your_token_here" >> .env

# Launch all three services
docker compose up

# Dashboard at http://localhost:8765
# vLLM at http://localhost:8000
# Ollama at http://localhost:11434

The VLLM_MODEL default is Qwen/Qwen2.5-14B-Instruct, which requires ~28 GB VRAM in bfloat16. Set VLLM_MODEL=Qwen/Qwen2.5-7B-Instruct in .env for a 24 GB card, or use a quantized GGUF variant via Ollama if you don't have a second GPU to dedicate to vLLM.

Two Dockerfiles

The three-service Compose stack

vllm

ollama

harness

Service dependency and startup

Live code mount

Switching to the GPU image

Quick start

Related posts