Deploying the Harness with Docker: CPU, GPU, and Compose Variants
Three Docker artifacts cover the full deployment range — a CPU slim image for development, a CUDA GPU image for production inference, and a three-service Compose stack that brings up vLLM, Ollama, and the harness dashboard together.
Running the harness locally with a Conda environment and a native Ollama install is the fastest path for development. But for reproducible deployments — on a rented GPU instance, a shared inference server, or a CI environment — Docker provides the isolation and repeatability that conda activate can't. The harness ships with two Dockerfiles and a Compose configuration that cover both cases.
Two Dockerfiles
| File | Base image | PyTorch | Use case |
|---|---|---|---|
Dockerfile |
python:3.11-slim |
CPU build from download.pytorch.org/whl/cpu |
Development, CI, no GPU required |
Dockerfile.gpu |
pytorch/pytorch:2.7.1-cuda11.8-cudnn9-runtime |
Pre-installed in base image (CUDA 11.8) | Production inference, sentence-transformers on GPU |
Both Dockerfiles follow the same layer order: system packages (git, build-essential, curl, ffmpeg), Python dependencies from requirements.txt, Playwright browser installation, then the project source. The CPU variant installs PyTorch in its own layer before requirements.txt — PyTorch is large and changes rarely, so it benefits from Docker layer caching during iterative development. The GPU variant skips this step because PyTorch is already present in the base image.
Both expose port 8765 and start the server with:
CMD ["python", "server.py", "--host", "0.0.0.0", "--port", "8765"]
The ffmpeg system package is included to support youtube_transcribe.py, which needs ffmpeg available as a system binary for its Whisper fallback path.
The three-service Compose stack
vllm
- Image:
vllm/vllm-openai:latest runtime: nvidia, GPU passthrough- Model from
VLLM_MODELenv var (default: Qwen/Qwen2.5-14B-Instruct) bfloat16, 16k max context- HuggingFace cache mounted from
./model-cache - Health:
/healthendpoint, 20 retries, 120s start
ollama
- Image:
ollama/ollama:latest - Optional GPU passthrough
- Manages its own model downloads
- Handles smaller/fast models (planner, evaluator,
glm4:9b) - Data mounted from
./ollama-data - Health: root endpoint, 10 retries, 30s start
harness
- Built from
Dockerfile(orDockerfile.gpu) - Live code mount:
.:/app— no rebuild during dev - Persists
./runsand./memory-dbas volumes - Env vars:
VLLM_BASE_URLandOLLAMA_HOSTset for service discovery - Depends on vllm and ollama (both healthy) before starting
restart: on-failure
Service dependency and startup
The Compose file uses condition: service_healthy for both vLLM and Ollama before the harness starts. This matters because vLLM can take 1–2 minutes to load a 14B model into GPU memory before it's ready to serve requests. Without the health check, the harness would start, attempt an LLM call during initialization, get a connection refused, and fail.
The vLLM service has a generous startup window: start_period: 120s with retries: 20 at 15-second intervals. On a cold start with model weights not yet cached, the first pull from HuggingFace can extend this further — but Docker health checks don't count failed checks during start_period, so the 120s window absorbs most first-run delays.
Live code mount
The harness service mounts the project directory at /app:
volumes:
- .:/app # live code — no rebuild during dev
- ./runs:/app/runs # persist run data
- ./memory-db:/app/memory-db # persist ChromaDB + SQLite
The live mount means that editing a Python file on the host is immediately reflected in the container — no docker compose build required. For a workflow that involves frequent edits to agent.py or skill files, this makes the Compose stack nearly as fast to iterate on as running the harness natively. The separate volume mounts for runs/ and memory-db/ ensure run history and the ChromaDB memory store persist across container restarts.
Switching to the GPU image
# In docker-compose.yml, change:
harness:
build:
context: .
dockerfile: Dockerfile.gpu # was: Dockerfile
runtime: nvidia # add this line
environment:
- NVIDIA_VISIBLE_DEVICES=all
The GPU variant is worth switching to when sentence-transformers (used by the memory store for embedding generation) is a throughput bottleneck. Embedding generation on CPU can add 200–500ms per research step when the memory store is active; on GPU it's typically under 20ms.
Quick start
# Copy environment template
cp .env.example .env
# Set your HuggingFace token (for model downloads)
echo "HF_TOKEN=your_token_here" >> .env
# Launch all three services
docker compose up
# Dashboard at http://localhost:8765
# vLLM at http://localhost:8000
# Ollama at http://localhost:11434
The VLLM_MODEL default is Qwen/Qwen2.5-14B-Instruct, which requires ~28 GB VRAM in bfloat16. Set VLLM_MODEL=Qwen/Qwen2.5-7B-Instruct in .env for a 24 GB card, or use a quantized GGUF variant via Ollama if you don't have a second GPU to dedicate to vLLM.