# SparkBroker Sprint Bible **The definitive guide for running autonomous AI coding sprints on GB10 hardware.** --- ## Table of Contents 1. [Architecture Overview](#1-architecture-overview) 2. [Infrastructure Map](#2-infrastructure-map) 3. [Self-Healing Systems](#3-self-healing-systems) 4. [Common Failures & Fixes](#4-common-failures--fixes) 5. [The Golden Rules](#5-the-golden-rules) 6. [Dispatcher Deep Dive](#6-dispatcher-deep-dive) 7. [vLLM Management](#7-vllm-management) 8. [Database & Board Management](#8-database--board-management) 9. [VPS & Traefik](#9-vps--traefik) 10. [Tailscale Networking](#10-tailscale-networking) 11. [Monitoring & Dashboards](#11-monitoring--dashboards) 12. [Emergency Playbook](#12-emergency-playbook) 13. [Configuration Reference](#13-configuration-reference) --- ## 1. Architecture Overview ``` ┌─────────────────────────────────────────────────────┐ │ VPS (72.61.125.54) │ │ asyllym.cloud │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Traefik │ │ Hermes │ │ Kanban Proxy │ │ │ │ :80/:443 │ │ Agent │ │ Nginx → :19300 │ │ │ └──────────┘ └──────────┘ └────────┬─────────┘ │ │ │ │ │ Tailscale: 100.73.83.64 │ │ └────────────────────────────────────────┼─────────────┘ │ Tailscale │ :3000 ┌────────────────────────────────────────┼─────────────┐ │ Spark 2 (GB10) │ │ │ spark-05b8 @ 100.97.213.69 │ │ │ ┌──────────┐ ┌──────────┐ ┌───────▼──────────┐ │ │ │ vLLM │ │ Hermes │ │ Kanban Server │ │ │ │ :8000 │ │ Gateway │ │ Python :3000 │ │ │ │ Qwen3-VL │ │ │ │ SQLite DB │ │ │ └──────────┘ └──────────┘ └──────────────────┘ │ │ │ │ Crontab (9 jobs): │ │ • Dispatcher (2min) • Watchdog (5min) │ │ • Reclaimer (3min) • CTO Review (10min) │ │ • Redis (5min) • Export (5min) • Push (5min) │ │ • SSH Tunnel (2min) • Kanban Server (5min) │ └──────────────────────────────────────────────────────┘ ``` ### Key Components | Component | Port | Purpose | |-----------|------|---------| | vLLM | 8000 | LLM inference (Qwen3-VL-30B) | | Kanban Server | 3000 | Board API + Web UI | | Hermes Gateway | 4860 | Agent orchestration | | Traefik | 80/443 | SSL termination + routing | | Kanban Proxy | 19300 | VPS → Spark 2 proxy | --- ## 2. Infrastructure Map ### Machines | Name | IP (Tailscale) | Role | Specs | |------|----------------|------|-------| | VPS (srv1681243) | 100.73.83.64 | Web + Traefik + Proxy | 8GB RAM, 96GB SSD | | Spark 2 (spark-05b8) | 100.97.213.69 | Build Agent + vLLM | GB10 GPU, 120GB VRAM | | Spark 1 (spark-4da6) | 100.113.65.53 | Secondary Agent | GB10 GPU | | Hermes Host | 100.119.144.59 | Main Hermes Instance | 7.8GB RAM | ### SSH Access ```bash # VPS ssh -i ~/.ssh/hermes_root root@72.61.125.54 # Spark 2 (from VPS) ssh spark-05b8@100.97.213.69 # Spark 2 (from Hermes Host via socks5) ssh spark-05b8 # uses ProxyCommand socks5-proxy ``` ### Domains | Domain | Routes To | Purpose | |--------|-----------|---------| | asyllym.cloud | VPS Web | Dashboard | | spark.asyllym.cloud | VPS → Spark 2 :3000 | Kanban Board | --- ## 3. Self-Healing Systems ### 3.1 Watchdog (every 5 min) **Script:** `/home/spark-05b8/brokerage/scripts/watchdog.sh` **Checks:** - vLLM health (port 8000) - GPU zombie detection (>100GB VRAM) - Stale tasks (dead PIDs, >2h timeout) - Disk usage (>85% triggers cleanup) - Hermes gateway process - FastAPI process **Auto-fix actions:** - vLLM down → runs `start-vllm-optimized.sh` - GPU zombie → `kill -9` the offending process - Stale task → reclaims to `ready` status - Disk full → deletes logs >7 days, old workspaces - Gateway down → restarts via hermes binary **Alert cooldown:** 30 minutes per service (prevents Telegram spam) ### 3.2 Stale Task Reclaimer (every 3 min) **Script:** `/home/spark-05b8/brokerage/scripts/reclaim_stale.sh` **Logic:** ```python for each task with status='running': if PID is dead: → reclaim task (status → ready) → close run record → log event if running > 2 hours: → same reclaim logic ``` **Why this matters:** Without this, dead agents block the dispatcher forever. The dispatcher sees "1/1 running" and skips all new tasks. ### 3.3 Dispatcher (every 2 min) **Script:** `/home/spark-05b8/.hermes/cron/dispatcher.sh` **Features:** - `flock` file lock prevents concurrent dispatches - Counts running tasks before dispatching - Respects `max_in_progress` config (set to 1 for GB10) - Uses `hermes kanban dispatch --max 1` ### 3.4 Other Crons | Cron | Interval | Purpose | |------|----------|---------| | Redis Keepalive | 5 min | Restarts Redis if down | | Kanban Export | 5 min | Exports board for remote monitoring | | Push to VPS | 5 min | Syncs board data to VPS | | SSH Tunnel | 2 min | Keeps reverse tunnel alive | | Kanban Server | 5 min | Restarts if crashed | | CTO Review | 10 min | Analyzes progress, detects phase boundaries | --- ## 4. Common Failures & Fixes ### 4.1 "At capacity: 1/1 running. Skipping." (but no task is actually running) **Cause:** Agent crashed but task still shows `running` status. Dead PID not reclaimed yet. **Fix:** ```bash # Manual reclaim python3 -c " import sqlite3, os, time conn = sqlite3.connect('kanban.db') now = int(time.time()) for r in conn.execute(\"SELECT t.id, t.current_run_id, tr.worker_pid FROM tasks t JOIN task_runs tr ON t.id=tr.task_id WHERE t.status='running' AND tr.ended_at IS NULL\").fetchall(): try: os.kill(r[2], 0) # check if alive except: conn.execute('UPDATE task_runs SET status=\"reclaimed\", ended_at=? WHERE id=?', (now, r[1])) conn.execute('UPDATE tasks SET status=\"ready\", current_run_id=NULL WHERE id=?', (r[0],)) print(f'Reclaimed: {r[0]}') conn.commit() " ``` **Prevention:** The 3-min reclaimer handles this automatically. ### 4.2 vLLM Crashes (GPU OOM) **Cause:** Zombie vLLM process holds GPU memory after crash. **Symptoms:** - `nvidia-smi` shows >100GB used - Port 8000 not responding - Agents exit with code 0 (can't reach model) **Fix:** ```bash # Kill zombie pkill -9 -f vllm.entrypoints # Wait for GPU memory to clear sleep 10 # Restart vLLM bash /home/spark-05b8/start-vllm-optimized.sh ``` **Prevention:** Watchdog checks GPU memory every 5 min. ### 4.3 Gateway Stopped While Agent Working **Cause:** Hermes gateway crashes mid-task. **Symptoms:** - Tasks show "Task cancelled: gateway stopped" - Agent process orphaned **Fix:** Watchdog auto-restarts gateway. Orphaned tasks get reclaimed by reclaimer. ### 4.4 "No such file or directory" Errors **Cause:** Workspace directory missing or agent tried to access non-existent path. **Fix:** Usually self-resolving — next dispatch creates fresh workspace. If persistent: ```bash # Check workspace exists ls -la /home/spark-05b8/.hermes/kanban/boards/sparkbroker/workspaces/ # If missing, task will be re-dispatched with fresh workspace ``` ### 4.5 Traefik Can't Start (Port 80 in Use) **Cause:** Nginx or another service hogging port 80. **Fix:** ```bash # On VPS systemctl stop nginx systemctl disable nginx docker restart traefik-traefik-1 ``` ### 4.6 Kanban Board Shows Stale Data **Cause:** Kanban server cached old DB state, or nginx caching. **Fix:** ```bash # Restart kanban server pkill -f kanban_server.py cd /home/spark-05b8/brokerage setsid python3 kanban_server.py >> logs/kanban_server.log 2>&1 & # Clear nginx cache (VPS) docker restart hermes-agent-oxet-spark-kanban-1 ``` ### 4.7 Close Button Not Working on Mobile **Cause:** Missing `onclick` handler on modal close button. **Fix:** Ensure `