# SparkBroker Sprint Bible

**The definitive guide for running autonomous AI coding sprints on GB10 hardware.**

---

## Table of Contents

1. [Architecture Overview](#1-architecture-overview)
2. [Infrastructure Map](#2-infrastructure-map)
3. [Self-Healing Systems](#3-self-healing-systems)
4. [Common Failures & Fixes](#4-common-failures--fixes)
5. [The Golden Rules](#5-the-golden-rules)
6. [Dispatcher Deep Dive](#6-dispatcher-deep-dive)
7. [vLLM Management](#7-vllm-management)
8. [Database & Board Management](#8-database--board-management)
9. [VPS & Traefik](#9-vps--traefik)
10. [Tailscale Networking](#10-tailscale-networking)
11. [Monitoring & Dashboards](#11-monitoring--dashboards)
12. [Emergency Playbook](#12-emergency-playbook)
13. [Configuration Reference](#13-configuration-reference)

---

## 1. Architecture Overview

```
┌─────────────────────────────────────────────────────┐
│                    VPS (72.61.125.54)                │
│  asyllym.cloud                                       │
│  ┌──────────┐  ┌──────────┐  ┌──────────────────┐  │
│  │ Traefik  │  │  Hermes  │  │  Kanban Proxy    │  │
│  │ :80/:443 │  │  Agent   │  │  Nginx → :19300  │  │
│  └──────────┘  └──────────┘  └────────┬─────────┘  │
│                                        │             │
│  Tailscale: 100.73.83.64              │             │
└────────────────────────────────────────┼─────────────┘
                                         │ Tailscale
                                         │ :3000
┌────────────────────────────────────────┼─────────────┐
│              Spark 2 (GB10)            │             │
│  spark-05b8 @ 100.97.213.69           │             │
│  ┌──────────┐  ┌──────────┐  ┌───────▼──────────┐  │
│  │  vLLM    │  │ Hermes   │  │  Kanban Server   │  │
│  │  :8000   │  │ Gateway  │  │  Python :3000    │  │
│  │ Qwen3-VL │  │          │  │  SQLite DB       │  │
│  └──────────┘  └──────────┘  └──────────────────┘  │
│                                                      │
│  Crontab (9 jobs):                                   │
│  • Dispatcher (2min) • Watchdog (5min)               │
│  • Reclaimer (3min)  • CTO Review (10min)           │
│  • Redis (5min) • Export (5min) • Push (5min)       │
│  • SSH Tunnel (2min) • Kanban Server (5min)         │
└──────────────────────────────────────────────────────┘
```

### Key Components

| Component | Port | Purpose |
|-----------|------|---------|
| vLLM | 8000 | LLM inference (Qwen3-VL-30B) |
| Kanban Server | 3000 | Board API + Web UI |
| Hermes Gateway | 4860 | Agent orchestration |
| Traefik | 80/443 | SSL termination + routing |
| Kanban Proxy | 19300 | VPS → Spark 2 proxy |

---

## 2. Infrastructure Map

### Machines

| Name | IP (Tailscale) | Role | Specs |
|------|----------------|------|-------|
| VPS (srv1681243) | 100.73.83.64 | Web + Traefik + Proxy | 8GB RAM, 96GB SSD |
| Spark 2 (spark-05b8) | 100.97.213.69 | Build Agent + vLLM | GB10 GPU, 120GB VRAM |
| Spark 1 (spark-4da6) | 100.113.65.53 | Secondary Agent | GB10 GPU |
| Hermes Host | 100.119.144.59 | Main Hermes Instance | 7.8GB RAM |

### SSH Access

```bash
# VPS
ssh -i ~/.ssh/hermes_root root@72.61.125.54

# Spark 2 (from VPS)
ssh spark-05b8@100.97.213.69

# Spark 2 (from Hermes Host via socks5)
ssh spark-05b8  # uses ProxyCommand socks5-proxy
```

### Domains

| Domain | Routes To | Purpose |
|--------|-----------|---------|
| asyllym.cloud | VPS Web | Dashboard |
| spark.asyllym.cloud | VPS → Spark 2 :3000 | Kanban Board |

---

## 3. Self-Healing Systems

### 3.1 Watchdog (every 5 min)

**Script:** `/home/spark-05b8/brokerage/scripts/watchdog.sh`

**Checks:**
- vLLM health (port 8000)
- GPU zombie detection (>100GB VRAM)
- Stale tasks (dead PIDs, >2h timeout)
- Disk usage (>85% triggers cleanup)
- Hermes gateway process
- FastAPI process

**Auto-fix actions:**
- vLLM down → runs `start-vllm-optimized.sh`
- GPU zombie → `kill -9` the offending process
- Stale task → reclaims to `ready` status
- Disk full → deletes logs >7 days, old workspaces
- Gateway down → restarts via hermes binary

**Alert cooldown:** 30 minutes per service (prevents Telegram spam)

### 3.2 Stale Task Reclaimer (every 3 min)

**Script:** `/home/spark-05b8/brokerage/scripts/reclaim_stale.sh`

**Logic:**
```python
for each task with status='running':
    if PID is dead:
        → reclaim task (status → ready)
        → close run record
        → log event
    if running > 2 hours:
        → same reclaim logic
```

**Why this matters:** Without this, dead agents block the dispatcher forever. The dispatcher sees "1/1 running" and skips all new tasks.

### 3.3 Dispatcher (every 2 min)

**Script:** `/home/spark-05b8/.hermes/cron/dispatcher.sh`

**Features:**
- `flock` file lock prevents concurrent dispatches
- Counts running tasks before dispatching
- Respects `max_in_progress` config (set to 1 for GB10)
- Uses `hermes kanban dispatch --max 1`

### 3.4 Other Crons

| Cron | Interval | Purpose |
|------|----------|---------|
| Redis Keepalive | 5 min | Restarts Redis if down |
| Kanban Export | 5 min | Exports board for remote monitoring |
| Push to VPS | 5 min | Syncs board data to VPS |
| SSH Tunnel | 2 min | Keeps reverse tunnel alive |
| Kanban Server | 5 min | Restarts if crashed |
| CTO Review | 10 min | Analyzes progress, detects phase boundaries |

---

## 4. Common Failures & Fixes

### 4.1 "At capacity: 1/1 running. Skipping." (but no task is actually running)

**Cause:** Agent crashed but task still shows `running` status. Dead PID not reclaimed yet.

**Fix:**
```bash
# Manual reclaim
python3 -c "
import sqlite3, os, time
conn = sqlite3.connect('kanban.db')
now = int(time.time())
for r in conn.execute(\"SELECT t.id, t.current_run_id, tr.worker_pid FROM tasks t JOIN task_runs tr ON t.id=tr.task_id WHERE t.status='running' AND tr.ended_at IS NULL\").fetchall():
    try:
        os.kill(r[2], 0)  # check if alive
    except:
        conn.execute('UPDATE task_runs SET status=\"reclaimed\", ended_at=? WHERE id=?', (now, r[1]))
        conn.execute('UPDATE tasks SET status=\"ready\", current_run_id=NULL WHERE id=?', (r[0],))
        print(f'Reclaimed: {r[0]}')
conn.commit()
"
```

**Prevention:** The 3-min reclaimer handles this automatically.

### 4.2 vLLM Crashes (GPU OOM)

**Cause:** Zombie vLLM process holds GPU memory after crash.

**Symptoms:**
- `nvidia-smi` shows >100GB used
- Port 8000 not responding
- Agents exit with code 0 (can't reach model)

**Fix:**
```bash
# Kill zombie
pkill -9 -f vllm.entrypoints

# Wait for GPU memory to clear
sleep 10

# Restart vLLM
bash /home/spark-05b8/start-vllm-optimized.sh
```

**Prevention:** Watchdog checks GPU memory every 5 min.

### 4.3 Gateway Stopped While Agent Working

**Cause:** Hermes gateway crashes mid-task.

**Symptoms:**
- Tasks show "Task cancelled: gateway stopped"
- Agent process orphaned

**Fix:** Watchdog auto-restarts gateway. Orphaned tasks get reclaimed by reclaimer.

### 4.4 "No such file or directory" Errors

**Cause:** Workspace directory missing or agent tried to access non-existent path.

**Fix:** Usually self-resolving — next dispatch creates fresh workspace. If persistent:
```bash
# Check workspace exists
ls -la /home/spark-05b8/.hermes/kanban/boards/sparkbroker/workspaces/<task_id>

# If missing, task will be re-dispatched with fresh workspace
```

### 4.5 Traefik Can't Start (Port 80 in Use)

**Cause:** Nginx or another service hogging port 80.

**Fix:**
```bash
# On VPS
systemctl stop nginx
systemctl disable nginx
docker restart traefik-traefik-1
```

### 4.6 Kanban Board Shows Stale Data

**Cause:** Kanban server cached old DB state, or nginx caching.

**Fix:**
```bash
# Restart kanban server
pkill -f kanban_server.py
cd /home/spark-05b8/brokerage
setsid python3 kanban_server.py >> logs/kanban_server.log 2>&1 &

# Clear nginx cache (VPS)
docker restart hermes-agent-oxet-spark-kanban-1
```

### 4.7 Close Button Not Working on Mobile

**Cause:** Missing `onclick` handler on modal close button.

**Fix:** Ensure `<button class="modal-close" onclick="closeModal()">` in HTML.

---

## 5. The Golden Rules

### Rule 1: NEVER Run More Than 1 Agent on GB10

```yaml
# config.yaml
kanban:
  max_in_progress: 1  # NOT max_concurrent!
```

**Why:** GB10 has 120GB VRAM. vLLM uses ~92GB. One agent + tool calls = ~20GB. Two agents = OOM crash.

**Critical:** The config key is `max_in_progress`, NOT `max_concurrent`. The dispatch code reads `max_in_progress`.

### Rule 2: Always Use `flock` in Dispatcher

```bash
exec 9>"/tmp/sparkbroker-dispatch.lock"
flock -n 9 || { echo "Already running"; exit 0; }
```

**Why:** Cron fires every 2 min. Without flock, overlapping dispatches spawn multiple agents.

### Rule 3: Never Let Tasks Stay in 'running' with Dead PIDs

The 3-min reclaimer handles this, but always verify after crashes:
```bash
# Quick check
ps aux | grep "hermes.*gateway" | grep -v grep | wc -l
```

### Rule 4: Watch GPU Memory

```bash
# Check VRAM usage
nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits

# If >100GB and no healthy vLLM process → zombie
pkill -9 -f vllm.entrypoints
```

### Rule 5: Always Verify SSH Before Critical Operations

```bash
# Test connectivity first
ssh spark-05b8@100.97.213.69 'echo OK'

# If fails, check Tailscale
tailscale status
```

### Rule 6: Backup Crontab Before Editing

```bash
crontab -l > /tmp/crontab-backup.txt
# ... edit ...
crontab /tmp/crontab-new.txt
```

### Rule 7: Use Python Scripts, Not Shell One-Liners

Shell quoting through SSH is fragile. Write Python scripts, SCP them over, then execute:
```bash
# Bad (quoting hell)
ssh spark-05b8 'python3 -c "import sqlite3; c=sqlite3.connect(\"db\"); ..."'

# Good
scp fix.py spark-05b8:/tmp/fix.py
ssh spark-05b8 'python3 /tmp/fix.py'
```

### Rule 8: Log Everything

Every self-healing action should be logged:
```bash
echo "[$(date)] RECLAIMED stale task: $TASK_ID" >> /home/spark-05b8/brokerage/logs/reclaim.log
```

### Rule 9: Telegram Alerts Need Cooldown

Without cooldown, a flapping service sends hundreds of alerts:
```bash
ALERT_COOLDOWN=1800  # 30 min
alert_once() {
    local lockfile="/tmp/sparkbroker-alert-$1"
    if [[ -f "$lockfile" ]]; then
        local age=$(( $(date +%s) - $(stat -c %Y "$lockfile") ))
        [[ "$age" -lt "$ALERT_COOLDOWN" ]] && return 1
    fi
    touch "$lockfile"
    return 0
}
```

### Rule 10: Test Before Deploy

Always verify scripts work before putting them in cron:
```bash
# Test watchdog
/home/spark-05b8/brokerage/scripts/watchdog.sh 2>&1

# Check output
tail -5 /home/spark-05b8/brokerage/logs/watchdog.log
```

---

## 6. Dispatcher Deep Dive

### How It Works

1. Cron fires every 2 min
2. Dispatcher acquires flock (prevents overlap)
3. Counts running tasks in SQLite
4. If running < max_in_progress:
   - Calls `hermes kanban dispatch --max 1`
   - Spawns one agent process
   - Agent creates workspace dir
   - Agent works on task
   - Agent calls `kanban_complete` on success
   - Task moves to `done`
5. If running >= max_in_progress:
   - Logs "At capacity" and exits

### Task Lifecycle

```
todo → ready → running → done
                ↓
              blocked → ready (after unblock)
                ↓
            reclaimed → ready (after crash/timeout)
```

### Parent-Child Relationships

```json
{
  "parent_id": "t_adc426c3",
  "child_id": "t_adc426c3_s1"
}
```

- Parent stays `todo` until all children are `done`
- Children with incomplete parents stay `todo` (not `ready`)
- When all children complete, parent auto-promotes to `ready`

---

## 7. vLLM Management

### Start vLLM

```bash
bash /home/spark-05b8/start-vllm-optimized.sh
```

### Check Health

```bash
curl -s http://localhost:8000/health
# Should return 200
```

### Check GPU Usage

```bash
nvidia-smi
# vLLM should use ~92GB VRAM
# If >100GB → zombie process
```

### Kill Zombie

```bash
# Find vLLM PIDs
pgrep -f vllm.entrypoints

# Kill all
pkill -9 -f vllm.entrypoints

# Verify GPU cleared
nvidia-smi  # Should show 0MB used
```

### Model Info

- **Model:** Qwen3-VL-30B
- **VRAM:** ~92GB
- **Port:** 8000
- **Features:** FlashInfer, MoE (CUTLASS), auto tool choice, XML parser

---

## 8. Database & Board Management

### Database Location

```
/home/spark-05b8/.hermes/kanban/boards/sparkbroker/kanban.db
```

### Key Tables

| Table | Purpose |
|-------|---------|
| tasks | All tasks with status, assignee, errors |
| task_runs | Execution history for each task |
| task_events | Event log (spawned, claimed, completed, etc.) |
| task_comments | Comments on tasks |
| task_links | Parent-child relationships |

### Useful Queries

```sql
-- Board summary
SELECT status, COUNT(*) FROM tasks GROUP BY status;

-- Running tasks
SELECT id, title, current_run_id FROM tasks WHERE status='running';

-- Blocked tasks
SELECT id, title, last_failure_error FROM tasks WHERE status='blocked';

-- Recent failures
SELECT task_id, outcome, summary FROM task_runs 
WHERE ended_at > strftime('%s','now') - 7200 
AND outcome IN ('crashed','error','failed');

-- Tasks with most failures
SELECT id, title, consecutive_failures FROM tasks 
WHERE consecutive_failures > 0 ORDER BY consecutive_failures DESC;
```

### Manual Unblock

```python
import sqlite3, time
conn = sqlite3.connect('kanban.db')
now = int(time.time())
task_id = 't_xxxxxxxx'

# Close stale run
conn.execute("UPDATE task_runs SET status='reclaimed', ended_at=? WHERE id=(SELECT current_run_id FROM tasks WHERE id=?)", (now, task_id))

# Reset task
conn.execute("UPDATE tasks SET status='ready', current_run_id=? WHERE id=?", (None, task_id))

# Log event
conn.execute("INSERT INTO task_events (task_id, kind, payload, created_at) VALUES (?, 'unblocked', ?, ?)", (task_id, '{"reason":"manual"}', now))

conn.commit()
```

---

## 9. VPS & Traefik

### Docker Containers

```bash
# List all
docker ps --format '{{.Names}}: {{.Status}}'

# Restart specific
docker restart traefik-traefik-1
docker restart hermes-agent-oxet-spark-kanban-1
docker restart hermes-agent-oxet-web-1
```

### Traefik Config

```yaml
# /docker/traefik/docker-compose.yml
services:
  traefik:
    image: traefik:latest
    network_mode: host
    command:
      - --providers.docker=true
      - --providers.docker.exposedbydefault=false
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
      - --certificatesresolvers.letsencrypt.acme.httpchallenge=true
      - --entrypoints.web.http.redirections.entrypoint.to=websecure
```

### Common VPS Issues

| Issue | Fix |
|-------|-----|
| Port 80 in use | `systemctl stop nginx && systemctl disable nginx` |
| Traefik not starting | Check `docker logs traefik-traefik-1` |
| SSL not renewing | Check ACME logs, ensure port 80 reachable |
| 504 Gateway Timeout | Check if Tailscale tunnel is alive |

### Nginx Proxy Config

```nginx
# /tmp/spark-kanban-nginx.conf
server {
    listen 19300;
    location / {
        proxy_pass http://100.97.213.69:3000;  # Spark 2 kanban
        proxy_set_header Host $host;
    }
}
```

---

## 10. Tailscale Networking

### Check Status

```bash
tailscale status
# All machines should show IP and "active" or "idle"
```

### SSH Between Machines

```bash
# From VPS to Spark 2
ssh spark-05b8@100.97.213.69 'command'

# From Hermes Host (via socks5 proxy)
ssh spark-05b8 'command'
```

### Reverse SSH Tunnel

```bash
# Keeps Spark 2 accessible from VPS
ssh -i ~/.ssh/id_ed25519 -N -R 13000:localhost:3000 root@72.61.125.54
```

### Troubleshooting

```bash
# Can't resolve hostname
tailscale status  # Check IP directly

# Connection refused
ss -tlnp | grep :3000  # Check if service listening

# Tunnel dead
ps aux | grep "ssh.*-R" | grep -v grep
# If missing, cron will restart it in <2 min
```

---

## 11. Monitoring & Dashboards

### VPS Dashboard

**URL:** https://asyllym.cloud

**Features:**
- System stats (auto-refresh from API)
- Docker services status
- Tailscale mesh status
- SparkBroker progress
- Quick links to all services

**API:** https://asyllym.cloud/api/system-stats.json

**Refresh:** Cron updates stats every minute:
```bash
* * * * * bash /tmp/system-stats.sh > /docker/hermes-agent-oxet/data/infographic/api/system-stats.json
```

### SparkBroker Board

**URL:** https://spark.asyllym.cloud

**Features:**
- Real-time board view
- Task detail modal
- Unblock buttons
- Auto-refresh every 30s

### Log Locations

| Log | Location |
|-----|----------|
| Watchdog | `/home/spark-05b8/brokerage/logs/watchdog.log` |
| Dispatcher | `/home/spark-05b8/brokerage/logs/dispatch.log` |
| Reclaimer | `/home/spark-05b8/brokerage/logs/reclaim.log` |
| CTO Review | `/home/spark-05b8/.hermes/logs/cto_review.log` |
| Kanban Server | `/home/spark-05b8/brokerage/logs/kanban_server.log` |
| vLLM Restart | `/home/spark-05b8/brokerage/logs/vllm_restart.log` |

---

## 12. Emergency Playbook

### Scenario: All Tasks Blocked

```bash
# 1. Check board state
python3 -c "import sqlite3; print({r[0]:r[1] for r in sqlite3.connect('kanban.db').execute('SELECT status,COUNT(*) FROM tasks').fetchall()})"

# 2. Unblock all
python3 -c "
import sqlite3, time
conn = sqlite3.connect('kanban.db')
now = int(time.time())
conn.execute('UPDATE tasks SET status=\"ready\", current_run_id=NULL WHERE status=\"blocked\"')
conn.commit()
print('All unblocked')
"

# 3. Verify dispatcher picks up
tail -f /home/spark-05b8/brokerage/logs/dispatch.log
```

### Scenario: vLLM Dead

```bash
# 1. Check
curl -s http://localhost:8000/health

# 2. Kill zombies
pkill -9 -f vllm.entrypoints
sleep 10

# 3. Restart
bash /home/spark-05b8/start-vllm-optimized.sh

# 4. Wait for model load (3-5 min)
watch -n 5 curl -s http://localhost:8000/health
```

### Scenario: Gateway Dead

```bash
# 1. Check
pgrep -f "hermes.*gateway"

# 2. Restart
nohup /home/spark-05b8/.hermes/hermes-agent/venv/bin/hermes gateway run --replace >> logs/gateway.log 2>&1 &

# 3. Verify
sleep 5
pgrep -f "hermes.*gateway"
```

### Scenario: Board Stale on VPS

```bash
# 1. Restart kanban server on Spark 2
ssh spark-05b8 'pkill -f kanban_server.py; cd /home/spark-05b8/brokerage && setsid python3 kanban_server.py >> logs/kanban_server.log 2>&1 &'

# 2. Restart proxy on VPS
docker restart hermes-agent-oxet-spark-kanban-1
```

### Scenario: Disk Full

```bash
# 1. Check
df -h /

# 2. Clean
find /home/spark-05b8/brokerage/logs -name "*.log" -mtime +7 -delete
find /home/spark-05b8/.hermes/kanban/boards/sparkbroker/workspaces -maxdepth 1 -type d -mtime +3 -exec rm -rf {} \;

# 3. Verify
df -h /
```

---

## 13. Configuration Reference

### Spark 2 Config (`~/.hermes/config.yaml`)

```yaml
kanban:
  dispatch_in_gateway: false      # Use external dispatcher
  dispatch_interval: 120          # 2 minutes
  max_in_progress: 1              # GB10: ONE agent at a time
  default_claim_ttl: 1800         # 30 min claim TTL
  stale_threshold: 600            # 10 min stale threshold
```

**CRITICAL:** The key is `max_in_progress`, NOT `max_concurrent`. The dispatch code reads `max_in_progress`.

### VPS Docker Compose (`/docker/hermes-agent-oxet/docker-compose.yml`)

```yaml
services:
  spark-kanban:
    image: nginx:alpine
    network_mode: host
    volumes:
      - /tmp/spark-kanban-nginx.conf:/etc/nginx/conf.d/default.conf:ro
    labels:
      - traefik.enable=true
      - traefik.http.routers.spark-kanban.rule=Host(`spark.asyllym.cloud`)
      - traefik.http.routers.spark-kanban.entrypoints=websecure
      - traefik.http.routers.spark-kanban.tls.certresolver=letsencrypt
      - traefik.http.services.spark-kanban.loadbalancer.server.port=19300
```

### Crontab (Spark 2)

```bash
# Watchdog — every 5 min
*/5 * * * * /home/spark-05b8/brokerage/scripts/watchdog.sh >> logs/watchdog.log 2>&1

# Dispatcher — every 2 min
*/2 * * * * /home/spark-05b8/.hermes/cron/dispatcher.sh >> logs/dispatch.log 2>&1

# Stale Reclaimer — every 3 min
*/3 * * * * /home/spark-05b8/brokerage/scripts/reclaim_stale.sh 2>&1

# CTO Review — every 10 min
*/10 * * * * su - spark-05b8 -c 'python3 ~/.hermes/scripts/cto_review.py' >> logs/cto_review.log 2>&1

# Redis Keepalive — every 5 min
*/5 * * * * redis-cli ping || redis-server --daemonize yes

# Kanban Export — every 5 min
*/5 * * * * python3 brokerage/scripts/export_kanban.sh >> logs/export.log 2>&1

# Push to VPS — every 5 min
*/5 * * * * brokerage/scripts/push_kanban.sh >> logs/push_kanban.log 2>&1

# SSH Tunnel — every 2 min
*/2 * * * * ssh -i ~/.ssh/id_ed25519 -N -R 13000:localhost:3000 root@72.61.125.54 >> /tmp/tunnel.log 2>&1

# Kanban Server — every 5 min
*/5 * * * * pgrep -f kanban_server.py || (setsid python3 brokerage/kanban_server.py >> logs/kanban_server.log 2>&1 &)
```

---

## Appendix A: File Locations

```
/home/spark-05b8/
├── .hermes/
│   ├── config.yaml                    # Main config
│   ├── kanban/boards/sparkbroker/
│   │   ├── kanban.db                  # SQLite database
│   │   ├── board.json                 # Board metadata
│   │   └── workspaces/                # Agent workspaces
│   └── cron/dispatcher.sh             # Dispatcher script
├── brokerage/
│   ├── kanban_server.py               # Board API server
│   ├── kanban_index.html              # Board web UI
│   ├── scripts/
│   │   ├── watchdog.sh                # Health watchdog
│   │   ├── reclaim_stale.sh           # Stale task reclaimer
│   │   ├── export_kanban.py           # Board exporter
│   │   └── push_kanban.sh             # VPS sync
│   └── logs/
│       ├── watchdog.log
│       ├── dispatch.log
│       ├── reclaim.log
│       ├── kanban_server.log
│       └── vllm_restart.log
├── models/qwen35-35b/                 # vLLM model
├── start-vllm-optimized.sh           # vLLM launcher
└── vllm-setup-env/                    # vLLM Python env
```

---

## Appendix B: Key Commands Cheat Sheet

```bash
# Board status
python3 -c "import sqlite3; print({r[0]:r[1] for r in sqlite3.connect('kanban.db').execute('SELECT status,COUNT(*) FROM tasks').fetchall()})"

# Check running
ps aux | grep "hermes\|vllm\|kanban" | grep -v grep

# Check ports
ss -tlnp | grep -E ":8000|:3000|:4860"

# Kill zombie vLLM
pkill -9 -f vllm.entrypoints

# Reclaim stale tasks
python3 /tmp/fix_stale.py

# Restart everything
pkill -f kanban_server.py; pkill -f vllm.entrypoints
bash start-vllm-optimized.sh
cd brokerage && setsid python3 kanban_server.py >> logs/kanban_server.log 2>&1 &

# Check Tailscale
tailscale status

# Check VPS
ssh -i ~/.ssh/hermes_root root@72.61.125.54 'docker ps'
```

---

**Last updated:** 2026-06-13
**Maintained by:** Hermes Agent (SparkBroker autonomous system)