docs(broker): production deployment spec

Authoritative runtime contract for the broker. Documents:
- HTTP + WS routes (single-port architecture)
- Required + optional env vars (DATABASE_URL, caps, TTLs, limits)
- /health and /metrics semantics, including 503 behavior on DB drop
- SIGTERM/SIGINT graceful shutdown sequence
- Recommended multi-stage Docker build (node:slim for pnpm, oven/bun
  for runtime) with GIT_SHA build-arg convention
- Signal/grace-period guidance for orchestrators
- Prometheus metric names + suggested alert thresholds
- CI pattern for the test suite (needs a live Postgres)
- Deployment target hand-off to the deploy lane

Complements the existing Dockerfile (claudemesh-3's work) with the
runtime contract the Dockerfile implements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-04-04 22:15:24 +01:00
parent 5bf815b304
commit 5f8567614a

182
apps/broker/DEPLOY_SPEC.md Normal file
View File

@@ -0,0 +1,182 @@
# @claudemesh/broker — Deployment Spec
Runtime contract for deploying the broker. Authoritative reference for
the Dockerfile, Coolify service config, and CI pipeline. Owned by the
broker lane; consumed by the deploy lane.
## Runtime
- **Entry point**: `bun apps/broker/src/index.ts` (TypeScript executed
directly by Bun, no compile step).
- **Single process**. Stateless — all persistence is in Postgres.
- **Single port**: HTTP + WebSocket multiplexed over one TCP port.
WS upgrades match path `/ws`; all other requests route to HTTP.
## Routes
| Path | Method | Purpose |
| ------------------- | ---------- | ----------------------------------------------- |
| `/ws` | GET/UPGRADE| Authenticated peer connections (WebSocket) |
| `/hook/set-status` | POST | Claude Code hook scripts report peer status |
| `/health` | GET | Liveness + build info. 503 if Postgres is down. |
| `/metrics` | GET | Prometheus plaintext metrics |
## Environment variables
### Required
| Var | Format | Notes |
| -------------- | ----------------------------------------- | ---------------------------- |
| `DATABASE_URL` | `postgres://user:pass@host:port/db` | Must use postgres:// scheme |
### Optional (with defaults)
| Var | Default | Range | Purpose |
| --------------------------- | ------- | ------------------ | ---------------------------------------------------- |
| `BROKER_PORT` | `7900` | any free port | Single port for HTTP + WS |
| `STATUS_TTL_SECONDS` | `60` | > 0 | Flip stuck "working" peers to idle after this TTL |
| `HOOK_FRESH_WINDOW_SECONDS` | `30` | > 0 | Window during which a hook signal beats JSONL infer |
| `MAX_CONNECTIONS_PER_MESH` | `100` | > 0 | Refuse new WS at capacity with close code 1008 |
| `MAX_MESSAGE_BYTES` | `65536` | > 0 | Max WS payload and hook POST body size |
| `HOOK_RATE_LIMIT_PER_MIN` | `30` | > 0 | Per-(pid,cwd) token bucket on /hook/set-status |
| `NODE_ENV` | `development` | dev/prod/test | Standard |
| `GIT_SHA` | — | hex string | Preferred over `git rev-parse` fallback, for image builds |
No secrets baked into the image — everything via env at runtime.
## Healthcheck
Container healthcheck SHOULD hit `/health`:
```dockerfile
HEALTHCHECK --interval=15s --timeout=5s --start-period=10s --retries=3 \
CMD bun -e "fetch('http://localhost:7900/health').then(r=>{process.exit(r.ok?0:1)}).catch(()=>process.exit(1))"
```
`/health` returns `200` with:
```json
{
"status": "ok",
"db": "up",
"version": "0.1.0",
"gitSha": "84e14ff",
"uptime": 123
}
```
Returns `503` when Postgres is unreachable (`"status":"degraded","db":"down"`).
The broker does NOT exit on transient DB failures — it keeps serving
and recovers automatically when the DB comes back.
## Signals
- `SIGTERM` and `SIGINT` → graceful shutdown:
1. Stop background sweepers (TTL, pending-status, DB ping).
2. Close all WS connections with code `1001`.
3. Mark all active presences as `disconnectedAt=now` in Postgres.
4. Close HTTP server.
5. Exit 0.
Grace period: ~5s typical. Orchestrators should allow ≥10s before
sending SIGKILL.
## Image
- **Base**: `oven/bun:1.2-slim` for runtime (Bun executes TS directly).
pnpm-install stage can use a separate `node:22-slim` image.
- **User**: non-root. `oven/bun` ships with UID 1000 `bun` user.
- **Target size**: <200MB compressed.
- **Volumes**: none. Broker is stateless.
### Build stages (recommended)
1. **deps**: Node + pnpm + full workspace → `pnpm install --frozen-lockfile --ignore-scripts`
2. **runtime**: Bun + copy node_modules + copy only needed workspace packages:
- `apps/broker/`
- `packages/db/`
- `packages/shared/`
- `tooling/typescript/`
- root metadata (`package.json`, `pnpm-workspace.yaml`, `pnpm-lock.yaml`, `tsconfig.json`)
### Build args
- `GIT_SHA` SHOULD be passed at build time and forwarded as ENV so
`/health` surfaces the image commit:
```dockerfile
ARG GIT_SHA
ENV GIT_SHA=$GIT_SHA
```
CI should set `--build-arg GIT_SHA=${GITHUB_SHA:0:7}` (or equivalent).
## Dependencies
Runtime needs reachable:
- **Postgres 15+** with `pgvector` extension enabled (the broker itself
doesn't use vector, but shared migrations do — if you deploy the
broker-only migration subset you can drop pgvector).
- No other external services. No Redis, no queue, no cache.
## Deployment targets (authoritative lane)
- **Production**: OVH VPS via Coolify, Traefik-fronted. Internal port
7900 → Traefik → `ic.claudemesh.com:443`. Separate deploy lane owns
Traefik labels, TLS, DNS, compose.
- **Test DB on CI**: spin up pgvector/pgvector:pg17, create
`claudemesh_test` database, run migrations, then `pnpm test` in
`apps/broker`. See below.
## CI integration
Test suite requires a live Postgres. Suggested GitHub Actions step:
```yaml
services:
postgres:
image: pgvector/pgvector:pg17
env:
POSTGRES_USER: turbostarter
POSTGRES_PASSWORD: turbostarter
POSTGRES_DB: claudemesh_test
ports: ['5440:5432']
options: >-
--health-cmd="pg_isready -U turbostarter"
--health-interval=5s
steps:
- uses: actions/checkout@v4
- run: pnpm install --frozen-lockfile
- run: cd packages/db && pnpm exec drizzle-kit migrate
env: { DATABASE_URL: 'postgresql://turbostarter:turbostarter@127.0.0.1:5440/claudemesh_test' }
- run: cd apps/broker && pnpm test
env: { DATABASE_URL: 'postgresql://turbostarter:turbostarter@127.0.0.1:5440/claudemesh_test' }
```
## Metrics
Scraped by Prometheus via `GET /metrics`. Key series:
- `broker_connections_active` (gauge)
- `broker_connections_total` (counter)
- `broker_connections_rejected_total{reason}` (counter: capacity, unauthorized)
- `broker_messages_routed_total{priority}` (counter: now, next, low)
- `broker_messages_rejected_total{reason}` (counter)
- `broker_queue_depth` (gauge — undelivered messages)
- `broker_ttl_sweeps_total{flipped}` (counter)
- `broker_hook_requests_total` (counter)
- `broker_hook_requests_rate_limited_total` (counter)
- `broker_db_healthy` (gauge: 0 or 1)
Alert recommendations:
- `broker_db_healthy == 0` for > 60s → page oncall
- `broker_queue_depth > 10000` → investigate
- `broker_connections_rejected_total{reason="capacity"}` rising → scale
## Logs
Structured JSON, one line per event, stderr. No log aggregation
required — suitable for stdout/stderr capture and direct ingestion
into Loki/Datadog/CloudWatch without parsing.
Key events: `broker listening`, `ws hello`, `ws close`, `ws set_status`,
`hook` (with `cwd`, `pid`, `status`, `presence_id`, `pending`), `shutdown signal`,
`shutdown complete`, `db healthy`, `db ping failed`.