From 5f8567614a123f18fdabaecf0709e43022f3558e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Alejandro=20Guti=C3=A9rrez?= <35082514+alezmad@users.noreply.github.com> Date: Sat, 4 Apr 2026 22:15:24 +0100 Subject: [PATCH] docs(broker): production deployment spec Authoritative runtime contract for the broker. Documents: - HTTP + WS routes (single-port architecture) - Required + optional env vars (DATABASE_URL, caps, TTLs, limits) - /health and /metrics semantics, including 503 behavior on DB drop - SIGTERM/SIGINT graceful shutdown sequence - Recommended multi-stage Docker build (node:slim for pnpm, oven/bun for runtime) with GIT_SHA build-arg convention - Signal/grace-period guidance for orchestrators - Prometheus metric names + suggested alert thresholds - CI pattern for the test suite (needs a live Postgres) - Deployment target hand-off to the deploy lane Complements the existing Dockerfile (claudemesh-3's work) with the runtime contract the Dockerfile implements. Co-Authored-By: Claude Opus 4.6 (1M context) --- apps/broker/DEPLOY_SPEC.md | 182 +++++++++++++++++++++++++++++++++++++ 1 file changed, 182 insertions(+) create mode 100644 apps/broker/DEPLOY_SPEC.md diff --git a/apps/broker/DEPLOY_SPEC.md b/apps/broker/DEPLOY_SPEC.md new file mode 100644 index 0000000..411cd9b --- /dev/null +++ b/apps/broker/DEPLOY_SPEC.md @@ -0,0 +1,182 @@ +# @claudemesh/broker — Deployment Spec + +Runtime contract for deploying the broker. Authoritative reference for +the Dockerfile, Coolify service config, and CI pipeline. Owned by the +broker lane; consumed by the deploy lane. + +## Runtime + +- **Entry point**: `bun apps/broker/src/index.ts` (TypeScript executed + directly by Bun, no compile step). +- **Single process**. Stateless — all persistence is in Postgres. +- **Single port**: HTTP + WebSocket multiplexed over one TCP port. + WS upgrades match path `/ws`; all other requests route to HTTP. + +## Routes + +| Path | Method | Purpose | +| ------------------- | ---------- | ----------------------------------------------- | +| `/ws` | GET/UPGRADE| Authenticated peer connections (WebSocket) | +| `/hook/set-status` | POST | Claude Code hook scripts report peer status | +| `/health` | GET | Liveness + build info. 503 if Postgres is down. | +| `/metrics` | GET | Prometheus plaintext metrics | + +## Environment variables + +### Required + +| Var | Format | Notes | +| -------------- | ----------------------------------------- | ---------------------------- | +| `DATABASE_URL` | `postgres://user:pass@host:port/db` | Must use postgres:// scheme | + +### Optional (with defaults) + +| Var | Default | Range | Purpose | +| --------------------------- | ------- | ------------------ | ---------------------------------------------------- | +| `BROKER_PORT` | `7900` | any free port | Single port for HTTP + WS | +| `STATUS_TTL_SECONDS` | `60` | > 0 | Flip stuck "working" peers to idle after this TTL | +| `HOOK_FRESH_WINDOW_SECONDS` | `30` | > 0 | Window during which a hook signal beats JSONL infer | +| `MAX_CONNECTIONS_PER_MESH` | `100` | > 0 | Refuse new WS at capacity with close code 1008 | +| `MAX_MESSAGE_BYTES` | `65536` | > 0 | Max WS payload and hook POST body size | +| `HOOK_RATE_LIMIT_PER_MIN` | `30` | > 0 | Per-(pid,cwd) token bucket on /hook/set-status | +| `NODE_ENV` | `development` | dev/prod/test | Standard | +| `GIT_SHA` | — | hex string | Preferred over `git rev-parse` fallback, for image builds | + +No secrets baked into the image — everything via env at runtime. + +## Healthcheck + +Container healthcheck SHOULD hit `/health`: + +```dockerfile +HEALTHCHECK --interval=15s --timeout=5s --start-period=10s --retries=3 \ + CMD bun -e "fetch('http://localhost:7900/health').then(r=>{process.exit(r.ok?0:1)}).catch(()=>process.exit(1))" +``` + +`/health` returns `200` with: +```json +{ + "status": "ok", + "db": "up", + "version": "0.1.0", + "gitSha": "84e14ff", + "uptime": 123 +} +``` + +Returns `503` when Postgres is unreachable (`"status":"degraded","db":"down"`). +The broker does NOT exit on transient DB failures — it keeps serving +and recovers automatically when the DB comes back. + +## Signals + +- `SIGTERM` and `SIGINT` → graceful shutdown: + 1. Stop background sweepers (TTL, pending-status, DB ping). + 2. Close all WS connections with code `1001`. + 3. Mark all active presences as `disconnectedAt=now` in Postgres. + 4. Close HTTP server. + 5. Exit 0. + +Grace period: ~5s typical. Orchestrators should allow ≥10s before +sending SIGKILL. + +## Image + +- **Base**: `oven/bun:1.2-slim` for runtime (Bun executes TS directly). + pnpm-install stage can use a separate `node:22-slim` image. +- **User**: non-root. `oven/bun` ships with UID 1000 `bun` user. +- **Target size**: <200MB compressed. +- **Volumes**: none. Broker is stateless. + +### Build stages (recommended) + +1. **deps**: Node + pnpm + full workspace → `pnpm install --frozen-lockfile --ignore-scripts` +2. **runtime**: Bun + copy node_modules + copy only needed workspace packages: + - `apps/broker/` + - `packages/db/` + - `packages/shared/` + - `tooling/typescript/` + - root metadata (`package.json`, `pnpm-workspace.yaml`, `pnpm-lock.yaml`, `tsconfig.json`) + +### Build args + +- `GIT_SHA` SHOULD be passed at build time and forwarded as ENV so + `/health` surfaces the image commit: + ```dockerfile + ARG GIT_SHA + ENV GIT_SHA=$GIT_SHA + ``` + CI should set `--build-arg GIT_SHA=${GITHUB_SHA:0:7}` (or equivalent). + +## Dependencies + +Runtime needs reachable: +- **Postgres 15+** with `pgvector` extension enabled (the broker itself + doesn't use vector, but shared migrations do — if you deploy the + broker-only migration subset you can drop pgvector). +- No other external services. No Redis, no queue, no cache. + +## Deployment targets (authoritative lane) + +- **Production**: OVH VPS via Coolify, Traefik-fronted. Internal port + 7900 → Traefik → `ic.claudemesh.com:443`. Separate deploy lane owns + Traefik labels, TLS, DNS, compose. +- **Test DB on CI**: spin up pgvector/pgvector:pg17, create + `claudemesh_test` database, run migrations, then `pnpm test` in + `apps/broker`. See below. + +## CI integration + +Test suite requires a live Postgres. Suggested GitHub Actions step: + +```yaml +services: + postgres: + image: pgvector/pgvector:pg17 + env: + POSTGRES_USER: turbostarter + POSTGRES_PASSWORD: turbostarter + POSTGRES_DB: claudemesh_test + ports: ['5440:5432'] + options: >- + --health-cmd="pg_isready -U turbostarter" + --health-interval=5s + +steps: + - uses: actions/checkout@v4 + - run: pnpm install --frozen-lockfile + - run: cd packages/db && pnpm exec drizzle-kit migrate + env: { DATABASE_URL: 'postgresql://turbostarter:turbostarter@127.0.0.1:5440/claudemesh_test' } + - run: cd apps/broker && pnpm test + env: { DATABASE_URL: 'postgresql://turbostarter:turbostarter@127.0.0.1:5440/claudemesh_test' } +``` + +## Metrics + +Scraped by Prometheus via `GET /metrics`. Key series: + +- `broker_connections_active` (gauge) +- `broker_connections_total` (counter) +- `broker_connections_rejected_total{reason}` (counter: capacity, unauthorized) +- `broker_messages_routed_total{priority}` (counter: now, next, low) +- `broker_messages_rejected_total{reason}` (counter) +- `broker_queue_depth` (gauge — undelivered messages) +- `broker_ttl_sweeps_total{flipped}` (counter) +- `broker_hook_requests_total` (counter) +- `broker_hook_requests_rate_limited_total` (counter) +- `broker_db_healthy` (gauge: 0 or 1) + +Alert recommendations: +- `broker_db_healthy == 0` for > 60s → page oncall +- `broker_queue_depth > 10000` → investigate +- `broker_connections_rejected_total{reason="capacity"}` rising → scale + +## Logs + +Structured JSON, one line per event, stderr. No log aggregation +required — suitable for stdout/stderr capture and direct ingestion +into Loki/Datadog/CloudWatch without parsing. + +Key events: `broker listening`, `ws hello`, `ws close`, `ws set_status`, +`hook` (with `cwd`, `pid`, `status`, `presence_id`, `pending`), `shutdown signal`, +`shutdown complete`, `db healthy`, `db ping failed`.