Files
claudemesh/apps/broker/DEPLOY_SPEC.md
Alejandro Gutiérrez 5f8567614a docs(broker): production deployment spec
Authoritative runtime contract for the broker. Documents:
- HTTP + WS routes (single-port architecture)
- Required + optional env vars (DATABASE_URL, caps, TTLs, limits)
- /health and /metrics semantics, including 503 behavior on DB drop
- SIGTERM/SIGINT graceful shutdown sequence
- Recommended multi-stage Docker build (node:slim for pnpm, oven/bun
  for runtime) with GIT_SHA build-arg convention
- Signal/grace-period guidance for orchestrators
- Prometheus metric names + suggested alert thresholds
- CI pattern for the test suite (needs a live Postgres)
- Deployment target hand-off to the deploy lane

Complements the existing Dockerfile (claudemesh-3's work) with the
runtime contract the Dockerfile implements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:15:24 +01:00

7.0 KiB

@claudemesh/broker — Deployment Spec

Runtime contract for deploying the broker. Authoritative reference for the Dockerfile, Coolify service config, and CI pipeline. Owned by the broker lane; consumed by the deploy lane.

Runtime

  • Entry point: bun apps/broker/src/index.ts (TypeScript executed directly by Bun, no compile step).
  • Single process. Stateless — all persistence is in Postgres.
  • Single port: HTTP + WebSocket multiplexed over one TCP port. WS upgrades match path /ws; all other requests route to HTTP.

Routes

Path Method Purpose
/ws GET/UPGRADE Authenticated peer connections (WebSocket)
/hook/set-status POST Claude Code hook scripts report peer status
/health GET Liveness + build info. 503 if Postgres is down.
/metrics GET Prometheus plaintext metrics

Environment variables

Required

Var Format Notes
DATABASE_URL postgres://user:pass@host:port/db Must use postgres:// scheme

Optional (with defaults)

Var Default Range Purpose
BROKER_PORT 7900 any free port Single port for HTTP + WS
STATUS_TTL_SECONDS 60 > 0 Flip stuck "working" peers to idle after this TTL
HOOK_FRESH_WINDOW_SECONDS 30 > 0 Window during which a hook signal beats JSONL infer
MAX_CONNECTIONS_PER_MESH 100 > 0 Refuse new WS at capacity with close code 1008
MAX_MESSAGE_BYTES 65536 > 0 Max WS payload and hook POST body size
HOOK_RATE_LIMIT_PER_MIN 30 > 0 Per-(pid,cwd) token bucket on /hook/set-status
NODE_ENV development dev/prod/test Standard
GIT_SHA hex string Preferred over git rev-parse fallback, for image builds

No secrets baked into the image — everything via env at runtime.

Healthcheck

Container healthcheck SHOULD hit /health:

HEALTHCHECK --interval=15s --timeout=5s --start-period=10s --retries=3 \
  CMD bun -e "fetch('http://localhost:7900/health').then(r=>{process.exit(r.ok?0:1)}).catch(()=>process.exit(1))"

/health returns 200 with:

{
  "status": "ok",
  "db": "up",
  "version": "0.1.0",
  "gitSha": "84e14ff",
  "uptime": 123
}

Returns 503 when Postgres is unreachable ("status":"degraded","db":"down"). The broker does NOT exit on transient DB failures — it keeps serving and recovers automatically when the DB comes back.

Signals

  • SIGTERM and SIGINT → graceful shutdown:
    1. Stop background sweepers (TTL, pending-status, DB ping).
    2. Close all WS connections with code 1001.
    3. Mark all active presences as disconnectedAt=now in Postgres.
    4. Close HTTP server.
    5. Exit 0.

Grace period: ~5s typical. Orchestrators should allow ≥10s before sending SIGKILL.

Image

  • Base: oven/bun:1.2-slim for runtime (Bun executes TS directly). pnpm-install stage can use a separate node:22-slim image.
  • User: non-root. oven/bun ships with UID 1000 bun user.
  • Target size: <200MB compressed.
  • Volumes: none. Broker is stateless.
  1. deps: Node + pnpm + full workspace → pnpm install --frozen-lockfile --ignore-scripts
  2. runtime: Bun + copy node_modules + copy only needed workspace packages:
    • apps/broker/
    • packages/db/
    • packages/shared/
    • tooling/typescript/
    • root metadata (package.json, pnpm-workspace.yaml, pnpm-lock.yaml, tsconfig.json)

Build args

  • GIT_SHA SHOULD be passed at build time and forwarded as ENV so /health surfaces the image commit:
    ARG GIT_SHA
    ENV GIT_SHA=$GIT_SHA
    
    CI should set --build-arg GIT_SHA=${GITHUB_SHA:0:7} (or equivalent).

Dependencies

Runtime needs reachable:

  • Postgres 15+ with pgvector extension enabled (the broker itself doesn't use vector, but shared migrations do — if you deploy the broker-only migration subset you can drop pgvector).
  • No other external services. No Redis, no queue, no cache.

Deployment targets (authoritative lane)

  • Production: OVH VPS via Coolify, Traefik-fronted. Internal port 7900 → Traefik → ic.claudemesh.com:443. Separate deploy lane owns Traefik labels, TLS, DNS, compose.
  • Test DB on CI: spin up pgvector/pgvector:pg17, create claudemesh_test database, run migrations, then pnpm test in apps/broker. See below.

CI integration

Test suite requires a live Postgres. Suggested GitHub Actions step:

services:
  postgres:
    image: pgvector/pgvector:pg17
    env:
      POSTGRES_USER: turbostarter
      POSTGRES_PASSWORD: turbostarter
      POSTGRES_DB: claudemesh_test
    ports: ['5440:5432']
    options: >-
      --health-cmd="pg_isready -U turbostarter"
      --health-interval=5s

steps:
  - uses: actions/checkout@v4
  - run: pnpm install --frozen-lockfile
  - run: cd packages/db && pnpm exec drizzle-kit migrate
    env: { DATABASE_URL: 'postgresql://turbostarter:turbostarter@127.0.0.1:5440/claudemesh_test' }
  - run: cd apps/broker && pnpm test
    env: { DATABASE_URL: 'postgresql://turbostarter:turbostarter@127.0.0.1:5440/claudemesh_test' }

Metrics

Scraped by Prometheus via GET /metrics. Key series:

  • broker_connections_active (gauge)
  • broker_connections_total (counter)
  • broker_connections_rejected_total{reason} (counter: capacity, unauthorized)
  • broker_messages_routed_total{priority} (counter: now, next, low)
  • broker_messages_rejected_total{reason} (counter)
  • broker_queue_depth (gauge — undelivered messages)
  • broker_ttl_sweeps_total{flipped} (counter)
  • broker_hook_requests_total (counter)
  • broker_hook_requests_rate_limited_total (counter)
  • broker_db_healthy (gauge: 0 or 1)

Alert recommendations:

  • broker_db_healthy == 0 for > 60s → page oncall
  • broker_queue_depth > 10000 → investigate
  • broker_connections_rejected_total{reason="capacity"} rising → scale

Logs

Structured JSON, one line per event, stderr. No log aggregation required — suitable for stdout/stderr capture and direct ingestion into Loki/Datadog/CloudWatch without parsing.

Key events: broker listening, ws hello, ws close, ws set_status, hook (with cwd, pid, status, presence_id, pending), shutdown signal, shutdown complete, db healthy, db ping failed.