Backwards compat shim (task 27) - requireCliAuth() falls back to body.user_id when BROKER_LEGACY_AUTH=1 and no bearer present. Sets Deprecation + Warning headers + bumps a broker_legacy_auth_hits_total metric so operators can watch the legacy traffic drain to 0 before removing the shim. - All handlers parse body BEFORE requireCliAuth so the fallback can read user_id out of it. HA readiness (task 29) - .artifacts/specs/2026-04-15-broker-ha-statelessness-audit.md documents every in-memory symbol and rollout plan (phase 0-4). - packaging/docker-compose.ha-local.yml spins up 2 broker replicas behind Traefik sticky sessions for local smoke testing. - apps/broker/src/audit.ts now wraps writes in a transaction that takes pg_advisory_xact_lock(meshId) and re-reads the tail hash inside the txn. Concurrent broker replicas can no longer fork the audit chain. Deploy gate (task 30) - /health stays permissive (200 even on transient DB blips) so Docker doesn't kill the container on a glitch. - New /health/ready checks DB + optional EXPECTED_MIGRATION pin, returns 503 if either fails. External deploy gate can poll this and refuse to promote a broken deploy. Metrics dashboard (task 32) - packaging/grafana/claudemesh-broker.json: ready-to-import Grafana dashboard covering active conns, queue depth, routed/rejected rates, grant drops, legacy-auth hits, conn rejects. Tests (task 28) - audit-canonical.test.ts (4 tests) pins canonical JSON semantics. - grants-enforcement.test.ts (6 tests) covers the member-then- session-pubkey lookup with default/explicit/blocked branches. Docs (task 34) - docs/env-vars.md catalogues every env var the broker + CLI read. Crypto review prep (task 35) - .artifacts/specs/2026-04-15-crypto-review-packet.md: reviewer brief, threat model, scope, test coverage list, deliverables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5.2 KiB
Broker HA readiness — statelessness audit
Single-instance broker is the biggest GA blocker. Moving to 2+ replicas behind a load balancer requires first understanding which state the broker holds in-process that breaks if split across nodes.
Current in-process state (apps/broker/src/index.ts)
| Symbol | Line | Per-node? | Survives HA? | Notes |
|---|---|---|---|---|
connections |
147 | yes (WS state) | ✅ naturally per-node | WS connections are pinned to a node by L7 routing. Each node holds only its own connections. OK as long as the LB uses sticky sessions or cross-node fan-out. |
connectionsPerMesh |
148 | yes | 🟡 per-node count, not global | Used for capacity cap. Global cap requires Redis. |
tgTokenRateLimit |
151 | yes | 🟡 per-node | Telegram bot rate limiting; tolerable as per-node. |
urlWatches |
173 | yes | 🔴 stuck on one node | If peer disconnects from node A and reconnects on B, the watch stays orphaned on A. Needs DB/Redis, or "pin to owning node". Acceptable risk if watches are per-session ephemeral. |
streamSubscriptions |
259 | yes | 🔴 multi-node broken | Sub on A, publish on B → message never reaches A's subscribers. Needs Redis pub/sub for HA. |
meshClocks |
270 | yes | 🔴 multi-node broken | Simulated clocks must be single-authority. Solve by pinning one node as clock leader (simple leader election) or by moving clock state to DB. |
mcpRegistry |
327 | yes | 🔴 multi-node broken | MCP server catalog cached in memory. If deployed on A but called on B, B doesn't know it exists. Must be DB-backed (partly is already — see mesh_service table). Audit the cache/DB sync path. |
mcpCallResolvers |
338 | yes | ✅ per-call ephemeral | In-flight callback resolvers; WS sticks to owning node so this is fine. |
scheduledMessages |
359 | yes | 🔴 multi-node broken | Scheduled delivery timers live in-process. Restart loses them. Persistence exists (scheduled_message table) + recovery on startup, but two nodes could both fire the same timer. Needs a leader lock or per-schedule pg_advisory_lock on fire. |
sendRateLimit |
index.ts:494 | yes | 🟡 per-node | Each node enforces its own quota; a client spread across nodes could 2x the limit. Tolerable if sticky sessions hold. |
hookRateLimit |
index.ts:482 | yes | 🟡 per-node | Same as sendRateLimit. |
lastHash (audit.ts:22) |
— | yes | 🔴 broken on write | Two nodes writing audit rows concurrently will BOTH read the same last hash, BOTH compute a new hash, and both INSERT — the chain forks. Needs SELECT FOR UPDATE or a single audit writer. |
Conclusion
Current broker is NOT HA-safe. Five symbols break under multi-instance:
urlWatches, streamSubscriptions, meshClocks, mcpRegistry cache,
scheduledMessages, lastHash. None are unsolvable, but none are
trivial.
Rollout plan for HA
Phase 0 (now) — sticky sessions
Deploy a single broker behind Traefik with loadBalancer.sticky.cookie
enabled. WS upgrade inherits the cookie, so reconnects land on the same
node. Gives us 1 node of safe HA headroom (i.e., one deploy rollover
without user-visible disconnection) without any code changes.
Phase 1 — Active/passive
Two replicas. Traefik routes all traffic to primary; secondary is warm. Primary fails → secondary takes over, all WS connections reset. No code change needed; clients auto-reconnect.
Phase 2 — Active/active for stateless routes
HTTP-only routes (/cli/*, /download, /hook) can round-robin across
any number of replicas today. WS routes stay sticky per mesh via Traefik
sticky.cookie. Already behind Postgres → each replica reads the same
mesh/member/invite rows.
Phase 3 — Full active/active
Migrate the 6 problematic in-memory symbols:
streamSubscriptions→ Redis pub/submeshClocks→ leader-elect via Postgres advisory lock on mesh_idscheduledMessages→ single-writer pattern: whichever replica holdspg_advisory_xact_lock(schedule_id)firesurlWatches→ DB-backed + each replica owns watches wherepresence.node_id = this_nodemcpRegistry→ rely onmesh_servicetable, drop the in-memory cachelastHash→ wrap audit.ts writes in a transaction thatSELECT hash FROM audit_log ... ORDER BY id DESC FOR UPDATE, making concurrent inserts serialize.
Phase 4 — Multi-region
SPOF at Frankfurt (OVH). Move to a managed Postgres with read replicas, one broker cluster per region, global DNS geo-routing. Out of scope for v1.0.0.
Immediate ship: local docker-compose for 2-replica smoke test
packaging/docker-compose.ha-local.yml (TODO) spins up:
- 2x broker (same DATABASE_URL)
- 1x postgres
- 1x traefik with sticky cookie
- 1x locust / synthetic client
Tests:
- Send to peer connected on node A → delivered.
- Subscribe on A, publish on B → expect failure (documents the gap).
- Kill node A → client reconnects to B within Xs.
- Audit chain verify after concurrent writes from both nodes → expect a fork (documents the gap).
Decision
Ship v1.0.0 on sticky-session single-writer (Phase 0 + Phase 1 warm standby). That closes the "what happens on deploy" story. Phase 3 full HA is v1.1.0 work.