alezmad/claudemesh

Fork 0

Files

Alejandro Gutiérrez 05729ad8a4

CI / Lint (push) Has been cancelled

Details

CI / Typecheck (push) Has been cancelled

Details

CI / Broker tests (Postgres) (push) Has been cancelled

Details

CI / Docker build (linux/amd64) (push) Has been cancelled

Details

feat(ga): close remaining GA blockers (backcompat, HA prep, tests, docs)

Backwards compat shim (task 27)
- requireCliAuth() falls back to body.user_id when BROKER_LEGACY_AUTH=1
  and no bearer present. Sets Deprecation + Warning headers + bumps a
  broker_legacy_auth_hits_total metric so operators can watch the
  legacy traffic drain to 0 before removing the shim.
- All handlers parse body BEFORE requireCliAuth so the fallback can
  read user_id out of it.

HA readiness (task 29)
- .artifacts/specs/2026-04-15-broker-ha-statelessness-audit.md
  documents every in-memory symbol and rollout plan (phase 0-4).
- packaging/docker-compose.ha-local.yml spins up 2 broker replicas
  behind Traefik sticky sessions for local smoke testing.
- apps/broker/src/audit.ts now wraps writes in a transaction that
  takes pg_advisory_xact_lock(meshId) and re-reads the tail hash
  inside the txn. Concurrent broker replicas can no longer fork the
  audit chain.

Deploy gate (task 30)
- /health stays permissive (200 even on transient DB blips) so
  Docker doesn't kill the container on a glitch.
- New /health/ready checks DB + optional EXPECTED_MIGRATION pin,
  returns 503 if either fails. External deploy gate can poll this
  and refuse to promote a broken deploy.

Metrics dashboard (task 32)
- packaging/grafana/claudemesh-broker.json: ready-to-import Grafana
  dashboard covering active conns, queue depth, routed/rejected
  rates, grant drops, legacy-auth hits, conn rejects.

Tests (task 28)
- audit-canonical.test.ts (4 tests) pins canonical JSON semantics.
- grants-enforcement.test.ts (6 tests) covers the member-then-
  session-pubkey lookup with default/explicit/blocked branches.

Docs (task 34)
- docs/env-vars.md catalogues every env var the broker + CLI read.

Crypto review prep (task 35)
- .artifacts/specs/2026-04-15-crypto-review-packet.md: reviewer
  brief, threat model, scope, test coverage list, deliverables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-15 23:51:28 +01:00

5.2 KiB

Raw Blame History

Broker HA readiness — statelessness audit

Single-instance broker is the biggest GA blocker. Moving to 2+ replicas behind a load balancer requires first understanding which state the broker holds in-process that breaks if split across nodes.

Current in-process state (apps/broker/src/index.ts)

Symbol	Line	Per-node?	Survives HA?	Notes
`connections`	147	yes (WS state)	✅ naturally per-node	WS connections are pinned to a node by L7 routing. Each node holds only its own connections. OK as long as the LB uses sticky sessions or cross-node fan-out.
`connectionsPerMesh`	148	yes	🟡 per-node count, not global	Used for capacity cap. Global cap requires Redis.
`tgTokenRateLimit`	151	yes	🟡 per-node	Telegram bot rate limiting; tolerable as per-node.
`urlWatches`	173	yes	🔴 stuck on one node	If peer disconnects from node A and reconnects on B, the watch stays orphaned on A. Needs DB/Redis, or "pin to owning node". Acceptable risk if watches are per-session ephemeral.
`streamSubscriptions`	259	yes	🔴 multi-node broken	Sub on A, publish on B → message never reaches A's subscribers. Needs Redis pub/sub for HA.
`meshClocks`	270	yes	🔴 multi-node broken	Simulated clocks must be single-authority. Solve by pinning one node as clock leader (simple leader election) or by moving clock state to DB.
`mcpRegistry`	327	yes	🔴 multi-node broken	MCP server catalog cached in memory. If deployed on A but called on B, B doesn't know it exists. Must be DB-backed (partly is already — see `mesh_service` table). Audit the cache/DB sync path.
`mcpCallResolvers`	338	yes	✅ per-call ephemeral	In-flight callback resolvers; WS sticks to owning node so this is fine.
`scheduledMessages`	359	yes	🔴 multi-node broken	Scheduled delivery timers live in-process. Restart loses them. Persistence exists (`scheduled_message` table) + recovery on startup, but two nodes could both fire the same timer. Needs a leader lock or per-schedule pg_advisory_lock on fire.
`sendRateLimit`	index.ts:494	yes	🟡 per-node	Each node enforces its own quota; a client spread across nodes could 2x the limit. Tolerable if sticky sessions hold.
`hookRateLimit`	index.ts:482	yes	🟡 per-node	Same as sendRateLimit.
`lastHash` (audit.ts:22)	—	yes	🔴 broken on write	Two nodes writing audit rows concurrently will BOTH read the same last hash, BOTH compute a new hash, and both INSERT — the chain forks. Needs `SELECT FOR UPDATE` or a single audit writer.

Conclusion

Current broker is NOT HA-safe. Five symbols break under multi-instance: urlWatches, streamSubscriptions, meshClocks, mcpRegistry cache, scheduledMessages, lastHash. None are unsolvable, but none are trivial.

Rollout plan for HA

Phase 0 (now) — sticky sessions

Deploy a single broker behind Traefik with loadBalancer.sticky.cookie enabled. WS upgrade inherits the cookie, so reconnects land on the same node. Gives us 1 node of safe HA headroom (i.e., one deploy rollover without user-visible disconnection) without any code changes.

Phase 1 — Active/passive

Two replicas. Traefik routes all traffic to primary; secondary is warm. Primary fails → secondary takes over, all WS connections reset. No code change needed; clients auto-reconnect.

Phase 2 — Active/active for stateless routes

HTTP-only routes (/cli/*, /download, /hook) can round-robin across any number of replicas today. WS routes stay sticky per mesh via Traefik sticky.cookie. Already behind Postgres → each replica reads the same mesh/member/invite rows.

Phase 3 — Full active/active

Migrate the 6 problematic in-memory symbols:

streamSubscriptions → Redis pub/sub
meshClocks → leader-elect via Postgres advisory lock on mesh_id
scheduledMessages → single-writer pattern: whichever replica holds pg_advisory_xact_lock(schedule_id) fires
urlWatches → DB-backed + each replica owns watches where presence.node_id = this_node
mcpRegistry → rely on mesh_service table, drop the in-memory cache
lastHash → wrap audit.ts writes in a transaction that SELECT hash FROM audit_log ... ORDER BY id DESC FOR UPDATE, making concurrent inserts serialize.

Phase 4 — Multi-region

SPOF at Frankfurt (OVH). Move to a managed Postgres with read replicas, one broker cluster per region, global DNS geo-routing. Out of scope for v1.0.0.

Immediate ship: local docker-compose for 2-replica smoke test

packaging/docker-compose.ha-local.yml (TODO) spins up:

2x broker (same DATABASE_URL)
1x postgres
1x traefik with sticky cookie
1x locust / synthetic client

Tests:

Send to peer connected on node A → delivered.
Subscribe on A, publish on B → expect failure (documents the gap).
Kill node A → client reconnects to B within Xs.
Audit chain verify after concurrent writes from both nodes → expect a fork (documents the gap).

Decision

Ship v1.0.0 on sticky-session single-writer (Phase 0 + Phase 1 warm standby). That closes the "what happens on deploy" story. Phase 3 full HA is v1.1.0 work.

5.2 KiB Raw Blame History

Broker HA readiness — statelessness audit

Current in-process state (apps/broker/src/index.ts)

Conclusion

Rollout plan for HA

Phase 0 (now) — sticky sessions

Phase 1 — Active/passive

Phase 2 — Active/active for stateless routes

Phase 3 — Full active/active

Phase 4 — Multi-region

Immediate ship: local docker-compose for 2-replica smoke test

Decision

5.2 KiB

Raw Blame History