feat(ga): close remaining GA blockers (backcompat, HA prep, tests, docs)
Backwards compat shim (task 27) - requireCliAuth() falls back to body.user_id when BROKER_LEGACY_AUTH=1 and no bearer present. Sets Deprecation + Warning headers + bumps a broker_legacy_auth_hits_total metric so operators can watch the legacy traffic drain to 0 before removing the shim. - All handlers parse body BEFORE requireCliAuth so the fallback can read user_id out of it. HA readiness (task 29) - .artifacts/specs/2026-04-15-broker-ha-statelessness-audit.md documents every in-memory symbol and rollout plan (phase 0-4). - packaging/docker-compose.ha-local.yml spins up 2 broker replicas behind Traefik sticky sessions for local smoke testing. - apps/broker/src/audit.ts now wraps writes in a transaction that takes pg_advisory_xact_lock(meshId) and re-reads the tail hash inside the txn. Concurrent broker replicas can no longer fork the audit chain. Deploy gate (task 30) - /health stays permissive (200 even on transient DB blips) so Docker doesn't kill the container on a glitch. - New /health/ready checks DB + optional EXPECTED_MIGRATION pin, returns 503 if either fails. External deploy gate can poll this and refuse to promote a broken deploy. Metrics dashboard (task 32) - packaging/grafana/claudemesh-broker.json: ready-to-import Grafana dashboard covering active conns, queue depth, routed/rejected rates, grant drops, legacy-auth hits, conn rejects. Tests (task 28) - audit-canonical.test.ts (4 tests) pins canonical JSON semantics. - grants-enforcement.test.ts (6 tests) covers the member-then- session-pubkey lookup with default/explicit/blocked branches. Docs (task 34) - docs/env-vars.md catalogues every env var the broker + CLI read. Crypto review prep (task 35) - .artifacts/specs/2026-04-15-crypto-review-packet.md: reviewer brief, threat model, scope, test coverage list, deliverables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
87
.artifacts/specs/2026-04-15-broker-ha-statelessness-audit.md
Normal file
87
.artifacts/specs/2026-04-15-broker-ha-statelessness-audit.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Broker HA readiness — statelessness audit
|
||||
|
||||
Single-instance broker is the biggest GA blocker. Moving to 2+ replicas
|
||||
behind a load balancer requires first understanding which state the broker
|
||||
holds in-process that breaks if split across nodes.
|
||||
|
||||
## Current in-process state (apps/broker/src/index.ts)
|
||||
|
||||
| Symbol | Line | Per-node? | Survives HA? | Notes |
|
||||
|--------|------|-----------|--------------|-------|
|
||||
| `connections` | 147 | yes (WS state) | ✅ naturally per-node | WS connections are pinned to a node by L7 routing. Each node holds only its own connections. **OK as long as the LB uses sticky sessions or cross-node fan-out.** |
|
||||
| `connectionsPerMesh` | 148 | yes | 🟡 per-node count, not global | Used for capacity cap. Global cap requires Redis. |
|
||||
| `tgTokenRateLimit` | 151 | yes | 🟡 per-node | Telegram bot rate limiting; tolerable as per-node. |
|
||||
| `urlWatches` | 173 | yes | 🔴 stuck on one node | If peer disconnects from node A and reconnects on B, the watch stays orphaned on A. **Needs DB/Redis, or "pin to owning node". Acceptable risk if watches are per-session ephemeral.** |
|
||||
| `streamSubscriptions` | 259 | yes | 🔴 multi-node broken | Sub on A, publish on B → message never reaches A's subscribers. **Needs Redis pub/sub for HA.** |
|
||||
| `meshClocks` | 270 | yes | 🔴 multi-node broken | Simulated clocks must be single-authority. Solve by pinning one node as clock leader (simple leader election) or by moving clock state to DB. |
|
||||
| `mcpRegistry` | 327 | yes | 🔴 multi-node broken | MCP server catalog cached in memory. If deployed on A but called on B, B doesn't know it exists. **Must be DB-backed** (partly is already — see `mesh_service` table). Audit the cache/DB sync path. |
|
||||
| `mcpCallResolvers` | 338 | yes | ✅ per-call ephemeral | In-flight callback resolvers; WS sticks to owning node so this is fine. |
|
||||
| `scheduledMessages` | 359 | yes | 🔴 multi-node broken | Scheduled delivery timers live in-process. Restart loses them. Persistence exists (`scheduled_message` table) + recovery on startup, but two nodes could both fire the same timer. **Needs a leader lock or per-schedule pg_advisory_lock on fire.** |
|
||||
| `sendRateLimit` | index.ts:494 | yes | 🟡 per-node | Each node enforces its own quota; a client spread across nodes could 2x the limit. Tolerable if sticky sessions hold. |
|
||||
| `hookRateLimit` | index.ts:482 | yes | 🟡 per-node | Same as sendRateLimit. |
|
||||
| `lastHash` (audit.ts:22) | — | yes | 🔴 broken on write | Two nodes writing audit rows concurrently will BOTH read the same last hash, BOTH compute a new hash, and both INSERT — the chain forks. **Needs `SELECT FOR UPDATE` or a single audit writer.** |
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Current broker is NOT HA-safe.** Five symbols break under multi-instance:
|
||||
`urlWatches`, `streamSubscriptions`, `meshClocks`, `mcpRegistry` cache,
|
||||
`scheduledMessages`, `lastHash`. None are unsolvable, but none are
|
||||
trivial.
|
||||
|
||||
## Rollout plan for HA
|
||||
|
||||
### Phase 0 (now) — sticky sessions
|
||||
Deploy a single broker behind Traefik with `loadBalancer.sticky.cookie`
|
||||
enabled. WS upgrade inherits the cookie, so reconnects land on the same
|
||||
node. Gives us 1 node of safe HA headroom (i.e., one deploy rollover
|
||||
without user-visible disconnection) without any code changes.
|
||||
|
||||
### Phase 1 — Active/passive
|
||||
Two replicas. Traefik routes all traffic to primary; secondary is warm.
|
||||
Primary fails → secondary takes over, all WS connections reset. No code
|
||||
change needed; clients auto-reconnect.
|
||||
|
||||
### Phase 2 — Active/active for stateless routes
|
||||
HTTP-only routes (`/cli/*`, `/download`, `/hook`) can round-robin across
|
||||
any number of replicas today. WS routes stay sticky per mesh via Traefik
|
||||
`sticky.cookie`. Already behind Postgres → each replica reads the same
|
||||
mesh/member/invite rows.
|
||||
|
||||
### Phase 3 — Full active/active
|
||||
Migrate the 6 problematic in-memory symbols:
|
||||
- `streamSubscriptions` → Redis pub/sub
|
||||
- `meshClocks` → leader-elect via Postgres advisory lock on mesh_id
|
||||
- `scheduledMessages` → single-writer pattern: whichever replica holds
|
||||
`pg_advisory_xact_lock(schedule_id)` fires
|
||||
- `urlWatches` → DB-backed + each replica owns watches where
|
||||
`presence.node_id = this_node`
|
||||
- `mcpRegistry` → rely on `mesh_service` table, drop the in-memory cache
|
||||
- `lastHash` → wrap audit.ts writes in a transaction that
|
||||
`SELECT hash FROM audit_log ... ORDER BY id DESC FOR UPDATE`, making
|
||||
concurrent inserts serialize.
|
||||
|
||||
### Phase 4 — Multi-region
|
||||
SPOF at Frankfurt (OVH). Move to a managed Postgres with read replicas,
|
||||
one broker cluster per region, global DNS geo-routing. Out of scope for
|
||||
v1.0.0.
|
||||
|
||||
## Immediate ship: local docker-compose for 2-replica smoke test
|
||||
|
||||
`packaging/docker-compose.ha-local.yml` (TODO) spins up:
|
||||
- 2x broker (same DATABASE_URL)
|
||||
- 1x postgres
|
||||
- 1x traefik with sticky cookie
|
||||
- 1x locust / synthetic client
|
||||
|
||||
Tests:
|
||||
1. Send to peer connected on node A → delivered.
|
||||
2. Subscribe on A, publish on B → expect failure (documents the gap).
|
||||
3. Kill node A → client reconnects to B within Xs.
|
||||
4. Audit chain verify after concurrent writes from both nodes → expect
|
||||
a fork (documents the gap).
|
||||
|
||||
## Decision
|
||||
|
||||
**Ship v1.0.0 on sticky-session single-writer (Phase 0 + Phase 1 warm
|
||||
standby).** That closes the "what happens on deploy" story. Phase 3 full
|
||||
HA is v1.1.0 work.
|
||||
152
.artifacts/specs/2026-04-15-crypto-review-packet.md
Normal file
152
.artifacts/specs/2026-04-15-crypto-review-packet.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# claudemesh crypto — external review packet
|
||||
|
||||
**Goal:** 2-day review of the claudemesh cryptographic surface by an
|
||||
external reviewer familiar with libsodium, x25519/ed25519, authenticated
|
||||
encryption, and hash-chain audit logs.
|
||||
|
||||
**Status:** self-audited + Codex-reviewed. Not yet reviewed by an
|
||||
independent human with security expertise.
|
||||
|
||||
## Scope
|
||||
|
||||
### Files in scope
|
||||
|
||||
| File | LoC | What it does |
|
||||
|---|---|---|
|
||||
| `apps/broker/src/crypto.ts` | ~400 | Hello signature verification, canonical invite bytes (v1+v2), `sealRootKeyToRecipient` via `crypto_box_seal`, `verifyInviteV2`, `claimInviteV2Core` (gated). |
|
||||
| `apps/broker/src/broker-crypto.ts` | 70 | AES-256-GCM encryption-at-rest for MCP env vars. Key from `BROKER_ENCRYPTION_KEY` or ephemeral in dev. |
|
||||
| `apps/broker/src/audit.ts` | ~250 | Hash-chained audit log. Canonical JSON payload hash, per-mesh `pg_advisory_xact_lock` for concurrent writers. |
|
||||
| `apps/cli/src/services/crypto/box.ts` | 60 | `crypto_box_easy` / `crypto_box_open_easy` wrappers that accept ed25519 keys and convert to curve25519 via `crypto_sign_*_to_curve25519`. |
|
||||
| `apps/cli/src/services/crypto/keypair.ts` | ~50 | `generateKeypair` wrapping `crypto_sign_keypair`. |
|
||||
| `apps/cli/src/commands/backup.ts` | ~180 | Config backup via Argon2id + XChaCha20-Poly1305 (`crypto_aead_xchacha20poly1305_ietf_*`) from a user passphrase. |
|
||||
| `apps/cli/src/services/invite/parse-v1.ts` | ~160 | Invite payload decode + signature verification, URL parsing, short-code resolution. |
|
||||
|
||||
### Out of scope
|
||||
|
||||
- TLS config (Traefik termination)
|
||||
- Postgres at-rest disk encryption
|
||||
- Homebrew/winget binary signing pipeline
|
||||
- Secrets storage on the user's machine (we rely on OS file mode 0600)
|
||||
|
||||
## Threat model
|
||||
|
||||
### Adversary profile
|
||||
|
||||
- **Network attacker** on the wire between CLI and broker. Controls
|
||||
DNS, can inject packets, can replay. TLS terminates at Traefik;
|
||||
assume TLS is trusted.
|
||||
- **Malicious broker** operator. Can read any row in Postgres.
|
||||
- **Mesh peer** with a valid member record. Can try to escalate
|
||||
privileges, impersonate other members, replay, DoS, exfiltrate
|
||||
other members' messages.
|
||||
- **Laptop thief** who has the user's `~/.claudemesh/` directory but
|
||||
not the login password. (Keys on disk at mode 0600.)
|
||||
|
||||
### Must hold
|
||||
|
||||
- E2E: broker cannot read plaintext of direct messages.
|
||||
- Signature: no member can forge messages signed as another member.
|
||||
- Invite integrity: modifying an invite URL invalidates the signature.
|
||||
- Backup secrecy: an attacker with the backup file but not the
|
||||
passphrase learns nothing.
|
||||
- Audit integrity: tampering with an audit row breaks chain
|
||||
verification.
|
||||
|
||||
### Known weaknesses (deliberate)
|
||||
|
||||
- **root_key in v1 invite URL**: current long URL form carries the
|
||||
mesh root key in base64(JSON). Short-URL mode (`/i/<code>`) resolves
|
||||
to the same token server-side, so this does NOT reduce the exposure.
|
||||
v2 protocol moves root_key out of the URL but CLI migration is not
|
||||
yet shipped.
|
||||
- **Session-key routing identity**: a peer can claim arbitrary
|
||||
`sessionPubkey` in hello (validated as 64-hex in alpha.36 but not
|
||||
proven-own). Proof-of-secret-key for session key is not enforced.
|
||||
Impact: a peer can route messages as any session pubkey it chooses
|
||||
but cannot decrypt replies without the matching secret, so the
|
||||
impact is DoS/confusion, not impersonation.
|
||||
- **mesh.owner_secret_key stored plaintext** in the DB. A malicious
|
||||
broker can issue arbitrary invites. Mitigated only by DB access
|
||||
control.
|
||||
|
||||
## Review checklist for the reviewer
|
||||
|
||||
1. **libsodium usage**
|
||||
- Are nonces generated with `randombytes_buf` and never reused?
|
||||
- `crypto_box_easy` / `crypto_box_open_easy` order and parameters correct?
|
||||
- Are ed25519 keys converted to curve25519 on BOTH sides consistently?
|
||||
- Is `crypto_sign_detached` / `crypto_sign_verify_detached` used with the right message bytes?
|
||||
|
||||
2. **Invite protocol**
|
||||
- Canonical bytes v1 + v2 format strings stable across CLI and broker?
|
||||
- Replay protection: is a v1 URL reusable? (short URL + usedCount)
|
||||
- Is the `maxUses` counter race-safe? (atomic UPDATE with `lt`)
|
||||
- v2 root_key sealing: does `crypto_box_seal` fit the trust model?
|
||||
- Is recipient_x25519_pubkey validated on both shape and length?
|
||||
|
||||
3. **Audit chain**
|
||||
- Is the canonical JSON serialization reviewable and stable?
|
||||
- Does `pg_advisory_xact_lock` actually serialize writes on the same mesh under HA?
|
||||
- Can a malicious broker rewrite history by dropping the `lastHash` cache + DROPping rows + replaying with a new chain? (Yes — documented. Mitigation is append-only at the DB level.)
|
||||
|
||||
4. **At-rest encryption (broker-crypto.ts)**
|
||||
- AES-256-GCM with 12-byte IV + 16-byte tag — correct, but is the IV generation guaranteed random and unique per encryption?
|
||||
- Any concern about auth tag truncation or nonce collision under high volume?
|
||||
|
||||
5. **Backup (cli/commands/backup.ts)**
|
||||
- Argon2id params reasonable? (INTERACTIVE — should possibly be SENSITIVE.)
|
||||
- XChaCha20-Poly1305 parameter order?
|
||||
- Does the passphrase-minimum (12 chars) match the Argon2id parameters?
|
||||
- Is the salt stored alongside the ciphertext and read back correctly?
|
||||
|
||||
6. **Session vs member key**
|
||||
- When is which key used? Is there any path where one is trusted for the other's purpose?
|
||||
|
||||
7. **Hello signature**
|
||||
- Timestamp skew window (`±60s`) — does the broker reject out-of-window replays?
|
||||
- Is the canonical hello string covered by the signature exactly?
|
||||
|
||||
8. **Grants**
|
||||
- Can a peer bypass server-side grant enforcement by lying about their
|
||||
own sender key in hello? (Signature pins memberPubkey to a real
|
||||
signing key, but sessionPubkey isn't proven.)
|
||||
|
||||
## Test coverage supplied
|
||||
|
||||
- `apps/broker/tests/invite-signature.test.ts`
|
||||
- `apps/broker/tests/invite-v2.test.ts`
|
||||
- `apps/broker/tests/hello-signature.test.ts`
|
||||
- `apps/broker/tests/audit-canonical.test.ts`
|
||||
- `apps/broker/tests/grants-enforcement.test.ts`
|
||||
- `apps/broker/tests/rate-limit.test.ts`
|
||||
- `apps/broker/tests/encoding.test.ts`
|
||||
- `apps/broker/tests/dup-delivery.test.ts`
|
||||
- `apps/cli/tests/unit/crypto-roundtrip.test.ts`
|
||||
|
||||
## Deliverables expected from reviewer
|
||||
|
||||
1. **Findings list** — severity (crit/high/med/low), file:line, fix recommendation.
|
||||
2. **Protocol-level critique** — anything in the invite or hello flow that can be exploited with a valid account.
|
||||
3. **Tooling recs** — libsodium best-practice they'd follow differently.
|
||||
4. **Go/no-go** for v1.0.0 GA assuming the findings are addressed.
|
||||
|
||||
## Budget
|
||||
|
||||
2 person-days. Hourly rate acceptable; fixed-fee preferred. Request
|
||||
for quote from reviewers with published libsodium / PKI experience
|
||||
(see recommended list below).
|
||||
|
||||
## Recommended reviewers
|
||||
|
||||
- Filippo Valsorda (independent, ex-Go crypto lead, known for age/tink reviews)
|
||||
- Trail of Bits (firm-rate; their Tamarin+reviewer combo is strong)
|
||||
- Latacora (firm; expensive but thorough)
|
||||
- NCC Group (firm; good for libsodium-specific)
|
||||
- Cure53 (firm; EU, fast turnaround)
|
||||
|
||||
## Review deliverable format
|
||||
|
||||
Markdown report with:
|
||||
- Findings table (id, severity, file:line, summary, recommended fix)
|
||||
- Protocol notes
|
||||
- One-page exec summary for non-technical stakeholders
|
||||
Reference in New Issue
Block a user