Commit Graph

10 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
a2a53ff355 feat(cli,broker): 1.34.14 + 1.34.15 — env-var fallback, peer list scope, kick refuses control-plane
Three follow-ups from the 1.34.x multi-session correctness train,
all backwards-compatible.

1.34.14 — stale CLAUDEMESH_CONFIG_DIR falls back. The launch flow
exposes CLAUDEMESH_CONFIG_DIR=<tmpdir> to its spawned claude; if a
later claudemesh invocation inherited that env (Bash tool inside
Claude Code, tmux update-environment, exported var), the inherited
path pointed at a tmpdir that no longer existed and readConfig()
silently returned empty. paths.ts now memoizes resolution: env unset
→ default; env points at a real dir → trust it; env set but dir gone
→ TTY-only stderr warning with shell-specific unset hint, fall back
to ~/.claudemesh.

1.34.15 — peer list --mesh actually scopes. peers.ts and launch.ts
were calling tryListPeersViaDaemon() with no argument; the daemon's
?mesh= filter (server-side, since 1.26.0) was already correct, the
CLI just wasn't passing the slug. Forwarding fixed in both sites;
send.ts cross-mesh hex-prefix resolution intentionally untouched.

1.34.15 — kick refuses no-op kicks on control-plane. Pre-1.34.15
kicking a daemon's member-WS just closed the socket and triggered
auto-reconnect — a no-op with a misleading "session ended" message.
Broker now skips peers where peerRole === "control-plane" and
surfaces them in a new additive ack field skipped_control_plane;
the CLI reads it and prints a clearer hint pointing at ban / daemon
down. Soft disconnect verb keeps old behavior. PeerConn gains a
peerRole slot populated at both connections.set sites.

Tests: 4 new for paths-stale-env, 5 for kick-control-plane-skip.
CLI 87/87 green; broker 55/55 unit green (integration tests
pre-existing infra failure on this machine).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:59:06 +01:00
Alejandro Gutiérrez
033a2d37e1 feat(broker): canonical session-hello + parent-attestation helpers
Adds the crypto primitives the 1.30.0 per-session broker presence flow
needs: canonicalSessionAttestation/canonicalSessionHello bytes, and
verifySessionAttestation/verifySessionHelloSignature with TTL bounds
(≤24h) plus standard ed25519 + skew checks.

10 unit tests cover the hostile cases — expired attestation, over-TTL,
wrong-key signing, tampered fields, and the "attacker captured the
attestation but doesn't hold the session secret key" scenario.

No wire changes yet — types and dispatch land in the next two commits.
Spec: .artifacts/specs/2026-05-04-per-session-presence.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 12:57:28 +01:00
Alejandro Gutiérrez
05729ad8a4 feat(ga): close remaining GA blockers (backcompat, HA prep, tests, docs)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Backwards compat shim (task 27)
- requireCliAuth() falls back to body.user_id when BROKER_LEGACY_AUTH=1
  and no bearer present. Sets Deprecation + Warning headers + bumps a
  broker_legacy_auth_hits_total metric so operators can watch the
  legacy traffic drain to 0 before removing the shim.
- All handlers parse body BEFORE requireCliAuth so the fallback can
  read user_id out of it.

HA readiness (task 29)
- .artifacts/specs/2026-04-15-broker-ha-statelessness-audit.md
  documents every in-memory symbol and rollout plan (phase 0-4).
- packaging/docker-compose.ha-local.yml spins up 2 broker replicas
  behind Traefik sticky sessions for local smoke testing.
- apps/broker/src/audit.ts now wraps writes in a transaction that
  takes pg_advisory_xact_lock(meshId) and re-reads the tail hash
  inside the txn. Concurrent broker replicas can no longer fork the
  audit chain.

Deploy gate (task 30)
- /health stays permissive (200 even on transient DB blips) so
  Docker doesn't kill the container on a glitch.
- New /health/ready checks DB + optional EXPECTED_MIGRATION pin,
  returns 503 if either fails. External deploy gate can poll this
  and refuse to promote a broken deploy.

Metrics dashboard (task 32)
- packaging/grafana/claudemesh-broker.json: ready-to-import Grafana
  dashboard covering active conns, queue depth, routed/rejected
  rates, grant drops, legacy-auth hits, conn rejects.

Tests (task 28)
- audit-canonical.test.ts (4 tests) pins canonical JSON semantics.
- grants-enforcement.test.ts (6 tests) covers the member-then-
  session-pubkey lookup with default/explicit/blocked branches.

Docs (task 34)
- docs/env-vars.md catalogues every env var the broker + CLI read.

Crypto review prep (task 35)
- .artifacts/specs/2026-04-15-crypto-review-packet.md: reviewer
  brief, threat model, scope, test coverage list, deliverables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 23:51:28 +01:00
Alejandro Gutiérrez
c1fa3bcb5c feat: anthropic-style mesh + invite redesign (wave 1 checkpoint)
Ships the user-visible friction fixes and the foundation for the v2
invite protocol. API wiring + CLI client + email UI ship in wave 2.

Meshes — shipped
- Drop global UNIQUE on mesh.slug; mesh.id is canonical everywhere
- Server derives slug from name; create form has no slug field
- Two users can freely name their mesh "platform"; no collision errors
- Migration 0017

Invites v1 — shipped (URL shortener, backward compatible)
- New invite.code column (base62, 8 chars, nullable unique index)
- createMyInvite mints both token + short code; returns shortUrl
- GET /api/public/invite-code/:code resolves short code to token
- New route /i/[code] server-redirects to /join/[token]
- Invite generator UI shows short URL; QR encodes short URL
- Advanced fields (role/maxUses/expiresInDays) collapsed under disclosure
- Migration 0018

Invites v2 — foundation (broker + DB only; API+CLI+Web wiring in wave 2)
- Broker: canonicalInviteV2, verifyInviteV2, sealRootKeyToRecipient
- Broker: POST /invites/:code/claim endpoint (atomic single-use accounting)
- Broker tests: invite-v2.test.ts (signature, expiry, revocation, exhaustion)
- DB: mesh.invite gains version/capabilityV2/claimedByPubkey columns
- DB: new mesh.pending_invite table for email invites
- Migration 0019
- Contract locked in docs/protocol.md §v2 + SPEC.md §14b

Consent landing — shipped
- /join/[token] redesigned: explicit role, inviter, mesh stats, consent
- New server components: invite-card, role-badge, inviter-line, consent-summary
- "Join [mesh] as [Role]" primary action (not just "Join")

Error surfacing — shipped
- handle() now parses {error} responses from hono route catch blocks
- onError fallback includes timestamp so handle() can match apiErrorSchema
- Real error messages reach the UI instead of "Something went wrong"

Docs
- SPEC.md §14b: v2 invite protocol
- docs/protocol.md: v2 claim wire format
- docs/roadmap.md: status
- .artifacts/specs/2026-04-10-anthropic-vision-meshes-invites.md

Deferred to wave 2/3
- API claim route wiring (packages/api)
- createMyInvite v2 capability generation
- Email invite mutation + Postmark delivery
- CLI v2 join flow (x25519 keypair + unseal)
- Web invite-generator email field + v2 display

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 13:41:11 +01:00
Alejandro Gutiérrez
0c4a9591fa feat(broker): invite signature verification + atomic one-time-use
Completes the v0.1.0 security model. Every /join is now gated by a
signed invite that the broker re-verifies against the mesh owner's
ed25519 pubkey, plus an atomic single-use counter.

schema (migrations/0001_demonic_karnak.sql):
- mesh.mesh.owner_pubkey: ed25519 hex of the invite signer
- mesh.invite.token_bytes: canonical signed bytes (for re-verification)
Both nullable; required for new meshes going forward.

canonical invite format (signed bytes):
  `${v}|${mesh_id}|${mesh_slug}|${broker_url}|${expires_at}|
   ${mesh_root_key}|${role}|${owner_pubkey}`

wire format — invite payload in ic://join/<base64url(JSON)> now has:
  owner_pubkey: "<64 hex>"
  signature:    "<128 hex>"

broker joinMesh() (apps/broker/src/broker.ts):
1. verify ed25519 signature over canonical bytes using payload's
   owner_pubkey → else invite_bad_signature
2. load mesh, ensure mesh.owner_pubkey matches payload's owner_pubkey
   → else invite_owner_mismatch (prevents a malicious admin from
   substituting their own owner key)
3. load invite row by token, verify mesh_id matches → else
   invite_mesh_mismatch
4. expiry check → else invite_expired
5. revoked check → else invite_revoked
6. idempotency: if pubkey is already a member, return existing id
   WITHOUT burning an invite use
7. atomic CAS: UPDATE used_count = used_count + 1 WHERE used_count <
   max_uses → if 0 rows affected, return invite_exhausted
8. insert member with role from payload

cli side:
- apps/cli/src/invite/parse.ts: zod-validated owner_pubkey + signature
  fields; client verifies signature immediately and rejects tampered
  links (fail-fast before even touching the broker)
- buildSignedInvite() helper: owners sign invites client-side
- enrollWithBroker sends {invite_token, invite_payload, peer_pubkey,
  display_name} (was: {mesh_id, peer_pubkey, display_name, role})
- parseInviteLink is now async (libsodium ready + verify)

seed-test-mesh.ts generates an owner keypair, sets mesh.owner_pubkey,
builds + signs an invite, stores the invite row, emits ownerPubkey +
ownerSecretKey + inviteToken + inviteLink in the output JSON.

tests — invite-signature.test.ts (9 new):
- valid signed invite → join succeeds
- tampered payload → invite_bad_signature
- signer not the mesh owner → invite_owner_mismatch
- expired invite → invite_expired
- revoked invite → invite_revoked
- exhausted (maxUses=2, 3rd join) → invite_exhausted
- idempotent re-join doesn't burn a use
- atomic single-use: 5 concurrent joins → exactly 1 success, 4 exhausted
- mesh_id payload vs DB row mismatch → invite_mesh_mismatch

verified live: tampered link blocked client-side with a clear error.
Unmodified link joins cleanly end-to-end (roundtrip.ts + join-roundtrip.ts
both pass). 64/64 tests green.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 23:02:12 +01:00
Alejandro Gutiérrez
9d3dbcecaf feat(broker): verify ed25519 hello signature against member pubkey
WS handshake is now authenticated end-to-end. The broker proves that
every connected peer actually holds the secret key for the pubkey
they claim as identity — not just that they know the pubkey.

wire format change:
  {type:"hello", meshId, memberId, pubkey, sessionId, pid, cwd,
   timestamp, signature}
  where signature = ed25519_sign(canonical, secretKey)
  and canonical = `${meshId}|${memberId}|${pubkey}|${timestamp}`

broker verifies on every hello:
1. timestamp within ±60s of broker clock → else close(1008, timestamp_skew)
2. pubkey is 64 hex chars, signature is 128 hex chars → else malformed
3. crypto_sign_verify_detached(signature, canonical, pubkey) → else bad_signature
4. (existing) mesh.member row exists for (meshId, pubkey) → else unauthorized

All rejection paths close the WS with code 1008 + structured error
message + metrics counter increment (connections_rejected_total by
reason).

new modules:
- apps/broker/src/crypto.ts: canonicalHello, verifyHelloSignature,
  HELLO_SKEW_MS constant
- apps/cli/src/crypto/hello-sig.ts: matching signHello helper

clients updated:
- apps/cli/src/ws/client.ts: signs hello before send
- apps/broker/scripts/{peer-a,peer-b}.ts (smoke-test): sign hellos
  with seed-provided secret keys

new regression tests — tests/hello-signature.test.ts (7):
- valid signature accepted
- bad signature (signed with wrong key) rejected
- timestamp too old rejected (>60s)
- timestamp too far in future rejected (>60s)
- tampered canonical field (different meshId at verify time) rejected
- malformed hex pubkey rejected
- malformed signature length rejected

verified live:
- apps/broker/scripts/smoke-test.sh: full hello+ack+send+push flow
- apps/cli/scripts/roundtrip.ts: signed hello + encrypted message
- 55/55 tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:53:40 +01:00
Alejandro Gutiérrez
81a8d0714b feat(crypto): client-side direct-message encryption with crypto_box
Direct messages between peers are now end-to-end encrypted. The
broker only ever sees {nonce, ciphertext} — plaintext lives on the
two endpoints.

apps/cli/src/crypto/envelope.ts:
- encryptDirect(message, recipientPubkeyHex, senderSecretKeyHex)
  → {nonce, ciphertext} via crypto_box_easy, 24-byte fresh nonce
- decryptDirect(envelope, senderPubkeyHex, recipientSecretKeyHex)
  → plaintext or null (null on MAC failure / malformed input)
- ed25519 keys (from Step 17) are converted to X25519 on the fly via
  crypto_sign_ed25519_{pk,sk}_to_curve25519 — one signing keypair
  covers both signing + encryption roles.

BrokerClient.send():
- if targetSpec is a 64-hex pubkey → encrypt via crypto_box
- else (broadcast "*" or channel "#foo") → base64-wrapped plaintext
  (shared-key encryption for channels lands in a later step)

InboundPush now carries:
- plaintext: string | null   (decrypted body, null if decryption failed
                              OR it's a non-direct message)
- kind: "direct" | "broadcast" | "channel" | "unknown"
MCP check_messages formatter reads plaintext directly.

side-fixes pulled in during 18a:
- apps/broker/scripts/seed-test-mesh.ts now generates real ed25519
  keypairs (the previous "aaaa…" / "bbbb…" fillers weren't valid
  curve points, so crypto_sign_ed25519_pk_to_curve25519 rejected
  them). Seed output now includes secretKey for each peer.
- apps/broker/src/broker.ts drainForMember wraps the atomic claim in
  a CTE + outer ORDER BY so FIFO ordering is SQL-sourced, not
  JS-sorted (Postgres microsecond timestamps collapse to the same
  Date.getTime() milliseconds otherwise).
- vitest.config.ts fileParallelism: false — test files share
  DB state via cleanupAllTestMeshes afterAll, so running them in
  parallel caused one file's cleanup to race another's inserts.
- integration/health.test.ts "returns 200" now uses waitFullyHealthy
  (a 200-only waiter) instead of waitHealthyOrAny — prevents a race
  with the startup DB ping.

verified live:
- apps/cli/scripts/roundtrip.ts (direct A→B): ciphertext in DB is
  opaque bytes (not base64-plaintext), decrypted correctly on arrival
- apps/cli/scripts/join-roundtrip.ts (full join → encrypted send):
  PASSED
- 48/48 broker tests green

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:48:33 +01:00
Alejandro Gutiérrez
cd389c6bdd fix(broker): atomic message claim to prevent duplicate delivery
drainForMember previously ran SELECT undelivered rows, THEN UPDATE
delivered_at. Two concurrent callers (e.g. WS fan-out on send +
handleHello's own drain for the target) could both SELECT the same
row before either UPDATEd, pushing the same envelope twice.

now: single atomic UPDATE ... FROM member ... WHERE id IN (
  SELECT id ... FOR UPDATE SKIP LOCKED
) RETURNING mq.*, m.peer_pubkey AS sender_pubkey.

FOR UPDATE SKIP LOCKED is the key primitive — concurrent callers
each claim DISJOINT sets, so a message can never be drained twice.
Union of all concurrent drains still covers every eligible row.

re-sorts RETURNING rows by created_at client-side (Postgres makes no
FIFO guarantee on the RETURNING clause's output order), and normalizes
created_at to Date since raw-sql results can come back as ISO strings.

regression: tests/dup-delivery.test.ts (4 tests)
- two concurrent drains produce disjoint result sets
- six concurrent drains partition cleanly (20 messages, each drained once)
- subsequent drain after success returns empty
- FIFO ordering preserved within a single drain

48/48 tests pass. Live round-trip no longer logs the double-push.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:39:48 +01:00
Alejandro Gutiérrez
3458860c1f test(broker): coverage for hardening modules — caps, limits, metrics, health, logs
Adds 23 tests across 4 files, taking total broker coverage from
21 → 44 passing in ~2.5s.

Unit tests (no I/O):
- tests/rate-limit.test.ts (6): TokenBucket capacity, refill rate,
  no-overflow cap, independent buckets per key, sweep GC.
- tests/metrics.test.ts (5): all 10 series present in /metrics,
  counter increment semantics, labelled series produce distinct lines,
  gauge set overwrites, Prometheus format well-formed.
- tests/logging.test.ts (5): JSON per line, required fields (ts, level,
  component, msg), context merging, level preservation, no plain-text
  escape hatches.

Integration tests (spawn real broker subprocesses on random ports):
- tests/integration/health.test.ts (7):
  * GET /health 200 + {status, db, version, gitSha, uptime} (healthy DB)
  * GET /health 503 + {status:degraded, db:down} (unreachable DB)
  * GET /metrics 200 text/plain with all expected series
  * GET /nope → 404
  * POST /hook/set-status oversized body → 413
  * POST /hook/set-status 6th req/min → 429
  * Rate limit isolation by (pid, cwd) key

Integration tests use node:child_process (vitest runs under Node, not
Bun — Bun.spawn isn't available). Each suite spawns its own broker
subprocess with a random port + tailored env vars.

Not yet covered (flagged for follow-up):
- WebSocket connection caps (needs seeded mesh + WS client setup)
- WebSocket message-size rejection (ws.maxPayload behavior)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:19:14 +01:00
Alejandro Gutiérrez
e25115f1b0 test(broker): port test suite from claude-intercom to drizzle/postgres
21 integration tests (14 broker behavior + 7 path encoding), all
passing in ~1s against a real Postgres (claudemesh_test database on
the dev container).

Test infrastructure:
- apps/broker/vitest.config.ts extends @turbostarter/vitest-config/base
- tests/helpers.ts: setupTestMesh() creates a fresh mesh + 2 members
  per test with a unique slug, returns cleanup function that cascades
  the delete. cleanupAllTestMeshes() as an afterAll safety net.
- Mesh isolation in broker logic means tests don't interfere even when
  they share a database — no per-test TRUNCATE needed.

Ported behavior tests (broker.test.ts, 14 tests):
- hook flips status + queued "next" messages unblock
- "now"-priority bypasses the working gate
- DND is sacred (hooks cannot unset it)
- hook source stays fresh through jsonl refresh
- source decays to jsonl when hook signal goes stale
- isHookFresh freshness window + source-type rules
- TTL sweep flips stuck "working" → idle
- TTL sweep leaves DND alone
- first-turn race: hook fired pre-connect stashed in pending_status
- applyPendingHookStatus picks newest matching entry
- expired pending entries are ignored on connect
- broadcast targetSpec (*) reaches all members
- pubkey mismatch → message not drained
- mesh isolation: peer in mesh X doesn't drain from mesh Y

Ported encoding tests (encoding.test.ts, 7 tests):
- macOS, Linux, Windows path encoding first-candidate correctness
- Roberto's H:\Claude → H--Claude regression test (2026-04-04)
- Candidate dedup, drive-stripped fallback, leading-dash fallback

How to run: from apps/broker,
  DATABASE_URL="postgresql://.../claudemesh_test" pnpm test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 22:09:06 +01:00