172 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
1b28550f30 docs(roadmap): v1.34.16 + broker — continuous presence shipped
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Watchdogs (75s stale detect) and lease model (90s grace window for
silent reconnects) both shipped 2026-05-05.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:41:25 +01:00
Alejandro Gutiérrez
9d1b4f3d4c feat(broker): lease model — 90s grace window across WS reconnects
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Continuous presence: peers no longer see peer_left/peer_joined for
transient WS reconnects. After a WS close, the connection enters a
90s grace window in offline-leased state. If the same session
reconnects (matched by sessionPubkey, or sessionId+memberPubkey for
member-WS) within grace, it silently swaps the WS reference, restores
online state, drains queued DMs, and resets the DB row. No peer ever
sees the session leave.

Mechanics:
- PeerConn gains leaseState ("online"|"offline"), leaseUntil, evictionTimer
- ws.on("close") starts grace instead of immediate cleanup; old
  socket close after a reattach is detected (conn.ws !== ws) and
  ignored, since the lease is already healthy on the new socket
- handleHello / handleSessionHello check for offline-leased entry
  matching the stable identity BEFORE running session-id dedup;
  reattach swaps ws, resets state, returns silent: true
- The hello dispatcher skips peer_joined broadcast when result.silent
- evictPresenceFully extracted from the close handler — runs the
  peer_left broadcast + cleanup (URL watches, streams, MCP registry,
  clock auto-pause). Called by evictionTimer after 90s, or directly
  if lease wasn't online (defensive)
- Stale-pong watchdog skips offline-leased entries (their WS is
  intentionally dead during grace)
- broker.ts exports restorePresence(presenceId) — clears
  disconnectedAt + bumps lastPingAt, called on reattach to undo any
  damage the DB-level stale-presence sweeper may have done during
  grace

DMs sent to a session in grace fall through to today's existing
queueing path (sendToPeer no-ops on dead WS, the message_queue row
sits with deliveredAt=NULL, drained on reattach via the existing
maybePushQueuedMessages call). No protocol change. No DB schema
change. Backward compatible — old daemons against this broker get
silent reconnects within 90s, full peer_joined cycle beyond.

Layer 2 of the continuous-presence work; spec at
.artifacts/specs/2026-05-05-continuous-presence.md. Layer 3
(daemon-side resume token storage + send) is optional polish, not
needed for the user-visible behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:31:55 +01:00
Alejandro Gutiérrez
ffd0621ccc feat(broker,cli): liveness watchdogs — 75s stale-pong terminate
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Both sides now actively detect half-dead WS connections instead of
waiting for kernel TCP keepalive (~2hrs default on Linux). Bug user
reported: "claudemesh peer list" shows zero peers despite running
sessions, because NAT/CGNAT silently dropped the WS flow but neither
side noticed.

Broker (apps/broker/src/index.ts):
- Add lastPongAt to PeerConn, populate at connections.set sites,
  bump in ws.on("pong").
- 30s ping loop now also terminates conns whose pong is >75s stale.
  ws.terminate() fires the close handler → existing peer_left path.

Daemon (apps/cli/src/daemon/ws-lifecycle.ts):
- Add idle watchdog at 30s cadence, started after hello-ack.
- Bumps lastActivity on incoming message, ping, and pong frames.
- Sends sock.ping() if recent activity, terminates if idle >75s.
- Watchdog cleared on close handler + explicit close().

CLI 1.34.15 → 1.34.16. Broker stays 0.1.0 (deploys from main).

Spec: .artifacts/specs/2026-05-05-continuous-presence.md (full lease
model + resume token, this commit ships only the watchdogs — first
of four progressive layers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:22:15 +01:00
Alejandro Gutiérrez
b9ecbe79ad feat(web): refresh Latest News toaster — current shipped work
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Replace four April-vintage entries (claudemesh launch v0.1.4, Mesh
Dashboard placeholder, MCP bridge placeholder, "SQLite-backed"
self-host) with the four most recent shipped milestones: kick refuses
control-plane (v1.34.15), 1.34.x multi-session correctness train,
per-session presence (v1.30.0), multi-mesh daemon (v1.26.0). All
entries link to /changelog instead of dead "#" hrefs or the old
github.com/alezmad/claudemesh-cli repo.

Copy passes Strunk: active voice, concrete versions, no puffery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 04:48:24 +01:00
Alejandro Gutiérrez
33051b95bf feat(web): marketing audit — Agent Teams positioning, MCP/dashboard claims fixed
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Comprehensive review of all home-page marketing components against
the post-correction positioning. Five surgical fixes, zero hand-waving.

CTA copy. The previous "Anthropic built Claude Code per developer.
The next unlock is between developers." was a strong line in 2025
but Anthropic Agent Teams (Feb 2026) IS now between-developers
within one machine. Replaced with the accurate distinction:
"Anthropic Agent Teams stops at the edge of one laptop. claudemesh
starts there — across machines, users, and organizations."

WhereMeshFits — new "vs. Agent Teams" comparison card. The single
most important card the page can have right now. Most readers
arriving in May 2026 know about Agent Teams; the comparison they
want to read is exactly this one. Also tightened the "What
claudemesh is" claim card to lean into "across machines, users,
orgs" instead of the narrower "peer network for Claude Code"
framing.

FAQ — three updates:
  1. "How is this different from MCP?" was claiming "43 tools that
     let peers message, share files…" which contradicted v1.5.0's
     ship of tool-less MCP (tools/list returns []). Replaced with
     the actual current architecture: thin push-pipe + resource-
     noun-verb CLI bundled as a skill.
  2. New entry "How is this different from Anthropic's Agent
     Teams?" — the biggest gap in the FAQ given the new ecosystem.
     Same shape as the WhereMeshFits card so the messaging stays
     consistent across surfaces.
  3. "Can a peer be in multiple meshes?" updated to reflect
     v1.26.0's universal multi-mesh daemon (was speaking about it
     as roadmap; it's been shipped for ~2 days). Bridge peers
     promoted from "v0.2 roadmap" to "shipped in v0.2.0 (v1.6.0)".
  4. "Free during public beta" no longer claims paid tiers launch
     "when the dashboard ships" — dashboard already shipped (v1.5+
     web chat, v1.7 demo cut). Replaced with team-scale features
     (SSO, audit retention, dedicated brokers) as the pricing
     trigger.

Pricing card — same "dashboard ships" → "team-scale features"
language fix as the FAQ pricing entry. Single source of truth
maintained between FAQ + Pricing card.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 23:10:27 +01:00
Alejandro Gutiérrez
64d9f9f6f9 feat(web): refresh marketing site — accurate timeline, live changelog, cross-boundary positioning
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The site had drifted ~6 months behind the product. Three problems
addressed in one push:

1. Timeline ("Shipped, not promised") topped out at v0.6–0.8 and
   claimed "66 npm releases" — both stale. Adds a v0.9 → 1.34 tier
   covering daemon, multi-mesh, multi-session correctness train,
   refuse-to-kick on control-plane, env-var fallback. Updates count
   to "120+ npm releases through v1.34.15." Rewrites the "next"
   block from the now-shipped "Daemon redesign · per-topic
   encryption" to the actually-pending "HKDF cross-machine identity
   · session capabilities · A2A interop · self-host packaging ·
   federation."

2. Hero subhead leaned into the original "Claude Code peer mesh"
   framing, which is undercut by Anthropic Agent Teams (Feb 2026,
   single-machine native mailbox). Now reframes claudemesh as the
   encrypted backbone where Claude Code sessions, autonomous
   agents, and humans coordinate "across machines, across users,
   across organizations" — the four words that distinguish the
   product from anything Anthropic structurally can ship from
   inside Claude Code.

3. /changelog had three entries from April 2026 (v0.1.2 → v0.1.4)
   and was 70+ versions out of date. Replaced with a curated
   16-entry timeline from v0.1.0 → v1.34.15, hand-picked to tell
   the story (load-bearing ships, not every patch). Adds links
   back to docs/roadmap.md, .artifacts/specs/, and GitHub Releases.

New module: apps/web/src/modules/marketing/home/changelog-data.ts
holds the curated entries as a single source of truth. Imported by
both the /changelog page and a new home-page component
LatestReleases (compact 5-entry strip, slotted between Timeline
and Pricing) so they never disagree.

Misc fixes pulled in:
- timeline.tsx had glyph="layers" which isn't in SectionIcon's
  valid set; switched to "grid" (changelog-data.ts uses same).
- changelog data extracted to a non-route module so Next.js's
  route-export validator stops complaining about exporting
  CHANGELOG_ENTRIES from app/.../changelog/page.tsx.

Pre-existing typecheck noise in packages/ui/web/sidebar.tsx
(csstype version mismatch) + billing modules unrelated to this
change. My files all typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:55:30 +01:00
Alejandro Gutiérrez
7f61a711f1 docs(roadmap): mark 1.34.x triage gaps 1-3 shipped, gap 4 spec'd
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Updates the "Known gaps tracked for follow-ups" subsection of the
v1.34.x section to reflect the 2026-05-04 follow-up sprint:

- Gap 1 (stale CLAUDEMESH_CONFIG_DIR) shipped in 1.34.14.
- Gap 2 (peer list --mesh scope) shipped in 1.34.15. Notes the
  diagnosis correction — bug was CLI-side, not broker.
- Gap 3 (kick no-op on control-plane) shipped in 1.34.15 as
  refuse-with-hint. Richer presence-pause verb deferred.
- Gap 4 (session capabilities) has a written spec at
  .artifacts/specs/2026-05-04-session-capabilities.md;
  implementation queued behind v0.3.0 topic-encryption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 22:05:30 +01:00
Alejandro Gutiérrez
96520394ff docs(spec): session capabilities — first-class concept
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Spec for the gap #4 follow-up from the 1.34.x triage. Builds on
2026-04-15-per-peer-capabilities.md (member-keyed recipient grants)
by adding a sender-side cap subset on session attestations: parent
member signs {session_pubkey, allowed_caps[], expires_at}, broker
enforces intersection of recipient grants × session caps on every
protected operation.

v2 attestation alongside v1 (different canonical prefix
"claudemesh-session-attest-v2|..." → no collision). Default when
no caps subset is declared = full member caps (today's behavior;
opt-in restriction, not breaking).

CLI surface: claudemesh launch --caps dm,read. Bonus: set_state
gate (state-write cap) ships in the same release — closes the
"any session can clobber shared keys like current-pr" footgun.

Migration: dry-run mode for one release before flipping
enforcement. Mirrors the original per-peer-capabilities rollout.

Estimate: ~1 sprint + 1 week dry-run window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:59:18 +01:00
Alejandro Gutiérrez
a2a53ff355 feat(cli,broker): 1.34.14 + 1.34.15 — env-var fallback, peer list scope, kick refuses control-plane
Three follow-ups from the 1.34.x multi-session correctness train,
all backwards-compatible.

1.34.14 — stale CLAUDEMESH_CONFIG_DIR falls back. The launch flow
exposes CLAUDEMESH_CONFIG_DIR=<tmpdir> to its spawned claude; if a
later claudemesh invocation inherited that env (Bash tool inside
Claude Code, tmux update-environment, exported var), the inherited
path pointed at a tmpdir that no longer existed and readConfig()
silently returned empty. paths.ts now memoizes resolution: env unset
→ default; env points at a real dir → trust it; env set but dir gone
→ TTY-only stderr warning with shell-specific unset hint, fall back
to ~/.claudemesh.

1.34.15 — peer list --mesh actually scopes. peers.ts and launch.ts
were calling tryListPeersViaDaemon() with no argument; the daemon's
?mesh= filter (server-side, since 1.26.0) was already correct, the
CLI just wasn't passing the slug. Forwarding fixed in both sites;
send.ts cross-mesh hex-prefix resolution intentionally untouched.

1.34.15 — kick refuses no-op kicks on control-plane. Pre-1.34.15
kicking a daemon's member-WS just closed the socket and triggered
auto-reconnect — a no-op with a misleading "session ended" message.
Broker now skips peers where peerRole === "control-plane" and
surfaces them in a new additive ack field skipped_control_plane;
the CLI reads it and prints a clearer hint pointing at ban / daemon
down. Soft disconnect verb keeps old behavior. PeerConn gains a
peerRole slot populated at both connections.set sites.

Tests: 4 new for paths-stale-env, 5 for kick-control-plane-skip.
CLI 87/87 green; broker 55/55 unit green (integration tests
pre-existing infra failure on this machine).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:59:06 +01:00
Alejandro Gutiérrez
6780899185 feat(cli): 1.34.7 → 1.34.13 — multi-session correctness train
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Seven-ship sequence that took the daemon from "works for one session"
to "internally consistent for N sessions on one daemon." Architecture
invariant after 1.34.13: every shared store / channel scopes by
recipient (SSE demux at bind layer + token forwarding, inbox per-
recipient columns, outbox sender-session routing).

- 1.34.7  inbox flush + delete commands
- 1.34.8  seen_at column + TTL prune + first echo guard
- 1.34.9  broader echo guard + system-event polish + staleness warning
- 1.34.10 per-session SSE demux (SseFilterOptions) + universal daemon
          (--mesh / --name deprecated) + daemon_started version stamp
- 1.34.11 inbox per-recipient column (storage half of 1.34.10)
- 1.34.12 daemon up detaches by default (logs to ~/.claudemesh/daemon/
          daemon.log; service units explicitly pass --foreground)
- 1.34.13 MCP forwards session token on /v1/events — the actual fix
          that activates 1.34.10's demux. Without this header the
          daemon's session resolved null, filter was empty, every MCP
          received the unfiltered global stream.

Roadmap entry at docs/roadmap.md captures the timeline + the four
known gaps tracked for follow-ups (launch env-var leak, broker
listPeers mesh-filter, kick on control-plane no-op, session caps as
first-class concept).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 21:10:07 +01:00
Alejandro Gutiérrez
cba4a938ec chore(cli): keep WS lifecycle diagnostic logs
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Five info-level log points across the WS lifecycle helper:
ws_open_attempt / ws_open_ok / ws_hello_sent / ws_hello_acked /
ws_closed (with status + close code/reason).

Surfaced during M1 smoke testing — without these the only visible
signal was "presence row missing on broker," which made it hard to
distinguish "WS never opened" / "opened but hello rejected" /
"acked then closed by broker."

Both clients prefix the helper-emitted msg ("session_broker_*",
"broker_*") so log greps stay clean per role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:34:18 +01:00
Alejandro Gutiérrez
706e681d6e feat: 1.33.0 — m1 ship: peerRole rename + client_ack wired + version bump
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Resolves the merge of m1-broker-drain-race-and-presence-role and
m1-cli-lifecycle-and-role-peer-list into main:

* Rename wire-level role classification field `role` → `peerRole`
  to avoid collision with 1.31.5's top-level `role` lift of
  `profile.role` (user-supplied string consumed by the agent-vibes
  claudemesh skill). `peerRole` is the broker presence taxonomy
  (control-plane/session/service); top-level `role` keeps its 1.31.5
  semantics.
  - apps/broker/src/broker.ts (listPeersInMesh return)
  - apps/broker/src/index.ts (peers_list response)
  - apps/broker/src/types.ts (WSPeersListMessage)
  - apps/cli/src/commands/peers.ts (PeerRecord + filter + lift)

* Wire CLI client_ack emission: handleBrokerPush gains
  ackClientMessage callback; daemon-WS and session-WS each got a
  sendClientAck() method that frames {type:"client_ack",
  clientMessageId, brokerMessageId?} and forwards via the lifecycle
  helper. Run.ts wires the callback into both onPush paths.
  Receiver dedupes against existing inbox row first then acks
  unconditionally — broker needs the ack regardless of dedupe to
  release its claim lease.
  - apps/cli/src/daemon/inbound.ts (ackClientMessage in InboundContext)
  - apps/cli/src/daemon/broker.ts + session-broker.ts (sendClientAck)
  - apps/cli/src/daemon/run.ts (wire-up)

* Version bump 1.32.1 → 1.33.0; CHANGELOG entry replaces "Unreleased"
  with full m1 description.

Verification: tsc clean across cli + broker; CLI 83/83 unit tests
pass; broker 50 unit tests pass (5 integration test files require a
live Postgres and were skipped — pre-existing infra gap, not a
regression). CLI bundle rebuilt; version 1.33.0 baked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:17:45 +01:00
Alejandro Gutiérrez
c036f759c3 Merge m1-cli-lifecycle-and-role-peer-list into main
Milestone 1 CLI side:
- New apps/cli/src/daemon/ws-lifecycle.ts: connectWsWithBackoff helper
- DaemonBrokerClient + SessionBrokerClient refactored to use the helper
- DaemonBrokerClient: stray sessionPubkey + getSessionKeys() removed
- daemon-WS onPush no longer carries session secret (member-only decrypt)
- IPC send paths now sign with mesh member secret
- peers.ts: filters role==='control-plane' by default; --all opts in;
  JSON output exposes role field

NOTE: a follow-up commit on main renames the wire-level field 'role'
to 'peerRole' to avoid collision with 1.31.5's profile.role lift.
2026-05-04 18:11:47 +01:00
Alejandro Gutiérrez
54e00109ab Merge m1-broker-drain-race-and-presence-role into main
Milestone 1 broker side:
- Schema: claimedAt + claimId + claimExpiresAt on message_queue,
  role on presence (default 'session')
- Migration 0029_drain_lease_and_presence_role.sql
- drainForMember rewritten for two-phase claim/deliver with 30s lease
- New markDelivered() called on receipt of client_ack
- New sweepExpiredClaims() running every 15s
- handleHello sets role='control-plane', handleSessionHello sets 'session'
- listPeersInMesh returns role
- WSClientAckMessage type added; broker accepts and dispatches client_ack
2026-05-04 18:11:47 +01:00
Alejandro Gutiérrez
16c148a87f docs(specs): m1 — agentic-comms architecture spec (v1 + v2 frozen)
v1: initial 3-layer architecture proposal, reviewed by Codex GPT-5.2 (high)
v2: full end-state with hybrid P2P data plane, broker as coordination
    plane only, 6 layers, 8 architectural milestones, Codex-2 corrections
    (at-least-once requires client_ack, service_pubkey explicit, meta
    required in v2 envelope, streamId required for stream channel,
    explicit revocation flow). v2 is frozen for implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:11:29 +01:00
Alejandro Gutiérrez
b57e47ed65 feat(broker): m1 — two-phase claim/deliver + client_ack + role-tagged presence
Three correctness fixes on top of the m1 schema migration:

1) Fix the drainForMember claim-then-push race
   ----------------------------------------------------------------
   Previously the claim CTE set delivered_at = NOW() *before* the WS
   send. If readyState !== OPEN at push time, the row was marked
   delivered and the message dropped silently — at-most-once with no
   retry hook.

   The new flow:
     - claim sets (claimed_at, claim_id, claim_expires_at = NOW()+30s)
     - delivered_at stays NULL until the recipient acks
     - re-eligibility predicate now also accepts rows whose lease
       expired, so dropped pushes redeliver (at-least-once)

   Adds two helpers:
     - markDelivered() — scoped to (mesh_id, recipient pubkey) so a
       peer can only ack its own messages
     - sweepExpiredClaims() — clears expired (claimed_at, claim_id,
       claim_expires_at) every 15s, wired into startSweepers

2) Accept `client_ack` from recipients
   ----------------------------------------------------------------
   New WS message type handled in the dispatcher right after `send`.
   Lookups by clientMessageId or brokerMessageId; either is fine. Until
   the daemon (apps/cli, separate worktree) starts emitting acks, leases
   will simply expire and re-deliver — which is the desired retry
   behaviour.

3) Tag presence rows with `role`
   ----------------------------------------------------------------
   handleHello (member-keyed, used by the long-lived daemon WS) →
     role: 'control-plane'
   handleSessionHello (per-Claude-Code session WS) →
     role: 'session'

   listPeersInMesh exposes the new field; the peers_list response
   surfaces it. WSPeersListMessage type adds an optional `role` plus the
   long-undocumented `memberPubkey`. CLI-side filter swap from peerType
   to role lands in a follow-up worktree — that's why the CLI is
   untouched here per the M1 spec.

Typechecks clean (apps/broker tsc --noEmit, packages/db tsc --noEmit).
Test suite needs a real DB so wasn't run in this worktree; existing
dup-delivery and broker tests use drainForMember positionally and the
new claimerPresenceId arg is optional, so they should continue to pass.
2026-05-04 18:10:25 +01:00
Alejandro Gutiérrez
5a8db796a0 feat(db): m1 — message_queue claim lease + presence.role columns
Schema groundwork for v2 agentic-comms milestone 1.

mesh.message_queue gets three nullable columns (claimed_at, claim_id,
claim_expires_at) so drainForMember can move from "claim-and-deliver in
one UPDATE" to a two-phase claim/lease + recipient-ack model. This is
the at-least-once retry hook the broker has been missing.

mesh.presence gets a typed `role` column ('control-plane' | 'session'
| 'service') with default 'session' so legacy hellos keep working. The
CLI's hidden-daemon hack (peerType === 'claudemesh-daemon') will swap
to a role-based filter in a follow-up worktree.

Migration is hand-authored as 0029_*.sql to match the existing pattern
(drizzle-kit's _journal.json drifted long ago — the runtime migrator
in apps/broker/src/migrate.ts tracks files lexicographically via
mesh.__cmh_migrations, not the journal).
2026-05-04 18:10:04 +01:00
Alejandro Gutiérrez
dab80f475e refactor(cli): m1 lifecycle + role-aware peer list
Foundational cleanups before agentic-comms architecture work
(.artifacts/specs/2026-05-04-agentic-comms-architecture-v2.md).
All behavior-preserving.

1. Extract `connectWsWithBackoff` into apps/cli/src/daemon/ws-lifecycle.ts.
   Both DaemonBrokerClient and SessionBrokerClient now share one
   lifecycle implementation (connect, hello-handshake, ack-timeout,
   close + backoff reconnect). Each client provides its own buildHello
   / isHelloAck / onMessage hooks and keeps its own RPC bookkeeping
   (pendingAcks, peerListResolvers, onPush). Composition over
   inheritance per Codex's review; no protocol shape changes.

2. Drop daemon-WS ephemeral session pubkey. DaemonBrokerClient no
   longer mints + sends a per-reconnect ephemeral keypair in its
   hello. Session-targeted DMs land on SessionBrokerClient since
   1.32.1, not the member-keyed daemon-WS, so the field was
   vestigial. Send-encrypt path now signs DMs with the stable mesh
   member secret. handleBrokerPush invocations from daemon-WS only
   pass the member secret — session decryption is the session-WS's
   job.

3. Role-aware peer list. `peer list` now hides peers whose
   broker-emitted `role` is `'control-plane'`. `--all` opts back in.
   JSON output emits `role` at top level. Older brokers that don't
   emit role yet default to 'session', so legacy peer rows stay
   visible without the broker-side change shipped first. Replaces
   the prior `peerType === 'claudemesh-daemon'` channel-name hack.

Typecheck + tests + build all green.
2026-05-04 18:08:32 +01:00
Alejandro Gutiérrez
a25102a79f fix(cli): 1.32.1 — DMs to session pubkeys finally land in inbox
SessionBrokerClient (daemon-side, since 1.30.0) was constructed
without a push handler and silently dropped every inbound `push` /
`inbound` frame. Header docstring claimed it handled "inbound DM
delivery for messages targeted at the session pubkey" but the
callback was never wired.

Net effect: any DM sent to a peer's session pubkey (everything
`peer list` returns now) was queued, broker-acked, marked
delivered_at on the broker, and thrown away by the recipient
daemon. inbox.db stayed at zero rows; `claudemesh inbox` reported
"no messages" no matter what arrived.

Two-session smoke surfaced this — sender outbox status=done with
broker_message_id, recipient inbox empty.

Fix: wire SessionBrokerClient to forward push/inbound frames to
the same handleBrokerPush the member-keyed broker already uses.
Pass the per-session secret key as sessionSecretKeyHex so
decryptOrFallback tries it first; member key remains the fallback
for legacy member-targeted traffic.

Verified end-to-end with two registered sessions sending in both
directions — inbox.db row count went 0 → 2.

Files: apps/cli/src/daemon/session-broker.ts,
apps/cli/src/daemon/run.ts. No broker change required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:33:18 +01:00
Alejandro Gutiérrez
7460d34335 feat(cli): 1.32.0 — multi-session UX bundle (self-identity, --self fan-out, broker welcome)
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Nine UX bugs surfaced from a real two-session interconnect smoke
test, shipped together.

Self-identity is visible
- peer list now shows the caller as (this session), sorted to top.
  Daemon path resolves session pubkey via /v1/sessions/me so
  isThisSession is set correctly warm.
- whoami shows session pubkey, session id, mesh, role, groups, cwd,
  pid when run inside a launched session.

Sibling-session disambiguation
- peer list rows carry sid:<short> tag so visually-identical rows
  can be told apart at a glance.

Daemon hidden by default
- claudemesh-daemon presence rows hidden from peer list by default.
  --all opts back in. Header shows N daemon hidden when applicable.

--self flag works end-to-end
- Argv parser was greedy: --self ate the next arg as its value.
  BOOLEAN_FLAGS set in cli/argv.ts now lists known no-value switches.
- message send subcommand now passes self through (only legacy send
  was wired before).
- Help text lists --self.

Member-pubkey fan-out
- Sending to your own member pubkey with --self now resolves to every
  connected sibling session and sends one message per recipient.
  Required because the broker drain matches target_spec only against
  full session pubkeys; member-pubkey sends queued but never drained.

Broker welcome at launch
- After the launch banner, one line confirms WS state, peer count,
  and unread inbox count. Best-effort — falls back gracefully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:02:28 +01:00
Alejandro Gutiérrez
25586d298f fix(cli): 1.31.6 — resolve hex prefix to full pubkey before send so messages actually deliver
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
claudemesh send <16-hex-prefix> would ack with sent to <prefix> (daemon)
but the recipient never received the message. Broker pre-flight and
the drain query both exact-match on full 64-char pubkey, so a prefix
queued successfully but no recipient drain ever fetched the row.
Sender saw sent, recipient saw nothing — silent drop.

Fix: CLI resolves any hex prefix (4-63 chars, not full 64) to the
full pubkey via the daemon peer list before submitting. Outcomes:

- unique match: canonicalize and continue
- no match: clear error + list of online peer display names
- multiple: clear error + candidate list + hint to lengthen prefix

The 16-hex prefix shown in peer list rows is now safe to paste
straight into claudemesh send.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:45:09 +01:00
Alejandro Gutiérrez
a852a9df18 feat(cli): 1.31.5 — JSON peer list lifts role to top level + skill renders it
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
After 1.31.4 the human renderer surfaced role and groups, but launched-
session LLMs still dropped them when they called peer list --json and
built their own tables.

- Top-level role field. The broker returns role nested under
  profile.role; the CLI now lifts it to a top-level role field at
  parse time so it is the second-most-visible JSON field after
  displayName. profile.role is preserved.
- Updated claudemesh skill SKILL.md peer-list section with the full
  JSON shape (memberPubkey, sessionId, role, profile, isSelf,
  isThisSession) plus explicit guidance to render role + groups in
  any peer table inside a launched session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:36:23 +01:00
Alejandro Gutiérrez
4cfb682eab feat(cli): 1.31.4 — peer list shows profile.role and groups
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
claudemesh peer list now surfaces each peer's profile-level role
(set via claudemesh profile) and any joined groups inline next to
the display name, e.g.

  ● mou [role:lead, @flexicar:reviewer, @oncall] (ai) · 0d215762…

When both are empty, an explicit footer is added so absence is
unambiguous:

  ● peer [...]
     role: (none)  groups: (none)

JSON output is unchanged — the broker has been returning profile
and groups all along, only the human renderer was missing the role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 16:31:30 +01:00
Alejandro Gutiérrez
0958463998 chore(cli): 1.31.3 — clean rebuild of 1.31.2 with correct VERSION baked in
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
1.31.2 published with the right code change (DAEMON_PATHS no longer
follow CLAUDEMESH_CONFIG_DIR) but a stale baked-in VERSION constant
because the build ran before the version bump. Same fix, rebuilt
cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:29:38 +01:00
Alejandro Gutiérrez
088a4efaa3 fix(cli): 1.31.2 — daemon paths no longer follow per-session CLAUDEMESH_CONFIG_DIR
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Real production bug observed in 1.31.0 / 1.31.1: every CLI verb from
inside a claudemesh launch-spawned session printed

  [claudemesh] warn service-managed daemon not responding within 8000ms

even when the launchd-managed daemon was healthy and answering
direct UDS probes in 10ms.

Root cause: claudemesh launch exports CLAUDEMESH_CONFIG_DIR to a
per-session tmpdir so joined-mesh state and the IPC session token
stay isolated. DAEMON_PATHS read from the same env, so inside a
launched session the CLI looked for daemon.sock at
/var/folders/.../claudemesh-XXXX/daemon/daemon.sock — which never
exists. The CLI declared the daemon down, fell into the service-
managed wait branch, and timed out.

The daemon is a per-machine singleton serving every session; its
files live at ~/.claudemesh/daemon/ regardless of overlays. Pin
DAEMON_PATHS.DAEMON_DIR to that location. New CLAUDEMESH_DAEMON_DIR
override is preserved for tests and multi-daemon dev setups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:28:10 +01:00
Alejandro Gutiérrez
15b7920b2a fix(cli): 1.31.1 — reaper no longer blocks the daemon event loop
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
1.31.0 introduced a session reaper that called execFileSync(ps) once
per registered session every 5s. With many sessions registered, the
daemon's event loop stalled for hundreds of ms — long enough that
incoming /v1/version probes from the CLI timed out against a healthy
daemon and the new service-managed warning fired.

Fix:

- getProcessStartTime is now async (execFile + promisify); never
  blocks the event loop
- New getProcessStartTimes(pids) issues one batched ps for all
  survivors instead of N separate forks. Sweep cost is fixed
  regardless of session count.
- registerSession stays sync; start-time capture is fire-and-forget
- reapDead is now async; the setInterval wrapper voids it so a
  rejected sweep cannot crash the daemon

Behavior is otherwise unchanged from 1.31.0: same 5s cadence, same
PID-reuse guard semantics, same broker-WS teardown via the registry
hook. 83/83 tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:15:48 +01:00
Alejandro Gutiérrez
b0c1348a0a chore: raise commitlint body limits — disable nonsensical 100-char total cap, allow 200-char lines
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The default body-max-length=100 was firing a warning on every
substantive commit because 100 chars total can't fit a real changelog
message. Disabled (level 0). body-max-line-length bumped to 200 so
long URLs / paths / pasted errors don't trip a warning that adds
nothing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:06:50 +01:00
Alejandro Gutiérrez
1a14cef1e0 feat(cli): 1.31.0 — session autoclean + broker verification + service path
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Three operability fixes for users running the daemon under launchd or
systemd.

PID-watcher autoclean
=====================

The session reaper already dropped registry entries with dead pids on
a 30s loop, but had two real-world gaps:

- 30s sweep let stale presence linger on the broker for half a minute
- bare process.kill(pid, 0) trusts a recycled pid; a registry entry
  could survive its real owner's death whenever the OS rolled the
  pid number forward to a new program

Process-exit IPC from claude-code is best-effort and skipped on
SIGKILL / OOM / segfault / panic, so it cannot replace the sweep.

Fix:

- New process-info.ts captures opaque per-process start-times via
  ps -o lstart= (works on macOS and Linux, ~1 ms per call)
- registerSession stores the start-time alongside the pid
- reapDead drops entries when pid is dead OR start-time changed
  since register
- Sweep cadence 30s -> 5s
- Best-effort fallback to bare liveness when start-time capture
  fails at register time

Registry hooks already close the per-session broker WS on
deregister, so peer list rebuilds within one sweep of any session
exit.

Service-managed daemon: no more "spawn failed" false alarms
===========================================================

After claudemesh install (which writes a launchd plist or systemd
unit with KeepAlive=true), users routinely saw

  [claudemesh] warn daemon spawn failed: socket did not appear
  within 3000ms

even when the daemon was running fine. Two contributing causes:

1. Probe timeout was 800ms — the first IPC after a launchd-driven
   restart can take longer (SQLite migration + broker WS opens) and
   tripped it. Bumped to 2500ms.
2. On a failed probe the CLI tried its own detached spawn, which
   collided with launchd's KeepAlive restart cycle (singleton lock
   fails, child exits) and we'd then time out polling for a socket
   that was actually about to come up.

Now: when the launchd plist or systemd unit exists, the CLI does not
attempt a spawn. It waits up to 8s for the OS-managed unit to bring
the socket up. New service-not-ready state distinguishes "OS hasn't
restarted it yet" from "we tried to spawn and it failed".

Install verifies broker connectivity, not just process start
============================================================

Previously install ended once launchctl reported the unit loaded —
a daemon that boots but cannot reach the broker (blocked :443,
expired TLS, DNS, broker outage) only surfaced on the user's first
peer list or send.

/v1/health now includes per-mesh broker WS state. install polls it
for up to 15s after service boot and prints either "broker
connected (mesh=...)" or a warning naming the meshes still in
connecting state, with a hint at common causes.

The verification is best-effort and does not fail the install — it
just surfaces the issue early.

Tests
=====

4 new vitest cases cover the reaper paths: dead pid, live pid plus
matching start-time, live pid plus mismatched start-time (PID
reuse), and the no-start-time fallback. 83 of 83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:05:44 +01:00
Alejandro Gutiérrez
71f7f81880 fix(cli): 1.30.2 — daemon service unit attaches to every joined mesh
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
claudemesh install was baking --mesh <primary> into the launchd plist /
systemd unit, locking the daemon to a single mesh and contradicting
1.26.0's multi-mesh design. users with >1 joined mesh fell off the
daemon path on every non-primary verb (cold-WS fallback, peer list
returning all meshes because the server-side filter ran against zero
attached state, "daemon spawn failed: socket did not appear" from
launched sessions in sibling meshes).

now: meshSlug is optional in InstallArgs; claudemesh install omits it
so the unit runs `claudemesh daemon up` with no flag, which attaches
to every joined mesh. `claudemesh daemon install-service --mesh <slug>`
is preserved as opt-in for single-mesh hosts and CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:44:11 +01:00
Alejandro Gutiérrez
052f65149d fix(cli): 1.30.1 — daemon install upgrade-safe + node-pinned
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
two install-path fixes that bit on first 1.30.0 upgrade:

- pin node by absolute path in launchd plist / systemd unit. shebang's
  /usr/bin/env node resolved against the service environment PATH and
  picked up system Node 22.x, which lacks node:sqlite (experimental)
  → daemon died with ERR_UNKNOWN_BUILTIN_MODULE. process.execPath now
  goes first, so the daemon always runs under the same Node that ran
  claudemesh install.
- tear down the old daemon before bootstrapping. claudemesh install on
  a machine with an already-running daemon hit Bootstrap failed: 5:
  Input/output error (launchctl refuses to re-bootstrap a loaded unit
  + old daemon held the singleton lock). Now we run launchctl bootout
  (systemd: systemctl --user stop) first, plus SIGTERM to any orphan
  pid in daemon.pid, so subsequent installs replace cleanly.

both fixes apply to darwin and linux paths. windows path is unchanged
— it doesn't have a service-install today (daemon-install-service
errors with "unsupported platform" on win32).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:31:27 +01:00
Alejandro Gutiérrez
0b3014e7eb docs(roadmap): mark 1.30.0 shipped
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:26:50 +01:00
Alejandro Gutiérrez
cef246a34a chore(cli): typecheck clean (10 → 0)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
- broker-actions: msg-status section header used out-of-scope `id`
  variable; was a real bug (renders "message undefined…" on the JSON
  path). Fixed to use the in-scope lookupId.
- exit-codes: add IO_ERROR (10) — referenced in three places by
  platform-actions but never declared.
- types/text-import.d.ts: declare wildcard `*.md` module so Bun's
  text-import attribute used by skill.ts typechecks.
- ipc/server: cast PeerSummary/SkillSummary through unknown before
  spreading into Record<string, unknown>.
- mcp/server: typed JSON.parse for SSE events.
- bridge/daemon-route: import path with .ts → .js (esm).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:23:55 +01:00
Alejandro Gutiérrez
f013436541 chore(broker): typecheck clean (77 → 0)
paid down the broker's accumulated type debt. zero behavioral changes,
purely type-system tightening:

- broker.ts: row extraction helper for postgres-js result vs pg shape;
  findMemberByPubkey defaultGroups null-coalescing.
- env.ts: zod default ordered before transform (zod v4 ordering).
- index.ts: typed JSON.parse for the tg/token, upload-auth, file-upload,
  member patch and mesh-settings handlers; export SelfEditablePolicy
  from member-api; added bodyVersion to WSSendMessage; added the
  disconnect/kick/ban/unban/list_bans message types to WSClientMessage;
  String(key) cast for neo4j record symbol-typed keys.
- jwt.ts, paths.ts, telegram-token.ts: typed JSON.parse results.
- service-manager.ts: typed package.json + MCP JSON-RPC reader.
- telegram-bridge.ts: typed WS message handler; missing log import;
  null-tolerant BridgeRow + skip rows missing memberId/displayName;
  typed e in catch.
- types.ts: bodyVersion on WSSendMessage, manifest on WSSkillData,
  five new admin message types (kick/disconnect/ban/unban/list_bans).
- packages/db/server.ts: drizzle constructor positional args + scoped
  ts-expect-error for the namespace-bag schema generic mismatch.

apps/broker/src/types.ts will eventually want a real audit pass to
catch every WS verb and surface the orphans, but this clears the path
for 1.30.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:22:09 +01:00
Alejandro Gutiérrez
6d981976c0 refactor(cli): drop CLAUDEMESH_SESSION_PRESENCE flag
per-session presence is small and uncomplicated enough that a rollback
flag isn't load-bearing. backwards compat is already covered at the
protocol layer — older brokers reply unknown_message_type to
session_hello and the SessionBrokerClient marks itself closed for that
mesh, which is the same outcome the flag would have given. removing
the flag, the helper, and the conditional from the registry hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:12:11 +01:00
Alejandro Gutiérrez
f7d7d391c9 feat(cli): 1.30.0 — per-session broker presence
flips CLAUDEMESH_SESSION_PRESENCE default to ON. With the broker side
already shipped (the session_hello handler from earlier in this sprint
A wave), every claudemesh launch now gets its own long-lived broker
presence row owned by the daemon and identified by a per-launch
ephemeral keypair vouched by the member's stable key. Two sessions in
the same cwd finally see each other in peer list — the symptom users
have been hitting since 1.28.0 dropped the bridge tier.

Bumps roadmap: 1.30.0 = presence (was queued for 1.30/wizard); the
launch-wizard refactor moves to 1.31.0, setup wizard to 1.32.0, the
mesh→workspace rename to 1.33.0. Verification smoke documented in the
1.30.0 changelog entry.

Rollback: CLAUDEMESH_SESSION_PRESENCE=0 (also accepts "false"/"off").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:10:51 +01:00
Alejandro Gutiérrez
ff2aa8bf7c feat(cli): launch mints session keypair + parent attestation
claudemesh launch now also generates a per-launch ed25519 keypair and a
parent-vouched attestation (12h TTL), included in the body of POST
/v1/sessions/register under body.presence. The daemon stores it on
SessionInfo and, with CLAUDEMESH_SESSION_PRESENCE=1, opens a long-lived
broker WS so the session has its own presence row.

Also fixes a latent 1.29.0 bug: claudeSessionId was referenced before
its const declaration, hitting the TDZ → ReferenceError silently
swallowed by the surrounding catch. Net: the IPC session-token
registration has been failing every launch since 1.29.0, falling back
to user-level scope for every session. Hoisted the declaration up so
the registration actually runs.

The presence payload is forward-compat: older daemons ignore unknown
body fields, so 1.30.0 CLIs work fine against unupgraded daemons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:08:15 +01:00
Alejandro Gutiérrez
4d42185b0f test(cli): tolerate exit 2 in whoami --json golden
whoami --json exits with EXIT.AUTH_FAILED (=2) when not signed in.
The JSON output is the contract under test, valid regardless of exit
code — execSync was throwing on exit 2 so the assertion never ran.
Switch to spawnSync, accept {0,2}, parse stdout independently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:06:40 +01:00
Alejandro Gutiérrez
d62b3f45d2 feat(cli): sessionbrokerclient + registry hooks (flag-gated)
daemon-side half of 1.30.0 per-session broker presence. behind
CLAUDEMESH_SESSION_PRESENCE=1 (default OFF this cycle so the broker
side bakes before the flag flips).

- SessionBrokerClient (apps/cli/src/daemon/session-broker.ts) — slim
  WS that opens with session_hello, presence-only, no outbox drain.
- session-hello-sig.ts — signParentAttestation (12h TTL, ≤24h cap) and
  signSessionHello, mirroring the broker canonical formats.
- session-registry: optional presence field on SessionInfo;
  setRegistryHooks for onRegister/onDeregister callbacks. Hook errors
  are caught so they can never throttle registry mutations.
- IPC POST /v1/sessions/register accepts the presence material under
  body.presence (session_pubkey, session_secret_key, parent_attestation).
  Older callers without it stay scoped + supported.
- run.ts wires the registry hooks: on register, opens a SessionBrokerClient
  for the matching mesh; on deregister (explicit or reaper), closes it.
  Shutdown closes any remaining session WSes before the IPC server.

8 new unit tests cover registry lifecycle (replace/throw/presence
roundtrip) and signature canonical-bytes verification against libsodium.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:05:33 +01:00
Alejandro Gutiérrez
e688f66791 feat(broker): session_hello WS handler for per-launch presence
The 1.30.0 daemon-multiplexed presence flow needs a way for the daemon
to open a WS keyed on a per-launch ephemeral pubkey. This commit adds:

- WSSessionHelloMessage in types.ts (additive — older clients still use
  WSHelloMessage; older brokers reply with unknown_message_type so newer
  clients can fall back).
- handleSessionHello in index.ts: validates parentAttestation (TTL ≤24h,
  ed25519 by parent), session signature (skew + ed25519 by session),
  parent membership in mesh.member, and parentMemberId/pubkey coherence.
- Inserts a presence row keyed on sessionPubkey but member_id from the
  parent — member-targeted operations (revocation, send-by-member-pubkey)
  keep working unchanged.
- Broadcasts peer_joined to ALL siblings in the mesh, including the
  same-member ones (the regular hello path skips those to avoid self-
  spam, but session_hello explicitly wants sibling visibility).

Behavior parity tests will land alongside the daemon SessionBrokerClient.
The unit tests added in the previous commit cover the crypto layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:00:11 +01:00
Alejandro Gutiérrez
033a2d37e1 feat(broker): canonical session-hello + parent-attestation helpers
Adds the crypto primitives the 1.30.0 per-session broker presence flow
needs: canonicalSessionAttestation/canonicalSessionHello bytes, and
verifySessionAttestation/verifySessionHelloSignature with TTL bounds
(≤24h) plus standard ed25519 + skew checks.

10 unit tests cover the hostile cases — expired attestation, over-TTL,
wrong-key signing, tampered fields, and the "attacker captured the
attestation but doesn't hold the session secret key" scenario.

No wire changes yet — types and dispatch land in the next two commits.
Spec: .artifacts/specs/2026-05-04-per-session-presence.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 12:57:28 +01:00
Alejandro Gutiérrez
364178d95b docs(spec): per-session broker presence (queued for 1.30.0)
records the design for daemon-multiplexed broker presence — every
launched claude session gets its own long-lived presence row owned
by the daemon, identified by a per-launch ephemeral keypair vouched
by the member's stable keypair.

resolves the "two sibling sessions can't see each other in peer list"
gap that surfaced when the bridge tier was deleted in 1.28.0. covers
state machine, broker session_hello handler, parent-attestation
signing, ipc route extension, sequencing (broker first, daemon
flagged, cli third), compat with older builds, and verification
smoke.

~440 loc estimate across cli + daemon + broker. queued for 1.30.0
alongside the launch-wizard refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 12:47:31 +01:00
Alejandro Gutiérrez
f91871c71d docs(roadmap): record sprint A ships (1.26.0 through 1.29.0)
extend the v0.9.x section with a new "v1.26.0 → v1.29.0 — sprint A
toward v2" block listing what each release delivered. trim the
v2.0.0 section to just the remaining HKDF identity work; everything
else from the original v2 spec is now shipped.

queue 1.30.0 (launch wizard), 1.31.0 (setup wizard), 1.32.0 (full
workspace rename) as the explicit remaining items before HKDF
ships as 2.0.0 in its own sprint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 12:36:34 +01:00
Alejandro Gutiérrez
92cac16c91 feat(cli): 1.29.0 — per-session IPC tokens + auto-scoping
every claudemesh launch-spawned session now mints a 32-byte random
token, writes it under tmpdir (mode 0600), and registers it with the
daemon. cli invocations from inside that session inherit
CLAUDEMESH_IPC_TOKEN_FILE in env, attach the token via Authorization:
ClaudeMesh-Session <hex>, and the daemon resolves it to a SessionInfo.

server-side: every read route that filters by mesh now uses meshFromCtx —
explicit query/body wins, session default fills in when missing. write
routes follow the same pattern.

cli-side: peers.ts (and other multi-mesh-iterating verbs in future)
prefers session-token mesh over all joined meshes when the user didn't
pass --mesh explicitly.

backward-compatible in both directions — tokenless callers behave
exactly as before. registry is in-memory; daemon restart loses it but
the 30s reaper handles dead pids and most callers re-register on next
launch.

verified end-to-end: peer list with token returns 4 prueba1 peers,
without token returns 3 meshes' peers (aggregate).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 12:33:06 +01:00
Alejandro Gutiérrez
81f0e4f7ac feat(cli): 1.28.0 — bridge deletion + daemon-policy flags
drop the orphaned bridge tier (~600 LoC). client/server/protocol
files deleted; tryBridge had returned null in production for seven
releases since the 1.24.0 mcp shim rewrite stopped opening the
sockets. each verb now has two paths: daemon (with 1.27.3's
auto-spawn) → cold ws.

add per-process daemon policy: --strict (error instead of cold
fallback) and --no-daemon (skip daemon entirely). enforcement at
withMesh so a single chokepoint covers every verb. env equivalents
CLAUDEMESH_STRICT_DAEMON / CLAUDEMESH_NO_DAEMON. flag wins.

net -394 loc; the daemon-up case ships ~600 loc lighter and the
fallback story is one tier simpler. first sprint A drop; per-session
ipc tokens and the wizard refactors follow in 1.29.0+.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 12:23:04 +01:00
Alejandro Gutiérrez
2b6cf2c14b feat(cli): self-healing daemon lifecycle
every daemon-routed verb now probes the ipc socket via /v1/version
(instead of trusting existsSync), cleans up stale sock/pid files left
by a crashed daemon, and auto-spawns a detached `claudemesh daemon up`
under a file-lock when the daemon is down. polls for liveness up to a
budget (3s for ad-hoc verbs, 10s for launch) before falling through to
cold path.

includes a per-process result cache (script doing 50 sends pays spawn
cost at most once), a 30s recently-failed marker (no thundering-herd
retries on crash-loop), a spawn-lock (concurrent invocations share one
attempt), and a recursion guard env var (nested cli calls inside the
daemon process skip auto-spawn).

fixes the stale-socket bug where launch's ensureDaemonRunning returned
early on a left-over socket file from a crashed daemon, silently
breaking the spawned claude session's mcp shim.

deferred to 1.28.0: --strict / --no-daemon flags, lazy-loading of
cold-path code, per-session ipc tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 11:17:32 +01:00
Alejandro Gutiérrez
8a5469a5df docs(skill): canonical fully-populated launch template
adds a kitchen-sink "every flag set explicitly" recipe under
wizard-free spawn templates, with a per-position annotation table.
agents copy this verbatim instead of stitching flags from the table
when spawning unattended sessions.

corrects two stale items: --system-prompt forwards to claude
--system-prompt (not --append-system-prompt), and -q is currently a
no-op (only --quiet is wired).

flags the 1.27.1 cutoff: all twelve launch flags are only end-to-end
wired from that version on; older builds silently dropped half of them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 10:15:28 +01:00
Alejandro Gutiérrez
e128a6ae5f fix(cli): wire missing launch flags through entrypoint
six flags declared on `LaunchFlags` were silently dropped at the CLI
layer — `--role`, `--groups`, `--message-mode`, `--system-prompt`,
`--continue`, and `--quiet`. each was honored inside `runLaunch` if it
arrived, but the four call sites in the entrypoint forwarded a hardcoded
5-key subset.

now forwarded at every entry: bare command, bare invite URL, the
launch/connect verb, and the new workspace launch alias. pure plumbing;
no behaviour change for users who weren't passing these flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 10:08:41 +01:00
Alejandro Gutiérrez
3753a6e137 feat(cli): 1.27.0 — state/memory through daemon + workspace alias
extend the daemon thin-client surface to two more verb families: state
get/set/list now routes through `/v1/state`, and remember/recall/forget
through `/v1/memory`. same warm-path pattern as 1.25.0 — try the unix
socket first, fall back to the cold ws path when the daemon is absent.
multi-mesh aware (aggregates on read, requires `--mesh` for writes
when ambiguous).

also ships an early `claudemesh workspace <verb>` alias surface — bare
teaser for the 1.28.0 mesh→workspace public rename. no-arg falls
through to launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 09:41:18 +01:00
Alejandro Gutiérrez
cb90f1ca60 feat(daemon): multi-mesh — attach to all joined meshes simultaneously
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The 1.26.0 step that finally delivers ambient mode for multi-mesh
users. Daemon holds Map<slug, DaemonBrokerClient>; one process, one
PID per user, all your meshes online concurrently.

run.ts: claudemesh daemon up with no --mesh attaches to every joined
mesh from config. --mesh <slug> still scopes to one (legacy mode).
The daemon_started log line reports meshes: [...] instead of mesh.

drain.ts: dispatches each outbox row to the broker keyed by row.mesh
(column added in 1.25.0). Legacy rows with mesh=NULL fall back to the
only broker if there's exactly one, otherwise mark dead with a clear
error.

ipc/server.ts:
- GET /v1/peers aggregates across all attached meshes; each peer
  record gains a mesh field. ?mesh=<slug> narrows server-side.
- GET /v1/skills aggregates similarly; /v1/skills/:name walks meshes
  and returns first match.
- POST /v1/send requires mesh field on multi-mesh daemons; auto-picks
  on single-mesh; returns 400 with attached list if ambiguous.
- POST /v1/profile accepts optional mesh; without it, fans out to all
  attached meshes (consistent presence).

CLI: trySendViaDaemon now forwards expectedMesh as the body's mesh
field (was informational, now authoritative). claudemesh send
--mesh A and --mesh B from the same shell both route to the right
broker via the same daemon process.

Verified: aggregated peer list across 3 attached meshes; cross-mesh
sends from CLI reach status=done with correct broker_message_ids.

Released as 1.26.0 on npm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 02:14:43 +01:00
Alejandro Gutiérrez
0e3a5babd9 feat(daemon): sprint 4 outbound routing + CLI thin-client + ambient mode
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Daemon outbox now stores resolved target_spec + crypto_box ciphertext
+ nonce per row. Drain worker is a forwarder; no per-row resolution at
drain time. Outbound routing is no longer a placeholder.

Schema additions (additive, NULL allowed for legacy rows): outbox.mesh,
target_spec, nonce, ciphertext, priority. v0.9.0 rows keep draining via
the broadcast fallback so existing in-flight rows finish cleanly.

IPC /v1/send resolves the user-friendly to (display name, hex prefix,
full pubkey, @group, *, #topicId) into a broker-format target_spec at
accept time. DMs encrypt via crypto_box; broadcast/topic/group base64
the plaintext. Hex prefixes (16+ chars) match against connected peers.

CLI thin-client routing extends trySendViaDaemon pattern to peer list
and skill list/get. Three new helpers in services/bridge/daemon-route.ts.

SKILL.md gains ambient mode section: after claudemesh install, raw
claude works for the daemon's attached mesh. Launch stays as the
override path.

Spec at .artifacts/specs/2026-05-04-v2-roadmap-completion.md orders
the remaining v2.0.0 work: multi-mesh daemon (1.26), CLI-to-thin-client
(1.27), mesh-to-workspace rename (1.28), HKDF identity (2.0).

Released as 1.25.0 on npm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 01:36:16 +01:00
Alejandro Gutiérrez
6794aa8512 feat(daemon+mcp): daemon required for in-Claude-Code use; thin MCP shim
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The architectural convergence v0.9.0 was building toward. CLI keeps
working without a daemon (claudemesh send/peer/inbox/...), but the MCP
push-pipe — which Claude Code uses for mid-turn channel emits, slash
commands, and resources — now requires the daemon. There is no fallback.

Daemon (additive):
- /v1/skills (list) and /v1/skills/:name (get) IPC endpoints, so the
  MCP shim can surface mesh skills without holding its own broker WS.
- listSkills() / getSkill() on DaemonBrokerClient.
- SSE 'message' event now carries plaintext body, sender_member_pubkey,
  priority, and subtype — full payload the MCP shim needs to render a
  channel notification.

MCP server: 979 → 469 LoC (470 of the remaining 469 is the unrelated
mesh-service proxy mode; the push-pipe path is ~200 LoC including
boilerplate).
- Probes ~/.claudemesh/daemon/daemon.sock at boot. Bails loudly with
  actionable instructions if missing.
- Subscribes to /v1/events SSE and translates each event into a
  notifications/claude/channel emit.
- Fetches mesh skills from the daemon for ListPrompts/GetPrompt and
  ListResources/ReadResource. ListTools returns []; the CLI is the API.
- No broker WS, no decryption, no reconnect logic. Daemon owns all of it.

claudemesh install: auto-installs and starts the daemon service for the
user's primary mesh (launchd / systemd-user). Pass --no-service to skip.

claudemesh launch: probes the daemon socket; if absent, spawns
'claudemesh daemon up --mesh <slug>' detached and waits up to 10s for
the socket. Surfaces a clear warning on timeout but doesn't block —
Claude Code's MCP shim will print the same error if the daemon really
isn't there.

Bundle: dist/entrypoints/mcp.js drops from 154KB → 104KB (gzipped 34KB
→ 19KB). Test: MCP boots cleanly via stdio, declares correct
capabilities, talks JSON-RPC; daemon /v1/skills returns the empty list
as expected on a mesh with no skills.

Released as 1.24.0 on npm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 23:43:02 +01:00
Alejandro Gutiérrez
c56910bfcf feat(cli): vault set / watch add / webhook create + prune dead MCP stubs
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Closes the last functional gaps where the MCP tool registry exposed
write verbs the CLI didn't:

- vault set <k> <v> [--type env|file --mount <path> --description ...]
  Client-side crypto_secretbox_easy with a fresh symmetric key sealed
  to the member's own pubkey via crypto_box_seal — same pattern used
  for file shares. Pairs with the existing vault list/delete.
- watch add <url> [--label --interval --mode --extract --notify-on]
  Pairs with watch list/remove.
- webhook create <name> — pairs with webhook list/delete.

Cleanup: deletes 22 stub files under apps/cli/src/mcp/tools/* plus
router.ts, middleware/, handlers/ (~120 LoC). These were FAMILY/TOOLS
metadata-only re-exports left over from before the 1.5.0 tool-less
push-pipe flip; nothing imports them. The legitimate MCP surfaces
stay: the inbound <channel> push pipe, mesh skills as prompts and
skill:// resources, and the mesh-service proxy mode.

Released as 1.23.0 on npm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:53:25 +01:00
Alejandro Gutiérrez
4eff4f5a20 docs(cli): daemon coverage in --help, daemon usage block, SKILL.md
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
- Root --help now lists the daemon subcommand suite (was missing).
- claudemesh daemon (no subcommand) prints a usage block instead of
  silently launching the foreground daemon. Adds help|--help|-h aliases.
- SKILL.md gains a "Daemon path (v0.9.0, opt-in, fastest)" section
  explaining the runtime, lifecycle, and that it's independent from
  claudemesh install.

Released as 1.22.1 on npm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:33:17 +01:00
Alejandro Gutiérrez
a2568ad9f4 chore(release): cli 1.22.0 — daemon v0.9.0 + housekeeping
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh
  daemon long-lived runtime).
- CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring,
  crash recovery, and the deferred Sprint 7 broker hardening.
- Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0
  daemon redesign section, so the bridge release is documented as the
  shipped step toward the larger architectural shift.
- Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0
  spec + broker-hardening followups) from .artifacts/specs/ to
  .artifacts/shipped/ per the project artifact-pipeline convention.

Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag
— both are public-distribution actions and require explicit user
approval.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:24:32 +01:00
Alejandro Gutiérrez
bf22afb0ed feat(broker): record daemon idempotency fields on message_queue
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Additive plumbing for v0.9.0 daemon spec §4.2/§4.4. Adds two nullable
columns to mesh.message_queue — client_message_id (caller-supplied) and
request_fingerprint (canonical sha256 of the send shape) — and threads
them through the broker:

  - handleSend reads them off the wire envelope when present
  - queueMessage persists them on the row
  - drainForMember projects them onto the push so receiving daemons
    can dedupe their local inbox by client_message_id

Columns stay nullable so legacy traffic (launch CLI, dashboard chat)
continues to flow uninterrupted. Sprint 7 (broker hardening) will add
the partial unique index and the client_message_dedupe atomic-accept
table once we're ready to enforce dedupe broker-side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:05:36 +01:00
Alejandro Gutiérrez
abaa4bcf87 feat(cli): claudemesh daemon — peer mesh runtime (v0.9.0)
Long-lived process that holds a persistent WS to the broker and exposes
a local IPC surface (UDS + bearer-auth TCP loopback). Implements the
v0.9.0 spec under .artifacts/specs/.

Core:
- daemon up | status | version | down | accept-host
- daemon outbox list [--failed|--pending|--inflight|--done|--aborted]
- daemon outbox requeue <id> [--new-client-id <id>]
- daemon install-service / uninstall-service (macOS launchd, Linux systemd)

IPC routes:
- /v1/version, /v1/health
- /v1/send  (POST)  — full §4.5.1 idempotency lookup table
- /v1/inbox (GET)   — paged history
- /v1/events        — SSE stream of message/peer_join/peer_leave/broker_status
- /v1/peers         — broker passthrough
- /v1/profile       — summary/status/visible/avatar/title/bio/capabilities
- /v1/outbox + /v1/outbox/requeue — operator recovery

Storage (SQLite via node:sqlite / bun:sqlite):
- outbox.db: pending/inflight/done/dead/aborted with audit columns
- inbox.db: dedupe by client_message_id, decrypts DMs via existing crypto
- BEGIN IMMEDIATE serialization for daemon-local accept races

Identity:
- host_fingerprint.json (machine-id || first-stable-mac)
- refuse-on-mismatch policy with `daemon accept-host` recovery

CLI integration:
- claudemesh send detects the daemon and routes through /v1/send when
  present, falling back to bridge socket / cold path otherwise

Tests: 15-case coverage of the §4.5.1 IPC duplicate lookup table.

Spec arc preserved at .artifacts/specs/2026-05-03-daemon-{v1..v10}.md;
v0.9.0 implementation target locked at 2026-05-03-daemon-spec-v0.9.0.md;
deferred items at 2026-05-03-daemon-spec-broker-hardening-followups.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:03:05 +01:00
Alejandro Gutiérrez
65e63b0b27 fix(rename): surface duplicate-slug 409 instead of 500 (v1.21.1)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
mesh.slug actually carries a UNIQUE constraint (mesh_slug_unique)
even though the schema comment claimed otherwise. Trying to rename
to a slug another mesh already owns blew up as a generic 500.
Now: caught at the route, surfaced as 409 with body
{"error":"slug \"<x>\" is already taken"}; CLI maps it to
EXIT.ALREADY_EXISTS and prints the message.

Schema comment corrected to match DB reality.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 15:49:28 +01:00
Alejandro Gutiérrez
5785454ac9 feat: collapse mesh.name and mesh.slug into one identifier (v1.21.0)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Pre-launch fix: every visible surface already keyed on slug, so
"name" was a parallel string that only existed to confuse users
on rename ("I renamed but nothing visible changed").

Now slug IS the identifier. claudemesh rename <old> <new> is the
whole rename surface. PATCH /api/cli/meshes/:slug body becomes
{ slug } and the route writes both columns to keep them in sync.
Mesh create derives slug from input.name and stores name = slug.
Pickers drop the (parens). The claudemesh slug verb shipped 30
min ago is removed — merged into rename.

The mesh.name DB column stays for now to avoid touching ~25
reader sites; a follow-up migration drops it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 15:23:04 +01:00
Alejandro Gutiérrez
03cff156e2 fix(launch): welcome picker shows mesh name + slug (v1.20.1)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The launch welcome flow's menuSelect was rendering opts.meshes.map(
m => m.slug) — so even after rename writes the new name to local
config, the picker still only showed the slug. Renders as
"name  (slug)" when they differ; falls back to slug alone when
they match (default for never-renamed meshes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 15:12:23 +01:00
Alejandro Gutiérrez
e84914b25b feat: claudemesh slug <old> <new> — change a mesh's slug (v1.20.0)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Slugs are not globally unique (mesh.id is canonical) so the route
only validates the regex and updates the row. CLI refuses a local
collision (two joined meshes sharing a slug would make the picker
ambiguous) and rewrites ~/.claudemesh/config.json on success.
Other peers pick up the new slug on next claudemesh sync.

Server: PATCH /api/cli/meshes/:slug body now accepts { name?, slug? }
— same route, just optional both fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 15:08:32 +01:00
Alejandro Gutiérrez
5a1d5d6a49 fix(cli): rename syncs local config + picker shows display name
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
After renaming the mesh display name on the server, the launch
picker still showed the slug ("flexicar-2") because (a) local
config.json was not updated and (b) the picker only printed
mesh.slug. Now: rename writes the new name back into config.json
on success, and the picker prints "name (slug)" when they differ.
Also surfaces a hint that slugs are immutable (today).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:52:50 +01:00
Alejandro Gutiérrez
f3649d761f fix(rename): split 404 vs 403 + surface API error body (v1.19.2)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The rename route was collapsing "mesh doesn't exist" and "exists
but you don't own it" into a single 404 with body
{"error":"mesh not found or you are not the owner"}, and the CLI
was throwing that body away — the user only saw "API error 404:
Not Found", which is actively misleading when they have multiple
accounts and signed in to the wrong one.

Server: separate lookup-then-update. 404 only when the slug is
missing; 403 with an actionable message when the caller is not
the owner.

CLI: parse the {error} body off ApiError and print it instead of
the bare statusText. Map status codes to specific exit codes
(401 -> AUTH_FAILED, 403 -> PERMISSION_DENIED, 404 -> NOT_FOUND).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:44:22 +01:00
Alejandro Gutiérrez
79485898cf fix(ci): force fresh build on web deploy
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Coolify's last deploy reused the cached image — the new
/api/cli/meshes/[slug] route never made it into .next/server.
Adding force=true to the deploy API call so Coolify rebuilds
from the current commit instead of replaying the cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:17:22 +01:00
Alejandro Gutiérrez
b69df75f0c fix(cli+web): claudemesh rename via inline-JWT route (v1.19.1)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The /api/my/meshes/:slug PATCH route was never implemented and
better-auth's enforceAuth middleware can't validate the CLI's
device-code JWT (signed with CLI_SYNC_SECRET, not a better-auth
session). Adds /api/cli/meshes/:slug on the web app — verifies
the HS256 JWT inline, scopes the rename to (slug, ownerUserId).
CLI now calls the new path. Mirrors the cli-sync-token pattern.

Closes the "API error 401: Unauthorized" hit after a successful
claudemesh login.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:11:31 +01:00
Alejandro Gutiérrez
3a3d2a6c4c feat(cli): file share / file get + same-host fast path (v1.19.0)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Two new CLI verbs for the file-sharing surface that already existed
on the broker (HTTP /upload + WS get_file/list_files) but was only
reachable through MCP-style docstrings referencing tools that do
not in fact exist:

  claudemesh file share <path> [--to peer] [--message "..."]
  claudemesh file get <id> [--out path]

Same-host fast path: when --to resolves to a session on the same
hostname, skip MinIO and DM the absolute filepath. The receiver
reads it off disk directly. No bucket roundtrip, no 50 MB cap.
Falls back to encrypted upload when the peer is remote or --upload
is set.

Routes the same-host DM by session pubkey, not displayName, so
sibling sessions of the same member do not trip the v0.5.1
self-DM guard.

Updates the bundled SKILL.md and the MCP server instructions to
reference the real CLI verbs instead of the fictional share_file()
/ get_file() tool calls.

Also: rename.ts now distinguishes mesh-membership from web-account
auth and points users at claudemesh login + the dashboard rather
than emitting a bare "Not signed in".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 14:05:24 +01:00
Alejandro Gutiérrez
f9ed3fa286 feat(cli): claudemesh skill prints bundled SKILL.md (v1.18.0)
Some checks failed
CI / Docker build (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
Zero-install access to the protocol reference: a fresh `npm i -g
claudemesh-cli` user (or someone running the prebuilt binary) can
now `claudemesh skill | claude --skill-add -` without copying any
files into ~/.claude/skills. The skill markdown is embedded into
the CLI bundle at build time via Bun's text-import attribute.

Also replaces two `<> ALL(...)` raw SQL fragments in the dashboard
unread-count queries with drizzle's notInArray() helper — matches
the same fix already applied to /v1/me/topics in the API package.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 12:24:45 +01:00
Alejandro Gutiérrez
50b2ae97c2 feat(cli): peer list self-marking + send self-DM guard
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
closes the "DM looped back to my own inbox" footgun.

what was happening: peer list returns one row per presence,
including the caller's own session AND its sibling sessions.
the cli filtered out the exact-session row but left siblings
unlabeled — copying their pubkey from peer list silently
targeted your own sibling, and the message arrived in "your
own inbox" because the sender was you.

fix is two-part.

(1) peer list — tag rows whose memberPubkey matches the
caller's stable JoinedMesh.pubkey:
  ● displayName (this session) — the exact session running
                                 the cli call
  ● displayName (your other session) — sibling session of
                                       your own member
visually identical otherwise; just the marker.

(2) claudemesh send — refuse a target that exactly matches the
caller's own member pubkey on the mesh, with a hint pointing
at --self for the rare intentional sibling-DM case.

both changes additive — existing scripts that pass display
names or other peers' pubkeys behave identically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 11:45:03 +01:00
Alejandro Gutiérrez
4b459622e4 fix(api): /v1/me/tasks query — completedAt-based window + iso cast
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
the previous form had drizzle render the date param as a js
toString() value which postgres rejected (Fri Apr 03 2026
GMT+0000 doesn't parse as timestamp without help). fix:
serialize to iso then cast ::timestamp inside the sql tag.

simplified the where clause too — the prior conditional dance
emitted "status != completed" three times redundantly. one
"completed_at IS NULL OR > window" covers active + recent-done
in one clause; status filtering happens client-side via the
existing statusSet pass.

also cleans up the debug probe scaffolding.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 10:29:13 +01:00
Alejandro Gutiérrez
f679b49b6c feat(workspace): default-aggregation for task/state/memory
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
ships v0.5.0 phase 2.

api: three new aggregator endpoints for the per-mesh subsystems
that didn't have one yet.
- GET /v1/me/tasks — open + claimed by default; ?status=all
  surfaces completed (30d window). sorted open > claimed > done.
- GET /v1/me/state — every (key, value) row across the user's
  meshes, sorted by recency. ?key=foo filters to one key.
- GET /v1/me/memory?q=... — ilike on content + tags, no q
  returns the last 30 days. excludes forgotten rows.

cli (1.16.0): task list, state list, recall now route through
the matching aggregator when --mesh is omitted. --mesh foo
still scopes to one mesh (existing behavior preserved).

with this, every per-mesh read verb in the cli either has a
cross-mesh aggregator or doesn't need one. v0.5.0 substrate is
complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 10:17:41 +01:00
Alejandro Gutiérrez
5ceb311d74 feat(cli): default-aggregation for topic list + notification list
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
ships v0.5.0 phase 1.

omitting --mesh on these read verbs now routes through
/v1/me/topics and /v1/me/notifications instead of prompting
the user to pick a mesh. behavior preserved for explicit
--mesh foo.

implementation: resolveMeshForMint helper in commands/me.ts
silently picks the first joined mesh for apikey-mint when
flags.mesh is null. /v1/me/* endpoints resolve the user from
the apikey issuer regardless of which mesh issued the key, so
mint location is irrelevant — only the user identity matters.

help text updated to reflect the new default.

phase 2 (task list, state list, memory recall) needs /v1/me/*
aggregator endpoints first; deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:56:33 +01:00
Alejandro Gutiérrez
e60980cfd7 feat(workspace): claudemesh me search + dashboard parity
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
ships v0.4.0 phase 5 — final aggregating verb. v0.4.0 substrate
is complete after this.

api: GET /v1/me/search?q=... matches against topic names +
sender display names + v1 message snippets (base64 decode then
ilike). v2 ciphertext matches only on topic/sender — server has
no topic keys. 30-day window on messages, capped at 50 hits per
category.

cli (1.14.0): claudemesh me search <query> renders topic + msg
sections with inline yellow highlighting. min 2 chars; --json
returns the raw response.

web: /dashboard/search adds an autofocused input + mark
highlighting on every match site (topic name, sender, snippet).
sidebar gets a search entry between activity and invites.

roadmap: phase 5 marked shipped, v0.5.0 default-aggregation
behavior added as the natural next track.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:45:54 +01:00
Alejandro Gutiérrez
ff3d11d42d feat(workspace): claudemesh me activity + dashboard parity
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
ships v0.4.0 phase 4. final aggregating verb after this is
me search (phase 5).

api: GET /v1/me/activity returns topic messages across every
mesh the user belongs to in a 24h default window (?since=iso
override), excluding messages the caller authored themselves.
"what is happening that i missed", capped at 200.

cli (1.13.0): claudemesh me activity prints a condensed feed
with mesh + topic + sender + relative timestamp + snippet (or
[encrypted] for v2 ciphertext).

web: /dashboard/activity clusters consecutive messages from the
same topic into thread blocks for readability. sidebar gains an
activity entry between notifications and invites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:35:52 +01:00
Alejandro Gutiérrez
43e429f204 feat(workspace): claudemesh me notifications + dashboard parity
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
ships v0.4.0 phase 3.

api: GET /v1/me/notifications aggregates the mesh.notification
table across every joined mesh in a 7-day window (?since=iso
overrides, ?include=all surfaces already-read). returns sender +
topic + mesh context plus a 240-char snippet for v1 plaintext
messages or raw ciphertext for v2 (the dashboard topic-key cache
decrypts client-side).

cli (1.12.0): claudemesh me notifications — terse unread feed
with @ dot, --all to include read, --since for custom window.

web: /dashboard/notifications mirrors the cli view in card form,
adds a notifications entry to the dashboard sidebar between
topics and invites. each card links straight to the topic chat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 02:35:57 +01:00
Alejandro Gutiérrez
1c335e8daa ci(web): auto-deploy claudemesh-web to coolify on push to main
new workflow joins the tailnet via tailscale oauth then triggers
the coolify deploy endpoint. path filter scoped to web app + every
package transpiled into it, so broker/cli/docs changes skip it.
concurrency group coalesces rapid pushes.

requires three repo secrets: COOLIFY_TOKEN, TS_OAUTH_CLIENT_ID,
TS_OAUTH_SECRET (the OAuth client needs the devices:write scope and
the tag:ci tag in tailnet ACL tagOwners).

inline coolify token removed from CLAUDE.md — it now references
the repo secret. broker deploy is unchanged: it runs through the
gitea-vps webhook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 02:06:29 +01:00
Alejandro Gutiérrez
397ddb4c45 docs: mark v0.4.0 phase 2 shipped + record web deploy trick
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
roadmap entry for the me-topics + dashboard-topics ship.

claude.md gets the long-overdue note that apps/web is on coolify
on the ovh vps, not vercel — it does not auto-deploy on push to
gitea-vps the way the broker does, and that mismatch cost a
session of debugging. records the manual deploy command so the
next time we ship a web change we don't rediscover the issue.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:36:05 +01:00
Alejandro Gutiérrez
354c47c3d6 chore: remove diagnostic endpoint + debug probe scaffolding
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
restores api + cli to clean state after isolating the v0.4.0
phase 2 deploy issue (web app needed an explicit coolify deploy
trigger — it does not auto-deploy from gitea-vps push the way the
broker does).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:35:08 +01:00
Alejandro Gutiérrez
2262564680 chore(api): rename diagnostic to a unique path to defeat any stale routing cache 2026-05-03 01:28:36 +01:00
Alejandro Gutiérrez
c18891191e chore(api): add /v1/me/ping sanity probe
confirms whether new GET routes under /me/* deploy correctly to
vercel — diagnostic in the middle of the /me/topics 404 chase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:26:58 +01:00
Alejandro Gutiérrez
eb021a8a6f chore: trigger vercel rebuild 2026-05-03 01:16:58 +01:00
Alejandro Gutiérrez
3964de4962 fix(api): use notInArray + inArray in unread-count subqueries
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
the sql.join() form of NOT IN crashed the route handler before
it could respond — vercel surfaced the crash as a plaintext 404
instead of going through hono's exception handler. switching to
drizzle's notInArray() / inArray() emits stable parameter
bindings and resolves both /v1/me/topics (fresh endpoint) and
/v1/topics (older endpoint with the same ANY() pattern bug).

also cleans up debug instrumentation that was added while
chasing the 404.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:05:42 +01:00
Alejandro Gutiérrez
c795df4fd4 feat(workspace): claudemesh me topics + dashboard topics page
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
ships v0.4.0 phase 2: a cross-mesh topic feed.

api: GET /v1/me/topics aggregates topics across every mesh the
caller belongs to with per-topic unread counts (vs the user's
member-row last_read_at) and last-message timestamps. Sorted by
last activity.

cli (1.11.0): claudemesh me topics renders the feed; --unread
filters to topics with pending reads; --json returns raw.

web: /dashboard/topics ssr's the same view server-side (direct
db queries, no apikey-mint roundtrip) and adds a Topics entry
to the dashboard sidebar between Meshes and Invites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:39:58 +01:00
Alejandro Gutiérrez
aa6c7be4eb build(sdk): add exports.bun condition pointing at src for compile
bun build --compile in the cli release workflow couldn't resolve
@claudemesh/sdk because dist/ never gets built (--ignore-scripts).
adding exports.bun -> ./src/index.ts lets bun consume the typescript
sources directly while npm consumers keep using dist/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:04:35 +01:00
Alejandro Gutiérrez
3da06d357e docs(roadmap): mark v0.4.0 phase 1 shipped (claudemesh me)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
split the v0.4.0 entry into phase 1 (the me/workspace endpoint
+ verb that just shipped in CLI 1.10.0) and phase 2+ (remaining
me topics/notifications/activity/search verbs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:03:22 +01:00
Alejandro Gutiérrez
075df6db08 fix(api): correct online count in /v1/me/workspace
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
count distinct members with disconnectedAt is null instead of
all presence rows — a member can have many sessions, plus stale
rows from prior runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 23:53:12 +01:00
Alejandro Gutiérrez
c7ce92f35b fix(api): use inArray for /v1/me/workspace mesh-id filters
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
drizzle's sql template literal interpolated meshIds as a tuple
(($1, $2, $3, ...)) instead of an array, breaking ANY() and
returning HTTP 500. inArray() emits the right binding shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 23:46:50 +01:00
Alejandro Gutiérrez
7de13cbb71 feat(cli): claudemesh me — cross-mesh workspace overview (v0.4.0)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
2026-05-02 23:35:01 +01:00
Alejandro Gutiérrez
ad70782171 feat(api): cross-mesh workspace overview endpoint at /v1/me/workspace
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
2026-05-02 23:31:44 +01:00
Alejandro Gutiérrez
646d4fa3f1 fix(ui): chat footer reflects per-topic encryption state
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
2026-05-02 23:24:49 +01:00
Alejandro Gutiérrez
7f6af0137d feat(api+web): browser claims + re-seals encryption on v1 topics
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Closes the last gap from phase 3.5: web-created topics start as v1
plaintext (mutations.ts ensureGeneralTopic doesn't generate a key,
because the dashboard owner has a throwaway pubkey with no secret).
Once the browser identity is registered via /v1/me/peer-pubkey, the
chat panel can lazily upgrade the topic to v2.

API (POST /v1/topics/:name/claim-key)
- Atomic claim: only succeeds when topic.encrypted_key_pubkey IS
  NULL. Body carries the new senderPubkey + the caller's sealed copy
  of the freshly-generated topic key. Race losers get 409 with the
  winning senderPubkey so they fall through to the regular fetch
  path. Idempotent at topic_member_key level.

Web
- claimTopicKey() in services/crypto/topic-key.ts: generates a fresh
  32-byte symmetric key, seals for self, POSTs the claim. Returns
  the in-memory key so the caller can encrypt immediately without a
  follow-up GET /key round-trip.
- sealTopicKeyFor(): mirrors the CLI helper so a browser holder can
  re-seal for newcomers (CLI peers, other browsers) instead of the
  topic going dark when only a browser has the key.
- TopicChatPanel: when keyState === "topic_unencrypted", composer
  now shows a "🔓 plaintext (v1) — encryption not yet enabled" line
  with an "enable encryption" button. Click → claimTopicKey → state
  flips to "ready" → 🔒 v0.3.0 banner appears. On race-lost, falls
  through to fetch.
- New 30s re-seal loop fires while holding the key: polls
  /pending-seals, seals via sealTopicKeyFor for each pending target,
  POSTs to /seal. Same cadence + soft-fail discipline as the CLI.

Net effect: any dashboard user can convert legacy v1 topics to v2
with a single click, and CLI peers joining later will receive a
sealed copy from the browser's re-seal loop without manual action.
2026-05-02 23:22:26 +01:00
Alejandro Gutiérrez
2e57173ed9 fix(api): /v1/me/peer-pubkey only updates web-managed members
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Adds a 409 not_web_member guard to POST /v1/me/peer-pubkey: the
endpoint will only rewrite peer_pubkey on members that have
dashboard_user_id set. CLI members own their on-disk keypair —
overwriting their stored peer_pubkey would break the next WS hello
because the signature verification would fail against the new
pubkey.

In practice this restriction is invisible to the legitimate browser
flow: the dashboard always mints its apikey against the web member
(dashboard_user_id is non-null by construction in mutations.ts).
Guard ensures a misuse (e.g. a CLI-minted apikey being used to call
peer-pubkey) gets a clear 409 instead of silently breaking the CLI's
auth.

Discovered during phase 3.5 smoke when a CLI-minted apikey clobbered
the only openclaw member (CLI-owned) and the user's CLI signature
would have stopped verifying on the next launch.
2026-05-02 23:08:50 +01:00
Alejandro Gutiérrez
95b16a23fc docs(roadmap): mark v0.3.0 phase 3.5 (web encryption) shipped
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
2026-05-02 22:59:33 +01:00
Alejandro Gutiérrez
a3cf9b938e feat(web+api): browser-side per-topic encryption (v0.3.0 phase 3.5)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Closes the v1-vs-v2 split between CLI and dashboard. The web chat
panel now reads and writes the same crypto_secretbox-under-topic-key
ciphertext that CLI 1.8.0+ writes — every encrypted topic finally
renders correctly from the browser.

API
- POST /v1/me/peer-pubkey replaces the throwaway pubkey that
  mutations.ts mints at mesh-create time with one whose secret the
  browser actually holds. Idempotent; auth via the dashboard apikey
  whose issuedByMemberId is the row to update.

Web
- apps/web/src/services/crypto/identity.ts — IndexedDB-backed
  ed25519 identity, lazy-init on first use. Generates once per
  browser-profile; survives reload. ed25519 → x25519 derivation for
  crypto_box decrypt. Module-cached after first call.
- apps/web/src/services/crypto/topic-key.ts — mirrors the CLI
  topic-key service. Fetches GET /v1/topics/:name/key, decrypts the
  sealed copy with our x25519 secret, caches the 32-byte symmetric
  key in-memory keyed by (apikey-prefix, topic). encryptMessage /
  decryptMessage map directly onto crypto_secretbox{,_open}.
- apps/web/src/modules/mesh/topic-chat-panel.tsx — on mount:
  registers our pubkey, fetches the topic key, polls /key every 5s
  while not_sealed (matching the CLI's 30s re-seal cadence). Render
  branches on bodyVersion: v2 -> decrypted-cache, v1 -> legacy
  base64. Send branches: encrypts under the topic key when key is
  ready, falls back to v1 plaintext on legacy or not-yet-sealed
  topics. Composer shows a 🔒 v0.3.0 / "waiting for re-seal" badge.

Adds libsodium-wrappers + @types to apps/web. Browser bundle picks
up its own copy; the existing CLI/broker/API copies are untouched.

Threat model: IndexedDB is per-origin and not exfiltratable from
other sites; XSS or a malicious extension still wins, same as for
any browser-stored secret. Documented divergence from the CLI's
~/.claudemesh-stored keypair in the identity module's preamble.
2026-05-02 22:59:08 +01:00
Alejandro Gutiérrez
ce321c0a21 docs(skill): add Windows pane-spawn primitives for launch
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Adds Windows Terminal (wt.exe new-tab + split-pane), PowerShell
Start-Process, cmd.exe start, and WSL routing examples to the
"Spawning new sessions" section. Plus the platform's gotchas:
single-quote nesting in cmd.exe, -NoExit semantics, WSL ~/.claudemesh
path-vs-host divergence, and pwsh / --profile selectors for Windows
Terminal. Bumps CLI to 1.9.5.
2026-05-02 22:48:16 +01:00
Alejandro Gutiérrez
9ecf2d65af docs(skill): wizard-free launch patterns for spawning peer sessions
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Adds a "Spawning new sessions (no wizard)" section to the bundled
claudemesh skill. Documents every flag of `claudemesh launch`
(--name, --mesh, --join, --groups, --role, --message-mode,
--system-prompt, --resume, --continue, -y, -q, plus -- pass-through),
shows wizard-free spawn templates from minimal to cold-start-with-
join, and the canonical pane-creation primitives (tmux send-keys,
iTerm2 osascript, Terminal.app, gnome-terminal, screen) that wrap
the verb when spawning into a fresh terminal pane or window.

Closes the gap where Claude knew the verb existed but had no
playbook for "how do I start another peer in a new pane without an
interactive prompt firing." Bumps CLI to 1.9.4 so the skill ships
on `claudemesh install`.
2026-05-02 22:44:00 +01:00
Alejandro Gutiérrez
80755dbf9b feat(cli+broker): structured argument validation, msg-status prefixes (v1.9.3)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Adds apps/cli/src/cli/validators.ts — a small module of shape
validators (pubkey, pubkey prefix, message id, mesh slug) that return
discriminated results so callers can distinguish "shape is wrong"
(INVALID_ARGS exit) from "value is well-shaped, lookup failed"
(NOT_FOUND exit). Includes renderValidationError() for a consistent
three-tier error contract: what's wrong, what would be valid, closest
valid alternative.

First adopter is `claudemesh msg-status`:
- Validates id locally before opening WS — typos return immediately.
- Accepts 8-32 char prefixes (full ids are 32). Pastes that get
  copy-truncated by the terminal still work.
- Distinct error messages for malformed input vs not-in-queue vs
  ambiguous prefix; --json emits the structured shape.

Broker side: WS message_status handler validates idStr is 8-32
base62 before querying. Prefix lookups use LIKE 'prefix%' scoped to
the caller's mesh (no cross-mesh leak). Returns ambiguous_prefix
when more than one match.

Establishes the canonical pattern; rolling out to send / grant /
revoke / topic post --reply-to in subsequent patches.
2026-05-02 22:40:45 +01:00
Alejandro Gutiérrez
82ee89d0dc feat(cli+docs): colorize --help output + workspace view spec
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Help text was a wall of monochrome ASCII. Now section headers print
bold-clay, the program title is brand-orange, each verb's syntax is
tinted cyan, and `(alias: ...)` parentheticals are dimmed so they
read as secondary metadata. The styles helper already gates on TTY +
NO_COLOR, so non-interactive output stays unchanged.

Adds .artifacts/specs/2026-05-02-workspace-view.md — the v0.4.0
spec for a per-user virtual workspace that aggregates reads across
all joined meshes while keeping writes mesh-scoped. Roadmap entry
added under v0.3.0.
2026-05-02 22:28:46 +01:00
Alejandro Gutiérrez
8697c1c032 fix(api+cli): topic post messageId is the durable historyId (v1.9.2)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Previously POST /v1/messages returned the message_queue row id as
`messageId`. Topic posts ARE durable (in topic_message); the queue
entry drains on delivery. Pasting that id into `--reply-to` failed
because the broker validates parents against topic_message, not the
queue. Now `messageId` aliases `historyId` for topic posts; both
`historyId` and `queueId` remain available as explicit fields.

Roadmap and CLI README updated with v0.3.1 reply-to + v0.3.2
multi-session entries.
2026-05-02 22:10:13 +01:00
Alejandro Gutiérrez
716e674473 fix(broker+cli): multi-session DM routing + broadcast self-loopback (v0.3.2)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Two related bugs surfaced in multi-session production use of 1.8.0:

1. Replies via `claudemesh send <from_id>` rejected with "no connected
   peer for target" when the original sender's session had rotated
   (Claude Code restart, /resume). Root cause: from_id carried the
   ephemeral session pubkey, which disappears the moment the session
   ends. Fix: handleSend pre-flight now also resolves the target
   pubkey against the persistent meshMember table and routes to the
   owning member's live session(s); MCP push channel now sets from_id
   to the stable member pubkey and exposes the ephemeral one under
   from_session_pubkey.

2. Broadcast/* and @group sends loopback'd to the sender's *sibling*
   sessions (same member, different session keypair), surfacing a
   spurious "tampered or wrong keypair" decrypt warning on the
   sender's own inboxes. Fix: broadcast/group fan-out now skips by
   memberPubkey, not just by presence_id, so the entire sender member
   is excluded — direct sends keep per-presence skip so a member can
   still DM their own sibling session intentionally.

Push envelope now also carries senderMemberPubkey alongside
senderPubkey so any other client of the WS channel can choose the
right one.
2026-05-02 22:05:11 +01:00
Alejandro Gutiérrez
038a5b5bf7 feat(broker+api+cli): topic message reply-to threading (v0.3.1)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Adds a reply_to_id column (self-FK on topic_message) plus end-to-end
plumbing so a message can mark itself as a reply to a previous one in
the same topic.

- Schema: 0027_topic_message_reply_to.sql adds reply_to_id with
  ON DELETE SET NULL + index for backlink lookup.
- Broker: appendTopicMessage validates parent shares the topic, writes
  reply_to_id; topicHistory + topic_history_response surface it; WS
  push envelope now carries senderMemberId, senderName, topic name,
  reply_to_id, and message_id so recipients have everything they need
  to reply without a follow-up query.
- REST: POST /v1/messages accepts replyToId (validated server-side);
  GET /messages and SSE /stream emit it per row.
- CLI: \`topic post --reply-to <id|prefix>\` resolves prefixes against
  recent history; \`topic tail\` renders an "↳ in reply to <name>:
  <snippet>" line above replies and shows a copyable #shortid tag on
  every row.
- MCP push pipe: channel attributes now include from_pubkey,
  from_member_id, message_id, topic, reply_to_id — the recipient can
  thread a reply directly from the inbound notification.
- Skill + identity prompt updated to teach Claude how to use the new
  attributes for replies.

Bumped CLI to 1.9.0.
2026-05-02 21:58:21 +01:00
Alejandro Gutiérrez
d871988084 fix(broker): libsodium dynamic import — extract .default for bun
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
await import('libsodium-wrappers') returns the namespace object in
bun, not the sodium API. randombytes_buf et al. live on .default.
Without this, every topic_create on the deployed broker errored
with 'sodium.randombytes_buf is not a function' and the WS handler
silently dropped — CLI saw a 5s timeout.

Confirmed via broker docker logs:
  warn ws message error: sodium.randombytes_buf is not a function

Same destructure pattern as crypto.ts (which uses the synchronous
default import).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:15:37 +01:00
Alejandro Gutiérrez
3c35932191 docs(skill): cover topic tail/post + member list + notification list
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Adds v1.7.0 (terminal parity) and v1.8.0 (per-topic encryption)
verbs to the bundled claudemesh skill so Claude Code sessions
discover them via the auto-installed SKILL.md instead of the
README-only path.

Sections added:
  - topic tail / topic post under the topic block
  - member resource (distinct from peer)
  - notification resource
  - per-topic encryption block — explains v2 ciphertext marker,
    re-seal flow, and 404 behaviour

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:12:55 +01:00
Alejandro Gutiérrez
b08daadbdc fix(broker): topic_create no longer rejects on creator-seal failure
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
A bad ed25519 pubkey on the creator member (legacy data) made
sealTopicKeyForMember throw, which propagated up through createTopic
and made the WS topic_create handler never send a topic_created
frame. CLI saw a 5s timeout and printed 'topic create failed'.

Wraps the seal call in try/catch — topic creation succeeds even if
no copy gets sealed for the creator. They'll see GET /v1/topics/:name/key
return 404 until they re-seal (or a holder does it for them via
the phase-3 background loop).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:11:55 +01:00
Alejandro Gutiérrez
cb5faca920 docs(roadmap): v0.3.0 phase 3 (CLI) shipped, phase 3.5 (web) added
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
CLI v1.8.0 on npm. Web stays on v1 plaintext pending the IndexedDB
identity work tracked as phase 3.5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:03:47 +01:00
Alejandro Gutiérrez
77f4316f2d feat(broker+api+cli): per-topic E2E encryption — v0.3.0 phase 3 (CLI)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Wire format:
  topic_member_key.encrypted_key = base64(
    <32-byte sender x25519 pubkey> || crypto_box(topic_key)
  )

Embedding sender pubkey inline lets re-sealed copies (carrying a
different sender than the original creator-seal) decode the same
way as creator copies, without an extra schema column or join.
topic.encrypted_key_pubkey stays for backwards-compat metadata
but the wire truth is the inline prefix.

API (phase 3):
  GET  /v1/topics/:name/pending-seals  list members without keys
  POST /v1/topics/:name/seal           submit a re-sealed copy
  POST /v1/messages now accepts bodyVersion (1|2); v2 skips the
  regex mention extraction (server can't read v2 ciphertext).
  GET  /messages + /stream now return bodyVersion per row.

Broker + web mutations updated to use the inline-sender format
when sealing. ensureGeneralTopic (web) also generates topic keys
per the bugfix that landed earlier today; both producers now
share one wire format.

CLI (claudemesh-cli@1.8.0):
  + apps/cli/src/services/crypto/topic-key.ts — fetch/decrypt/encrypt/seal
  + claudemesh topic post <name> <msg> — encrypted REST send (v2)
  * claudemesh topic tail <name> — decrypts v2 on render, runs a
    30s background re-seal loop for pending joiners

Web client stays on v1 plaintext until phase 3.5 (browser-side
persistent identity in IndexedDB). Mention fan-out from phase 1
already works for both versions, so /v1/notifications keeps
working through the cutover.

Spec at .artifacts/specs/2026-05-02-topic-key-onboarding.md
updated with the implemented inline-sender format and the
phase 3.5 web plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:03:11 +01:00
Alejandro Gutiérrez
82ebd2b6be chore(broker): wire mentions through WS topic_send + dedupe imports
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
WSSendMessage gains an optional mentions field; the broker forwards
it into appendTopicMessage so WS-driven topic sends get the same
write-time fan-out path as REST POST /v1/messages. v1 messages
(today's plaintext-base64) still fall back to a body regex when the
field is omitted, so existing CLIs aren't broken; v2 ciphertext
clients in phase 3 will populate it.

Also drops the duplicate meshMember import (kept the meshMember-as-
memberTable alias which the rest of the file uses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:45:57 +01:00
Alejandro Gutiérrez
b70536195a fix(api): ensureGeneralTopic generates a topic key + seals for owner
The web mesh-creation path went straight through db.insert(meshTopic)
and bypassed the broker's createTopic, so the v0.3.0 phase-2 key
generation never ran for #general topics created via the dashboard.
Result: GET /v1/topics/general/key returned 409 topic_unencrypted
on every web-created mesh.

Mirrors the broker's createTopic flow inline: generate a 32-byte
topic key + ephemeral x25519 sender keypair, persist the public
half on topic.encrypted_key_pubkey, seal a copy for the oldest
non-revoked member (the owner — owner-as-member rows are minted
at mesh creation per a prior fix), and let the topicKey leave
memory.

Existing meshes with already-created (and unencrypted) #general
topics aren't backfilled; they stay v0.2.0 plaintext until the
phase 3 client encrypt path lands. New meshes get encrypted
topics from this commit forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:44:26 +01:00
Alejandro Gutiérrez
39929eb7fe docs(roadmap): expand v0.3.0 per-topic encryption into three phases
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Phase 1 (notification table) and phase 2 (schema + creator seal)
shipped today. Phase 3 (member-driven re-seal + client-side
encrypt/decrypt) is the cut that actually flips the broker to
ciphertext-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:28:37 +01:00
Alejandro Gutiérrez
da5103a315 feat(broker+api): per-topic symmetric keys — schema + creator seal
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Phase 2 (infra layer) of v0.3.0. Topics now generate a 32-byte
XSalsa20-Poly1305 key on creation; the broker seals one copy via
crypto_box for the topic creator using an ephemeral x25519
sender keypair (whose public half lives on
topic.encrypted_key_pubkey). Topic key plaintext leaves memory
immediately after the creator's seal — the broker can't read it.

Schema 0026:
  + topic.encrypted_key_pubkey (text, nullable for legacy v0.2.0)
  + topic_message.body_version  (integer, 1=plaintext / 2=v2 cipher)
  + topic_member_key            (id, topic_id, member_id,
                                 encrypted_key, nonce, rotated_at)

API:
  + GET /v1/topics/:name/key — return the calling member's sealed
    copy. 404 if no copy exists yet (joined post-creation, no peer
    has re-sealed). 409 if the topic is legacy unencrypted.

Open question parked: how new joiners get their sealed copy
without ceding plaintext to the broker. Spec at
.artifacts/specs/2026-05-02-topic-key-onboarding.md picks
member-driven re-seal (Option B). Pending-seals endpoint, seal
POST, and the actual on-the-wire encryption ship in phase 3.

Mention fan-out from phase 1 (notification table) is decoupled
from ciphertext, so /v1/notifications + MentionsSection keep
working unchanged through both phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:28:10 +01:00
Alejandro Gutiérrez
1a238d4178 feat(api+broker+web): write-time mention fan-out via notification table
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Phase 1 of v0.3.0 — replaces the regex-on-decoded-ciphertext scan
in /v1/notifications and the dashboard MentionsSection with reads
from a new mesh.notification table populated at write time.

Schema 0025: mesh.notification (id, mesh_id, topic_id, message_id,
recipient_member_id, sender_member_id, kind, created_at, read_at)
with a unique (message_id, recipient) so a re-fanned message yields
one row per recipient. Backfills existing v0.2.0 messages by
regex-matching the (still-base64-plaintext) bodies — guarded with
a base64 + length check so binary ciphertext doesn't crash the
migration.

Writers (POST /v1/messages + broker appendTopicMessage) now
extract @-mentions from either an explicit `mentions: string[]`
on the request OR a regex over the base64 plaintext (transitional
fallback). Targets are intersected with the mesh roster + capped
at 32 per message. Web chat panel sends the explicit array now so
it keeps working after phase 2 lands.

Readers switch to JOIN-on-notification:
  /v1/notifications      — table-backed, supports ?unread=1
  POST /v1/notifications/read  — new, mark by ids or all-up-to
  MentionsSection (RSC) — same JOIN, returns readAt for each row

GET /v1/notifications also gains a read_at field per row so a
future bell UI can show unread vs read.

Once per-topic encryption (phase 2) lands, the regex fallback
becomes a no-op for v2 messages — clients MUST send `mentions`,
which they already do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:23:50 +01:00
Alejandro Gutiérrez
81f8066f99 docs(roadmap): mark v1.7.0 CLI parity shipped
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Adds the terminal verbs (topic tail / member list / notification
list) explicitly to v1.7.0 so the demo cut summary matches what's
on npm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:02:59 +01:00
Alejandro Gutiérrez
dd80d4e946 feat(cli): v1.7.0 — terminal parity for SSE + members + mentions
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Three new verbs that wrap the v1.6.x REST surface:

  claudemesh topic tail <name>  → live SSE consumer with N-message backfill
  claudemesh member list        → mesh roster decorated with online state
  claudemesh notification list  → recent @-mentions of you across topics

Each command auto-mints a 5-minute read-only apikey via the WS
broker and revokes on exit, so users don't manage tokens. SSE
client uses fetch + ReadableStream so the bearer stays in the
Authorization header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 20:02:29 +01:00
Alejandro Gutiérrez
c31a591681 docs(handoff): 2026-05-02 evening — v1.6.x + v1.7.0 demo cut state
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Companion to the morning handoff. Captures the 12 commits shipped
this evening, live deployment status, the CLI/UI surface gap, three
known risks (chiefly: mentions query depends on plaintext-base64
ciphertext + crashes on non-UTF8 bytes), and three branches for
the next session ranked by leverage: record the demo, wire CLI
verbs to the new endpoints, then v0.3.0 per-topic encryption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:35:12 +01:00
Alejandro Gutiérrez
a2ab7de60a docs(marketing): refresh timeline 'what's next' for v2.0.0 + v0.3.0
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Old next-block listed dashboard (shipped), slack bridge (still
v0.3.0), self-host (v0.3.0), SSO (out of scope). Replaces with
the actual roadmap horizon: daemon redesign, per-topic crypto,
self-host packaging, federation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:33:51 +01:00
Alejandro Gutiérrez
69cf39bc9f docs(blog+demo): v1.7.0 launch post + 90s demo script
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Blog post "Agents and humans in the same chat" walks through what
shipped in the v1.7.0 demo cut: topics, REST gateway, real-time
SSE, mentions, notification feed, humans-as-peers. Linked from
the blog index above the original protocol post.

Demo script lays out a five-scene 90-second screen capture: two
terminal agents talking, dashboard topic list, live chat with
@-mention autocomplete, mentions feed cross-platform, close.
Production notes + distribution checklist included.

Marketing screenshots and the actual recording are still TODO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:32:35 +01:00
Alejandro Gutiérrez
0ab2bea045 docs(roadmap): mark /v1/peers humans-as-peers as shipped
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Bridge smoke test is the last remaining v1.6.x item.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:29:03 +01:00
Alejandro Gutiérrez
f4601f4d9c feat(api): humans-as-peers in /v1/peers
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Recently-active apikey holders (used in the last 5 minutes) appear
in the peer list alongside WS-connected sessions. The dashboard
chat user now becomes visible to CLI peers calling list_peers,
closing the v1.6.0 humans-as-peers loop.

Presence rows take precedence when both exist; rest-only rows
get via:"rest" flag and idle status (no presence channel to
infer working/dnd from).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:28:47 +01:00
Alejandro Gutiérrez
a83133a4c6 docs(roadmap): mark v1.6.x SSE/unread + v1.7.0 sidebar/mentions/feed shipped
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Updates v1.6.x and v1.7.0 sections with concrete endpoints + client
behaviour for what landed this session. Bridge smoke test and
/v1/peers humans remain open under v1.6.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:27:44 +01:00
Alejandro Gutiérrez
a9160a0965 feat(api+web): notification feed — recent @-mentions across meshes
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Universe dashboard gets a "Recent mentions" section listing every
topic_message from the last 7 days that references the viewer via
`@<displayName>` (per-mesh — a user can carry different display
names in different meshes). One union'd OR query, capped at 20.

Each mention card links straight into the topic chat at the right
mesh. Snippet is the first 240 chars of the decoded ciphertext with
@-tokens highlighted in clay, matching the in-chat renderer.

GET /v1/notifications mirrors the same scan for api-key-authed
clients (CLI, bots) — accepts ?since=<ISO> for incremental polling.
Both paths use Postgres regex on the decoded base64 plaintext;
when per-topic encryption lands in v0.3.0 they'll move to a
notification table populated at write time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:26:02 +01:00
Alejandro Gutiérrez
00c25d9803 feat(web): client-side search filter in topic chat
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
A "search" toggle in the chat header opens a small input that
client-filters loaded messages by plaintext match on body or
sender name. Live tail auto-scroll suspends while a query is
active so matches stay visible when new messages arrive.

Server-side fulltext search lands when ciphertext moves to
per-topic symmetric keys in v0.3.0 — until then there's no
server index to query, and the loaded window (last 100 plus
forward stream) covers most "find that thing from earlier"
needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:23:21 +01:00
Alejandro Gutiérrez
35a289b64a feat(web): @-mention autocomplete + highlight in topic chat
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Typing `@` in the compose box opens a dropdown of matching mesh
members fed by /v1/members. Filters live by displayName prefix
(case-insensitive); online members rank above offline; shorter
names rank higher; capped at 8 entries.

Keyboard: ArrowUp/Down to navigate, Enter or Tab to insert,
Escape to dismiss. Mouse hover updates the selection; mousedown
inserts (mousedown so the textarea doesn't lose focus first).

Rendered messages now highlight @mentions in clay so they're
visually distinct from plain text — same regex the autocomplete
uses, so the round trip is consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:21:19 +01:00
Alejandro Gutiérrez
7af61e121e fix(web): stop SSE reconnect loop on 4xx errors
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
A revoked api key or missing topic returned by GET /v1/.../stream
used to throw inside the catch and bounce through the backoff loop
forever. Now any 4xx response terminates the loop and surfaces the
status + body in the panel error so the user sees the real cause.
5xx and network errors still reconnect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:19:25 +01:00
Alejandro Gutiérrez
a75483b3c2 feat(api+web): member sidebar in topic chat with live presence
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
GET /v1/members lists every non-revoked member of the api key's
mesh, decorated with online state from presence rows. Distinct from
/v1/peers (active sessions) — sidebars want roster + live dot, not
just whoever is currently connected.

Chat panel splits into a 2-column layout (>=lg) with a 180px
sidebar that polls the roster every 20s. Online members go up top
with status-coloured dots (idle=green, working=clay, dnd=fig);
offline members fade below at 50% opacity. Bots get a "bot" tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:10:26 +01:00
Alejandro Gutiérrez
541440c357 feat(web): unread badge on dashboard mesh cards
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Universe page aggregates unread topic_message rows per mesh for the
viewing user. Counts messages newer than topic_member.last_read_at
(or all messages if the viewer never opened the topic) and excludes
anything the viewer authored. One JOIN-grouped query, not N+1.

Mesh card surfaces the count as a clay-rounded badge to the left of
the role chip — matches the per-topic badge style on the mesh detail
page so unread is the same visual idiom across the dashboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:08:11 +01:00
Alejandro Gutiérrez
a80eb6fcca feat(api+web): unread counts per topic + PATCH /read mark-as-read
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
PATCH /v1/topics/:name/read upserts topic_member.last_read_at for the
api key's issuing member. The chat panel calls it on mount and on
every inbound SSE message (5s debounce so we don't hammer it).

GET /v1/topics now returns unread per topic — counts messages newer
than last_read_at and not authored by the viewer. Mesh detail page
shows a clay-rounded badge next to each topic name with the count
(99+ ceiling).

AuthedApiKey gains issuedByMemberId so endpoints can attribute
side-effects to the minting member. Required because external api
keys aren't tied to a specific peer member; only dashboard- and
CLI-minted keys carry one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:06:01 +01:00
Alejandro Gutiérrez
7e71a61db4 feat(api+web): stream topic chat live over server-sent events
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
GET /v1/topics/:name/stream opens an SSE firehose, polled server-side
every 2s and streamed as `message` events. Forward-only — clients
hit /messages once for backfill, then live from connect-time onward.
Heartbeats every 30s keep the connection through proxies.

Web chat panel reads the stream via fetch + ReadableStream so the
bearer token stays in the Authorization header (EventSource can't
set custom headers, which would force token-in-URL leaks). Auto-
reconnect with exponential backoff. setInterval polling removed.

Vercel maxDuration bumped to 300s on the catch-all API route so
streams aren't cut at the 10s default.

drizzle migrations/meta/ deleted — superseded by the filename-
tracked custom runner in apps/broker/src/migrate.ts (c2cd67a).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:02:38 +01:00
Alejandro Gutiérrez
d7cef45640 chore(release): claudemesh-cli@1.6.1
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Patch release on top of 1.6.0:

- Revoke-by-id-prefix bug fix (broker.revokeApiKey now returns
  structured status; CLI surfaces not_found / not_unique). Pasting
  the 8-char prefix from `apikey list` output now works as users
  expect, instead of silently no-op'ing with a misleading "✔
  revoked" message. Already deployed to broker.
- whoami falls back to local mesh-config view when no web session
  is signed in. Users who joined via invite (and never ran
  `claudemesh login`) now see their member ids and pubkey prefixes
  per mesh, instead of a "Not signed in" dead end.
- README updated: REST surface lives at claudemesh.com/api/v1/*
  (web app), NOT ic.claudemesh.com/api/v1/* (broker). Surfaced
  during CLI-only smoke test against prod when curl on the broker
  host returned 404.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:50:22 +01:00
Alejandro Gutiérrez
0f32529370 fix(apikey): revoke must verify a row was actually updated
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
claudemesh apikey revoke <id> reported success even when the input
didn't match any row in mesh.api_key. The CLI's `apikey list` shows
truncated 8-char prefixes; users naturally paste those; broker did
exact-id match against meshApiKey.id; UPDATE affected 0 rows; old
revokeApiKey returned void so the CLI couldn't tell. Discovered via
end-to-end CLI smoke test against prod (roadmap validation pass).

Three-part fix:

- broker.revokeApiKey now returns
  { status: "revoked"|"not_found"|"not_unique"; id?, matches? } and
  accepts either the full id or a unique prefix (>=6 chars). Prefix
  matching is bounded to the caller's mesh and only succeeds if
  exactly one row matches; ambiguous prefixes return not_unique so
  we never silently revoke the wrong key.

- New WSApiKeyRevokeResponseMessage carries the structured status
  back to the CLI. Old apikey_revoke_ok type removed before being
  released — never shipped to users. The error path is no longer
  used for not_found/not_unique cases; the unified response carries
  both outcomes.

- CLI's apiKeyRevoke now resolves with { ok, id } | { ok: false,
  code, message }. runApiKeyRevoke surfaces the code/message and
  exits non-zero on failure (NOT_FOUND for missing, INVALID_ARGS
  for ambiguous prefix).

Net effect: pasting `claudemesh apikey revoke vq0fwjdX` now actually
revokes the key whose id starts with vq0fwjdX (or fails loud if 0
or >1 keys match). Verified against prod via the new branch's CLI
binary before commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:39:25 +01:00
Alejandro Gutiérrez
7d1538d743 docs(roadmap): correct v3.0.0 — opt-in stays, only the form changes
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Earlier wording claimed --dangerously-load-development-channels "goes
away" at v3.0.0. That overstated what we know. Some opt-in mechanism
is always required for Claude Code to accept external runtime events
from a third-party process — that's a security invariant, not a quirk
of today's flag.

What changes at v3.0.0 is the FORM of the opt-in (stable settings
entry, native transport subscription, etc.), not its existence. The
"dangerously" / "experimental" / "development" framing is what
disappears, because the underlying API graduates from experimental
to stable. The flag itself, or its successor, lives on as a normal
config entry that claudemesh install writes once.

Public roadmap and internal spec both updated to reflect this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:29:59 +01:00
Alejandro Gutiérrez
dc7e0e826d docs(roadmap): refresh after v1.6.0 ships + add daemon redesign target
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Public docs/roadmap.md gets the v1.6.0 cut moved to shipped, drops the
v0.2.0-as-next section in favor of a v1.6.x patch line + v1.7.0 demo
cut + v2.0.0 daemon redesign + v3.0.0 native-channels migration target.
Items that were in v0.2.0-next migrate down: gateways and tag routing
land in v0.3.0 alongside per-topic encryption and self-hosted broker.

The detailed strategic version lives at
.artifacts/specs/2026-05-02-roadmap.md — schedule, cost estimates,
migration paths, deliberate exclusions, the load-bearing principle for
the daemon shift ("the user is the unit, not the Claude session").
The public file stays marketing-tone; the artifact captures internal
planning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:27:50 +01:00
Alejandro Gutiérrez
2aa21fe07c fix(api): mint owner peer-identity row at mesh creation
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Web-first owners had no mesh.member row because the broker only ever
created one on first WS hello (CLI flow). The topic chat page server
component requires that row to issue a dashboard apikey
(issuedByMemberId is a FK to mesh.member), so visiting the chat for a
web-only mesh hit notFound() on the owner's own room.

Forward fix: createMyMesh now generates a fresh ed25519 peer keypair,
inserts a mesh.member row with role=admin and dashboardUserId=userId,
and subscribes the owner to the auto-created #general topic as 'lead'.
The peer secret key is intentionally discarded — web users don't sign
anything in v0.2.0 (no DMs, base64 plaintext on topics). If the same
user later runs the CLI, the broker mints a separate member row from
its own keypair; both work for their respective surfaces.

Backfill: apps/broker/scripts/backfill-owner-members.ts walks every
non-archived mesh whose owner has no member row, generates real
ed25519 keypairs via libsodium, inserts the rows in a transaction,
and subscribes each as 'lead' on #general. Already run against prod
— 13 owner rows minted, ddtest verified end-to-end via playwriter
(send → poll → render round-trip ok).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:02:40 +01:00
Alejandro Gutiérrez
6de5e275fa chore(broker): comment migrate skip flag as break-glass only
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Now that the filename-tracked runner is in place and prod is bootstrapped,
BROKER_SKIP_MIGRATE=1 is no longer needed. Removed from Coolify env;
the comment is updated to reflect that the flag is a break-glass for
ops, not the steady-state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:45:36 +01:00
Alejandro Gutiérrez
c2cd67a885 feat(broker): filename-tracked migration runner replaces drizzle's
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
drizzle's _journal.json drifted to idx=11 while the file system had 25
.sql files; the prod drizzle.__drizzle_migrations table was further
behind with 3 rows. The runtime migrator silently skipped anything
outside the journal, so every new schema change required psql -f by
hand.

The new runner tracks applied files in mesh.__cmh_migrations
(filename PK + sha256 + applied_at). On startup it bootstraps the
tracking table inline, lists migrations/*.sql lexicographically,
filters out already-applied files, and runs the rest in transaction
order under the existing pg_advisory_lock. SHA mismatches on
already-applied files emit a warning but don't fail (cosmetic edits
are common); production drift detection lives elsewhere.

Bootstrap script at apps/broker/scripts/bootstrap-cmh-migrations.ts
computes file hashes and seeds the tracking table — already run
against prod with all 25 current files registered as applied. Future
deploys pick up only truly new migrations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:41:51 +01:00
Alejandro Gutiérrez
4ebd138a68 fix(migrations): explicit id + enum cast for 0024 backfill
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
- mesh.topic.id has no PG-side default (drizzle $defaultFn is ORM-only)
- mesh.topic_member.role needs an explicit cast to the enum type

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:34:28 +01:00
Alejandro Gutiérrez
2e97a0eeee feat(broker+api): every mesh ships with a default #general topic
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The web chat surface needed a guaranteed landing room — a topic that
exists for every mesh from creation onward so the dashboard always has
somewhere to drop the user. #general is the convention; ephemeral DMs
remain ephemeral (mesh.message_queue) so agentic privacy is unchanged.

Three hooks plus a backfill:

- packages/api/src/modules/mesh/mutations.ts — createMyMesh now calls
  ensureGeneralTopic() right after the mesh insert. New helper is
  idempotent via the unique (mesh_id, name) index.
- apps/broker/src/index.ts — handleMeshCreate (CLI claudemesh new)
  inserts #general + subscribes the owner member as 'lead' in the
  same handler.
- apps/broker/src/crypto.ts — invite-claim flow auto-subscribes the
  newly minted member to #general as 'member', defensively ensuring
  the topic exists if predates this change.
- packages/db/migrations/0024_general_topic_backfill.sql — one-shot
  backfill: creates #general for every active mesh that doesn't have
  one, subscribes every active member, and marks the mesh owner as
  'lead' based on owner_user_id == member.user_id. Idempotent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:32:16 +01:00
Alejandro Gutiérrez
f727620d16 feat(web): topic discoverability — counts on cards + inline creation
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Two UX wins for the v0.2.0 chat surface:

- Mesh cards on /dashboard now show topic count alongside members and
  tier ("3 MEMBERS · 2 TOPICS · FREE"). Active topics render in clay,
  zero in tertiary. One aggregate query, not N+1.
- Mesh detail page replaces the CLI-hint empty state with an inline
  CreateTopicForm. Non-empty topic lists get a compact "+ new topic"
  pill in the section header. Server action validates name format
  (lowercase letters/digits/dashes, 1-50 chars), inserts via the
  unique (meshId, name) index, auto-subscribes the creator as topic
  lead, then redirects into the chat.

Sidebar audit — kept platform/manage/dev structure as is. Topics are
mesh-scoped so a top-level "topics" entry would have nothing to land
on without a mesh chosen first. Discoverability lives on the mesh
cards instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:27:19 +01:00
Alejandro Gutiérrez
c801afd2ab style(web): topic chat panel matches mesh-panel idiom
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Audit against peer-graph-panel, live-stream-panel, state-timeline-panel,
and resource-panel showed the chat used generic shadcn Card chrome
instead of the established panel pattern. Refactor swaps the wrapper
to the canonical idiom:

- rounded-[var(--cm-radius-lg)] + border-[var(--cm-border)] + bg-[var(--cm-bg)]
- mono header strip with clay-pulse fetch dot, 11px label, 10px metadata
- mono 9px footer status bar (mesh slug · poll cadence · key expiry)
- Anthropic Mono via var(--cm-font-mono) on chrome, sans on message body
- compose textarea uses cm-bg-elevated + cm-border-hover focus state
- error line in cm-fig (#c46686) instead of generic destructive

No behavior change — only chrome. Polling, send path, decode logic
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:22:22 +01:00
Alejandro Gutiérrez
b60daff886 feat(web): topic chat UI over /api/v1/* (v0.2.0)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
New dashboard route at /dashboard/meshes/[id]/topics/[name] gives signed-in
users a thin chat client over the v0.2.0 REST surface. The mesh detail page
now lists topics with one-click links into the chat. Backend layout:

- packages/api/src/modules/mesh/api-key-auth.ts — exports
  createDashboardApiKey() that mints a 24h read+send key scoped to a single
  topic for the caller's member id. The page server component calls this on
  every render and embeds the secret in the props of the client component;
  the secret never touches sessionStorage so a tab close = key effectively
  abandoned (the row remains until expiresAt).
- apps/web/.../topics/[name]/page.tsx — server component, NextAuth gate,
  resolves the user's meshMember.id, mints the key, renders the shell.
- apps/web/src/modules/mesh/topic-chat-panel.tsx — client component, polls
  GET /v1/topics/:name/messages every 5s, sends via POST /v1/messages.
  Encoding wraps base64(plaintext) into the ciphertext field — matches the
  current broker contract until per-topic HKDF lands in v0.3.0.

The mesh detail page gains a Topics section with empty-state copy that
points users at the CLI verb (claudemesh topic create) for now; topic
creation from the web UI is a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:19:38 +01:00
Alejandro Gutiérrez
7d35c779f4 chore(release): claudemesh-cli@1.6.0
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The v0.2.0 backend cut. Topics, API keys, REST /api/v1/*, and bridge
peers — all in one CLI release. Adds three new verb namespaces:
topic (channel pub/sub), apikey (REST client auth), bridge (cross-mesh
forwarding).

Also pins @claudemesh/sdk as a workspace devDependency so the bridge
implementation is bundled by Bun at build time and doesn't leak into
the npm tarball's runtime deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 16:11:56 +01:00
Alejandro Gutiérrez
f08d6c9f0c docs(handoff): 2026-05-02 — state after 1.5.0 + v0.2.0 backend
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Three pending sessions ranked by leverage: ship 1.6.0 npm release, fix migration drift, build web chat UI.
2026-05-02 15:55:53 +01:00
Alejandro Gutiérrez
9dd1e401b0 feat(sdk+cli): bridge peer — forward a topic between two meshes
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
A bridge holds memberships in two meshes and relays messages on a
single topic between them. Federation-lite without a broker-to-broker
protocol.

SDK additions:
- Bridge class (start, stop, EventEmitter for forwarded/dropped/error)
- MeshClient.joinTopic / leaveTopic / createTopic methods
- Loop prevention: plaintext hop counter prefix __cmh<n>: with maxHops
  default 2; echo guard via senderPubkey == own session pubkey

CLI additions:
- claudemesh bridge run <config.yaml> long-lived process
- claudemesh bridge init prints config template
- Zero-dep YAML parser for the flat bridge config shape

The hop prefix is visible in message bodies — minor wart, fixed in
v0.3.0 by moving loop tracking into broker primitives.

SDK kept as devDependency since Bun bundles it into dist; no impact
on npm publish or runtime resolution.

Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 13:41:50 +01:00
Alejandro Gutiérrez
9418d0ee30 fix(api): dedupe /v1/peers by member (one row per active session)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
2026-05-02 02:27:50 +01:00
Alejandro Gutiérrez
8b5708a604 fix(api): mount /v1 router via .route, not basePath
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
2026-05-02 02:22:08 +01:00
Alejandro Gutiérrez
56d7cc1c48 feat(api): /v1 REST surface for external clients (v0.2.0)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Bearer-auth REST endpoints for humans, scripts, bots — anyone without
browser-side ed25519. Same key model as broker WS, scoped by capability
and optional topic whitelist.

Endpoints (v0.2.0 minimum):
- POST /v1/messages
- GET  /v1/topics
- GET  /v1/topics/:name/messages (limit, before cursor)
- GET  /v1/peers

Auth: Authorization: Bearer cm_<secret>. Middleware verifies prefix +
SHA-256 hash with constant-time compare; capability + topic-scope
asserted per route. Cross-mesh isolation: every endpoint scopes to
apiKey.meshId.

Live delivery: writes to messageQueue + topic_message; broker's
existing pendingTimer drains and pushes to live peers. Real-time
push from REST writes is a follow-up.

Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:19:12 +01:00
Alejandro Gutiérrez
13d691980a feat(broker+cli): apikey create/list/revoke verbs (v0.2.0 #71)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Issuance flow over WS for now (REST endpoints come next slice).
Plaintext secret returned ONCE on create — never recoverable.

- broker: 3 WS handlers (apikey_create/list/revoke), wire types in
  union, audit log on issuance + revoke
- ws-client: apiKeyCreate/List/Revoke with resolver maps, response
  dispatch
- CLI: claudemesh apikey create <label> [--cap a,b] [--topic c,d]
  [--expires ISO]; list shows status, scope, last-used; revoke by id
- policy: apikey create + revoke prompt by default (issuing or
  disabling a credential is meaningful)

Default capability set is "send,read" — least privilege for unscoped
keys (admin must explicitly opt-in).

Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:13:12 +01:00
Alejandro Gutiérrez
f45380d231 feat(broker): api key schema and helpers
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Foundation for v0.2.0 REST + external WS auth.

Bearer tokens stored as SHA-256 hashes; secrets are 256-bit CSPRNG so
Argon2 would waste cost without security gain.

Adds mesh.api_key table, migration 0023 applied manually to prod, and
helpers: createApiKey, listApiKeys, revokeApiKey, verifyApiKey.

Next slices: CLI apikey verbs and REST endpoints in apps/web router.

Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:09:44 +01:00
Alejandro Gutiérrez
f71218c1e1 docs(spec): v0.2.0 — humans-in-mesh interface is REST, not browser WS
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Broker already plumbs peer_type. Real blocker is browser-side ed25519
hello-sig — sidestepped by exposing REST API for humans (and external
scripts/bots), with web chat UI as a thin REST client using dashboard
session auth. Collapses #2 (humans) and #3 (REST) into one deliverable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:06:29 +01:00
Alejandro Gutiérrez
f98c2de5a3 fix(broker): topic-tagged sends bypass direct-target pre-flight
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
handleSend's pre-flight check rejected #<topicId> sends because the
target wasn't matched by @group / * / pubkey, so it fell into the
"direct" branch and looked for a peer with that pubkey. Topic targets
need their own class — delivery happens via topic_member, not by
matching connected peers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 02:01:35 +01:00
Alejandro Gutiérrez
1afae7a507 feat(broker+cli): topics — conversation scope within a mesh (v0.2.0)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Adds the third axis of mesh organization: mesh = trust boundary,
group = identity tag, topic = conversation scope. Topic-tagged
messages filter delivery by topic_member rows and persist to a
topic_message history table for back-scroll on reconnect.

Schema (additive):
- mesh.topic, mesh.topic_member, mesh.topic_message tables
- topic_visibility (public|private|dm) and topic_member_role
  (lead|member|observer) enums
- migration 0022_topics.sql, hand-written following project convention
  (drizzle journal has been drifting since 0011)

Broker:
- 10 helpers (createTopic, listTopics, findTopicByName, joinTopic,
  leaveTopic, topicMembers, getMemberTopicIds, appendTopicMessage,
  topicHistory, markTopicRead)
- drainForMember matches "#<topicId>" target_specs via member's
  topic memberships
- 7 WS handlers (topic_create/list/join/leave/members/history/mark_read)
  + resolveTopicId helper accepting id-or-name
- handleSend auto-persists topic-tagged messages to history

CLI:
- claudemesh topic create/list/join/leave/members/history/read
- claudemesh send "#deploys" "..." resolves topic name to id
- bundled skill teaches Claude the DM/group/topic decision matrix
- policy-classify recognizes topic create/join/leave as writes

Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:53:42 +01:00
Alejandro Gutiérrez
b4f457fceb feat(cli): 1.5.0 — CLI-first architecture, tool-less MCP, policy engine
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
CLI becomes the API; MCP becomes a tool-less push-pipe. Bundle -42%
(250 KB → 146 KB) after stripping ~1700 lines of dead tool handlers.

- Tool-less MCP: tools/list returns []. Inbound peer messages still
  arrive as experimental.claude/channel notifications mid-turn.
- Resource-noun-verb CLI: peer list, message send, memory recall, etc.
  Legacy flat verbs (peers, send, remember) remain as aliases.
- Bundled claudemesh skill auto-installed by `claudemesh install` —
  sole CLI-discoverability surface for Claude.
- Unix-socket bridge: CLI invocations dial the push-pipe's warm WS
  (~220 ms warm vs ~600 ms cold).
- --mesh <slug> flag: connect a session to multiple meshes.
- Policy engine: every broker-touching verb runs through a YAML gate
  at ~/.claudemesh/policy.yaml (auto-created). Destructive verbs
  prompt; non-TTY auto-denies. Audit log at ~/.claudemesh/audit.log.
- --approval-mode plan|read-only|write|yolo + --policy <path>.

Spec: .artifacts/specs/2026-05-02-architecture-north-star.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 01:18:19 +01:00
Alejandro Gutiérrez
ff551ccf3d chore(cli): release 1.0.1
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Ships disconnect/kick/ban three-tier peer removal, revoked-hello
friendly error with 'contact mesh owner' message, WS close codes
4001 (kicked) and 4002 (banned) that stop CLI auto-reconnect.

latest + alpha dist-tags both → 1.0.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:59:52 +01:00
Alejandro Gutiérrez
b49e9a9b61 feat(cli+broker): three-tier peer removal: disconnect, kick, ban
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Broker (apps/broker/src/index.ts)
- Unified disconnect/kick handler uses close code 1000 for disconnect
  (CLI auto-reconnects) vs 4001 for kick (CLI exits, no reconnect).
- Ban now closes with code 4002.
- Hello handler: revoked members get a specific 'revoked' error with a
  'Contact the mesh owner to rejoin' message, then ws.close(4002).
  Previously banned users saw the generic 'unauthorized' error.
- list_bans handler returns { name, pubkey, revokedAt } for each
  revoked member.

CLI (apps/cli)
- ws-client: close codes 4001 and 4002 set .closed = true and stash
  .terminalClose so callers can surface a friendly message instead of
  the low-level 'ws terminal close' error. Revoked error in hello is
  also captured as a terminal close.
- withMesh catches terminalClose and prints:
  4001 → 'Kicked from this mesh. Run claudemesh to rejoin.'
  4002 → the broker's 'Contact the mesh owner to rejoin.' message
- kick.ts now exports runDisconnect + runKick with clear hints:
  'disconnect' → 'They will auto-reconnect within seconds.'
  'kick'       → 'They can rejoin anytime by running claudemesh.'
- cli.ts adds 'disconnect' dispatch; HELP updated.

Semantics:
  disconnect: session reset, no DB state, auto-reconnects
  kick      : session ends, no DB state, user must manually rejoin
  ban       : session ends + revokedAt set, cannot rejoin until unban

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 09:55:05 +01:00
Alejandro Gutiérrez
163e1be70a chore(cli): release 1.0.0 — out of alpha
Some checks failed
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Promote CLI from 1.0.0-alpha.42 to stable 1.0.0 so
`npm i -g claudemesh-cli` installs the current release without
needing the @alpha dist-tag.

Both dist-tags now point at 1.0.0 — `@alpha` kept as an alias for
continuity so existing docs, install scripts, and scheduled upgrade
commands keep working.

upgrade + doctor commands updated to prefer the `latest` dist-tag
(falling back to `alpha`) and to suggest `npm i -g claudemesh-cli`
without the @alpha suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 02:06:11 +01:00
Alejandro Gutiérrez
3d2ab0cb4b fix(cli): production-grade peer disambiguation (alpha.42)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Three bugs compounding when multiple peers share a display name:

1. list_peers (MCP + CLI) truncated pubkey to 12 hex chars with an
   ellipsis. A truncated pubkey cannot be used as a routing key, so
   the caller had no way to disambiguate visually.

2. send_message required the full 64-hex pubkey and refused prefix
   input, forcing callers to rely on --json output to get a full key.

3. Name-based resolution returned the first exact match without
   filtering the caller's own session — so "send to <my-own-name>"
   would bounce against the broker's self-send guard when another
   session of the same user was the intended target.

Fixes:
- list_peers now prints 16-char pubkey prefix labelled "pubkey: …"
  (MCP) and appends it to CLI output
- send_message accepts any 8–64 hex-char prefix and resolves against
  live peer lists across joined meshes; unique match routes, multi-
  match returns a disambiguation error listing each candidate's
  displayName + pubkey + cwd
- Name matches now skip the caller's own session pubkey; multiple
  same-named matches fail loudly with a copy-pasteable pubkey
  disambiguation hint instead of silently picking one
- Full 64-char pubkeys without a live match still queue at the
  broker (preserves offline-delivery semantics)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 22:56:41 +01:00
Alejandro Gutiérrez
0664180a54 feat(web): universe dashboard — meshes + incoming invitations
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
New /dashboard landing that surfaces meshes and invitations-to-you
in one view. Replaces the simple mesh grid at /dashboard (preserved
at /dashboard/legacy).

Backend additions:
- GET /api/my/invites/incoming — pending_invite rows addressed to
  the authed user's email, joined with invite for role + expiry and
  user/mesh for display. Unaccepted + unrevoked + unexpired only.
- DELETE /api/my/invites/incoming/:id — dismiss a pending invite
  (revokes the pending_invite row only; underlying invite code stays
  valid so the inviter can re-send).

Web additions (all under apps/web/src/modules/dashboard/universe/):
- welcome.tsx — editorial serif header with mesh + invite counts
- invitations.tsx — client card with Accept (→ /i/:code claim flow)
  and optimistic Decline
- meshes-grid.tsx — hero card + compact grid, linked to mesh detail
- reveal.tsx — fade-up motion matching marketing _reveal.tsx

Styling uses the existing claudemesh design tokens (--cm-clay,
--cm-bg-elevated, Anthropic Sans/Serif/Mono) — nothing redefined.

Onboarding redirect (0 meshes → /meshes/new?onboarding=1) preserved,
now gated on 0 invitations too so users with pending invites still
land on the dashboard.

Sidebar icon switched to Atom for the "universe" concept.

Standalone prototype saved at prototypes/live-dashboard.html for
reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 21:31:15 +01:00
Alejandro Gutiérrez
2abf86d540 fix(cli): short-circuit join <slug> when already a member (alpha.41)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
If the arg isn't a URL and matches a mesh already in local config,
print a hint pointing at `launch --mesh <slug>` instead of treating
the slug as an invite code. Avoids the 501 invite_v2_disabled confusion
when users try to "enter" a mesh they already own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 20:11:46 +01:00
Alejandro Gutiérrez
a5347cebc0 fix(cli): silence "session restored" log for one-shot commands (alpha.40)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Add quiet opt to BrokerClient; withMesh passes quiet:true so commands
like peers/state/info/remind no longer print per-mesh restore chatter.
Long-running paths (launch, MCP) stay verbose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:54:53 +01:00
Alejandro Gutiérrez
622ea569ad fix(cli): filter self from claudemesh peers output (alpha.39)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The peers command opens its own WS to each mesh, which briefly appears
as a hostname-PID peer. Filter it out by session pubkey.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:50:18 +01:00
Alejandro Gutiérrez
d7f381a1e8 fix(cli): surface broker error messages in ban/unban (alpha.38 fix)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 08:57:08 +01:00
Alejandro Gutiérrez
3ceac68e67 feat(cli+broker): kick, ban, unban, bans commands
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Broker WS handlers:
- kick: disconnect peer(s) by name, --stale duration, or --all.
  Authz: owner or admin only. Closes WS + marks presence disconnected.
- ban: kick + set revokedAt on mesh.member. Hello already rejects
  revoked members, so ban is instant and permanent until unban.
- unban: clear revokedAt. Peer can rejoin with their existing keypair.
- list_bans: return all revoked members for a mesh.

Session-id dedup (previous commit): handleHello disconnects ghost
presences with matching (meshId, sessionId) before inserting the new
one. Eliminates duplicate entries after broker restarts.

CLI (alpha.37):
- claudemesh kick <peer|--stale 30m|--all>
- claudemesh ban/unban <peer>
- claudemesh bans [--json]
- Uses new sendAndWait() on ws-client for request-response pattern
  over WS (generic _reqId resolver).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 08:37:38 +01:00
Alejandro Gutiérrez
5ddb11b2d5 fix(broker): dedup presences by session_id on hello
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
When a client reconnects with the same session_id before the 90s
stale sweeper runs, the old ghost presence stays in the connections
map. Result: duplicate entries in list_peers for the same Claude
Code instance.

Now: handleHello iterates connections for matching (meshId, sessionId),
closes the old WS, deletes from map, marks disconnected in DB.
One session_id = one presence, always.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 21:40:25 +01:00
Alejandro Gutiérrez
2edbfce7d3 fix(broker): add BROKER_SKIP_MIGRATE=1 escape hatch for manual-migrated DBs
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:59:28 +01:00
Alejandro Gutiérrez
9f3a82dd63 fix(broker): use sql.unsafe for SET lock_timeout in migrate
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 12:55:04 +01:00
Alejandro Gutiérrez
05729ad8a4 feat(ga): close remaining GA blockers (backcompat, HA prep, tests, docs)
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Backwards compat shim (task 27)
- requireCliAuth() falls back to body.user_id when BROKER_LEGACY_AUTH=1
  and no bearer present. Sets Deprecation + Warning headers + bumps a
  broker_legacy_auth_hits_total metric so operators can watch the
  legacy traffic drain to 0 before removing the shim.
- All handlers parse body BEFORE requireCliAuth so the fallback can
  read user_id out of it.

HA readiness (task 29)
- .artifacts/specs/2026-04-15-broker-ha-statelessness-audit.md
  documents every in-memory symbol and rollout plan (phase 0-4).
- packaging/docker-compose.ha-local.yml spins up 2 broker replicas
  behind Traefik sticky sessions for local smoke testing.
- apps/broker/src/audit.ts now wraps writes in a transaction that
  takes pg_advisory_xact_lock(meshId) and re-reads the tail hash
  inside the txn. Concurrent broker replicas can no longer fork the
  audit chain.

Deploy gate (task 30)
- /health stays permissive (200 even on transient DB blips) so
  Docker doesn't kill the container on a glitch.
- New /health/ready checks DB + optional EXPECTED_MIGRATION pin,
  returns 503 if either fails. External deploy gate can poll this
  and refuse to promote a broken deploy.

Metrics dashboard (task 32)
- packaging/grafana/claudemesh-broker.json: ready-to-import Grafana
  dashboard covering active conns, queue depth, routed/rejected
  rates, grant drops, legacy-auth hits, conn rejects.

Tests (task 28)
- audit-canonical.test.ts (4 tests) pins canonical JSON semantics.
- grants-enforcement.test.ts (6 tests) covers the member-then-
  session-pubkey lookup with default/explicit/blocked branches.

Docs (task 34)
- docs/env-vars.md catalogues every env var the broker + CLI read.

Crypto review prep (task 35)
- .artifacts/specs/2026-04-15-crypto-review-packet.md: reviewer
  brief, threat model, scope, test coverage list, deliverables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 23:51:28 +01:00
Alejandro Gutiérrez
49e0af0fc0 chore(cli): bump to alpha.36 with security fixes
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
2026-04-15 19:18:57 +01:00
Alejandro Gutiérrez
2be5e9dccb fix(security): resolve all 17 codex findings — auth, grants, crypto, ops
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Critical: broker HTTP auth via cli_session bearer token on all /cli/*;
file download requires auth+membership; v2 claim gated; duplicate
claimInviteV2Core removed; grant enforcement tries member then
session pubkey; audit hash uses canonical sorted-keys JSON.

High: rate limit args fixed (burst 10, 60/min) + both buckets swept;
BROKER_ENCRYPTION_KEY fail-fast in prod; migrate uses pg_try + lock_
timeout; hello validates sessionPubkey hex; blocked DMs rejected pre-
queue; watch timers cleaned on disconnect.

Medium: inbound pushes serialized; reconnect jitter + timer guard;
hardcoded URLs through env; v2 claim path configurable.

Low: WSHelloMessage optional protocolVersion+capabilities.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 19:18:25 +01:00
Alejandro Gutiérrez
1a7a059e75 fix: queue TTL + per-member send rate limit + size cap + no-recipient reject + ack.error
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Broker (all need redeploy):
- sweepOrphanMessages: DELETE undelivered message_queue rows older
  than 7 days; hourly sweep. Stops unbounded growth when a sender
  typos a name (queued forever, never claimed).
- Per-member send rate limit: TokenBucket(60/min, burst 10) keyed on
  memberId so reconnecting can't bypass. Surfaces as queued=false,
  error='rate_limit: ...'.
- Pre-flight size cap: reject at handleSend if nonce+ciphertext+
  targetSpec exceeds env.MAX_MESSAGE_BYTES with a clear error
  instead of silent WSS frame-level kill.
- No-recipient reject: for direct sends, check any matching peer
  is connected BEFORE queueing. Kills the self-send silent drop
  (sending to your own pubkey when you only have one session
  connected) and typo-to-offline-peer silent drops.
- WSAckMessage.error field added for structured failure reasons.

CLI:
- ws-client ack handler reads msg.queued and msg.error; surfaces
  rate_limit / too_large / no_recipient to callers instead of
  returning ok:true with a dummy messageId.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:44:09 +01:00
Alejandro Gutiérrez
39fe296aaa fix(cli): decrypt falls back to member secret key when session key fails
Some checks failed
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
When Alice's session-A encrypts a direct message to Bob (target = Bob's
stable member pubkey) and Bob's session-B receives it, Bob has BOTH an
ephemeral session secret key and the member secret key. The old code
only tried session_sk, then silently failed with '⚠ message from
<sender> failed to decrypt' even though the message was valid —
just encrypted to the member key.

Now: try session first, fall back to member on null. Matches the
sender side's choice freedom (encrypt using either key).

Repros when: user opens multiple Claude Code sessions (all use the
same member key but each generates its own session key), and one
session sends to another by display-name resolution which returns
the member pubkey.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:37:36 +01:00
Alejandro Gutiérrez
3dfab0f792 fix(broker): don't broadcast peer_joined/peer_left/peer_returned to same-pubkey sessions
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
When a user opens multiple Claude Code instances on one laptop they
all share the same memberPubkey (one identity, one config.json). The
broker was broadcasting each Claude Code start/stop to every OTHER
session of the same user — showing as 'peer agutierrez left / joined'
spam in every active claude terminal.

Now: skip broadcast to presences whose memberPubkey equals the joining
or leaving presence's memberPubkey. Other actual peers on the mesh
still see the event.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:28:57 +01:00
Alejandro Gutiérrez
6f4a44e281 fix(db): realign audit_log schema — actor_member_id, prev_hash, hash chain
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
The broker code moved to an append-only hash-chained audit log
(actor_member_id / actor_display_name / payload / prev_hash / hash
with integer GENERATED ALWAYS AS IDENTITY id) but prod still had
the original 0000-migration shape (actor_peer_id / metadata /
text id). Every peer_joined / peer_left event logged 'audit log
insert failed' — no audit trail captured at all.

Applied manually on prod already; committing the migration so
future environments converge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:26:48 +01:00
Alejandro Gutiérrez
4bc3c045ae fix(cli): send_message hard-fails on unknown peer name; dedup-annotate list_peers
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Two bugs that combined to make Claude's peer-send look successful even
when the recipient didn't exist:

1. resolveClient fell through to 'let the broker try' when a single
   mesh was joined and the name didn't match any peer. The broker
   queued the message against the literal unknown string, matched no
   peer in fan-out, but returned a messageId — so the CLI reported
   '✓ lezg → msgId' for a peer that was never there.

   Now: refuse to send, list the known peer names.

2. list_peers showed the same pubkey multiple times with different
   display_names (one per live session) without hinting that they
   were the same member — so Claude treated them as distinct people.

   Now: annotate with '[shares key with N other session(s)]' so the
   caller understands one pubkey = one identity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:10:47 +01:00
Alejandro Gutiérrez
94e914f476 fix(broker): reject mesh create without valid pubkey
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Older CLIs sometimes called POST /cli/mesh/create without a pubkey,
and the broker stored the string 'pending' as peer_pubkey on the
owner's mesh.member row. Every subsequent hello from the real CLI
failed the membership lookup silently, leaving the connection in
'reconnecting' forever with no useful log line.

Now: validate pubkey is 64 hex chars before creating the owner
member row. Existing 'pending' rows on prod were patched manually.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 12:50:11 +01:00
Alejandro Gutiérrez
1bb702e481 chore(cli): bump to alpha.32 2026-04-15 08:54:26 +01:00
269 changed files with 42946 additions and 40613 deletions

View File

@@ -0,0 +1,158 @@
HACKATHON — THE DAY-ONE "WOW" SCENARIO
======================================
Date: 2026-04-19
Follow-up to: 2026-04-19-hackathon-proposal.txt
THE SHORT ANSWER
----------------
Yes — it's exactly as simple as run one command, join a mesh, and
immediately inherit your team's tools, skills, MCPs, and context.
No config copying. No API key juggling. No "let me send you my
.mcp.json". Zero setup.
That's the thing that has never existed before: Claude Code sessions
that share capability at the speed of a chat invite.
THE 60-SECOND STORY (rough, but close to real)
----------------------------------------------
Picture Ana at the hackathon. Her teammate David has been working on
their project for two days — wired up a Linear MCP, a Figma MCP, a
custom "brand-asset" skill, shared project context, a few API keys
in the team vault. She shows up at the table, opens her laptop, has
never touched the project.
1. David runs one command:
$ claudemesh share ana@team.com
She gets a link: https://claudemesh.com/i/5SLJ7F95
2. Ana runs one command:
$ claudemesh https://claudemesh.com/i/5SLJ7F95
(No separate install, the CLI self-installs if missing.
Takes under 10 seconds.)
3. Claude Code opens automatically, connected to the mesh. No
further setup.
4. Ana types into Claude Code:
"what are we building?"
Claude — HER local Claude, on HER laptop — answers with the
team's current brief, pulled from the mesh's shared context
that David set earlier. It knows the repo, the deadline, the
stack, who's on the team, what's done, what's open.
5. Ana says:
"pull the latest tickets from Linear"
Her Claude uses the Linear MCP. Ana never installed it. She has
no Linear API key on her machine. The MCP was deployed to the
mesh by David on day one; the moment Ana joined, it became
callable from her Claude Code as if it were local. Ciphertext
routes through the broker, tool calls execute on the peer that
owns the integration.
6. She asks:
"generate launch-day assets in our brand"
Her Claude invokes the /brand-asset skill that David authored
two days ago. Skills are portable in the mesh — calling it
remotely is indistinguishable from having it installed locally.
7. She hits a wall on a type error. Instead of pinging David in
Slack she types:
"ask the mesh"
Question fans out to every teammate's Claude. Thirty seconds
later she has three answers with three different repo contexts,
synthesized into one reply, with attributions. This is the
fan-out demo from the main proposal.
TOTAL ELAPSED TIME: under 90 seconds from "I don't have anything
set up" to "my Claude knows our project and can use my team's tools."
WHY THIS IS THE HEADLINE
------------------------
Every other developer tool in 2026 still demands:
- install this package
- set these env vars
- copy this config
- get an API key approved
- restart your editor
- re-index your repo
claudemesh replaces all of that with a single click on an invite
link. The mesh IS the onboarding.
The shorter way to say it: every Claude Code session you onboard,
you onboard your team's entire AI toolchain in one shot.
WHAT THE USER ACTUALLY SEES
---------------------------
Terminal (Ana):
$ claudemesh https://claudemesh.com/i/5SLJ7F95
✔ Joined "launch-team" as Ana
4 peers online: David, Nedas, Lug-Nut, Juan
12 tools available from the mesh
3 shared skills
context: "launch-day assets — due Friday"
✔ Launching Claude Code…
Claude Code:
> connected to mesh: launch-team
> inherited: 12 tools, 3 skills, shared context, 14 memories
Dashboard (claudemesh.com):
Ana's node appears on the live topology. Packets animate along
edges as her first message flies. David's screen gets a presence
ping: "Ana joined — ready".
That's the wow. Not a pitch deck, not a feature matrix — a literal
before-and-after experience that takes under two minutes and looks
impossible to anyone who's ever onboarded a new developer onto a
project the old way.
WHAT WE'RE BUILDING THIS WEEK TO MAKE THIS REAL
-----------------------------------------------
Most of the primitives exist. The hackathon week is the glue:
• Tool inheritance — a peer's deployed MCPs become callable from
other peers as if installed locally. Today: partially shipped.
Hackathon goal: make it automatic, zero-config, visible in the
universe dashboard.
• Skill sharing — same story, for skills (already has an alpha).
Hackathon goal: polish, auto-discovery, one-line invoke.
• Context inheritance — joining a mesh automatically loads the
mesh's shared context into the new Claude's session so it
"knows what we're working on" from minute one. Today: state
exists, auto-pull on join does not.
• "Ask the mesh" fan-out — the broadcast + synthesize primitive
from the main proposal.
• The onboarding CLI flow — make the invite-link-to-Claude-ready
path bulletproof and under 10 seconds on a fresh machine.
THE DEMO ARTIFACT
-----------------
A single 90-second screencast. Split screen: Ana's terminal on the
left, the claudemesh.com live universe dashboard on the right.
She joins. Her node appears on the mesh. She asks a question. Tools
fire. Skills execute. Answer comes back. No text overlays needed —
the UX itself is the argument.
That's the video that goes at the top of claudemesh.com on demo
day.

View File

@@ -0,0 +1,147 @@
HACKATHON PROPOSAL — CLAUDEMESH
===============================
Date: 2026-04-19
Author: Alejandro Gutiérrez
THE SHORT ANSWER
----------------
I'm going with claudemesh — not the Flexicar voice assistant, not a fresh
blend. claudemesh is already a real product with a real backbone (CLI,
MCP server, broker, E2E crypto, web dashboard), and what it still lacks
is the one thing a hackathon is perfect for: a single headline capability
that makes its existence obvious in ten seconds.
So I'm using the week to push claudemesh from "useful infra for people
who already get it" → "demo that makes someone say, oh, that's what this
is for."
WHAT'S ALREADY THERE (SO YOU KNOW WHAT I'M BUILDING ON, NOT FROM ZERO)
----------------------------------------------------------------------
- CLI + MCP server (claudemesh-cli), 40+ alpha releases shipped
- Broker on wss://ic.claudemesh.com/ws with libsodium E2E encryption —
broker routes ciphertext, never reads messages
- Shared primitives: direct messages, group broadcasts, shared state,
memory, file sharing, skill sharing, MCP deployment to the mesh
- Telegram bridge with a Haiku-4.5 AI layer so you can talk to the mesh
from your phone (shipped this week)
- Web dashboard with per-mesh live panel (peers, envelope stream,
audit chain)
- Brand-new "Universe" dashboard landing (shipped today) — meshes +
incoming invitations in one view
WHAT I'M BUILDING DURING THE HACKATHON
---------------------------------------
Headline: AGENT-TO-AGENT DELEGATION WITH LIVE STREAMING
Right now a Claude Code session can SEND a message to another session
in the mesh. That's primitive-level. What's missing — and what makes
the whole thing click — is DELEGATION: one Claude hands off a task to
another, waits for the real answer (not a "sure, I'll do that later"
acknowledgement), and composes it into its own response, with the
user watching the whole thing happen live.
Why this is the right hackathon target:
- It requires NO new physical infrastructure. The broker, the crypto,
the transport are all there.
- It's the unlock that turns claudemesh from "chat for Claudes" into
"distributed cognition layer for Claude Code."
- It's demoable in 60 seconds and the value is self-evident.
DAY-BY-DAY PLAN (REALISTIC, NOT ASPIRATIONAL)
---------------------------------------------
DAY 1 — Protocol + primitive
• Design `mesh_delegate(to, task, timeout)` MCP tool — one call from
the local Claude, returns the remote Claude's answer synchronously
from the caller's perspective
• Broker-side: new message type `delegation_request` / `_response`
with correlation IDs so responses route back to the originator
• Remote Claude receives delegation → runs in a sandboxed subcontext
→ emits structured response (text + artifacts)
DAY 2 — Live streaming of remote work
• While remote Claude works, stream its tool calls + thinking back
through the mesh as `delegation_progress` events
• Caller's dashboard lights up with "Nedas is reading src/auth.ts…"
in real time
• The "wow" moment: watching another Claude think, from your terminal
DAY 3 — Multi-peer fan-out
• `mesh_ask_all(question)` — broadcast a question to @group, gather
answers in parallel, synthesize
• This is the Slack-killer: one question, three Claudes with
different repo contexts, one merged answer
• Add to the universe dashboard: inline "ask your mesh" prompt
DAY 4 — Voice control (stretch, uses my Pipecat/Cartesia background)
• Phone → Telegram voice note → AI layer already in place →
mesh_delegate or mesh_ask_all fires
• "Hey mesh, which of you is closest to the payments bug?" — the
mesh answers with the Claude that has the most recent auth.ts edits
• Ties the Flexicar voice work into claudemesh without fragmenting
the proposal
DAY 5 — Live schematic on the dashboard
• Build the animated mesh-topology view from my prototype
(SVG nodes + packets in flight) using REAL delegation traffic
• When a delegation fires, you literally see a packet fly from one
node to another on the dashboard
• This is the screenshot/video artifact for the demo day
DAY 6 — Demo recording + narrative
• 90-second video: single person, three terminals, one dashboard.
Asks a question in terminal 1, two other Claudes answer, dashboard
animates, final answer synthesized
• Landing page update with the video above the fold
• Changelog post
DAY 7 — Buffer, polish, publish alpha
WHAT MAKES THIS TAILORED FOR A HACKATHON (NOT JUST ROADMAP WORK)
-----------------------------------------------------------------
1. Visible. Three terminals + one dashboard = immediately legible.
2. Ambitious. Going from "pub/sub messaging" to "synchronous distributed
delegation" is a real protocol-level step up — it's the difference
between email and RPC.
3. Native to the event. Hackathon judges are the exact target user:
people with multiple Claude Code sessions open, wanting them to
coordinate. Dogfood-able during the week itself.
4. Leverages what I already built. I'm not rebuilding the transport,
the crypto, the auth, the dashboard shell — just adding the one
missing primitive that ties it all together.
5. Stretch goal (voice) reuses my Flexicar/Pipecat expertise without
making the proposal schizophrenic — it's one coherent pitch with a
multimodal cherry on top if time allows.
WHAT I'M EXPLICITLY NOT DOING
------------------------------
- Not rewriting the Flexicar assistant as a mesh app. It's a great
product, wrong scope for one week.
- Not building federation (mesh-to-mesh). Powerful but too abstract
to demo cleanly.
- Not building a self-hosted broker. Infra work, no hackathon payoff.
- Not building a mobile app. Telegram already covers the "mesh from
anywhere" story.
THE PITCH IN ONE SENTENCE
-------------------------
By the end of the week, one Claude will delegate a real coding task to
another Claude running on a different machine, get a real answer back,
and the whole thing will happen in sixty seconds with the mesh
topology animating live on claudemesh.com.
That's the demo. Everything else in the week is in service of making
those sixty seconds watertight.

View File

@@ -0,0 +1,551 @@
# `claudemesh daemon` — Final Spec v10
> **Round 10.** v9 was reviewed by codex (round 9). The two-layer ID
> model (5/5) and §4.1 wording (4/5) were closed cleanly, but rate-limit
> placement created a worse failure: putting B1 limiter before dedupe
> lookup means **idempotent retries burn rate-limit budget** and a
> daemon retry of an already-committed message during a saturated
> window can get rate-limit-rejected → daemon marks `dead` → split-brain
> (broker has the message, daemon believes failure).
>
> **v10 fixes**:
>
> 1. New **Phase B0 dedupe fast-path** — read dedupe table BEFORE rate
> limit. Existing id (match or mismatch) returns immediately without
> touching rate-limit budget.
> 2. **Idempotent rate-limiter** keyed by `(mesh_id, client_message_id,
> window_bucket)` so even if two same-id requests race past B0, only
> the first one consumes budget.
> 3. **§4.11 stale text** — rate-limit moved out of B2 failure mode.
> 4. **§4.7.2 pseudocode reordered** to show B0 → B1 → BEGIN → claim →
> B2 → B3.
>
> **Intent §0 unchanged from v2.** v10 only revises §4.
---
## 0. Intent — unchanged, see v2 §0
## 1. Process model — unchanged
## 2. Identity — unchanged from v5 §2
## 3. IPC surface — unchanged from v4 §3
---
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
### 4.1 The contract (precise — v9, two-layer ID model)
> **Two-layer ID rules** (NEW v9 — codex r8):
>
> - **Daemon-layer**: a `client_message_id` is **daemon-consumed** iff an
> outbox row exists for it. Daemon-mediated callers can never reuse a
> daemon-consumed id, regardless of whether the broker ever saw it.
> The daemon's outbox is the single authority for "this id was issued
> by my caller against this daemon."
> - **Broker-layer**: a `client_message_id` is **broker-consumed** iff a
> dedupe row exists for `(mesh_id, client_message_id)` in
> `mesh.client_message_dedupe`. Direct broker callers (none in
> v0.9.0; reserved for future SDK paths that bypass the daemon) can
> reuse a broker-non-consumed id freely.
> - In v0.9.0 there are no daemon-bypass clients, so for practical
> purposes "daemon-consumed" is the operative rule.
>
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db`
> before the response returns. The daemon enforces request-fingerprint
> idempotency at the IPC layer (§4.5.1).
>
> **Local audit guarantee**: a `client_message_id` once written to
> `outbox.db` is **never released** (daemon-layer rule). Operator
> recovery via `requeue` always mints a fresh id; the old row stays in
> `aborted` for audit. There is no daemon-side path to free a used id.
>
> **Broker guarantee** (v9 — tightened): a dedupe row exists iff the
> broker accept transaction **committed** (Phase B3 reached). Phase B1
> rejections never insert dedupe rows. Phase B2 rejections roll the
> transaction back, so any partial dedupe row is unwound. Direct
> broker callers retrying after B1/B2 rejection see no dedupe row and
> may reuse the id.
>
> **Atomicity guarantee**: same as v8 §4.1.
>
> **End-to-end guarantee**: at-least-once.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
### 4.3 Broker schema — unchanged from v6 §4.3
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
#### 4.5.1 IPC accept algorithm (v8)
On `POST /v1/send`:
1. Validate request envelope (auth, schema, size limits, destination
resolvable). Failures here return `4xx` immediately. **No outbox row
is written; the `client_message_id` is not consumed.**
2. Compute `request_fingerprint` (§4.4).
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
a concurrent IPC accept on the same id serializes against this one.
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
preventing any other writer from beginning a transaction on the same
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
supported.
4. `SELECT id, request_fingerprint, status, broker_message_id,
last_error FROM outbox WHERE client_message_id = ?`.
5. Apply the lookup table below. For the "(no row)" case, INSERT the
new row inside the same transaction.
6. COMMIT.
| Existing row state | Fingerprint match? | Daemon response |
|---|---|---|
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
| `pending` | match | Return `202 accepted, queued`. No mutation |
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
`request_fingerprint` (8-byte hex prefix) so callers can debug
client/server canonical-form drift. **Every state in the table returns
something deterministic, including `aborted`.** A `client_message_id`
written to `outbox.db` is permanently bound to that row's lifecycle —
the only "free" state is "no row exists".
#### 4.5.2 Outbox table — fingerprint required
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
request_fingerprint BLOB NOT NULL, -- 32 bytes
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN
('pending','inflight','done','dead','aborted')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT,
aborted_at INTEGER, -- NEW v8
aborted_by TEXT, -- NEW v8: operator/auto
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
```
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
audit trail. `superseded_by` lets `outbox inspect` show the chain when
a row was requeued multiple times.
`request_fingerprint` is computed once at IPC accept time and frozen
forever for the row's lifecycle. Daemon never recomputes from
`payload`.
### 4.6 Rejected-request semantics — two-layer rules + rate-limit moved to B1 (v9 — codex r8)
> **Two-layer rule (v9)**: a `client_message_id` is **daemon-consumed**
> iff an outbox row exists for it; **broker-consumed** iff a dedupe row
> exists. Daemon-mediated callers see daemon-layer authority (the only
> path in v0.9.0). Pre-validation failures at any layer consume nothing
> at that layer. The two layers are independent: a daemon-consumed id
> may or may not be broker-consumed (depending on whether the send
> reached B3); a daemon-non-consumed id can never be broker-consumed
> (no outbox row ⇒ no broker call from the daemon).
#### 4.6.1 Daemon-side rejection phasing (v9)
| Phase | When daemon rejects | Outbox row? | Daemon-consumed? | Same daemon caller may reuse id? |
|---|---|---|---|---|
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | No | Yes — id never written locally |
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | Yes | N/A — daemon owns retries |
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | Yes | No — rotate via `requeue` |
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Yes (still consumed) | Old id NEVER reusable; new id is fresh |
The "daemon-consumed?" column is the daemon-layer authority. It does
not depend on whether the broker ever saw the request — phase C above
shows the broker has not committed a dedupe row, but the daemon still
holds the id in `dead` state.
#### 4.6.2 Broker-side rejection phasing (v10 — B0 dedupe fast-path added)
The broker validates in **four phases** relative to dedupe-row
insertion. Phase B0 (NEW v10 — codex r9) makes idempotent retries
free of rate-limit budget so a daemon retry of an already-committed
message can never get rate-limit-rejected:
| Phase | Validation | Side effects | Result for direct broker callers |
|---|---|---|---|
| **B0. Dedupe fast-path** (NEW v10) | Read `mesh.client_message_dedupe` for `(mesh_id, client_message_id)`. **Does not touch rate-limit budget.** | None | If row exists & fingerprint matches → `200 duplicate` with original `broker_message_id`. If row exists & fingerprint mismatches → `409 idempotency_key_reused`. If row absent → continue to B1 |
| **B1. Pre-dedupe-claim** (atomic, external) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, **rate limit not exceeded** (idempotent external limiter — see §4.6.4) | None | `4xx` returned. No dedupe row, no broker-consumed id. Caller may retry with same id once condition clears |
| **B2. Post-dedupe-claim** (in-tx) | Conditions that require the accept transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx` returned, transaction rolled back, no dedupe row remains. Caller may retry with same id |
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows, mention_index rows | `201` returned with `broker_message_id`. Id is broker-consumed |
**Why B0 is correct (codex r9)**: idempotent retries should never be
distinguishable from "the call worked" from the caller's perspective.
A retry that the broker can resolve to the original accept must do so
before any operation that could fail (rate limit, capacity check,
auth-quota, etc.). B0 reads — non-mutating, no transaction — so it can
be skipped on the strictly-new-id path with negligible cost (one
indexed PK lookup against the dedupe table).
**Race semantics for new ids (v10 — codex r9)**: B0 is a non-locking
read; two same-id requests can both miss B0 simultaneously. Without
care, both would consume rate-limit budget. v10 requires the limiter
to be **idempotent over `(mesh_id, client_message_id, window)`**:
budget is consumed at most once per id-window pair regardless of
concurrent retries (§4.6.4). The "second" retry that misses B0 still
sees its `INCR` short-circuited by the limiter and proceeds to B2/B3
without budget impact. Whichever request wins the dedupe `INSERT`
commits; the loser sees fingerprint match (rollback to `200
duplicate`) or mismatch (`409`).
**Daemon-mediated callers**: in v0.9.0 the daemon is the only B-phase
caller. Daemon-mediated callers see only the daemon-layer rules
(§4.6.1). The broker's "may retry with same id" wording in the table
above applies to direct broker callers only (none in v0.9.0; reserved
for future SDK paths).
**Critical guarantee (v9 — tightened from v8)**: a dedupe row exists
**iff the broker accept transaction committed (B3)**. There is no
broker code path where a permanent 4xx leaves a dedupe row behind.
If the broker decides post-commit that an accepted message is invalid
(async content-policy job, async moderation, etc.), that's NOT a
permanent rejection — it's a follow-up event that operates on the
`broker_message_id`, not on the dedupe key.
#### 4.6.4 Rate limiter — idempotent over `(mesh, client_id, window)` (v10 — codex r9)
Codex r9 caught: v9's plain `INCR` limiter would let idempotent
retries burn budget. A daemon retry of an already-committed message
that gets rate-limit-rejected creates a split-brain (broker has it,
daemon marks dead). v10 makes the limiter idempotent over
`(mesh_id, client_message_id, window_bucket)` so retries are free.
- **Authority**: same external Redis-style limiter used elsewhere in
claudemesh, but called via an idempotency-aware wrapper:
```
consume_budget(mesh_id, client_message_id, window_bucket) → {ok, denied}
Lua / WATCH-MULTI on Redis:
key = "rl:" + mesh_id + ":" + window_bucket
idem = "rli:" + mesh_id + ":" + client_message_id + ":" + window_bucket
if EXISTS idem → return ok -- already counted
if INCR key > limit_per_window
DECR key -- refund this attempt
return denied
SET idem 1 EX 2*window_seconds -- short TTL for repeat-detection
return ok
```
The `idem` key TTL is small (2× window) to keep memory bounded;
outside the window, retries that arrive late count as new traffic
(which is correct — the original `INCR` row has rolled out of the
window too).
- **Race semantics**: two same-id requests racing past B0 both arrive
at `consume_budget`. Whichever Redis call lands first runs the
conditional `INCR`+`SET idem`; the second sees `EXISTS idem` and
returns `ok` without `INCR`. Each id-window pair consumes at most
one budget unit. Implemented in Lua (single round-trip, atomic).
- **B2 rollback non-refund**: if the limiter accepts but the in-tx
Phase B2 then rejects (e.g. topic not found), the consumed budget
is **not** refunded. Counter
`cm_broker_rate_limit_consumed_then_b2_rejected_total` exposes the
delta. Refunding would require a coordinated rollback across the DB
tx and the limiter, which we don't want to build.
- **Async counters**: `mesh.rate_limit_counter` (or any DB-resident
view of "messages-per-mesh-per-window") is **non-authoritative** —
metrics/telemetry only, rebuilt from the authoritative limiter and
from message-history. Used for dashboards, not for accept decisions.
This split — idempotent atomic external limiter for enforcement,
async DB counters for telemetry — keeps idempotent retries free of
budget impact, prevents the v9 split-brain, and stays inside the
existing claudemesh rate-limit infrastructure.
**Why B0 still matters even with the idempotent limiter**: the
idempotent limiter prevents budget over-consumption, but it does NOT
make the limiter itself the dedupe authority. B0 is a non-mutating DB
read that resolves committed dedupe rows (the truth) without any
limiter or DB-write side effects at all. For the common retry case
(daemon timeout after broker B3 commit), B0 returns `200 duplicate`
without ever calling the limiter. B0 + idempotent limiter together
mean: idempotent retries are O(1 PK lookup), free, and never visible
to rate-limit accounting.
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
To unstick a `dead` or `pending`-but-stuck row, operator runs:
```
claudemesh daemon outbox requeue --id <outbox_row_id>
[--new-client-id <id> | --auto]
[--patch-payload <path>]
```
This atomically (single SQLite transaction):
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
`aborted_by = "operator"`. Row is **never deleted** — audit trail
permanent.
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
or auto-ulid'd via `--auto`).
3. Inserts a new outbox row in `pending` with the fresh id and the same
payload (or patched payload if `--patch-payload` was given).
4. Sets `superseded_by = <new_row_id>` on the old row so
`outbox inspect <old_id>` displays the chain.
**The old `client_message_id` is permanently dead** — `outbox.db` still
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
re-using it gets `409 outbox_aborted_*` per §4.5.1.
If broker had ever accepted the old id (it reached B3), the broker's
dedupe row is also permanent — duplicate sends to broker with the old
id would also `409` for fingerprint mismatch (or return the original
`broker_message_id` for matching fingerprint). Daemon-side
`aborted` and broker-side dedupe row are independent records of "this
id was used," neither releases the id.
This is the resolution to v7's contradiction: there is **no path** for
an id to "become free again." If the operator wants to retry the
payload, they get a new id. The old id stays buried.
### 4.7 Broker atomicity contract — side-effect classification (v9)
#### 4.7.1 Side effects (v9 — rate limit moved to B1 external)
Every successful broker accept atomically commits these durable
state changes in **one transaction**:
| Effect | Table | In-tx? | Why |
|---|---|---|---|
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
**Outside the transaction** — non-authoritative or rebuildable, with
explicit rationale per item:
| Effect | Where | Why outside |
|---|---|---|
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
| Rate-limit **counters** (telemetry only) | Async, eventually consistent | Authoritative limiter is the external Redis-style INCR in B1 (§4.6.4); the DB counter is rebuilt for dashboards, not consulted for accept |
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
| Metrics | Prometheus, pull-based | Always non-authoritative |
If any in-transaction insert fails, the transaction rolls back
completely. The accept is `5xx` to daemon; daemon retries. No partial
state.
The async side effects are driven off the in-transaction
`delivery_queue` and `message_history` rows, so they cannot get ahead
of committed state — only lag behind.
#### 4.7.2 Pseudocode — corrected and final (v8)
```sql
-- =========================================================================
-- Phase B0: dedupe fast-path (NEW v10 — codex r9). Non-mutating.
-- Resolves idempotent retries WITHOUT touching rate-limit budget.
-- =========================================================================
SELECT broker_message_id, request_fingerprint, history_available, first_seen_at
FROM mesh.client_message_dedupe
WHERE mesh_id = $mesh_id AND client_message_id = $client_id;
-- If row exists:
-- fingerprint match → return 200 duplicate (broker_message_id, history_available). Done.
-- fingerprint mismatch → return 409 idempotency_key_reused. Done.
-- Otherwise: row absent → continue.
-- =========================================================================
-- Phase B1: schema/auth/size validation + idempotent rate-limit consume.
-- All before any DB transaction. Failures here return 4xx without opening a tx.
-- =========================================================================
-- consume_budget(mesh_id, client_id, window_bucket) — Lua/Redis (§4.6.4).
-- Idempotent over (mesh_id, client_id, window_bucket): retries within window
-- consume at most once.
-- =========================================================================
-- Phase B2 + B3: in-transaction claim and side effects.
-- =========================================================================
BEGIN;
INSERT INTO mesh.client_message_dedupe
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
destination_kind, destination_ref, expires_at)
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
$dest_kind, $dest_ref, $expires_at)
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
-- Inspect the row that's actually there now (ours or a racer's).
SELECT broker_message_id, request_fingerprint, destination_kind,
destination_ref, history_available, first_seen_at
FROM mesh.client_message_dedupe
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
FOR SHARE;
-- Branch:
-- row.broker_message_id == $msg_id → we won the race; continue to side effects.
-- row.broker_message_id != $msg_id → racer won. Compare fingerprints:
-- fingerprint match → ROLLBACK; return 200 duplicate (the rare race-vs-B0 case
-- where two concurrent first-time-but-same-id requests
-- both missed B0 and one beat the other to the INSERT).
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
-- Phase B2 validation: destination_ref existence (topic exists,
-- member subscribed, etc.). Rate limit is NOT here — it was checked
-- in B1 (§4.6.4) before this transaction opened.
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
-- Step 4: insert all in-tx side effects (§4.7.1).
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
VALUES ($msg_id, $mesh_id, $client_id, ...);
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
VALUES ($msg_id, $mesh_id, ...);
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
SELECT $msg_id, member_pubkey, ...
FROM mesh.topic_subscription
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
SELECT $msg_id, mention_pubkey, ...
FROM unnest($mention_list);
COMMIT;
-- After COMMIT, async workers consume delivery_queue and update
-- search indexes, audit logs, rate-limit counters, etc.
```
#### 4.7.3 Orphan check — same as v7 §4.7.3
Extended over the side-effect inventory to verify in-tx items consistency.
### 4.8 Outbox max-age math — unchanged from v7 §4.8
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
safety_margin` strictly < window; safety_margin floor 24h.
### 4.9 Inbox schema — unchanged from v3 §4.5
### 4.10 Crash recovery — unchanged from v3 §4.6
### 4.11 Failure modes — B0/B1/B2 distinction (v10)
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
- **IPC accept against `aborted` row, fingerprint match**: returns 409
per §4.5.1. Caller must use a new id; the old id is permanently retired.
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
§4.6.3; old id stays in `aborted`, new id is fresh.
- **Broker fingerprint mismatch on retry**: at B0 → returns 409
immediately (no rate-limit consumed). Daemon marks `dead`; operator
requeue path.
- **Idempotent retry of an already-committed id during a saturated
rate-limit window** (NEW v10): B0 fast-path returns `200 duplicate`
with the original `broker_message_id`. Rate-limit budget is NOT
consumed. Daemon transitions outbox row from `pending`/`inflight`
to `done`. **No split-brain.** This is the key correctness fix
from codex r9.
- **Daemon retry after dedupe row hard-deleted by broker retention
sweep**: cannot happen unless operator overrode `max_age_hours`.
- **Broker phase B1 rejection (rate limit, schema, size, etc.)**: no
dedupe row exists; daemon receives 4xx; idempotent limiter ensures
retries within window don't re-consume budget. If the rejection is
permanent (size, schema), daemon marks `dead`. If transient (rate
limit), daemon retries with exponential backoff until window clears
or `max_age_hours` exhausted.
- **Broker phase B2 rejection on retry**: same id reaches B2 and the
in-tx condition fails (topic deleted, member unsubscribed). B2
rolls back the dedupe insert; no dedupe row remains. Daemon
receives 4xx → marks `dead`. Operator can `requeue` if condition
clears (note: `requeue` mints a fresh id per §4.6.3, so the old id
stays `aborted`).
- **Atomicity violation found by orphan check**: alerts ops.
---
## 5-13. — unchanged from v4
## 14. Lifecycle — unchanged from v5 §14
## 15. Version compat — unchanged from v7 §15
## 16. Threat model — unchanged
---
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
Broker side, deploy order: same as v7 §17, with one addition:
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
validation, returns 4xx without writing) and Phase B2/B3 (within the
accept transaction). Implementation: refactor handler to validate
Phase B1 conditions before opening the DB transaction.
Daemon side:
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
columns and the `aborted` enum value (§4.5.2). Migration applies via
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
greenfield.
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
(§4.5.1 step 3).
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
- `claudemesh daemon outbox requeue` always mints a fresh
`client_message_id`; never frees the old id. `--new-client-id <id>`
and `--auto` are the only modes; the old `client_message_id`
argument is removed.
---
## What changed v8 → v9 (codex round-8 actionable items)
| Codex r8 item | v9 fix | Section |
|---|---|---|
| Cross-layer ID-consumed authority contradiction | Two-layer model: daemon-consumed iff outbox row; broker-consumed iff dedupe row committed; daemon-mediated callers see only daemon-layer authority | §4.1, §4.6.1, §4.6.2 |
| Rate-limit authority muddled (B2 vs async counters) | Rate limit moved to B1 via external atomic limiter (Redis-style INCR with TTL); DB rate-limit counters demoted to telemetry-only | §4.6.2, §4.6.4, §4.7.1 |
| §4.1 broker guarantee fuzzy | Tightened: "dedupe row exists iff broker accept transaction committed (B3)" | §4.1, §4.6.2 |
(Earlier rounds' fixes preserved unchanged.)
---
## What needs review (round 9)
1. **Two-layer ID model (§4.1, §4.6.1)** — is the daemon-vs-broker
authority split clear, or does it create more confusion for
operators reading "consumed" in different contexts? Should we use
different verbs (e.g. "claimed" at daemon, "committed" at broker)?
2. **Rate-limit external limiter (§4.6.4)** — is "atomic external
limiter" specified concretely enough? Is the over-counting on
limiter-accepted-then-B2-rejected acceptable?
3. **B2 contents after rate-limit move** — B2 now only has
`destination_ref existence`. Worth keeping a B2 phase at all, or
collapse into B1+B3?
4. **Anything else still wrong?** Read it as if you were going to
operate this for a year.
Three options:
- **(a) v9 is shippable**: lock the spec, start coding the frozen core.
- **(b) v10 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless.

View File

@@ -0,0 +1,853 @@
# `claudemesh daemon` — Final Spec v2
> **Round 2 after a critical first-pass review.** v1 of this spec was reviewed
> by another model and pushed back on identity model, no-auth IPC, "exactly-once"
> overclaim, hook credentials, surface bloat, and missing operational flows
> (rotation, image clones, schema migration, threat model). v2 incorporates all
> of those.
---
## 0. Intent — what this is, what it isn't
### 0.1 The product reality
claudemesh today is a **peer mesh runtime for Claude Code sessions**. Each
session runs `claudemesh launch`, opens a WebSocket to a managed broker, gets
ephemeral identity, sends/receives DMs and topic messages with other Claude Code
sessions, posts to shared state, deploys MCP servers / skills / files,
participates in tasks, schedules reminders. Everything is E2E encrypted with
crypto_box envelopes for DMs and per-topic symmetric keys for topics. The broker
is a routing/persistence layer; peers do the actual work.
The CLI is the canonical surface — every operation is a `claudemesh <verb>`.
The MCP server is a "tool-less push pipe" that surfaces inbound messages to
Claude Code as channel notifications. There is also a web dashboard, an `/v1/*`
REST API, and an existing apikey auth model for external integrations.
### 0.2 The gap
Anything that **isn't a Claude Code session** is a second-class citizen:
- A RunPod handler that wants to alert a peer when an OOM happens has only
one option: curl an apikey-authed REST endpoint. One-way only. The handler
is not a peer — it can't be DM'd back, can't be `@-mentioned`, can't be in
`peer list`, can't claim a task assigned to it, can't host an MCP service or
share a skill. It's a webhook spoke, not a participant.
- A Temporal worker that wants to track its own progress in shared mesh state,
publish to a `#alerts` topic, and listen for "retry now" instructions has
no good shape. Either it shells out to `claudemesh send` cold-path
(a fresh WS handshake per message — ~1s latency, broker churn, no inbound
path) or it speaks the WS protocol manually (significant code, no SDK).
- A long-running CI runner, an IoT box, a phone app, a future Python or Go
service — none can be **first-class peers** without writing the same WS
reconnect / queue / encryption / presence code that the existing CLI already
has, plus an IPC surface so the host's apps can use it without re-implementing
any of that.
### 0.3 What this daemon is
A long-running process — the same `claudemesh-cli` binary in `daemon` mode —
that turns any host into a **first-class peer**:
- Stable identity across restarts (the host *is* a member of the mesh, not a
series of disconnected sessions).
- Persistent WS to the broker, with reconnect, queue, dedupe.
- Local IPC surface (UDS + loopback HTTP + SSE) that any local app can hit
to send, subscribe, query — without learning the broker protocol or carrying
long-lived secrets in app code.
- Hooks: shell scripts that fire on events. Server replies to DMs, auto-claims
tasks, escalates errors — without the app being involved.
- Same security primitives as `claudemesh launch` (mesh keypair, crypto_box,
per-topic keys). No new auth model toward the broker.
The daemon **is the runtime**. The CLI in cold-path mode is a fallback. The
Claude Code MCP integration is one client of the daemon (eventually).
### 0.4 What this daemon is NOT
- **Not a webhook gateway.** `/v1/notify` and apikeys remain the path for
systems that can't host the runtime (third-party SaaS, monitoring tools).
The daemon is for systems that *can* run a process — code you control.
- **Not a generic message broker.** It speaks claudemesh protocol to one
managed broker. It is not a substitute for NATS, Redis, Kafka, RabbitMQ.
- **Not a Slack replacement.** Topics, DMs, mentions exist because *AI
sessions* use them. Humans interact via the dashboard or a Claude Code
session, not by reading the daemon's inbox directly.
- **Not a fleet manager.** One daemon manages one mesh on one host. Multi-mesh
on one host is supported (one daemon per mesh, supervised). Cross-host
supervision is an external concern (systemd, k8s, etc.) — the daemon doesn't
reach across hosts.
### 0.5 Who deploys this
- A developer running `claudemesh daemon up` on their laptop so their open
Claude Code sessions all share one persistent connection (instead of each
opening its own ephemeral WS).
- The same developer running `claudemesh daemon install-service` on their VPS,
RunPod pod, Temporal worker, CI runner — turning each into an
addressable peer that scripts on that host can talk to via local IPC.
- Eventually: language SDKs (Python / Go / TypeScript) talking to the daemon
on `localhost`, exposing claudemesh as a first-class API for any app the
developer writes.
### 0.6 Pre-launch posture
No users yet. We can break protocol, schema, surface, anything. Optimize for
the architecture we want to live with for years, not for the smallest
shippable cut. Codex pushed back on v1 on this exact axis: do not ship
graph/vector/MCP/skills/tasks on day one — freeze a small, hardened core,
expand deliberately.
---
## 1. Process model
**One daemon per (user, mesh)**. Persistent. Survives reboots via OS
supervisor. Serves multiple local apps concurrently.
```
~/.claudemesh/daemon/<mesh-slug>/
pid 0600 pidfile, cleaned on shutdown
sock 0600 unix domain socket (primary IPC)
http.port 0644 auto-allocated loopback port
local_token 0600 per-daemon bearer for HTTP/TCP transports
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
host_fingerprint.json 0600 machine-id + boot-id + interface mac digest
config.toml 0644 user-editable runtime tuning
outbox.db 0600 SQLite — durable outbound queue
inbox.db 0600 SQLite — N-day inbound history, FTS-indexed
schema_version 0644 integer; gates online migrations
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
hooks/ 0700 user-managed event scripts
```
**Resource caps (defaults, configurable):**
| Resource | Default | Why |
|---|---|---|
| RSS | 256 MB | Most workloads stay under 50 MB; cap protects multi-mesh hosts |
| CPU | unlimited | Hook fan-out can spike briefly; rely on OS scheduler |
| Outbox DB | 5 GB | At 1KB avg msg, that's 5M queued. Disk-full handling at 90% |
| Inbox DB | 5 GB | Same |
| File descriptors | 1024 | UDS clients + SSE streams + DB handles + WS |
| SSE concurrent | 32 streams | DoS protection; configurable up |
| IPC concurrent | 64 in-flight | Backpressure beyond this returns `429 daemon_busy` |
| Hook concurrency | 8 | Bounded pool; overflow queues |
Single binary. Same `claudemesh-cli` package; `daemon` is one of its modes.
## 2. Identity — persistent member by default, ephemeral on opt-in, clone-aware
### 2.1 Modes
```
claudemesh daemon up # default: persistent member
claudemesh daemon up --ephemeral # session-shaped, no keypair persisted
claudemesh daemon up --ephemeral --ttl=2h # auto-shutdown after TTL
```
- **Persistent (default)**: ed25519 + x25519 keypair stored in `keypair.json`.
Same identity across restarts, reconnects, supervisor cycles. Right for
servers, workers, addressable peers.
- **Ephemeral**: keypair generated in memory, never written. Daemon exits =
identity gone. Right for CI jobs, preview environments, disposable RunPod
pods, test harnesses, build agents, anything that should not leave a peer
ghost in the broker after teardown.
- **`--ttl <duration>`** on ephemeral mode: auto-shutdown after the duration,
or after `claudemesh daemon down`, whichever first. Broker member record
cleaned up on shutdown.
### 2.2 Image-clone detection
Two daemons booting with the same `keypair.json` (VM image clone, container
copy, restored backup) is a serious failure mode — broker sees connection
collisions, presence flickers, encrypted messages route to the wrong host.
Handled in three places:
1. **Daemon side**: `host_fingerprint.json` is written on first startup —
`sha256(machine-id || boot-id || mac-of-default-iface || hostname)`. On every
subsequent startup, the fingerprint is recomputed and compared. If it
differs, the daemon **refuses to start** unless `--accept-cloned-identity`
is passed (writes a fresh fingerprint and continues with the same keypair —
for legitimate hardware migrations) or `--remint` is passed (mints fresh
keypair, registers as a new member, broker reaps the old member after
grace period).
2. **Broker side**: tracks `lastSeenHostFingerprint` per member. On
reconnection from a different fingerprint, broker emits a
`member_clone_suspected` security event to the mesh owner's dashboard.
Connection itself is allowed (legitimate hardware swaps happen) but visible
for audit.
3. **Mesh owner**: `claudemesh member revoke <pubkey>` revokes the keypair
server-side; daemon receives `keypair_revoked` push event on next
connection and self-disables.
### 2.3 Rename
`--name` is taken at first `daemon up`; subsequent runs read the keypair file
and ignore `--name` unless `--rename` is passed (which produces a
`member_renamed` event the broker propagates to peers).
## 3. IPC surface — stable core only in v0.9.0
### 3.1 Frozen core surface (v0.9.0)
Codex's feedback: do not ship every CLI verb on day one. A small hardened core
first, expand under explicit capability gates.
```
# Messaging — durable, tested
POST /v1/send {to, message, priority?, meta?, replyToId?}
POST /v1/topic/post {topic, message, priority?, mentions?}
POST /v1/topic/subscribe {topic} (idempotent)
POST /v1/topic/unsubscribe {topic}
GET /v1/topic/list
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
GET /v1/inbox/search ?q=<fts-query>&limit=<n> (FTS5)
# Peers + presence — read-only on day one
GET /v1/peers ?mesh=<slug>
POST /v1/profile {summary?, status?, visible?} (limited fields)
# Files — already production in CLI
POST /v1/file/share {path, to?, message?, persistent?}
GET /v1/file/get ?id=<fileId>&out=<path>
GET /v1/file/list
# Events — push
GET /v1/events text/event-stream
core events: message, peer_join, peer_leave, file_shared,
daemon_disconnect, daemon_reconnect, hook_executed
# Control plane
GET /v1/health {connected, lag_ms, queue_depth, inflight,
mesh, member_pubkey, uptime_s, schema_version,
daemon_version, broker_version}
GET /v1/metrics Prometheus exposition
GET /v1/version {daemon, schema, ipc_api} (negotiation)
POST /v1/heartbeat {} (caller-side liveness signal)
```
That's it. ~20 endpoints. Battle-test these before adding more.
### 3.2 Capability-gated future surface (v0.9.x roadmap)
Behind explicit feature flags in `config.toml`, post-v0.9.0:
```toml
[capabilities]
state = false # /v1/state/{set,get,list}
memory = false # /v1/memory/{remember,recall}
vector = false # /v1/vector/{store,search,delete}
graph = false # /v1/graph/query
tasks = false # /v1/task/{create,claim,complete}
scheduling = false # /v1/scheduling/remind
mcp_host = false # /v1/mcp/{register,call} (LARGEST surface; treat as v1.0)
skill_share = false # /v1/skill/{deploy,share}
```
Each capability is its own ship: design review, security review, test
coverage, capability-token model, then enable. None enabled in v0.9.0.
### 3.3 Local IPC authentication
Codex was right: loopback TCP without auth is an attack surface (browser SSRF,
container side-channels, sandboxed apps with network but no FS access, WSL
host-shared loopback).
| Transport | Auth | Rationale |
|---|---|---|
| UDS | None (relies on FS perms 0600) | Reaching the socket = same UID = can read keypair anyway |
| TCP loopback | **Required**: `Authorization: Bearer <local_token>` | Browser/container/sandbox can reach loopback without FS access |
| SSE | Required: `Authorization: Bearer <local_token>` | Same |
`local_token` is 32 bytes of `crypto.randomBytes` (~256 bits), encoded base64url,
written to `local_token` mode 0600 at daemon init. Rotated on `claudemesh
daemon rotate-token`. SDKs auto-discover the token by reading the file (same
mechanism as discovering the socket path).
**Additional defenses:**
- HTTP listener binds **127.0.0.1 only**. Refuses to bind elsewhere unless
`[ipc] http_bind = "..."` is set explicitly **and** `[ipc] http_external_auth = "..."`
points to a separate token file (escape hatch for advanced users; never the default).
- `Origin` header check: rejects requests with `Origin` set unless it's
explicitly allowlisted in config (default: empty allowlist). Defends against
browser SSRF.
- `Host` header check: must be `localhost` or `127.0.0.1`. Defends against DNS
rebinding.
- CORS: `Access-Control-Allow-Origin` never echoed; preflight returns `403`.
- `User-Agent` required (rejects empty UA — mild signal against simple SSRF).
### 3.4 Request limits + backpressure
- Max request body: **1 MB** (override per endpoint; file uploads use a separate
streaming endpoint).
- Max response body: **10 MB**; truncated with `Link: rel=next` cursor.
- Max in-flight IPC requests: **64**. Beyond → `429 daemon_busy`.
- Max SSE concurrent streams: **32**. Beyond → `429 too_many_streams`.
- Per-token rate limit: **100 req/sec** sustained, 1000/sec burst (token
bucket). Tunable.
## 4. Delivery contract — durable at-least-once with idempotent send
Codex was right: "exactly-once" is a lie. Replacing the claim with a precise
contract.
### 4.1 The contract
> **The daemon guarantees: each successful send call enqueues exactly one row
> to the broker eventually, identified by a stable `messageId`. The daemon
> does not guarantee that downstream peers process the message exactly once —
> that is the receiver's responsibility, aided by the propagated
> `idempotency_key`.**
Concretely:
- **Caller → daemon**: caller may supply `Idempotency-Key`; daemon dedupes
identical keys for 24h. Without one, daemon mints `ulid` and returns it as
`messageId`.
- **Daemon → broker**: each outbox row has at-most-one inflight transmit.
Daemon retries with exponential backoff until broker ACKs OR row hits TTL
(7d default → moves to `dead`).
- **Broker → peer**: existing claudemesh delivery semantics. Broker dedupes by
`messageId`. Peer receives ≥1 copy.
- **Peer hooks**: hooks see `idempotency_key` in the event JSON. Idempotent
hook implementations are the receiver's responsibility.
### 4.2 Outbox row state machine
```
┌────────────┐
send call → │ pending │
└─────┬──────┘
│ daemon picks up batch
┌────────────┐
│ inflight │ ← attempts++, last_error written
└─┬────┬─────┘
│ │ broker NACK / network err
broker ACK │ └──────────► back to pending (with exp. backoff)
┌────────────┐
│ done │ ← delivered_at set, broker_message_id stored
└────────────┘
age > max_age_hours:
┌────────────┐
│ dead │ ← surfaces in `daemon outbox --failed`
└────────────┘
```
### 4.3 Crash recovery
On daemon startup:
1. Any rows in `inflight` are reset to `pending` with `attempts++` and
`next_attempt_at = now + min_backoff`. Note: this MAY cause double-delivery
of a message that was actually ACK'd by the broker but the ACK didn't
persist locally before crash. The `idempotency_key` propagates to broker
(via message `meta`) so the broker dedupes by key.
2. `outbox.db` integrity check (`PRAGMA integrity_check`); if fails, daemon
refuses to start, points user at `claudemesh daemon recover`.
3. `inbox.db` integrity check; on failure, drops to `inbox.db.corrupt-<ts>`,
creates fresh empty inbox, logs `inbox_corruption_recovered` (does not
block startup — inbox is a cache).
### 4.4 Disk-full
- At 80% of `outbox.max_queue_size` or 80% of `[disk] reserved_bytes`: daemon
emits `outbox_pressure_high` event + Prometheus gauge. Sends still accept.
- At 95%: new sends return `507 insufficient_storage`. Existing inflight
drains.
- At 100%: daemon enters degraded mode — refuses sends, refuses new SSE
streams, holds open WS for inbound only. `daemon status` shows degraded.
- Recovery: drain via broker reconnect (drains `done` rows older than
retention window) or `claudemesh daemon outbox prune --confirm`.
### 4.5 Schema migration
`schema_version` file holds an integer. On startup:
1. If `schema_version` matches binary's expected version → continue.
2. If version is older → run `apps/cli/src/daemon/migrations/<from>-<to>.sql`
in a transaction, write new version on success.
3. If version is newer (downgrade) → daemon refuses to start, error points at
re-installing matching version.
Migrations are forward-only. Each migration is ≤ 1 transaction. Test coverage
required: every migration has a snapshot test from prior schema.
## 5. Inbound — durable history with FTS
Every inbound message is written to `inbox.db` before any hook fires:
```sql
CREATE VIRTUAL TABLE inbox USING fts5(
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
sender_name, body, meta, idempotency_key UNINDEXED,
received_at UNINDEXED, replied_to_id UNINDEXED
);
CREATE INDEX inbox_received_at ON inbox(received_at);
CREATE INDEX inbox_idem ON inbox(idempotency_key);
```
- **Receiver-side dedupe**: on insert, `INSERT OR IGNORE` on `idempotency_key`.
Duplicate broker delivery becomes a no-op locally + `cm_daemon_dedupe_total`
counter increments.
- 30-day rolling retention (configurable). `VACUUM` weekly during low-traffic
window.
- `claudemesh daemon search "OOM"` queries the FTS index.
- Apps connecting mid-stream replay history via `?since=<iso>`.
## 6. Hooks — first-class but tightly bounded
Codex was right: hooks were underspecified, and putting `CLAUDEMESH_TOKEN` in
every hook env was a serious exfil footgun.
### 6.1 Hook directory & contract
```
hooks/
on-message.sh every inbound message (DM + topic)
on-dm.sh DMs only
on-mention.sh when @<my-name> appears anywhere
on-topic-<name>.sh a specific topic
on-file-share.sh file shared with me
on-disconnect.sh WS dropped
on-reconnect.sh reconnected
on-startup.sh daemon up
pre-send.sh filter / mutate outbound (last gate)
hooks.toml per-hook policy (auth, redaction, env, timeout)
```
`hooks.toml` (mandatory; daemon refuses to invoke hooks without it):
```toml
[on-mention]
enabled = true
timeout_s = 30
output_size_limit = 65536
redact_payload = ["body.password", "meta.api_key"] # JSONPath
allow_reply = true # if false, stdout reply ignored
capability_token_scope = ["topic:alerts:post"] # scoped, NOT broker session token
network_policy = "deny" # 'deny' | 'allow' | 'allowlist'
network_allowlist = [] # only if policy = 'allowlist'
fs_policy = "readonly" # 'readonly' | 'rw' | 'sandbox'
killpg_on_timeout = true # SIGTERM process group, not just child
audit = true # log every invocation
```
### 6.2 Credentials passed to hooks
**Default: nothing.** No `CLAUDEMESH_TOKEN`, no broker session, nothing that
lets the hook impersonate the daemon's identity broadly.
**Opt-in per hook**: `capability_token_scope = ["topic:alerts:post"]` mints a
**short-lived (5 min) capability token** scoped to exactly that capability.
The hook can use it to call back into the daemon's IPC ("post a reply to
#alerts") but cannot use it to read state, read inbox, deploy MCP, etc. Token
expires when hook process exits OR after 5 min, whichever first.
Capability tokens are local-only — they authorize against the daemon's IPC
surface, never the broker directly. Daemon translates capability calls into
broker calls.
Env variables the hook DOES get:
- `CLAUDEMESH_MESH=<slug>`
- `CLAUDEMESH_HOOK_NAME=on-mention`
- `CLAUDEMESH_EVENT_ID=<ulid>`
- `CLAUDEMESH_CAPABILITY_TOKEN=<token>` (only if scope was configured; else absent)
- `CLAUDEMESH_DAEMON_SOCK=<path>` (so SDKs can connect for capability calls)
- `PATH=/usr/bin:/bin` (locked down)
### 6.3 Payload redaction
Hook stdin receives event JSON minus paths listed in `redact_payload`. Default
redaction: nothing. Mesh owner / daemon admin opts in.
### 6.4 Timeout & cleanup
- Per-hook `timeout_s` (default 30s). On timeout, daemon sends SIGTERM to the
hook's process group (`killpg_on_timeout=true`), waits 5s, then SIGKILL.
Catches forked grandchildren that were trying to keep things alive.
- Hook stdout/stderr captured, truncated at `output_size_limit`. Larger
outputs log a warning and discard the overflow.
### 6.5 Audit log
Every hook invocation logs:
```json
{"hook":"on-mention","event_id":"01H8…","exit":0,"duration_ms":47,
"stdout_bytes":120,"stderr_bytes":0,"replied":true,"capability_calls":1,
"ts":"2026-05-03T14:00:00Z"}
```
Stored in `daemon.log`; metrics exposed via `cm_daemon_hook_*`.
### 6.6 Sandboxing — supported, not required
The contract supports sandboxing without mandating it (mandating breaks too
many real workflows):
- Linux: opt-in `sandbox = "bubblewrap"` in `hooks.toml` runs the hook under
`bwrap` with no network (unless `network_policy != "deny"`), readonly FS
except `/tmp/<hook-id>`, no DBus, no /proc.
- macOS: opt-in `sandbox = "sandbox-exec"` with similar profile.
- Default: no sandbox; rely on Unix permissions + `network_policy=deny` (which
is enforced via `unshare --net` on Linux when available, otherwise
best-effort firewall rule).
## 7. Multi-mesh — daemon-per-mesh, supervised by a thin shell
### 7.1 The decision
One daemon per mesh, coordinated by a supervisor script. Codex pushed back —
"why not one daemon serving all meshes?". Going daemon-per-mesh because:
- **Crash isolation**: a panic in `prod` mesh's WS reader can't corrupt
`dev` mesh's outbox.
- **Resource accounting**: per-mesh RSS, per-mesh metrics, per-mesh disk
budget — easy to attribute, easy to cap.
- **Independent identity**: each mesh has its own keypair, host fingerprint,
capability gates. Conflating into one process forces shared trust.
- **Independent upgrades**: rolling daemon restarts per mesh, no downtime
across all meshes.
- **Simpler code**: zero cross-mesh routing logic in the daemon body.
The cost (process count, log fan-out) is real but bounded: typical user has
13 meshes. Heavy users (1020) get a `claudemesh daemon ps` + `--all` UX that
treats them as a fleet.
### 7.2 Resource caps for fleet hosts
`config.toml` has `[fleet]` section read by `daemon up --all`:
```toml
[fleet]
max_daemons = 10
total_memory_budget = "2GB" # divided across daemons; each gets budget/N RSS cap
total_disk_budget = "20GB" # divided across outbox + inbox per daemon
```
If a user hits `max_daemons`, `daemon up <next>` errors with a clear message
pointing at the cap.
### 7.3 Commands
```
claudemesh daemon up --mesh <slug> # one mesh
claudemesh daemon up --all # all joined meshes (respects fleet caps)
claudemesh daemon down --mesh <slug>
claudemesh daemon down --all
claudemesh daemon status # all daemons, table view
claudemesh daemon status --json # machine-readable
claudemesh daemon ps # alias of status
claudemesh daemon logs --mesh <slug> [-f]
claudemesh daemon restart --mesh <slug>
```
## 8. Auto-routing — clarified, not transparent
Codex pushed back: "no behavior difference" was hand-waving. Persistent
identity, queueing, hooks, profile state — these legitimately change behavior.
### 8.1 What changes when a daemon is up
| Behavior | Cold-path CLI | Daemon-routed CLI |
|---|---|---|
| Sender attribution | Ephemeral session pubkey for that invocation | Daemon's persistent member pubkey |
| Latency | ~1s (fresh WS handshake) | <10ms (local UDS round-trip) |
| Send durability | None — if broker is unreachable, command fails | Outbox queue retries until TTL |
| Inbound visibility | Not available (cold path closes WS) | `claudemesh inbox` reads daemon's inbox.db |
| Hooks | Not invoked | Invoked on every event |
| Presence | Brief flicker as session connects+disconnects | Continuous; daemon's status reflected |
| `peer list` shows me as | A new ephemeral session each invocation | The daemon's persistent member |
### 8.2 Detection logic — connect, don't trust pidfile
```
1. Check ~/.claudemesh/daemon/<slug>/sock exists.
2. attempt UDS connect with 100ms timeout.
3. If connect succeeds: send GET /v1/version.
4. If response is well-formed AND mesh matches AND daemon_version is
compatible → use this daemon.
5. Otherwise → cold path.
```
PID liveness check is unreliable (PID reuse, process orphaned). Socket
handshake is canonical.
### 8.3 Coexistence with `claudemesh launch`
Both can be running for the same mesh:
- Daemon connected as persistent member `runpod-worker-3`.
- A separate `claudemesh launch` connects as ephemeral session of the same
member. Visible to peers as "another session of runpod-worker-3"
(sibling-session relationship via `memberPubkey`).
- CLI verbs from inside `claudemesh launch` route through the launch session,
NOT the daemon (preserves "this Claude Code session has its own ephemeral
identity" semantics).
- CLI verbs from a separate shell route through the daemon (faster, durable).
This is consistent with the v0.5.1 self-DM guard and sibling-session
semantics already shipped.
## 9. Service installation
```bash
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
claudemesh daemon uninstall-service
claudemesh daemon install-service --user # user-scope unit (default; no root)
claudemesh daemon install-service --system # system-scope unit (root; multi-user host)
```
Unit defaults:
- `Restart=on-failure`, `RestartSec=5s`, `StartLimitBurst=5/5min`
- `MemoryMax=<resource cap>`, `TasksMax=128`, `LimitNOFILE=4096`
- `StandardOutput/Error=journal`
- `NoNewPrivileges=yes`, `PrivateTmp=yes`, `ProtectSystem=strict`,
`ProtectHome=read-only` with `ReadWritePaths=~/.claudemesh`
- For systemd `--user`, runs as the invoking user (no root needed).
`claudemesh install` (the existing setup verb) gains an opt-in prompt:
*"Install as a background service that always runs?"* Defaults differently
based on detected environment (TTY vs no-TTY, presence of systemd, etc.).
## 10. Observability
Standard CLI surface unchanged from v1, with the new gauges/counters:
```
cm_daemon_connected{mesh} 0/1
cm_daemon_reconnects_total{mesh,reason}
cm_daemon_lag_ms{mesh} last broker round-trip
cm_daemon_outbox_depth{mesh,status} pending|inflight|dead
cm_daemon_outbox_age_seconds{mesh} oldest pending row
cm_daemon_dedupe_total{mesh,direction} out|in
cm_daemon_disk_pct{mesh,kind} outbox|inbox
cm_daemon_send_total{mesh,kind,status}
cm_daemon_recv_total{mesh,kind,from_type}
cm_daemon_hook_invocations_total{hook,exit}
cm_daemon_hook_duration_seconds{hook} histogram
cm_daemon_hook_capability_calls_total{hook,scope}
cm_daemon_ipc_request_total{endpoint,status,transport}
cm_daemon_ipc_duration_seconds{endpoint} histogram
cm_daemon_local_token_rotations_total
cm_daemon_clone_suspected_total
```
Tracing: optional OpenTelemetry export.
## 11. SDKs — three, slim, core-API only
Same shape as v1 but only target the **frozen core surface** (§3.1). State /
memory / vector / graph / tasks / MCP / skills are NOT in v0.9.0 SDKs — they
ship per capability gate.
Each SDK auto-discovers the daemon: reads `sock` path, `http.port`,
`local_token`. SDKs versioned in lockstep with the daemon's `/v1` surface.
## 12. Security model — explicit boundaries
| Boundary | Trust | Mechanism |
|---|---|---|
| App ↔ Daemon (UDS) | OS user, FS perms | UDS 0600 |
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 only + `local_token` + Origin/Host check |
| Hook ↔ Daemon | Capability scope | Short-lived capability token, never broker session |
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello + crypto_box DM + per-topic keys |
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
| Cloned identity | Host fingerprint check | Daemon refuses to start; dashboard audit event |
## 13. Configuration
`config.toml` — same shape as v1 plus:
- `[capabilities]` (§3.2)
- `[fleet]` (§7.2)
- `[disk] reserved_bytes` (§4.4)
- `[clone] policy = "refuse" | "warn" | "allow"` (§2.2)
User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.
## 14. Lifecycle — the operational flows v1 was missing
### 14.1 Key rotation
```
claudemesh daemon rotate-keypair
```
Mints fresh ed25519 + x25519. Registers new pubkey with broker as a `member_keypair_rotated` operation (broker associates new pubkey with same member id). Old pubkey is held server-side for 24h grace (decrypts in-flight messages encrypted to old pubkey), then revoked.
### 14.2 Local token rotation
```
claudemesh daemon rotate-token
```
Atomically writes a new `local_token`, returns the old one alongside the new
one for 60s grace. SDKs that already have the old token finish in-flight
requests; new requests use the new token. After 60s, old token is rejected.
### 14.3 Compromised host revocation
From the dashboard or another mesh-owner session:
```
claudemesh member revoke <pubkey>
```
Broker marks member as revoked. Connected daemon receives `member_revoked`
push, self-disables (refuses new IPC, closes WS), exits with non-zero status,
logs forensic event.
### 14.4 Image-clone lifecycle
Covered in §2.2. Three policies (`refuse`, `warn`, `allow` — settable per-host
via `config.toml`).
### 14.5 Backup & restore
```
claudemesh daemon backup --out <path> # dumps keypair, config, schema_version
claudemesh daemon restore --in <path> # writes them; refuses if a daemon is running
```
Backup is encrypted with a passphrase (Argon2id KDF + crypto_secretbox). The
intent: "I'm reformatting my laptop, I want my mesh memberships back without
re-joining." NOT for "deploy this same identity on 10 servers" (that's the
clone problem above).
### 14.6 Uninstall / reset
```
claudemesh daemon uninstall # full purge: stops, deregisters from broker, wipes ~/.claudemesh/daemon/<slug>
claudemesh daemon reset # wipes local state, keeps broker member registration (for restoring)
```
Uninstall calls broker's `POST /v1/me/members/:pubkey/leave` so member doesn't
linger as ghost. Reset is local-only, no broker contact.
### 14.7 Disk corruption recovery
```
claudemesh daemon recover # interactive: integrity check + offer rebuild paths
```
Detects corrupt `outbox.db` / `inbox.db`. Options:
- Restore from local journal-only inbox (read-only mode; sends disabled).
- Wipe + rebuild from broker (fetches last N days of message history if
available; topics need re-subscribe; outbox is irrecoverable, queued sends are
lost).
- Wipe + start fresh.
## 15. Version compatibility
### 15.1 Negotiation handshake
On daemon connect to broker AND on every IPC request:
```
GET /v1/version
{
"daemon_version": "0.9.0",
"ipc_api": "v1",
"ipc_minor": 3, # additive minor
"schema_version": 7,
"broker_protocol_min": "0.7",
"broker_protocol_max": "0.9"
}
```
### 15.2 Compat policy
| Across | Policy |
|---|---|
| Daemon ↔ Broker | Daemon refuses to connect if broker version < daemon's `broker_protocol_min`. Broker logs warning. Pre-1.0 we may break this with notice; post-1.0 we maintain backward compat for ≥6 months. |
| CLI ↔ Daemon | CLI checks daemon's `ipc_api`. Same major = OK. Different major = CLI falls back to cold-path with warning. |
| SDK ↔ Daemon | SDK negotiates `ipc_minor`; uses minimum of (SDK's, daemon's). |
| Daemon binary ↔ schema | Binary refuses to start on unknown schema; migrations run forward-only; no automatic downgrade. |
### 15.3 Compatibility matrix (published in docs, machine-readable JSON at /v1/compat)
```json
{
"daemon": "0.9.0",
"compatible_brokers": ["0.7.x", "0.8.x", "0.9.x"],
"compatible_clis": ["0.9.x"],
"compatible_sdks": {
"python": ">=0.9.0,<1.0.0",
"go": ">=0.9.0,<1.0.0",
"ts": ">=0.9.0,<1.0.0"
}
}
```
## 16. Threat model
### 16.1 Attacker classes
| Attacker | Has | Wants | Mitigations |
|---|---|---|---|
| Local same-user shell | OS user creds | Send / read mesh messages | None needed — they already have FS access to keypair; daemon is no worse |
| Local different-user shell | Different OS user | Read this user's daemon | UDS 0600 + TCP loopback + token. Requires OS exploit to escalate |
| Browser SSRF | Loopback HTTP | Send messages, read inbox | `local_token` + Origin/Host check + non-default port. SSRF without token cannot succeed |
| Container side-channel | Same loopback namespace | Read another container's daemon | Containers share host loopback only if explicitly net=host. `local_token` defends. Recommended: bind UDS only inside containers |
| Compromised hook | Capability token in env | Use that scope | Capability tokens are scoped + short-lived; cannot escalate |
| Compromised broker | Full mesh visibility on its side | Deliver malicious messages, identity-impersonate | E2E encryption (crypto_box DMs, per-topic keys) — broker can't read content. Out-of-scope for daemon |
| Cloned VM image | Same keypair on two hosts | Identity collision | Host fingerprint detection + dashboard audit + `--remint` flow |
| Stolen laptop | Disk access | Mesh impersonation forever | `member revoke` from dashboard. Without disk encryption, this is the user's laptop security; documented in security guide |
| Untrusted hook author | Hook script content | Exfil mesh data | Hook is on disk YOU control. If you ran `git pull` on a malicious hooks/ repo, that's a code-supply-chain attack out of scope for the daemon |
### 16.2 Out of scope
- Defending against an attacker with root on the daemon host. They can read
`keypair.json` directly.
- Defending against malicious peers in the same mesh sending malformed
payloads. Daemon validates structure but trusts mesh members.
- Defending against compromised broker. Out-of-scope for daemon; mesh-level
E2E protects content but not metadata.
## 17. Migration — what changes for existing users
Same as v1. Additive. No DB migration on broker. Existing
`~/.claudemesh/config.json` consumed unchanged. `claudemesh launch` keeps
working; daemon is opt-in.
---
## What needs review (round 2)
Round 1 produced: identity model needs `--ephemeral` + clone-detect, IPC needs
local token, "exactly-once" was a lie, hooks needed scoped credentials, surface
needed shrinking, missing rotation/recovery/migration/threat-model.
This v2 attempts to address all of them. Specifically critique:
1. **Has the identity model fully closed the clone problem?** Refuses-on-fingerprint-mismatch
plus broker audit plus mesh-owner revoke — does this catch a sophisticated
attacker who copies `host_fingerprint.json` along with the keypair?
2. **Is the local-token model sufficient for browser-SSRF defense?**
Token + Origin + Host checks + 127.0.0.1-only. Anything else needed?
3. **The delivery contract** (§4) — is it now defensible? Does the inflight-recovery
semantics + idempotency-key propagation produce the guarantees claimed?
4. **Hook capability tokens** (§6.2) — short-lived, scoped, expire on hook exit.
Does this fully eliminate the exfil footgun? What capability scopes are
actually needed for v0.9.0 hooks?
5. **Frozen v0.9.0 surface** (§3.1) — is the cut right? Should `peer list` be
in core or capability-gated? Should `inbox/search` ship in v0.9.0?
6. **Threat model** (§16) — anything missing? Specifically thinking about CI
environments where the daemon's host is a fleet shared across many users'
builds.
7. **Lifecycle flows** (§14) — image clones, key rotation, host moves, disk
corruption, uninstall semantics. Anything still missing?
8. **Version compat** (§15) — is the negotiation handshake sufficient, or do
we need stronger guarantees (e.g. semver-strict, or a feature-bit
negotiation rather than version numbers)?
Score 15 each. Top 3 changes you'd insist on for v3, if any. If you think v2
is shippable, say so explicitly — over-engineering is a real risk.

View File

@@ -0,0 +1,648 @@
# `claudemesh daemon` — Final Spec v3
> **Round 3.** v2 of this spec was reviewed by another model and pushed back on
> identity/clone semantics (boot-id false-positives), delivery contract (broker
> must dedupe on client-supplied id — protocol change), CI shared-runner threat
> model, version negotiation (need feature bits, not ranges), key rotation
> crypto, hook scope granularity, inbox schema correctness, and ~7 smaller
> polish items. v3 incorporates all of them.
>
> **The intent §0 from v2 is unchanged and still authoritative — read it
> there.** v3 only revises what changed.
---
## 0. Intent — unchanged, see v2 §0
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
a generic broker. We can break anything.
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
precise contract in §4 below.
---
## 1. Process model — same as v2 §1
Resource caps, file layout, single-binary unchanged.
---
## 2. Identity — accidental-clone detection only, plus broker dedupe
Codex was right: v2's clone detection was both too weak (anyone copying
`host_fingerprint.json` along with `keypair.json` defeats it) and too noisy
(boot-id flips every reboot → false-positives on every legitimate restart).
### 2.1 Modes
```
claudemesh daemon up # default: persistent member
claudemesh daemon up --ephemeral # in-memory keypair, never written
claudemesh daemon up --ephemeral --ttl 2h # auto-shutdown after duration
```
**CI auto-detection** (NEW): if any of the following env vars are set
(`CI=true`, `GITHUB_ACTIONS`, `GITLAB_CI`, `BUILDKITE`, `CIRCLECI`,
`JENKINS_URL`, `RUNPOD_POD_ID`, `KUBERNETES_SERVICE_HOST`), AND `--persistent`
is not explicitly passed, daemon defaults to `--ephemeral`. Rationale in §16.
### 2.2 Accidental-clone detection (NOT attacker-grade)
Frame change: this catches **image clones, restored backups, copy-pasted
homedirs** — accidents made by humans operating at human speed. It does not
defend against an attacker who copies both `keypair.json` and
`host_fingerprint.json`. The threat model (§16) says this explicitly.
Persisted fingerprint = `sha256(machine-id || first-stable-mac)`. Notably:
- **No boot-id** — that flips on every reboot and would false-positive
every legitimate restart.
- **No hostname** — laptops legitimately rename themselves.
- **`first-stable-mac`** = MAC of the lexicographically first non-loopback,
non-virtual interface present at first daemon boot. Frozen at first run;
not recomputed.
Behavior on mismatch:
- Default policy: refuse to start. Print: *"This keypair was created on a
different host. If you legitimately moved hardware, run
`claudemesh daemon accept-host` (writes a fresh fingerprint, keeps keypair).
If this is a clone of an existing daemon, run `claudemesh daemon remint`
(mints fresh keypair, registers as a new member)."*
- `[clone] policy = "refuse" | "warn" | "allow"` overrides per host.
### 2.3 Concurrent-duplicate-identity broker policy (NEW — protocol change)
When the broker receives two WS connections claiming the same member pubkey:
- **`prefer_newest`** (default): older connection is closed with code 4003
`replaced_by_newer_connection`. New connection takes over presence/inbox
delivery. Daemon-side: receives the close code, logs forensic event, exits
with non-zero status (lets supervisor restart it; if the *other* host is
the legitimate one, supervisor restart-loops are noisy enough to alert).
- **`prefer_oldest`**: new connection is rejected with code 4004
`member_already_connected`. The new daemon refuses to start.
- **`allow_concurrent`** (new mode, server-side feature flag): both
connections accepted; broker tracks both as sibling sessions of the same
member (same model as `claudemesh launch` siblings today). Useful when a
user really does want one keypair on multiple hosts (e.g. failover pairs).
Configured per-mesh in `mesh.cloneConcurrencyPolicy`. Default:
`prefer_newest`. Broker emits `member_concurrent_connection` audit event in
all cases.
### 2.4 Rename, key rotation — see §14
---
## 3. IPC surface — frozen core, hardened auth
### 3.1 Frozen core (v0.9.0) — slight cut from v2
Codex agreed v2's cut was mostly right, except: defer FTS-search to a
capability gate, keep `peer list` in core, drop redundancies.
```
# Messaging
POST /v1/send {to, message, priority?, meta?, replyToId?,
client_message_id?}
POST /v1/topic/post {topic, message, priority?, mentions?,
client_message_id?}
POST /v1/topic/subscribe {topic}
POST /v1/topic/unsubscribe {topic}
GET /v1/topic/list
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
# plain SQL paging; NO FTS in v0.9.0
# Peers + presence (kept in core — central to "first-class peer")
GET /v1/peers ?mesh=<slug>
POST /v1/profile {summary?, status?, visible?}
# Files (already production)
POST /v1/file/share {path, to?, message?, persistent?}
GET /v1/file/get ?id=<fileId>&out=<path>
GET /v1/file/list
# Events — push
GET /v1/events text/event-stream
core events: message, peer_join, peer_leave, file_shared,
daemon_disconnect, daemon_reconnect, hook_executed,
feature_negotiation_failed
# Control plane
GET /v1/health (auth required by default — see §3.3)
GET /v1/metrics (auth required by default)
GET /v1/version (auth required by default)
POST /v1/heartbeat {}
```
`inbox/search` with FTS deferred to v0.9.x capability gate `inbox_fts`.
### 3.2 Capability-gated future surface (v0.9.x)
Same as v2 §3.2 — state, memory, vector, graph, tasks, scheduling,
mcp_host, skill_share, plus new `inbox_fts`. None enabled in v0.9.0.
### 3.3 Local IPC authentication — tightened
Same shape as v2 §3.3 but with codex's polish folded in:
| Transport | Auth | Notes |
|---|---|---|
| UDS | None (FS perms 0600) | Reaching socket = same UID |
| TCP loopback | `Authorization: Bearer <local_token>` REQUIRED | 127.0.0.1 only |
| SSE | `Authorization: Bearer <local_token>` REQUIRED | same |
**Token plumbing rules (NEW):**
- `local_token` MUST be in the `Authorization` header. **Never** accepted in
query string. Endpoint that sees a `?token=...` query param logs a security
event and returns 400.
- `local_token` MUST be redacted from access logs (`Authorization: Bearer
***` in logs).
- `local_token` rotation atomically writes a new file; SDKs hold the OLD
token valid for 60s grace, then it's rejected.
**Endpoint default auth (NEW — codex):**
- Every IPC endpoint requires the local token by default, **including**
`/v1/health`, `/v1/metrics`, `/v1/version`. `[ipc] public_health_check =
true` opts in to public `/v1/health` for k8s probes etc.
**Container default (NEW — codex):**
- If `KUBERNETES_SERVICE_HOST` is set OR `/.dockerenv` exists OR
`/proc/1/cgroup` indicates a container OR explicit `--container` flag,
daemon defaults to **UDS-only** (`[ipc] tcp_enabled = false`). Containers
share host loopback when `network_mode: host`; UDS-only avoids the
side-channel.
**Origin/Host policy:**
- `Host` header must be `localhost`, `127.0.0.1`, `[::1]` or empty. Else 403.
- `Origin` header: explicit allowlist (default: empty). SSRF-from-browser
bounce-attack defense.
- `User-Agent` requirement DROPPED (codex called it theatre — correct).
- CORS: never echo `Access-Control-Allow-Origin`; preflight returns 403.
### 3.4 Request limits & backpressure — same as v2
---
## 4. Delivery contract — at-least-once, broker-dedupes-on-client-id
Codex caught the real protocol gap: idempotency only works if the broker
dedupes on the **caller's** id, not its own. This requires a broker change.
### 4.1 The contract (precise)
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db` before
> the response returns.
>
> **Broker guarantee**: the broker dedupes on `client_message_id` for a
> 24h window. Multiple inflight retries from the daemon for the same
> `client_message_id` produce **at most one** broker-accepted row.
>
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
> `client_message_id` propagated in the inbound envelope so receivers can
> dedupe locally on their side. We do **not** guarantee at-most-once
> end-to-end — that requires receiver-side dedupe, which the daemon's
> inbox.db provides for daemon-hosted peers.
### 4.2 Daemon-supplied `client_message_id` (NEW — broker protocol change)
Every send has a stable id minted **on the daemon**, not the broker:
- Caller-supplied via `Idempotency-Key` header → wins.
- Caller-supplied in body as `client_message_id` field → second.
- Else daemon mints a `ulid` → last.
The id is:
- Returned in the IPC response.
- Stored in `outbox.db` as a UNIQUE NOT NULL column (real dedupe, not
`INSERT OR IGNORE` on nullable — codex caught this).
- Propagated to the broker on every retry (`client_message_id` field in the
WS send envelope and in `POST /v1/messages`).
- Stored in the broker's `meshTopicMessage.client_message_id` column with a
`UNIQUE` constraint scoped to `(meshId, client_message_id)`.
- Propagated in the inbound delivery to receivers' inboxes.
**Broker behavior on duplicate `client_message_id`**: returns the
already-stored `messageId` and `historyId` from the prior insertion. No new
row, no new fan-out, idempotent.
### 4.3 Broker schema delta (NEW)
```sql
ALTER TABLE mesh.topic_message
ADD COLUMN client_message_id TEXT;
ALTER TABLE mesh.message_queue
ADD COLUMN client_message_id TEXT;
CREATE UNIQUE INDEX topic_message_client_id_idx
ON mesh.topic_message(mesh_id, client_message_id)
WHERE client_message_id IS NOT NULL;
CREATE UNIQUE INDEX message_queue_client_id_idx
ON mesh.message_queue(mesh_id, client_message_id)
WHERE client_message_id IS NOT NULL;
```
Partial unique index — legacy traffic without `client_message_id` (from
`claudemesh launch`, dashboard chat, web posts) is unaffected.
### 4.4 Outbox schema (corrected)
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY, -- ulid (local row id)
client_message_id TEXT NOT NULL UNIQUE, -- propagated to broker
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT -- set on ACK
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
```
`UNIQUE NOT NULL` on `client_message_id`: caller retries with the same id
collide locally and become a no-op.
### 4.5 Inbox schema (corrected — content table + FTS index)
Codex caught: FTS5 virtual tables are not where you put `CREATE INDEX`.
Real shape:
```sql
-- Content table — the durable store
CREATE TABLE inbox (
id TEXT PRIMARY KEY, -- ulid (local row id)
client_message_id TEXT NOT NULL UNIQUE, -- dedupe key
broker_message_id TEXT,
mesh TEXT NOT NULL,
topic TEXT,
sender_pubkey TEXT NOT NULL,
sender_name TEXT NOT NULL,
body TEXT,
meta TEXT, -- JSON
received_at INTEGER NOT NULL,
reply_to_id TEXT
);
CREATE INDEX inbox_received_at ON inbox(received_at);
CREATE INDEX inbox_topic ON inbox(topic);
CREATE INDEX inbox_sender ON inbox(sender_pubkey);
-- FTS5 index — gated behind capability `inbox_fts` (deferred to v0.9.x)
-- When enabled, populated via triggers; absent in v0.9.0.
```
Insert path: `INSERT INTO inbox(...) ON CONFLICT(client_message_id) DO
NOTHING RETURNING id`. The `RETURNING` clause tells us whether a new row
landed; only new rows trigger hooks.
### 4.6 Crash recovery — explicit semantics
On daemon startup:
1. Rows in `inflight` reset to `pending` with `attempts++`,
`next_attempt_at = now + min_backoff`. **Note:** these may double-deliver
if the broker actually accepted before the local ACK persisted. The
`client_message_id` propagation ensures the broker dedupes the retry —
net result: exactly one broker-accepted row, possibly two daemon-side
`inflight → done` transitions.
2. `outbox.db` PRAGMA integrity_check; failure → daemon refuses to start,
point at `claudemesh daemon recover`.
3. `inbox.db` integrity check; failure → move to `inbox.db.corrupt-<ts>`,
create fresh empty inbox, log `inbox_corruption_recovered`. Inbox is a
cache; recoverable from broker history.
### 4.7 Failure modes the spec is honest about
- **Broker dedupe window expired**: daemon retries a 25h-old send. Broker
accepts again as if new (no dedupe). Daemon's outbox `max_age_hours`
(default 168h = 7d) is longer than broker dedupe (24h), so this is
possible. Default daemon `max_age_hours` REDUCED to **23h** to stay inside
broker dedupe window. Configurable up only if the operator accepts the
risk explicitly.
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. User
manually requeues (`outbox requeue <id>`) or drops (`outbox drop <id>`).
- **Receiver-side dedupe failure**: only daemon-hosted receivers dedupe.
`claudemesh launch` and dashboard chat clients DO NOT dedupe today —
fixing them is post-v0.9.0.
---
## 5. Inbound — schema corrected (see §4.5), retention as v2
30-day rolling retention (configurable). Weekly VACUUM.
`claudemesh daemon search` deferred to `inbox_fts` capability.
---
## 6. Hooks — scopes tightened, exfiltration acknowledged
Codex was right: capability tokens removed the broad-token footgun, not
exfiltration. Untrusted hook payload + `network_policy=deny` not reliable
across platforms. Spec is now honest about that.
### 6.1 Hooks contract — same shape as v2 §6, with tighter defaults
### 6.2 Capability scopes — narrowed for v0.9.0
Codex pushed: scopes were too coarse. v0.9.0 scopes are exactly:
| Scope | Capability | Notes |
|---|---|---|
| `reply:event` | Reply to the specific event that triggered this hook | Bound to `event_id`; daemon validates target; expires on hook exit |
| `dm:send:<sender_pubkey>` | Send DM only to the specific sender | Bound to one pubkey from event; not a write to anyone |
| `topic:<name>:post` | Post to the specific topic that fired | Bound to topic from event; can't write elsewhere |
**No read scopes in v0.9.0.** A hook cannot read state, inbox, peers, etc.
If a hook wants to consult mesh data to compose its reply, it does so via
the *event payload* (which the daemon redacted appropriately) or via shell
out to a fresh `claudemesh <verb>` call (which uses the user's existing
config and is subject to its own auth). No daemon-mediated read tokens.
### 6.3 Sandboxing — supported, not promised
Codex caught: "network_policy=deny" sounds reliable but isn't cross-platform.
Spec now says explicitly:
- `network_policy = "deny"` is **best-effort**:
- Linux: enforced via `unshare --net` if available; else firewall rule via
`iptables -m owner` if available; else daemon logs warning that policy
cannot be enforced and the hook STILL runs.
- macOS: enforced via `sandbox-exec` profile if available; else warning + run.
- Windows: not enforced; warning + run.
- Operators on hostile networks should set `enabled = false` for hooks they
don't trust.
- Daemon `cm_daemon_hook_unenforceable_total` counter exposes the count of
hooks that ran with weakened sandbox.
### 6.4 Payload size & truncation — NEW
Stdin payloads to hooks capped at 256 KB (configurable). Larger payloads
truncated with `_truncated: true` flag in the JSON event. Hook stdout
captured up to `output_size_limit` (default 64 KB).
### 6.5 Audit log + killpg — same as v2
---
## 7. Multi-mesh — same as v2 §7
---
## 8. Auto-routing — same as v2 §8 (codex agreed it was clarified correctly)
---
## 9. Service installation — same as v2 §9
Add: when `claudemesh daemon install-service` runs in CI-detected
environment, prints `Refusing to install persistent service in CI; ephemeral
mode only.` and exits non-zero unless `--allow-ci-persistent` is passed.
---
## 10. Observability — same as v2 §10
Add metric: `cm_daemon_hook_unenforceable_total{hook,reason}` (§6.3).
---
## 11. SDKs — same shape as v2, bound to frozen core only
---
## 12. Security model — same boundaries, plus dedupe + feature negotiation
| Boundary | Trust | Mechanism |
|---|---|---|
| App ↔ Daemon (UDS) | OS user | UDS 0600 |
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 + `local_token` + Origin/Host |
| Hook ↔ Daemon | Capability scope | Short-lived token bound to event; no read scopes |
| Daemon ↔ Broker | Mesh keypair + feature bits | WSS + ed25519 + crypto_box + per-topic keys + feature negotiation (§15) |
| Daemon ↔ Disk | OS user | All files 0600/0644 |
| Cloned identity | First-mac fingerprint | Accidental-clone detection only; broker concurrent-policy on §2.3 |
---
## 13. Configuration — same shape as v2 §13, plus `[features]`
```toml
[features]
require = ["client_message_id_dedupe", "concurrent_connection_policy"]
optional = ["mesh_skill_share", "mcp_host"]
# Daemon refuses to start if broker doesn't advertise all `require` bits.
```
---
## 14. Lifecycle — key rotation crypto fixed
### 14.1 Key rotation (CORRECTED — codex)
v2 said: *"old pubkey held server-side for 24h grace (decrypts in-flight
messages encrypted to old pubkey)"*. **Wrong** — only the daemon has the
private key. Broker can't decrypt.
Real semantics:
- `claudemesh daemon rotate-keypair` mints fresh ed25519 + x25519, registers
the new pubkey with the broker as `member_keypair_rotated`.
- Broker associates the new pubkey with the same member id, marks the old
pubkey as `rotated_out` (not revoked).
- **Daemon-side**: the OLD x25519 private key is retained in
`keypair-archive.json` (mode 0600, durable) for a `key_grace_period`
(default 7 days). During the grace window, daemon will attempt to decrypt
inbound messages with the new private key first, falling back to archived
keys (one or more). Messages encrypted to the old pubkey by senders who
haven't yet seen the rotation event continue to decrypt cleanly.
- After the grace period, archived keys are zeroed and the file is deleted.
Messages encrypted to a stale pubkey after the grace window fail to
decrypt and are logged as `cm_daemon_decrypt_stale_total`.
### 14.2 Backup includes topic state (CORRECTED)
`claudemesh daemon backup` now packages:
- `keypair.json` (current)
- `keypair-archive.json` (any in-grace-window archived keys)
- `host_fingerprint.json`
- `config.toml`
- `local_token` (NOT — token is rotated on restore)
- `topic_subscriptions.json` (which topics this daemon subscribes to)
- `topic_keys.json` (per-topic symmetric keys this member holds)
- `key_epoch.json` (current epoch number per topic; relevant when the mesh
rotates topic keys)
- `schema_version`
Backup file: encrypted with a passphrase (Argon2id KDF + crypto_secretbox).
Restore writes everything except `local_token` (regenerated). On first run
after restore, daemon performs `accept-host` if fingerprint mismatches
(restore is by definition a host change).
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — same as v2 §14
---
## 15. Version compat — feature-bit negotiation (REPLACES v2 §15)
Codex was right: version ranges aren't enough when daemon depends on
specific broker capabilities (client-supplied IDs, concurrent-connection
policy, key epochs).
### 15.1 Feature bits
Each protocol-relevant capability gets a stable string identifier:
```
client_message_id_dedupe broker dedupes on client_message_id (§4.2)
concurrent_connection_policy broker honours mesh.cloneConcurrencyPolicy (§2.3)
member_keypair_rotated_event broker emits the event (§14.1)
key_epoch per-topic key epochs supported (§14.2)
mesh_skill_share post-v0.9, future
mcp_host post-v0.9, future
```
### 15.2 Negotiation handshake
On WS connect (after hello, before normal traffic):
```
→ daemon: feature_negotiation_request
{ require: ["client_message_id_dedupe",
"concurrent_connection_policy"],
optional: ["mesh_skill_share","mcp_host"] }
← broker: feature_negotiation_response
{ supported: ["client_message_id_dedupe",
"concurrent_connection_policy",
"member_keypair_rotated_event"],
missing_required: [] }
```
If `missing_required` is non-empty, daemon closes the connection with code
4010 `feature_unavailable`, logs forensic event, exits with non-zero status.
Supervisor sees a restart-loop → operator alerted via configured
mechanisms.
### 15.3 IPC negotiation (CLI/SDK ↔ daemon)
`GET /v1/version` returns:
```json
{
"daemon_version": "0.9.0",
"ipc_api": "v1",
"ipc_features": ["send","topic","peers","files","events","health"],
"schema_version": 7,
"broker_features_negotiated": ["client_message_id_dedupe", ...]
}
```
CLI/SDK matches `ipc_features` against required. Missing required →
fall-back to cold-path with warning OR fail explicitly (CLI verb's choice).
### 15.4 Compatibility matrix — published
```json
GET /v1/compat
{
"daemon": "0.9.0",
"compatible_brokers": ["0.7.x","0.8.x","0.9.x"],
"required_broker_features": ["client_message_id_dedupe",
"concurrent_connection_policy"],
"compatible_clis": ["0.9.x"],
"compatible_sdks": {
"python": ">=0.9.0,<1.0.0",
"go": ">=0.9.0,<1.0.0",
"ts": ">=0.9.0,<1.0.0"
}
}
```
---
## 16. Threat model — shared-CI reality folded in
### 16.1 Attacker classes — same matrix as v2 §16, plus:
| Attacker | Has | Wants | Mitigations |
|---|---|---|---|
| **Shared CI runner** (NEW) | Same Unix UID as other untrusted jobs | Read this user's persistent keypair across job boundaries | Auto-detect CI envs (§2.1) → ephemeral default + UDS-only + isolated `$HOME`. If operator overrides with `--persistent`, log warning `persistent_keypair_in_ci_environment`. |
| **Malicious mesh peer** (PROMOTED from out-of-scope to in-scope) | Mesh membership | Send malformed payload to crash daemon | Every inbound shape validated against schema before any processing. Daemon refuses unknown fields (defense-in-depth) and emits `cm_daemon_invalid_inbound_total`. Crashes from inbound payloads are bugs. |
### 16.2 Stated explicitly out of scope
- Root attacker on daemon host (can read keypair directly).
- Compromised broker (E2E content protection still holds; metadata is not
protected by daemon — that's mesh-level).
- Sophisticated attacker who copies BOTH `keypair.json` and
`host_fingerprint.json` (§2.2 calls this out).
- Receivers other than daemon-hosted peers deduping inbound traffic
(post-v0.9.0).
### 16.3 Container & CI defaults table (NEW)
| Environment | Identity | IPC | Hooks |
|---|---|---|---|
| Bare metal / VM (default) | Persistent (clone-detected) | UDS + TCP loopback | Enabled |
| Docker container (`/.dockerenv`) | Persistent | UDS-only by default | Enabled |
| Kubernetes (`KUBERNETES_SERVICE_HOST`) | Persistent | UDS-only | Enabled |
| CI (`CI=true`, `GITHUB_ACTIONS`, etc.) | Ephemeral | UDS-only | Disabled by default (`[hooks] enabled = false` until opted-in) |
| RunPod (`RUNPOD_POD_ID`) | Ephemeral | UDS-only | Enabled |
Operator overrides any default with explicit flags; warning logged for
non-default-secure choices.
---
## 17. Migration — same as v2 §17, plus broker schema add
Broker needs the schema delta in §4.3 (additive, partial unique indexes —
safe for online migration). Coordinated with daemon rollout: broker first,
then daemon. Daemon refuses to start against a broker that lacks
`client_message_id_dedupe` feature bit (§15).
---
## What needs review (round 3)
Round 1 → identity, IPC auth, exactly-once lie, hook tokens, surface bloat,
missing rotation/recovery/migration/threat-model.
Round 2 → boot-id false-positive, broker must dedupe on client id (protocol
change), CI shared-runner reality, feature-bit negotiation, key rotation
crypto, hook scopes, FTS schema, ~7 polish items.
This v3 attempts to address all of those. Specifically critique:
1. **Accidental-clone framing (§2.2)** — does the honest framing close the
issue, or does removing boot-id make the detection so weak it's not worth
shipping at all? Should we drop fingerprint detection entirely and rely on
broker concurrent-connection policy?
2. **Broker schema delta (§4.3)** — is this the smallest correct change?
Partial unique indexes feel right; anything else needed (audit table,
gc job)?
3. **`max_age_hours` reduced to 23h** — codex's logic says daemon outbox TTL
must be inside broker dedupe window. Is 23h vs 24h tight enough? Should
the broker advertise its dedupe window as a feature parameter so the
daemon configures itself?
4. **Hook scopes (§6.2)** — too tight? `reply:event` + `dm:send:<sender>` +
`topic:<name>:post`. Does this cover real use cases for v0.9.0 hooks
(auto-reply, escalate-to-oncall, file-receipt-ack)?
5. **Feature-bit negotiation (§15)** — is the scheme right? Should
feature-bits be string identifiers (current) or numeric bit positions in
a bitmask (denser, more brittle)?
6. **CI defaults (§16.3)** — is the table accurate? Anything wrong about
defaulting hooks-disabled in CI?
7. **Key rotation grace-key archive (§14.1)** — is 7d the right default? Is
storing archived private keys on disk (mode 0600) acceptable, or should
they be encrypted at rest with a passphrase?
8. **Anything still wrong?** Read it as if you were going to operate this
daemon for a year — what falls down?
Three options after this review:
- **(a) v3 is shippable**: lock the spec, start coding the frozen core.
- **(b) v4 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless. We can break anything.

View File

@@ -0,0 +1,538 @@
# `claudemesh daemon` — Final Spec v4
> **Round 4.** v3 was reviewed by codex (round 3) and got an overall pass on
> architecture but flagged three precision gaps: (1) broker dedupe window
> semantics — permanent or windowed? schema as drawn was permanent but the
> prose said 24h; (2) feature-bit negotiation should carry parameters, not
> just booleans (so daemon can derive its outbox TTL from broker policy
> instead of hardcoding 23h); (3) key-archive record format and retention
> behavior were unspecified. Plus minor polish: document machine-id/MAC
> source precedence per OS, explicitly defer arbitrary outbound hook sends,
> resolve RunPod identity-vs-hooks inconsistency.
>
> **The intent §0 is unchanged from v2 — read it there.** v4 only revises
> what changed from v3.
---
## 0. Intent — unchanged, see v2 §0
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
a generic broker. We can break anything.
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
precise contract in §4 below.
---
## 1. Process model — unchanged from v3 §1 / v2 §1
Resource caps, file layout, single-binary unchanged.
---
## 2. Identity — accidental-clone detection only, plus broker dedupe
Codex round-2 fix retained: no boot-id (false-positives every reboot).
Codex round-3 polish: spell out fingerprint sources per OS so we don't ship
a brittle "machine-id || first-mac" with no precedence rules.
### 2.1 Modes
```
claudemesh daemon up # default: persistent member
claudemesh daemon up --ephemeral # in-memory keypair, never written
claudemesh daemon up --ephemeral --ttl 2h # auto-shutdown after duration
```
**CI auto-detection**: if any of these env vars are set (`CI=true`,
`GITHUB_ACTIONS`, `GITLAB_CI`, `BUILDKITE`, `CIRCLECI`, `JENKINS_URL`,
`KUBERNETES_SERVICE_HOST`), AND `--persistent` is not explicitly passed,
daemon defaults to `--ephemeral`. Rationale in §16.
`RUNPOD_POD_ID` removed from auto-CI list (was inconsistent — see §16.3).
### 2.2 Accidental-clone detection (NOT attacker-grade)
This catches **image clones, restored backups, copy-pasted homedirs**
accidents made by humans. It does not defend against an attacker who copies
both `keypair.json` and `host_fingerprint.json`. The threat model (§16) says
this explicitly.
#### 2.2.1 Fingerprint source precedence (NEW — codex r3)
`host_fingerprint.json` stores `sha256(host_id || stable_mac)` where the
inputs are computed from the OS-specific table below, in order:
| OS | `host_id` (try in order) | `stable_mac` |
|---|---|---|
| Linux | `/etc/machine-id``/var/lib/dbus/machine-id` → first stable MAC | First non-loopback non-virtual interface, lex-sorted by name (`en…`/`eth…` before `wl…`); `docker0/veth*/br-*/lo` excluded |
| macOS | `IOPlatformUUID` (`ioreg -rd1 -c IOPlatformExpertDevice`) | First non-loopback non-virtual interface (`en0` typical) |
| Windows | `HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid` | First physical adapter (`Get-NetAdapter -Physical`), MAC sorted lex by adapter name |
| BSD | `kern.hostuuid` (`sysctl -n kern.hostuuid`) | Same MAC rule as Linux |
**Excluded interfaces** (cross-platform): loopback, point-to-point tunnels
(tailscale*, wg*, utun*, ppp*), docker (docker0, br-*, veth*), VPN
(`tap*`/`tun*`), VM bridges (vboxnet*, vmnet*), Apple awdl/llw bridges.
**Cloud-image false-positive note**: bare AMIs/Azure images regenerate
`/etc/machine-id` on first boot via cloud-init; for those, the first-boot
fingerprint is what we keep. If an operator clones a *running* VM
post-cloud-init, both `host_id` AND first-MAC will collide → the daemon
correctly flags this as an accidental clone.
If `host_id` cannot be read on the host's OS, daemon logs
`fingerprint_host_id_unavailable` and falls back to MAC-only. If MAC also
unavailable (truly headless container with no NIC), daemon logs
`fingerprint_unavailable`, persists a random UUID as `host_id`, and the
clone-detection feature is effectively disabled for this host (broker
concurrent-connection policy still works).
Behavior on mismatch (unchanged from v3): refuse / `accept-host` / `remint`.
`[clone] policy = "refuse" | "warn" | "allow"` overrides per host.
### 2.3 Concurrent-duplicate-identity broker policy — unchanged from v3 §2.3
`prefer_newest` (default), `prefer_oldest`, `allow_concurrent`. Configured
per-mesh in `mesh.cloneConcurrencyPolicy`.
### 2.4 Rename, key rotation — see §14
---
## 3. IPC surface — unchanged from v3 §3
Same frozen core, same auth model (UDS 0600 / TCP+SSE bearer / no token in
query / all endpoints auth by default / UDS-only in containers / Origin/Host
checks / no User-Agent theatre).
---
## 4. Delivery contract — at-least-once, **permanent** broker dedupe
Codex round 3 caught: v3's prose said "24h dedupe window" but the schema
(partial unique indexes with no `created_at`) gave **permanent** dedupe. We
have to pick. v4 chooses **permanent dedupe** because:
- It's the simplest correct choice. No GC job, no edge case where a
long-asleep daemon's retry slips past the window and double-sends.
- The unique index storage cost is bounded: at 1 KB per row × 100k
messages/day × 365 = ~36 GB/year of broker storage, which is well within
the broker's existing message-retention budget. Older message rows
themselves can still be GC'd by the existing message retention policy
(currently 365d) — only the `client_message_id` column on retained rows
has to live as long as that row does.
- It eliminates the daemon-side `max_age_hours = 23h` hack. Daemon outbox
TTL becomes "however long you want to keep retrying"; default 7d.
- It removes a class of "where exactly is the dedupe window edge?" bugs.
If broker storage growth becomes a real concern post-v0.9.0, we can convert
to a windowed scheme via a feature-bit upgrade (§15) — but we'd own the
correct migration semantics then.
### 4.1 The contract (precise)
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db` before
> the response returns.
>
> **Broker guarantee**: the broker dedupes on `client_message_id`
> **permanently within the lifetime of the row**. Multiple inflight retries
> from the daemon for the same `client_message_id` produce **at most one**
> broker-accepted row, regardless of time elapsed (subject to message-row
> retention policy on the broker). This is advertised via the
> `client_message_id_dedupe` feature-bit with `{ mode: "permanent" }`
> parameter (§15).
>
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
> `client_message_id` propagated in the inbound envelope so receivers can
> dedupe locally. We do **not** guarantee at-most-once end-to-end —
> receiver-side dedupe is the receiver's job. The daemon's `inbox.db`
> provides it for daemon-hosted peers.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
Sources: `Idempotency-Key` header → body `client_message_id` → daemon-minted
ulid. Stored in outbox UNIQUE NOT NULL, propagated to broker, propagated to
receivers.
### 4.3 Broker schema delta — clarified as permanent dedupe
```sql
ALTER TABLE mesh.topic_message
ADD COLUMN client_message_id TEXT;
ALTER TABLE mesh.message_queue
ADD COLUMN client_message_id TEXT;
CREATE UNIQUE INDEX topic_message_client_id_idx
ON mesh.topic_message(mesh_id, client_message_id)
WHERE client_message_id IS NOT NULL;
CREATE UNIQUE INDEX message_queue_client_id_idx
ON mesh.message_queue(mesh_id, client_message_id)
WHERE client_message_id IS NOT NULL;
-- No created_at column needed for dedupe; the existing message row's
-- created_at handles row-level retention. Dedupe is permanent for the row's
-- lifetime, then naturally GC'd when the row is purged.
```
Partial unique indexes — legacy traffic without `client_message_id` (from
`claudemesh launch`, dashboard chat, web posts) is unaffected.
**Migration**: additive-only. Online ALTER TABLE on Postgres takes the row
lock for the column add but not the index build (`CREATE UNIQUE INDEX
CONCURRENTLY` is safe). Deploy order: schema migration → broker code that
reads/writes `client_message_id` → daemon code that sends it → daemon
enforces feature bit.
### 4.4 Outbox schema — unchanged from v3 §4.4
`UNIQUE NOT NULL` on `client_message_id`. Default `max_age_hours` raised
back to **168h (7d)** because broker dedupe is permanent — no need to stay
inside a 24h window.
### 4.5 Inbox schema — unchanged from v3 §4.5
Content table + indexes; FTS5 deferred.
### 4.6 Crash recovery — unchanged from v3 §4.6
### 4.7 Failure modes — windowed-broker case removed
The "broker dedupe window expired" failure mode in v3 §4.7 is **deleted**
because dedupe is permanent. Remaining cases:
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. User
manually requeues (`outbox requeue <id>`) or drops (`outbox drop <id>`).
- **Receiver-side dedupe**: only daemon-hosted receivers dedupe.
`claudemesh launch` and dashboard chat don't dedupe today; post-v0.9.0.
- **Broker row already GC'd, daemon retries**: daemon retry hits the
partial unique index → 23505 conflict. Broker treats as already-accepted,
returns the original `messageId` from a soft-delete tombstone OR (if the
row was hard-deleted by retention) returns `client_id_unknown`. Daemon
treats `client_id_unknown` as "delivered, history may have been pruned"
and marks `done`. Tombstone strategy is a broker implementation choice
(advertised via `client_message_id_dedupe.tombstone_retention_days` in
§15.1).
---
## 5. Inbound — unchanged from v3 §5
---
## 6. Hooks — scopes tightened (codex r2), explicit deferment of arbitrary sends (codex r3)
### 6.1 Hooks contract — unchanged from v2 §6 / v3 §6.1
### 6.2 Capability scopes — narrowed for v0.9.0
| Scope | Capability | Notes |
|---|---|---|
| `reply:event` | Reply to the specific event that triggered this hook | Bound to `event_id`; daemon validates target; expires on hook exit |
| `dm:send:<sender_pubkey>` | Send DM only to the specific sender | Bound to one pubkey from event; not a write to anyone |
| `topic:<name>:post` | Post to the specific topic that fired | Bound to topic from event; can't write elsewhere |
**No read scopes in v0.9.0.** Hooks read via the event payload (which the
daemon redacts appropriately), not via daemon-mediated reads.
**Explicitly deferred to post-v0.9.0** (codex r3 — say it out loud so use
cases don't pile up against an undocumented limit):
- **Arbitrary outbound `dm:send` to anyone other than the event sender** —
no scope grant for this. "Escalate to oncall" hooks must shell out to
`claudemesh send <oncall>` with the user's normal config; the daemon
doesn't issue capability tokens for arbitrary recipients.
- **Cross-topic post** — a hook firing on `topic:alerts` cannot post to
`topic:incidents`. Same reason.
- **Mesh-cross post** — hooks see one mesh at a time.
- **Reading state/inbox/peers** — covered above.
If a real use case demands cross-topic or arbitrary-recipient hooks
post-v0.9.0, we add scopes like `dm:send:*` (wildcard) or
`topic:*:post` (wildcard) and gate them behind explicit operator opt-in in
config (`[hooks.<name>] dangerous_wildcards = true`). Not in v0.9.0.
### 6.3 Sandboxing — unchanged from v3 §6.3
Best-effort `network_policy = "deny"`; cross-platform unenforceability
acknowledged; counter `cm_daemon_hook_unenforceable_total` exposed.
### 6.4 Payload size & truncation — unchanged from v3 §6.4
### 6.5 Audit log + killpg — unchanged
---
## 7. Multi-mesh — unchanged
## 8. Auto-routing — unchanged
## 9. Service installation — unchanged
## 10. Observability — unchanged
## 11. SDKs — unchanged
## 12. Security model — unchanged
---
## 13. Configuration — unchanged shape, plus parameterized features
```toml
[features]
require = [
"client_message_id_dedupe", # broker provides §4.1 contract
"concurrent_connection_policy", # broker honours mesh.cloneConcurrencyPolicy
]
optional = ["mesh_skill_share", "mcp_host"]
# Daemon refuses to start if broker doesn't advertise all `require` bits.
# Broker advertises feature parameters in the negotiation response (§15.1)
# — daemon picks up `dedupe_mode` and `tombstone_retention_days` from there
# and writes them to its runtime view, not config.
```
---
## 14. Lifecycle — key rotation crypto fixed (codex r2), archive format spec'd (codex r3)
### 14.1 Key rotation — crypto correct (codex r2)
`claudemesh daemon rotate-keypair`:
- Mints fresh ed25519 + x25519 keypairs.
- Registers new pubkeys with the broker as `member_keypair_rotated` event.
- Broker associates the new pubkey with the same member id, marks the old
pubkey as `rotated_out` (not revoked); senders who haven't received the
rotation event continue to encrypt to the old pubkey for a grace window.
- Daemon retains the old x25519 **private** key (only x25519 — ed25519 is
for signing, doesn't need a grace window) in `keypair-archive.json`.
- During grace, decrypt path: try current private key first; on
`crypto_box_open_easy` failure, walk archived keys in order. Successful
archived-key decrypts increment `cm_daemon_decrypt_archived_total`.
- After grace expiry, archived keys are zeroed and the file is rewritten
without them. Messages still encrypted to a fully-expired pubkey fail to
decrypt and increment `cm_daemon_decrypt_stale_total`.
#### 14.1.1 Archive record format (NEW — codex r3)
`keypair-archive.json` (mode 0600, atomic-rename writes):
```json
{
"schema_version": 1,
"max_archived_keys": 8,
"keys": [
{
"pubkey": "ed25519-base64...",
"x25519_pubkey": "base64...",
"x25519_privkey": "base64...", // sensitive; whole file is 0600
"key_id": "k_01HQX...", // ulid; matches broker's record
"created_at": "2026-04-12T11:00:00Z",
"rotated_out_at": "2026-05-03T16:00:00Z",
"expires_at": "2026-05-10T16:00:00Z" // rotated_out_at + grace
}
]
}
```
Rules:
- **`max_archived_keys`** (default 8): cap on archive size. If a rotation
would push the archive past the cap, the oldest entry is force-expired
(zeroed + removed) regardless of `expires_at`. Force-expiry increments
`cm_daemon_archive_force_expired_total{key_id}`. Operator who rotates
faster than 8 keys per grace-window-duration is intentionally accepting
decryption gaps for very-late inbound messages encrypted to those keys.
- **Grace period default**: 7 days. Configurable via
`[crypto] key_grace_period_days = 7`. Hard cap 30 days (codex review:
unbounded grace = unbounded archive on disk = bigger blast radius if
daemon host is compromised mid-life).
- **Cleanup**: scheduled daily at midnight local time + on-demand via
`claudemesh daemon archive-cleanup`. Walks `keys[]`, drops anything with
`expires_at < now`. If file is empty after cleanup, file is deleted.
- **Archive write failure**: rotation is aborted. Daemon refuses to commit
the new keypair if the archive can't be written durably. Logged as
`key_rotation_aborted_archive_write_failed`. New keypair is in memory
only; restart returns to old keypair. This is intentional: the archive
write is the durability point of rotation.
- **At-rest encryption**: archive file is mode 0600 plaintext, same threat
model as `keypair.json` (root-on-host can read both anyway). Operators
who want disk-level encryption can put `~/.claudemesh/` on an encrypted
volume; we don't reinvent that. Documented in the threat model (§16).
Future option `--archive-passphrase` deferred — adds passphrase prompt to
rotation/decrypt path, but breaks unattended daemon restart.
### 14.2 Backup includes topic state — unchanged from v3 §14.2
`keypair.json`, `keypair-archive.json` (with all archived keys),
`host_fingerprint.json`, `config.toml`, `topic_subscriptions.json`,
`topic_keys.json`, `key_epoch.json`, `schema_version`.
`local_token` NOT included; regenerated on restore.
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — unchanged from v2 §14.3
---
## 15. Version compat — feature-bit negotiation with **parameters** (codex r3)
v3's feature bits were boolean. Codex r3: dedupe-window, max-payload, key
epochs all need parameters. v4 makes feature bits string-keyed entries that
optionally carry a value.
### 15.1 Feature bits with parameters
| Bit | Type | Parameters | Notes |
|---|---|---|---|
| `client_message_id_dedupe` | object | `{ mode: "permanent"\|"windowed", window_hours?: int, tombstone_retention_days: int }` | Daemon reads `mode` to decide whether to enforce its own outbox max-age cap. `tombstone_retention_days` (broker-controlled) tells daemon how long it can expect "already-accepted" replies after the source row is GC'd |
| `concurrent_connection_policy` | bool | — | Broker honours `mesh.cloneConcurrencyPolicy` |
| `member_keypair_rotated_event` | bool | — | Broker emits the event |
| `key_epoch` | object | `{ max_concurrent_epochs: int }` | Per-topic key epochs supported |
| `max_payload` | object | `{ inline_bytes: int, blob_bytes: int }` | Hard limits broker enforces |
| `mesh_skill_share` | bool | — | Future |
| `mcp_host` | bool | — | Future |
### 15.2 Negotiation handshake (parameterized)
On WS connect, after hello, before normal traffic:
```
→ daemon: feature_negotiation_request
{
require: ["client_message_id_dedupe",
"concurrent_connection_policy"],
optional: ["mesh_skill_share","mcp_host","max_payload"]
}
← broker: feature_negotiation_response
{
supported: {
"client_message_id_dedupe": {
"mode": "permanent",
"tombstone_retention_days": 30
},
"concurrent_connection_policy": true,
"member_keypair_rotated_event": true,
"max_payload": {
"inline_bytes": 65536,
"blob_bytes": 524288000
}
},
missing_required: []
}
```
If `missing_required` is non-empty, daemon closes the connection with code
4010 `feature_unavailable`, logs forensic event, exits non-zero. Supervisor
sees a restart-loop → operator alert.
If `client_message_id_dedupe.mode == "windowed"`, daemon reads
`window_hours` and configures its outbox `max_age_hours` to
`window_hours - 1` (margin) instead of the 168h default. Permanent mode →
daemon uses the config default, no override.
### 15.3 IPC negotiation — unchanged from v3 §15.3
`GET /v1/version` returns daemon version, IPC features, schema version, and
the **parsed** broker feature parameters (so SDKs querying the daemon can
display them).
### 15.4 Compatibility matrix — unchanged from v3 §15.4
Published at `GET /v1/compat`.
---
## 16. Threat model — unchanged from v3 §16, plus RunPod fix
### 16.1 Attacker classes — unchanged
### 16.2 Out of scope — unchanged
### 16.3 Container & CI defaults table (RunPod inconsistency fixed)
| Environment | Identity | IPC | Hooks | Rationale |
|---|---|---|---|---|
| Bare metal / VM (default) | Persistent (clone-detected) | UDS + TCP loopback | Enabled | Trusted operator-owned host |
| Docker container (`/.dockerenv`) | Persistent | UDS-only by default | Enabled | Single-tenant container, host loopback shared |
| Kubernetes (`KUBERNETES_SERVICE_HOST`) | Persistent | UDS-only | Enabled | Single pod = single tenant |
| CI (`CI=true`, `GITHUB_ACTIONS`, etc.) | Ephemeral | UDS-only | Disabled by default (`[hooks] enabled = false`) | Multi-tenant runner; arbitrary code; ephemeral identity = no cross-job leak; hooks disabled because CI workloads are arbitrary user code |
| RunPod (`RUNPOD_POD_ID`) | Persistent | UDS-only | Enabled | Long-lived single-tenant sandbox; user owns the pod for its lifetime; identical trust model to a Docker container, NOT to a CI runner |
**RunPod resolution (codex r3)**: v3 listed RunPod under both "ephemeral
identity" and "hooks enabled" which was contradictory. v4 treats RunPod as
a **single-tenant container** (Docker-like): persistent identity, UDS-only,
hooks enabled. RunPod is removed from the CI auto-detect list (§2.1).
Operators who run RunPod as multi-tenant sandbox-as-CI can opt in with
`--ephemeral` + `[hooks] enabled = false` explicitly.
Operator overrides any default with explicit flags; warning logged for
non-default-secure choices.
---
## 17. Migration — unchanged from v3 §17
Broker schema delta (additive partial unique indexes, safe online),
deployed before daemon. Daemon refuses to start if `client_message_id_dedupe`
feature bit is missing from broker's negotiation response.
---
## What changed v3 → v4 (codex round-3 actionable items)
| Codex r3 item | v4 fix | Section |
|---|---|---|
| Broker dedupe window: permanent vs windowed? | **Picked permanent**; schema clarified; outbox `max_age_hours` raised back to 168h | §4 |
| Feature bits should be parameterized | All feature bits are string-keyed with optional value object | §15.1, §15.2 |
| Key archive record format unspecified | Full schema with `key_id`, timestamps, `max_archived_keys`, force-expiry rule, write-failure semantics | §14.1.1 |
| Document fingerprint source precedence per OS | Per-OS table for `host_id` and stable MAC; cloud-image false-positive note | §2.2.1 |
| Explicit deferment of arbitrary outbound hook sends | Listed deferred capabilities + escape hatch path post-v0.9.0 | §6.2 |
| RunPod ephemeral-but-hooks-enabled inconsistency | RunPod treated as single-tenant container; removed from CI auto-detect | §2.1, §16.3 |
---
## What needs review (round 4)
Round 1 → identity, IPC auth, exactly-once lie, hook tokens, surface bloat,
missing rotation/recovery/migration/threat-model.
Round 2 → boot-id false-positive, broker must dedupe on client id, CI
shared-runner reality, feature-bit negotiation, key rotation crypto, hook
scopes, FTS schema, ~7 polish items.
Round 3 → dedupe window semantics, feature-bit parameters, key archive
record format, fingerprint source precedence, deferred hook scopes, RunPod
inconsistency.
This v4 attempts to address all of round 3. Specifically:
1. **Permanent dedupe choice (§4)** — does the storage-cost calculus hold?
Is the tombstone path (`client_id_unknown` after row GC) actually
workable, or does it need to be a real tombstone table?
2. **Feature parameter shape (§15.1)** — is the type system right (object
with optional value)? Should it be a flat key-value list instead?
Versioning of parameters within a feature?
3. **Archive record format (§14.1.1)** — anything missing? Is
`max_archived_keys=8` a sensible default, or should it be unbounded with
a force-expiry on storage size instead of count?
4. **Fingerprint per-OS table (§2.2.1)** — accurate? Is BSD worth listing
if we're not actively building for FreeBSD in v0.9.0?
5. **Hook deferment list (§6.2)** — does it cover all the realistic v0.9.0
ask? Is the "shell out to `claudemesh send`" workaround for escalation
ergonomically acceptable?
6. **RunPod resolution (§16.3)** — agree with treating RunPod as
single-tenant container? Or are there real multi-tenant RunPod
deployments we should default-guard against?
7. **Anything else still wrong?** Read it as if you were going to operate
this for a year. What falls down?
Three options after this review:
- **(a) v4 is shippable**: lock the spec, start coding the frozen core.
- **(b) v5 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless. We can break anything.

View File

@@ -0,0 +1,468 @@
# `claudemesh daemon` — Final Spec v5
> **Round 5.** v4 was reviewed by codex (round 4) and got an architectural
> pass but flagged one blocker plus four polish items.
>
> **Blocker**: §4 called dedupe "permanent" while also saying it disappears
> when retained rows are hard-deleted. Internally inconsistent. Fix: real
> broker-side dedupe/tombstone table independent of message retention.
>
> **Polish**: (a) rename `mode: "permanent"` to `retention_scoped`; (b)
> deterministic duplicate-response shape; (c) feature-parameter schema
> validation rules + per-feature parameter version; (d) drop
> "zeroed/secure-delete" promises in archive cleanup, define malformed-archive
> startup behavior; plus Linux MAC||MAC self-collision noted, RunPod warning
> log on persistent default.
>
> **Intent §0 unchanged from v2.** v5 only revises what changed from v4.
---
## 0. Intent — unchanged, see v2 §0
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
a generic broker. We can break anything.
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
precise contract in §4.
---
## 1. Process model — unchanged from v3 §1 / v2 §1
---
## 2. Identity — accidental-clone detection only
### 2.1 Modes — unchanged from v4 §2.1, RunPod warning added
When `RUNPOD_POD_ID` is set and identity is persistent (the default for
RunPod under v4 §16.3), daemon logs `runpod_persistent_default_assumed` at
INFO. Operators running RunPod as multi-tenant CI surface set `--ephemeral`
explicitly; the warning makes the default visible in case the assumption
doesn't fit their deployment.
### 2.2 Accidental-clone detection — unchanged from v4 §2.2
#### 2.2.1 Fingerprint source precedence — unchanged from v4 §2.2.1, with self-collision note
**Linux MAC-only fallback (NEW note)**: when `/etc/machine-id` is unreadable
and we fall back to MAC-only as `host_id`, the resulting fingerprint is
effectively `sha256(mac || mac)`. This is acceptable for clone detection
(still uniquely identifies *this* host's first-NIC MAC) but reduces entropy
to ~48 bits. Operators who want stronger fingerprinting in degraded
environments can persist a generated UUID via `host_fingerprint.id_override`
in config; documented but not required.
### 2.3 Concurrent-duplicate-identity broker policy — unchanged from v3 §2.3
### 2.4 Rename, key rotation — see §14
---
## 3. IPC surface — unchanged from v4 §3
---
## 4. Delivery contract — at-least-once, **dedupe table**, retention-scoped
Codex round 4 caught: v4 said "permanent" but also said dedupe disappears
when message rows are hard-deleted. That's `retention_scoped`, not
permanent — and worse, the partial-unique-index design fails when the row
itself is gone. v5 introduces a real broker-side dedupe table with its own
retention policy, independent of message retention.
### 4.1 The contract (precise)
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db` before
> the response returns.
>
> **Broker guarantee**: the broker maintains a dedupe record for every
> accepted `client_message_id` in a dedicated table
> (`mesh.client_message_dedupe`). The dedupe record outlives the message
> row when the dedupe-retention policy is longer than the
> message-retention policy. While the dedupe record exists, all retries
> with that `client_message_id` collapse to the original
> `broker_message_id` deterministically. After the dedupe record expires,
> a retry would create a new message — but daemon outbox `max_age_hours`
> is configured against the broker's advertised `dedupe_retention_days`
> with margin (§15.1), so this should not happen in practice.
>
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
> `client_message_id` propagated in the inbound envelope. Receiver-side
> dedupe is the receiver's job; the daemon's `inbox.db` provides it for
> daemon-hosted peers.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
Sources: `Idempotency-Key` header → body `client_message_id` → daemon ulid.
Stored in outbox UNIQUE NOT NULL, propagated to broker, propagated to
receivers in inbound envelope.
### 4.3 Broker schema — dedupe table separate from message rows (v5)
```sql
-- The dedupe authority. One row per (mesh, client_message_id) accepted
-- by the broker. Outlives mesh.topic_message rows when retention >
-- message retention.
CREATE TABLE mesh.client_message_dedupe (
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
client_message_id TEXT NOT NULL,
broker_message_id UUID NOT NULL, -- the original accepted message id
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
destination_ref TEXT NOT NULL, -- topic name, recipient pubkey, etc.
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ, -- NULL = never expires (operator opt-in)
status TEXT NOT NULL CHECK(status IN ('accepted','rejected')),
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
PRIMARY KEY (mesh_id, client_message_id)
);
CREATE INDEX client_message_dedupe_expires_idx
ON mesh.client_message_dedupe(expires_at)
WHERE expires_at IS NOT NULL;
-- Existing tables get the convenience back-pointer (for receiver
-- inclusion in delivered envelopes); UNIQUE NOT enforced here — the
-- dedupe table is the authority.
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
```
**Retention semantics**:
- `expires_at = NULL` → dedupe row never expires unless mesh is deleted.
Operator opts in via mesh setting `dedupeRetentionMode = "permanent"`.
- `expires_at = first_seen_at + dedupe_retention_days` → default
`retention_scoped` mode. Default value: 365 days. Configurable per-mesh.
- A nightly broker job deletes rows where `expires_at < NOW()`.
- A separate broker job, fired when the message-retention sweep hard-deletes
a `mesh.topic_message` or `mesh.message_queue` row, sets the corresponding
dedupe row's `history_available = FALSE`. The dedupe row stays — only the
payload is gone. Retries still collapse correctly; receiver requests for
history return "row pruned" deterministically (§4.4 below).
**Migration**: additive-only. Daemon refuses to start unless broker
advertises feature `client_message_id_dedupe` with `mode` of
`retention_scoped` or `permanent` (§15.1).
### 4.4 Duplicate response — deterministic shape (NEW v5 — codex r4)
When the broker sees a send with a `client_message_id` already in
`mesh.client_message_dedupe`, the response is deterministic:
```json
{
"broker_message_id": "msg_01HQX...",
"client_message_id": "cmid_01HQX...",
"duplicate": true,
"history_available": true, // false if message row was GC'd
"first_seen_at": "2026-05-03T11:42:00Z",
"destination_kind": "topic",
"destination_ref": "alerts"
}
```
Daemon outcomes:
- `duplicate: true, history_available: true` → mark outbox row `done`,
store `broker_message_id`. No re-fanout (broker did the work the first
time).
- `duplicate: true, history_available: false` → mark outbox row `done` but
log `cm_daemon_dedupe_history_pruned_total`. The message *did* deliver
the first time; we just can't show it in history. Receivers who needed
it have it; receivers who didn't have already missed their window.
- No more `client_id_unknown` — that response code is removed.
### 4.5 Outbox schema — daemon-side max-age derived (v5)
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
```
Daemon `max_age_hours` is **derived** from the broker-advertised
`dedupe_retention_days` parameter:
- `permanent` → daemon default 168h (7d), capped at 30d. (Daemon doesn't
hold sends forever — that's an outbox bug surface.)
- `retention_scoped, dedupe_retention_days = N` → daemon
`max_age_hours = (N * 24) - safety_margin_hours`. Default
`safety_margin_hours = 24`.
- Operator override permitted but logged as
`outbox_max_age_above_broker_window` if it exceeds broker safe range.
### 4.6 Inbox schema — unchanged from v3 §4.5
### 4.7 Crash recovery — unchanged from v3 §4.6
### 4.8 Failure modes — corrected for dedupe-table model
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. Same as v4.
- **Receiver-side dedupe**: only daemon-hosted receivers dedupe. Same as v4.
- **Daemon retry after dedupe row expired AND message row GC'd**: in
`retention_scoped` mode this can only happen if the daemon outbox row
was older than `dedupe_retention_days - safety_margin`. Daemon will
refuse to send rows older than its computed `max_age_hours` (§4.5) —
they go to `dead` first, surfaced for human action. So this edge is
closed by daemon-side gating, not broker-side dedupe.
- **Daemon retry after dedupe row expired BUT message row still alive**:
doesn't happen by design — dedupe retention is always ≥ message
retention in operator-sane configs. If misconfigured, message row
persists with NULL `client_message_id` reference, retry creates a new
message, broker emits `cm_broker_dedupe_misconfig_total` with
`(mesh_id, retention_dedupe_days, retention_message_days)` labels.
---
## 5. Inbound — unchanged from v3 §5
---
## 6. Hooks — unchanged from v4 §6
---
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
---
## 14. Lifecycle — archive cleanup wording corrected (codex r4)
### 14.1 Key rotation — unchanged crypto from v4 §14.1
### 14.1.1 Archive record format — corrected wording (v5)
`keypair-archive.json` (mode 0600, atomic-rename writes):
```json
{
"schema_version": 1,
"max_archived_keys": 8,
"keys": [
{
"ed25519_pubkey": "base64...", // metadata only; matches the rotated-out signing key for that key_id
"x25519_pubkey": "base64...", // matches the retained private key
"x25519_privkey": "base64...", // sensitive; whole file is 0600
"key_id": "k_01HQX...",
"created_at": "2026-04-12T11:00:00Z",
"rotated_out_at": "2026-05-03T16:00:00Z",
"expires_at": "2026-05-10T16:00:00Z"
}
]
}
```
**Field clarifications (codex r4)**:
- `ed25519_pubkey` is metadata — the daemon does not retain the old ed25519
*private* key. Stored to bind `key_id` ↔ old signing identity for audit
reconstruction (e.g. "this archived x25519 was the recipient half of a
member who at the time signed messages with the matching ed25519").
- `x25519_pubkey` MUST match the public half of `x25519_privkey`. Daemon
validates on archive load; mismatch → quarantine (see corruption rules).
**Cleanup wording (codex r4)**:
- On `expires_at < now`: entry is removed from the live archive file via
atomic-rename rewrite. **Secure deletion of the prior file's data is not
guaranteed** on modern filesystems (journals, COW snapshots, SSD wear
leveling, atomic-rename leaving stale inodes). Operators who need
cryptographic erasure must operate on encrypted volumes or reissue
hardware. Documented in threat model §16.
- "Force-expiry" when `max_archived_keys` is exceeded uses the same
removal mechanism; same caveat applies. Counter
`cm_daemon_archive_force_expired_total{key_id}` exposed.
**Duplicate `key_id` handling (NEW v5)**:
- Archive load rejects any file whose `keys[]` contains two records with
the same `key_id`. Quarantine to `keypair-archive.json.malformed-<ts>`,
start with empty archive, log `keypair_archive_duplicate_key_id`. Daemon
continues to start (we don't want archive corruption to be a permanent
outage). Old in-flight messages encrypted to the lost archived keys
fail to decrypt and are counted in `cm_daemon_decrypt_stale_total`.
**Malformed archive on startup (NEW v5)**:
- File present but JSON parse fails OR schema fails OR pubkey/privkey pair
fails validation: quarantine as above, start with empty archive, log
`keypair_archive_malformed`. Same continue-startup behavior.
- File missing entirely: treated as empty archive (normal first run /
post-cleanup state), no warning.
- File present but mode != 0600: log `keypair_archive_perms` warning,
read anyway. Operators surfaced; daemon doesn't auto-chmod (they should
fix their pipeline).
### 14.2 Backup — unchanged from v4 §14.2
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — unchanged
---
## 15. Version compat — feature-bit schema validation (v5)
Codex r4: feature parameters need explicit schema-validation rules and
per-feature versioning so we don't paint ourselves into a corner when a
parameter shape evolves.
### 15.1 Feature bits with parameters and versions
Each feature bit's parameters are versioned independently of broker version:
| Bit | `params.version` | Required parameters | Optional parameters |
|---|---|---|---|
| `client_message_id_dedupe` | `1` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 1)` (when mode=retention_scoped) | `tombstone_history_pruned_window_days: int` |
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
| `mesh_skill_share` | future | — | — |
| `mcp_host` | future | — | — |
**Validation rules (NEW v5)**:
When the broker advertises feature parameters in
`feature_negotiation_response`, the daemon validates against the
parameter schema for that `params.version`. Validation failures:
- **Required parameter missing**: treated identically to "feature missing
from `supported`" — if the feature is in daemon's `require[]`, daemon
closes WS with code 4010 `feature_unavailable` and exits non-zero.
- **Required parameter out of bounds** (e.g. `dedupe_retention_days = -5`,
`inline_bytes = 0`): same — treated as "feature missing from
`supported`."
- **Unknown `params.version`**: if daemon doesn't recognize the version,
treated as "feature missing." Daemon does NOT silently degrade.
- **Optional parameter missing or invalid**: daemon uses its own default,
logs `feature_optional_param_invalid{feature, param, reason}`, continues.
- **Unknown `mode` for `client_message_id_dedupe`** (not "retention_scoped"
or "permanent"): treated as "feature missing." Future modes require a
`params.version` bump.
Validation is NOT silent: every feature_negotiation_response is logged
fully (with sensitive parameters redacted, though we don't currently have
any) at DEBUG, and a single line at INFO summarizes negotiated capabilities
on each successful negotiation.
### 15.2 Negotiation handshake — shape updated (v5)
```
→ daemon: feature_negotiation_request
{
require: ["client_message_id_dedupe",
"concurrent_connection_policy"],
optional: ["mesh_skill_share","mcp_host","max_payload"]
}
← broker: feature_negotiation_response
{
supported: {
"client_message_id_dedupe": {
"params": {
"version": 1,
"mode": "retention_scoped",
"dedupe_retention_days": 365,
"tombstone_history_pruned_window_days": 30
}
},
"concurrent_connection_policy": {
"params": { "version": 1, "default_policy": "prefer_newest" }
},
"member_keypair_rotated_event": { "params": { "version": 1 } },
"max_payload": {
"params": { "version": 1, "inline_bytes": 65536, "blob_bytes": 524288000 }
}
},
missing_required: []
}
```
If `missing_required` is non-empty after broker's response OR after daemon
parameter validation, daemon closes with 4010 and exits non-zero.
### 15.3 IPC negotiation — unchanged from v3 §15.3
### 15.4 Compatibility matrix — unchanged from v3 §15.4
---
## 16. Threat model — unchanged from v4 §16
Plus archive-secure-delete clarification under §14.1.1.
---
## 17. Migration — broker dedupe table is the new prereq
Broker side, deploy order:
1. `CREATE TABLE mesh.client_message_dedupe` + supporting indexes
(additive, online-safe).
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id` (already
in v3/v4 plan).
3. Broker code: every `INSERT` into `topic_message` / `message_queue` first
`INSERT ... ON CONFLICT DO UPDATE RETURNING` into
`client_message_dedupe`. The conflict path returns existing
`broker_message_id` instead of creating a new row.
4. Broker code: nightly job to delete `client_message_dedupe` rows where
`expires_at < NOW()`.
5. Broker code: hook into the existing message-retention sweep to set
`history_available = FALSE` on dedupe rows whose message row has been
pruned.
6. Broker advertises `client_message_id_dedupe` feature bit in negotiation
response.
7. Daemon refuses to start unless that feature bit is advertised with valid
params.
---
## What changed v4 → v5 (codex round-4 actionable items)
| Codex r4 item | v5 fix | Section |
|---|---|---|
| Dedupe must be retention-scoped, not "permanent" with row-deletion gap | Real `mesh.client_message_dedupe` table; retention independent of message rows; `permanent` becomes opt-in mode meaning "no expires_at" | §4.1, §4.3 |
| Rename misleading mode | `retention_scoped` is the default; `permanent` reserved for explicit opt-in | §4.3, §15.1 |
| Deterministic duplicate response | New shape with `duplicate`, `broker_message_id`, `history_available`; removed `client_id_unknown` | §4.4 |
| Feature parameter validation rules | `params.version` per feature; required-param failure = treated as missing-required-feature; daemon closes WS 4010, exits non-zero | §15.1 |
| Drop "zeroed/secure-delete" promise | Replaced with "removed from live archive; secure deletion not guaranteed"; threat model documents | §14.1.1 |
| Duplicate `key_id` handling | Archive load rejects, quarantine, start empty, continue | §14.1.1 |
| Malformed archive startup behavior | Quarantine, start empty, continue; mode-mismatch warns but reads | §14.1.1 |
| Linux MAC||MAC self-collision | Documented; `host_fingerprint.id_override` escape hatch | §2.2.1 |
| RunPod warning on persistent default | Logged at INFO so default is visible | §2.1 |
---
## What needs review (round 5)
1. **Dedupe table design (§4.3)** — is `(mesh_id, client_message_id)`
PRIMARY KEY enough, or do we need versioning of the dedupe row itself
(e.g. when destination changes mid-retry)? Is `destination_kind` /
`destination_ref` needed at all, or just for audit?
2. **`history_available = FALSE` semantics (§4.4)** — does it actually fix
the case where receivers ask for history of a pruned message? Or does
the receiver need its own dedupe-with-history-pruned pathway?
3. **Daemon outbox max-age math (§4.5)** — is `dedupe_retention_days * 24
- 24` margin correct? Should the margin be a percentage instead of a
fixed 24h?
4. **Feature param validation (§15.1)** — does treating "invalid required
param" as "missing required feature" lose useful diagnostic detail?
Should we have a 4011 `feature_param_invalid` close code separately?
5. **Archive quarantine (§14.1.1)** — is "continue startup with empty
archive" the right call, or should it be opt-in / refuse-by-default?
6. **Anything else still wrong?** Read it as if you were going to operate
this for a year.
Three options:
- **(a) v5 is shippable**: lock the spec, start coding the frozen core.
- **(b) v6 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless.

View File

@@ -0,0 +1,447 @@
# `claudemesh daemon` — Final Spec v6
> **Round 6.** v5 was reviewed by codex (round 5) which found the dedupe
> table architecture sound but called out four idempotency-correctness
> issues that would silently corrupt sends in production:
>
> 1. **Idempotency key reuse with different payload/destination** — v5
> silently collapsed a different send onto the original. Need a request
> fingerprint.
> 2. **`status = 'rejected'` underspecified** — schema allowed it, semantics
> didn't. Either fully define or drop.
> 3. **Outbox max-age math edges** — `dedupe_retention_days = 1` minus 24h
> margin = 0 hours, which is undefined.
> 4. **Broker atomicity not stated** — dedupe insert and message insert
> must be one transaction or you produce orphan dedupe rows.
>
> v6 fixes all four. **Intent §0 unchanged from v2.** v6 only revises
> idempotency semantics in §4 and migration in §17.
---
## 0. Intent — unchanged, see v2 §0
---
## 1. Process model — unchanged from v3 §1 / v2 §1
---
## 2. Identity — unchanged from v5 §2
---
## 3. IPC surface — unchanged from v4 §3
---
## 4. Delivery contract — at-least-once with **request-fingerprinted** dedupe
Codex r5: dedupe must compare the *whole request shape*, not just
`(mesh, client_message_id)`. Otherwise a caller who reuses an idempotency
key with a different destination or body silently drops the new send and
gets the old send's metadata back.
### 4.1 The contract (precise — v6)
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db` before
> the response returns.
>
> **Broker guarantee**: the broker maintains a dedupe record per accepted
> `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`. Each
> dedupe record carries a canonical `request_fingerprint`. Retries with
> the same `client_message_id` AND matching fingerprint collapse to the
> original `broker_message_id`. Retries with the same `client_message_id`
> but a different fingerprint return a deterministic conflict
> (`409 idempotency_key_reused`) and do **not** create a new message.
>
> **Atomicity guarantee**: dedupe row insertion and message row insertion
> happen in one broker DB transaction. Either both land, or neither. No
> orphan dedupe rows. If the broker crashes between dedupe insert and
> message insert, the rollback unwinds both.
>
> **End-to-end guarantee**: at-least-once delivery, with
> `client_message_id` propagated to receivers' inboxes.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
### 4.3 Broker schema — request fingerprint added (v6)
```sql
CREATE TABLE mesh.client_message_dedupe (
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
client_message_id TEXT NOT NULL,
-- The original accepted message; FK NOT enforced because the message row
-- may be GC'd by retention sweeps before the dedupe row expires.
broker_message_id UUID NOT NULL,
-- Canonical fingerprint of the original request. Recomputed on every
-- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
request_fingerprint BYTEA NOT NULL, -- 32-byte sha256
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
destination_ref TEXT NOT NULL,
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ, -- NULL = `permanent` mode
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
PRIMARY KEY (mesh_id, client_message_id)
);
CREATE INDEX client_message_dedupe_expires_idx
ON mesh.client_message_dedupe(expires_at)
WHERE expires_at IS NOT NULL;
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
```
**`status` column dropped (codex r5)**. Rejected requests do **not**
consume idempotency keys. Rationale below in §4.6.
### 4.4 Request fingerprint — canonical form (NEW v6)
The fingerprint covers everything that makes a send semantically distinct.
A retry must reproduce the same fingerprint bit-for-bit; anything else is
a different send and must not be collapsed.
```
request_fingerprint = sha256(
envelope_version || 0x00 ||
destination_kind || 0x00 ||
destination_ref || 0x00 ||
reply_to_id_or_empty || 0x00 ||
priority || 0x00 ||
meta_canonical_json || 0x00 ||
body_hash
)
```
Where:
- `envelope_version`: integer string (e.g. `"1"`). Bumps when the envelope
shape changes.
- `destination_kind`: `topic`, `dm`, or `queue`.
- `destination_ref`: topic name, recipient ed25519 pubkey hex, or queue id.
- `reply_to_id_or_empty`: original `broker_message_id` or empty string.
- `priority`: `now`, `next`, or `low`.
- `meta_canonical_json`: the `meta` field, serialized with sorted keys,
no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.
- `body_hash`: sha256(body bytes), hex.
The fingerprint is computed:
1. **Daemon-side** before durable outbox persistence — stored as
`outbox.request_fingerprint` (NEW column) so retries always produce
the same fingerprint regardless of caller behavior.
2. **Broker-side** on first receipt — stored in
`client_message_dedupe.request_fingerprint`.
3. **Broker-side** on every duplicate retry — recomputed and compared
byte-equal to the stored value.
If the daemon and broker disagree on the canonical form (e.g. JCS
implementation drift), the broker emits
`cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id}` and
returns `409 idempotency_key_reused` with a body that includes the
broker's fingerprint hex for debugging. Daemons that see this should
log it loudly and stop retrying that outbox row (it goes to `dead`).
### 4.5 Duplicate response — three cases (v6)
| Case | HTTP/WS code | Body |
|---|---|---|
| First insert | `201 created` | `{ broker_message_id, client_message_id, history_id, duplicate: false }` |
| Duplicate, fingerprint match | `200 ok` | `{ broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at }` |
| Duplicate, fingerprint mismatch | `409 idempotency_key_reused` | `{ client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." }` (first 8 bytes hex) |
Daemon outcomes:
- `201` → mark outbox row `done`, store `broker_message_id`. Normal path.
- `200 duplicate` with `history_available: true` → mark `done`, no
re-fanout, log at INFO.
- `200 duplicate` with `history_available: false` → mark `done`, log at
WARN. The original delivery succeeded; receivers got it.
- `409 idempotency_key_reused` → mark outbox row `dead`, surface in
`claudemesh daemon outbox --failed`. Operator must rotate the
idempotency key by hand and resubmit (`outbox requeue --new-id <id>`,
NEW v6 subcommand). Daemon does NOT auto-rotate to avoid masking caller
bugs.
### 4.6 Why rejected requests don't consume idempotency keys (v6)
`status` was in v5's schema but underspecified. Two scenarios:
- **Transient broker error** (DB down, queue full, network blip): daemon
retries. If we'd persisted a `rejected` row on the first attempt, the
retry would fail forever. Bad.
- **Permanent validation error** (payload too large, destination not
found, auth missing): broker returns the appropriate `4xx` immediately
without inserting a dedupe row. Daemon either fixes the request and
retries (different fingerprint → fingerprint mismatch → `409` per §4.5)
or marks dead. Persisting a "rejected" row buys nothing — the daemon
isn't going to send the same broken request again with the same key.
Net result: `client_message_dedupe` rows only exist when the broker
**successfully** accepted a message and committed it. The single source
of truth for "was this idempotency key consumed?" is the existence of
the dedupe row. No status enum, no ambiguous states.
### 4.7 Broker atomicity contract (NEW v6)
Every accept path runs in one DB transaction with the following shape:
```sql
BEGIN;
-- Pre-generate broker_message_id outside the transaction; pass in.
INSERT INTO mesh.client_message_dedupe
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
destination_kind, destination_ref, expires_at)
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
$dest_kind, $dest_ref, $expires_at)
ON CONFLICT (mesh_id, client_message_id) DO NOTHING
RETURNING broker_message_id, request_fingerprint, history_available, first_seen_at;
-- If RETURNING was empty (conflict), do a SELECT to fetch the original
-- and exit the transaction with a duplicate response.
-- If RETURNING produced a row AND $fingerprint != returned.fingerprint,
-- that's the §4.5 mismatch path — also exit with 409.
-- Otherwise, this is the first insert. Insert the message row.
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
VALUES ($msg_id, $mesh_id, $client_id, ...);
-- Optional: enqueue fan-out work, etc.
COMMIT;
```
Failure modes:
- Crash before `COMMIT`: both rows roll back. Next daemon retry inserts
cleanly.
- Crash after `COMMIT` but before WS ACK: dedupe row exists, message row
exists. Daemon retries → fingerprint matches → `200 duplicate`. Net:
exactly one broker-accepted row, one daemon `done` transition.
- Constraint violation on message row insert (e.g. unique violation on
some other column): rolls back the dedupe insert. Returns `5xx` to
daemon. Daemon retries; same fingerprint reproduces the same constraint
violation; daemon eventually marks `dead`. No orphan dedupe row.
Counter `cm_broker_dedupe_orphan_check_total` runs nightly and validates
that every `client_message_dedupe` row has a matching `topic_message` or
`message_queue` row OR the matching message row has been retention-pruned
(in which case `history_available = FALSE` was set). Any row failing both
conditions is logged as `cm_broker_dedupe_orphan_found{mesh_id}` for
human review. Should be zero in steady state.
### 4.8 Outbox schema — fingerprint stored alongside (v6)
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
request_fingerprint BLOB NOT NULL, -- 32 bytes
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
```
`request_fingerprint` is computed at IPC accept time and stored. Every
retry sends the same bytes. The daemon never recomputes from `payload`
post-enqueue (would produce drift if envelope_version changes between
daemon runs).
### 4.9 Outbox max-age math — bounded (v6)
Codex r5: the v5 formula `(dedupe_retention_days * 24) - 24h_margin`
breaks at `dedupe_retention_days = 1` (yields zero) and is undefined
behavior at `<= 1`.
v6 formula and bounds:
- **Minimum supported broker dedupe retention**: 3 days. Daemon refuses
to start if broker advertises `dedupe_retention_days < 3` (treats it
as `feature_param_invalid`, exits 4010).
- **Daemon `max_age_hours` derivation**:
- `permanent` mode → daemon uses config default (168h = 7d), cap 720h
(30d).
- `retention_scoped` mode → daemon `max_age_hours = max(72,
(dedupe_retention_days * 24) - safety_margin_hours)` where
`safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 *
24))`. For `dedupe_retention_days=3` this gives
`max(72, 72-24) = 72h`. For 30 days: `max(72, 720-72) = 648h`. For
365 days: `max(72, 8760-876) = 7884h`.
- The 72h floor prevents the daemon outbox from being uselessly short
— three days is enough margin for normal operator response to a
paged outage.
- Operator override allowed via `[outbox] max_age_hours_override = N`,
but if `N` exceeds `dedupe_retention_days * 24 - 1` daemon refuses to
start with `outbox_max_age_above_dedupe_window`. The override exists
for the rare case of a much-shorter-than-default outbox; it does not
exist to circumvent the broker's dedupe window.
### 4.10 Inbox schema — unchanged from v3 §4.5
### 4.11 Crash recovery — unchanged from v3 §4.6
### 4.12 Failure modes — corrected for fingerprint model (v6)
- **Fingerprint mismatch on retry** (`409 idempotency_key_reused`): outbox
row marked `dead`. Surfaced in `--failed` view. Operator command
`outbox requeue --new-id <id>` rotates `client_message_id` and retries.
- **Daemon retry after dedupe row hard-deleted by retention sweep**: in
`retention_scoped` mode, daemon `max_age_hours` is bounded inside the
retention window (§4.9), so this can only happen via operator override.
In that case the retry creates a NEW dedupe row + new message — the
caller chose this risk explicitly. Counter
`cm_daemon_retry_after_dedupe_expired_total`.
- **Daemon retry after dedupe row hard-deleted in `permanent` mode**:
cannot happen by definition — `permanent` means no `expires_at`. Only
mesh deletion removes dedupe rows.
- **Duplicate row, history pruned**: as v5 §4.4. Mark `done`, log
`cm_daemon_dedupe_history_pruned_total`.
---
## 5. Inbound — unchanged from v3 §5
---
## 6. Hooks — unchanged from v4 §6
---
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
---
## 14. Lifecycle — unchanged from v5 §14
---
## 15. Version compat — feature param updated for new dedupe semantics
### 15.1 Feature bits with parameters (v6 update)
| Bit | `params.version` | Required parameters | Optional parameters |
|---|---|---|---|
| `client_message_id_dedupe` | `2` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 3)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
`client_message_id_dedupe` bumped to `params.version = 2` because it now
requires `request_fingerprint = true`. A broker still on version 1
(no fingerprint comparison) is treated as "feature missing" and the
daemon refuses to start. That's intentional — v0.9.0 daemons require
fingerprint enforcement for safe idempotency.
`dedupe_retention_days` minimum raised to 3 (matches the §4.9 floor).
### 15.2 Negotiation handshake — unchanged shape from v5 §15.2
### 15.3 IPC negotiation — unchanged from v3 §15.3
### 15.4 Compatibility matrix — unchanged from v3 §15.4
### 15.5 Diagnostic close codes (NEW v6 — codex r5)
WebSocket close codes are split for diagnostic clarity:
| Code | Reason | When |
|---|---|---|
| `4010` | `feature_unavailable` | Required feature missing from broker's `supported` |
| `4011` | `feature_param_invalid` | Required feature present but parameters fail validation (missing required, out of bounds, unknown version) |
| `4012` | `feature_param_below_floor` | Required feature parameter below daemon's hard floor (e.g. `dedupe_retention_days < 3`) |
Daemon logs the full negotiation payload at WARN before exiting; supervisor
+ alerting catches the restart loop.
---
## 16. Threat model — unchanged from v4 §16
---
## 17. Migration — broker dedupe table + atomicity (v6)
Broker side, deploy order:
1. `CREATE TABLE mesh.client_message_dedupe` with v6 schema (additive,
online-safe).
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
4. Broker code refactor: every accept path wraps dedupe insert + message
insert in **one transaction** (§4.7). Pre-generated
`broker_message_id` (ulid in code) passed in.
5. Broker code: nightly job to delete dedupe rows where `expires_at <
NOW()` (skip in `permanent` mode).
6. Broker code: hook into the message-retention sweep — when a
`topic_message` or `message_queue` row is hard-deleted, find the
matching dedupe row by `client_message_id` and set `history_available
= FALSE`. (Note: `client_message_id` is nullable on those tables for
legacy traffic; nullable rows have no dedupe row to update.)
7. Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
8. Broker advertises `client_message_id_dedupe` feature with
`params.version = 2` and `request_fingerprint: true`.
9. Daemon refuses to start unless that feature bit is advertised with
valid v2 params.
Rollback plan: feature flag disables fingerprint enforcement broker-side
(falls back to existing pre-v6 behavior — no dedupe). Daemons that
require fingerprint refuse to start. Operator switches off the feature
flag, reverts the daemon, restarts. No data loss; pending dedupe rows
remain in place for the next forward roll.
---
## What changed v5 → v6 (codex round-5 actionable items)
| Codex r5 item | v6 fix | Section |
|---|---|---|
| Idempotency key reuse with different payload silently collapses | `request_fingerprint` BYTEA in dedupe table; canonical form per §4.4; 409 on mismatch | §4.3, §4.4, §4.5 |
| `status='rejected'` underspecified | Dropped `status` column; rejected requests don't consume keys; existence of dedupe row = "key consumed" | §4.3, §4.6 |
| Outbox max-age math edges at low retention | 72h floor; min `dedupe_retention_days = 3`; percentage-based safety margin; explicit override gating | §4.9, §15.1 |
| Broker atomicity not stated | One transaction per accept path; orphan-check job; rollback semantics | §4.7 |
| Diagnostic detail on feature param failures | New close codes 4011 / 4012 separate from 4010 | §15.5 |
| Outbox stores fingerprint | NEW column `outbox.request_fingerprint` BLOB; computed once at IPC accept | §4.8 |
| Operator command for fingerprint-mismatch recovery | NEW `outbox requeue --new-id <id>` to rotate idempotency key | §4.5 |
---
## What needs review (round 6)
1. **Request fingerprint canonical form (§4.4)** — does JCS work
cross-language for `meta_canonical_json` (Python json.dumps,
Go encoding/json, JS JSON.stringify all behave differently)? Should
we ship a vetted JCS lib in each SDK or fall back to a simpler
"sorted keys + no spaces + escape-as-stored" rule with conformance
tests?
2. **Atomicity contract (§4.7)** — is the orphan-check sufficient, or
does a violation mean we need a "broker rebuild dedupe from messages"
recovery tool? The latter is destructive but useful for ops emergencies.
3. **Max-age formula (§4.9)** — is the 72h floor correct? Is the
percentage-based safety margin (`max(24, ceil(0.1 * dedupe_window))`)
the right shape? Or simpler to say "always 24h"?
4. **`409 idempotency_key_reused` recovery flow (§4.5)** — is sending the
row to `dead` and surfacing it via `outbox --failed` enough? Should
the daemon emit a high-priority event for the SSE stream so operators
are paged immediately?
5. **Diagnostic close codes (§15.5)** — is splitting 4010/4011/4012
useful, or does it just push complexity onto operators? Should we
collapse to 4010 with structured close-reason JSON instead?
6. **Anything else still wrong?** Read it as if you were going to
operate this for a year. What falls down?
Three options:
- **(a) v6 is shippable**: lock the spec, start coding the frozen core.
- **(b) v7 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless.

View File

@@ -0,0 +1,439 @@
# `claudemesh daemon` — Final Spec v7
> **Round 7.** v6 was reviewed by codex (round 6) which found the broker
> layer largely correct but caught five daemon-side and broker-tx
> correctness gaps:
>
> 1. **Daemon-local duplicate POST semantics** undefined — local fingerprint
> comparison missing across `pending` / `inflight` / `done` / `dead`.
> 2. **§4.6 rejected-request contradiction** — talked about both "fix and
> retry" and "fingerprint mismatch → 409". Only one of those can be true.
> 3. **§4.7 pseudocode bug** — `ON CONFLICT DO NOTHING RETURNING` returns
> nothing on conflict; the fingerprint comparison was in the wrong branch.
> 4. **Max-age math floor consumes margin** — at min retention (3 days),
> daemon max-age 72h equals broker window 72h. Not inside the window.
> 5. **Broker transaction boundary incomplete** — fan-out/queue/history side
> effects not stated as in-transaction; "optional" wording was wrong.
>
> v7 fixes all five. **Intent §0 unchanged from v2.** v7 only revises §4
> (delivery contract) and §15 (feature param min) and §17 (migration).
---
## 0. Intent — unchanged, see v2 §0
---
## 1. Process model — unchanged
## 2. Identity — unchanged from v5 §2
## 3. IPC surface — unchanged from v4 §3
---
## 4. Delivery contract — at-least-once, fingerprinted at IPC and broker layers
### 4.1 The contract (precise — v7)
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db` before
> the response returns. The daemon enforces request-fingerprint
> idempotency at the IPC layer: a duplicate `POST` with the same
> `client_message_id` and matching `request_fingerprint` returns the
> stable prior result; with a mismatched fingerprint it returns local
> `409 idempotency_key_reused` and the new request is **not** persisted.
>
> **Broker guarantee**: the broker maintains a dedupe record per
> accepted `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`
> with `request_fingerprint`. Retries with matching fingerprint collapse;
> retries with mismatched fingerprint return `409
> idempotency_key_reused` without creating a new message.
>
> **Atomicity guarantee**: every durable side effect of a successful
> accept (dedupe row, message row, fan-out work, history row, queue
> insertion) lands in the same broker DB transaction. Either all commit
> or none do.
>
> **End-to-end guarantee**: at-least-once delivery, with
> `client_message_id` propagated to receivers' inboxes.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
### 4.3 Broker schema — unchanged from v6 §4.3
(`mesh.client_message_dedupe` table with `request_fingerprint BYTEA`, no
`status` column.)
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
### 4.5 Daemon-local idempotency at the IPC layer (NEW v7 — codex r6)
The daemon enforces fingerprint idempotency **before** the request hits
`outbox.db` so a caller bug never creates duplicate-key/mismatch-payload
state at all.
#### 4.5.1 IPC accept algorithm
On `POST /v1/send`:
1. Validate request envelope (auth, schema, size limits). Failures
here return `4xx` immediately. **No outbox row is written.** The
`client_message_id` (whether caller-supplied or daemon-minted) is
**not consumed** — the same id may be reused by the caller for a
subsequent valid send.
2. Compute `request_fingerprint` (§4.4).
3. Look up existing outbox row by `client_message_id`:
| Existing row state | Fingerprint match? | Daemon response |
|---|---|---|
| (no row) | — | Insert new outbox row in `pending`; return `202 accepted, queued` with `client_message_id` |
| `pending` | match | Return `202 accepted, queued` with the existing `client_message_id`. No new row. Idempotent retry of an in-progress send |
| `pending` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_pending_fingerprint_mismatch"`. **No mutation of the existing row.** |
| `inflight` | match | Return `202 accepted, inflight`. No new row. Caller is retrying mid-broker-roundtrip |
| `inflight` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_inflight_fingerprint_mismatch"` |
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No new row, no broker call |
| `done` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
| `dead` | match | Return `409 idempotency_key_reused` with `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Caller must rotate the id (see §4.6.3) — daemon refuses to re-attempt a dead row's exact bytes. |
| `dead` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_dead_fingerprint_mismatch"` |
Rule: any IPC `409` carries the daemon's `request_fingerprint` (8-byte
hex prefix) so callers can debug client/server canonical-form drift.
#### 4.5.2 Outbox table — fingerprint required, atomic UPSERT removed
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
request_fingerprint BLOB NOT NULL, -- 32 bytes
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
```
Insertion is `BEGIN; SELECT FOR UPDATE; if-no-row INSERT; COMMIT;`
explicit lock + check + insert, not `INSERT OR IGNORE`. The daemon
never auto-mutates an existing row's `request_fingerprint` or
`payload`; mismatches are 409s, not silent overwrites.
`request_fingerprint` is computed once at IPC accept time and frozen.
Retries to the broker re-send the same bytes from `payload` and the
same `request_fingerprint`. Daemon does not recompute post-enqueue.
### 4.6 Rejected-request semantics — pick one rule (NEW v7 — codex r6)
> **Rule: the `client_message_id` is consumed iff the daemon writes an
> outbox row. Anything that fails before outbox insertion (validation,
> auth, size) leaves the id untouched and freely reusable.**
This makes §4.6 internally consistent with §4.5:
#### 4.6.1 IPC validation failure (no outbox row written)
- Schema/auth/size/destination-not-resolvable failures return `4xx`
immediately. The `client_message_id` is **not** stored anywhere on
the daemon. Caller may re-send with the same id and a fixed payload;
it will be treated as a fresh request because no outbox row exists.
#### 4.6.2 Outbox row exists, broker permanent rejection (4xx response)
- Daemon receives `4xx` from broker (e.g. payload size delta between
daemon and broker advertised limits, mesh-level reject). Outbox row
transitions to `dead` with `last_error` populated.
- Caller retrying with same `client_message_id` → daemon returns
`409 idempotency_key_reused, conflict: "outbox_dead_*"` per §4.5.1.
- The id is consumed (row is locked in `dead`) until operator action.
#### 4.6.3 Operator recovery: rotating an idempotency key
To unstick a `dead` row whose payload needs to change, operator runs:
```
claudemesh daemon outbox requeue --id <outbox_id> --new-client-id [auto|<id>]
```
This atomically:
1. Marks the existing `dead` row as `aborted` (terminal, never retried).
2. Creates a new outbox row with a fresh `client_message_id` (caller-
supplied or daemon-ulid'd) and the SAME or a CALLER-PATCHED payload.
3. The old `client_message_id` becomes free again at the daemon layer
but is still locked at the broker layer if the broker had ever
accepted it (its dedupe row stays). For a row that died before
broker acceptance, the id is fully reusable end-to-end.
Operators see a clear distinction between `dead` (needs operator
attention) and `aborted` (intentionally retired). Add `aborted` to the
status CHECK constraint:
```sql
status TEXT CHECK(status IN ('pending','inflight','done','dead','aborted'))
```
### 4.7 Broker atomicity contract — corrected pseudocode + side-effect inventory (v7 — codex r6)
#### 4.7.1 Side effects inside the transaction
Every successful broker accept atomically commits the following durable
state in **one transaction**:
| Effect | Table | Notes |
|---|---|---|
| Dedupe record | `mesh.client_message_dedupe` | NEW row keyed by `(mesh_id, client_message_id)` |
| Message body | `mesh.topic_message` OR `mesh.message_queue` | NEW row keyed by `broker_message_id` (pre-generated ulid) |
| History row | `mesh.message_history` | NEW row pointing at `broker_message_id` for ordered replay |
| Fan-out work | `mesh.delivery_queue` | One row per intended recipient (member subscribed to topic, recipient of DM, etc.) |
Effects **outside** the transaction (committed after ACK to daemon):
- WebSocket pushes to currently-connected subscribers — these are best-
effort live notifications; on failure subscribers fetch from history
on next connect.
- Webhook fan-out (post-v0.9.0 feature) — runs asynchronously off the
`delivery_queue` rows committed inside the transaction.
If any in-transaction insert fails (constraint violation, DB error),
the transaction rolls back: no dedupe row, no message row, no history,
no delivery queue rows. Broker returns `5xx` to daemon; daemon retries.
#### 4.7.2 Corrected pseudocode (codex r6)
The fingerprint comparison must happen on the conflict-select branch,
not the `RETURNING` branch:
```sql
BEGIN;
-- Pre-generate broker_message_id (ulid) outside the transaction, pass in.
-- Step 1: try to claim the idempotency key.
INSERT INTO mesh.client_message_dedupe
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
destination_kind, destination_ref, expires_at)
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
$dest_kind, $dest_ref, $expires_at)
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
-- Step 2: was it our insert?
SELECT broker_message_id, request_fingerprint, destination_kind,
destination_ref, history_available, first_seen_at
FROM mesh.client_message_dedupe
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
FOR SHARE;
-- If returned.broker_message_id == $msg_id (our pre-generated id),
-- this was the first insert. Continue to step 3.
-- If returned.broker_message_id != $msg_id AND
-- returned.request_fingerprint == $fingerprint,
-- this is a duplicate retry. ROLLBACK; return 200 duplicate.
-- If returned.broker_message_id != $msg_id AND
-- returned.request_fingerprint != $fingerprint,
-- ROLLBACK; return 409 idempotency_key_reused.
-- Step 3: insert message row, history, fan-out queue.
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
VALUES ($msg_id, $mesh_id, $client_id, ...);
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
VALUES ($msg_id, $mesh_id, ...);
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
SELECT $msg_id, member_pubkey, ...
FROM mesh.topic_subscription
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
COMMIT;
```
The branch logic determines the response shape (`201` vs `200
duplicate` vs `409 idempotency_key_reused`) before COMMIT. The
duplicate and 409 branches always ROLLBACK because nothing else
needs to commit on those paths.
`SELECT … FOR SHARE` blocks concurrent writers from upgrading the
same dedupe row mid-transaction; a concurrent insert with the same
key will block until our transaction completes.
#### 4.7.3 Orphan check — covers full inventory now
The nightly `cm_broker_dedupe_orphan_check_total` job (v6 §4.7) is
extended to verify all four in-transaction effects. For each
`client_message_dedupe` row:
- Either the corresponding `topic_message` / `message_queue` row exists,
OR `history_available = FALSE` AND a deleted-tombstone is recorded.
- AND a corresponding `message_history` row exists (or has been pruned
per history retention).
- AND zero outstanding `delivery_queue` rows older than fan-out timeout
reference a `broker_message_id` whose dedupe row is missing.
Any inconsistency logged as `cm_broker_atomicity_violation_found` for
human review. Should be zero in steady state.
### 4.8 Outbox max-age math — strictly inside broker window (v7 — codex r6)
Codex r6: at v6's 3-day minimum, daemon max_age (72h) **equaled** broker
window (72h). That isn't "inside the window."
v7 raises the floor and tightens the formula:
- **Minimum supported broker `dedupe_retention_days`**: **7** (was 3 in
v6). Below this, daemon refuses to start with `4012
feature_param_below_floor`.
- **Daemon `max_age_hours` derivation** (`retention_scoped` mode):
```
safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 * 24))
max_age_hours = (dedupe_retention_days * 24) - safety_margin_hours
```
At minimum (7 days): `safety_margin = max(24, 17) = 24h`; `max_age =
168 - 24 = 144h`. Daemon outbox ≤144h, broker window ≥168h, gap ≥24h.
- **Daemon `max_age_hours` derivation** (`permanent` mode):
```
max_age_hours = config.outbox.max_age_hours_default (168h)
capped at config.outbox.max_age_hours_cap (720h)
```
- **Operator override**: `[outbox] max_age_hours_override = N` accepted
iff `N <= dedupe_retention_days * 24 - 24`. Above that → daemon
refuses to start with `outbox_max_age_above_dedupe_window` clear text.
- The 72h floor from v6 is **dropped** because the new 7-day broker
minimum already produces a 144h derived max-age — well above any
realistic floor concern.
### 4.9 Inbox schema — unchanged from v3 §4.5
### 4.10 Crash recovery — unchanged from v3 §4.6
### 4.11 Failure modes — unchanged from v6 §4.12, with §4.5/§4.6 added
- **IPC accept fingerprint-mismatch on duplicate id**: returns 409 with
`conflict` field per §4.5.1. Caller must rotate id.
- **Outbox row stuck in `dead`**: operator runs `outbox requeue
--new-client-id` per §4.6.3.
- **Broker fingerprint mismatch on retry**: as v6 §4.5. Daemon marks
`dead`, surfaces in `outbox --failed`.
- **Daemon retry after dedupe row hard-deleted by broker retention
sweep**: cannot happen unless operator overrode `max_age_hours`
beyond the safety margin. In `permanent` mode cannot happen at all.
- **Atomicity violation found by orphan check**: alerts ops; broker
team investigates. Should be zero.
---
## 5. Inbound — unchanged from v3 §5
## 6. Hooks — unchanged from v4 §6
## 7-13. — unchanged from v4
## 14. Lifecycle — unchanged from v5 §14
---
## 15. Version compat — minimum dedupe_retention_days raised
### 15.1 Feature bits with parameters (v7 update)
Only one row changes from v6 §15.1:
| Bit | `params.version` | Required parameters | Optional parameters |
|---|---|---|---|
| `client_message_id_dedupe` | `2` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 7)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
`dedupe_retention_days` minimum raised from 3 to 7 to keep daemon
outbox max-age strictly inside the broker window with margin (§4.8).
### 15.2 — 15.5 unchanged from v6 §15
(`feature_negotiation_request/response`, IPC negotiation, compat
matrix, diagnostic close codes 4010 / 4011 / 4012.)
---
## 16. Threat model — unchanged from v4 §16
---
## 17. Migration — broker dedupe + atomicity + corrected pseudocode (v7)
Broker side, deploy order:
1. `CREATE TABLE mesh.client_message_dedupe` (v6 §4.3 schema, unchanged
in v7).
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
4. Broker code refactor: every accept path runs the v7 §4.7.2 corrected
pseudocode in **one transaction** with the side-effect inventory
from §4.7.1 — dedupe row, message row, history row, delivery_queue
rows all in-tx.
5. Broker code: existing fan-out workers consume `delivery_queue` rows
committed by the accept transaction.
6. Broker code: nightly retention sweep + `history_available` flip on
message-row pruning (unchanged from v6 §17 step 5+6).
7. Broker code: extended orphan-check job (v7 §4.7.3) — alerts on
atomicity violations across full inventory.
8. Broker advertises `client_message_id_dedupe` feature with
`params.version = 2`, `request_fingerprint: true`,
`dedupe_retention_days >= 7` (was 3).
9. Daemon refuses to start unless above is advertised.
Daemon side:
- Outbox table gains `aborted` status (§4.6.3); migration ALTER on the
CHECK constraint at startup if SQLite version <DDL works without
a recreate; else table recreate via `INSERT INTO new SELECT * FROM
old`. v0.9.0 daemons are fresh installs by definition; existing
outboxes don't exist.
- IPC accept path implements §4.5.1 lookup table.
- IPC error envelope adds `conflict` and `daemon_fingerprint_prefix`
fields for 409 responses.
- New CLI verb `claudemesh daemon outbox requeue --id <id>
--new-client-id [auto|<id>]` (§4.6.3).
---
## What changed v6 → v7 (codex round-6 actionable items)
| Codex r6 item | v7 fix | Section |
|---|---|---|
| Daemon-local duplicate POST semantics undefined | Full lookup table for pending/inflight/done/dead × match/mismatch; `409 idempotency_key_reused` at IPC layer with `conflict` field | §4.5 |
| §4.6 rejected-request contradiction | Single rule: id consumed iff outbox row written; pre-outbox failures leave id untouched; broker-rejected outbox row goes to `dead`, requires `requeue --new-client-id` | §4.6 |
| §4.7 pseudocode wrong | Corrected: `INSERT ON CONFLICT DO NOTHING`, then `SELECT FOR SHARE`, then branch on returned `broker_message_id` and `fingerprint` | §4.7.2 |
| Max-age math equals window at min | Min `dedupe_retention_days` raised to 7; safety margin always >= 24h; derived max-age strictly < window | §4.8, §15.1 |
| Broker atomicity scope incomplete | Side-effect inventory: dedupe + message + history + delivery_queue all in-tx; WS push and webhook fan-out explicitly outside-tx; orphan check extended | §4.7.1, §4.7.3 |
| New `aborted` outbox status | Distinguishes operator-retired rows from dead rows | §4.6.3 |
---
## What needs review (round 7)
1. **IPC lookup table (§4.5.1)** — does it cover all the realistic
client races? The "inflight + match" return is `202 accepted,
inflight` — should it be `200 ok` with the broker response if the
broker has already responded? Or does the daemon prefer to respond
from local state always?
2. **Aborted vs dead vs done (§4.6.3)** — is the three-state terminal
distinction useful, or noisy? Would `dead` + an `aborted_at`
timestamp suffice?
3. **§4.7.2 transaction shape** — `SELECT FOR SHARE` after `INSERT ON
CONFLICT DO NOTHING` is two round-trips. Could it be one with
`INSERT ... ON CONFLICT DO UPDATE SET ... RETURNING xmax = 0` or
similar Postgres-specific trick? Worth optimizing here?
4. **Max-age formula at higher windows** — at 365 days,
`safety_margin = ceil(0.1 * 365 * 24) = 876h ≈ 36.5 days`. Daemon
max-age = `8760 - 876 = 7884h ≈ 328 days`. Is that the right shape,
or should the safety margin be capped (e.g. `min(72, ceil(0.1 * w))`)?
5. **Side-effect inventory (§4.7.1)** — anything missing? E.g. broker-
side rate-limit counters, audit-log entries, mention-fanout-search?
6. **Anything else still wrong?** Read it as if you were going to
operate this for a year. What falls down?
Three options:
- **(a) v7 is shippable**: lock the spec, start coding the frozen core.
- **(b) v8 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless.

View File

@@ -0,0 +1,401 @@
# `claudemesh daemon` — Final Spec v8
> **Round 8.** v7 was reviewed by codex (round 7) which found four
> remaining correctness problems, one of them new in v7:
>
> 1. **`aborted` semantics not in §4.5.1** and contradiction with `UNIQUE`
> constraint — v7 said the old id "becomes free again at the daemon
> layer," but `client_message_id TEXT NOT NULL UNIQUE` makes that
> impossible without DELETE.
> 2. **Broker permanent-rejection ordering underspec** — v7 didn't state
> when (relative to dedupe insertion) permanent 4xx fires.
> 3. **SQLite `SELECT FOR UPDATE`** — SQLite doesn't support it; needs
> `BEGIN IMMEDIATE` for daemon-local serialization.
> 4. **Side-effect inventory still ambiguous** — rate-limit counters,
> audit logs, mention/search indexes need explicit
> in-tx/non-authoritative classification.
>
> v8 fixes all four. **Intent §0 unchanged from v2.** v8 only revises §4
> (delivery contract).
---
## 0. Intent — unchanged, see v2 §0
## 1. Process model — unchanged
## 2. Identity — unchanged from v5 §2
## 3. IPC surface — unchanged from v4 §3
---
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
### 4.1 The contract (precise — v8)
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db` before
> the response returns. The daemon enforces request-fingerprint
> idempotency at the IPC layer: a duplicate `POST` with the same
> `client_message_id` returns `409 idempotency_key_reused` if the
> fingerprint mismatches, regardless of outbox row state.
>
> **Local audit guarantee (NEW v8)**: a `client_message_id` once written
> to `outbox.db` is **never released**. Operator recovery via
> `requeue --new-client-id` always mints a fresh id; the old row stays
> in `aborted` for audit. There is no daemon-side path to free a used
> id.
>
> **Broker guarantee**: same as v7 §4.1. Dedupe row exists iff the
> broker reached the post-validation accept phase (§4.7.1).
>
> **Atomicity guarantee**: same as v7 §4.1.
>
> **End-to-end guarantee**: at-least-once.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
### 4.3 Broker schema — unchanged from v6 §4.3
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
#### 4.5.1 IPC accept algorithm (v8)
On `POST /v1/send`:
1. Validate request envelope (auth, schema, size limits, destination
resolvable). Failures here return `4xx` immediately. **No outbox row
is written; the `client_message_id` is not consumed.**
2. Compute `request_fingerprint` (§4.4).
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
a concurrent IPC accept on the same id serializes against this one.
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
preventing any other writer from beginning a transaction on the same
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
supported.
4. `SELECT id, request_fingerprint, status, broker_message_id,
last_error FROM outbox WHERE client_message_id = ?`.
5. Apply the lookup table below. For the "(no row)" case, INSERT the
new row inside the same transaction.
6. COMMIT.
| Existing row state | Fingerprint match? | Daemon response |
|---|---|---|
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
| `pending` | match | Return `202 accepted, queued`. No mutation |
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
`request_fingerprint` (8-byte hex prefix) so callers can debug
client/server canonical-form drift. **Every state in the table returns
something deterministic, including `aborted`.** A `client_message_id`
written to `outbox.db` is permanently bound to that row's lifecycle —
the only "free" state is "no row exists".
#### 4.5.2 Outbox table — fingerprint required
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
request_fingerprint BLOB NOT NULL, -- 32 bytes
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN
('pending','inflight','done','dead','aborted')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT,
aborted_at INTEGER, -- NEW v8
aborted_by TEXT, -- NEW v8: operator/auto
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
```
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
audit trail. `superseded_by` lets `outbox inspect` show the chain when
a row was requeued multiple times.
`request_fingerprint` is computed once at IPC accept time and frozen
forever for the row's lifecycle. Daemon never recomputes from
`payload`.
### 4.6 Rejected-request semantics — phasing made explicit (v8 — codex r7)
> **Single rule, phased**: a `client_message_id` is consumed iff a
> dedupe row exists. The dedupe row is the durable evidence that a
> request reached the post-validation accept phase. Pre-validation
> failures consume nothing — caller may freely retry the same id with
> a fixed payload.
#### 4.6.1 Daemon-side rejection phasing
| Phase | When daemon rejects | Outbox row? | Caller may reuse id? |
|---|---|---|---|
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | Yes — id never consumed |
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | N/A — daemon owns retries |
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | No — rotate via `requeue --new-client-id` |
| **D. Operator retirement** | Operator runs `requeue --new-client-id` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Old id NEVER reusable; new id is fresh |
#### 4.6.2 Broker-side rejection phasing (NEW v8 — codex r7)
The broker validates in two phases relative to dedupe-row insertion:
| Phase | Validation | Result |
|---|---|---|
| **B1. Pre-dedupe-claim** (NEW — explicit) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes` | `4xx` returned. **No dedupe row inserted.** Caller may retry with same id and corrected payload. |
| **B2. Post-dedupe-claim** | Anything that requires the dedupe-claim transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.), per-mesh rate limit not exceeded | `4xx` returned, transaction rolled back, **no dedupe row remains**. Caller may retry with same id. |
| **B3. Accepted** | All side effects (dedupe row, message row, history row, delivery_queue rows) commit atomically | `201` returned with `broker_message_id` |
**Critical guarantee (v8)**: there is no broker code path where a
permanent rejection (4xx) leaves a dedupe row behind. Either the
request committed and a dedupe row exists (B3), or it didn't and no
dedupe row exists (B1, B2). This makes "dedupe row exists" the single
unambiguous signal of "id consumed at the broker layer."
If broker decides post-commit that an accepted message is invalid
(e.g. an async content-policy job runs on accepted messages), that's
NOT a permanent rejection — that's a follow-up moderation event that
operates on the broker_message_id, not on the dedupe key.
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
To unstick a `dead` or `pending`-but-stuck row, operator runs:
```
claudemesh daemon outbox requeue --id <outbox_row_id>
[--new-client-id <id> | --auto]
[--patch-payload <path>]
```
This atomically (single SQLite transaction):
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
`aborted_by = "operator"`. Row is **never deleted** — audit trail
permanent.
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
or auto-ulid'd via `--auto`).
3. Inserts a new outbox row in `pending` with the fresh id and the same
payload (or patched payload if `--patch-payload` was given).
4. Sets `superseded_by = <new_row_id>` on the old row so
`outbox inspect <old_id>` displays the chain.
**The old `client_message_id` is permanently dead** — `outbox.db` still
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
re-using it gets `409 outbox_aborted_*` per §4.5.1.
If broker had ever accepted the old id (it reached B3), the broker's
dedupe row is also permanent — duplicate sends to broker with the old
id would also `409` for fingerprint mismatch (or return the original
`broker_message_id` for matching fingerprint). Daemon-side
`aborted` and broker-side dedupe row are independent records of "this
id was used," neither releases the id.
This is the resolution to v7's contradiction: there is **no path** for
an id to "become free again." If the operator wants to retry the
payload, they get a new id. The old id stays buried.
### 4.7 Broker atomicity contract — side-effect classification (v8 — codex r7)
#### 4.7.1 Side effects (v8 — explicit classification)
Every successful broker accept atomically commits these durable
state changes in **one transaction**:
| Effect | Table | In-tx? | Why |
|---|---|---|---|
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
**Outside the transaction** — non-authoritative or rebuildable, with
explicit rationale per item:
| Effect | Where | Why outside |
|---|---|---|
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
| Rate-limit counters | Async, eventually consistent | Counters are an estimate; over-counting on retry > under-counting |
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
| Metrics | Prometheus, pull-based | Always non-authoritative |
If any in-transaction insert fails, the transaction rolls back
completely. The accept is `5xx` to daemon; daemon retries. No partial
state.
The async side effects are driven off the in-transaction
`delivery_queue` and `message_history` rows, so they cannot get ahead
of committed state — only lag behind.
#### 4.7.2 Pseudocode — corrected and final (v8)
```sql
BEGIN;
-- Phase B1 already passed (see §4.6.2).
-- Phase B2 + B3: try to claim the idempotency key.
INSERT INTO mesh.client_message_dedupe
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
destination_kind, destination_ref, expires_at)
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
$dest_kind, $dest_ref, $expires_at)
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
-- Inspect the row that's actually there now (ours or someone else's).
SELECT broker_message_id, request_fingerprint, destination_kind,
destination_ref, history_available, first_seen_at
FROM mesh.client_message_dedupe
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
FOR SHARE;
-- Branch:
-- row.broker_message_id == $msg_id → first insert; continue to step 3.
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
-- fingerprint match → ROLLBACK; return 200 duplicate.
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
-- Step 3: validate Phase B2 (subscribers exist, rate limit not exceeded, etc.)
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
-- Step 4: insert all in-tx side effects (§4.7.1).
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
VALUES ($msg_id, $mesh_id, $client_id, ...);
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
VALUES ($msg_id, $mesh_id, ...);
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
SELECT $msg_id, member_pubkey, ...
FROM mesh.topic_subscription
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
SELECT $msg_id, mention_pubkey, ...
FROM unnest($mention_list);
COMMIT;
-- After COMMIT, async workers consume delivery_queue and update
-- search indexes, audit logs, rate-limit counters, etc.
```
#### 4.7.3 Orphan check — same as v7 §4.7.3
Extended over the side-effect inventory to verify in-tx items consistency.
### 4.8 Outbox max-age math — unchanged from v7 §4.8
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
safety_margin` strictly < window; safety_margin floor 24h.
### 4.9 Inbox schema — unchanged from v3 §4.5
### 4.10 Crash recovery — unchanged from v3 §4.6
### 4.11 Failure modes — `aborted` semantics added (v8)
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
- **IPC accept against `aborted` row, fingerprint match**: returns 409
per §4.5.1 (NEW v8). Caller must use a new id; the old id is
permanently retired.
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
§4.6.3; old id stays in `aborted`, new id is fresh.
- **Broker fingerprint mismatch on retry**: as v6/v7. Daemon marks
`dead`; operator requeue path.
- **Daemon retry after dedupe row hard-deleted by broker retention
sweep**: cannot happen unless operator overrode `max_age_hours`.
- **Broker phase B2 rejection on retry**: same id, same fingerprint,
but B2 condition has changed (e.g. mesh rate-limit now exceeded).
Daemon receives 4xx → marks `dead`. Operator can `requeue` once
conditions clear.
- **Atomicity violation found by orphan check**: alerts ops.
---
## 5-13. — unchanged from v4
## 14. Lifecycle — unchanged from v5 §14
## 15. Version compat — unchanged from v7 §15
## 16. Threat model — unchanged
---
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
Broker side, deploy order: same as v7 §17, with one addition:
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
validation, returns 4xx without writing) and Phase B2/B3 (within the
accept transaction). Implementation: refactor handler to validate
Phase B1 conditions before opening the DB transaction.
Daemon side:
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
columns and the `aborted` enum value (§4.5.2). Migration applies via
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
greenfield.
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
(§4.5.1 step 3).
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
- `claudemesh daemon outbox requeue` always mints a fresh
`client_message_id`; never frees the old id. `--new-client-id <id>`
and `--auto` are the only modes; the old `client_message_id`
argument is removed.
---
## What changed v7 → v8 (codex round-7 actionable items)
| Codex r7 item | v8 fix | Section |
|---|---|---|
| `aborted` not in §4.5.1; `UNIQUE` contradiction | Added two `aborted` rows (match/mismatch) to lookup table; old id never reusable; new audit columns `aborted_at`/`aborted_by`/`superseded_by` | §4.5.1, §4.5.2, §4.6.3 |
| Broker permanent-rejection ordering vague | Three-phase model B1 (pre-dedupe), B2 (post-claim, in-tx), B3 (accepted); permanent 4xx never leaves dedupe row | §4.6.2 |
| SQLite `SELECT FOR UPDATE` invalid | Replaced with `BEGIN IMMEDIATE` for daemon-local serialization | §4.5.1 |
| Side-effect inventory ambiguous on rate-limit/audit/search | Explicit in-tx vs outside-tx table with rationale per item | §4.7.1 |
| Operator id reuse semantics | Old id permanently retired in `aborted`; requeue always mints fresh id; no daemon-side path to release used ids | §4.6.3 |
---
## What needs review (round 8)
1. **`aborted` permanence (§4.5.1, §4.6.3)** — is "old id permanently
dead" correct, or is there a real operational case where releasing
an id (e.g. caller mistyped a uuid) is worth the audit-trail loss?
2. **Phase B1/B2/B3 split (§4.6.2)** — clean enough? Is rate-limiting
in B2 (in-tx) the right call, or should it be B1 (cheaper to enforce
pre-tx)?
3. **In-tx mention_index (§4.7.1)** — agree it should be in-tx, or
should mention indexing be async like search?
4. **`BEGIN IMMEDIATE` (§4.5.1)** — correct SQLite primitive, or should
it be `BEGIN EXCLUSIVE` to also block readers? (Probably not — readers
should see committed-pending rows, but worth confirming.)
5. **Anything else still wrong?** Read it as if you were going to
operate this for a year.
Three options:
- **(a) v8 is shippable**: lock the spec, start coding the frozen core.
- **(b) v9 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless.

View File

@@ -0,0 +1,473 @@
# `claudemesh daemon` — Final Spec v9
> **Round 9.** v8 was reviewed by codex (round 8) which closed
> aborted/UNIQUE (5/5) and SQLite locking (5/5) cleanly, but flagged
> three spec-level correctness problems:
>
> 1. **Cross-layer ID-consumed authority contradiction** — v8 §4.1
> said "id consumed iff dedupe row exists" while §4.6.1 says a
> daemon-rejected id stays consumed locally with no broker dedupe
> row. Two incompatible authorities.
> 2. **Rate-limit authority muddled** — v8 listed rate limit in B2
> (in-tx authoritative) but classified rate-limit counters as
> async/non-authoritative in §4.7.1.
> 3. **§4.1 broker guarantee wording** — "post-validation accept
> phase" was fuzzy because B2 rolls back. Tighten to "accept
> committed."
>
> v9 fixes all three with **two-layer ID rules** (daemon vs broker),
> rate-limit moved to B1 via an external atomic limiter, and §4.1
> tightened. **Intent §0 unchanged from v2.** v9 only revises §4.
---
## 0. Intent — unchanged, see v2 §0
## 1. Process model — unchanged
## 2. Identity — unchanged from v5 §2
## 3. IPC surface — unchanged from v4 §3
---
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
### 4.1 The contract (precise — v9, two-layer ID model)
> **Two-layer ID rules** (NEW v9 — codex r8):
>
> - **Daemon-layer**: a `client_message_id` is **daemon-consumed** iff an
> outbox row exists for it. Daemon-mediated callers can never reuse a
> daemon-consumed id, regardless of whether the broker ever saw it.
> The daemon's outbox is the single authority for "this id was issued
> by my caller against this daemon."
> - **Broker-layer**: a `client_message_id` is **broker-consumed** iff a
> dedupe row exists for `(mesh_id, client_message_id)` in
> `mesh.client_message_dedupe`. Direct broker callers (none in
> v0.9.0; reserved for future SDK paths that bypass the daemon) can
> reuse a broker-non-consumed id freely.
> - In v0.9.0 there are no daemon-bypass clients, so for practical
> purposes "daemon-consumed" is the operative rule.
>
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db`
> before the response returns. The daemon enforces request-fingerprint
> idempotency at the IPC layer (§4.5.1).
>
> **Local audit guarantee**: a `client_message_id` once written to
> `outbox.db` is **never released** (daemon-layer rule). Operator
> recovery via `requeue` always mints a fresh id; the old row stays in
> `aborted` for audit. There is no daemon-side path to free a used id.
>
> **Broker guarantee** (v9 — tightened): a dedupe row exists iff the
> broker accept transaction **committed** (Phase B3 reached). Phase B1
> rejections never insert dedupe rows. Phase B2 rejections roll the
> transaction back, so any partial dedupe row is unwound. Direct
> broker callers retrying after B1/B2 rejection see no dedupe row and
> may reuse the id.
>
> **Atomicity guarantee**: same as v8 §4.1.
>
> **End-to-end guarantee**: at-least-once.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
### 4.3 Broker schema — unchanged from v6 §4.3
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
#### 4.5.1 IPC accept algorithm (v8)
On `POST /v1/send`:
1. Validate request envelope (auth, schema, size limits, destination
resolvable). Failures here return `4xx` immediately. **No outbox row
is written; the `client_message_id` is not consumed.**
2. Compute `request_fingerprint` (§4.4).
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
a concurrent IPC accept on the same id serializes against this one.
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
preventing any other writer from beginning a transaction on the same
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
supported.
4. `SELECT id, request_fingerprint, status, broker_message_id,
last_error FROM outbox WHERE client_message_id = ?`.
5. Apply the lookup table below. For the "(no row)" case, INSERT the
new row inside the same transaction.
6. COMMIT.
| Existing row state | Fingerprint match? | Daemon response |
|---|---|---|
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
| `pending` | match | Return `202 accepted, queued`. No mutation |
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
`request_fingerprint` (8-byte hex prefix) so callers can debug
client/server canonical-form drift. **Every state in the table returns
something deterministic, including `aborted`.** A `client_message_id`
written to `outbox.db` is permanently bound to that row's lifecycle —
the only "free" state is "no row exists".
#### 4.5.2 Outbox table — fingerprint required
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
request_fingerprint BLOB NOT NULL, -- 32 bytes
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN
('pending','inflight','done','dead','aborted')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT,
aborted_at INTEGER, -- NEW v8
aborted_by TEXT, -- NEW v8: operator/auto
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
```
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
audit trail. `superseded_by` lets `outbox inspect` show the chain when
a row was requeued multiple times.
`request_fingerprint` is computed once at IPC accept time and frozen
forever for the row's lifecycle. Daemon never recomputes from
`payload`.
### 4.6 Rejected-request semantics — two-layer rules + rate-limit moved to B1 (v9 — codex r8)
> **Two-layer rule (v9)**: a `client_message_id` is **daemon-consumed**
> iff an outbox row exists for it; **broker-consumed** iff a dedupe row
> exists. Daemon-mediated callers see daemon-layer authority (the only
> path in v0.9.0). Pre-validation failures at any layer consume nothing
> at that layer. The two layers are independent: a daemon-consumed id
> may or may not be broker-consumed (depending on whether the send
> reached B3); a daemon-non-consumed id can never be broker-consumed
> (no outbox row ⇒ no broker call from the daemon).
#### 4.6.1 Daemon-side rejection phasing (v9)
| Phase | When daemon rejects | Outbox row? | Daemon-consumed? | Same daemon caller may reuse id? |
|---|---|---|---|---|
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | No | Yes — id never written locally |
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | Yes | N/A — daemon owns retries |
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | Yes | No — rotate via `requeue` |
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Yes (still consumed) | Old id NEVER reusable; new id is fresh |
The "daemon-consumed?" column is the daemon-layer authority. It does
not depend on whether the broker ever saw the request — phase C above
shows the broker has not committed a dedupe row, but the daemon still
holds the id in `dead` state.
#### 4.6.2 Broker-side rejection phasing (v9 — rate limit moved to B1)
The broker validates in two phases relative to dedupe-row insertion:
| Phase | Validation | Side effects | Result for direct broker callers |
|---|---|---|---|
| **B1. Pre-dedupe-claim** (atomic, external) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, **rate limit not exceeded** (atomic external limiter — see §4.6.4) | None | `4xx` returned. No dedupe row, no broker-consumed id. Caller may retry with same id once condition clears |
| **B2. Post-dedupe-claim** (in-tx) | Conditions that require the accept transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx` returned, transaction rolled back, no dedupe row remains. Caller may retry with same id |
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows, mention_index rows | `201` returned with `broker_message_id`. Id is broker-consumed |
**Daemon-mediated callers**: in v0.9.0 the daemon is the only B-phase
caller. Daemon-mediated callers see only the daemon-layer rules
(§4.6.1). The broker's "may retry with same id" wording in the table
above applies to direct broker callers only (none in v0.9.0; reserved
for future SDK paths).
**Critical guarantee (v9 — tightened from v8)**: a dedupe row exists
**iff the broker accept transaction committed (B3)**. There is no
broker code path where a permanent 4xx leaves a dedupe row behind.
If the broker decides post-commit that an accepted message is invalid
(async content-policy job, async moderation, etc.), that's NOT a
permanent rejection — it's a follow-up event that operates on the
`broker_message_id`, not on the dedupe key.
#### 4.6.4 Rate limiter — atomic, external, B1 (NEW v9 — codex r8)
Codex r8 caught: v8 listed rate-limit enforcement in B2 (in-tx) but
classified rate-limit *counters* as async/non-authoritative. Both
can't be true. v9 resolves it by moving rate-limit enforcement to B1
backed by an atomic external limiter:
- **Authority**: the broker's existing Redis (or equivalent
fixed-window limiter) used for `claudemesh launch` rate-limiting is
the authority for accept-time rate-limit enforcement. `INCR` with
TTL is atomic; the broker checks the result before committing the
Phase B2/B3 transaction.
- **Idempotency interaction**: rate-limit `INCR` happens **before** the
dedupe-claim INSERT. If the limiter rejects, no DB transaction is
opened, no dedupe row exists. If the limiter accepts but the in-tx
Phase B2 then rejects (e.g. topic not found), the limiter `INCR` is
not refunded. This is intentional: refunding would require a
reliable distributed counter, and the over-counting risk is
acceptable. Counter
`cm_broker_rate_limit_consumed_then_rejected_total` exposes the
delta for ops awareness.
- **Retries**: a daemon retry with the same `client_message_id` after a
B1 rate-limit rejection produces another `INCR`. To avoid burning
rate-limit budget on retries-of-rejected-ids, the broker can
optionally short-circuit `INCR` if the rate-limit subsystem can
cheaply detect "this exact `client_message_id` was rejected for
rate-limit in the last N seconds" — but this is an optimization,
not a correctness requirement.
- **Async counters**: `mesh.rate_limit_counter` (or any DB-resident
view of "messages-per-mesh-per-window") is **non-authoritative** —
it's metrics/telemetry rebuilt from the authoritative limiter and
from message-history. Used for dashboards, not for accept decisions.
This split — atomic external limiter for enforcement, async DB
counters for telemetry — matches how every other rate-limited
subsystem in claudemesh works (`claudemesh launch`, dashboard chat
posts, etc.). No new infrastructure required.
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
To unstick a `dead` or `pending`-but-stuck row, operator runs:
```
claudemesh daemon outbox requeue --id <outbox_row_id>
[--new-client-id <id> | --auto]
[--patch-payload <path>]
```
This atomically (single SQLite transaction):
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
`aborted_by = "operator"`. Row is **never deleted** — audit trail
permanent.
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
or auto-ulid'd via `--auto`).
3. Inserts a new outbox row in `pending` with the fresh id and the same
payload (or patched payload if `--patch-payload` was given).
4. Sets `superseded_by = <new_row_id>` on the old row so
`outbox inspect <old_id>` displays the chain.
**The old `client_message_id` is permanently dead** — `outbox.db` still
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
re-using it gets `409 outbox_aborted_*` per §4.5.1.
If broker had ever accepted the old id (it reached B3), the broker's
dedupe row is also permanent — duplicate sends to broker with the old
id would also `409` for fingerprint mismatch (or return the original
`broker_message_id` for matching fingerprint). Daemon-side
`aborted` and broker-side dedupe row are independent records of "this
id was used," neither releases the id.
This is the resolution to v7's contradiction: there is **no path** for
an id to "become free again." If the operator wants to retry the
payload, they get a new id. The old id stays buried.
### 4.7 Broker atomicity contract — side-effect classification (v9)
#### 4.7.1 Side effects (v9 — rate limit moved to B1 external)
Every successful broker accept atomically commits these durable
state changes in **one transaction**:
| Effect | Table | In-tx? | Why |
|---|---|---|---|
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
**Outside the transaction** — non-authoritative or rebuildable, with
explicit rationale per item:
| Effect | Where | Why outside |
|---|---|---|
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
| Rate-limit **counters** (telemetry only) | Async, eventually consistent | Authoritative limiter is the external Redis-style INCR in B1 (§4.6.4); the DB counter is rebuilt for dashboards, not consulted for accept |
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
| Metrics | Prometheus, pull-based | Always non-authoritative |
If any in-transaction insert fails, the transaction rolls back
completely. The accept is `5xx` to daemon; daemon retries. No partial
state.
The async side effects are driven off the in-transaction
`delivery_queue` and `message_history` rows, so they cannot get ahead
of committed state — only lag behind.
#### 4.7.2 Pseudocode — corrected and final (v8)
```sql
-- Phase B1 already passed (see §4.6.2). This includes:
-- - schema/auth/size validation
-- - external atomic rate-limit INCR (§4.6.4)
-- Anything that fails B1 returns 4xx without ever opening this tx.
BEGIN;
-- Phase B2 + B3: try to claim the idempotency key.
INSERT INTO mesh.client_message_dedupe
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
destination_kind, destination_ref, expires_at)
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
$dest_kind, $dest_ref, $expires_at)
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
-- Inspect the row that's actually there now (ours or someone else's).
SELECT broker_message_id, request_fingerprint, destination_kind,
destination_ref, history_available, first_seen_at
FROM mesh.client_message_dedupe
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
FOR SHARE;
-- Branch:
-- row.broker_message_id == $msg_id → first insert; continue to step 3.
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
-- fingerprint match → ROLLBACK; return 200 duplicate.
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
-- Step 3: validate Phase B2 (destination_ref existence: topic exists,
-- member subscribed, etc.). Rate limit is NOT here — it was checked
-- atomically in B1 via the external limiter (§4.6.4) before this
-- transaction opened.
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
-- Step 4: insert all in-tx side effects (§4.7.1).
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
VALUES ($msg_id, $mesh_id, $client_id, ...);
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
VALUES ($msg_id, $mesh_id, ...);
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
SELECT $msg_id, member_pubkey, ...
FROM mesh.topic_subscription
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
SELECT $msg_id, mention_pubkey, ...
FROM unnest($mention_list);
COMMIT;
-- After COMMIT, async workers consume delivery_queue and update
-- search indexes, audit logs, rate-limit counters, etc.
```
#### 4.7.3 Orphan check — same as v7 §4.7.3
Extended over the side-effect inventory to verify in-tx items consistency.
### 4.8 Outbox max-age math — unchanged from v7 §4.8
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
safety_margin` strictly < window; safety_margin floor 24h.
### 4.9 Inbox schema — unchanged from v3 §4.5
### 4.10 Crash recovery — unchanged from v3 §4.6
### 4.11 Failure modes — `aborted` semantics added (v8)
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
- **IPC accept against `aborted` row, fingerprint match**: returns 409
per §4.5.1 (NEW v8). Caller must use a new id; the old id is
permanently retired.
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
§4.6.3; old id stays in `aborted`, new id is fresh.
- **Broker fingerprint mismatch on retry**: as v6/v7. Daemon marks
`dead`; operator requeue path.
- **Daemon retry after dedupe row hard-deleted by broker retention
sweep**: cannot happen unless operator overrode `max_age_hours`.
- **Broker phase B2 rejection on retry**: same id, same fingerprint,
but B2 condition has changed (e.g. mesh rate-limit now exceeded).
Daemon receives 4xx → marks `dead`. Operator can `requeue` once
conditions clear.
- **Atomicity violation found by orphan check**: alerts ops.
---
## 5-13. — unchanged from v4
## 14. Lifecycle — unchanged from v5 §14
## 15. Version compat — unchanged from v7 §15
## 16. Threat model — unchanged
---
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
Broker side, deploy order: same as v7 §17, with one addition:
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
validation, returns 4xx without writing) and Phase B2/B3 (within the
accept transaction). Implementation: refactor handler to validate
Phase B1 conditions before opening the DB transaction.
Daemon side:
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
columns and the `aborted` enum value (§4.5.2). Migration applies via
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
greenfield.
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
(§4.5.1 step 3).
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
- `claudemesh daemon outbox requeue` always mints a fresh
`client_message_id`; never frees the old id. `--new-client-id <id>`
and `--auto` are the only modes; the old `client_message_id`
argument is removed.
---
## What changed v8 → v9 (codex round-8 actionable items)
| Codex r8 item | v9 fix | Section |
|---|---|---|
| Cross-layer ID-consumed authority contradiction | Two-layer model: daemon-consumed iff outbox row; broker-consumed iff dedupe row committed; daemon-mediated callers see only daemon-layer authority | §4.1, §4.6.1, §4.6.2 |
| Rate-limit authority muddled (B2 vs async counters) | Rate limit moved to B1 via external atomic limiter (Redis-style INCR with TTL); DB rate-limit counters demoted to telemetry-only | §4.6.2, §4.6.4, §4.7.1 |
| §4.1 broker guarantee fuzzy | Tightened: "dedupe row exists iff broker accept transaction committed (B3)" | §4.1, §4.6.2 |
(Earlier rounds' fixes preserved unchanged.)
---
## What needs review (round 9)
1. **Two-layer ID model (§4.1, §4.6.1)** — is the daemon-vs-broker
authority split clear, or does it create more confusion for
operators reading "consumed" in different contexts? Should we use
different verbs (e.g. "claimed" at daemon, "committed" at broker)?
2. **Rate-limit external limiter (§4.6.4)** — is "atomic external
limiter" specified concretely enough? Is the over-counting on
limiter-accepted-then-B2-rejected acceptable?
3. **B2 contents after rate-limit move** — B2 now only has
`destination_ref existence`. Worth keeping a B2 phase at all, or
collapse into B1+B3?
4. **Anything else still wrong?** Read it as if you were going to
operate this for a year.
Three options:
- **(a) v9 is shippable**: lock the spec, start coding the frozen core.
- **(b) v10 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless.

View File

@@ -0,0 +1,374 @@
# `claudemesh daemon` — Final Spec
> Context for the reviewer: claudemesh is a peer mesh runtime for Claude Code
> sessions. Existing infrastructure: a managed broker (`wss://ic.claudemesh.com/ws`,
> Bun + Drizzle + Postgres) that handles routing, presence, topics, files,
> per-mesh apikeys, etc. There is also a CLI (`claudemesh-cli`, npm) and a web
> dashboard. Each session today is short-lived: `claudemesh launch` opens a WS,
> stays up while Claude Code is running, then closes. Server-side
> integrations (RunPod handlers, Temporal workers, CI jobs) currently have no
> first-class way to participate in a mesh — they'd either curl an apikey-auth
> REST endpoint (one-way) or shell out to the CLI cold-path (slow, no inbound).
>
> This spec proposes a `claudemesh daemon` mode that turns any host (laptop,
> server, RunPod pod) into a persistent, addressable peer with a local IPC
> surface that apps can talk to without dealing with the broker directly.
>
> The user has explicitly said: pre-launch, no users yet, optimize for the
> right architecture not the smallest first cut. They want the FINAL spec, not
> phased MVPs.
---
## 1. Process model
**One daemon per (user, mesh)**. Persistent. Survives reboots via OS supervisor (systemd / launchd / SCM). Serves multiple local apps concurrently.
```
~/.claudemesh/daemon/<mesh-slug>/
pid 0600 pidfile, cleaned on shutdown
sock 0600 unix domain socket (primary IPC)
http.port 0644 auto-allocated loopback port (Windows / Docker fallback)
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
config.toml 0644 user-editable runtime tuning
outbox.db 0600 SQLite — durable outbound queue + dedupe ledger
inbox.db 0600 SQLite — 30-day inbound history, FTS-indexed
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
hooks/ 0700 user-managed event scripts
```
Single binary. No external runtime beyond the existing CLI dependencies. The daemon *is* the CLI in long-running mode — `claudemesh daemon up` is a flag on the same binary.
## 2. Identity — persistent member, not ephemeral session
The daemon mints a stable ed25519 + x25519 keypair on first startup, stored in `keypair.json`. Registers with the broker as a **persistent member** — same identity across restarts, reconnects, host migrations. `runpod-worker-3` is `runpod-worker-3` forever, until you `claudemesh daemon reset` or revoke the keypair.
`--name` is taken at first `daemon up`; subsequent runs read the keypair file and ignore `--name` unless `--rename` is passed (which produces a `member_renamed` event the broker propagates to peers).
This is the default. It's the right thing for servers. There is no `--ephemeral` mode.
## 3. IPC surface — single versioned API, three transports
**Transports**, all serving identical JSON:
- **UDS** at `~/.claudemesh/daemon/<slug>/sock` (primary, default)
- **TCP loopback** on auto-allocated port written to `http.port` (Docker / Windows clients)
- **Server-Sent Events** stream at `GET /v1/events` for push (real-time inbound)
**No auth on local IPC.** Trust boundary is the OS — UDS is mode 0600, TCP listens on 127.0.0.1 only. If you can reach the socket, you're already running as the right user; the daemon's `keypair.json` is also reachable, so adding a token would be theatre.
**Endpoint surface — exactly mirrors CLI verbs:**
```
# messaging
POST /v1/send {to, message, priority?, meta?, replyToId?}
POST /v1/topic/post {topic, message, priority?, mentions?}
POST /v1/topic/subscribe {topic}
GET /v1/topic/list
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
POST /v1/broadcast {message, scope: "*"|"@group"|...}
# peers + presence
GET /v1/peers ?mesh=<slug>
POST /v1/profile {summary?, status?, visible?, avatar?, ...}
POST /v1/groups/join {name, role?}
POST /v1/groups/leave {name}
# state, memory, vector, graph — full mesh-services platform
POST /v1/state/set {key, value, scope?: "mesh"|"member"}
GET /v1/state/get ?key=...
GET /v1/state/list
POST /v1/memory/remember {content, tags?}
GET /v1/memory/recall ?q=<query>
POST /v1/vector/store {collection, text, metadata?}
GET /v1/vector/search ?collection=<c>&q=<query>&limit=<n>
POST /v1/graph/query {cypher, params?}
# files
POST /v1/file/share {path, to?, message?, persistent?}
GET /v1/file/get ?id=<fileId>&out=<path>
GET /v1/file/list
# tasks + scheduling
POST /v1/task/create {title, assignee?, priority?, tags?}
POST /v1/task/claim {id}
POST /v1/task/complete {id, result?}
POST /v1/scheduling/remind {at|in|cron, message, to?}
# skills + MCP services (full peer participation)
POST /v1/skill/deploy {path}
POST /v1/skill/share {name, manifest}
POST /v1/mcp/register {server_name, description, tools, transport}
POST /v1/mcp/call {server, tool, args}
# events (push)
GET /v1/events text/event-stream
events: message, peer_join, peer_leave, file_shared, task_assigned,
state_changed, mcp_deployed, skill_shared, hook_executed,
disconnect, reconnect
# control plane
GET /v1/health {connected, lag_ms, queue_depth, mesh, member_pubkey, uptime_s}
GET /v1/metrics Prometheus exposition
POST /v1/heartbeat {} (caller asserts it's alive — daemon may set status="working")
```
Every CLI verb the platform offers has a daemon endpoint. No second-class features. Apps written against the daemon get the same surface as Claude Code itself.
## 4. Outbound — exactly-once via SQLite + idempotency keys
Sends route through `outbox.db` first, then to the broker. Schema:
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY, -- ulid
idempotency_key TEXT UNIQUE, -- caller-provided or autogen
payload BLOB NOT NULL, -- serialized envelope
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
last_error TEXT,
delivered_at INTEGER
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
```
- WAL mode, `synchronous=NORMAL` — durable enough, ~10k inserts/sec.
- Caller-supplied `Idempotency-Key` header dedupes retries (24h window).
- Exponential backoff with jitter; 7-day max retention; `dead` rows surface in `claudemesh daemon outbox --failed`.
- `delivered_at` set when broker ACKs the queue row, not when daemon sends — gives true at-least-once with explicit dedupe → effectively exactly-once.
## 5. Inbound — durable history with FTS
Every inbound message is written to `inbox.db` before any hook fires:
```sql
CREATE VIRTUAL TABLE inbox USING fts5(
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
sender_name, body, meta, received_at UNINDEXED, replied_to_id UNINDEXED
);
```
- 30-day rolling retention (configurable).
- `claudemesh daemon search "OOM"` queries the FTS index (instant, offline-capable).
- Apps that connect mid-stream replay history via `?since=<iso>`.
- Exposed in metrics: `cm_daemon_inbox_rows`, `cm_daemon_inbox_bytes`.
## 6. Hooks — first-class scripted reactions
Hooks turn the daemon from a passive relay into an autonomous peer. Files in `hooks/`:
```
hooks/
on-message.sh every inbound message (DM + topic)
on-dm.sh DMs only
on-mention.sh when @<my-name> appears anywhere
on-topic-<name>.sh a specific topic (e.g. on-topic-alerts.sh)
on-file-share.sh file shared with me
on-task-assigned.sh task assigned to me
on-disconnect.sh WS dropped (informational)
on-reconnect.sh reconnected (informational)
on-startup.sh daemon up
pre-send.sh filter / mutate outbound (last gate)
```
**Contract:**
- Stdin: full event JSON.
- Stdout (if non-empty, JSON object): used as a structured response. For inbound messages, `{reply: "..."}` posts a reply automatically.
- Exit 0 = success; non-zero logs + counts but does not retry.
- Timeout: 30s default, override via `# claudemesh:timeout=120s` shebang comment.
- Env: `PATH=/usr/bin:/bin`, `CLAUDEMESH_MESH=<slug>`, `CLAUDEMESH_MEMBER=<pubkey>`, `CLAUDEMESH_HOME=<config-dir>`, plus the daemon's own broker session token in `CLAUDEMESH_TOKEN` so the script can call `claudemesh send` without re-authenticating.
- Concurrent execution: bounded pool (default 8) — overflow queues, never blocks the WS reader.
This makes a server a real participant: it auto-replies to "@worker-3 status?", auto-acks file shares, auto-claims tasks, escalates errors to oncall — all configured by dropping shell scripts in a directory.
## 7. Multi-mesh — one daemon per mesh, coordinated by a supervisor
Multi-mesh handled by **one daemon per mesh** (no shared state, no cross-mesh leakage). Coordinated by:
```
claudemesh daemon up --all # spawns one daemon per joined mesh
claudemesh daemon down --all
claudemesh daemon status --all # JSON table of every daemon
claudemesh daemon ps # alias of status
```
CLI verbs without `--mesh` continue to do their existing aggregator routing (`/v1/me/...`) and additionally each daemon contributes inbound state to the aggregator.
## 8. Auto-routing — every CLI verb prefers the daemon
The CLI's `withMesh` helper is replaced by `viaDaemonOrMesh`:
1. Read `~/.claudemesh/daemon/<slug>/pid`.
2. If alive → call the daemon's UDS endpoint.
3. Else → cold path (existing `withMesh` flow, opens its own short-lived WS).
Transparent to the user. `claudemesh send X "msg"` from a script becomes a sub-millisecond local UDS call when a daemon is up, instead of a 1-second broker handshake.
## 9. Service installation
```bash
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
claudemesh daemon uninstall-service
```
Generated unit:
- `Restart=on-failure`, `RestartSec=5s`
- `MemoryMax=512M` (will rarely use this)
- `StandardOutput/Error=journal`
- For systemd, runs as the invoking user (no root needed).
`claudemesh install` (the existing setup verb) gains an opt-in prompt: *"Install as a background service that always runs?"* For interactive users this is opt-in; for `--yes` it defaults to yes on Linux servers (detected by absence of TTY + presence of systemd).
## 10. Observability
```
claudemesh daemon status human-readable: connected, lag, queue, hooks fired
claudemesh daemon status --json machine-readable
claudemesh daemon logs [-f] tail daemon.log
claudemesh daemon outbox pending sends + dead-letter queue
claudemesh daemon inbox recent received messages (FTS-searchable)
claudemesh daemon metrics prints /v1/metrics
# Prometheus counters/gauges:
cm_daemon_connected{mesh} 0/1
cm_daemon_reconnects_total{mesh,reason}
cm_daemon_lag_ms{mesh} last broker round-trip
cm_daemon_outbox_depth{mesh}
cm_daemon_outbox_dead_total{mesh}
cm_daemon_send_total{mesh,kind=topic|dm|broadcast,status}
cm_daemon_recv_total{mesh,kind=topic|dm,from_type=peer|apikey|webhook}
cm_daemon_hook_invocations_total{hook,exit}
cm_daemon_hook_duration_seconds{hook} histogram
cm_daemon_ipc_request_total{endpoint,status}
cm_daemon_ipc_duration_seconds{endpoint} histogram
```
Tracing: optional OpenTelemetry export (`config.toml: [otel] endpoint = ...`) — emits spans for every IPC request + downstream broker call.
## 11. SDKs — three, all thin
The daemon's HTTP+UDS surface is the API; SDKs are convenience wrappers, not new surfaces.
**Python** (single file, stdlib only — no `requests`, no `aiohttp`):
```python
from claudemesh import Daemon
cm = Daemon() # auto-discovers running daemon for current cwd's mesh
cm.send("@oncall", "OOM detected")
cm.topic.post("alerts", "build done", mentions=["alice"])
for evt in cm.events(): # SSE stream, blocking iterator
if evt.kind == "message" and "@me" in evt.body:
cm.send(evt.from_pubkey, "got it, on it")
```
**Go** (single file, stdlib only — no third-party deps):
```go
cm, _ := claudemesh.Connect()
cm.Send(ctx, "@oncall", "OOM detected")
for evt := range cm.Events(ctx) { ... }
```
**TypeScript / Node** (zero runtime deps, ESM only):
```ts
import { Daemon } from "@claudemesh/daemon-client";
const cm = await Daemon.connect();
await cm.send("@oncall", "OOM detected");
for await (const evt of cm.events()) { ... }
```
Each is ~300 lines. All three are versioned in lockstep with the daemon's `/v1` surface. A `/v2` surface (when it eventually exists) keeps `/v1` alive indefinitely — old SDKs never break.
## 12. Security model — explicit boundaries
| Boundary | Trust | Mechanism |
|---|---|---|
| App ↔ Daemon (local) | OS user | UDS 0600, TCP loopback only |
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello sig + crypto_box DM envelopes + per-topic keys (existing model) |
| Hook ↔ Daemon (env) | OS user + filesystem | `hooks/` dir mode 0700; only files there execute; no remote install |
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
**No new attack surface introduced by the daemon** — apps that previously could read `~/.claudemesh/config.json` directly already had full mesh access; the daemon just adds an IPC layer on top.
**Hook RCE consideration**: a peer cannot install a hook on your daemon. Hooks are files YOU put on disk. Inbound messages can only trigger hooks that already exist with content you wrote. The broker has no path to your hook directory.
## 13. Configuration — `config.toml`
```toml
[daemon]
mesh = "prod" # set on `daemon up --mesh`; immutable thereafter
display_name = "runpod-worker-3"
log_level = "info"
[ipc]
http_port = 0 # 0 = auto-allocate
http_bind = "127.0.0.1" # never 0.0.0.0; explicit if you know what you're doing
uds_mode = "0600"
[outbox]
max_queue_size = 10000
max_age_hours = 168 # 7 days
fsync_mode = "batched_50ms" # 'strict' | 'batched_50ms' | 'off'
[inbox]
retention_days = 30
fts_enabled = true
[reconnect]
initial_backoff_ms = 500
max_backoff_ms = 30000
backoff_multiplier = 2.0
jitter_pct = 25
[hooks]
enabled = true
concurrency = 8
default_timeout_s = 30
[metrics]
prometheus_enabled = true
otel_endpoint = "" # empty = disabled
```
User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.
## 14. Migration — what changes for existing users
- `claudemesh launch` (Claude Code mode) is unchanged. It can optionally `--via-daemon` to share the WS with a running daemon, but defaults to its own session (preserves "ephemeral session" semantics that Claude Code expects).
- `claudemesh send X "msg"` and every other cold-path verb gets a transparent speedup when a daemon is up. No flag, no opt-in, no behavior difference visible to the user.
- Existing `~/.claudemesh/config.json` is consumed unchanged by the daemon.
- No DB migration. No broker changes. The daemon talks to the existing `/v1` HTTPS + WSS surfaces — broker doesn't even know whether a connection is `claudemesh launch` or `claudemesh daemon`.
---
## What needs review
Please critically review this spec for the v0.9.0 anchor. Specifically I want
your hardest pushback on:
1. **Identity model** — persistent member by default vs ephemeral session. Have I
missed a case where ephemeral is the right answer for a daemon? Should
`--ephemeral` exist?
2. **No-auth local IPC** — UDS 0600 + TCP loopback. Is "OS-trust is enough"
actually safe in shared-tenant Linux (multi-user host, container
side-channel)? Should there be a per-daemon token even locally?
3. **SQLite outbox/inbox** — single writer, WAL, batched fsync. Is the
exactly-once-via-idempotency-key claim defensible? What's the failure mode
I'm glossing over?
4. **Hooks fork-execing scripts** — RCE/data-exfil concerns I'm dismissing too
easily? Should hooks be sandboxed (seccomp, no network, …)?
5. **Auto-routing CLI verbs through daemon** — does this break composability
with existing `claudemesh launch`? Race conditions when both are running?
What about pidfile-stale detection?
6. **One daemon per mesh** — why not one daemon serving all meshes, with mesh
selection per-request? What does single-daemon actually buy beyond "fewer
processes"?
7. **The IPC surface duplicates the broker REST surface** — am I solving a
problem the broker REST + per-mesh apikey already solves, with extra
complexity for caching + queueing?
8. **What's missing entirely** — auth boundaries, recovery flows, on-disk
secret rotation, anything else a production daemon shipped with this spec
would lack?
Score the spec on each axis: 1 = serious flaw, 5 = sound. Then list the
top 3 changes you'd insist on before I write any code. Be ruthless — pre-launch
window means I can break anything.

View File

@@ -0,0 +1,218 @@
# `claudemesh daemon` — broker-hardening followups
> **Purpose**: refinements found during the v6 → v10 codex review series
> that are real improvements but **not** v0.9.0 blockers. The
> implementation target is `2026-05-03-daemon-spec-v0.9.0.md`. This
> document lists what was deferred, why, and the trigger that promotes
> each item to "must-do."
>
> **Background**: codex reviewed the daemon spec across 9 rounds (v1
> through v10). Rounds 14 found load-bearing architectural issues
> (identity, IPC auth, exactly-once lie, hook tokens, rotation, etc.).
> Rounds 59 found progressively finer correctness issues inside one
> subsystem (broker idempotency mechanics). v6 closed the architectural
> review; v7v10 are increasingly fine-grained idempotency-correctness
> shavings on the same layer. Pre-launch (no users) doesn't need v7v10
> level rigor. We pulled the cheap wins into v0.9.0; the rest waits.
---
## 1. B0 dedupe fast-path before rate-limit (v10)
**What v10 said**: read `mesh.client_message_dedupe` BEFORE consulting
the rate limiter. Existing id (match or mismatch) returns immediately
without touching rate-limit budget.
**Why deferred**: v0.9.0 doesn't have meaningful rate-limit pressure on
the daemon path. The split-brain failure (broker accepted, daemon
believes failure due to rate-limit-rejection-on-retry) requires
sustained saturated rate-limit windows, which don't exist pre-launch.
**Promote when**: any single mesh sees rate-limit rejections AND has
daemon retries against committed ids. Telemetry to watch:
`cm_broker_rate_limit_rejection_total` per mesh > 0 sustained.
**Implementation cost**: small — one indexed PK lookup before the
existing limiter call. The work is mostly testing the race semantics.
---
## 2. Lua-scripted idempotent rate limiter (v10)
**What v10 said**: limiter keyed by `(mesh_id, client_message_id,
window_bucket)` so retries-within-window consume budget at most once.
**Why deferred**: depends on (1) above. Without B0 fast-path this is
incremental complexity for marginal benefit. With B0 it becomes the
right belt-and-suspenders fix for the rare race where two same-id
requests both miss B0 simultaneously.
**Promote when**: B0 ships. Same trigger.
**Implementation cost**: medium — Lua script in Redis, careful TTL
tuning, integration with existing limiter call sites.
---
## 3. In-tx `mesh.mention_index` (v8)
**What v8 said**: mention-fanout index updates should commit inside the
broker accept transaction so mention-search reads can never see a
mention pointing at an uncommitted message.
**Why deferred**: the lag between accept-commit and async
mention-indexer is small (single-digit milliseconds in expected
deployment). Stale-read window during mention search is acceptable for
v0.9.0; receivers learn of mentions via the `mention` event in their
inbox stream regardless.
**Promote when**: real users complain about "I was mentioned but the
mention search doesn't show it" with reproducible cases that don't
self-heal in seconds.
**Implementation cost**: small — add `INSERT INTO mesh.mention_index`
to the accept transaction. The async indexer becomes a backfill
fallback rather than the primary path.
---
## 4. 4011 / 4012 close-code split (v6 §15.5)
**What v6 said**: split `4010 feature_unavailable` into three codes:
`4010` (missing), `4011` (params invalid), `4012` (params below floor).
**Why deferred**: v0.9.0 ships single `4010` with structured
`close_reason` JSON containing `kind`, `feature`, `detail`. Same
diagnostic information, simpler protocol surface.
**Promote when**: ops tooling or external monitoring needs distinct
status codes (e.g. PagerDuty rules that fire on 4012-only). Probably
never; structured JSON is parseable.
**Implementation cost**: trivial — three constants and a switch on
`close_reason.kind`.
---
## 5. Per-OS fingerprint precedence elaborate table (v8 §2.2.1)
**What v8 said**: comprehensive per-OS table covering Linux machine-id
sources, macOS `IOPlatformUUID`, Windows `MachineGuid`, BSD
`kern.hostuuid`, plus interface exclusion rules.
**Why deferred**: v0.9.0 ships with the simpler "machine-id ||
first-stable-mac" rule from v6. Edge cases (cloud images,
machine-id-not-readable, etc.) are documented when first hit.
**Promote when**: operators report fingerprint false-positives we can't
explain from the v6 rule. Each report adds one row to the per-OS
table.
**Implementation cost**: incremental — each OS-specific source is a
small probe function with a fallback chain.
---
## 6. `request_fingerprint` schema-version-2 in feature negotiation (v6 §15.1)
**What v6 said**: `client_message_id_dedupe` feature parameters
versioned independently. v0.9.0 ships at version 1 with a single
`request_fingerprint: bool` flag.
**Why deferred**: we don't yet need parameterized fingerprint variants
(different canonical forms, different hash algos). Version-bump path
is documented; we'll use it when we add the second fingerprint mode.
**Promote when**: we want a fingerprint algo other than sha256/JCS
(e.g. a faster hash, or a normalized canonical form).
**Implementation cost**: small — single feature-bit version bump
following the documented pattern.
---
## 7. Force-expiry / quarantine semantics for `keypair-archive.json` (v8 §14.1.1)
**What v8 said**: `max_archived_keys` cap with force-expiry; explicit
quarantine of malformed archive (`keypair-archive.json.malformed-<ts>`);
duplicate `key_id` rejection; mode-mismatch warning behavior.
**Why deferred**: v0.9.0 ships the simpler v6 rule — drop expired
entries on cleanup pass; refuse to start on malformed archive (loud,
operator-actionable). The v8 elaboration makes archive corruption
non-blocking, which is operationally nicer but trades off audit
clarity.
**Promote when**: a real operator hits an archive corruption that
shouldn't have brought the daemon down (e.g. mid-rotation crash leaves
a partially-written archive).
**Implementation cost**: small — quarantine logic + one extra startup
check.
---
## 8. Cross-language JCS conformance for `request_fingerprint` (v6 §4.4 round-6 question)
**What v6 asked**: does JCS work cross-language for
`meta_canonical_json`? Python json.dumps, Go encoding/json, and JS
JSON.stringify all behave differently. Should we ship a vetted JCS lib
in each SDK?
**Why deferred from v0.9.0**: the daemon ships in TypeScript only for
v0.9.0 (the `claudemesh-cli` package). Single-language JCS is trivial.
SDK ports come post-v0.9.0.
**Promote when**: we ship the Python or Go SDK. Each SDK port gets a
JCS conformance test against a corpus of envelopes.
**Implementation cost**: small per-language — a conformance fixture
file and a unit test.
---
## Sprint 7 (this session) — what landed vs deferred
**Landed in code** (not yet deployed):
- `packages/db/migrations/0028_message_queue_idempotency_fields.sql` adds
nullable `client_message_id` and `request_fingerprint` columns to
`mesh.message_queue` (additive, online-safe).
- `apps/broker/src/broker.ts``queueMessage` and `drainForMember`
thread the new columns through.
- `apps/broker/src/index.ts``handleSend` picks them up from the
daemon's wire envelope; outbound push echoes them back so receiving
daemons can dedupe.
- `apps/broker/src/types.ts``WSPushMessage` declares the optional
fields.
**Deployment plan (not auto-applied)**:
1. Apply migration against prod DB (the broker's filename-tracked
migrator picks up `0028_*.sql` on next startup).
2. Deploy the broker with the code changes via Coolify.
3. Verify a daemon-originated send shows non-null `client_message_id`
in `mesh.message_queue` afterwards.
**Still deferred** (full broker hardening):
- `mesh.client_message_dedupe` table with `request_fingerprint BYTEA`
and atomic accept transaction (spec §4.7).
- Feature-bit advertisement on hello_ack of
`client_message_id_dedupe` v1, with daemon-side enforcement (spec §15).
- Partial unique index `(mesh_id, client_message_id) WHERE NOT NULL`.
These sit behind the same trigger as the followups below: do them when
real users hit operational corners that this addressing doesn't cover.
---
## How to use this document
When picking up post-v0.9.0 work on the daemon:
1. Check whether any of the "promote when" triggers above have fired.
2. If yes, consult the corresponding versioned spec (v6/v7/v8/v9/v10)
for the full proposed change.
3. Implement the lift, update `daemon-spec-v0.9.0.md` to reflect the
merge, and remove the item from this followups list.
The versioned specs live in `.artifacts/specs/` indefinitely as a
review-trail audit.

View File

@@ -0,0 +1,680 @@
# `claudemesh daemon` — Implementation spec v0.9.0
> **Implementation target.** Locked from the v1v10 codex-reviewed spec
> series. This document is what we build for v0.9.0 of the daemon.
>
> **Base**: v6 (the round where the architecture passed codex's
> structural review — request_fingerprint, dedupe table, atomicity
> contract, feature-bit negotiation, key archive format).
>
> **Pulled in from v7v9**: six cheap, load-bearing fixes that close
> real v0.9.0-era bugs (not future-scale concerns):
>
> 1. `aborted` outbox status + audit columns (operator recovery without
> destroying audit trail) — v7 §4.5.2
> 2. `BEGIN IMMEDIATE` for daemon-local SQLite serialization (v6's
> `SELECT FOR UPDATE` is invalid SQLite anyway) — v7 §4.5.1
> 3. Daemon-local IPC duplicate lookup table over outbox states ×
> fingerprint match/mismatch — v8 §4.5.1
> 4. Phase B1/B2/B3 broker validation split (the concept; we don't need
> the elaborate phase tables) — v7 §4.6.2
> 5. Side-effect inventory (in-tx vs async) as an implementation comment
> block — v8 §4.7.1
> 6. Two-layer ID model wording: daemon-consumed iff outbox row,
> broker-consumed iff dedupe row — v9 §4.1
>
> **Deferred to broker-hardening followups** (see
> `2026-05-03-daemon-spec-broker-hardening-followups.md` for the full list and
> rationale): B0 dedupe fast-path, Lua-scripted idempotent rate
> limiter, in-tx mention_index, 4011/4012 close-code split, per-OS
> fingerprint precedence table, request-fingerprint schema-v2 in
> feature negotiation. These are real improvements but not v0.9.0
> blockers; they land as the broker matures.
>
> **Intent §0 unchanged from v2.**
---
## 0. Intent — unchanged, see v2 §0
---
## 1. Process model — unchanged from v3 §1 / v2 §1
---
## 2. Identity — unchanged from v5 §2
---
## 3. IPC surface — unchanged from v4 §3
---
## 4. Delivery contract — at-least-once with **request-fingerprinted** dedupe
Codex r5: dedupe must compare the *whole request shape*, not just
`(mesh, client_message_id)`. Otherwise a caller who reuses an idempotency
key with a different destination or body silently drops the new send and
gets the old send's metadata back.
### 4.1 The contract (precise)
> **Two-layer ID rule** (from v9): a `client_message_id` is
> **daemon-consumed** iff an outbox row exists for it; **broker-consumed**
> iff a dedupe row exists in `mesh.client_message_dedupe`. The two layers
> are independent: a daemon-consumed id may or may not be broker-consumed
> (depending on whether the send reached broker commit). In v0.9.0 there
> are no daemon-bypass clients, so for practical purposes "daemon-consumed"
> is the operative rule.
>
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db` before
> the response returns. The daemon enforces request-fingerprint
> idempotency at the IPC layer (§4.5).
>
> **Local audit guarantee**: a `client_message_id` once written to
> `outbox.db` is never released. Operator recovery via `requeue` always
> mints a fresh id; the old row stays in `aborted` for audit. There is
> no daemon-side path to free a used id.
>
> **Broker guarantee**: the broker maintains a dedupe record per accepted
> `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`. Each
> dedupe record carries a canonical `request_fingerprint`. Retries with
> the same id AND matching fingerprint collapse to the original
> `broker_message_id`. Retries with mismatched fingerprint return
> `409 idempotency_key_reused` and do **not** create a new message.
>
> **Atomicity guarantee**: dedupe row insertion, message row insertion,
> and history row insertion happen in one broker DB transaction. Either
> all land, or none do. No orphan dedupe rows.
>
> **End-to-end guarantee**: at-least-once delivery, with
> `client_message_id` propagated to receivers' inboxes.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
### 4.3 Broker schema — request fingerprint added (v6)
```sql
CREATE TABLE mesh.client_message_dedupe (
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
client_message_id TEXT NOT NULL,
-- The original accepted message; FK NOT enforced because the message row
-- may be GC'd by retention sweeps before the dedupe row expires.
broker_message_id UUID NOT NULL,
-- Canonical fingerprint of the original request. Recomputed on every
-- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
request_fingerprint BYTEA NOT NULL, -- 32-byte sha256
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
destination_ref TEXT NOT NULL,
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ, -- NULL = `permanent` mode
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
PRIMARY KEY (mesh_id, client_message_id)
);
CREATE INDEX client_message_dedupe_expires_idx
ON mesh.client_message_dedupe(expires_at)
WHERE expires_at IS NOT NULL;
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
```
**`status` column dropped (codex r5)**. Rejected requests do **not**
consume idempotency keys. Rationale below in §4.6.
### 4.4 Request fingerprint — canonical form (NEW v6)
The fingerprint covers everything that makes a send semantically distinct.
A retry must reproduce the same fingerprint bit-for-bit; anything else is
a different send and must not be collapsed.
```
request_fingerprint = sha256(
envelope_version || 0x00 ||
destination_kind || 0x00 ||
destination_ref || 0x00 ||
reply_to_id_or_empty || 0x00 ||
priority || 0x00 ||
meta_canonical_json || 0x00 ||
body_hash
)
```
Where:
- `envelope_version`: integer string (e.g. `"1"`). Bumps when the envelope
shape changes.
- `destination_kind`: `topic`, `dm`, or `queue`.
- `destination_ref`: topic name, recipient ed25519 pubkey hex, or queue id.
- `reply_to_id_or_empty`: original `broker_message_id` or empty string.
- `priority`: `now`, `next`, or `low`.
- `meta_canonical_json`: the `meta` field, serialized with sorted keys,
no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.
- `body_hash`: sha256(body bytes), hex.
The fingerprint is computed:
1. **Daemon-side** before durable outbox persistence — stored as
`outbox.request_fingerprint` (NEW column) so retries always produce
the same fingerprint regardless of caller behavior.
2. **Broker-side** on first receipt — stored in
`client_message_dedupe.request_fingerprint`.
3. **Broker-side** on every duplicate retry — recomputed and compared
byte-equal to the stored value.
If the daemon and broker disagree on the canonical form (e.g. JCS
implementation drift), the broker emits
`cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id}` and
returns `409 idempotency_key_reused` with a body that includes the
broker's fingerprint hex for debugging. Daemons that see this should
log it loudly and stop retrying that outbox row (it goes to `dead`).
### 4.5 Daemon-local idempotency at the IPC layer (from v8)
The daemon enforces fingerprint idempotency **before** the request hits
`outbox.db` so a caller bug never creates duplicate-key/mismatch-payload
state at all.
#### 4.5.1 IPC accept algorithm
On `POST /v1/send`:
1. Validate request envelope (auth, schema, size limits, destination
resolvable). Failures here return `4xx` immediately. **No outbox
row is written; the `client_message_id` is not consumed.**
2. Compute `request_fingerprint` (§4.4).
3. Open a SQLite transaction with `BEGIN IMMEDIATE` so a concurrent IPC
accept on the same id serializes against this one. `BEGIN IMMEDIATE`
acquires the RESERVED lock at transaction start; SQLite has no
row-level lock and `SELECT FOR UPDATE` is not supported.
4. `SELECT id, request_fingerprint, status, broker_message_id,
last_error FROM outbox WHERE client_message_id = ?`.
5. Apply the lookup table below. For the "(no row)" case, INSERT inside
the same transaction.
6. COMMIT.
| Existing row state | Fingerprint | Daemon response |
|---|---|---|
| (no row) | — | INSERT new outbox row `pending`; return `202 accepted, queued` |
| `pending` | match | Return `202 accepted, queued`. No mutation |
| `pending` | mismatch | Return `409`, `conflict: "outbox_pending_fingerprint_mismatch"` |
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
| `inflight` | mismatch | Return `409`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
| `done` | mismatch | Return `409`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
| `dead` | match | Return `409`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"` |
| `dead` | mismatch | Return `409`, `conflict: "outbox_dead_fingerprint_mismatch"` |
| `aborted` | match | Return `409`, `conflict: "outbox_aborted_fingerprint_match"`. Operator-retired id, never reusable |
| `aborted` | mismatch | Return `409`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
Every `409` carries the daemon's `request_fingerprint` (8-byte hex
prefix) for client/server canonical-form-drift debugging. A
`client_message_id` written to `outbox.db` is permanently bound to that
row's lifecycle — the only "free" state is "no row exists".
#### 4.5.2 Outbox table
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
request_fingerprint BLOB NOT NULL, -- 32 bytes
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN
('pending','inflight','done','dead','aborted')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT,
aborted_at INTEGER, -- v7
aborted_by TEXT, -- v7: operator/auto
superseded_by TEXT -- v7: id of requeue successor
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
```
`aborted_at` / `aborted_by` / `superseded_by` give operators a clear
audit trail. `superseded_by` lets `outbox inspect` show the chain when
a row is requeued multiple times. `request_fingerprint` is computed
once at IPC accept time and frozen for the row's lifecycle.
#### 4.5.3 Operator recovery via `requeue`
```
claudemesh daemon outbox requeue --id <outbox_row_id>
[--new-client-id <id> | --auto]
[--patch-payload <path>]
```
Atomically (single SQLite transaction):
1. Marks the existing row `aborted`, sets `aborted_at = now`,
`aborted_by = "operator"`. Row is **never deleted** — audit trail
permanent.
2. Mints a fresh `client_message_id` (caller-supplied or auto-ulid).
3. Inserts a new outbox row `pending` with the fresh id and the same
payload (or patched if `--patch-payload`).
4. Sets `superseded_by = <new_row_id>` on the old row.
The old `client_message_id` is permanently dead. There is no path for
an id to become free again.
### 4.5b Broker duplicate response — three cases
| Case | HTTP/WS code | Body |
|---|---|---|
| First insert | `201 created` | `{ broker_message_id, client_message_id, history_id, duplicate: false }` |
| Duplicate, fingerprint match | `200 ok` | `{ broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at }` |
| Duplicate, fingerprint mismatch | `409 idempotency_key_reused` | `{ client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." }` (first 8 bytes hex) |
Daemon outcomes:
- `201` → mark outbox row `done`, store `broker_message_id`.
- `200 duplicate` with `history_available: true` → mark `done`, log INFO.
- `200 duplicate` with `history_available: false` → mark `done`, log WARN.
- `409 idempotency_key_reused` → mark outbox row `dead`. Operator runs
`outbox requeue` (§4.5.3); old id stays `aborted`, new id is fresh.
### 4.6 Rejected-request semantics — id consumed iff outbox row written
> **Rule**: a `client_message_id` is daemon-consumed iff the daemon
> writes an outbox row. Anything that fails before outbox insertion
> (auth, schema, size, destination not resolvable) leaves the id
> untouched and freely reusable.
#### 4.6.1 Daemon-side rejection phasing
| Phase | When daemon rejects | Outbox row? | Caller may reuse id? |
|---|---|---|---|
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | Yes — id never consumed |
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | N/A — daemon owns retries |
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | No — rotate via `requeue` |
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Old id NEVER reusable; new id is fresh |
#### 4.6.2 Broker-side rejection phasing (B1 / B2 / B3)
The broker validates in three phases relative to dedupe-row insertion:
| Phase | Validation | Side effects | Result for direct broker callers (none in v0.9.0) |
|---|---|---|---|
| **B1. Pre-dedupe-claim** | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, rate limit not exceeded | None | `4xx`. No dedupe row. Direct broker caller may retry with same id |
| **B2. Post-dedupe-claim** (in-tx) | destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx`, transaction rolled back, no dedupe row remains. Direct broker caller may retry with same id |
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows | `201` with `broker_message_id` |
**Daemon-mediated callers (the only path in v0.9.0)** see only the
daemon-layer rules of §4.6.1: any broker `4xx` after IPC accept lands
the outbox row in `dead`. Daemon-mediated callers MUST rotate via
`requeue` (§4.5.3); the daemon-consumed id is never reusable
regardless of whether the broker layer sees a dedupe row. The "may
retry with same id" wording above describes broker-bypass callers
only, which v0.9.0 does not have.
**Critical guarantee**: there is no broker code path where a permanent
4xx leaves a dedupe row behind. Either the request committed and a
dedupe row exists (B3), or it didn't and no dedupe row exists (B1, B2).
"Dedupe row exists" is the unambiguous signal of "id consumed at the
broker layer."
If the broker decides post-commit that an accepted message is invalid
(async content-policy job), that's NOT a permanent rejection — it's a
follow-up moderation event that operates on the `broker_message_id`,
not on the dedupe key.
Net result: `client_message_dedupe` rows only exist when the broker
**successfully** accepted a message and committed it. The single source
of truth for "was this idempotency key consumed?" is the existence of
the dedupe row. No status enum, no ambiguous states.
### 4.7 Broker atomicity contract
#### 4.7.1 Side-effect inventory
Every successful broker accept atomically commits these durable state
changes in **one transaction**:
| Effect | Table | Why in-tx |
|---|---|---|
| Dedupe record | `mesh.client_message_dedupe` | Idempotency authority |
| Message body | `mesh.topic_message` / `mesh.message_queue` | Authoritative store |
| History row | `mesh.message_history` | Replay log; lost-on-rollback breaks ordered replay |
| Fan-out work | `mesh.delivery_queue` | Each recipient must see exactly committed messages |
**Outside the transaction** (non-authoritative or rebuildable):
- WS push to live subscribers — best-effort live notifications.
- Webhook fan-out — async via `delivery_queue` workers.
- Rate-limit counters — telemetry only; authority is the external
limiter checked in B1.
- Audit log entries — append-only stream; rebuildable from history.
- Search/FTS index updates — async via outbox-pattern worker.
- Mention index updates — async (deferred in-tx promotion to followups
doc).
- Metrics — Prometheus, pull-based.
If any in-transaction insert fails, the transaction rolls back
completely. The accept is `5xx` to daemon; daemon retries. No partial
state.
#### 4.7.2 Pseudocode
```sql
-- Pre-generate broker_message_id (ulid) in code, pass in.
BEGIN;
-- Step 1: try to claim the idempotency key.
INSERT INTO mesh.client_message_dedupe
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
destination_kind, destination_ref, expires_at)
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
$dest_kind, $dest_ref, $expires_at)
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
-- Step 2: inspect what's actually there now (ours or someone else's).
SELECT broker_message_id, request_fingerprint, destination_kind,
destination_ref, history_available, first_seen_at
FROM mesh.client_message_dedupe
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
FOR SHARE;
-- Branch:
-- row.broker_message_id == $msg_id → first insert; continue.
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
-- match → ROLLBACK; return 200 duplicate.
-- mismatch → ROLLBACK; return 409 idempotency_key_reused.
-- Step 3: validate Phase B2 (destination_ref existence — topic exists,
-- member subscribed, etc.). If B2 fails → ROLLBACK; return 4xx (no
-- dedupe row remains).
-- Step 4: insert in-tx side effects (§4.7.1).
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
VALUES ($msg_id, $mesh_id, $client_id, ...);
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
VALUES ($msg_id, $mesh_id, ...);
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
SELECT $msg_id, member_pubkey, ...
FROM mesh.topic_subscription
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
COMMIT;
```
The branch logic determines the response shape (`201` / `200 duplicate`
/ `409 idempotency_key_reused`) before COMMIT. The duplicate and 409
branches always ROLLBACK because nothing else needs to commit.
`SELECT … FOR SHARE` blocks concurrent writers from upgrading the same
dedupe row mid-transaction.
#### 4.7.3 Failure modes
- Crash before `COMMIT`: all rows roll back. Next daemon retry inserts
cleanly.
- Crash after `COMMIT` but before WS ACK: dedupe row exists. Daemon
retries → fingerprint matches → `200 duplicate`. Net: exactly one
broker-accepted row, one daemon `done` transition.
- Constraint violation on message row insert: rolls back the whole tx.
`5xx` to daemon. Same fingerprint reproduces; daemon eventually
marks `dead`. No orphan dedupe row.
Counter `cm_broker_dedupe_orphan_check_total` runs nightly and
validates that every `client_message_dedupe` row has a matching
`topic_message` / `message_queue` row OR the matching row has been
retention-pruned (`history_available = FALSE`). Inconsistencies logged
as `cm_broker_dedupe_orphan_found{mesh_id}` for human review.
### 4.8 Outbox schema
The authoritative outbox schema for v0.9.0 is in §4.5.2 (includes
`aborted` status and audit columns from the v7 pull). `request_fingerprint`
is computed at IPC accept time and frozen for the row's lifecycle —
the daemon never recomputes from `payload` post-enqueue (would produce
drift if envelope_version changes between daemon runs).
### 4.9 Outbox max-age math — bounded (v6)
Codex r5: the v5 formula `(dedupe_retention_days * 24) - 24h_margin`
breaks at `dedupe_retention_days = 1` (yields zero) and is undefined
behavior at `<= 1`.
v6 formula and bounds:
- **Minimum supported broker dedupe retention**: 3 days. Daemon refuses
to start if broker advertises `dedupe_retention_days < 3` (treats it
as `feature_param_invalid`, exits 4010).
- **Daemon `max_age_hours` derivation**:
- `permanent` mode → daemon uses config default (168h = 7d), cap 720h
(30d).
- `retention_scoped` mode → daemon `max_age_hours = max(72,
(dedupe_retention_days * 24) - safety_margin_hours)` where
`safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 *
24))`. For `dedupe_retention_days=3` this gives
`max(72, 72-24) = 72h`. For 30 days: `max(72, 720-72) = 648h`. For
365 days: `max(72, 8760-876) = 7884h`.
- The 72h floor prevents the daemon outbox from being uselessly short
— three days is enough margin for normal operator response to a
paged outage.
- Operator override allowed via `[outbox] max_age_hours_override = N`,
but if `N` exceeds `dedupe_retention_days * 24 - 1` daemon refuses to
start with `outbox_max_age_above_dedupe_window`. The override exists
for the rare case of a much-shorter-than-default outbox; it does not
exist to circumvent the broker's dedupe window.
### 4.10 Inbox schema — unchanged from v3 §4.5
### 4.11 Crash recovery — unchanged from v3 §4.6
### 4.12 Failure modes — corrected for fingerprint model (v6)
- **Fingerprint mismatch on retry** (`409 idempotency_key_reused`): outbox
row marked `dead`. Surfaced in `--failed` view. Operator command
`outbox requeue --new-id <id>` rotates `client_message_id` and retries.
- **Daemon retry after dedupe row hard-deleted by retention sweep**: in
`retention_scoped` mode, daemon `max_age_hours` is bounded inside the
retention window (§4.9), so this can only happen via operator override.
In that case the retry creates a NEW dedupe row + new message — the
caller chose this risk explicitly. Counter
`cm_daemon_retry_after_dedupe_expired_total`.
- **Daemon retry after dedupe row hard-deleted in `permanent` mode**:
cannot happen by definition — `permanent` means no `expires_at`. Only
mesh deletion removes dedupe rows.
- **Duplicate row, history pruned**: as v5 §4.4. Mark `done`, log
`cm_daemon_dedupe_history_pruned_total`.
---
## 5. Inbound — unchanged from v3 §5
---
## 6. Hooks — unchanged from v4 §6
---
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
---
## 14. Lifecycle — unchanged from v5 §14
---
## 15. Version compat — feature param updated for new dedupe semantics
### 15.1 Feature bits with parameters (v6 update)
| Bit | `params.version` | Required parameters | Optional parameters |
|---|---|---|---|
| `client_message_id_dedupe` | `1` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 3)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
`client_message_id_dedupe` ships at `params.version = 1` with
`request_fingerprint: bool == true` as a required parameter. A broker
that doesn't advertise the feature, or advertises it without
`request_fingerprint: true`, is treated as "feature missing" and the
daemon refuses to start. That's intentional — v0.9.0 daemons require
fingerprint enforcement for safe idempotency.
The schema-version-2 evolution (parameters that need versioning) is
deferred (see followups doc).
`dedupe_retention_days` minimum is 3 (matches the §4.9 floor).
### 15.2 Negotiation handshake — unchanged shape from v5 §15.2
### 15.3 IPC negotiation — unchanged from v3 §15.3
### 15.4 Compatibility matrix — unchanged from v3 §15.4
### 15.5 Diagnostic close code (v0.9.0)
v0.9.0 ships a single WebSocket close code with a structured
`close_reason` JSON payload that distinguishes the underlying cause:
| Code | Reason | `close_reason.kind` values |
|---|---|---|
| `4010` | `feature_unavailable` | `feature_unavailable` (feature missing from broker's `supported`) · `feature_param_invalid` (params fail validation: missing required, out of bounds, unknown version) · `feature_param_below_floor` (param below daemon's hard floor, e.g. `dedupe_retention_days < 3`) |
`close_reason` payload shape:
```json
{
"kind": "feature_unavailable" | "feature_param_invalid" | "feature_param_below_floor",
"feature": "client_message_id_dedupe",
"detail": "..."
}
```
Daemon logs the full negotiation payload at WARN before exiting;
supervisor + alerting catches the restart loop. The split into
4011/4012 codes is deferred (see followups doc).
---
## 16. Threat model — unchanged from v4 §16
---
## 17. Migration — broker dedupe table + atomicity (v6)
Broker side, deploy order:
1. `CREATE TABLE mesh.client_message_dedupe` with v6 schema (additive,
online-safe).
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
4. Broker code refactor: every accept path wraps dedupe insert + message
insert in **one transaction** (§4.7). Pre-generated
`broker_message_id` (ulid in code) passed in.
5. Broker code: nightly job to delete dedupe rows where `expires_at <
NOW()` (skip in `permanent` mode).
6. Broker code: hook into the message-retention sweep — when a
`topic_message` or `message_queue` row is hard-deleted, find the
matching dedupe row by `client_message_id` and set `history_available
= FALSE`. (Note: `client_message_id` is nullable on those tables for
legacy traffic; nullable rows have no dedupe row to update.)
7. Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
8. Broker advertises `client_message_id_dedupe` feature with
`params.version = 1` and `request_fingerprint: true`.
9. Daemon refuses to start unless that feature bit is advertised with
valid v1 params.
Rollback plan: feature flag disables fingerprint enforcement broker-side
(falls back to existing pre-v6 behavior — no dedupe). Daemons that
require fingerprint refuse to start. Operator switches off the feature
flag, reverts the daemon, restarts. No data loss; pending dedupe rows
remain in place for the next forward roll.
---
## v0.9.0 lock — what's in vs deferred
**In** (this document): everything codex r1r4 ratified plus the six
sweet-spot pulls from v7v9 enumerated at the top — `aborted` outbox
status, `BEGIN IMMEDIATE`, IPC duplicate lookup table, B1/B2/B3 phasing
concept, side-effect inventory, two-layer ID model.
**Deferred** (see `2026-05-03-daemon-spec-broker-hardening-followups.md`):
- B0 dedupe fast-path before rate-limit (v10).
- Lua-scripted idempotent rate limiter keyed by
`(mesh, client_id, window)` (v10).
- In-tx `mesh.mention_index` (v8).
- 4011 / 4012 close-code split (v6 §15.5 — collapsed to 4010 with
structured reason JSON for v0.9.0).
- Per-OS fingerprint precedence elaborate table (v8 §2.2.1).
- `request_fingerprint` schema-version-2 in feature negotiation (v6
§15.1 ships at version 1 with `request_fingerprint: bool`).
- Force-expiry / quarantine semantics for `keypair-archive.json`
(v8 §14.1.1).
These deferrals are real improvements but not v0.9.0 blockers. They
land as the broker matures and we have actual scale-load to optimize
against.
---
## Cross-spec note: §15.5 close-code collapse
For v0.9.0 we ship a single `4010 feature_unavailable` close code with
a structured `close_reason` JSON payload that distinguishes the
underlying cause:
```json
{
"close_reason": {
"kind": "feature_unavailable" | "feature_param_invalid" | "feature_param_below_floor",
"feature": "client_message_id_dedupe",
"detail": "..."
}
}
```
The 4011/4012 split is deferred to followups.
---
## NON-NORMATIVE: round-6 review trailer (preserved for audit only)
> **Not part of the v0.9.0 contract.** Preserved verbatim from the
> v6 source spec as a record of the open questions at the time of the
> codex round-6 review. Items below have either been resolved in this
> merged document, deferred to the followups doc, or superseded.
> Do NOT use this section as a checklist for implementation.
1. **Request fingerprint canonical form (§4.4)** — does JCS work
cross-language for `meta_canonical_json` (Python json.dumps,
Go encoding/json, JS JSON.stringify all behave differently)? Should
we ship a vetted JCS lib in each SDK or fall back to a simpler
"sorted keys + no spaces + escape-as-stored" rule with conformance
tests?
2. **Atomicity contract (§4.7)** — is the orphan-check sufficient, or
does a violation mean we need a "broker rebuild dedupe from messages"
recovery tool? The latter is destructive but useful for ops emergencies.
3. **Max-age formula (§4.9)** — is the 72h floor correct? Is the
percentage-based safety margin (`max(24, ceil(0.1 * dedupe_window))`)
the right shape? Or simpler to say "always 24h"?
4. **`409 idempotency_key_reused` recovery flow (§4.5)** — is sending the
row to `dead` and surfacing it via `outbox --failed` enough? Should
the daemon emit a high-priority event for the SSE stream so operators
are paged immediately?
5. **Diagnostic close codes (§15.5)** — is splitting 4010/4011/4012
useful, or does it just push complexity onto operators? Should we
collapse to 4010 with structured close-reason JSON instead?
6. **Anything else still wrong?** Read it as if you were going to
operate this for a year. What falls down?
Three options:
- **(a) v6 is shippable**: lock the spec, start coding the frozen core.
- **(b) v7 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless.

View File

@@ -0,0 +1,87 @@
# Broker HA readiness — statelessness audit
Single-instance broker is the biggest GA blocker. Moving to 2+ replicas
behind a load balancer requires first understanding which state the broker
holds in-process that breaks if split across nodes.
## Current in-process state (apps/broker/src/index.ts)
| Symbol | Line | Per-node? | Survives HA? | Notes |
|--------|------|-----------|--------------|-------|
| `connections` | 147 | yes (WS state) | ✅ naturally per-node | WS connections are pinned to a node by L7 routing. Each node holds only its own connections. **OK as long as the LB uses sticky sessions or cross-node fan-out.** |
| `connectionsPerMesh` | 148 | yes | 🟡 per-node count, not global | Used for capacity cap. Global cap requires Redis. |
| `tgTokenRateLimit` | 151 | yes | 🟡 per-node | Telegram bot rate limiting; tolerable as per-node. |
| `urlWatches` | 173 | yes | 🔴 stuck on one node | If peer disconnects from node A and reconnects on B, the watch stays orphaned on A. **Needs DB/Redis, or "pin to owning node". Acceptable risk if watches are per-session ephemeral.** |
| `streamSubscriptions` | 259 | yes | 🔴 multi-node broken | Sub on A, publish on B → message never reaches A's subscribers. **Needs Redis pub/sub for HA.** |
| `meshClocks` | 270 | yes | 🔴 multi-node broken | Simulated clocks must be single-authority. Solve by pinning one node as clock leader (simple leader election) or by moving clock state to DB. |
| `mcpRegistry` | 327 | yes | 🔴 multi-node broken | MCP server catalog cached in memory. If deployed on A but called on B, B doesn't know it exists. **Must be DB-backed** (partly is already — see `mesh_service` table). Audit the cache/DB sync path. |
| `mcpCallResolvers` | 338 | yes | ✅ per-call ephemeral | In-flight callback resolvers; WS sticks to owning node so this is fine. |
| `scheduledMessages` | 359 | yes | 🔴 multi-node broken | Scheduled delivery timers live in-process. Restart loses them. Persistence exists (`scheduled_message` table) + recovery on startup, but two nodes could both fire the same timer. **Needs a leader lock or per-schedule pg_advisory_lock on fire.** |
| `sendRateLimit` | index.ts:494 | yes | 🟡 per-node | Each node enforces its own quota; a client spread across nodes could 2x the limit. Tolerable if sticky sessions hold. |
| `hookRateLimit` | index.ts:482 | yes | 🟡 per-node | Same as sendRateLimit. |
| `lastHash` (audit.ts:22) | — | yes | 🔴 broken on write | Two nodes writing audit rows concurrently will BOTH read the same last hash, BOTH compute a new hash, and both INSERT — the chain forks. **Needs `SELECT FOR UPDATE` or a single audit writer.** |
## Conclusion
**Current broker is NOT HA-safe.** Five symbols break under multi-instance:
`urlWatches`, `streamSubscriptions`, `meshClocks`, `mcpRegistry` cache,
`scheduledMessages`, `lastHash`. None are unsolvable, but none are
trivial.
## Rollout plan for HA
### Phase 0 (now) — sticky sessions
Deploy a single broker behind Traefik with `loadBalancer.sticky.cookie`
enabled. WS upgrade inherits the cookie, so reconnects land on the same
node. Gives us 1 node of safe HA headroom (i.e., one deploy rollover
without user-visible disconnection) without any code changes.
### Phase 1 — Active/passive
Two replicas. Traefik routes all traffic to primary; secondary is warm.
Primary fails → secondary takes over, all WS connections reset. No code
change needed; clients auto-reconnect.
### Phase 2 — Active/active for stateless routes
HTTP-only routes (`/cli/*`, `/download`, `/hook`) can round-robin across
any number of replicas today. WS routes stay sticky per mesh via Traefik
`sticky.cookie`. Already behind Postgres → each replica reads the same
mesh/member/invite rows.
### Phase 3 — Full active/active
Migrate the 6 problematic in-memory symbols:
- `streamSubscriptions` → Redis pub/sub
- `meshClocks` → leader-elect via Postgres advisory lock on mesh_id
- `scheduledMessages` → single-writer pattern: whichever replica holds
`pg_advisory_xact_lock(schedule_id)` fires
- `urlWatches` → DB-backed + each replica owns watches where
`presence.node_id = this_node`
- `mcpRegistry` → rely on `mesh_service` table, drop the in-memory cache
- `lastHash` → wrap audit.ts writes in a transaction that
`SELECT hash FROM audit_log ... ORDER BY id DESC FOR UPDATE`, making
concurrent inserts serialize.
### Phase 4 — Multi-region
SPOF at Frankfurt (OVH). Move to a managed Postgres with read replicas,
one broker cluster per region, global DNS geo-routing. Out of scope for
v1.0.0.
## Immediate ship: local docker-compose for 2-replica smoke test
`packaging/docker-compose.ha-local.yml` (TODO) spins up:
- 2x broker (same DATABASE_URL)
- 1x postgres
- 1x traefik with sticky cookie
- 1x locust / synthetic client
Tests:
1. Send to peer connected on node A → delivered.
2. Subscribe on A, publish on B → expect failure (documents the gap).
3. Kill node A → client reconnects to B within Xs.
4. Audit chain verify after concurrent writes from both nodes → expect
a fork (documents the gap).
## Decision
**Ship v1.0.0 on sticky-session single-writer (Phase 0 + Phase 1 warm
standby).** That closes the "what happens on deploy" story. Phase 3 full
HA is v1.1.0 work.

View File

@@ -0,0 +1,152 @@
# claudemesh crypto — external review packet
**Goal:** 2-day review of the claudemesh cryptographic surface by an
external reviewer familiar with libsodium, x25519/ed25519, authenticated
encryption, and hash-chain audit logs.
**Status:** self-audited + Codex-reviewed. Not yet reviewed by an
independent human with security expertise.
## Scope
### Files in scope
| File | LoC | What it does |
|---|---|---|
| `apps/broker/src/crypto.ts` | ~400 | Hello signature verification, canonical invite bytes (v1+v2), `sealRootKeyToRecipient` via `crypto_box_seal`, `verifyInviteV2`, `claimInviteV2Core` (gated). |
| `apps/broker/src/broker-crypto.ts` | 70 | AES-256-GCM encryption-at-rest for MCP env vars. Key from `BROKER_ENCRYPTION_KEY` or ephemeral in dev. |
| `apps/broker/src/audit.ts` | ~250 | Hash-chained audit log. Canonical JSON payload hash, per-mesh `pg_advisory_xact_lock` for concurrent writers. |
| `apps/cli/src/services/crypto/box.ts` | 60 | `crypto_box_easy` / `crypto_box_open_easy` wrappers that accept ed25519 keys and convert to curve25519 via `crypto_sign_*_to_curve25519`. |
| `apps/cli/src/services/crypto/keypair.ts` | ~50 | `generateKeypair` wrapping `crypto_sign_keypair`. |
| `apps/cli/src/commands/backup.ts` | ~180 | Config backup via Argon2id + XChaCha20-Poly1305 (`crypto_aead_xchacha20poly1305_ietf_*`) from a user passphrase. |
| `apps/cli/src/services/invite/parse-v1.ts` | ~160 | Invite payload decode + signature verification, URL parsing, short-code resolution. |
### Out of scope
- TLS config (Traefik termination)
- Postgres at-rest disk encryption
- Homebrew/winget binary signing pipeline
- Secrets storage on the user's machine (we rely on OS file mode 0600)
## Threat model
### Adversary profile
- **Network attacker** on the wire between CLI and broker. Controls
DNS, can inject packets, can replay. TLS terminates at Traefik;
assume TLS is trusted.
- **Malicious broker** operator. Can read any row in Postgres.
- **Mesh peer** with a valid member record. Can try to escalate
privileges, impersonate other members, replay, DoS, exfiltrate
other members' messages.
- **Laptop thief** who has the user's `~/.claudemesh/` directory but
not the login password. (Keys on disk at mode 0600.)
### Must hold
- E2E: broker cannot read plaintext of direct messages.
- Signature: no member can forge messages signed as another member.
- Invite integrity: modifying an invite URL invalidates the signature.
- Backup secrecy: an attacker with the backup file but not the
passphrase learns nothing.
- Audit integrity: tampering with an audit row breaks chain
verification.
### Known weaknesses (deliberate)
- **root_key in v1 invite URL**: current long URL form carries the
mesh root key in base64(JSON). Short-URL mode (`/i/<code>`) resolves
to the same token server-side, so this does NOT reduce the exposure.
v2 protocol moves root_key out of the URL but CLI migration is not
yet shipped.
- **Session-key routing identity**: a peer can claim arbitrary
`sessionPubkey` in hello (validated as 64-hex in alpha.36 but not
proven-own). Proof-of-secret-key for session key is not enforced.
Impact: a peer can route messages as any session pubkey it chooses
but cannot decrypt replies without the matching secret, so the
impact is DoS/confusion, not impersonation.
- **mesh.owner_secret_key stored plaintext** in the DB. A malicious
broker can issue arbitrary invites. Mitigated only by DB access
control.
## Review checklist for the reviewer
1. **libsodium usage**
- Are nonces generated with `randombytes_buf` and never reused?
- `crypto_box_easy` / `crypto_box_open_easy` order and parameters correct?
- Are ed25519 keys converted to curve25519 on BOTH sides consistently?
- Is `crypto_sign_detached` / `crypto_sign_verify_detached` used with the right message bytes?
2. **Invite protocol**
- Canonical bytes v1 + v2 format strings stable across CLI and broker?
- Replay protection: is a v1 URL reusable? (short URL + usedCount)
- Is the `maxUses` counter race-safe? (atomic UPDATE with `lt`)
- v2 root_key sealing: does `crypto_box_seal` fit the trust model?
- Is recipient_x25519_pubkey validated on both shape and length?
3. **Audit chain**
- Is the canonical JSON serialization reviewable and stable?
- Does `pg_advisory_xact_lock` actually serialize writes on the same mesh under HA?
- Can a malicious broker rewrite history by dropping the `lastHash` cache + DROPping rows + replaying with a new chain? (Yes — documented. Mitigation is append-only at the DB level.)
4. **At-rest encryption (broker-crypto.ts)**
- AES-256-GCM with 12-byte IV + 16-byte tag — correct, but is the IV generation guaranteed random and unique per encryption?
- Any concern about auth tag truncation or nonce collision under high volume?
5. **Backup (cli/commands/backup.ts)**
- Argon2id params reasonable? (INTERACTIVE — should possibly be SENSITIVE.)
- XChaCha20-Poly1305 parameter order?
- Does the passphrase-minimum (12 chars) match the Argon2id parameters?
- Is the salt stored alongside the ciphertext and read back correctly?
6. **Session vs member key**
- When is which key used? Is there any path where one is trusted for the other's purpose?
7. **Hello signature**
- Timestamp skew window (`±60s`) — does the broker reject out-of-window replays?
- Is the canonical hello string covered by the signature exactly?
8. **Grants**
- Can a peer bypass server-side grant enforcement by lying about their
own sender key in hello? (Signature pins memberPubkey to a real
signing key, but sessionPubkey isn't proven.)
## Test coverage supplied
- `apps/broker/tests/invite-signature.test.ts`
- `apps/broker/tests/invite-v2.test.ts`
- `apps/broker/tests/hello-signature.test.ts`
- `apps/broker/tests/audit-canonical.test.ts`
- `apps/broker/tests/grants-enforcement.test.ts`
- `apps/broker/tests/rate-limit.test.ts`
- `apps/broker/tests/encoding.test.ts`
- `apps/broker/tests/dup-delivery.test.ts`
- `apps/cli/tests/unit/crypto-roundtrip.test.ts`
## Deliverables expected from reviewer
1. **Findings list** — severity (crit/high/med/low), file:line, fix recommendation.
2. **Protocol-level critique** — anything in the invite or hello flow that can be exploited with a valid account.
3. **Tooling recs** — libsodium best-practice they'd follow differently.
4. **Go/no-go** for v1.0.0 GA assuming the findings are addressed.
## Budget
2 person-days. Hourly rate acceptable; fixed-fee preferred. Request
for quote from reviewers with published libsodium / PKI experience
(see recommended list below).
## Recommended reviewers
- Filippo Valsorda (independent, ex-Go crypto lead, known for age/tink reviews)
- Trail of Bits (firm-rate; their Tamarin+reviewer combo is strong)
- Latacora (firm; expensive but thorough)
- NCC Group (firm; good for libsodium-specific)
- Cure53 (firm; EU, fast turnaround)
## Review deliverable format
Markdown report with:
- Findings table (id, severity, file:line, summary, recommended fix)
- Protocol notes
- One-page exec summary for non-technical stakeholders

View File

@@ -0,0 +1,162 @@
---
title: MCP tool surface trim + multi-mesh push
status: proposed
target: claudemesh-cli 1.1.0
author: Alejandro
date: 2026-05-01
---
# MCP tool surface trim + multi-mesh push
## Problem
Two issues with the current `claudemesh mcp` server:
1. **80+ tools registered.** Every Claude session that has claudemesh installed pays the deferred-tool-list cost (~80 entries surfacing in `ToolSearch`). Most of those tools are CLI-verb-wrappers that already have a perfect Bash equivalent — no structured I/O is gained by exposing them as MCP tools.
2. **Single-mesh push only.** A session launched with `claudemesh launch` opens its WS to one mesh. Peer messages from any other joined mesh arrive only if the user manually runs `claudemesh inbox`. The MCP push pipeline doesn't fan out across meshes.
The cleanest framing: **MCP earns its keep when a tool returns structured data Claude reads. CLI is better for fire-and-forget verbs.** Today's tool surface ignores that distinction.
## Non-goals
- **Don't redesign the architecture as "CLI-only with a daemon."** That trades warm-WS sends (~5ms in-process) for cold Bash spawns (~300-500ms) and forces a Unix-socket bridge to recover state coherence. See discussion 2026-05-01 — the platform vision (vectors, graph, files, mesh-services) genuinely benefits from typed tool I/O.
- **Don't break MCP backward compat in 1.x.** Existing scripts calling `mcp__claudemesh__send_message` keep working until 2.0; in 1.1 they're soft-deprecated with a stderr warning.
## Proposal
Three patches, ship together as 1.1.0:
### Patch 1: `--mesh <slug>` flag on `claudemesh mcp`
Today `claudemesh mcp` calls `readConfig()` and `startClients(config)` — connects to every mesh in `~/.claudemesh/config.json`. The `claudemesh launch` flow writes a per-session tmpdir config with one mesh, so practically the MCP server binds to one mesh per session.
Add an explicit flag for non-launch contexts (manual `~/.claude.json` editing):
```ts
// apps/cli/src/mcp/server.ts, near line 244
export async function startMcpServer(): Promise<void> {
const serviceIdx = process.argv.indexOf("--service");
if (serviceIdx !== -1 && process.argv[serviceIdx + 1]) {
return startServiceProxy(process.argv[serviceIdx + 1]!);
}
const meshIdx = process.argv.indexOf("--mesh");
const onlyMesh = meshIdx !== -1 ? process.argv[meshIdx + 1] : null;
const config = readConfig();
if (onlyMesh) {
const before = config.meshes.length;
config.meshes = config.meshes.filter((m) => m.slug === onlyMesh);
if (config.meshes.length === 0) {
throw new Error(
`--mesh "${onlyMesh}" not found in config (have: ${
config.meshes.map((m) => m.slug).join(", ") || "none"
})`,
);
}
}
// ...rest unchanged
}
```
Enables this `~/.claude.json` pattern for users who want push from N meshes simultaneously without launching N Claude sessions:
```json
{
"mcpServers": {
"claudemesh:flexicar": { "command": "claudemesh", "args": ["mcp", "--mesh", "flexicar"] },
"claudemesh:openclaw": { "command": "claudemesh", "args": ["mcp", "--mesh", "openclaw"] },
"claudemesh:prueba1": { "command": "claudemesh", "args": ["mcp", "--mesh", "prueba1"] }
}
}
```
Each instance opens one WS, holds it for the session, decrypts and forwards `claude/channel` notifications independently. Channel events already carry `[meshSlug]` in `formatPush()` (server.ts:240), so Claude knows which mesh a message came from.
**LoC:** ~10. **Risk:** very low — additive flag, default behavior unchanged.
### Patch 2: trim 25 messaging tools from MCP surface
Move these tools from "registered MCP tool" to "soft-deprecated CLI shim":
| Module | Tool | CLI replacement | Rationale |
|---|---|---|---|
| messaging.ts | `send_message` | `claudemesh send <to> <msg> [--mesh X] [--priority Y]` | Pure verb, no structured return. |
| messaging.ts | `list_peers` | `claudemesh peers --json` | One-shot, easy to parse. |
| messaging.ts | `check_messages` | `claudemesh inbox --json` | One-shot. |
| messaging.ts | `message_status` | `claudemesh msg-status <id>` (new) | One-shot lookup. |
| profile.ts | `set_profile` | `claudemesh profile --avatar X --bio Y ...` | Pure write. |
| profile.ts | `set_status` | `claudemesh status set <state>` (new) | Pure write. |
| profile.ts | `set_summary` | `claudemesh summary <text>` (new) | Pure write. |
| profile.ts | `set_visible` | `claudemesh visible <true\|false>` (new) | Pure write. |
| groups.ts | `join_group` | `claudemesh group join @<name> [--role X]` (new) | Pure write. |
| groups.ts | `leave_group` | `claudemesh group leave @<name>` (new) | Pure write. |
| state.ts | `get_state` | `claudemesh state get <key> --json` | Already exists. |
| state.ts | `set_state` | `claudemesh state set <key> <value>` | Already exists. |
| state.ts | `list_state` | `claudemesh state list --json` | Already exists. |
| memory.ts | `remember` | `claudemesh remember <text>` | Already exists. |
| memory.ts | `recall` | `claudemesh recall <query> --json` | Already exists. |
| memory.ts | `forget` | `claudemesh forget <id>` (new) | Pure write. |
| scheduling.ts | `schedule_reminder` | `claudemesh remind <msg> --in/--at/--cron` | Already exists. |
| scheduling.ts | `list_scheduled` | `claudemesh remind list --json` | Already exists. |
| scheduling.ts | `cancel_scheduled` | `claudemesh remind cancel <id>` | Already exists. |
| mesh-meta.ts | `mesh_info` | `claudemesh info --json` | One-shot read. |
| mesh-meta.ts | `mesh_stats` | `claudemesh stats --json` (new) | One-shot read. |
| mesh-meta.ts | `mesh_clock` | `claudemesh clock --json` (new) | One-shot read. |
| mesh-meta.ts | `ping_mesh` | `claudemesh ping` (new) | Pure verb. |
| tasks.ts | `claim_task` / `complete_task` | `claudemesh task claim/complete <id>` (new) | Pure write. |
**Keep as MCP tools (~50):**
- **vault.ts** — `vault_set / vault_list / vault_delete` (encrypted, structured payloads).
- **vectors.ts** — `vector_store / vector_search / vector_delete` (typed embeddings, ranked results Claude reasons over).
- **graph.ts** — `graph_query / graph_execute` (returns structured graph results).
- **files.ts** — `share_file / get_file / list_files / list_peer_files / read_peer_file / grant_file_access / file_status / delete_file` (binary payloads, ACL semantics).
- **skills.ts** — `share_skill / list_skills / get_skill / remove_skill / mesh_skill_deploy` (typed skill metadata).
- **streams.ts** — `create_stream / list_streams / publish / subscribe` (event stream cursor semantics).
- **contexts.ts** — `share_context / get_context / list_contexts` (context-passing payloads).
- **mcp-registry-*.ts** — `mesh_mcp_*` (the ~14 mesh-MCP-services tools — these are platform-defining, MCP-native).
- **clock-write.ts** — `mesh_set_clock / mesh_pause_clock / mesh_resume_clock` (logical-clock writes that Claude composes with reads).
- **sql.ts** — `mesh_query / mesh_schema / mesh_execute` (typed SQL results).
- **webhooks.ts** — `create_webhook / list_webhooks / delete_webhook` (typed webhook metadata).
- **url-watch.ts** — `mesh_watch / mesh_unwatch / mesh_watches` (returns watch state).
- **tasks.ts** — `create_task / list_tasks` (typed task records — only the writes go to CLI).
### Patch 3: tool-call → CLI shim with deprecation warning
For the trimmed tools, keep the registration but route through the CLI:
```ts
// apps/cli/src/mcp/tools/messaging.ts (sketch)
async function sendMessageDeprecated(args: SendMessageArgs): Promise<ToolResult> {
process.stderr.write(
`[claudemesh] mcp__claudemesh__send_message is soft-deprecated in 1.1. ` +
`Use \`claudemesh send\` via Bash instead — it's faster and cleaner.\n`,
);
return originalSendMessageHandler(args); // unchanged behavior
}
```
In 2.0 the registrations get deleted entirely.
## Migration plan
1. **1.1.0** — ship all three patches. Existing users see deprecation warnings; nothing breaks.
2. **1.1.x** — collect feedback. If anyone has scripts hard-wired to the deprecated tools, surface in CHANGELOG.
3. **1.2.0** (~6 weeks later) — flip deprecation warnings to "removal in 2.0" messaging.
4. **2.0.0** — delete the 25 tool registrations. ToolSearch surface drops to ~50 entries.
## Open questions
- **Do we need a Unix-socket bridge between CLI sends and the MCP push-pipe** so they share one WS connection per mesh per session? Probably yes for `claudemesh send` warm-path performance, but it's a separate spec — file under `socket-bridge` after this lands.
- **Should `claudemesh launch` keep writing one MCP server entry** (current behavior, default for new users) or switch to the per-mesh-N-entries pattern from Patch 1? Recommend keeping single-entry default — Patch 1 is for advanced users who manually edit `~/.claude.json`.
- **Do `mesh_mcp_*` tools really belong in the keep list?** They're MCP-on-mesh management — their bias is RPC-shaped, not stream-shaped. Provisional yes; revisit if 1.1 reduces their use.
## Effort
- Patch 1: ~10 LoC + 1 test. ~30 min.
- Patch 2: ~25 tool-handler refactors (registration removed, CLI verb confirmed/added). Some new verbs (`status set`, `summary`, `visible`, `group join/leave`, `forget`, `stats`, `clock`, `ping`, `task claim/complete`, `msg-status`) need wiring through to existing broker-client methods. ~150 LoC, half a day.
- Patch 3: deprecation shim per trimmed tool. ~50 LoC, 1 hour.
**Total:** ~1 dev-day for 1.1.0. ToolSearch surface drops by ~30%, multi-mesh push works, no architectural disruption, platform tools stay typed.

View File

@@ -0,0 +1,234 @@
---
title: claudemesh North Star — CLI-first with claude/channel push-pipe
status: canonical
target: 2.0.0
author: Alejandro
date: 2026-05-02
supersedes: none
references:
- 2026-05-01-mcp-tool-surface-trim.md (first cut at the trim)
- SPEC.md
- docs/protocol.md
---
# claudemesh North Star
## The commitment, in one sentence
> **CLI is the canonical surface for every claudemesh operation. MCP exists for one thing: to deliver `claude/channel` push notifications mid-turn. That's the killer feature, and it's the only reason an MCP server runs at all.**
Everything else — sending messages, listing peers, sharing files, deploying mesh-MCPs, running graph queries, scheduling jobs, publishing skills — is invoked from the CLI, by humans, scripts, cron, hooks, or by Claude itself via Bash.
## Why this shape
1. **Mid-turn interrupt is the differentiator.** When peer A sends to peer B, B's Claude session pauses what it's doing and reads the message immediately. That requires `claude/channel` notifications routed through an MCP transport — Claude Code only watches MCP server connections for those events. **Lose that, and claudemesh becomes another inbox-polling pattern.** Every other primitive can degrade to "delivered at next tool boundary"; this one cannot.
2. **CLI is universal.** Bash works in scripts, hooks, cron, CI, terminals, automation, and Claude itself (via Bash tool calls). A primitive that exists as both an MCP tool and a CLI verb is double-maintenance with one calling convention nobody actually wants.
3. **JSON-on-stdout is enough structure.** Claude reads `claudemesh peers --json` exactly as well as it reads a typed MCP tool return. The CLI man page is the schema. The "MCP gives structured I/O" advantage was real when we were paying for nothing else, but warm-WS via socket bridge (below) closes the cost gap.
4. **Surface shrinks where it matters.** ToolSearch deferred-tool list drops from ~80 entries to ~0 entries (push-pipe registers no tools). Massive context-budget win for every Claude session.
## Prior art (this is not novel architecture)
The "live-state daemon + thin scriptable CLI talking via Unix socket" pattern is the canonical shape for CLIs in this category. Reviewers should not treat this as bespoke design:
- **Docker** — `dockerd` daemon, CLI talks via `/var/run/docker.sock`. `DOCKER_HOST` env override. `docker context` for multi-daemon switching.
- **Tailscale** — `tailscaled` daemon, `tailscale` CLI via socket. Per-key ACL identity model. Same peer-mesh-with-keypairs shape as claudemesh.
- **Stripe `listen`** — long-running CLI daemon receives webhook push, forwards to local consumer. Same push-pipe-as-CLI-subcommand shape.
- **Obsidian CLI** — talks to a running Obsidian instance via REST. **Notable: ships a Claude skill (`~/.claude/skills/obsidian-cli/SKILL.md`) that documents every verb and flag for Claude consumption — replacing MCP tool introspection entirely.**
Claudemesh's CLI-first + push-pipe + socket-bridge architecture is exactly this pattern. We are following the well-trodden path, not inventing a new one.
## The six architectural commitments
### 1. **MCP server is a push-pipe, full stop.**
The MCP entrypoint (`claudemesh mcp [--mesh <slug>]`) does exactly three things:
- Holds a WS connection to the broker for the meshes it's bound to.
- Decrypts inbound peer messages.
- Emits them as `claude/channel` notifications to the parent Claude Code session.
It registers **zero tools**. It advertises only `experimental: { "claude/channel": {} }`. Its `tools/list` returns an empty array. There is no surface to discover, search, or call.
One push-pipe per joined mesh, registered in `~/.claude.json` via `claudemesh install` (or auto-injected by `claudemesh launch`). The `--mesh` flag (shipped 1.0.3) makes this trivial.
### 2. **CLI is the canonical surface for every primitive.**
Every resource has uniform CLI verbs:
| Resource | Verbs |
|---|---|
| peer | `claudemesh peers [--json] [--mesh X]` |
| group | `claudemesh group join/leave @<n> [--role X]` |
| message | `claudemesh send <to> <msg>`, `claudemesh inbox`, `claudemesh msg-status <id>` |
| state | `claudemesh state get/set/list [--json]` |
| memory | `claudemesh remember/recall/forget` |
| task | `claudemesh task create/claim/complete/list` |
| file | `claudemesh file put/get/list/grant/delete` |
| vector | `claudemesh vector store/search/delete` |
| graph | `claudemesh graph query/execute/watch` |
| stream | `claudemesh stream create/publish/subscribe/list` |
| context | `claudemesh context share/get/list` |
| skill | `claudemesh skill publish/list/get/remove` |
| schedule | `claudemesh schedule msg/webhook/tool/list/cancel` |
| webhook | `claudemesh webhook create/list/delete` |
| watch | `claudemesh watch create/list/unwatch` |
| mcp | `claudemesh mesh-mcp deploy/list/call/undeploy/catalog` |
| clock | `claudemesh clock get/set/pause/resume` |
| sql | `claudemesh sql query/schema/execute` |
| vault | `claudemesh vault set/get/list/delete` |
| profile | `claudemesh profile/summary/visible/status set` |
**Every verb supports `--json`** for structured consumption. **Every verb supports `--mesh <slug>`** for targeting (default: pick first or interactive picker). Verbs share one broker-call implementation — no duplication between CLI and MCP.
### 3. **Warm path via Unix socket bridge** (load-bearing for 2.0).
A push-pipe holds a live WS connection. CLI invocations should reuse that connection rather than opening their own (which costs ~300-500ms cold-start).
Mechanism:
- On startup, push-pipe creates `~/.claudemesh/sockets/<mesh-slug>.sock` (Unix domain socket, mode 0600).
- CLI verbs that need broker round-trip first try to dial that socket.
- If alive: forward request, get response back over socket (~5ms).
- If absent / stale: open ephemeral WS, do the op, close (~300ms — fine for cron/scripts where there's no parent push-pipe).
Push-pipe owns one WS, all ops through that WS, broker sees ONE session per mesh per host (no duplicate hellos). On crash, socket file is unlinked by `unlink` on exit handler; stale-socket detection by `connect()` ECONNREFUSED.
This is **mandatory for 2.0** — without it, every CLI op pays cold-start, and CLI-first becomes unusably slow for tight loops.
### 4. **JSON output is the schema, with field selection and streaming.**
Every CLI verb has a deterministic `--json` output shape, documented in `docs/cli-schemas.md`, validated by zod parsers in tests. Claude reads `claudemesh vector search "x" --json` and gets a typed-array shape it can reason over identically to a tool return.
**Three output modes, mandatory across every read-shaped verb** (modeled on `gh` and `gemini`):
- `--json` — full record, all fields
- `--json <fields>` — field-selected projection (e.g. `claudemesh peers --json name,pubkey,status`)
- `--output-format stream-json` — incremental JSONL for long-running ops (mesh-MCP calls fanning across peers, `vector search` against large indexes, `schedule list` with many entries). One object per line, Claude consumes incrementally.
Plus convenience output:
- `--jq <expr>` — native jq filter pipeline
- `--template '{{.field}}'` — Go template formatting
`schema_version: "1.0"` field on every JSON output — mandatory. Bumps when shape changes. Old code paths can pin with `--schema-version=1.0`.
### 5. **All features stay. Nothing is removed.**
This is **not a feature trim**. Every primitive in the current 80-tool surface gets a CLI verb. Vectors, graphs, mesh-MCP, files, vault, SQL — all of it. The user-facing pitch is unchanged: "claudemesh gives your Claude session a name, a network, shared memory, shared compute, shared skills, scheduled actions." The change is *how you call it*.
### 6. **The Claude skill IS the schema.** *(load-bearing for CLI-first to work)*
Stripping MCP tool introspection (`tools/list`) costs Claude its discoverability. The replacement: a packaged `claudemesh` skill at `~/.claude/skills/claudemesh/SKILL.md` written by `claudemesh install`, documenting every verb, flag, JSON shape, and gotcha. Claude reads it on demand via the Skill tool — **not on every session, not pre-loaded into deferred-tool-list**. This is exactly how `obsidian-cli` works today and it works perfectly.
The skill replaces three things at once:
- **Tool discovery** — Claude knows the verb-set after one Skill invocation. No `tools/list` needed.
- **Output schemas** — every JSON shape is documented in the skill, so Claude knows what to expect from `--json` without parsing TypeScript types at runtime.
- **Behavioral conventions** — the skill teaches "preview before delete," "confirm peer match before kick," "use `--mesh` for cross-mesh ops" — soft guardrails that complement the policy engine's hard rules.
Topic-shards for size: `claudemesh` (core), `claudemesh-platform` (vault/vectors/graph/sql/mesh-mcp), `claudemesh-schedule` (cron/webhooks/watches), `claudemesh-admin` (kick/ban/grants/install). Each shard is independently loadable.
**This is the answer to the "JSON-on-stdout is a worse schema" caveat.** It's not — when Claude has a documented skill to load, the CLI surface is *more* discoverable than 80 deferred MCP tools that bloat ToolSearch silently.
### 7. **Pluggable policy engine, not binary `--yes`.** *(answers the Bash-blast-radius caveat)*
Modeled on `gemini --policy / --admin-policy` and `codex --sandbox`. Replace the current binary `-y/--yes` with:
- **`--approval-mode plan|read-only|write|yolo`** — four levels (read-only blocks all writes, plan blocks all side effects, write prompts on dangerous verbs, yolo skips all confirmation).
- **`--policy <file>`** — YAML allow/deny rules per resource × verb × peer. Sample:
```yaml
# ~/.claudemesh/policy.yaml
default: prompt
rules:
- resource: send
verb: "*"
decision: allow
- resource: sql
verb: execute
decision: prompt
- resource: file
verb: delete
decision: deny
- resource: mesh-mcp
verb: call
peers: ["@trusted"]
decision: allow
```
Policy decisions log to a tamper-evident audit file. Org admin can ship `--admin-policy` that overrides user config. **This is the real answer to "Bash carries unrestricted blast-radius once allowed" — claudemesh's own policy engine kicks in before the broker call, regardless of what shell permissions are.**
## What this means for `claude/channel`
When peer A's CLI runs `claudemesh send peer-B "hello"`:
1. CLI dials `~/.claudemesh/sockets/<mesh>.sock` (warm path) or opens its own WS (cold).
2. Encrypts message with peer-B's pubkey via crypto_box.
3. Broker receives `send` envelope, forwards encrypted blob to peer-B's connected push-pipe.
4. Peer-B's push-pipe decrypts and emits a `claude/channel` notification.
5. Claude Code mid-turn-injects the message as a `<channel source="claudemesh" ...>` reminder.
6. Claude responds immediately per the system prompt convention.
Step 5 is the **only step that requires MCP**. Steps 1-4 are pure CLI + broker. The architecture is "CLI for everything, MCP for the one thing it's irreplaceable for."
## Migration path from 1.1.0
| Version | Ships | Behavior |
|---|---|---|
| **1.2.0** | Unix socket bridge. CLI verbs auto-detect push-pipe and use warm path. **Field-selectable JSON (`--json a,b,c`)** + `--jq` + `--template` adopted. | All existing MCP tools still work. Nothing breaks. |
| **1.2.1** | Ships `~/.claude/skills/claudemesh/SKILL.md` written by `claudemesh install`. Includes full verb reference + output schemas + gotchas. Topic-shards (`-platform`, `-schedule`, `-admin`). | Skill auto-installs on `claudemesh install`. |
| **1.3.0** | Schedule unification (`schedule msg/webhook/tool`). All remaining missing CLI verbs (file, vector, graph, mesh-mcp, vault, sql, stream, context, skill, watch). **`--output-format stream-json`** for long-running ops. | All existing MCP tools still work. New verbs additive. |
| **1.4.0** | Resource-model rename pass — every CLI verb is `<resource> <verb>`. Old verbs become aliases. | All existing MCP tools still work. Old CLI verbs aliased forever. |
| **1.5.0** | **Pluggable policy engine** (`--approval-mode`, `--policy`, `--admin-policy`). MCP `tools/list` shrinks to configurable allowlist (default: empty). `CLAUDEMESH_MCP_FAT=1` for users who need typed tool surface. | Default 1.5 install: MCP exposes zero tools. Push-pipe-only. Policy engine gates all writes. |
| **2.0.0** | MCP server hardcoded to push-pipe-only. Strip all tool registrations + handlers. | **Old MCP tool calls return tool-not-found.** Users must update scripts to CLI verbs. Old CLI verbs (1.4 aliases) still work. |
## What stays exactly the same
- Crypto: ed25519 sign + x25519 sealing + crypto_box for DMs. No change.
- Broker protocol: WS frame format, hello flow, audit log. No change.
- Membership / mesh-scope / capability grants. No change.
- Web app, dashboard, Telegram bridge, OAuth. No change.
- The platform vision (vault, vectors, graph, files, skills, mesh-MCPs, scheduled jobs). All shipped, all stay.
## What changes for users
- `~/.claude.json` simplifies: `"claudemesh": { "command": "claudemesh", "args": ["mcp"] }` becomes one entry per joined mesh after `claudemesh install`. Multi-mesh push works out of the box.
- ToolSearch loses ~80 deferred entries. Sessions are lighter.
- Scripts that called `mcp__claudemesh__*` get a deprecation warning in 1.x, break in 2.0 — replaced by `claudemesh <verb> --json` + `jq`.
- Claude Code system prompt for the MCP server gets shorter (no tool catalog), focused only on "RESPOND IMMEDIATELY to channel events."
## Open questions parked for future specs
- **Federation** — broker-to-broker encrypted relay so peers on different brokers can talk. Not in 2.0 scope.
- **Offline-with-TTL inbox** — persist `now` priority messages on broker if recipient is offline, with explicit TTL. Reasonable for 2.x.
- **Compute attribution** — when peer X invokes a mesh-MCP that peer Y deployed, who pays for broker compute / outbound calls? Pre-empts the eventual billing question. 2.x.
- **Universal hash-chained audit** — every state mutation per mesh is hash-chained, replayable, verifiable. Today only some events are; making it universal is its own spec.
- **ACP (Agent Communication Protocol) interop with Gemini CLI.** Gemini CLI exposes `--acp` for agent-to-agent comms — the same problem domain claudemesh occupies. Research question: is ACP a documented standard claudemesh can speak (making claudemesh peers and Gemini peers cross-talk in the same mesh), or is it Google-proprietary? If standard, implementing it is a major platform expansion. File as separate research spec before 2.x.
## What this spec is NOT
- Not a redesign of the broker. The broker stays as-is.
- Not a redesign of crypto. Crypto stays as-is.
- Not a feature deprecation. Every feature stays.
- Not optional. This is the canonical 2.0 architecture; intermediate versions migrate toward it.
## Effort estimate to 2.0
Sequential, single dev (revised after caveats survey — original estimate was rosy):
- **1.2.0** (socket bridge + field-JSON): 1-2 weeks. Socket bridge is real distributed-systems work (stale-cleanup, version negotiation, NFS/Windows edge cases) — not 2-3 days.
- **1.2.1** (claudemesh skill + topic shards): 2-3 days. Mostly content writing once schemas are documented.
- **1.3.0** (schedule unification + remaining verbs + stream-json): 1 week. Each of the ~10 missing verbs is small but adds up.
- **1.4.0** (resource-model rename + alias compat): 2-3 days.
- **1.5.0** (policy engine + MCP allowlist): 4-5 days. Policy engine is its own subsystem — parser, evaluator, audit log, admin override.
- **2.0.0** (strip tool handlers + cutover): 2 days.
Total: **~5-6 weeks of focused work** spread over 3-4 months calendar. Each release is independently shippable; the policy engine specifically can land later than 1.5 if needed.
## Acceptance signals — how we know it worked
- **ToolSearch** in a freshly-installed claudemesh session shows zero `mcp__claudemesh__*` entries by default (vs ~80 today).
- **`claudemesh peers --json name,status`** projects exactly two fields, no extra noise.
- **`claudemesh send <peer> "hi"`** from a Bash call inside a Claude session round-trips in <50ms (warm path via socket bridge) on localhost-broker, <250ms on EU-from-US.
- **`Skill: claudemesh`** loaded once teaches Claude the entire mesh surface; subsequent CLI calls require no further introspection.
- **A policy file with `decision: deny` for `file delete`** blocks the call before it hits the broker, with a clear stderr explanation.
- **`claudemesh status set working` from cron** opens its own WS (no daemon), succeeds in <1s, no orphan connections on broker.

View File

@@ -0,0 +1,155 @@
# claudemesh handoff — 2026-05-02 (evening)
Companion to the morning handoff (`2026-05-02-handoff.md`). Captures
what shipped through the v1.6.x patch line and the v1.7.0 demo cut.
Read before the next session.
---
## What shipped this evening
### v1.6.x patch line — closed except bridge smoke test
| Feature | Endpoint / file | Commit |
|---|---|---|
| SSE topic stream | `GET /api/v1/topics/:name/stream` | `7e71a61` |
| Unread counts | `PATCH /v1/topics/:name/read`, `unread` on `GET /v1/topics` | `a80eb6f` |
| Mesh-card unread badges | `apps/web/src/app/[locale]/dashboard/(user)/page.tsx` | `541440c` |
| Member sidebar | `GET /v1/members`, chat panel right rail | `a75483b` |
| SSE 4xx-stop fix | `apps/web/src/modules/mesh/topic-chat-panel.tsx` | `7af61e1` |
| Humans-as-peers | `GET /v1/peers` includes recent apikey users | `f4601f4` |
### v1.7.0 demo cut — 4 of 5 items shipped
| Item | Code | Commit |
|---|---|---|
| Member sidebar in chat | `apps/web/src/modules/mesh/topic-chat-panel.tsx` (+sidebar) | `a75483b` |
| Topic search + autocomplete | Same file (+ search toggle, mention dropdown, clay highlight) | `35a289b`, `00c25d9` |
| Notification feed | `MentionsSection` on universe + `GET /v1/notifications` | `a9160a0` |
| Public blog post | `apps/web/src/app/[locale]/(marketing)/blog/agents-and-humans-same-chat/` | `69cf39b` |
| Demo video script | `docs/demo-v1.7.0-script.md` (90s, 5 scenes) | `69cf39b` |
| Marketing site refresh | Timeline next-block updated | `a2ab7de` |
| **Recorded demo video** | — | **TODO (needs human + iTerm + Chrome)** |
| **Marketing screenshots** | — | **TODO (needs Chrome session)** |
### Roadmap state
- `docs/roadmap.md` updated. v1.6.x marks every endpoint shipped except
bridge smoke test. v1.7.0 marks sidebar/mentions/search/feed/blog
shipped; recording + screenshots open.
- v2.0.0 (daemon redesign) and v0.3.0 (operator layer / per-topic
encryption) untouched — both still architectural specs.
---
## Live status
- **Broker** (`wss://ic.claudemesh.com/ws`): autodeployed via Coolify
off the gitea-vps push. The custom migration runner from earlier
this session is the one moving migrations forward. No new
migrations shipped today — all v1.6.x work was code-only against
the v0.2.0 schema.
- **Web** (`claudemesh.com`): autodeployed via Vercel off the github
push. Verified `/v1/notifications`, `/v1/peers`, `/v1/members`,
`/v1/topics/general/stream`, `/v1/topics/general/read` all
return 401 with bad bearer (i.e. they exist + auth works).
Authenticated browser smoke not run — no Playwriter session
available during this handoff write.
- **CLI** (`claudemesh-cli@1.6.1` on npm): unchanged this session.
All v1.6.x work was server + web only; CLI doesn't yet consume
the new endpoints.
### CLI gap — worth noting
The new endpoints have NO CLI surface yet:
- `GET /v1/notifications``claudemesh notification list` could show
recent mentions in the terminal. ~30 LoC.
- `GET /v1/members``claudemesh member list` shows roster + online
state. Distinct from `peer list` which shows live sessions.
- `PATCH /v1/topics/:name/read` — could be implicit (called by
`topic show <name>`) or explicit (`claudemesh topic read <name>`).
- SSE stream — `claudemesh topic tail <name>` would tail messages
in the terminal. High demo value.
Wiring these is a small CLI release (v1.7.0). Not blocking anything
but worth doing before the recording so the demo includes a
"terminal tail" cut.
---
## Known issues / risks
1. **Mentions notification endpoint depends on plaintext-base64
ciphertext** that v0.2.0 ships. When per-topic encryption lands
in v0.3.0, both `GET /v1/notifications` and the universe-page
`MentionsSection` query break. Migration plan is documented in
the blog post + the inline comment: move to a
`mesh.notification` table populated at write time.
2. **Postgres `convert_from(decode(ciphertext, 'base64'), 'UTF8')`
throws on any ciphertext that isn't valid base64-of-UTF8.** All
current writers (broker WS path, REST POST /messages, web chat
panel) emit base64-of-plaintext-UTF8, so this works. If a future
writer emits binary ciphertext, the mention queries crash. Add a
safe-base64 guard or migrate to per-write notification table
before that happens.
3. **No live SSE smoke test in this session.** Endpoints respond
401 to bad bearer. Browser-authenticated test was deferred — no
Playwriter session was reachable during the run. Worth a
manual smoke before recording the demo.
4. **CSRF middleware blocks PATCH/POST without an Origin header.**
This is correct behaviour but trips up curl users. Documented
in the smoke notes; not a bug.
---
## Next session — three branches
### A. Record + ship the v1.7.0 launch (~2 hours, all human work)
1. Spin a fresh demo mesh + two iTerm panes running
`claudemesh launch --name Mou` and `--name Alexis`.
2. Run the demo script in `docs/demo-v1.7.0-script.md`.
3. Cut to 90s, upload to `claudemesh.com/media/demo-v170.mp4`.
4. Take 4-6 screenshots (universe, mesh detail, chat with sidebar,
mentions feed, mobile view) for the blog hero + Twitter card.
5. Cross-post per the script's distribution checklist.
### B. Wire CLI verbs to v1.6.x endpoints (~3 hours, code)
1. `claudemesh notification list [--since]``GET /v1/notifications`.
2. `claudemesh member list``GET /v1/members`.
3. `claudemesh topic tail <name>` → SSE consumer. Print as messages
arrive. Highest demo value.
4. `claudemesh topic read <name>``PATCH /v1/topics/:name/read`.
5. Bump `apps/cli/package.json` to 1.7.0, publish.
### C. v0.3.0 first slice — per-topic encryption (~5 hours, code)
This is the next architectural cut.
1. Schema: add `mesh.topic.encrypted_key` (encrypted-to-mesh-root).
2. Broker: derive symmetric key on first message via HKDF; cache.
3. Client: per-topic key fetch + `crypto_secretbox` over body.
4. `ciphertext` column stops being plaintext-base64 → mentions
query needs the notification table from issue #1.
Highest leverage right now is **A** (the recording is what turns
shipped code into shipped product), then **B** (CLI parity makes
the demo fuller). **C** is the next session for someone with
2+ uninterrupted hours.
---
## Repo state
- `main` ahead of `gitea-vps/main` and `github/main` by 0 commits
at handoff time — both pushed.
- 12 commits this evening session (sse → unread → grid → sidebar →
ssefix → mentions → search → notifications → roadmap → humans →
roadmap2 → blog+demo → timeline).
- No open PRs; everything went to main directly.
- No `.skip` / TODO files / temp commits left behind.
---
*Last handoff: this file. Previous: `2026-05-02-handoff.md` (morning).*

View File

@@ -0,0 +1,106 @@
# claudemesh handoff — 2026-05-02
State of the world after a long session that shipped 1.5.0 and the v0.2.0 backend. Read this before the next session — it captures what's done, what's deployed where, what's not, and the architectural decisions worth knowing.
---
## Where things stand
### Released to npm
- **`claudemesh-cli@1.5.0`** (latest tag, published earlier today). CLI-first architecture lock-in: zero-tool MCP, policy engine, bundled `claudemesh` skill. Verified install + smoke-tested via clean `npm i -g`.
### In `main` but NOT released yet
Everything below is committed, deployed to the broker (`wss://ic.claudemesh.com/ws`) and the web app (Vercel `claudemesh.com`), but **`claudemesh-cli@1.5.0` on npm doesn't have any of it**. Users won't see it until v1.6.0 publishes.
| Feature | Code path | Verified live? |
|---|---|---|
| Topics (schema, broker routing, CLI verbs, skill) | `packages/db/src/schema/mesh.ts`, `apps/broker/src/broker.ts`, `apps/cli/src/commands/topic.ts` | ✅ created `#deploys-test`, sent + persisted |
| `apikey create/list/revoke` (CLI + broker WS) | `apps/cli/src/commands/apikey.ts`, broker dispatch | ✅ full lifecycle exercised |
| REST `/api/v1/*` (messages, topics, peers, history) | `packages/api/src/modules/mesh/v1-router.ts` + `api-key-auth.ts` | ✅ posted via curl, history round-trips |
| Bridge peer (SDK + CLI) | `packages/sdk/src/bridge.ts`, `apps/cli/src/commands/bridge.ts` | ⚠️ code only — never run end-to-end |
### Architectural commitments locked this session
- **CLI-first, MCP push-pipe** (1.5.0): MCP `tools/list = []`. Inbound peer messages still arrive as `experimental.claude/channel` notifications. The bundled skill is the sole CLI-discoverability surface for Claude.
- **Topics complement groups, don't replace them** (v0.2.0): mesh = trust boundary, group = identity tag, topic = conversation scope. Three orthogonal axes.
- **Humans use REST + apikey, not browser WS** (v0.2.0): the broker already plumbs `peer_type: "human"`. The real blocker was browser-side ed25519, which we sidestep by exposing REST. Web chat UI = thin client over `/v1/*` using dashboard session auth.
- **Spec lives at**: `.artifacts/specs/2026-05-02-architecture-north-star.md` (1.5.0) and `.artifacts/specs/2026-05-02-v0.2.0-scope.md` (v0.2.0 cut + design sketches).
---
## Three pending sessions, ranked by leverage
### Session A — Ship v1.6.0 npm release (~30 min, highest leverage)
**Why first**: backend is feature-complete but unreleased. Users still get the no-topics 1.5.0.
Steps:
1. Bump `apps/cli/package.json` 1.5.0 → 1.6.0.
2. Update `apps/cli/README.md` migration note (mention topics, apikey, bridge).
3. Add `## v1.6.0` section to `docs/roadmap.md`.
4. Build + verify: `cd apps/cli && pnpm build && node dist/entrypoints/cli.js --version`.
5. `npm publish --tag latest --access public --no-git-checks --ignore-scripts`.
6. `git tag cli-v1.6.0 && git push github cli-v1.6.0` — workflow builds 5 binaries + auto-bumps Homebrew/winget tap.
7. Verify on a clean prefix: `PREFIX=/tmp/cm16 mkdir -p $PREFIX && npm install -g --prefix $PREFIX claudemesh-cli@1.6.0 && $PREFIX/bin/claudemesh --help | grep -E "topic|apikey|bridge"`.
### Session B — Migration drift fix (~1 day, highest pain reduction)
**Why second**: every schema change today requires manual `psql -f migration.sql` against prod. The drizzle `_journal.json` stops at idx 11, runtime migrator silently skips anything not in journal. Today's `0022_topics.sql` and `0023_api_keys.sql` were applied by hand. **Future migrations will keep needing this until fixed.**
Recommended approach:
1. Replace `drizzle-orm/postgres-js/migrator` in `apps/broker/src/migrate.ts` with a custom runner.
2. Scan `migrations/*.sql` lexicographically (already named `NNNN_*.sql`).
3. Track applied filenames in a new `mesh.__cmh_migrations` table (filename + sha256 + applied_at).
4. On startup: filter unapplied files, run them in transaction order under `pg_try_advisory_lock`. Fail loud on hash mismatch (catches edits after deploy).
5. Backfill the table with all 0000-0023 entries one-time so prod is consistent.
6. Drop the drizzle journal usage entirely (`migrations/meta/_journal.json` becomes dead state).
This unblocks every future feature touching DB.
### Session C — Web chat UI (~2-3 days, highest visibility)
**Why third**: the demo. Backend is ready; this is pure React + REST.
Path: `apps/web/src/app/[locale]/dashboard/(user)/meshes/[id]/topics/[name]/page.tsx` (new).
Components needed:
- Topic header (members count, settings button).
- Message stream — `GET /api/v1/topics/:name/messages?limit=50`. Poll every 5s for new (no WS yet — REST polling is fine for v0.2.0).
- Compose box — `POST /api/v1/messages` with `{topic, ciphertext, nonce}`.
- Members sidebar — `GET /api/v1/peers`.
- Apikey lifecycle: on first load, server-side issue an apikey for the dashboard user (using their existing NextAuth session) scoped to `read,send` on this topic. Stash in browser session storage.
Server-side helper for apikey issuance lives in `packages/api/src/modules/mesh/api-key-auth.ts` — refactor `verifyBearer` to also expose a `createApiKeyForUser(userId, meshId, scope)` helper for the dashboard handler.
---
## Three less-urgent followups (don't block sessions A-C)
1. **Bridge end-to-end smoke test**: never actually run between two meshes. Needs second test mesh + bridge member onboarding ritual. Worth doing before any blog post / external demo.
2. **`/v1/peers` includes only WS-connected agents**, not humans (since humans are REST-only and never appear in `presence`). Decide: synthetic presence rows for active apikey sessions? Or document that `/v1/peers` is "agents online"?
3. **Topic ciphertext is plaintext base64** in the current implementation — no actual encryption. The schema names it `ciphertext` for forward-compat, but the code base64-encodes UTF-8. Real per-topic symmetric key derivation (HKDF from mesh root_key + topic_id) is a v0.3.0 item.
---
## Production state worth knowing
- **Broker**: `wss://ic.claudemesh.com/ws`, deployed via Coolify on OVHcloud VPS. Auto-redeploys on push to `gitea-vps main`. Deploy ETA ~3 min.
- **Web**: `claudemesh.com`, Vercel auto-deploy on push to `github main`. Deploy ETA ~2 min.
- **Postgres**: container `eo1f5gydsgrg19b57e9s4zw7` on the VPS. SSH via `ssh ovh`, then `docker exec eo1f5gydsgrg19b57e9s4zw7 psql -U claudemesh -d claudemesh`.
- **Test mesh**: `openclaw` on the same broker has 5 active peers and one topic (`#deploys-test`).
- **Active apikey** (from earlier today's smoke): `cm_OC12dRti…` was revoked. None active right now.
---
## Files most worth reading first in next session
1. `.artifacts/specs/2026-05-02-architecture-north-star.md` — the 7 architectural commitments.
2. `.artifacts/specs/2026-05-02-v0.2.0-scope.md` — design sketches for topics, REST, bridge.
3. `apps/cli/skills/claudemesh/SKILL.md` — the canonical CLI surface; ships in npm tarball.
4. This file.
---
## Memory not yet captured
Worth adding to `~/.claude/projects/-Users-agutierrez-Desktop-claudemesh/memory/MEMORY.md` next session:
- **Drizzle journal drift is a recurring trap** — manual psql until session B lands. Save the exact apply ritual: `scp migrations/NNNN.sql ovh:/tmp/ && ssh ovh "docker cp /tmp/NNNN.sql <pg-container>:/tmp/ && docker exec <pg-container> psql -U claudemesh -d claudemesh -f /tmp/NNNN.sql"`.
- **`workspace:*` deps break `npm publish`** — keep SDK as devDependency in `apps/cli/package.json`; Bun bundles it into dist so runtime doesn't need it. Same trick for any other workspace-only build deps.
- **Commitlint hard-caps body lines at 100 chars** — use `git commit -F /tmp/cm-commit.txt` rather than `-m` heredocs. Heredocs that exceed the limit fail the husky hook silently.

View File

@@ -0,0 +1,227 @@
# claudemesh internal roadmap — 2026-05-02
Strategic counterpart to `docs/roadmap.md` (which is the public, marketing-tone roadmap). This file captures the *why*, the dependencies, the costs, and the things we deliberately won't do.
Anchored in the v0.2.0 backend cut + `#general` auto-creation + filename-tracked migrator + owner-member backfill that all shipped 2026-05-02.
---
## Forcing function
> **Ship v1.6.x in 2 weeks. Ship v1.7.0 in a month. Make the demo. Then commit the daemon.**
Each release stands on its own — usable and shippable even if the next slips. That's the property to optimize for, not "fastest path to v3.0.0."
---
## Schedule
| When | Version | Theme | Status |
|---|---|---|---|
| Now | 1.6.0 | v0.2.0 backend cut | ✅ shipped 2026-05-02 |
| +2w | 1.6.x | Demo polish (SSE, unread, sidebar) | Active |
| +5w | 1.7.0 | First marketing-ready version | Planned |
| +9w | 2.0.0 | Daemon redesign | Planned |
| +15w | 0.3.0 | Self-hosted + per-topic encryption + gateways | Planned |
| TBD | 3.0.0 | Native Claude channels | Anthropic-gated |
≈4 months from today to a teams-can-self-host shape. The MCP bridge stays load-bearing the whole time but stops being the user's problem at v2.0.0.
---
## v1.6.x patch line — 0-2 weeks, polish what's deployed
| Item | Effort | Why now |
|---|---|---|
| Real-time push (SSE on `/api/v1/topics/:name/stream`) | 2 days | Chat lag is the only user-visible v0.2.0 wart. Replaces 5s polling. |
| Unread counts via `last_read_at` | ½ day | Schema column already exists. PATCH on scroll-to-bottom + chip on topic list. |
| Bridge end-to-end smoke (two-mesh forwarding test) | ½ day | Feature shipped, never validated. Catches obvious bugs before any external demo. |
| Drizzle journal + `meta/` cleanup | 1 hour | Inert dead files since the new runner. Low-risk cosmetic. |
| `/v1/peers` includes humans (synthetic presence rows for active apikeys) | 1 day | Today the dashboard chat user is invisible to other peers. |
Total: ~1 week of focused work. Closes the v0.2.0 backend chapter cleanly.
---
## v1.7.0 — 2-3 weeks, the demo cut
The release that turns claudemesh into a thing you can record and show.
**Scope:**
- Member sidebar in the chat panel — names, online dots, presence summaries. Comes nearly free with SSE from v1.6.x.
- Topic search + member-mention autocomplete — `@Mou` hot-keys to `claudemesh send Mou ...`.
- Notification feed at `/dashboard` — "you have N unread in #deploys, 2 mentions in #incident." Purely aggregate; no new schema.
- One-line marketing site refresh — capture screenshots from the now-real-time UI, drop the v0.2.0 stamp from the chat footer, update README/landing.
- First public blog post + recorded demo — "claudemesh in 90 seconds" video. Triggers the first proper user-acquisition push.
**Not in scope:** any architectural change. v1.7.0 is pure UX polish on top of the v1.6.x foundation. Architecture work waits for v2.0.0.
**Why this comes before v2.0.0:** without users, the daemon is a solution for nobody. v1.7.0 produces the first real user signal so v2.0.0 has data to optimize against.
---
## v2.0.0 — 3-4 weeks, the daemon redesign
The single largest architectural shift on the roadmap. Background and rationale captured at length elsewhere this session; summary here.
### Single load-bearing principle
> **The user is the unit of mesh participation, not the Claude session.**
Every weird edge case from this session — the launch tax, the orphan owner, the per-session keypair churn, the MCP install/uninstall ritual, multi-Claude config corruption — comes from getting this one thing wrong today. Fix it once, structurally, and 70% of accumulated complexity vanishes.
### Architecture
```
claudemesh.com (web identity + workspace admin)
▼ JWT
broker (unchanged) — wss://ic.claudemesh.com/ws
▼ ws per workspace
claudemesh-daemon (per user, launchd/systemd, persistent)
▼ unix socket
┌────┴────┐
▼ ▼
CLI verbs MCP push-pipe (~50 LoC)
claude (any number of sessions)
```
### What v2.0.0 ships
- **`claudemesh-daemon`** — long-lived per-user process. One WS per workspace, kept alive across Claude session lifetimes. Listens on `~/.claudemesh/sockets/<workspace>.sock`. Started by `claudemesh login`, persists across reboots.
- **HKDF-derived peer keypairs from JWT** — same identity across machines, no key copy ritual. Web sign-up = CLI sign-up = same row in `mesh_member`.
- **Stateless CLI verbs** — each existing command (`send`, `peers`, `topic`, `apikey`, `bridge`, `state`, `remember`, etc.) retargeted to dial the daemon socket. ~3000 LoC of plumbing deleted, ~500 LoC of glue added.
- **50-line MCP server** — dial daemon, forward inbound peer messages as `experimental.claude/channel` notifications. The push-pipe shrinks from ~150 LoC to ~50.
- **`claudemesh launch` deprecated** — replaced by ambient mode: `claude` with no flags. Launch becomes a one-line alias that prints "ambient mode now, just run `claude`" and exits.
- **"Mesh" → "workspace"** in the public surface. DB tables keep `mesh_*` names for migration sanity.
### What v2.0.0 kills
- `claudemesh launch` command — the 8-thing bootstrap was paying for state the daemon now owns persistently.
- `--dangerously-skip-permissions` — set once at install in `settings.json` allowedTools, never seen by the user again.
- `--dangerously-load-development-channels` — written into `~/.claude.json` once at install, never seen again.
- Per-session `CLAUDEMESH_CONFIG_DIR` tmpdir — daemon owns config.
- Per-session `CLAUDEMESH_DISPLAY_NAME` env var — daemon stores it.
- MCP install/uninstall ritual on every launch — MCP entry is permanent.
- Multi-Claude config corruption — only the daemon writes config.
- Orphan-owner bug (just fixed via backfill) — structurally impossible because web sign-up creates the member row.
### What v2.0.0 keeps
- Wire protocol, crypto primitives, broker schema — 100% unchanged.
- All CLI verb names — 100% unchanged (just retargeted).
- REST `/api/v1/*` surface — 100% unchanged.
- Web chat UI — 100% unchanged.
- Bridge peer feature — 100% unchanged.
- Topic semantics, ciphertext field, ephemeral DMs — 100% unchanged.
### Cost
- ~3 weeks focused engineering
- ~30% LoC reduction in the CLI package
- ~80% reduction in support load for "launch flags," "config corruption," "peer keypair lost," "owner has no member row"
- ~0 cost to broker, web app, schema, protocol — none of the deep parts change
### Migration path (backwards-compatible at every step)
1. **Week 1** — daemon binary + unix socket protocol + retarget two CLI verbs (`send`, `peers`) as the smoke test. Ship to alpha testers.
2. **Week 2** — retarget remaining verbs. HKDF-keypair migration with a one-shot `claudemesh migrate-identity` command for existing users.
3. **Week 3**`claudemesh launch` becomes a deprecated alias. MCP server retargeted to daemon socket. Backfill: every existing user's daemon spins up on first `claudemesh` invocation.
4. **Cut v2.0.0**: remove deprecated launch alias one minor release later (v2.1.0) once metrics show no one's hitting it.
---
## v0.3.0 — 4-6 weeks, the operator chapter
For teams that want to run their own broker, encrypt at the topic level, or wire claudemesh to messaging surfaces beyond Claude Code.
- **Per-topic HKDF encryption** — kills the "broker can read your messages" wart. Symmetric key derived from `mesh.root_key + topic.id`. Web client gets the topic key from the sealed root_key it already holds.
- **Self-hosted broker packaging** — single `docker-compose.yml`, postgres included. CLI accepts `--broker wss://...` to point anywhere. Federation primer.
- **WhatsApp gateway** — peer bot that forwards a topic to a WhatsApp group.
- **Telegram gateway** — same pattern.
- **Tag routing** — `claudemesh send tag:repo:billing "deployed"` lands at every peer working on that repo. Already protocol-supported, needs CLI ergonomics + dashboard surface.
v0.3.0 is when teams that want to run their own broker can do so without paying us. Counterintuitively important: it's also when we can charge for hosted with a clean conscience.
---
## v3.0.0 — Anthropic-blessed cut (conditional)
Conditional on Anthropic shipping first-class agent-to-agent channels in Claude Code. We don't control the timing.
### What's load-bearing about today's flag
`--dangerously-load-development-channels server:claudemesh` does two things:
1. Loads the claudemesh MCP server.
2. Tells Claude Code to treat its `experimental.claude/channel` notifications as runtime channel events.
The flag is named `dangerously-load-development-channels` *specifically because* the channel API is experimental and unstable. Some opt-in mechanism will always be required for Claude Code to receive external events from a third-party process — that's a security-model invariant, not a quirk of today's flag. What changes at v3.0.0 is the *form* of the opt-in, not its existence.
### Two scenarios depending on Anthropic's choice
**Scenario A — MCP-channel API graduates.** The same MCP-based push primitive becomes stable.
- MCP wrapper stays (still translates `ws://broker → MCP notification`).
- The `--dangerously-load-development-channels` flag is replaced by a stable settings.json entry — e.g. `mcpServers.claudemesh.acceptChannelNotifications = true`.
- The `experimental.` prefix on the notification namespace goes away.
- Net user-visible change: nothing, because we already write the flag once at install and the user never sees it. The migration is internal: swap the install logic to write the new settings entry instead of the old flag.
**Scenario B — non-MCP transport ships.** Anthropic introduces a sidecar IPC, a native WebSocket subscription declared in settings, or some other primitive.
- The 50-line MCP wrapper from v2.0.0 disappears.
- The daemon plugs into the new transport directly.
- Some opt-in config is still required (settings.json entry, environment variable, etc.) — Claude Code must know to subscribe to the daemon's channel.
- Net user-visible change: still nothing if our `claudemesh install` adapts to write the new opt-in form.
### What disappears regardless
- The `experimental.` prefix on the channel API (it stabilizes).
- The `dangerously-` framing of the flag (the API is no longer experimental).
- The "you have to pass a launch flag to load development channels" mental model.
### What stays regardless
- An opt-in mechanism somewhere (security model invariant).
- The daemon as the lifecycle owner.
- The protocol, schema, broker, topics, web chat — all unchanged.
### Marketing pivot
claudemesh becomes a "hosted backend for Claude's native multi-agent feature" rather than a "Claude Code extension." The product story simplifies regardless of which shape ships, because the user no longer has to think about MCP servers, dangerous flags, or experimental APIs — claudemesh is just there.
Until v3.0.0 lands, v2.x ships with the MCP bridge under the existing flag. v3.0.0 is the migration target, not a planned feature.
---
## Cross-cutting tracks (always-on, not version-gated)
| Track | What it covers | Target version |
|---|---|---|
| Mobile | iOS peer app (thin: push + reply, same JWT identity) | v2.x |
| Browser peer (proper) | IndexedDB ed25519 + WebCrypto crypto_box for the dashboard. Today's web is REST-only; this makes it a true peer. | v2.x |
| Peer transcript queries | "Hey Claude2, what have you touched in the last hour?" cross-session memory primitive | v0.3.0+ |
| Mesh analytics | Volume, presence, handoff latency dashboards | v0.3.0 |
| Slack peer (first-party) | Today: build-your-own. Shipped natively. | v0.3.0 |
---
## Deliberate exclusions
| Idea | Why deferred |
|---|---|
| Custom bot framework / plugin marketplace | Premature — claudemesh barely has organic users. Build the user base first, then platform. |
| Voice channels | Out of scope. Different product. |
| Video chat | Same. |
| Email-as-peer (incoming SMTP → mesh) | Has demand from one user; ship if 3+ ask. |
| AI summarization of channels | LLM cost + scope creep. Users can wire their own with the existing message API. |
| Mobile push notifications via APNs/FCM | Wait for the iOS peer app, then revisit. |
| Reactions / threading | Not yet — would muddle the protocol surface for marginal value. Reconsider after v0.3.0 user feedback. |
---
## Single-sentence summary
**Polish v1.6.x → ship v1.7.0 demo → commit v2.0.0 daemon → open the operator chapter at v0.3.0 → plug into native channels at v3.0.0 when Anthropic ships them.** Each release stands on its own. The protocol, the schema, the broker, and the topics are all already correct — what changes is the lifecycle owner around them.

View File

@@ -0,0 +1,178 @@
# Topic-key onboarding — v0.3.0 phase 2
The schema for per-topic encryption is shipped (migration 0026). The
broker generates a 32-byte XSalsa20-Poly1305 key when a topic is
created and seals one copy for the creator via `crypto_box`. The open
question is **how new joiners get their sealed copy** without giving
the broker the plaintext.
This spec covers the three live options, picks one for v0.3.0 phase 2,
and parks the rest as future cuts. Implementation is **not in this
spec** — that follows once we ship the chosen flow.
---
## The constraint
The broker holds:
- `topic.encrypted_key_pubkey` — the ephemeral x25519 pubkey used to
seal each member's copy. Public. The matching secret is **discarded
immediately after creation** — only the topic creator's session
knows the topic key briefly during sealing, then it leaves memory.
- `topic_member_key.(encrypted_key, nonce)` — per-member sealed
ciphertext.
The broker **must not** be able to decrypt any sealed copy. So when a
new member joins a topic that already exists, the broker can't seal a
copy for them by itself.
## Option A — server-side escrow (REJECTED)
Broker holds the topic key encrypted under its own service key + per-
member sealed copies. Re-sealing for new members is a server-only
operation.
**Why rejected:** the broker can read every message in every topic
forever. Calling that "per-topic encryption" misleads users. Worse
than today's plaintext-base64 because it implies a security property
the design doesn't deliver.
## Option B — member-driven re-seal (CHOSEN for phase 2)
When a new member joins, an existing member's CLIENT decrypts their
own sealed copy of the topic key, then seals a new copy for the
joiner and POSTs it to the broker.
**Wire:**
1. New member joins via `claudemesh topic join <topic>` — broker
inserts `topic_member` row, no `topic_member_key` row.
2. New member calls `GET /v1/topics/:name/key` → 404 with
`key_not_sealed_for_member`.
3. Existing online members (any of them) periodically poll
`GET /v1/topics/:name/pending-seals` (new endpoint) and see the
new joiner.
4. Existing member's client:
- Decrypts their own sealed copy via `crypto_box_open` with their
x25519 secret + `topic.encrypted_key_pubkey`.
- Generates a fresh ephemeral x25519 keypair.
- Seals the topic key for the joiner via `crypto_box` with the
joiner's pubkey + the new ephemeral.
- POSTs the result to `POST /v1/topics/:name/seal`.
5. Broker stores the new `topic_member_key` row.
6. New member's `GET /v1/topics/:name/key` now returns 200.
**Trust model:** broker never sees plaintext. Assumes at least one
existing member is online when the joiner connects. Worst case the
joiner waits — UI shows "waiting for a peer to share the topic key"
until somebody seals.
**Open detail — sender pubkey identity:** each re-seal uses a fresh
ephemeral pubkey. Either:
(a) Store ALL ephemeral pubkeys ever used to seal copies of this
topic, indexed by member, so the joiner can pick the right one
when decrypting. Adds a new table.
(b) Embed the ephemeral pubkey in the sealed payload itself (
`encrypted_key` becomes `<32-byte ephem_pubkey><crypto_box_easy>`).
Decoder pulls the prefix, uses it as the sender pubkey. No schema
change beyond what 0026 already ships.
**(b) wins on simplicity. Phase 3 implementation ships it. Both the
broker creator-seal and the CLI re-seal write the
`<32-byte sender pubkey><cipher>` blob.** `topic.encrypted_key_pubkey`
becomes informational only — the wire-format truth is the inline prefix.
## Web client gap (phase 3.5)
The CLI side of phase 3 ships in this cut. The web side does NOT —
because web member rows have `peerPubkey` registered server-side but
the corresponding ed25519 SECRET is discarded immediately after
generation (see `mutations.ts:createMyMesh`). Without the secret the
browser can't `crypto_box_open` its sealed topic key.
Three fixes, in increasing order of effort:
1. **Browser-side persistent identity (recommended)** — generate an
ed25519 keypair in the browser on first dashboard visit, store the
secret in IndexedDB, sync the public half to `mesh.member.peerPubkey`
via a new `POST /v1/me/peer-pubkey` endpoint. Topic keys then seal
to the new pubkey; web user decrypts locally. Existing #general
topics need a re-seal cycle (the v0.3.0 phase-3 re-seal loop in
the CLI already does this for any pending member, including web
ones). Spec lift: ~3 hours, mostly browser code + a sync endpoint.
2. **Server-held secret** — keep the member's ed25519 secret server-
side. Trivial to implement, but the broker can read everything,
defeating the security claim. **Rejected.**
3. **JWT-derived keys** — derive the member's keypair from a stable
user-secret (e.g. PBKDF2 over their session JWT). Means cross-
device same key, but needs the JWT to include ~32 bytes of stable
key material. Tied to v2.0.0 daemon redesign. **Deferred.**
Phase 3 ships option 1 deferred; web stays on v1 plaintext until 3.5.
The CLI re-seal loop in `topic tail` already handles re-sealing for
web members ONCE they have a real pubkey — no broker work needed
when 3.5 lands.
## Option C — leaderless protocol (DEFERRED)
MLS, TreeKEM, or similar continuous group key agreement. Right answer
for groups >50 members. Overkill for v0.3.0 — implementation cost is
4-6 weeks of focused work, and the threat model gain over Option B
only matters if we believe a member's machine can be silently
compromised long enough to leak the topic key but short enough that
they aren't kicked from the topic.
Park for v0.4.0 or v0.5.0. Revisit when we onboard a customer that
asks for FS (forward secrecy) on group chat.
---
## Implementation checklist
Schema (0026 — done):
- [x] `topic.encrypted_key_pubkey` (informational; wire truth is the
inline 32-byte prefix on each `topic_member_key.encryptedKey`)
- [x] `topic_member_key.(encrypted_key, nonce)`
- [x] `topic_message.body_version` (1 = plaintext, 2 = v2 ciphertext)
API (phase 3 — done):
- [x] `GET /v1/topics/:name/key` — fetch the calling member's sealed copy
- [x] `GET /v1/topics/:name/pending-seals` — list members without keys
- [x] `POST /v1/topics/:name/seal` — submit a re-sealed copy
- [x] `GET /v1/topics/:name/messages` returns `bodyVersion`
- [x] `GET /v1/topics/:name/stream` emits `bodyVersion`
- [x] `POST /v1/messages` accepts `bodyVersion` (1|2) + skips regex
mention extraction on v2
Broker / web mutation (phase 3 — done):
- [x] `createTopic` generates topic key + seals for creator with
inline-sender-pubkey blob format
- [x] `ensureGeneralTopic` (web) mirrors the same flow
Client — CLI (phase 3 — done):
- [x] `services/crypto/topic-key.ts` — fetch + decrypt + encrypt + reseal helpers
- [x] `topic tail` decrypts v2 messages on render
- [x] `topic post` encrypts v2 on send via REST POST /v1/messages
- [x] Background re-seal loop in `topic tail` (30s cadence)
Client — web (phase 3.5 — DEFERRED):
- [ ] Browser-side persistent identity (IndexedDB)
- [ ] `POST /v1/me/peer-pubkey` sync endpoint
- [ ] Web chat panel encrypt-on-send + decrypt-on-render (currently v1)
UX surfaces (phase 3 — done in CLI):
- [x] "waiting for a peer to share the topic key" warning on tail
- [ ] (web) "your encryption keys are pending — pair this browser"
banner once 3.5 lands
Mention fan-out from phase 1 already works for both v1 and v2
messages, so `/v1/notifications` keeps working through the cutover.
The phase-3 cut ships full CLI encryption + re-seal flow. Web remains
on v1 plaintext until 3.5 lands the browser identity layer. Mixed
CLI+web meshes in the meantime should keep using v1 sends OR accept
that web members can't read v2 messages.

View File

@@ -0,0 +1,273 @@
# claudemesh v0.2.0 — scope
**Date:** 2026-05-02
**Status:** draft
**Predecessor:** [`2026-05-02-architecture-north-star.md`](./2026-05-02-architecture-north-star.md) (1.5.0 architecture lock)
---
## Cut
**Theme: from agent-only mesh to mesh of agents, humans, and external systems — with conversation context.**
| # | Feature | Effort | Spine |
|---|---------|--------|-------|
| 1 | **Topics** (channels/rooms within a mesh) | 2-3 d | yes |
| 2 | **Humans in the mesh** (web chat panel) | 2-3 d | depends on #1 |
| 3 | **REST API + external WS** (API keys per mesh) | 2-3 d | depends on #1 |
| 4 | **Bridge peer** (forwards one topic between meshes) | 1 d | depends on #1 |
Optional pickup if all four ship early:
- **Local peer aliases** (~0.5 d) — IRC-style local labels for hard-to-remember displayNames.
- **Semantic peer search** (~0.5 d) — already in vision doc; useful once topics exist.
Total: 7-9 days plus 1-2 days slack. Targeting **release window: 2026-05-12 to 2026-05-16**.
---
## Why this cut
The 1.5.0 architecture (CLI-first, tool-less MCP, policy engine) is finished. The next bottleneck is **product surface**, not engineering.
Current taxonomy `mesh + group + role` is the right *organizational* structure but missing a *conversational* primitive. Every message is DM or `@group` broadcast — there's no continuity for "the deploys conversation," no scoped state/memory/files, no way for a human to join a topic without joining the whole mesh, no way for a bridge to forward a single thread of work.
**Topics fix this.** They are the spine of v0.2.0:
- Without topics, "humans in mesh" floods every human with every peer's chatter.
- Without topics, "bridge" forwards everything (loop risk, signal-to-noise problem).
- Without topics, REST API endpoints have no natural sub-mesh scope.
Once topics exist, humans + REST + bridge each become 50% smaller because they slot into a clean primitive instead of inventing one.
---
## Deferred
| Item | Why later |
|---|---|
| **Federation** (broker-to-broker) | Bridges prototype it. Learn from real use first. |
| **Sandboxes** (E2B / Modal) | Orthogonal capability. Separate release. |
| **Sim SDK** (`@claudemesh/sim`) | Niche audience; long-tail. v0.3.0+. |
| **Welcome back / persistent MCP** | Already in progress as 1.6.0 patch. |
| **Mesh telemetry** | Pre-PMF telemetry is busywork; users first. |
---
## Design sketches
### 1. Topics
**Mental model:** mesh is *who you trust*; group is *who you are*; topic is *what you're talking about*. Three orthogonal axes.
**Wire shape:**
```yaml
topic:
id: <ulid>
mesh_slug: openclaw
name: deploys # unique within mesh
description: "deploy + on-call"
visibility: public # public | private (invite-only) | dm (1:1, autocreated)
created_by: <pubkey>
created_at: <ts>
```
**Membership:**
```yaml
topic_member:
topic_id: <ulid>
pubkey: <hex> # session pubkey OR member_pubkey for durable identity
role: lead | member | observer
joined_at: <ts>
last_read_at: <ts> # for unread counts
```
**Messages reference a topic, not just a target:**
```jsonc
// existing send_message envelope gains a `topic` field
{
"to": "@deploys", // or topic id, or peer name (DM)
"topic": "deploys", // optional explicit, inferred from `to: @<topic>`
"message": "...",
"priority": "next"
}
```
**Resolution rules:**
- `to: "alice"` → DM to peer alice (no topic).
- `to: "@frontend"` → group broadcast (no topic — backwards compatible with 1.5.0).
- `to: "#deploys"` → topic message; delivered only to topic subscribers.
- `to: "*"` → mesh-wide broadcast (kept; lower-priority than topic for new comms).
**State/memory/files scoping:**
- `claudemesh state set <k> <v> --topic deploys` — namespace under topic.
- `claudemesh remember "..." --topic deploys` — topic-scoped memory.
- `claudemesh file list --topic deploys` — files visible only to topic members.
**CLI:**
```bash
claudemesh topic create deploys --description "deploy + on-call"
claudemesh topic list # all topics in mesh
claudemesh topic join deploys
claudemesh topic leave deploys
claudemesh topic invite deploys <peer> # private topics
claudemesh topic members deploys
claudemesh topic delete deploys # creator/admin only
claudemesh send "#deploys" "rolling out 1.5.1"
```
**MCP `claude/channel` notification gains `topic`** as an attribute so peers know which conversation an inbound message belongs to.
**Effort breakdown:** schema + drizzle migration + CLI verbs + broker routing changes (filter by topic membership) + skill update. ~250 LoC across CLI + ~200 LoC broker.
---
### 2. Humans in the mesh
**Mental model:** a human is a peer with `peer_type: "human"` whose presence is durable (no session pubkey rotation; identity tied to an account). They join *topics*, not the whole mesh — so they only see relevant traffic.
> **Implementation update (2026-05-02):** `peer_type: "ai" | "human" | "connector"` is already plumbed end-to-end in the broker (hello envelope, ConnectedPeer, list_peers). What was missing wasn't broker support — it's the **interface** for humans, who don't have browser-side ed25519 to do hello-sig. Realistic path: **REST API is the human interface** (rolled into #3 below). The web chat panel becomes a thin client that posts/reads via REST using the dashboard user's session auth — not its own keypair. This collapses #2 and #3 into a single deliverable: REST → UI on top.
**Wire:**
```jsonc
// hello envelope gains:
{
"peer_type": "human",
"session_pubkey": <ephemeral, per browser tab>,
"member_pubkey": <durable, account-tied>,
"display_name": "Alejandro"
}
```
**Web panel (`apps/web`):**
```
/dashboard/mesh/<slug>/topic/<topic-name>
├── topic header (members, settings)
├── message stream (WS-driven, infinite scroll on history)
├── compose box (typing indicator broadcast on focus)
└── members sidebar (presence, profile, last_read_at)
```
**Backend changes:**
- Persistent message history per topic (drizzle table `topic_messages`; existing direct messages stay ephemeral by design).
- Topic-scoped read receipts (`topic_member.last_read_at`).
- Typing indicator: short-lived broadcast on the topic channel (`{type: "typing", peer: "..."}`).
**Privacy invariant:** a human in `#deploys` sees only `#deploys` traffic + DMs sent to them. Never the whole mesh. This is the *whole reason* topics come first.
**Effort:** WS endpoint already exists (broker side). Add: topic_messages table, history endpoint, web UI components (compose, stream, members). ~3 days.
---
### 3. REST API + external WS
**Auth:** API keys per mesh, scoped by capability + topic.
```yaml
api_key:
id: <ulid>
mesh_slug: openclaw
label: "ci-bot"
hash: <argon2id>
capabilities: ["send", "read"]
topic_scopes: ["#deploys"] # null = all topics; explicit = whitelist
created_at: <ts>
last_used_at: <ts>
revoked_at: <ts | null>
```
**CLI for issuance (admin only):**
```bash
claudemesh apikey create --label "ci-bot" --topic deploys --cap send,read
claudemesh apikey list
claudemesh apikey revoke <id>
```
**REST endpoints (claudemesh.com/api/v1):**
```
POST /v1/messages Send a message (auth: api key).
GET /v1/topics/:name/messages History (with pagination cursor).
GET /v1/peers List online peers (filtered by key scope).
GET /v1/state Read mesh state.
POST /v1/state Write mesh state.
```
**External WS:** `wss://ic.claudemesh.com/ws?api_key=...&topic=deploys` — connects with `peer_type: "external"`. Push-pipe parity with internal sessions; can subscribe to topic streams.
**Why REST keys not session keypairs:** external clients (Zapier, GitHub Actions, mobile apps, Slack workspace bots) need long-lived bearer-like creds, not ephemeral keypairs. Different threat model — scope tightly via topic + capability.
**Effort:** ~3 days. Mostly broker work; CLI gets the issuance verbs.
---
### 4. Bridge peer
**Mental model:** a bridge is a peer that holds memberships in two meshes and forwards traffic on a single topic between them. SDK-only (no broker changes).
**Implementation (uses existing `@claudemesh/sdk`):**
```typescript
import { Bridge } from "@claudemesh/sdk";
const bridge = new Bridge({
meshes: ["work", "external"],
topic: "incidents",
filter: (msg) => !msg.tags.includes("internal-only"),
loop_prevention: { tag: "via-bridge", max_hops: 2 },
});
await bridge.start();
```
**Loop prevention:** every forwarded message gets a `bridge_hop_<n>` tag; bridges drop messages that already carry their own tag (prevents echo) and any message with `max_hops` exceeded.
**CLI:** `claudemesh bridge run <config.yaml>` — runs an SDK bridge as a long-lived process. Useful for "run a bridge inside a docker container or systemd unit."
**What it deliberately doesn't do:**
- Cross-broker federation (that's a separate broker-to-broker protocol).
- Bidirectional state/memory sync (only messages on a single topic).
- Identity unification (a peer in mesh A is *not* the same peer in mesh B; the bridge appears as the messenger).
**Effort:** ~1 day on top of the existing SDK.
---
## Acceptance signals
v0.2.0 ships when all four are demonstrable end-to-end:
1. A peer creates `#deploys`, two other peers join it, traffic is topic-scoped, mesh-wide chat doesn't see it.
2. A human signs in at `claudemesh.com`, joins `#deploys`, sends a message, a Claude session in the mesh receives it as a `<channel>` interrupt with `topic="deploys"`.
3. A `curl` POST against `/v1/messages` with an API key delivers a message into `#deploys`; the same API key is rejected on `#secrets`.
4. A bridge peer running locally forwards `#incidents` between two test meshes; loop is prevented; one-shot demo recorded.
---
## Out of scope (explicitly)
- Topic hierarchy / nesting (flat namespace per mesh; revisit at scale).
- Topic-scoped capability grants (`grant <peer> read:#topic`) — solvable later via capability extension.
- Threads-within-topics (Slack-style). Defer.
- Voice / video / file-upload UX for humans — text only in v0.2.0.
- Federation, sandboxes, sim-sdk — explicitly deferred above.
---
## Risks
- **Topics retrofit risk** — existing 1.5.0 message envelope assumes "to" is peer/group/star. Adding `topic` is additive on the wire but changes routing logic. Test path: backfill existing meshes with a default `#general` topic; opt-in to topic-only routing.
- **Web chat session lifecycle** — humans expect "I closed the tab and came back, my place is preserved." Ephemeral session pubkeys break that. Workaround: tie human peer identity to `member_pubkey` + last_read_at on the topic; session pubkey rotates per tab but membership is durable.
- **API key abuse** — leaked keys = anyone can post. Mitigations: capability + topic scoping; rate limits per key; `last_used_at` + audit trail; revoke verb is fast.
---
## Open questions
1. Do existing `@group` semantics survive intact, or do we collapse `@group` and `#topic` into one primitive? (Answer favored: keep both — different axes.)
2. Should topics persist messages by default, or be opt-in? (Default: yes for `peer_type: "human"`-touched topics; configurable per topic for agent-only ones.)
3. Where does mesh-MCP discovery live in the topic model — per topic or per mesh? (Likely per mesh; mesh-MCP is infrastructure, not conversation.)

View File

@@ -0,0 +1,204 @@
# Workspace view — per-user superset over joined meshes
**Status:** spec / not started
**Target:** v0.4.0
**Author:** Alejandro
**Date:** 2026-05-02
## Why
Users routinely belong to multiple meshes — work, personal, side
projects, ECIJA + flexicar + openclaw + prueba1 in our own dogfood.
Today's CLI is mesh-scoped: every read or write either auto-picks the
default mesh or forces an interactive picker. Common questions like
*"who's online across all my meshes?"* or *"any new @-mentions
anywhere?"* require N round-trips, one per mesh.
A few verbs already aggregate implicitly (`peer list`, `inbox`,
`list`), but the surface is patchy and inconsistent.
We want the equivalent of "all my Slacks in one sidebar" — without
breaking the per-mesh trust model that v0.3.0 was built around.
## What it is NOT
- **Not a literal universal mesh.** A single global mesh everyone
joins collapses the trust boundary, blows up broadcast fan-out
(O(users²)), and turns into spam. See the universal-mesh discussion
rejected in this same session.
- **Not federation.** Federation is the broker-side equivalent
(already roadmapped under v0.3.0). Workspace is purely client-side.
- **Not identity stitching for *other* peers.** `Mou@openclaw` and
`Mou@flexicar-2` may or may not be the same human. Don't auto-merge.
Stitching MY identities is fine — local config knows.
## What it IS
A virtual layer that aggregates reads across the meshes the user has
joined, while keeping writes mesh-scoped. Pure projection over
existing per-mesh tables. Zero broker changes. Zero protocol changes.
```
┌──────────────────────────────┐
│ workspace │
│ (per-user view, client) │
└─┬────────┬────────┬─────────┬┘
│ │ │ │
┌─────▼──┐ ┌───▼──┐ ┌───▼──┐ ┌────▼──┐
│ mesh A │ │ B │ │ C │ │ ... │
└────────┘ └──────┘ └──────┘ └───────┘
(each remains its own crypto + trust domain)
```
## Surface
### New verbs (all read-only, all aggregating)
```bash
claudemesh me # overview: meshes, online peers, unread, tasks
claudemesh me topics # all subscribed topics, namespaced
claudemesh me notifications # cross-mesh @-mentions feed
claudemesh me activity # cross-mesh recent send/recv/topic-post
claudemesh me search "<q>" # full-text across memory + topics + tasks
```
`claudemesh me` (no subcommand) prints a one-screen dashboard:
```
workspace — agutmou (4 meshes · 23 peers visible · 2 unread @you)
meshes
openclaw 7 peers · 3 topics · last activity 2m
flexicar-2 5 peers · 1 topic · last activity 18m
prueba1 4 peers · idle
ECIJA 7 peers · 2 topics · 1 @you · last activity 4h
unread @-mentions
ECIJA · #incident-2026-05-02 · 1 from coronel-abos
openclaw · #deploys · 1 from claudemesh-2
pending tasks (3)
ECIJA ship-F4-cliente high claimed by you
...
```
### Default-aggregation rule for existing verbs
When `--mesh` is omitted on a *read-only* verb, aggregate. When
`--mesh` is omitted on a *write* verb, fall back to current behavior
(default mesh or interactive picker). Already-aggregating verbs keep
working unchanged.
| Verb | Today | After workspace |
|---|---|---|
| `peer list` | aggregates ✅ | unchanged |
| `inbox` | aggregates ✅ | unchanged |
| `list` | aggregates ✅ (lists meshes) | unchanged |
| `notification list` | mesh-scoped | aggregates by default |
| `topic list` | mesh-scoped | aggregates with namespacing |
| `task list` | mesh-scoped | aggregates by default |
| `state list` | mesh-scoped | aggregates by default |
| `memory recall` | mesh-scoped | aggregates by default |
| `info` / `stats` / `ping` | mesh-scoped | unchanged (per-mesh diagnostics) |
| `send`, `topic post`, `state set`, `remember`, ... | mesh-scoped | unchanged (writes pick a mesh) |
### Rendering rules for aggregated views
1. **Topic namespacing.** `#deploys` exists in two meshes — they're
different rooms. Render as `openclaw/#deploys`. Inside a
mesh-scoped command, keep the bare `#deploys` shorthand.
2. **Peer name collisions.** `Mou@openclaw` notation when the same
display name resolves in more than one mesh. Single resolution =
bare name.
3. **Time-grouped activity.** `me activity` sorts globally by ts
descending; mesh tag is shown as a dim suffix.
4. **Unread roll-up.** `me notifications` is a per-row
`[mesh][topic][snippet]` list, newest first.
## API surface (REST)
Mirror the read aggregations server-side so the dashboard + future
mobile/web UIs share the same endpoints.
```
GET /v1/me # workspace overview
GET /v1/me/meshes # joined meshes + summary stats
GET /v1/me/topics # all subscribed topics, all meshes
GET /v1/me/notifications # cross-mesh @-mentions
GET /v1/me/activity # unified activity feed
GET /v1/me/peers # already implicit; formalize
GET /v1/me/search?q=... # full-text across tables
```
Auth: needs a *user-scoped* api key (one issued per user, sees all
their meshes), which we don't have today — current keys are mesh-
scoped. Two options:
- **(a) Per-user key.** New token type `cm_u_...` issued by the
dashboard, scopes to all meshes the issuing user belongs to. Cheaper
to build; harder to reason about because the blast radius is
larger if leaked.
- **(b) Multi-mesh aggregation.** Accept N mesh-scoped keys
concurrently; CLI auto-mints them via the existing `withRestKey`
pattern, one per joined mesh. No new key type. More round-trips on
cold start, but rotation/revocation stays simple.
**Recommendation: (b).** Reuses today's auth model, doesn't widen the
blast radius, and the ephemeral keys we already mint per-command keep
the surface area minimal. The CLI orchestrates the fan-out client-
side.
## Storage
Pure projection at first. The cross-mesh queries are SELECT joins
over `mesh_member`, `mesh_topic`, `mesh_topic_member`,
`mesh_notification`, `mesh_topic_message`, `mesh_task`, `presence`.
If `me` queries become hot (likely once dashboards land), add a
materialized `user_workspace_view` refreshed on writes. Don't
optimize early.
## Effort
| Component | Effort |
|---|---|
| CLI verbs (`me`, `me topics`, etc.) | 1.5 days |
| Default-aggregation rule across existing verbs | 0.5 day |
| REST endpoints `/v1/me/*` | 1 day |
| Multi-mesh apikey orchestration in `withRestKey` | 0.5 day |
| Tests + docs | 0.5 day |
| **Total** | **~4 days** |
## Open questions
1. **`me` as namespace vs. flag.** Could be `claudemesh --workspace
topics` instead of `claudemesh me topics`. The verb form is
shorter and reads better; sticking with it.
2. **Notification ordering.** All notifications globally interleaved
by ts, or per-mesh sections? Default to **interleaved** with mesh
tag prefix; users can `--by-mesh` to group.
3. **Search relevance.** Cross-mesh full-text search is easy when each
mesh has its own pg full-text index. Cross-mesh ranking is the
harder problem (IDF varies). Punt to v0.4.1 — start with simple
tied-rank merge.
4. **Web dashboard.** Should the web dashboard's main view become a
workspace view by default? Yes, but that's downstream of this
spec — once `/v1/me/*` exists, the web rewrite is the obvious
next step.
## Out of scope (v0.4.0)
- Federation / cross-broker workspace.
- Identity stitching for non-self peers.
- Cross-mesh search ranking sophistication.
- Cross-mesh write fan-out (`me broadcast` is intentionally NOT a
verb — too easy to misuse).
- Mobile/web parity beyond the REST endpoints.
## Why we ship this
Because "I want one Slack-like sidebar for all my claudemesh meshes"
is the highest-frequency UX gap users hit, and the answer is two
days of plumbing on top of what already exists. Federation is the
right answer for cross-organization reach; workspace is the right
answer for *one user, many meshes*. Both compose.

View File

@@ -0,0 +1,506 @@
---
title: claudemesh — full end-state architecture for agentic peer communication
status: draft (v2 — supersedes v1: removes time-boxed phasing, adds P2P data plane, applies Codex-2 correctness/scope-gap edits)
target: end-state (architectural milestones, not version timelines)
author: Alejandro + Claude (Codex GPT-5.2 cross-checked twice)
date: 2026-05-04
supersedes: 2026-05-04-agentic-comms-architecture.md (v1)
references:
- 2026-05-02-architecture-north-star.md (CLI-first commitment, push-pipe)
- 2026-05-04-per-session-presence.md (per-launch session pubkey + attestation)
- apps/cli/CHANGELOG.md (1.30.01.32.1 history)
---
# claudemesh — agentic peer communication, full end-state
## What this document is
The end-state architecture for claudemesh as a transport-agnostic agentic peer-comms platform. Not a release plan, not a sprint roadmap — the **shape** the system needs to converge on. Implementation order at the end is a *suggestion*, not a contract; time estimates are deliberately omitted because the surface is too cross-cutting to phase by weeks.
v1 of this spec (same date, no `-v2` suffix) treated the broker as the sole data plane. v2 corrects that: **the broker is a coordination plane (signaling, discovery, offline queue, fan-out, registry, revocation); the data plane is hybrid P2P** with broker fallback for the cases P2P can't cover. Closer to how Tailscale, libp2p, LiveKit, and modern WebRTC stacks work in production.
## TL;DR
- **Identity** — three keypair types (member, session, service) all rooted in a member's secret key. Member is durable, session is per-launch, service is a member-scoped delegate for non-Claude integrations. Every service has its own pubkey and explicit revocation.
- **Coordination plane** — broker handles signaling, peer discovery, offline message queue, group/topic fan-out, mesh state authority, revocation gossip. Always reachable.
- **Data plane** — hybrid:
- **P2P first** (WebRTC data channels, future: QUIC) when both peers online + NAT-traversable.
- **Broker-relayed** when peers are NAT-blocked, when one peer is offline, or for group/topic/broadcast where fan-out at the broker is structurally cheaper than N-way sender-side fan-out.
- **Pure broker** for service identities that can't run a P2P stack (HTTP webhook senders, OpenAI Assistants, browser SDKs without WebRTC).
- **Channels** — typed envelope (dm, group, topic, rpc, system, stream). Channel type drives crypto, routing, and transport selection. `meta` is required in v2 envelope.
- **Transports** — pluggable adapters under one interface: WS-to-broker (today), WebRTC P2P, HTTP webhook, future LiveKit/QUIC/etc. Broker negotiates which adapter a peer pair uses.
- **Crypto** — every direct message is E2E encrypted to recipient's pubkey regardless of transport. Broker never sees plaintext. P2P doesn't get any extra trust just because it's direct.
- **Delivery** — at-least-once **requires receiver ack** before broker marks `delivered_at`. The retry path before that is best-effort with idempotent dedupe at the receiver.
The CLI-first commitment from the North Star spec stays intact. Every channel type and every transport is invocable from `claudemesh <verb>`. MCP serves only `claude/channel` mid-turn push.
---
## The forcing functions (why this shape, not a smaller one)
1. **Multi-session interconnect already broke** (1.30.0 → 1.32.1) because the per-session WS subsystem shipped without push handler. Symptom of "broker is the data plane and we keep bolting on" thinking. Need to formalize roles and transport adapters before the next bolt-on.
2. **Codex review surfaced a correctness bug** in `drainForMember` — claims `delivered_at = NOW()` *before* WS push succeeds; if `ws.readyState !== OPEN` the row is marked delivered and message is lost. At-most-once with no retry. Inherited by every channel/transport added unless fixed at the foundation.
3. **The agentic-comms domain has standardized on hybrid P2P + central coordinator.** Tailscale (control plane + WireGuard P2P), LiveKit (signaling + SFU + P2P data channels), libp2p (DHT discovery + multi-transport), Iroh (gossip + QUIC P2P). Pure-broker is a 2010s pattern; pure-P2P is academic. Hybrid is the norm.
4. **claudemesh's pricing/economics demand P2P.** Every byte through the broker is your cost. Voice transcripts, file transfers, real-time tool I/O — bandwidth-heavy. P2P data plane lets the broker scale linearly with peer count, not message volume.
5. **Privacy/sovereignty matters as the agent ecosystem grows.** "Your agents talk to my agents" should default to peer-to-peer paths when possible. Broker as relay is fine; broker as forced middleman is not.
---
## Audience for this architecture
| Peer type | Identity | Online presence | Data plane preference | Notes |
|---|---|---|---|---|
| **Claude Code session** | Per-launch session pubkey, member-attested | WS to broker (control + signaling) | P2P first, broker fallback | Mid-turn push via MCP `claude/channel` |
| **Daemon, no launch** (idle Mac with daemon running) | Member pubkey | WS to broker | Broker only (no P2P partner unless launched) | Receives broadcasts + member-targeted DMs |
| **Voice agent** (LiveKit, Pipecat) | Service identity, member-signed | LiveKit room + bridge | LiveKit room data channels intra-room; bridge over broker for cross-mesh | Side-car bridges room ↔ broker |
| **OpenAI Assistant / Anthropic Skill** | Service identity, scoped token | HTTP outbound, webhook inbound | Broker only (can't run P2P) | Daemon does delegated re-encryption |
| **Browser-based peer** (web dashboard, SDK) | Member or service identity | WS to broker, WebRTC for P2P | P2P-where-possible (browsers ARE WebRTC-native) | Full feature parity once on-mesh |
| **Webhook consumer** (Stripe-style passive) | Service identity | HTTP webhook inbound only | Broker only | Topic subscriptions; no inbound channel |
| **Bridge** (Slack, WhatsApp, IRC, Matrix) | Service identity per bridge + per-end-user delegated | WS to broker | Broker only for bridge ↔ broker; native protocol for bridge ↔ external | Trust delegated to bridge operator |
| **Cron / scheduled actor** | Member pubkey or service identity | Ephemeral; HTTP send only | Broker only | No long-lived connection |
| **CLI-only user** (no Claude Code) | Member pubkey | Ephemeral on each `claudemesh send` | Broker only | Command-line agent, queues via outbox |
Every row in this table works without changing the broker's coordination plane.
---
## Layer 1: Identity
Three keypair types, one auth model.
### Member identity (durable)
- Ed25519 keypair, generated at `claudemesh join <invite>`. Held in `~/.claudemesh/config.json` per mesh.
- The auth boundary — grants, kicks, bans operate on members.
- Used for hello signature on the daemon's control-plane WS.
- Used as cryptographic root of trust for sibling sessions and service identities.
### Session identity (ephemeral, per-launch)
- Ed25519 keypair generated by each `claudemesh launch`. Held in process memory only.
- Parent-signed attestation vouches for it (TTL 12h, broker cap 24h). Rotation = new launch.
- Used for hello signature on the per-session WS, and as routing key for DMs targeted at *this specific launched session*.
- Session secret never touches disk; lives only in the daemon's `sessionBrokers` map keyed by IPC token.
### Service identity (third type, additive)
For non-Claude integrations that can't or shouldn't use a per-launch session.
```
ServiceIdentity {
service_id // Stable string id ("openai-assistant-foo", "livekit-room-bar")
service_pubkey // Ed25519 pubkey — the cryptographic identity. crypto_box targets this.
member_id // The mesh member that owns this service (auth boundary)
service_type // "openai-assistant" | "livekit-room" | "webhook" | "voice-agent" | ...
scopes // ["dm:read", "topic:write", "rpc:invoke", ...]
attestation // member-signed: { service_id, service_pubkey, scopes, expires_at, signature }
transport_hint // "ws" | "http-webhook" | "sse" | "livekit" — informs how the broker reaches it
delegate_daemon_pubkey? // Optional. Set when the daemon holds the service's secret on its behalf.
}
```
Two flavors:
- **Holds-secret service** — has its own keypair (`service_pubkey` + service-secret kept by the service itself). Runs E2E crypto end-to-end. Voice agent side-cars, browser SDK, MQTT bridges.
- **Delegated service** — daemon holds the service-secret on the service's behalf. Senders still encrypt to `service_pubkey`; daemon decrypts on receipt and forwards plaintext (or re-signs) to the service via its `transport_hint`. Used by HTTP webhook consumers, OpenAI Assistants. Trust is in the daemon owner. `delegate_daemon_pubkey` records who's holding.
All three identity types resolve to a `member_id` for authorization. They differ in liveness (member = always; session = per-launch; service = scoped) and transport hint (member/session = WS-resident; service = polymorphic).
### Identity revocation (explicit)
Existing v1 left this implicit. v2 makes it concrete:
- **CLI verb:** `claudemesh service revoke <service_id>` (also `claudemesh peer revoke <pubkey>` for member revocation).
- **Broker effect:** add row to `revocation` table with `(mesh_id, revoked_pubkey, revoked_at, revoked_by, reason?)`. Drop any active WS for that pubkey (close 4002 "revoked"). Reject future helloes.
- **Drain effect:** `drainForMember` checks revocation list at drain time; ciphertext-in-flight from the revoked sender is dropped (sender already broker-acked, but recipient never sees it).
- **Gossip:** revocation events publish on the `system` channel (highest priority). Online peers cache; offline peers see on reconnect. Required so P2P sessions also honor revoke (otherwise a revoked peer's stored attestations could keep working over direct paths).
- **Latency target:** <30s for online peers to receive and apply.
- **Expiry vs revoke distinction:** `expires_at` is graceful (predictable, scheduled rotation); revoke is emergency (leaked secret, fired employee, compromised host). Both use the same revocation table; `expires_at` enforces silently when reached, revoke is logged as an audit event.
---
## Layer 2: Coordination plane (the broker, properly scoped)
The broker is **not** the data plane. Its real responsibilities:
1. **Mesh state authority** — member roster, group memberships, topic registry, service registrations, revocation list. Source of truth for who's in a mesh and what they can do.
2. **Peer discovery**`list_peers` returns currently-online presences. Broker is the only system that knows which peers are reachable now and over which transports.
3. **Signaling for P2P upgrades** — when peer A wants to open a P2P connection to peer B, A sends a SDP offer through the broker; B responds with an SDP answer through the broker. Once the data channel is up, broker is out of the path. Same as WebRTC signaling.
4. **Offline message queue** — when recipient is offline, broker stores the (encrypted) message until they reconnect. P2P can't do this without an "always-on peer" model, which is awkward to bootstrap.
5. **Group / topic / broadcast fan-out** — broker is the cheap fan-out point. Sender publishes once; broker delivers to N recipients. P2P fan-out (gossipsub) is possible but adds significant complexity for a feature most meshes won't need at scale.
6. **TURN-style relay for NAT-blocked pairs** — when P2P negotiation fails (symmetric NAT, restrictive corporate firewall), broker carries the data. Functionally equivalent to TURN.
7. **Revocation gossip publisher** — broker pushes revocation events to all online peers via the `system` channel; peers cache them.
8. **Audit log + persistence layer** — encrypted message metadata for compliance. Bodies are E2E-encrypted, so audit is over (sender, recipient, channel, timestamp, size), not content.
The broker is **NOT**:
- The default path for online-online direct messages (P2P should win).
- The decryptor for any direct message (E2E means broker sees ciphertext only).
- A bottleneck on bulk data (file transfer, voice, screen share — these go P2P or fail).
- The sole identity authority for active sessions (P2P sessions verify attestations locally via cached mesh state).
### Two roles per mesh on the WS layer (Codex-1 correction, kept)
Within the broker's WS surface, the daemon holds two roles per mesh, not one connection per launch:
- **Control-plane connection** — one per mesh, member-keyed. Carries: signaling + outbox drain + RPCs + broadcast/member-targeted inbound + revocation gossip subscription.
- **Session connections** — N per mesh, session-keyed. Carries: presence row keyed on session pubkey + signaling for P2P upgrades involving this session + inbound for session-targeted DMs that arrive via broker fallback.
A peer who's purely on the broker (no P2P) functions exactly as today. A peer who upgrades to P2P with another peer keeps its broker WS for the other roles.
---
## Layer 3: Data plane (hybrid P2P + broker fallback)
The data plane is what carries actual message bodies. Three modes, selected per (sender, recipient, channel) tuple:
### Mode 1: Direct P2P (preferred when possible)
Two peers run a WebRTC data channel (or QUIC stream — pluggable, see Layer 4) between their daemons. Established via signaling through the broker; once up, broker is out of the path.
**When P2P is selected:**
- Both peers are online (have an active broker WS).
- Both peers' transports advertise P2P capability (WebRTC available; not a webhook-only service identity; not a browser without `RTCPeerConnection`).
- ICE negotiation succeeds (at least one candidate pair works — direct, server-reflexive, or peer-reflexive).
- Channel type is `dm`, `rpc`, or `stream` (the 1:1 cases).
**P2P session lifecycle:**
- Established lazily on first message (warm-up cost ~200ms; dominated by ICE + DTLS handshake). Subsequent messages reuse the channel.
- Idle timeout: 5min of no traffic → tear down. Re-established on next message.
- Hard timeout: 1h max regardless of activity, then re-handshake. Limits damage of compromised session keys.
- Either side can demote to broker-relay at any time; broker is the fallback always.
**Crypto on P2P:**
- DTLS handshake provides transport encryption (forward secrecy; recipient pubkey verified via cached attestation chain).
- Application-layer crypto_box ALSO runs on top — same as broker-relayed messages — so the wire format and decryption path are identical on the receiver side. Defense in depth, no special-case code.
### Mode 2: Broker-relayed (fallback)
The current path. Sender encrypts to recipient pubkey (member or session or service), pushes to broker via WS, broker queues, recipient pulls (or broker pushes to recipient's WS).
**When broker-relay is selected:**
- One peer offline → broker queues, delivers on reconnect.
- ICE negotiation fails → broker becomes the relay.
- Channel type is `group`, `topic`, or `broadcast` → broker fan-out is structurally cheaper than P2P fan-out for any group >2.
- Service identity at either end can't run P2P → broker is the only path.
**Crypto:** unchanged from today — E2E crypto_box, broker sees ciphertext only.
### Mode 3: Direct webhook (broker as broker, not as relay)
For service identities advertising `transport_hint: "http-webhook"`. Sender encrypts to service's `service_pubkey` (or to delegate-daemon's pubkey for delegated services), broker POSTs the ciphertext to the service's registered URL with HMAC signature + retry. No long-lived connection on the service side.
This is functionally a "broker queue, custom delivery transport" — broker still mediates, but delivery is HTTP not WS.
### Selection logic (deterministic, sender-side)
```
function pickTransport(sender, recipient, channel) -> Transport:
if channel in [group, topic, broadcast]:
return broker.relay # fan-out semantics
if recipient.transport_hint == "http-webhook":
return broker.relay # broker calls webhook
if recipient is offline:
return broker.queue # store-and-forward
if !recipient.capabilities.p2p:
return broker.relay # one-end can't P2P
if !sender.capabilities.p2p:
return broker.relay # we can't P2P
if has_active_p2p_session(sender, recipient):
return p2p.session # warm path
attempt_p2p_handshake(sender, recipient, timeout=2s) ->
if ok: return p2p.session
else: return broker.relay # fall through, log degraded
```
Policy lives in the daemon's send path. Broker doesn't know or care — it sees only the messages that actually go through it.
---
## Layer 4: Transport adapters (pluggable)
A transport adapter is an implementation of how *one peer pair* moves bytes. Defined by an interface; new adapters added without touching upper layers.
```typescript
interface PeerTransport {
readonly kind: string; // "ws-broker" | "webrtc-p2p" | "http-webhook" | ...
readonly capabilities: {
p2p: boolean;
bidirectional: boolean;
midTurnPush: boolean;
maxMessageBytes: number;
streamingChunks: boolean;
};
open(opts: TransportOpenOpts): Promise<TransportSession>;
send(envelope: Envelope): Promise<TransportSendResult>;
inbound(): AsyncIterable<Envelope>;
heartbeat(): Promise<boolean>;
close(reason?: string): Promise<void>;
}
```
### Concrete adapters at end-state
1. **`WsBrokerTransport`** — current code. WebSocket to `wss://ic.claudemesh.com/ws`. Underpins both broker-relay (Mode 2) and signaling for P2P upgrades.
2. **`WebRtcP2pTransport`** — RTCPeerConnection + RTCDataChannel. Browser, Node (`node-datachannel` or similar), CLI all supported. Chunking handled at envelope layer for `stream` channel.
3. **`HttpWebhookTransport`** — outbound HTTP POST to broker `/v1/send`; inbound HTTP POST to a registered webhook URL. Unidirectional from peer's perspective. Mid-turn push: no.
4. **`LiveKitRoomTransport`** — for voice agents. Side-car bridges a LiveKit room to claudemesh. Maps a LiveKit participant → claudemesh service identity.
Future adapters TBD as concrete needs surface — no commitments here. (v1 listed MQTT/gRPC/SSE as future named adapters; v2 drops the named list per Codex-2 should-cut feedback.)
The peer's daemon advertises transport capabilities at hello time; broker stores them in the presence row; senders consult them via `list_peers` (capability fields added to the response).
---
## Layer 5: Channels (typed envelope)
Channels define **semantics**: what the message means, what crypto to apply, what delivery guarantees, what fan-out, what backpressure.
```typescript
type ChannelType =
| "dm" // 1:1 direct, encrypted to recipient pubkey, at-least-once with ack
| "group" // post to named group, per-recipient encrypt or symmetric, at-least-once with ack
| "topic" // pub/sub topic, persisted history, per-topic symmetric key, at-least-once with ack
| "rpc" // request/response with correlation id + timeout, exactly-once via dedupe
| "system" // peer_joined / peer_left / topology / lifecycle / revocation (broker-originated)
| "stream"; // long-lived ordered chunks, idempotent per (stream_id, chunk_id)
interface Envelope {
v: 2;
channel: ChannelType;
/** Routing target — meaning depends on channel:
* dm: recipient pubkey (member, session, or service)
* group: group name (e.g. "@admins")
* topic: topic id (e.g. "#abc123")
* rpc: recipient pubkey
* system: ignored (sender-determined fan-out; broker fills in)
* stream: recipient pubkey (the stream_id is in meta.streamId — see below) */
target: string;
/** Sender identity pubkey (member, session, or service). */
from: string;
/** Encrypted payload. Channel + recipient determines crypto recipe:
* dm/rpc/stream: crypto_box to recipient pubkey
* group: per-recipient seal (or symmetric in v3)
* topic: per-topic symmetric key (v0.2.0 spec)
* system: broker-signed, plaintext metadata (event has no body) */
body: { nonce: string; ciphertext: string; bodyVersion: number };
/** Required in v2 (was optional in v1). Even minimal envelopes must carry
* clientMessageId for idempotent dedupe. */
meta: {
clientMessageId: string; // REQUIRED — idempotency id (spec §4.2)
requestFingerprint?: string;
priority?: "now" | "next" | "low"; // dm: gates mid-turn push; group/topic: fan-out priority
timeoutMs?: number; // rpc only
streamId?: string; // REQUIRED for channel:"stream"; identifies the stream
streamChunkId?: number; // stream only; monotonic; receiver dedupes
streamTerminator?: boolean; // stream only; signals end
rpcCorrelationId?: string; // rpc only; back-edge for response
rpcResponse?: boolean; // rpc only; this is a response, not request
replyToId?: string; // dm/topic threading
mentions?: string[]; // dm/topic; @-callouts
expiresAt?: number; // any; broker drops past this; default 7d for queued
};
/** Sender Ed25519 signature over canonical bytes. Verified by recipient
* (and by broker for system-message origin). */
signature: string;
}
```
### Stream concurrency
For `channel: "stream"`, **`meta.streamId` is required**. Two concurrent streams to the same recipient pubkey use distinct streamIds; receiver demuxes by `(from, streamId)`. Without this, multi-stream voice transcripts or file transfers from the same peer would collide.
### Crypto by channel
- `dm`, `rpc`, `stream` → crypto_box(plaintext, recipient_pubkey, sender_secretkey). Receiver verifies attestation chain to ensure recipient_pubkey is a valid identity rooted in a current member.
- `group` → for now: per-recipient crypto_box (sender encrypts N times, broker fans out). Future: hybrid Curve25519 → AES-GCM with sender key wrap, like Signal Sender Keys.
- `topic` → per-topic symmetric key (already in v0.2.0 spec). Key rotation = new topic + members re-subscribe. Keys distributed via DM at join time, encrypted to each member's pubkey.
- `system` → broker is the signer; receivers verify against the broker's published Ed25519 pubkey. Plaintext bodies allowed since these are operational events.
### Delivery semantics (Codex-2 correction applied)
**At-least-once requires receiver ack.** Today's broker sets `delivered_at = NOW()` inside the claim CTE before WS push succeeds — that's at-most-once with no retry. The end-state behavior:
1. Sender's daemon writes to outbox (durable).
2. Drain worker sends to broker; broker acks with `client_message_id` echo (this is sender → broker delivery ack, NOT end-to-end).
3. Broker queues with `claimed_at` NULL, `delivered_at` NULL.
4. On recipient hello / push opportunity: broker claims by setting `claimed_at = NOW(), claim_id = <presenceId>` (lease 30s).
5. Broker `sendToPeer` writes to WS / P2P / webhook.
6. Receiver processes envelope and emits `client_ack { clientMessageId }` back to broker.
7. Broker sets `delivered_at = NOW()` ON ACK RECEIPT.
8. If lease expires without ack → broker re-eligible to claim and re-deliver.
9. Receiver dedupes by `clientMessageId` (idempotent insert into inbox).
Until ack is wired (transitional state), the transitional label is **best-effort retry with idempotent dedupe**, not at-least-once. The outbox + claim/lease + dedupe combination upgrades to at-least-once when the ack path is in place.
`rpc` exactly-once is the same path with the addition that the response carries the `rpcCorrelationId`; sender retries the request until response received OR `timeoutMs` elapses; receiver-side dedupe ensures the handler runs at most once.
### Mid-turn push
`channel: "dm"` with `meta.priority: "now"` and recipient is a launched Claude Code session → recipient's daemon emits `claude/channel` MCP push; the session's Claude Code reads it mid-turn. Other priorities deliver via `claudemesh inbox` poll or at next tool boundary.
### Reply threading + mentions
Uniform across `dm` and `topic`: `meta.replyToId` references the original message's `clientMessageId`. `meta.mentions` is an array of pubkeys (or `@<group>`) — UI/CLI surfaces them; broker doesn't enforce.
---
## Layer 6: Mesh state — broker authority + signed gossip
The mesh state (members, groups, topics, services, revocations, policies) needs both:
- **Authority** — single source of truth. The broker DB. Mutations (add member, revoke, change policy) go through broker, signed by mesh owner / admin.
- **Replication** — every peer needs a current-enough copy to authorize incoming P2P messages locally (otherwise revoke can't be enforced when peers chat directly).
End-state: broker publishes signed mesh-state-update events on the `system` channel; peers cache and apply. Conflict resolution is trivial because broker is authority — peers merge updates by version vector. Eventually consistent in seconds, not the open-ended convergence of CRDT-only systems.
For peer revocation specifically: revocation gossip is highest priority and must propagate within 30s to all online peers. Offline peers see it on reconnect.
---
## Crypto — what doesn't change vs what does
### Doesn't change
- Per-peer Ed25519 keypairs (member + session + service).
- crypto_box (Curve25519 + XSalsa20 + Poly1305) for DMs/RPC/stream.
- Parent-attestation flow for sessions and services.
### Does change (additive)
- DTLS layer underneath WebRTC P2P (transport-level encryption for fingerprint binding).
- Per-topic symmetric keys (v0.2.0 baseline; v2 makes it a hard requirement for topics).
- Broker signing key for `system` channel events (single Ed25519 keypair the broker holds; pubkey published in mesh state).
- Service identity attestations carry `service_pubkey` + `scopes`.
- Forward-secrecy for long-lived P2P sessions: post-handshake, derive a fresh symmetric key per session epoch (1h max); rotate.
---
## Migration order (architectural milestones, NO time estimates)
The end-state above doesn't ship in one PR. The following ordering minimizes regression risk and lets each milestone be useful on its own. **No weeks/sprints attached** — work proceeds when the prior milestone is stable.
### Milestone 1 — Foundational correctness
*Required before anything else. Without this, every later milestone inherits the bugs.*
- Extract `connectWsWithBackoff` helper. Refactor `DaemonBrokerClient` and `SessionBrokerClient` to use it. Eliminates the drift bug class.
- Drop daemon's stray `sessionPubkey` field (or rename + document).
- Tighten daemon-WS inbound filter — `*` broadcasts and member-targeted DMs only; session-targeted DMs land on session WS exclusively.
- Add `presence.role` column at broker (`control-plane | session | service`); list_peers + fan-out + reconnect honor it.
- **Fix broker drain race** — schema migration adds `claimed_at`, `claim_id`, `claim_expires_at` columns. Rewrite `drainForMember` for two-phase claim/deliver. Re-claim if `claimed_at` older than lease (30s).
- Receiver-side `client_ack` for at-least-once with ack (Codex-2 correction). Without ack wiring this stays at "best-effort retry with idempotent dedupe."
- Receiver-side dedupe: idempotent insert on `clientMessageId`; finished + made required for v2 envelopes.
### Milestone 2 — Capability advertisement + transport abstraction
*Sets up the interface. No new transport yet.*
- Define `PeerTransport` interface; refactor existing WS code to be the first implementation. No behavioral change.
- Add capabilities field to hello payload + presence row + `list_peers` response.
- Define `Envelope v2` schema with `meta` required + `streamId` requirement on `stream` channel. Broker accepts both v1 and v2 (v1 auto-upgraded server-side by inferring `channel` from `targetSpec` shape). Senders start emitting v2.
### Milestone 3 — Service identity + HTTP webhook transport
*First non-WS transport. Validates abstraction. Includes revocation.*
- Service identity registration: `claudemesh service register --type webhook --pubkey <hex> --scopes ...` mints attestation, stores broker-side. Service pubkey explicit in attestation.
- Service revocation: `claudemesh service revoke <service_id>` writes broker denylist + closes any active connections + publishes `system` revocation event.
- Add `HttpWebhookTransport` (broker-side outbound: POST with HMAC + retry; daemon-side inbound: HTTP server receives webhook callbacks → handleBrokerPush).
- Add `/v1/send` HTTP POST endpoint on broker (today broker is WS-only for sends).
- Demo: cron job using only `curl` posts to mesh; webhook subscriber receives.
- (`SseTransport` deferred — Codex-2 should-cut feedback. Pull in when concrete browser need arises.)
### Milestone 4 — Typed channels: rpc, stream, system
*Channel layer becomes real.*
- `channel: "rpc"` end-to-end: correlation id routing through any transport, response timeout, `claudemesh rpc <peer> <method> <args>` CLI verb.
- `channel: "stream"` end-to-end: chunked + ordered + idempotent, multi-stream demux via `meta.streamId`, `claudemesh stream <peer> <stream-id>` CLI verb.
- `channel: "system"` formalized (broker-signed events for peer_joined, peer_left, topology, revocation, mesh-state-updates).
### Milestone 5 — P2P data plane (WebRTC adapter)
*The big architectural shift. Broker becomes coordinator, not data path.*
- Add `WebRtcP2pTransport` adapter. Uses `node-datachannel` (or libdatachannel binding) on Node; native WebRTC in browser.
- Add signaling protocol over the existing broker WS:
- `p2p_offer` (sender → broker → recipient): SDP offer + ICE candidates.
- `p2p_answer` (recipient → broker → sender): SDP answer + ICE candidates.
- `p2p_candidate` (either way): trickle ICE candidates.
- All signaling messages are broker-attested (only valid sender/recipient pairs).
- Add `pickTransport()` policy in daemon send path.
- Add P2P session manager: warm-cache, idle timeout, hard timeout, demote-to-broker on failure.
- Tag broker-relayed messages that *could have* gone P2P with a metric, so degradation rate is observable.
### Milestone 6 — Mesh state replication + revocation gossip
*Required before P2P is safe at scale.*
- Broker publishes signed `system` events for all mesh state mutations.
- Peers subscribe; cache and apply.
- Revocation propagation latency target: <30s for online peers.
- P2P sessions verify peer identity against cached state on every message (cheap, just a map lookup).
### Milestone 7 — External integrations (proof points, parallel)
*One PoC per category to validate the architecture, opportunistically.*
- LiveKit side-car (validates LiveKit room transport).
- OpenAI Assistant (validates delegated-key crypto + webhook transport).
- WhatsApp / Slack bridge (validates human-bridge service identity).
- Browser SDK (validates browser as a peer; uses WebRTC adapter natively).
### Milestone 8 — Group/topic crypto upgrade
*Group fan-out crypto efficiency.*
- Sender Keys protocol for group: sender derives group key, encrypts content once, encrypts group key per-recipient. Avoids N-way encryption per message.
- Per-topic key rotation policy (member join → optional re-key; member leave → forced re-key).
### Beyond Milestone 8
- Future transport adapters as concrete needs surface (no commitments).
- Multi-broker federation (mesh spans multiple brokers; gossip across).
- Onion routing option for adversarial environments.
---
## Non-goals (explicit)
- **Replacing Slack / Discord / Matrix as a human chat product.** claudemesh is for agent coordination; humans participate via bridges or direct DMs but UX is CLI-first.
- **Pure-P2P with no central coordinator.** The broker stays — for offline queue, group fan-out, mesh authority, revocation. "P2P-first hybrid" is the commitment, not "P2P-only."
- **Replacing the MCP `claude/channel` push-pipe.** Mid-turn interrupt stays MCP. The data-plane changes don't touch the daemon-to-Claude-Code path.
- **Real-time media (audio/video) directly in claudemesh data channels.** Bandwidth-heavy media goes through dedicated stacks (LiveKit, WebRTC SFU). claudemesh metadata + signaling glues them.
---
## Open questions
1. **Mid-turn push when sender is on P2P session.** P2P delivery to recipient's daemon → daemon emits MCP push. Same shape as broker-delivered. Confirm the MCP push respects per-session targeting (different session pubkey siblings of the same member).
2. **Browser peers and NAT traversal.** Browser ↔ browser via WebRTC works. Browser ↔ daemon (Node WebRTC binding) — needs testing under symmetric NAT. May require running a STUN server (Google's for now; eventually self-hosted). TURN fallback uses the broker WS.
3. **Backpressure on stream channel.** WebRTC data channels have built-in flow control. Broker-relayed streams need per-stream backpressure signaling to avoid OOM at the broker. Proposal: receiver advertises `stream_window_bytes` periodically; sender pauses when used.
4. **Multi-region brokers.** Today single broker. If we add a second broker (or federation), how do peers in mesh A on broker 1 talk to peers in mesh A on broker 2? Out of scope here; separate spec when forced.
---
## Acknowledgements
**Codex-1 (initial architecture review of existing code) caught:**
- "Remove daemon-WS inbound entirely" idea silently loses broadcasts + member-targeted DMs whenever zero launches exist. Corrected → retained.
- Inheritance for the dup'd lifecycle would become a god class. Composition via helper kept.
- Drain race needs `claimed_at` + delivered-on-success; "check OPEN before claim" still drops on crash. Kept.
- Token-keyed registry is correct (token = auth boundary), not a smell. Kept.
**Codex-2 (single-pass review of v1 of this spec) caught:**
- At-least-once requires receiver ack, not just "set delivered_at on success." → Layer 5 delivery semantics rewritten to require client_ack.
- Service identity needs explicit `service_pubkey` field, included in attestation. → Added to ServiceIdentity definition.
- v2 envelope `meta` should be non-optional with `clientMessageId` always present. → meta is now required.
- Service identity needed explicit revocation/disable story. → New CLI verb `claudemesh service revoke`, broker denylist, system-channel gossip propagation.
- `streamId` location ambiguous; concurrent streams to same peer would collide. → `meta.streamId` made REQUIRED for `channel: "stream"`.
- Defer `SseTransport` from Milestone 3. → Done.
- Drop named future-adapter list (MQTT/gRPC) to avoid false commitments. → Done.
The hybrid P2P data plane, transport adapter abstraction, typed channel envelope, mesh state replication, and milestone reordering are mine. Codex's reviews were targeted at correctness/scope-gap/should-cut, not redesign.
**This spec is now frozen for implementation.** No further architectural drift; deviations during implementation surface as new spec-deltas with explicit rationale, not silent edits to this document.

View File

@@ -0,0 +1,360 @@
---
title: claudemesh as agentic communication platform — architecture spec
status: draft
target: 2.0.0 (foundational cleanup) → 2.1.0 (transport adapters) → 2.2.0 (channel typing)
author: Alejandro + Claude (cross-checked with Codex GPT-5.2)
date: 2026-05-04
supersedes: none
references:
- 2026-05-02-architecture-north-star.md (CLI-first commitment, push-pipe)
- 2026-05-04-per-session-presence.md (per-launch session pubkey + attestation)
- apps/cli/CHANGELOG.md (1.30.01.32.1 history)
---
# claudemesh as agentic communication platform
## TL;DR
Today claudemesh is a **peer mesh for Claude Code sessions** — broker + CLI + per-session WS, encrypted DMs, peer list, mid-turn push via MCP. Tomorrow it has to be a **transport-agnostic agentic communication platform** that:
- treats Claude Code as **one channel type** among many (with first-class support for mid-turn interrupts via `claude/channel`)
- accepts **non-Claude agents** as peers — voice agents (LiveKit/Pipecat), OpenAI Assistants, raw HTTP webhook consumers, scheduled cron actors, human IM bridges
- exposes **typed channels** (DM, group, topic, RPC, system event, stream) so message semantics aren't shoved through one `targetSpec` string
- has a **pluggable transport layer** so a peer can join the mesh over WS, HTTP webhook, SSE, MQTT, or gRPC without changing the broker's data plane
- preserves **end-to-end encryption** as a non-negotiable for direct messages
This document specifies the architecture in three layers (identity, transport, channel), the foundational cleanup needed before adding any of it (Codex caught a few sharp issues), and the migration path that gets us there without a "v2 rewrite" event.
The CLI-first commitment from the North Star spec stays intact — every channel type and transport adapter must be invocable from `claudemesh <verb>` first, with MCP serving only `claude/channel` push.
---
## Why now
Three forcing functions:
1. **Multi-session interconnect already broke** (1.30.0 → 1.32.1). The per-session WS subsystem shipped without a push handler because the architecture assumed "one daemon WS per mesh handles everything" and then we bolted session WSes on top without finishing the inbound side. The shape is right; the wiring was incomplete. We need to formalize the role split before adding more transports.
2. **Codex review surfaced a correctness bug in the broker's drain.** `drainForMember` claims rows by setting `delivered_at = NOW()` *before* the WS push succeeds. If `ws.readyState !== OPEN` at push time, the row is marked delivered and the message is gone. This is at-most-once with no retry. Any future channel type or transport adapter inherits this bug if we don't fix it at the foundation.
3. **The agentic-comms market is becoming a thing.** Voice agents (LiveKit, Pipecat, ElevenLabs Conversational), OpenAI Assistants threads, MCP servers acting as autonomous workers, scheduled cron actors — they all need a "mesh" to coordinate. claudemesh has the right primitives (E2E crypto, peer presence, typed routing); it just needs the architecture to admit non-Claude peers without forking the codebase.
---
## Audience for this architecture
| Peer type | Identity | Transport | Channels they speak |
|---|---|---|---|
| **Claude Code session** (today) | Per-launch session pubkey, parent-attested by member key | WS to broker | DM, group, topic, system events; receives mid-turn push via MCP `claude/channel` |
| **Headless agent** (e.g. cron job, Hermes/OpenClaw worker) | Member pubkey (no per-launch session) | WS to broker, OR HTTP webhook outbound | DM, group, topic; no mid-turn push (polls inbox) |
| **Voice agent** (LiveKit/Pipecat call) | Service identity (signed by mesh owner) | WS to broker, possibly via TURN relay | DM (transcript stream), group (call participants), system events (call lifecycle) |
| **OpenAI Assistant / Anthropic Agent** (Skill SDK) | Service identity, OAuth-style scoped token | HTTP webhook (server-side push) OR WS | DM, RPC (tool-style request/response) |
| **Human via Slack/WhatsApp bridge** | Service identity for the bridge, end-user mapped via membership | WS (bridge to broker) | DM, topic |
| **Webhook consumer** (Stripe-style passive listener) | Service identity, scoped to one channel | HTTP webhook outbound only | Topic (subscribe to events) |
Every row in this table needs to work without changing the broker's data plane.
---
## Layer 1: Identity
### Today
Two identity types coexist:
- **Member identity** — stable Ed25519 keypair held in `~/.claudemesh/config.json`. One per joined mesh. Used for hello signature on the daemon's main WS; used as the cryptographic root of trust for sibling sessions.
- **Session identity** — ephemeral Ed25519 keypair generated per `claudemesh launch`. Parent-signed attestation vouches for it (TTL 12h, broker cap 24h). Used for hello signature on the per-session WS; used as the routing key for DMs targeted at *this specific launched session*.
This is enough for Claude Code peers. It's not enough for the audience table above.
### Proposed: third identity type — **service identity**
A service identity is what a non-Claude integration uses to authenticate:
```
ServiceIdentity {
member_id // The mesh member that owns this service (auth boundary)
service_id // Stable id for the service ("openai-assistant-foo", "livekit-room-bar")
service_type // "openai-assistant" | "livekit-room" | "webhook" | "voice-agent" | ...
scopes // ["dm:read", "topic:write", "rpc:invoke", ...]
attestation // member-signed: { service_id, scopes, expires_at, signature }
transport_hint // "ws" | "http-webhook" | "sse" — informs how the broker reaches it
}
```
**Three identity types, one auth model:**
- All identities resolve to a `member_id` (the auth boundary — grants, kicks, bans operate on members).
- Identities differ in *liveness* (member = always; session = per-launch; service = scoped/scheduled) and in *transport hint* (member/session = WS-resident; service = polymorphic).
**Backward compatibility:** existing member + session identities are unchanged. Service identity is additive.
### Cryptographic implications
- E2E encryption (`crypto_box`) targets a public key. Member pubkey, session pubkey, service pubkey all work the same way.
- A service that can't hold a long-lived secret (e.g. OpenAI Assistant calling out via HTTPS) gets a **delegated identity** the daemon holds — sender encrypts to the daemon's per-member key, daemon re-encrypts and forwards over the service's webhook. This adds trust in the daemon, but it's the only way to bridge to non-crypto-native peers without giving them raw secrets.
---
## Layer 2: Transport
### Today
One transport: **WebSocket to broker** (`wss://ic.claudemesh.com/ws`). Everything goes through it — hello, send, push, RPC. The CLI's daemon holds two WS instances per mesh (member-keyed `DaemonBrokerClient` + per-launch `SessionBrokerClient`).
### Proposed: transport adapter interface
```typescript
interface BrokerTransport {
/** One-time hello + auth handshake. Identity is opaque to the transport. */
connect(opts: TransportConnectOpts): Promise<TransportSession>;
/** Send a typed envelope. Returns a delivery promise (ack or terminal failure). */
send(envelope: Envelope): Promise<SendResult>;
/** Stream of inbound envelopes. Pull-model so a transport can be a webhook,
* not just a long-lived socket. */
inbound(): AsyncIterable<Envelope>;
/** Close cleanly. */
close(reason?: string): Promise<void>;
/** Capabilities surfaced to the daemon — broker uses this to decide
* whether mid-turn push is possible, whether RPC blocks are
* supported, etc. */
capabilities: TransportCapabilities;
}
```
**Concrete adapters at v2.1.0:**
1. **`WsBrokerTransport`** — current WS implementation. The `DaemonBrokerClient` and `SessionBrokerClient` are recast as two roles using this transport with different hello payloads.
2. **`HttpWebhookTransport`** — for service identities that can't hold a WS open. Outbound: HTTP POST to the broker's `/v1/send`. Inbound: broker calls back to a registered webhook URL with retry + signature. Mid-turn push is not possible (degrades gracefully).
3. **`SseTransport`** — for browsers / restricted environments. Outbound: HTTP POST. Inbound: SSE stream from broker to client.
**Future adapters (v2.3+):**
4. **`LiveKitTransport`** — for voice agents. The "broker" is a LiveKit room; messages are LiveKit data-channel packets. Bridges to the central broker via a daemon side-car.
5. **`MqttTransport`** — for IoT / fleet scenarios.
6. **`GrpcTransport`** — for low-latency intra-cluster.
Any new adapter implements the same interface; broker logic is transport-agnostic at the API boundary.
### The two-role model (Codex's correction)
Even within one transport, the daemon holds **two roles per mesh**, not one connection per launch:
- **Control-plane connection** — one per mesh, member-keyed. Carries: outbox drain (one queue, can't race), `list_peers`/state/memory/skill RPCs, inbound for `*` broadcasts and member-targeted DMs (legacy traffic + zero-launch state).
- **Session connections** — N per mesh, session-keyed. Carries: presence row keyed on session pubkey, inbound for session-targeted DMs.
This is what we have today; the spec just makes the role split explicit. The mistake in 1.30.01.32.0 was treating session connections as "presence-only" instead of "second-class peers." 1.32.1 corrects that.
### Foundational cleanup (ship first, before any new transport)
1. **Extract `connectWsWithBackoff` helper** — current `DaemonBrokerClient` and `SessionBrokerClient` duplicate the WS lifecycle (open, hello, ack-timeout, close, backoff, reconnect). Codex's recommendation: composition, not inheritance. A single helper takes `{ url, buildHello, onMessage, onStatusChange }` and both clients call it. Eliminates the drift bug class that produced session_replaced thrashing.
2. **Drop the daemon's stray `sessionPubkey`** (`apps/cli/src/daemon/broker.ts:113`). It's a leftover from the era when the daemon WS was the only WS. The session role now owns session pubkeys. If we want the daemon itself to be addressable by a stable pubkey, rename it `daemonPubkey` and document it; today it's dead ballast.
3. **Tighten daemon-WS inbound filter, don't remove it** (Codex's correction to my prior take). Daemon WS should still receive `*` broadcasts and member-targeted DMs (legacy senders, zero-launch state). It should NOT decrypt session-targeted DMs (that's the session WS's job, and decryption requires the session secret which the daemon WS doesn't have anyway).
4. **Fix the broker drain race** (`apps/broker/src/broker.ts:2399-2402`). Add `claimed_at` + `claim_id` columns; claim sets `claimed_at = NOW()` (NOT `delivered_at`); push runs; `delivered_at = NOW()` is set ONLY after `ws.send` succeeds. Re-eligible if `claimed_at` is older than the lease timeout (e.g. 30s). Combined with `client_message_id` dedupe on the receiver side, this gives at-least-once semantics, which is what an agentic comms platform needs.
5. **Decouple presence-WS-role from session-WS-role at the broker.** Today `connectPresence` is called from both `handleHello` and `handleSessionHello`. The two paths diverge in identity (member vs session pubkey) and dedup key (sessionId in both cases). Make the role explicit on the presence row (`role: "control-plane" | "session" | "service"`) so list_peers, fan-out, and reconnect can reason about it. Hidden `claudemesh-daemon` rows in 1.32.0's `peer list` are a hack covering for missing typing.
---
## Layer 3: Channels
### Today
One channel type: **direct messages with target-spec routing**. `targetSpec` is a string that the broker pattern-matches:
- `<64-hex-pubkey>` → DM to that member or session
- `*` → broadcast to mesh
- `@<groupname>` → group post
- `#<topicId>` → topic post
This works but it's overloaded — the same `send` verb covers DMs, broadcasts, groups, topics, and (since v0.9) tagged messages. As we add agentic peers, the semantics matter and the routing key string can't carry them.
### Proposed: typed channel envelope
```typescript
type ChannelType =
| "dm" // 1:1 message, encrypted to recipient pubkey
| "group" // post to named group, encrypted per-recipient (today: base64 plaintext)
| "topic" // pub/sub topic, persisted, history available, per-topic symmetric key
| "rpc" // request/response, correlation id, timeout, structured result
| "system" // peer_joined / peer_left / topology / lifecycle events
| "stream"; // long-lived data stream (voice transcript, log tail, file transfer chunks)
interface Envelope {
/** Schema version. v1 = current opaque shape. v2 = this typed shape. */
v: 2;
/** What semantics the receiver should apply. */
channel: ChannelType;
/** Target — pubkey for dm, group name for group, topic id for topic, etc.
* Same wire format as today's targetSpec, but typed. */
target: string;
/** Sender identity (member, session, or service pubkey). */
from: string;
/** Encrypted payload + crypto envelope. Channel type drives crypto:
* - dm: crypto_box to recipient pubkey
* - group: per-recipient seal (today: plaintext)
* - topic: symmetric key (today: plaintext, v0.2.0+ adds per-topic key)
* - rpc / system / stream: same as DM (crypto_box) */
body: { nonce: string; ciphertext: string; bodyVersion: number };
/** Optional metadata, varies by channel type. */
meta?: {
/** Stable client-supplied id for dedupe (existing field, made required for v2). */
clientMessageId: string;
/** Sender's canonical fingerprint per spec §4.4 (existing field). */
requestFingerprint?: string;
/** dm/group: priority gate (now/next/low). rpc: timeout_ms. stream: chunk_id. */
priority?: "now" | "next" | "low";
timeoutMs?: number;
streamChunkId?: number;
/** dm/topic: replyTo for threading. */
replyToId?: string;
/** topic: mentions list (existing field). */
mentions?: string[];
/** rpc: correlation back-edge so the broker can route the response. */
rpcCorrelationId?: string;
};
/** Sender signature over (channel, target, from, nonce, ciphertext, meta). */
signature?: string;
}
```
**Why this matters for agentic peers:**
- A voice agent sending a partial transcript wants `channel: "stream"` semantics — high-frequency, small chunks, idempotent, no per-message ack required.
- An OpenAI Assistant calling a tool wants `channel: "rpc"` — request-response with timeout, correlation back-edge so the response routes.
- A scheduled cron actor reporting completion wants `channel: "topic"` — fire-and-forget, persisted history.
- Today all of these get bolted onto `dm` with conventions; v2 envelope makes them first-class.
### Claude Code channels — first-class support
Two specific channel features for Claude Code:
1. **Mid-turn interrupt** (`claude/channel` push). Already implemented via the MCP push-pipe. The new envelope makes it explicit: `channel: "dm"` with `meta.priority: "now"` triggers MCP push to a launched session. Other priorities deliver at next inbox poll.
2. **Reply threading** (`meta.replyToId`). Already partially supported on topics; v2 makes it work uniformly across `dm` and `topic`. The receiver Claude Code session sees a structured reply thread instead of flat history.
3. **Mentions** (`meta.mentions`). Already supported on topics; v2 surfaces them on `dm` too — useful for `@<peer>` callouts in groups even when the message body is encrypted.
### Backward compatibility
Envelope v1 (today's shape) stays accepted by the broker until v3.x. v1 envelopes are auto-upgraded server-side: `channel` inferred from `targetSpec` shape (`*` → group/broadcast, `#` → topic, hex → dm). Existing CLIs keep working.
---
## Future integrations (concrete)
These are not part of v2.0 — they're the test cases the architecture must support:
### LiveKit voice agent
- Service identity: `livekit-room-<id>`, signed by mesh owner.
- Transport: dedicated daemon side-car hosts a LiveKit participant; data-channel packets bridge to the central broker via WS.
- Channels: `stream` for transcript chunks, `system` for call lifecycle (joined/left/muted), `dm` for sidebar text.
- E2E: per-call ephemeral keypair held by the side-car; participants' member keys are discovered via mesh peer list.
### OpenAI Assistant integration
- Service identity: `openai-assistant-<id>`, scoped to one or more topics + RPC.
- Transport: HTTP webhook out (broker → assistant API), HTTP POST in (assistant → broker `/v1/send`).
- Channels: `rpc` for tool-style invocations from claudemesh peers, `topic` for assistant-published events.
- Crypto: delegated to daemon (assistant can't hold a libsodium secret; daemon re-encrypts on its behalf).
### Generic webhook consumer (Stripe-style)
- Service identity: `webhook-<consumer-id>`, scoped to subscribed topics.
- Transport: HTTP webhook out only. No inbound — it's a passive sink.
- Channels: `topic` only.
- Crypto: not E2E; webhook bodies are signed (HMAC-SHA256, sender = mesh) but plaintext.
### Human-via-WhatsApp bridge
- Service identity: `whatsapp-bridge`, with member-mapping for each end-user.
- Transport: WS (bridge holds long connection to broker), bridges to WhatsApp Business API.
- Channels: `dm` (1:1 chat → WhatsApp DM), `topic` (claudemesh topic → WhatsApp group).
- E2E: bridge holds a per-end-user delegated key; not "true" E2E to the WhatsApp side, but signaled clearly in UX.
---
## Migration plan
### v2.0.0 — Foundational cleanup (no new external surface)
**Target: 12 weeks**
- [ ] Extract `connectWsWithBackoff` helper, refactor `DaemonBrokerClient` + `SessionBrokerClient` to use it.
- [ ] Drop daemon's stray `sessionPubkey` (or rename + document).
- [ ] Tighten daemon-WS inbound filter (broadcast + member-targeted only).
- [ ] Add `presence.role` column (`control-plane | session | service`); broker fan-out + list_peers honor it.
- [ ] **Fix drain race**: schema migration adds `claimed_at`, `claim_id`, `claim_expires_at` columns; rewrite `drainForMember` for two-phase claim/deliver; add re-claim path for stale leases.
- [ ] Receiver-side: harden `client_message_id` dedupe (already partial in 1.32.x; finish for at-least-once). Add idempotent insert that returns existing row on conflict.
**Success criteria:**
- Two-session smoke test still passes (1.32.1 baseline).
- Crash-mid-push test: kill broker between claim and send; verify message redelivers on broker restart + recipient reconnect.
- Reconnect storm test: 100 reconnect cycles per session over 60s; zero message loss.
### v2.1.0 — Transport adapter interface
**Target: 23 weeks after v2.0.0**
- [ ] Define `BrokerTransport` interface; refactor existing WS code to be the first implementation.
- [ ] Add `HttpWebhookTransport` adapter (broker side: outbound HTTP POST with retry + HMAC signature; daemon side: HTTP server that receives webhook callbacks and inserts into inbox).
- [ ] Add `/v1/send` HTTP endpoint on the broker (today the broker is WS-only for sends).
- [ ] Service identity registration flow: `claudemesh service register --type webhook --scopes dm:read,topic:write` mints attestation, stores it locally + on broker.
- [ ] Basic `SseTransport` for browser/CI use cases.
**Success criteria:**
- A scheduled cron job using only `curl` can send to the mesh (no daemon required).
- A webhook consumer subscribed to a topic receives messages within 5s of post.
### v2.2.0 — Typed channels (envelope v2)
**Target: 23 weeks after v2.1.0**
- [ ] Define `Envelope v2` schema; broker accepts both v1 and v2; sender-side code emits v2.
- [ ] `channel: "rpc"` end-to-end: correlation id routing, response timeout, `claudemesh rpc <peer> <method> <args>` CLI verb.
- [ ] `channel: "stream"` end-to-end: chunked delivery, ordered, idempotent, `claudemesh stream <peer> <stream-id>` CLI verb.
- [ ] Mid-turn push (`claude/channel`) honors `channel: "dm"` with `meta.priority: "now"` only.
- [ ] Mentions + replyToId surface uniformly across dm and topic.
**Success criteria:**
- Demo: a Claude Code session sends an `rpc` to another Claude Code session, gets a structured response.
- Demo: a voice-agent prototype sends `stream` chunks; another peer receives them in order with no gaps.
### v2.3+ — Concrete external integrations
**Target: opportunistic**
- LiveKit side-car (one PoC integration to validate the architecture).
- OpenAI Assistant integration (validate delegated-key crypto path).
- WhatsApp bridge (validate human-bridge service identity).
These are not on the critical path for the architecture; they prove it.
---
## Non-goals (explicit)
- **Replacing Slack / Discord.** claudemesh is for agent coordination. Human chat is a side-effect, not the headline.
- **Federation across multiple brokers.** v2.0 stays single-broker per mesh. Multi-broker (gossip / federation) is a separate spec, post-v3.
- **Sync-only / no-broker P2P.** Direct peer-to-peer (without the central broker) is a different architecture (libp2p, Iroh). Not in scope.
- **Replacing the MCP push-pipe.** Mid-turn interrupt stays MCP-based. The transport-adapter layer is broker-side; MCP is daemon-to-Claude-Code, untouched.
---
## Open questions
1. **How does a service identity prove liveness?** WS gives us implicit liveness via the connection. HTTP webhook services need an explicit heartbeat / health-check. Proposal: broker periodically POSTs to `<webhook>/health`; service is marked offline after 3 consecutive failures.
2. **RPC routing through offline peers — what's the failure mode?** If `claudemesh rpc <peer> ...` and the peer is offline, do we (a) queue and wait (DM semantics) or (b) fail fast (REST semantics)? Proposal: RPC fails fast with `peer_offline` after a 5s probe; explicit `--wait` flag opts into DM-style queue.
3. **Per-topic symmetric key rotation.** Existing v0.2.0 spec mentions per-topic keys. Rotation policy (when, who triggers, how members re-sync) is unsolved. Defer to a separate spec; v2.2.0 ships with one-shot keys (rotate by re-creating topic).
---
## Acknowledgements
Cross-checked with Codex (GPT-5.2, high reasoning) on the foundational cleanup section. Codex caught:
- The "remove daemon-WS inbound entirely" idea would silently lose broadcasts + member-targeted DMs whenever zero launches exist. Corrected.
- Inheritance for the dup'd lifecycle would become a god class. Composition via helper is the right call.
- The drain race needs a `claimed_at` + delivered-on-success fix; "check OPEN before claim" still drops on crash.
- Token-keyed registry is correct (token = auth boundary), not a smell.
The agentic-comms / typed-channels / transport-adapter layers are mine — Codex didn't touch those because the question I asked was about the existing architecture's smells, not the future roadmap.

View File

@@ -0,0 +1,282 @@
# Per-session broker presence — daemon-multiplexed
**Status:** spec, queued for 1.30.0 (alongside launch-wizard refactor).
**Owner:** alezmad
**Author:** Claude (Sprint A planning, 2026-05-04)
**Related:** `2026-05-04-v2-roadmap-completion.md` (Sprint A overview),
1.29.0 session-registry CHANGELOG entry.
## Problem
After 1.28.0 dropped the bridge tier, **launched `claude` sessions have
no persistent broker presence**. Only the daemon does.
Concretely: two `claudemesh launch` sessions in the same cwd, querying
`peer list` 2 s apart, **never see each other**. Each `claudemesh peer
list` opens a short-lived cold-path WS that creates a `presence` row
for the duration of the query and tears it down. The "this session"
row everyone sees in their own snapshot is created by the snapshot
itself; sibling sessions' queries miss it because their WS-lifetimes
don't overlap.
Confirmed empirically (2026-05-04, same-cwd ECIJA-Intranet test):
| Snapshot | timestamp | self pubkey | self `connectedAt` |
|---|---|---|---|
| Session A | 11:42:37Z | `61d96106cb499208` | 11:42:38Z (= query time) |
| Session B | 11:42:39Z | `ce77188aba02827d` | 11:42:38Z (= query time) |
Each saw 5 long-lived peers (the daemon and unrelated other sessions)
plus its own ephemeral row. Neither saw the other.
## Goal
Every launched `claude` session has a long-lived broker presence row
**owned by the daemon**, identified by the session's per-launch
keypair. Siblings see each other in `peer list` immediately and
continuously, not as snapshot artifacts.
## Non-goals
- Cross-machine session sync (waiting on 2.0.0 HKDF identity).
- Replacing the daemon's own presence row — the daemon stays as a
separate row for "the user on this machine, no specific session."
- Persistence of the session-presence link across daemon restarts —
daemon restart can be allowed to require launched sessions to
re-register (same compromise as the in-memory session registry from
1.29.0).
## Design
### State machine
The 1.29.0 session registry already tracks `Map<token, SessionInfo>`
inside the daemon. Extend it to own a per-session broker connection.
```
session lifecycle:
POST /v1/sessions/register
→ registry.set(token, info)
→ daemon.openSessionWs(info) ← NEW
→ broker creates presence row owned by session.pubkey
DELETE /v1/sessions/:token
→ registry.delete(token)
→ daemon.closeSessionWs(token) ← NEW
→ broker marks presence.disconnectedAt = now()
reaper (30 s tick): pid dead?
→ registry.delete(token)
→ daemon.closeSessionWs(token)
```
### Daemon-side: per-session `BrokerClient`
Today the daemon holds `Map<meshSlug, DaemonBrokerClient>` (one WS per
attached mesh). Add a parallel `Map<token, SessionBrokerClient>` for
the per-launch ephemeral connections.
`SessionBrokerClient` is the existing `BrokerClient` reused, configured
with the session's per-launch keypair instead of the member's stable
keypair. It registers presence (`presence_join`) and stays connected
until `closeSessionWs(token)` fires. It does **not** drain the outbox
— that's the member-keypair `DaemonBrokerClient`'s job. It only carries
presence + receives DMs targeted at the session pubkey.
### Broker-side: parent-vouched presence auth
Today's broker accepts hello-sig auth where:
- Caller signs the broker's nonce with their `mesh_member` keypair.
- Broker looks up `mesh_member.peer_pubkey == sig.pubkey`.
For per-session keypairs, the session pubkey is **not** in `mesh_member`
— it's freshly generated by `claudemesh launch`. We need a new
attestation flow:
```
hello {
type: "session_hello",
session_pubkey: <fresh keypair>,
parent_member_pubkey: <member keypair from config>,
display_name, cwd, role, groups,
parent_signature: ed25519_sign(member_priv,
"claudemesh-session/" || session_pubkey || "/" || nonce),
nonce_challenge: <broker nonce>,
}
```
Broker validates:
1. `parent_member_pubkey` exists in `mesh.member` for the target mesh.
2. `parent_signature` validates against `parent_member_pubkey` over the
canonical message above.
3. Broker inserts a presence row keyed on `session_pubkey` but
`member_id` pointing at the parent member's `mesh.member.id`.
This is the OAuth-style refresh-vs-access pattern: the parent member
key vouches "this ephemeral session pubkey belongs to me." The broker
binds the row to the parent member but uses the session pubkey for
routing (so DMs targeted at the session pubkey land at this WS).
### CLI-side: launch.ts produces the parent signature
`claudemesh launch` already mints the session keypair and writes the
session-token file. Extend it to also produce a `parent_signature`
that the daemon can present when opening the session WS:
```ts
const sessionPubkey = sessionKeypair.publicKey;
const parentSig = ed25519_sign(
mesh.secretKey,
Buffer.concat([
Buffer.from("claudemesh-session/"),
sessionPubkey,
Buffer.from("/"),
/* nonce comes from broker — handled at WS-connect time */
]),
);
```
Actually, the nonce is broker-issued at hello time, so the signature
needs to be produced fresh per WS-connect. Simpler approach: the
`POST /v1/sessions/register` body carries the *member secret key* (or
a derived signing capability) so the daemon can sign nonces on behalf
of the session.
That's a key-leak risk. Better: register carries a **pre-signed
attestation** good for a TTL window:
```
register body adds:
parent_attestation: {
session_pubkey: hex,
parent_member_pubkey: hex,
expires_at: ISO,
signature: ed25519_sign(member_priv,
"claudemesh-session-attest/" ||
session_pubkey || "/" ||
expires_at),
}
```
Daemon presents this attestation in `session_hello`; broker validates
expiry and signature, then issues a nonce challenge that the daemon
can satisfy with the session keypair (which IS held by the daemon
for the lifetime of the registration). Two-stage: parent vouches the
session; session signs the nonce.
### Registry persistence
For now, in-memory only (matching 1.29.0). Daemon restart drops all
session WSes; launched `claude` processes are responsible for
re-registering on next CLI invocation. Acceptable v1 behaviour;
revisit when sqlite persistence lands for the registry.
## Wire changes
### Broker
- New `session_hello` message type (additive; existing `hello` for
member auth unchanged).
- `presence` row schema unchanged — `member_id` still required, but
`session_pubkey` differs from member's stable pubkey.
- Validate `parent_attestation.expires_at <= now() + 24h` to bound
attestation reuse.
### Daemon
- New `SessionBrokerClient` factory — wraps `BrokerClient` with
session-mode hello.
- `Map<token, SessionBrokerClient>` alongside the existing
`Map<slug, DaemonBrokerClient>`.
- IPC routes:
- `POST /v1/sessions/register` — extend body schema with
`parent_attestation`.
- `DELETE /v1/sessions/:token` — close the session WS first, then
drop registry entry.
### CLI (`claudemesh launch`)
- Mint session keypair (today only writes the session token; need to
add ed25519 keypair generation per launch and write the privkey
alongside the token).
- Sign `parent_attestation` with the member key from the joined-mesh
config.
- POST register with both the new keypair and the attestation.
## LoC estimate
- Daemon `SessionBrokerClient` + registry hook: ~120 LoC.
- IPC route schema extension + validation: ~40 LoC.
- Broker `session_hello` handler + tests: ~140 LoC.
- CLI `claudemesh launch` keypair + attestation: ~60 LoC.
- Tests + smoke: ~80 LoC.
Total: **~440 LoC** across CLI + daemon + broker.
## Risks
| Risk | Mitigation |
|---|---|
| Member private key never leaves the user's machine, but the **attestation** (signed token) can be replayed within its TTL. | TTL bound 24h; refresh on launch; revocation path = drop the parent member's mesh enrollment (nuclear, but works). |
| Cascading WS connections — N launches = N+1 broker WSes per user. | Acceptable up to 10-20 concurrent sessions; if it ever becomes a problem, multiplex per-session at the protocol level (one WS, multiple presence rows). Out of scope for v1. |
| Daemon restart kills all session WSes — `peer list` from inside a launched session sees the remaining 5 peers but not its own siblings until they re-register. | Same as 1.29.0 registry. The registry could persist to sqlite later; for v1, accepted. |
| Broker schema cost: every new presence row has a different `session_pubkey`, growing the table faster. | Already accepted — broker prunes disconnected rows on a 30-day window. Per-session keys triple the row count at peak but stay within the prune budget. |
## Compatibility
- **Older brokers** can't validate `session_hello`. Sessions will
attempt the new hello, get back `unknown_message_type`, and fall
back to the existing member-keyed hello (no per-session presence,
but everything still works as 1.28.0). Add the broker change first,
let it deploy, then ship the CLI side.
- **Older CLIs** continue to work unchanged — they don't open
per-session WSes. They appear as ephemeral cold-path rows just like
today, and lose the symmetric-visibility property between siblings.
- **Backward visible:** users on 1.30.0+ on the same mesh as users on
≤1.29.x will see the older users as one row (their daemon) instead
of one row per session. Acceptable — opt-in to the new visibility
by upgrading.
## Sequencing
1. **Broker change ships first.** Add `session_hello` handler, deploy,
bake for ~24h. No CLI behaviour change yet.
2. **Daemon `SessionBrokerClient` ships next** behind a feature flag
(`CLAUDEMESH_SESSION_PRESENCE=1`). Manually test with two launched
sessions in the same cwd; verify both see each other.
3. **CLI keypair-mint + attestation in `launch.ts` ships last**, behind
the same flag.
4. Flip the flag default in 1.30.0 release; document rollback via env.
## Verification
End-to-end smoke (paste into 1.30.0's CHANGELOG):
```
$ # In two different shells, both cd ~/Desktop/foo:
$ claudemesh launch --name SessionA -y # shell 1
$ claudemesh launch --name SessionB -y # shell 2
$
$ # In a third shell:
$ claudemesh peer list --json --mesh foo | jq '.[] | {n: .displayName, c: .cwd}'
{ "n": "SessionA", "c": "/.../foo" } ← persistent, not query-induced
{ "n": "SessionB", "c": "/.../foo" }
$
$ # In SessionA's shell:
$ claudemesh peer list --mesh foo
should include SessionB.
$
$ # Kill SessionB (Ctrl-C in shell 2). Wait <30s.
$ claudemesh peer list --mesh foo
should NOT include SessionB (reaper closed its WS).
```
## Open questions
- Should the per-session WS also drain *its own* outbox subset, or stay
presence-only? Recommend presence-only for v1 — keeps state machines
simple, daemon's member-keyed WS handles all sends. Can be revisited
when per-session policy DSL ships.
- Should the parent attestation be revocable mid-session? Could add an
IPC route on the daemon. Out of scope for v1; revoke = drop the
whole member enrollment.

View File

@@ -0,0 +1,288 @@
# Session capabilities — first-class concept
**Status:** spec, queued behind v0.3.0 topic-encryption work.
**Owner:** alezmad
**Author:** Claude (Sprint B follow-up, 2026-05-04)
**Related:** `2026-04-15-per-peer-capabilities.md` (existing per-peer
caps system, member-keyed), `2026-05-04-per-session-presence.md`
(per-launch session presence — what we're now restricting).
## Problem
Per-peer capability grants (`apps/broker/src/index.ts:2178+, 2309+`)
are keyed on the sender's **stable member pubkey**. The grant model
gives the recipient fine-grained control: "alice can DM me",
"bob can read state but not broadcast", etc.
But: as of v1.30.0 (`per-session-presence`), every `claudemesh
launch` mints a per-launch ephemeral keypair with a parent attestation
binding it to the member identity. The launched session inherits **all**
the member's capabilities transitively, because cap enforcement always
falls through to the member key.
Concretely:
- Member `alice` is in mesh `flexicar`, granted `dm + state-read +
state-write` by everyone.
- Alice launches a session with `claudemesh launch` to do an automated
task — say, run a Claude Code agent that iterates over PRs.
- That session has full member privileges. It can DM peers, write
shared state keys (e.g. clobber `current-pr`), grant new caps, ban
members, etc. — none of which the user wanted to delegate.
There is no way to express "this session can DM peers but cannot
deploy services or grant caps." The parent attestation is a binary
existence proof — "this session was vouched by a member" — with no
capability subset.
Plus an adjacent footgun: `set_state` (`apps/broker/src/index.ts:2949`)
has **no cap check at all**. Anyone in the mesh can write any key. The
spec at `2026-04-15-per-peer-capabilities.md` lists `state-write` as a
planned cap but it was never wired into the broker. Shared keys like
`current-pr` are write-anyone today.
## Goal
A launched session can be issued **a capability subset** of its
parent member, signed by the parent at launch time, and the broker
enforces the **intersection** of recipient grants × session caps on
every protected operation.
## Non-goals
- Changing the existing per-peer cap model. Member-keyed grants stay
authoritative for "who is allowed to talk to me."
- Cross-machine session caps (waiting on 2.0.0 HKDF identity).
- Per-tool granularity inside the Claude Code MCP surface — this
spec only covers the broker-enforceable verbs (dm, broadcast,
state-read, state-write, grant, kick, ban, profile-write,
service-deploy).
- Delegation: a session cannot re-vouch a sub-session with its own
cap subset. Only members can attest sessions. (Could be lifted in
a future spec; today's launch flow doesn't need it.)
## Design
### Capability vocabulary
Existing (today, member-level):
| Capability | Effect when GRANTED on a recipient → sender pair |
|---------------|---------------------------------------------------|
| `read` | Sender appears in recipient's `list_peers` |
| `dm` | Sender can DM recipient |
| `broadcast` | Sender's broadcasts reach recipient |
| `state-read` | Sender can read shared state |
| `state-write` | (planned) Sender can write shared state |
| `file-read` | Sender can fetch files recipient shared |
New (session-level — cap subset on the attestation):
These are the **verbs the session is allowed to invoke**, NOT what
peers can do TO it. A session attestation declaring `["dm", "read"]`
means the session can SEND dm/read-list operations; it cannot
broadcast, write state, grant, etc.
| Session cap | Gates which broker operations |
|-------------------|------------------------------------------------|
| `dm` | `send` with single recipient |
| `broadcast` | `send` with `*`, `@group`, `#topic` |
| `state-read` | `get_state`, `list_state` |
| `state-write` | `set_state` |
| `grant` | `grant`, `revoke`, `block` |
| `kick` | `kick`, `disconnect` |
| `ban` | `ban`, `unban` |
| `profile-write` | `set_profile`, `set_summary`, `set_status` |
| `service-deploy` | `mesh_service_register`, `_unregister` |
The default cap set when no subset is declared: the **full member
set** (today's behavior — opt-in restriction, not breaking).
### Attestation v2
Existing v1 (`apps/cli/src/services/broker/session-hello-sig.ts`):
```
canonical = `claudemesh-session-attest|<parent>|<session>|<expires>`
```
New v2 (additive — broker accepts both):
```
canonical = `claudemesh-session-attest-v2|<parent>|<session>|<expires>|<sorted-caps-csv>`
```
Where `<sorted-caps-csv>` is the lower-cased, comma-joined,
ASCII-sorted cap list. Empty-list = full member caps (default,
back-compat).
**Wire shape additions on `session_hello`:**
```ts
{
type: "session_hello",
...existing fields...,
parentAttestation: {
sessionPubkey,
parentMemberPubkey,
expiresAt,
signature,
// NEW:
allowed_caps?: string[], // omitted = full member set
version?: 2, // omitted = v1
},
}
```
The broker version-detects: `version === 2` → verify v2 canonical
including `allowed_caps`. Default behavior is unchanged for clients
that don't pass it.
### Enforcement
Add `allowed_caps: string[] | null` to the in-memory `PeerConn`
shape (`apps/broker/src/index.ts:131`). Populated from
`handleSessionHello` (the v2 attestation supplies it) and from
`handleHello` (control-plane / member connection — set to `null`,
meaning "full member caps").
**Effective cap check** for a sending peer needing `cap`:
```ts
function senderHasCap(conn: PeerConn, cap: string): boolean {
if (conn.allowed_caps === null) return true; // member-level, no subset
return conn.allowed_caps.includes(cap);
}
```
Wire this into every broker operation in the table above. The
existing per-peer recipient-cap check at `2178+, 2309+` stays —
session caps gate the **sender side**, recipient grants gate the
**receive side**, and both must allow:
```
allowed = senderHasCap(conn, capNeeded) && recipientGrants[sender][capNeeded]
```
### `set_state` gate (bonus, ship together)
Today: no cap check. After this spec: `set_state` requires
`state-write` on the sender side. Migration: existing members
default to having `state-write` in their member caps (no recipient
grant model for state-write — it's a sender-side gate only, mesh-
wide). New attestations can omit it to forbid the session.
The recipient-side analog (per-peer state-write grants) is left for
a future spec — today the value of guarding state-write is
session-level (avoid an automated session clobbering shared keys),
not peer-level.
### CLI surface
```
claudemesh launch --caps dm,read # tight: read-only chat agent
claudemesh launch --caps dm,broadcast # send-only, no state writes
claudemesh launch # default: full member caps
```
`claudemesh launch --caps ?` prints the table above with descriptions.
`claudemesh peer list --json` includes `allowed_caps` per row when
present (`null` = full member). Lets users audit what their running
sessions can actually do.
### Migration plan (mirrors `2026-04-15-per-peer-capabilities.md` §"Migration plan")
1. **Broker schema additive** — `PeerConn.allowed_caps` in-memory
only; no DB column. Reload-on-reconnect is fine because the
attestation is re-sent on every WS open (it's the proof of
identity).
2. **CLI ships v2 attestation alongside v1.** New `--caps` flag
defaults to omitted (= v1 attestation, full caps). Older
brokers ignore the new fields entirely.
3. **Broker accepts v2.** When `allowed_caps` arrives, store it.
No enforcement yet — log denied operations as `cap_check_dryrun`
metric counter, still allow them through.
4. **Dry-run release.** Ship one CLI + broker release that emits
the metric but doesn't enforce. Watch for false positives in
real meshes for ≥ 1 week.
5. **Flip enforcement on.** Broker rejects operations failing the
cap check with `forbidden: missing session capability "<cap>"`.
Default ("no caps declared = full member") keeps existing
sessions unaffected.
6. **`set_state` gate** ships in step 5 alongside the rest. Default
member caps include `state-write`, so flipping it on doesn't
break existing flows. Only sessions that explicitly omit
`state-write` from `--caps` lose write access.
### Crypto notes
- v2 attestation re-uses `crypto_sign_detached` over the new
canonical string; same parent member secret key, same TTL caps
(≤24 h), same `expiresAt` semantics.
- v1 signatures are NOT v2 signatures — collision is impossible
because the canonical strings have different prefixes
(`claudemesh-session-attest` vs `claudemesh-session-attest-v2`).
Domain separation is intrinsic.
- Like the existing per-peer cap system: caps are server-enforced
metadata, not capability tokens. A malicious broker can ignore
them. This is about UX trust + footgun prevention, not protocol-
level security.
## Open questions
1. **Should the session attestation also bind to a fingerprint of
the launched binary / Claude version?** Would let a member say
"this session is constrained to Claude Code v1.34.15" so a
compromised launched-binary doesn't get reused. Probably no — too
much friction for the threat model.
2. **What's the right default for `claudemesh launch` going forward?**
Once enforcement ships, do we change the default `--caps` from
"full member" to "dm + read + state-read"? Tighter but breaks
existing automation that writes state. Probably worth a one-
release deprecation warning ("your session will lose state-write
in v2.0.0 unless you pass --caps state-write") and then flip in
v2.0.0.
3. **Does `--caps` belong in `~/.claudemesh/config.json` per-mesh
defaults too?** A user who always launches read-only agents
wants `caps: ["dm", "read"]` as a personal default. Easy add;
defer until users ask for it.
4. **Per-tool MCP cap surface?** Out of scope here, but: a `claudemesh
launch --tools peer:read,memory:write` would be a finer cut than
broker-verb caps. The broker can't enforce that — it'd live in the
MCP wrapper / Claude Code's allowedTools. Different layer.
## Test plan
- Pure-logic tests on `senderHasCap` (member-level → always true,
empty caps → always false, declared caps → exact match).
- Broker integration: launch a session with `--caps dm`, attempt
`set_state` → expect `forbidden: missing session capability
"state-write"`.
- v1 attestation still accepted, no `allowed_caps` set, all caps
permitted (back-compat).
- v2 attestation with empty `allowed_caps` array → broker treats
as "explicitly empty, no caps allowed" (NOT "full member"). The
full-member default is "field omitted entirely". Test both.
- Dry-run mode: cap fail increments the counter but the operation
proceeds. Smoke-test before flipping enforcement.
## Estimate
- Spec review + open-question resolution: 12 days.
- Broker change (PeerConn field, attestation v2 accept, per-verb
enforcement, dry-run mode): 23 days.
- CLI change (`--caps` flag, attestation builder, peer list
surface): 1 day.
- Tests: 1 day.
- Dry-run release window: ≥ 1 week.
Total: ~1 sprint of focused work, plus a dry-run window.

View File

@@ -0,0 +1,104 @@
# v2.0.0 Daemon Redesign — Completion Roadmap
**Date:** 2026-05-04
**Owner:** alezmad
**Status:** in-progress (1.24.0 + 1.25.0 land most of it; remainder is two follow-up arcs)
## What's done
| v2.0.0 bullet | Version | Status |
|---|---|---|
| `claudemesh-daemon` long-lived launchd / systemd unit | 1.22.0 | ✅ Done |
| MCP server shrinks to thin daemon adapter | 1.24.0 | ✅ Done — 979 → ~200 LoC of push-pipe, daemon-required, no fallback |
| `claudemesh install` auto-installs + starts daemon | 1.24.0 | ✅ Done |
| `claudemesh launch` ensures daemon | 1.24.0 | ✅ Done |
| Daemon outbound routing (Sprint 4: real targets + crypto) | 1.25.0 | ✅ Done — outbox stores `mesh`, `target_spec`, `nonce`, `ciphertext`, `priority`; resolution + `crypto_box` happens at IPC accept time; drain is a forwarder |
| CLI thin-client routing for read verbs | 1.25.0 | ✅ Partial — `peer list`, `skill list/get` route through daemon when present; same `trySendViaDaemon` fallback shape |
| Ambient mode (raw `claude` Just Works) | 1.25.0 | ✅ Documented + functional for the daemon's attached mesh |
## What remains (in dependency order)
### A. Daemon multi-mesh (the prerequisite for "ambient mode for everything")
**Why it's the critical path:** ambient mode today only works for the single mesh the daemon is attached to. Users with N meshes either run N daemons (different sock paths) or restart the daemon to switch. Neither is acceptable for the v2.0.0 promise.
**What it takes:**
- Daemon holds `Map<slug, DaemonBrokerClient>` instead of one broker.
- Outbox row's `mesh` column (1.25.0 added) is the dispatch key.
- IPC `/v1/send` requires `mesh` field (or infers from target prefix `<slug>:<target>`).
- IPC read endpoints (`/v1/peers`, `/v1/skills`, `/v1/profile`) accept `?mesh=<slug>` or return mesh-grouped results.
- SSE event payloads already include `mesh` slug; no change needed.
- Drain worker selects broker by row's `mesh` column.
- `daemon up` with no `--mesh` attaches to all joined meshes; with `--mesh X` restricts to X (legacy mode for explicit single-mesh).
- Inbox dedupe keeps using `client_message_id` UNIQUE; mesh column for filtering only.
**Estimated effort:** 1 week. ~600 LoC across `run.ts`, `drain.ts`, `ipc/server.ts`, plus tests for per-mesh dispatch.
**Risk:** medium. The single-mesh assumption is baked into a few places (peer-list response shape, skill-list response shape). Need to choose: per-mesh tagged responses (breaking) or array-of-meshes wrapped responses (additive). Recommend the latter for back-compat.
### B. HKDF-derived peer keypairs (cross-machine identity)
**Why it matters:** today each install per machine = fresh keypair = different mesh member identity. User signs in on laptop and desktop and shows up as two different members. v2.0.0 promised "same identity across machines."
**What it takes:**
- `HKDF(account_secret, info: "claudemesh/mesh/<mesh_id>/peer", salt: <user_id>)` derives a deterministic ed25519 keypair per mesh.
- `account_secret` derives from the user's authenticated session — needs broker-side endpoint to vend it on first install.
- Enrollment flow changes: instead of generating a fresh keypair, derive it. Subsequent installs find the same pubkey already in `mesh.member` and skip enrollment.
- Migration: existing members keep their old keypairs (they're stored in config). Only new joins use HKDF. Optional: opt-in re-enrollment for users who want cross-machine sync.
- Broker hello-sig protocol unchanged (still ed25519 sign).
**Estimated effort:** 2-3 weeks. Touches enrollment, broker auth, dashboard, security review.
**Risk:** high. Crypto change with security implications. Needs design review (account_secret distribution security, HKDF salt choice, key compromise recovery story).
### C. Mesh → workspace public surface rename
**Why it matters:** "mesh" is internal jargon for what users experience as "a workspace." v2.0.0 calls for the rename to align UX language.
**What it takes:**
- All CLI verbs gain `workspace` aliases (`claudemesh workspace list``claudemesh list`).
- Help text, docs, README, marketing site updated.
- DB tables stay `mesh_*` (migration cost prohibitive; not user-visible).
- Wire protocol stays `mesh_*` (broker change too disruptive).
- Eventually deprecate the `mesh` aliases (~2 minor versions later).
**Estimated effort:** 3-4 days. Mostly rote search/replace + new aliases.
**Risk:** low. Cosmetic.
### D. Full CLI-to-thin-client conversion
**Why it matters:** today the CLI has bridge + cold-path code that duplicates ~3000 LoC of broker WS / crypto / decode logic that the daemon also has. Once daemon is multi-mesh, every verb can become "open IPC, send request, render response."
**What it takes:**
- Each verb: replace `withMesh(...)` (which opens its own broker WS) with `daemonOnly(...)` (calls IPC, errors if daemon down).
- Drop `bridge/server.ts`, `bridge/client.ts`, `bridge/socket-broker.ts` entirely.
- Drop most of `services/broker/ws-client.ts` from the CLI build (kept only for daemon's internal use).
- CLI binary shrinks ~30-40%.
- Daemon becomes the only broker WS holder per user.
**Estimated effort:** 1 week. Mostly mechanical; strict typescript catches most issues.
**Risk:** medium. Breaks workflows where CLI is used without daemon (CI environments, headless scripts). Need to keep a `--no-daemon` escape hatch or document the constraint.
## Recommended sequencing
```
1.25.0 (today): Sprint 4 outbound routing + CLI thin-client read paths + ambient mode docs
1.26.0 (next): A. Daemon multi-mesh — "ambient mode for everything"
1.27.0: D. CLI-to-thin-client conversion — drops ~3000 LoC
1.28.0: C. Mesh → workspace rename (aliases shipped, no removal yet)
2.0.0: B. HKDF identity (separate security-reviewed arc)
```
A → D → C → B is the right order:
- A unblocks ambient mode for multi-mesh users (highest UX value).
- D unblocks the LoC reduction the v2.0.0 promise mentioned ("3000 LoC removed").
- C is cosmetic; do it once D has stabilized.
- B is the most security-sensitive; do it last, with proper review.
## Out of scope for the v2.0.0 endpoint
- **Topic crypto (Sprint 5+).** Topics still ship as base64 plaintext. Real per-topic encryption is a v0.3.0 operator-layer item, parallel track.
- **Broker hardening for daemon idempotency (Sprint 7).** Partial unique index on `(mesh_id, client_message_id) WHERE NOT NULL` and the `mesh.client_message_dedupe` table. Documented in `2026-05-03-daemon-spec-broker-hardening-followups.md`.
- **`launch` deprecation.** 1.25.0 docs now recommend ambient mode for default cases; `launch` stays as the override path. Full deprecation is a 2.x decision.

View File

@@ -0,0 +1,350 @@
# Continuous presence — lease model + resume token
**Status:** spec, ready for v0.3.0.
**Owner:** alezmad
**Author:** Claude (2026-05-05, follow-up to user-reported "after hours claudemesh disconnects")
**Related:** `2026-05-04-per-session-presence.md` (per-launch ephemeral keypair), `apps/broker/src/index.ts:5430-5436` (current 30s ping loop), `apps/cli/src/daemon/ws-lifecycle.ts` (current backoff reconnect).
## Problem
Today, presence is fused to a single TCP/WS connection. When the
connection breaks — half-dead NAT entries, ISP route changes, laptop
sleep, broker restart — the broker tears down the presence row, fires
`peer_left`, and waits for the daemon to dial a fresh socket and run
the full attestation hello again. Other peers see the user blink
offline → back online. Messages sent to the session during the gap are
either dropped (if it's a `now`/`next` priority DM with no recipient
match) or held in `message_queue` for `low` only.
Concrete symptom (user-reported): `claudemesh peer list` shows zero
peers despite multiple sessions being "up" — they're stuck on
half-dead TCP connections. Daemon hasn't noticed because no `close`
fired. Hours later, kernel TCP keepalive (default Linux: 7200s idle +
9 × 75s probes ≈ 2h11m) finally RSTs the socket, daemon's existing
backoff reconnects, peers reappear. Until then: zombie session.
Two coupled bugs:
1. **No application-layer staleness detection.** Broker pings every
30s (line 5431) and updates `lastPingAt` on pong, but never
`terminate()`s a connection that stops returning pongs. Daemon
doesn't ping at all. Both sides trust the kernel for liveness,
which only fires after hours.
2. **Presence == connection.** Even once the staleness IS detected
and the daemon reconnects, peers see a full `peer_left` /
`peer_joined` cycle for a network blip that took 130 seconds.
Outbound messages during the gap that target the session by
pubkey route to nothing.
The user's ask: peers should never see a gap during transient
disconnects. Presence should be continuous as long as the *session
intent* is alive, regardless of how many sockets carried it.
## Goal
Presence is a **lease** keyed off the session's stable identity
(`sessionPubkey`), held in broker memory + DB, with a TTL refreshed
on every keepalive. Sockets come and go beneath the lease. Other peers
see continuous online status across reconnects up to the lease TTL.
Specifically:
- A daemon (or per-session WS) can drop and re-establish the WS
within a configurable grace window (default 90s) without any peer
observing `peer_left` / `peer_joined`.
- Messages sent to a session while its socket is mid-flap are queued,
delivered on the next reattach, ordered.
- Reconnect itself is sub-second on the wire when a `resume_token` is
presented — broker recognises the session, restores the slot, no
re-attestation round-trip.
- After the grace window expires, the broker fires `peer_left`
exactly once; on a later reconnect it fires `peer_joined` exactly
once. No flapping.
## Non-goals
- **Multi-broker handoff.** Out of scope. If the broker process
restarts, leases are lost and we fall back to today's behavior
(clean reconnect, peers see one cycle). A future spec can address
this with a shared lease store (Redis / Postgres LISTEN).
- **Dual-socket on the daemon.** Useful gold-plating but not required
for the user-facing problem. Single-socket with watchdog +
resume-token covers the failure modes actually observed (NAT drops,
ISP blips, sleep <90s).
- **Manual `claudemesh reconnect` CLI.** Not needed; the lease model
makes it redundant. Re-evaluate if real support cases surface.
## Design
### Lease model
```
sessionPubkey → { transport: "online" | "offline",
leaseUntil: Date,
ws: WebSocket | null,
...existing PeerConn fields }
```
Today the `connections` Map IS keyed by `presenceId`, which is a fresh
UUID per WS. We change that key to `sessionPubkey` (member-WS:
`memberPubkey`; session-WS: `sessionPubkey`). The PeerConn struct
gains:
```ts
transport: "online" | "offline";
leaseUntil: Date; // Date.now() + LEASE_TTL_MS
evictionTimer: NodeJS.Timeout | null;
```
### State transitions
**On WS open + hello accepted (initial):**
- Insert into `connections` with `transport: "online"`,
`leaseUntil: now + 90s`, `evictionTimer: null`.
- Broadcast `peer_joined` (today's behavior).
- Issue `resume_token` (see below) in the `hello_ack`.
**On WS open + hello carries valid `resume_token`:**
- Look up by `sessionPubkey`, verify token signature + freshness
(TTL <= LEASE_TTL_MS). If valid AND entry exists with
`transport: "offline"`:
- Cancel `evictionTimer`.
- Swap `ws` reference.
- Set `transport: "online"`, refresh `leaseUntil`.
- **Do NOT** broadcast `peer_joined`. The lease never expired.
- Drain any queued DMs accumulated during offline window.
- Reply `hello_ack` with new `resume_token`.
- If entry exists with `transport: "online"` (token replay attack or
rapid reconnect race): close old `ws` with `1000, "session_replaced"`
before swapping. Same as today's `oldConn.ws.close(1000, ...)`
pattern at lines 1768/1996.
- If no entry exists or token is stale: treat as a fresh hello,
broadcast `peer_joined`. Token expired = same as a cold start.
**On WS close (any reason):**
- Look up by `sessionPubkey`. If not found, no-op (already evicted).
- Set `transport: "offline"`, clear `ws` reference.
- Start `evictionTimer = setTimeout(evict, GRACE_MS)`.
- **Do NOT** broadcast `peer_left`. **Do NOT** delete the entry.
- **Do NOT** call `disconnectPresence(presenceId)` yet.
**On `evictionTimer` fire (lease expired without reattach):**
- Delete from `connections`.
- Broadcast `peer_left` (today's behavior at lines 5167-5189).
- `decMeshCount`.
- `disconnectPresence(presenceId)`.
- Clean up URL watches, stream subs, MCP registry — same as today's
close handler.
- Audit `peer_left`.
**Watchdog (broker):**
- The 30s ping loop (line 5431) gains a staleness check: if any
conn's `transport === "online"` and `lastPingAt < now - 75s`, call
`ws.terminate()`. This converts the half-dead socket into a clean
`close` event, which fires the lease-offline transition above.
- Same logic on the daemon side (see § Daemon changes).
### Resume token
A short opaque string the broker hands the daemon in `hello_ack`.
Format: `mesh-resume.v1.<base64url(JSON-payload)>.<base64url(sig)>`
where `JSON-payload = { sub: <sessionPubkey>, mid: <meshId>, exp:
<unix-ms>, iat: <unix-ms> }` and `sig = ed25519(brokerSigningKey,
JSON-payload)`.
- **Why a token, not just sessionPubkey?** A session needs to prove
it's the holder of an existing lease without re-running the full
attestation handshake (which involves member key + parent
attestation lookup). The token is a server-issued cookie: cheap to
verify, scoped to a single session, expires with the lease.
- **Storage:** broker keeps the signing key in env (`RESUME_TOKEN_KEY`,
generated on first boot if missing, persisted to a config row). No
DB column needed for the tokens themselves — they're verified by
signature alone.
- **TTL:** equal to LEASE_TTL_MS (90s). After that the daemon must
re-handshake with full attestation. Refreshed on every successful
reattach.
- **Daemon storage:** in-memory only. Lost on daemon restart, which
is correct: a daemon restart is a real reconnect and should run
the full hello.
### Wire protocol additions
`hello` (member-WS, session-WS, fresh-launch hello — all three):
```diff
{
type: "hello",
memberPubkey: "...",
sessionPubkey: "...", // session-WS only
attestation: "...", // session-WS only
signature: "...",
+ resumeToken?: "mesh-resume.v1...", // optional; presence = reattach attempt
...
}
```
`hello_ack`:
```diff
{
type: "hello_ack",
presenceId: "...",
...
+ resumeToken: "mesh-resume.v1...", // always issued; replaces prior on reattach
+ leaseTtlMs: 90000, // informational; daemon may use for ping cadence
}
```
No new message types. Old daemons that don't send `resumeToken` get
today's full-handshake behavior — fully backward compatible.
### Message queue during grace window
Today: DMs to a presence whose WS is closed → routed to
`message_queue` only for `priority: low`; `now`/`next` either route
to a different connected session of the same member or drop.
Change: when broker would route to a session whose
`transport === "offline"` (lease still valid), enqueue regardless of
priority. On reattach, the existing inbox-drain path
(`maybePushQueuedMessages` at line 967) flushes them in order. The
`message_queue` already has the schema for this; we're just relaxing
the priority gate when the target is in grace.
### Constants
```ts
const LEASE_TTL_MS = 90_000; // grace window after WS close
const PING_INTERVAL_MS = 30_000; // unchanged
const STALE_PONG_THRESHOLD_MS = 75_000; // 2.5x ping interval
const RESUME_TOKEN_TTL_MS = LEASE_TTL_MS;
```
`LEASE_TTL_MS` = 90s rationale: long enough to absorb a sleep/resume
cycle, NAT timeout, ISP route flap, mobile→wifi handover. Short
enough that a true crash (daemon killed, machine off) clears the
session within 90s — peers don't see ghost online status forever.
Configurable via env (`LEASE_TTL_MS`) for self-hosted brokers.
## Daemon changes
### Watchdog
In `ws-lifecycle.ts`, add an `idleWatchdog` parallel to the existing
backoff/reconnect machinery:
```ts
let lastActivity = Date.now(); // bumped on every incoming message + pong
const watchdog = setInterval(() => {
if (Date.now() - lastActivity > STALE_THRESHOLD_MS) {
log("warn", "ws_stale_terminate", { url: opts.url });
sock.terminate(); // fires existing close handler → reconnect path
} else if (sock.readyState === sock.OPEN) {
sock.ping(); // matches broker's 30s cadence, gives broker a pong
}
}, PING_INTERVAL_MS);
sock.on("message", () => { lastActivity = Date.now(); });
sock.on("pong", () => { lastActivity = Date.now(); });
```
Cleanup `clearInterval(watchdog)` in the close handler and explicit
`close()` path.
### Resume token in hello
`apps/cli/src/daemon/broker.ts:136` and equivalent in
`session-broker.ts`: persist the `resumeToken` from each successful
`hello_ack` into a private field, include it in the next
`buildHello()` call. On daemon restart the field is empty → cold
start, exactly today's behavior.
### No CLI changes
`claudemesh peer list` keeps reading the broker's `connections` Map
which now reflects continuous presence. Users see online sessions as
online during transient blips. No UX surface changes.
## Migration
- New broker is fully backward compatible with old daemons (resume
token is optional, defaults fall through to today's path).
- New daemons against an old broker: token is sent but ignored, full
handshake runs each reconnect — same as today.
- DB migration: none. `presence` table semantics unchanged. The
`disconnectedAt` column is now set only on lease eviction (>90s),
not on every WS close. This is a behavioral change but not a
schema change.
- Add ENV var `RESUME_TOKEN_KEY` (broker generates on first boot if
unset, persists to a singleton config row).
## Test plan
1. **Sleep test:** kill -STOP the daemon for 60s, then kill -CONT.
Expect: peers never see `peer_left`. Daemon's WS is dead-on-arrival
when it wakes; watchdog terminates it; reconnect with resume_token
succeeds within 1-2s; lease was at ~30s of its 90s TTL when the
daemon resumed.
2. **Hard offline:** kill -STOP for 120s, kill -CONT. Expect: peers
see exactly one `peer_left` at t=90s, then exactly one
`peer_joined` after the daemon resumes and reconnects (resume
token is now stale; full handshake runs).
3. **NAT drop simulation:** `iptables -A OUTPUT -p tcp --dport 443
-j DROP` for 60s on the daemon host, then remove the rule. Expect:
broker pings stop landing, broker-side watchdog calls
`ws.terminate()` at t=75s, lease enters grace, daemon's own
watchdog fires within ~30s, daemon reconnects with resume_token,
peers never see a flap.
4. **Message-during-grace:** while a target session is in grace
(offline, lease valid), send a `priority: now` DM. Expect: queued
in `message_queue`, delivered exactly once on reattach, no
`peer_left` visible to sender, ack returns delivered.
5. **Replay attack:** capture a resume_token in flight, replay it
against a different broker connection while the original session
is still online. Expect: broker treats it as a reconnect for an
already-online session → closes old WS with `session_replaced`,
new WS takes over. Equivalent to today's session-replacement
semantics; the original session detects the close and either
reconnects (if it's still alive) or gives up.
6. **Token forgery:** send a `resumeToken` not signed by the broker.
Expect: signature check fails, broker treats hello as a fresh
handshake (or rejects if the rest of the hello is invalid).
## Open questions
- **Should `peer list` expose a `transport` field** so callers can
distinguish "leased but offline" from "online"? Default no — the
abstraction we're selling is "they're online." But debugging may
want it; gate it behind `--all` or `--debug`.
- **What about the broker-side `mcpRegistry` cleanup?** Today we
delete non-persistent MCP entries on WS close (line 5217). With
leases, we should defer that to lease eviction, not WS close.
Otherwise an MCP server registered by a session disappears every
time its WS reconnects.
## Build order
1. **Broker lease model** — change `connections` keying, add
`transport`/`leaseUntil`/`evictionTimer`, refactor close handler
to start grace timer instead of immediate teardown, refactor
eviction path. (~80 lines.)
2. **Resume token** — signing key bootstrap, token issue/verify,
wire format, hello_ack changes. (~50 lines + 1 config row.)
3. **Daemon watchdog** — `ws-lifecycle.ts` adds `idleWatchdog` and
stores `resumeToken` from acks. (~25 lines.)
4. **Daemon hello** — pass `resumeToken` in next `buildHello()`.
(~10 lines across `broker.ts` + `session-broker.ts`.)
5. **Broker watchdog** — extend the 30s ping loop with
`terminate()`-on-stale logic. (~15 lines.)
6. **Queue-during-grace** — relax priority gate in DM routing.
(~5 lines.)
7. **Spec docs** — update `docs/protocol.md` with resume_token,
lease semantics. (~30 lines.)
8. **Tests** — six scenarios above. Likely ~3 new test files.
Estimated total: one focused day. The broker lease model is the load-
bearing change; everything else slots in cleanly once that's done.

View File

@@ -1 +1 @@
{"sessionId":"ae5dbe38-9c56-4d07-9fb6-a38cb8a250a6","pid":4612,"acquiredAt":1776217467441}
{"sessionId":"ae5dbe38-9c56-4d07-9fb6-a38cb8a250a6","pid":3633,"procStart":"Fri May 1 22:40:56 2026","acquiredAt":1777683244936}

71
.github/workflows/deploy-web.yml vendored Normal file
View File

@@ -0,0 +1,71 @@
name: Deploy claudemesh-web
# Triggers a Coolify deploy of the apps/web Next.js app on the OVH VPS.
# Coolify only auto-deploys the broker (it watches the gitea-vps mirror);
# the web app needs an explicit poke. This workflow is the poke.
#
# The Coolify dashboard is bound to a Tailscale-only address
# (100.122.34.28:8000), so the runner first joins the tailnet via
# an OAuth-issued ephemeral node, then hits Coolify's deploy API.
#
# Path filter: redeploy on changes to the web app, the API package
# (bundled into the web build), or any shared package the web app
# transpiles. Anything else (broker-only, cli-only, docs) skips it.
on:
push:
branches: [main]
paths:
- "apps/web/**"
- "packages/api/**"
- "packages/db/**"
- "packages/auth/**"
- "packages/ui/**"
- "packages/i18n/**"
- "packages/shared/**"
- "packages/email/**"
- "packages/billing/**"
- "packages/storage/**"
- "packages/monitoring-web/**"
- "pnpm-lock.yaml"
- ".github/workflows/deploy-web.yml"
workflow_dispatch:
# Coalesce rapid pushes — only one deploy in flight at a time, and
# if a newer push lands while one is queued, the older one is
# cancelled. Avoids the "5 commits, 5 deploys" stampede.
concurrency:
group: deploy-web
cancel-in-progress: true
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Connect to Tailscale
uses: tailscale/github-action@v3
with:
oauth-client-id: ${{ secrets.TS_OAUTH_CLIENT_ID }}
oauth-secret: ${{ secrets.TS_OAUTH_SECRET }}
tags: tag:ci
- name: Trigger Coolify deploy
env:
COOLIFY_TOKEN: ${{ secrets.COOLIFY_TOKEN }}
APP_UUID: p68x1e3k4xmrjmblca5ybe09
run: |
if [ -z "$COOLIFY_TOKEN" ]; then
echo "::error::COOLIFY_TOKEN secret is not set"
exit 1
fi
response=$(curl -sS -w "\n%{http_code}" -X GET \
"http://100.122.34.28:8000/api/v1/deploy?uuid=${APP_UUID}&force=true" \
-H "Authorization: Bearer ${COOLIFY_TOKEN}")
status=$(echo "$response" | tail -n1)
body=$(echo "$response" | sed '$d')
echo "HTTP $status"
echo "$body"
if [ "$status" != "200" ]; then
echo "::error::Coolify returned HTTP $status"
exit 1
fi

View File

@@ -20,11 +20,12 @@ Peer mesh for Claude Code sessions. Broker + CLI + MCP server.
## Deploy
- **Broker:** `git push gitea-vps main` triggers Coolify auto-deploy. Manual: `curl -s -X GET "http://100.122.34.28:8000/api/v1/deploy?uuid=mcn8m74tbxfxbplmyb40b2ia" -H "Authorization: Bearer 3|K2vkSJzdUA69rj22CKZc5z0YB6pkY43GLEonti3UzcnqVJj6WhrqqYTAng6DzMUi"`. Pending migrations apply automatically on startup.
- **Broker:** `git push gitea-vps main` triggers Coolify auto-deploy via the gitea webhook. Pending migrations apply automatically on startup.
- **Web:** Coolify on the OVH VPS (`claudemesh.com` resolves to `135.125.191.245`, NOT Vercel — the `apps/web/Dockerfile` is what Coolify builds). Auto-deploys via `.github/workflows/deploy-web.yml` on push to `main` when paths under `apps/web/**` or `packages/{api,db,auth,ui,i18n,shared,email,billing,storage,monitoring-web}/**` change. The workflow joins the tailnet via Tailscale OAuth, then hits the Coolify API.
- **Manual deploy** (if the workflow is broken or the path filter missed something) — Coolify dashboard at `http://100.122.34.28:8000` (Tailscale only). Token in `COOLIFY_TOKEN` repo secret. App UUIDs: broker `mcn8m74tbxfxbplmyb40b2ia`, web `p68x1e3k4xmrjmblca5ybe09`.
- **CLI:**
- npm: `cd apps/cli && npm publish --tag alpha --access public --no-git-checks --ignore-scripts`
- npm: `cd apps/cli && npm publish --access public --no-git-checks --ignore-scripts`
- Binaries: `git tag cli-v<version> && git push github cli-v<version>` — workflow builds 5 platforms.
- **Web:** Vercel auto-deploy on push to GitHub
## Dev

View File

@@ -41,6 +41,11 @@ COPY --from=deps --chown=bun:bun /app/packages/db/migrations /app/migrations
EXPOSE 7900
# Liveness (Docker HEALTHCHECK) hits /health — permissive, tolerates
# transient DB blips so the container isn't killed during brief DB
# restarts. Deploy-time readiness is a separate /health/ready endpoint
# which checks DB + migration version; an external gate should poll
# that after container start and fail the deploy if not green.
HEALTHCHECK --interval=10s --timeout=5s --start-period=30s --retries=5 \
CMD bun -e "fetch('http://localhost:7900/health').then(r=>{process.exit(r.ok?0:1)}).catch(()=>process.exit(1))"

View File

@@ -0,0 +1,101 @@
/**
* One-shot backfill: every active mesh whose owner has no peer-identity
* member row gets one minted via a fresh ed25519 keypair. Without this,
* web-first owners (who never connected via CLI) can't access the chat
* surface — issueDashboardApiKey is a FK to mesh.member, and the topic
* page server component's owner branch picks the oldest member row in
* the mesh (which is null if none exist).
*
* Idempotent. Safe to re-run. Each run prints per-mesh status.
*
* Owner identification: a member is the "owner's row" when its user_id
* matches mesh.owner_user_id. The script targets meshes that have zero
* such matching rows (regardless of total member count — a mesh with
* peers but no owner member also gets a fresh owner row).
*
* The owner row is also auto-subscribed to #general as 'lead' so the
* unread/role accounting matches CLI-flow meshes.
*
* Usage:
* DATABASE_URL=... bun apps/broker/scripts/backfill-owner-members.ts
*/
import postgres from "postgres";
import sodium from "libsodium-wrappers";
interface Orphan {
meshId: string;
slug: string;
ownerUserId: string;
meshName: string;
}
async function main() {
const url = process.env.DATABASE_URL;
if (!url) {
console.error("DATABASE_URL not set");
process.exit(2);
}
await sodium.ready;
const sql = postgres(url, { max: 1, onnotice: () => {} });
try {
const orphans = await sql<Orphan[]>`
SELECT m.id AS "meshId", m.slug, m.owner_user_id AS "ownerUserId", m.name AS "meshName"
FROM mesh.mesh m
WHERE m.archived_at IS NULL
AND NOT EXISTS (
SELECT 1 FROM mesh.member mm
WHERE mm.mesh_id = m.id
AND mm.revoked_at IS NULL
AND mm.user_id = m.owner_user_id
)
ORDER BY m.created_at
`;
console.log(`backfill · ${orphans.length} meshes need an owner member row`);
let inserted = 0;
for (const o of orphans) {
const kp = sodium.crypto_sign_keypair();
const peerPubkey = sodium.to_hex(kp.publicKey);
const id = sodium.to_hex(sodium.randombytes_buf(16));
try {
await sql.begin(async (tx) => {
await tx`
INSERT INTO mesh.member (
id, mesh_id, peer_pubkey, display_name, role,
user_id, dashboard_user_id
)
VALUES (
${id}, ${o.meshId}, ${peerPubkey},
${o.meshName + "-owner"}, ${"admin"}::mesh.role,
${o.ownerUserId}, ${o.ownerUserId}
)
`;
// Subscribe to #general as 'lead' if the topic exists.
await tx`
INSERT INTO mesh.topic_member (topic_id, member_id, role)
SELECT t.id, ${id}, ${"lead"}::mesh.topic_member_role
FROM mesh.topic t
WHERE t.mesh_id = ${o.meshId} AND t.name = 'general'
ON CONFLICT (topic_id, member_id) DO NOTHING
`;
});
inserted += 1;
console.log(` + ${o.slug.padEnd(20)} owner=${o.ownerUserId.slice(0, 8)}… member=${id.slice(0, 8)}… pk=${peerPubkey.slice(0, 12)}`);
} catch (e) {
console.error(`${o.slug}: ${(e as Error).message}`);
throw e;
}
}
console.log(`backfill done · ${inserted} owner member rows inserted`);
} finally {
await sql.end({ timeout: 5 });
}
}
main().catch((e) => {
console.error("backfill failed:", e);
process.exit(1);
});

View File

@@ -0,0 +1,87 @@
/**
* One-shot bootstrap for the new mesh.__cmh_migrations tracking table.
*
* Run this against an EXISTING prod DB exactly once before deploying
* the new runtime migrator. It:
* 1. Creates mesh.__cmh_migrations if it doesn't exist
* 2. Hashes every .sql file in packages/db/migrations
* 3. Inserts a row per file (filename + sha256) with applied_at = NOW()
* 4. ON CONFLICT (filename) DO NOTHING — safe to re-run
*
* The script does NOT execute any migration SQL — it only seeds the
* tracking table to reflect the schema state that was previously
* applied by drizzle (or by hand). After this runs, the broker's
* startup migrator will treat 0000..N as already-applied and only
* apply truly new files going forward.
*
* Usage:
* DATABASE_URL=... bun apps/broker/scripts/bootstrap-cmh-migrations.ts
*
* Safe to run multiple times. Output prints per-file status.
*/
import postgres from "postgres";
import { join } from "node:path";
import { existsSync, readdirSync, readFileSync } from "node:fs";
import { createHash } from "node:crypto";
async function main() {
const url = process.env.DATABASE_URL;
if (!url) {
console.error("DATABASE_URL not set");
process.exit(2);
}
const candidates = [
join(process.cwd(), "..", "..", "packages", "db", "migrations"),
join(process.cwd(), "packages", "db", "migrations"),
"/app/migrations",
];
const folder = candidates.find((p) => existsSync(p));
if (!folder) {
console.error("migrations folder not found");
process.exit(2);
}
const files = readdirSync(folder).filter((f) => f.endsWith(".sql")).sort();
console.log(`bootstrap · ${files.length} files at ${folder}`);
const sql = postgres(url, { max: 1, onnotice: () => {} });
try {
await sql.unsafe(`
CREATE SCHEMA IF NOT EXISTS mesh;
CREATE TABLE IF NOT EXISTS mesh.__cmh_migrations (
filename TEXT PRIMARY KEY,
sha256 TEXT NOT NULL,
applied_at TIMESTAMP NOT NULL DEFAULT NOW()
);
`);
let inserted = 0;
let skipped = 0;
for (const f of files) {
const content = readFileSync(join(folder, f), "utf8");
const sha = createHash("sha256").update(content).digest("hex");
const result = await sql`
INSERT INTO mesh.__cmh_migrations (filename, sha256)
VALUES (${f}, ${sha})
ON CONFLICT (filename) DO NOTHING
RETURNING filename
`;
if (result.length > 0) {
inserted += 1;
console.log(` + ${f} ${sha.slice(0, 12)}`);
} else {
skipped += 1;
}
}
console.log(`bootstrap done · ${inserted} inserted, ${skipped} already tracked`);
} finally {
await sql.end({ timeout: 5 });
}
}
main().catch((e) => {
console.error("bootstrap failed:", e);
process.exit(1);
});

View File

@@ -25,6 +25,29 @@ const lastHash = new Map<string, string>();
// Core audit logging
// ---------------------------------------------------------------------------
/**
* Deterministic JSON serialization: keys sorted recursively. The store
* is JSONB, which does NOT preserve key order, so hashing a naive
* JSON.stringify(row.payload) on verify can yield a different string
* from insert-time — false tamper reports. Canonical form guarantees
* both sides agree.
*/
function canonicalJson(value: unknown): string {
if (value === null || typeof value !== "object") return JSON.stringify(value);
if (Array.isArray(value)) {
return "[" + value.map(canonicalJson).join(",") + "]";
}
const obj = value as Record<string, unknown>;
const keys = Object.keys(obj).sort();
return (
"{" +
keys
.map((k) => JSON.stringify(k) + ":" + canonicalJson(obj[k]))
.join(",") +
"}"
);
}
function computeHash(
prevHash: string,
meshId: string,
@@ -33,15 +56,31 @@ function computeHash(
payload: Record<string, unknown>,
createdAt: Date,
): string {
const input = `${prevHash}|${meshId}|${eventType}|${actorMemberId}|${JSON.stringify(payload)}|${createdAt.toISOString()}`;
const input = `${prevHash}|${meshId}|${eventType}|${actorMemberId}|${canonicalJson(payload)}|${createdAt.toISOString()}`;
return createHash("sha256").update(input).digest("hex");
}
/**
* Stable 63-bit lock key per mesh for audit serialization under HA.
* Use the audit lock space; keep distinct from migrate's 74737_73831.
*/
function meshLockKey(meshId: string): bigint {
const digest = createHash("sha256").update("audit:" + meshId).digest();
const unsigned = digest.readBigUInt64BE(0);
return unsigned & 0x7fffffffffffffffn;
}
/**
* Append an audit entry for a mesh event.
*
* Fire-and-forget safe — callers should `void audit(...)` or
* `.catch(log.warn)` to avoid blocking the hot path.
*
* Concurrency under HA: wraps the write in a transaction that takes
* `pg_advisory_xact_lock(meshLockKey(meshId))` before reading the
* tail hash from the DB. This serializes all concurrent writers to
* the same mesh and prevents the chain from forking. The in-memory
* `lastHash` cache is updated after a successful commit.
*/
export async function audit(
meshId: string,
@@ -50,22 +89,31 @@ export async function audit(
actorDisplayName: string | null,
payload: Record<string, unknown>,
): Promise<void> {
const prevHash = lastHash.get(meshId) ?? "genesis";
const createdAt = new Date();
const hash = computeHash(prevHash, meshId, eventType, actorMemberId, payload, createdAt);
try {
await db.insert(auditLog).values({
meshId,
eventType,
actorMemberId,
actorDisplayName,
payload,
prevHash,
hash,
createdAt,
await db.transaction(async (tx) => {
const key = meshLockKey(meshId);
await tx.execute(sql`SELECT pg_advisory_xact_lock(${key}::bigint)`);
const [latest] = await tx
.select({ hash: auditLog.hash })
.from(auditLog)
.where(eq(auditLog.meshId, meshId))
.orderBy(desc(auditLog.id))
.limit(1);
const prevHash = latest?.hash ?? "genesis";
const hash = computeHash(prevHash, meshId, eventType, actorMemberId, payload, createdAt);
await tx.insert(auditLog).values({
meshId,
eventType,
actorMemberId,
actorDisplayName,
payload,
prevHash,
hash,
createdAt,
});
lastHash.set(meshId, hash);
});
lastHash.set(meshId, hash);
} catch (e) {
log.warn("audit log insert failed", {
mesh_id: meshId,

View File

@@ -23,13 +23,23 @@ let _key: Buffer | null = null;
function getKey(): Buffer {
if (_key) return _key;
if (env.BROKER_ENCRYPTION_KEY && env.BROKER_ENCRYPTION_KEY.length === 64) {
if (env.BROKER_ENCRYPTION_KEY && /^[0-9a-f]{64}$/i.test(env.BROKER_ENCRYPTION_KEY)) {
_key = Buffer.from(env.BROKER_ENCRYPTION_KEY, "hex");
} else {
_key = randomBytes(32);
log.warn("BROKER_ENCRYPTION_KEY not set — generated ephemeral key. " +
"Set BROKER_ENCRYPTION_KEY=" + _key.toString("hex") + " to persist across restarts.");
return _key;
}
// In production, refuse to start without a persistent key. Silently
// generating a random one meant every restart invalidated all encrypted
// rows on disk — and the ephemeral key was logged in clear, which is
// itself a leak.
if (process.env.NODE_ENV === "production") {
log.error("BROKER_ENCRYPTION_KEY is missing or malformed (need 64 hex chars) — refusing to start in production");
process.exit(1);
}
// Dev only: generate a stable per-process key. Never log the value.
_key = randomBytes(32);
log.warn("BROKER_ENCRYPTION_KEY not set — using ephemeral key for this dev process (encrypted data WILL NOT survive restarts). Set BROKER_ENCRYPTION_KEY to a 64-hex-char value for persistence.");
return _key;
}
@@ -62,7 +72,11 @@ export function decryptFromStorage(packed: string): string | null {
decipher.setAuthTag(tag);
const decrypted = Buffer.concat([decipher.update(ciphertext), decipher.final()]);
return decrypted.toString("utf8");
} catch {
} catch (e) {
// Loud failure: if a stored row fails to decrypt the key changed or
// data is corrupt — don't silently return null and let downstream
// code assume "no value".
log.error("decryptFromStorage failed", { err: e instanceof Error ? e.message : String(e) });
return null;
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -10,7 +10,7 @@
import { and, eq, isNull, lt, sql } from "drizzle-orm";
import sodium from "libsodium-wrappers";
import { db } from "./db";
import { invite as inviteTable, mesh, meshMember } from "@turbostarter/db/schema/mesh";
import { invite as inviteTable, mesh, meshMember, meshTopic, meshTopicMember } from "@turbostarter/db/schema/mesh";
let ready = false;
async function ensureSodium(): Promise<typeof sodium> {
@@ -138,6 +138,128 @@ export async function sealRootKeyToRecipient(params: {
export const HELLO_SKEW_MS = 60_000;
/** Maximum lifetime of a parent attestation (24h). */
export const SESSION_ATTESTATION_MAX_TTL_MS = 24 * 60 * 60 * 1000;
/**
* Canonical bytes for a parent-vouches-session attestation.
*
* The parent member signs this with their stable ed25519 secret key when
* minting an attestation in `claudemesh launch`. The broker recomputes
* the same string at session_hello time and verifies the signature
* against `parent_member_pubkey`.
*
* Format: `claudemesh-session-attest|<parent_pubkey>|<session_pubkey>|<expires_at_ms>`
*/
export function canonicalSessionAttestation(
parentMemberPubkey: string,
sessionPubkey: string,
expiresAt: number,
): string {
return `claudemesh-session-attest|${parentMemberPubkey}|${sessionPubkey}|${expiresAt}`;
}
/**
* Canonical bytes for the session_hello signature.
*
* The session keypair (held by the daemon for the lifetime of the
* registration) signs this fresh on every WS connect, proving liveness +
* possession of the session secret key. Without this stage, an attacker
* who captured an attestation could replay it from any machine.
*
* Format: `claudemesh-session-hello|<mesh_id>|<parent_pubkey>|<session_pubkey>|<timestamp_ms>`
*/
export function canonicalSessionHello(
meshId: string,
parentMemberPubkey: string,
sessionPubkey: string,
timestamp: number,
): string {
return `claudemesh-session-hello|${meshId}|${parentMemberPubkey}|${sessionPubkey}|${timestamp}`;
}
/**
* Validate a parent-vouches-session attestation: lifetime bound + signature.
* Returns `{ ok: true }` on success or `{ ok: false, reason }` on failure.
*
* The TTL ceiling (24h) bounds replay damage if an attestation leaks; the
* lower bound (already in the past) blocks reuse of expired ones.
*/
export async function verifySessionAttestation(args: {
parentMemberPubkey: string;
sessionPubkey: string;
expiresAt: number;
signature: string;
now?: number;
}): Promise<
| { ok: true }
| { ok: false; reason: "expired" | "ttl_too_long" | "bad_signature" | "malformed" }
> {
const now = args.now ?? Date.now();
if (!Number.isFinite(args.expiresAt)) {
return { ok: false, reason: "malformed" };
}
if (args.expiresAt <= now) {
return { ok: false, reason: "expired" };
}
if (args.expiresAt > now + SESSION_ATTESTATION_MAX_TTL_MS) {
return { ok: false, reason: "ttl_too_long" };
}
if (
!/^[0-9a-f]{64}$/i.test(args.parentMemberPubkey) ||
!/^[0-9a-f]{64}$/i.test(args.sessionPubkey) ||
!/^[0-9a-f]{128}$/i.test(args.signature)
) {
return { ok: false, reason: "malformed" };
}
const canonical = canonicalSessionAttestation(
args.parentMemberPubkey,
args.sessionPubkey,
args.expiresAt,
);
const ok = await verifyEd25519(canonical, args.signature, args.parentMemberPubkey);
return ok ? { ok: true } : { ok: false, reason: "bad_signature" };
}
/**
* Validate the session-side hello signature: timestamp skew + signature
* by the session keypair over canonical session-hello bytes.
*/
export async function verifySessionHelloSignature(args: {
meshId: string;
parentMemberPubkey: string;
sessionPubkey: string;
timestamp: number;
signature: string;
now?: number;
}): Promise<
| { ok: true }
| { ok: false; reason: "timestamp_skew" | "bad_signature" | "malformed" }
> {
const now = args.now ?? Date.now();
if (
!Number.isFinite(args.timestamp) ||
Math.abs(now - args.timestamp) > HELLO_SKEW_MS
) {
return { ok: false, reason: "timestamp_skew" };
}
if (
!/^[0-9a-f]{64}$/i.test(args.parentMemberPubkey) ||
!/^[0-9a-f]{64}$/i.test(args.sessionPubkey) ||
!/^[0-9a-f]{128}$/i.test(args.signature)
) {
return { ok: false, reason: "malformed" };
}
const canonical = canonicalSessionHello(
args.meshId,
args.parentMemberPubkey,
args.sessionPubkey,
args.timestamp,
);
const ok = await verifyEd25519(canonical, args.signature, args.sessionPubkey);
return ok ? { ok: true } : { ok: false, reason: "bad_signature" };
}
/**
* Verify a hello's ed25519 signature + timestamp skew.
* Returns { ok: true } on success, or { ok: false, reason } describing
@@ -344,6 +466,32 @@ export async function claimInviteV2Core(params: {
return { ok: false, status: 400, body: { error: "malformed" } };
}
// 6b. Auto-subscribe the new member to #general (the default mesh-wide
// room). Idempotent via unique (topic_id, member_id). If the mesh was
// created before #general auto-creation existed, ensure it now via a
// best-effort INSERT … ON CONFLICT — backfill migration handles the
// bulk case so this is just a safety net.
await db
.insert(meshTopic)
.values({
meshId: inv.meshId,
name: "general",
description: "Default mesh-wide channel. Every member can read and post.",
visibility: "public",
})
.onConflictDoNothing();
const [generalTopic] = await db
.select({ id: meshTopic.id })
.from(meshTopic)
.where(and(eq(meshTopic.meshId, inv.meshId), eq(meshTopic.name, "general")))
.limit(1);
if (generalTopic) {
await db
.insert(meshTopicMember)
.values({ topicId: generalTopic.id, memberId: row.id, role: "member" })
.onConflictDoNothing();
}
// 7. Seal the mesh root_key to the recipient's x25519 pubkey.
let sealed: string;
try {

View File

@@ -23,7 +23,7 @@ const envSchema = z.object({
MINIO_ENDPOINT: z.string().default("minio:9000"),
MINIO_ACCESS_KEY: z.string().default("claudemesh"),
MINIO_SECRET_KEY: z.string().default("changeme"),
MINIO_USE_SSL: z.enum(["true", "false", ""]).transform(v => v === "true").default("false"),
MINIO_USE_SSL: z.enum(["true", "false", ""]).default("false").transform(v => v === "true"),
QDRANT_URL: z.string().default("http://qdrant:6333"),
NEO4J_URL: z.string().default("bolt://neo4j:7687"),
NEO4J_USER: z.string().default("neo4j"),

File diff suppressed because it is too large Load Diff

View File

@@ -86,7 +86,7 @@ export async function verifySyncToken(
}
// Decode header — must be HS256
const header = JSON.parse(new TextDecoder().decode(base64UrlDecode(headerB64)));
const header = JSON.parse(new TextDecoder().decode(base64UrlDecode(headerB64))) as { alg?: string };
if (header.alg !== "HS256") {
return { ok: false, error: `unsupported algorithm: ${header.alg}` };
}

View File

@@ -31,7 +31,7 @@ export interface MemberPermissionUpdate {
export type MemberUpdateRequest = MemberProfileUpdate & MemberPermissionUpdate;
interface SelfEditablePolicy {
export interface SelfEditablePolicy {
displayName: boolean;
roleTag: boolean;
groups: boolean;

View File

@@ -94,6 +94,10 @@ export const metrics = {
"broker_messages_dropped_by_grant_total",
"Messages silently dropped because recipient didn't grant sender the required capability",
),
brokerLegacyAuthHitsTotal: new Counter(
"broker_legacy_auth_hits_total",
"Pre-alpha.36 clients authenticating via body.user_id fallback (remove shim when near zero)",
),
queueDepth: new Gauge(
"broker_queue_depth",
"Undelivered messages currently in the queue",

View File

@@ -1,19 +1,56 @@
/**
* Runtime migrations on broker startup.
*
* Runs pending drizzle migrations against DATABASE_URL before the broker
* listens. Uses pg_advisory_lock so a multi-instance deploy doesn't race.
* If migrations fail, the process exits non-zero so the orchestrator (Coolify
* healthcheck) sees the container as broken and doesn't route traffic.
* Replaced drizzle's migrator with a filename-tracked runner because
* drizzle's _journal.json drifted on the filesystem (last entry was
* idx=11; idx 12-24 were never recorded), and the prod
* drizzle.__drizzle_migrations table was even further behind (3 rows
* for 25 files). The runtime migrator silently skipped anything
* outside the journal, so every new schema change required `psql -f`
* by hand.
*
* The new runner tracks applied files in `mesh.__cmh_migrations`
* (filename + sha256 + applied_at). On startup:
* 1. Acquire advisory lock (unchanged)
* 2. CREATE TABLE IF NOT EXISTS for the tracking table
* 3. Read applied filenames from the table
* 4. List `migrations/*.sql` lexicographically; filter out applied
* 5. For each unapplied: BEGIN; execute file; INSERT row; COMMIT
* 6. For each applied: optionally verify sha matches; warn (don't
* fail) on mismatch — devs reformat migrations sometimes
*
* Bootstrap: run `apps/broker/scripts/bootstrap-cmh-migrations.ts`
* against an existing prod DB to seed the tracking table with the
* currently-applied set. Without that, the runner would try to
* re-apply 0000-0024 and fail on duplicate-table errors.
*
* Failure modes (all exit non-zero so Coolify healthcheck fails closed):
* - DATABASE_URL missing
* - lock acquisition timeout
* - migration SQL error mid-application
*/
import { drizzle } from "drizzle-orm/postgres-js";
import { migrate } from "drizzle-orm/postgres-js/migrator";
import postgres from "postgres";
import { dirname, join } from "node:path";
import { existsSync, readdirSync } from "node:fs";
import { join } from "node:path";
import { existsSync, readdirSync, readFileSync } from "node:fs";
import { createHash } from "node:crypto";
const LOCK_ID = 74737_73831; // "cmsh" ascii — stable magic constant
const LOCK_ACQUIRE_TIMEOUT_MS = 60_000;
const LOCK_RETRY_INTERVAL_MS = 2_000;
const TRACKING_TABLE_DDL = `
CREATE SCHEMA IF NOT EXISTS mesh;
CREATE TABLE IF NOT EXISTS mesh.__cmh_migrations (
filename TEXT PRIMARY KEY,
sha256 TEXT NOT NULL,
applied_at TIMESTAMP NOT NULL DEFAULT NOW()
);
`;
function sha256Hex(content: string): string {
return createHash("sha256").update(content).digest("hex");
}
export async function runMigrationsOnStartup(): Promise<void> {
const url = process.env.DATABASE_URL;
@@ -22,8 +59,6 @@ export async function runMigrationsOnStartup(): Promise<void> {
return;
}
// Resolve the migrations folder — it's shipped inside @turbostarter/db's
// deploy subset in the runtime image. Dev path also works.
const candidates = [
"/app/migrations",
"/app/node_modules/@turbostarter/db/migrations",
@@ -35,18 +70,88 @@ export async function runMigrationsOnStartup(): Promise<void> {
console.error("[migrate] migrations folder not found — skipping. Searched:", candidates);
return;
}
const count = readdirSync(migrationsFolder).filter((f) => f.endsWith(".sql")).length;
console.log(`[migrate] ${count} migration files at ${migrationsFolder}`);
const sql = postgres(url, { max: 1, onnotice: () => { /* quiet */ } });
const allFiles = readdirSync(migrationsFolder)
.filter((f) => f.endsWith(".sql"))
.sort(); // lexicographic = numeric for 0000_*..9999_*
console.log(`[migrate] ${allFiles.length} migration files at ${migrationsFolder}`);
const sql = postgres(url, { max: 1, onnotice: () => {} });
try {
// Advisory lock so parallel instances serialise.
await sql`SELECT pg_advisory_lock(${LOCK_ID})`;
await sql.unsafe(`SET lock_timeout = '${LOCK_ACQUIRE_TIMEOUT_MS}ms'`);
const deadline = Date.now() + LOCK_ACQUIRE_TIMEOUT_MS;
let locked = false;
while (Date.now() < deadline) {
const [row] = await sql<{ locked: boolean }[]>`
SELECT pg_try_advisory_lock(${LOCK_ID}) AS locked
`;
if (row?.locked) {
locked = true;
break;
}
console.log("[migrate] advisory lock held — retrying in 2s");
await new Promise((r) => setTimeout(r, LOCK_RETRY_INTERVAL_MS));
}
if (!locked) {
console.error(`[migrate] could not acquire advisory lock within ${LOCK_ACQUIRE_TIMEOUT_MS}ms — aborting`);
process.exit(1);
}
try {
const db = drizzle(sql);
const start = Date.now();
await migrate(db, { migrationsFolder });
console.log(`[migrate] ok (${Date.now() - start}ms)`);
// Bootstrap the tracking table itself. Idempotent.
await sql.unsafe(TRACKING_TABLE_DDL);
const applied = await sql<{ filename: string; sha256: string }[]>`
SELECT filename, sha256 FROM mesh.__cmh_migrations
`;
const appliedMap = new Map(applied.map((r) => [r.filename, r.sha256]));
const pending: Array<{ filename: string; sha: string; content: string }> = [];
for (const filename of allFiles) {
const path = join(migrationsFolder, filename);
const content = readFileSync(path, "utf8");
const sha = sha256Hex(content);
const knownSha = appliedMap.get(filename);
if (!knownSha) {
pending.push({ filename, sha, content });
} else if (knownSha !== sha) {
// File content changed after application. Don't re-run; warn.
// Hard-fail would block legit cosmetic edits (whitespace,
// comments). Production drift detection lives elsewhere.
console.warn(
`[migrate] sha mismatch for ${filename} — file modified post-apply (was ${knownSha.slice(0, 12)}…, now ${sha.slice(0, 12)}…)`,
);
}
}
if (pending.length === 0) {
console.log(`[migrate] up to date · ${applied.length} applied`);
} else {
console.log(`[migrate] applying ${pending.length} pending: ${pending.map((p) => p.filename).join(", ")}`);
for (const m of pending) {
const start = Date.now();
try {
await sql.begin(async (tx) => {
// drizzle migrations use `--> statement-breakpoint` to
// separate statements; postgres-js can run a multi-stmt
// script via .unsafe(), but transactional rollback wraps
// everything as one unit which is what we want.
await tx.unsafe(m.content);
await tx`
INSERT INTO mesh.__cmh_migrations (filename, sha256)
VALUES (${m.filename}, ${m.sha})
`;
});
console.log(`[migrate] ✓ ${m.filename} (${Date.now() - start}ms)`);
} catch (e) {
console.error(`[migrate] ✗ ${m.filename}:`, e instanceof Error ? e.message : e);
throw e;
}
}
console.log(`[migrate] ok`);
}
} finally {
await sql`SELECT pg_advisory_unlock(${LOCK_ID})`;
}

View File

@@ -115,11 +115,11 @@ function lastAssistantHasToolUse(filePath: string): boolean {
if (!line) continue;
if (!line.includes('"assistant"')) continue;
try {
const d = JSON.parse(line);
const d = JSON.parse(line) as { type?: string; message?: { content?: unknown } };
if (d.type !== "assistant") continue;
const content = d.message?.content;
if (!Array.isArray(content)) continue;
return content.some((c: { type?: string }) => c.type === "tool_use");
return (content as Array<{ type?: string }>).some((c) => c.type === "tool_use");
} catch {
/* malformed line, skip */
}

View File

@@ -169,7 +169,7 @@ function detectEntry(
try {
const pkg = JSON.parse(
readFileSync(join(sourcePath, "package.json"), "utf-8"),
);
) as { main?: string; bin?: string | Record<string, string> };
if (pkg.main) return { command: cmd, args: [pkg.main] };
if (pkg.bin) {
const bin =
@@ -372,7 +372,7 @@ function spawnService(svc: ManagedService): void {
const rl = createInterface({ input: child.stdout! });
rl.on("line", (line) => {
try {
const msg = JSON.parse(line);
const msg = JSON.parse(line) as { id?: string | number; error?: { message?: string }; result?: unknown };
if (msg.id && svc.pendingCalls.has(String(msg.id))) {
const pending = svc.pendingCalls.get(String(msg.id))!;
clearTimeout(pending.timer);

View File

@@ -13,6 +13,7 @@ import { Bot, InputFile } from "grammy";
import WebSocket from "ws";
import sodium from "libsodium-wrappers";
import { validateTelegramConnectToken } from "./telegram-token";
import { log } from "./logger";
// ---------------------------------------------------------------------------
// Types
@@ -22,11 +23,12 @@ export interface BridgeRow {
chatId: number;
meshId: string;
meshSlug?: string;
memberId: string;
/** memberId can be null until the bridge claims a mesh.member row. */
memberId: string | null;
pubkey: string;
secretKey: string;
displayName: string;
chatType: string;
displayName: string | null;
chatType: string | null;
chatTitle: string | null;
}
@@ -228,7 +230,7 @@ class MeshConnection {
ws.on("message", async (raw) => {
try {
const msg = JSON.parse(raw.toString());
const msg = JSON.parse(raw.toString()) as Record<string, any>;
if (msg.type === "hello_ack") {
clearTimeout(helloTimeout);
@@ -674,8 +676,8 @@ function createPushHandler(bot: Bot) {
for (const chatId of chatIds) {
bot.api
.sendMessage(chatId, formatted)
.catch((e) => {
console.error(`[tg-bridge] send to chat ${chatId} failed:`, e.message);
.catch((e: unknown) => {
console.error(`[tg-bridge] send to chat ${chatId} failed:`, e instanceof Error ? e.message : String(e));
});
}
};
@@ -1729,11 +1731,12 @@ async function executeAiToolCall(
for (const meshId of meshIds) {
const services = await listDbMeshServices(meshId);
for (const s of services) {
const sx = s as Record<string, unknown>;
allServices.push({
name: s.name,
type: s.type ?? "mcp",
tools: s.tool_count ?? 0,
status: s.status ?? "running",
name: String(sx.name ?? ""),
type: String(sx.type ?? "mcp"),
tools: Number(sx.tool_count ?? 0),
status: String(sx.status ?? "running"),
});
}
}
@@ -1841,6 +1844,9 @@ export async function bootTelegramBridge(
for (const [meshId, meshRows] of byMesh) {
const first = meshRows[0]!;
try {
// memberId/displayName come back from DB nullable; bridge only
// works once both are populated, so skip rows missing either.
if (!first.memberId || !first.displayName) continue;
await ensureMeshConnection(
{
meshId,

View File

@@ -102,11 +102,11 @@ export function validateTelegramConnectToken(
if (!timingSafeEqual(a, b)) return null;
// Verify header algorithm
const header = JSON.parse(base64urlDecode(headerB64));
const header = JSON.parse(base64urlDecode(headerB64)) as { alg?: string };
if (header.alg !== "HS256") return null;
// Decode and validate claims
const claims: JwtClaims = JSON.parse(base64urlDecode(payloadB64));
const claims = JSON.parse(base64urlDecode(payloadB64)) as JwtClaims;
// Check subject
if (claims.sub !== "telegram-connect") return null;

View File

@@ -46,9 +46,25 @@ export interface HookSetStatusResponse {
// --- WebSocket protocol envelopes ---
/**
* Wire protocol version. Bump ONLY on breaking changes to the hello or
* push envelope shape. Clients send their highest supported version;
* broker picks the minimum of its own and the client's and echoes it
* on hello_ack. Backward-compat fields can be gated on this.
* 1 = initial release
*/
export const WS_PROTOCOL_VERSION = 1 as const;
/** Sent by client on connect to authenticate. */
export interface WSHelloMessage {
type: "hello";
/** Highest WS protocol version the client understands. Optional —
* pre-alpha.36 clients omit it and the broker treats missing as 1. */
protocolVersion?: number;
/** Optional feature strings the client supports. Broker uses this to
* avoid emitting envelopes the client can't parse. Examples: "grants",
* "channels", "streams". Unknown capabilities ignored. */
capabilities?: string[];
meshId: string;
memberId: string;
pubkey: string; // must match mesh.member.peerPubkey
@@ -74,6 +90,66 @@ export interface WSHelloMessage {
signature: string;
}
/**
* Client → broker: per-launch session hello, vouched by the parent member.
*
* Used by the daemon's per-session WebSocket connections (1.30.0+) so that
* each `claudemesh launch`-spawned session has its own long-lived presence
* row owned by an ephemeral session keypair. The parent member key vouches
* (out-of-band) that the session pubkey is theirs; the session keypair
* proves liveness on every connect.
*
* Two-stage proof:
* 1. `parentAttestation.signature` — ed25519 over
* `claudemesh-session-attest|<parent_pubkey>|<session_pubkey>|<expires_at_ms>`
* signed by the parent member's stable secret key. TTL ≤ 24h.
* 2. `signature` — ed25519 over
* `claudemesh-session-hello|<mesh_id>|<parent_pubkey>|<session_pubkey>|<timestamp>`
* signed by the session secret key (held by the daemon for the
* lifetime of the session registration).
*
* Older brokers don't recognize this message type and reply with
* `unknown_message_type`; clients fall back to the legacy `hello` flow.
*/
export interface WSSessionHelloMessage {
type: "session_hello";
/** Highest WS protocol version the client understands. */
protocolVersion?: number;
/** Optional feature strings the client supports. */
capabilities?: string[];
meshId: string;
/** Parent member's id (mesh.member.id) — used for revocation lookup. */
parentMemberId: string;
/** Parent member's stable ed25519 pubkey (hex), as found in mesh.member. */
parentMemberPubkey: string;
/** Per-launch ephemeral ed25519 pubkey (hex). Routes presence + DMs. */
sessionPubkey: string;
/** Pre-signed attestation by the parent member, presented per session. */
parentAttestation: {
sessionPubkey: string;
parentMemberPubkey: string;
/** Unix ms; broker rejects past or > now+24h. */
expiresAt: number;
signature: string;
};
/** Display name override for this session (optional, falls back to member). */
displayName?: string;
sessionId: string;
pid: number;
cwd: string;
hostname?: string;
peerType?: "ai" | "human" | "connector";
channel?: string;
model?: string;
groups?: Array<{ name: string; role?: string }>;
/** Initial role tag for the session. */
role?: string;
/** ms epoch; broker rejects if outside ±60s of its own clock. */
timestamp: number;
/** ed25519 signature (hex) by the SESSION secret key over canonical bytes. */
signature: string;
}
/** Client → broker: send an E2E-encrypted envelope to a target. */
export interface WSSendMessage {
type: "send";
@@ -82,6 +158,22 @@ export interface WSSendMessage {
nonce: string; // base64
ciphertext: string; // base64
id?: string; // client-side correlation id
/**
* Optional client-extracted `@-mention` display names (lowercased,
* no `@` prefix, max 16). Required when `body_version: 2` cipher
* lands in v0.3.0 phase 3 — the server can't read v2 ciphertext to
* regex-match. Today's v1 plaintext path falls back to a regex on
* the body when this is absent.
*/
mentions?: string[];
/** Optional id of a previous topic message this one replies to.
* Server validates same-topic membership; FK is set null if parent
* later disappears. Ignored for non-topic targets. */
replyToId?: string;
/** Optional ciphertext-format version. 1 = v1 plaintext base64;
* 2 = v0.3.0 phase 3 per-topic encrypted body. Server passes this
* through verbatim into topic_message.body_version. */
bodyVersion?: number;
}
/** Broker → client: an envelope addressed to this peer. */
@@ -89,11 +181,34 @@ export interface WSPushMessage {
type: "push";
messageId: string;
meshId: string;
/** Sender's *session* pubkey — ephemeral, rotates on session restart.
* DMs are sealed against the recipient's session key paired with this.
* For replies prefer `senderMemberPubkey` / `senderMemberId`. */
senderPubkey: string;
/** Sender's *member* pubkey — stable across reconnects/restarts.
* Use this as the canonical reply target. */
senderMemberPubkey?: string;
/** Stable mesh.member id of the sender — survives display-name changes,
* use this as the canonical reply target when set. Optional for
* legacy/non-topic broker paths that haven't been wired yet. */
senderMemberId?: string;
/** Sender's current display name as a convenience for renderers. */
senderName?: string;
/** Topic name when the push originates from a topic post (vs DM). */
topic?: string;
/** Server-side message id of the parent message when this push is a
* reply, so the recipient can render thread context and re-thread. */
replyToId?: string;
priority: Priority;
nonce: string;
ciphertext: string;
createdAt: string;
/** v0.9.0 daemon fields. Echoed when the sender's send envelope
* carried them (spec §4.2/§4.4). Receivers use `client_message_id`
* for idempotent inbox dedupe and `request_fingerprint` for
* defense-in-depth verification. Both null on legacy traffic. */
client_message_id?: string | null;
request_fingerprint?: string | null;
/** Optional semantic tag — "reminder" when delivered by the scheduler,
* "system" for broker-originated topology events (peer join/leave). */
subtype?: "reminder" | "system";
@@ -109,6 +224,26 @@ export interface WSSetStatusMessage {
status: PeerStatus;
}
/**
* Client → broker: confirm receipt of a previously pushed envelope so the
* broker can mark the message_queue row delivered.
*
* v2 agentic-comms (M1): pairs with the two-phase claim/lease introduced
* in `drainForMember`. Without this ack, the lease expires after 30s and
* the message is re-claimed and re-pushed (at-least-once retry).
*
* Either id is accepted; daemons that track inbox dedupe by clientMessageId
* should send that one. brokerMessageId is the row primary key, useful when
* the original send didn't carry a client_message_id (legacy traffic).
*/
export interface WSClientAckMessage {
type: "client_ack";
/** Original caller-supplied idempotency id from the `send` envelope. */
clientMessageId?: string;
/** Broker-side row id (the `messageId` field on the inbound `push`). */
brokerMessageId?: string;
}
/** Client → broker: request list of connected peers in the same mesh. */
export interface WSListPeersMessage {
type: "list_peers";
@@ -163,6 +298,175 @@ export interface WSLeaveGroupMessage {
name: string;
}
// ── API keys (v0.2.0) ───────────────────────────────────────────────
// Issuance/management of bearer tokens for REST + external WS. Only the
// mesh admin can issue; keys are scoped by capability + optional topic
// whitelist. Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md
export interface WSApiKeyCreateMessage {
type: "apikey_create";
label: string;
capabilities: Array<"send" | "read" | "state_write" | "admin">;
topicScopes?: string[];
expiresAt?: string;
_reqId?: string;
}
export interface WSApiKeyListMessage {
type: "apikey_list";
_reqId?: string;
}
export interface WSApiKeyRevokeMessage {
type: "apikey_revoke";
id: string;
_reqId?: string;
}
export interface WSApiKeyCreatedMessage {
type: "apikey_created";
id: string;
/** Plaintext secret — shown ONCE, never returned again. */
secret: string;
label: string;
prefix: string;
capabilities: Array<"send" | "read" | "state_write" | "admin">;
topicScopes: string[] | null;
createdAt: string;
_reqId?: string;
}
export interface WSApiKeyListResponseMessage {
type: "apikey_list_response";
keys: Array<{
id: string;
label: string;
prefix: string;
capabilities: Array<"send" | "read" | "state_write" | "admin">;
topicScopes: string[] | null;
createdAt: string;
lastUsedAt: string | null;
revokedAt: string | null;
expiresAt: string | null;
}>;
_reqId?: string;
}
export interface WSApiKeyRevokeResponseMessage {
type: "apikey_revoke_response";
status: "revoked" | "not_found" | "not_unique";
/** Full id of the revoked key on success (may differ from input if a prefix was sent). */
id?: string;
/** How many keys matched on not_unique. */
matches?: number;
_reqId?: string;
}
// ── Topics (v0.2.0) ─────────────────────────────────────────────────
// Topics complement groups: groups are identity tags, topics are
// conversation scopes. targetSpec for topic-tagged messages is
// "#<topicId>". Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md
export interface WSTopicCreateMessage {
type: "topic_create";
name: string;
description?: string;
visibility?: "public" | "private" | "dm";
_reqId?: string;
}
export interface WSTopicListMessage {
type: "topic_list";
_reqId?: string;
}
export interface WSTopicJoinMessage {
type: "topic_join";
/** Topic id OR name. Server resolves. */
topic: string;
role?: "lead" | "member" | "observer";
_reqId?: string;
}
export interface WSTopicLeaveMessage {
type: "topic_leave";
topic: string;
_reqId?: string;
}
export interface WSTopicMembersMessage {
type: "topic_members";
topic: string;
_reqId?: string;
}
export interface WSTopicHistoryMessage {
type: "topic_history";
topic: string;
limit?: number;
beforeId?: string;
_reqId?: string;
}
export interface WSTopicMarkReadMessage {
type: "topic_mark_read";
topic: string;
_reqId?: string;
}
// Server → client topic responses
export interface WSTopicCreatedMessage {
type: "topic_created";
topic: { id: string; name: string; visibility: "public" | "private" | "dm" };
created: boolean;
_reqId?: string;
}
export interface WSTopicListResponseMessage {
type: "topic_list_response";
topics: Array<{
id: string;
name: string;
description: string | null;
visibility: "public" | "private" | "dm";
memberCount: number;
createdAt: string;
}>;
_reqId?: string;
}
export interface WSTopicMembersResponseMessage {
type: "topic_members_response";
topic: string;
members: Array<{
memberId: string;
pubkey: string;
displayName: string;
role: "lead" | "member" | "observer";
joinedAt: string;
lastReadAt: string | null;
}>;
_reqId?: string;
}
export interface WSTopicHistoryResponseMessage {
type: "topic_history_response";
topic: string;
messages: Array<{
id: string;
senderPubkey: string;
senderMemberId?: string;
senderName?: string;
nonce: string;
ciphertext: string;
bodyVersion?: number;
replyToId?: string | null;
createdAt: string;
}>;
_reqId?: string;
}
/** Client → broker: set a shared state key-value. */
export interface WSSetStateMessage {
type: "set_state";
@@ -206,6 +510,8 @@ export interface WSAckMessage {
id: string; // echoes client-side correlation id
messageId: string;
queued: boolean;
/** Populated when queued=false to explain why (rate_limit, too_large, etc.). */
error?: string;
_reqId?: string;
}
@@ -232,6 +538,8 @@ export interface WSPeersListMessage {
type: "peers_list";
peers: Array<{
pubkey: string;
/** Stable member pubkey — present on M1+ broker responses. */
memberPubkey?: string;
displayName: string;
status: PeerStatus;
summary: string | null;
@@ -239,6 +547,13 @@ export interface WSPeersListMessage {
sessionId: string;
connectedAt: string;
cwd?: string;
/** v2 agentic-comms (M1): typed connection role. CLI uses this to
* filter control-plane daemons out of user-facing peer lists.
* Optional for clients talking to a pre-M1 broker. Wire field is
* `peerRole` to avoid collision with 1.31.5's top-level `role`
* (which is a lift of `profile.role`, the user-supplied string
* like "lead" / "reviewer" / "human"). */
peerRole?: "control-plane" | "session" | "service";
hostname?: string;
peerType?: "ai" | "human" | "connector";
channel?: string;
@@ -1108,6 +1423,16 @@ export interface WSVaultGetMessage { type: "vault_get"; keys: string[]; _reqId?:
export interface WSWatchMessage { type: "watch"; url: string; mode?: "hash" | "json" | "status"; extract?: string; interval?: number; notify_on?: string; headers?: Record<string, string>; label?: string; _reqId?: string; }
/** Client → broker: stop watching. */
export interface WSUnwatchMessage { type: "unwatch"; watchId: string; _reqId?: string; }
/** Client → broker: soft-disconnect a peer (1000; CLI auto-reconnects). */
export interface WSDisconnectMessage { type: "disconnect"; target?: string; stale?: number; all?: boolean; _reqId?: string; }
/** Client → broker: hard-kick a peer (4001; CLI exits). */
export interface WSKickMessage { type: "kick"; target?: string; stale?: number; all?: boolean; _reqId?: string; }
/** Client → broker: ban a member by pubkey or display name. */
export interface WSBanMessage { type: "ban"; target: string; reason?: string; _reqId?: string; }
/** Client → broker: lift a ban. */
export interface WSUnbanMessage { type: "unban"; target: string; _reqId?: string; }
/** Client → broker: list active bans on the caller's mesh. */
export interface WSListBansMessage { type: "list_bans"; _reqId?: string; }
/** Client → broker: list active watches. */
export interface WSWatchListMessage { type: "watch_list"; _reqId?: string; }
/** Broker → client: watch created acknowledgement. */
@@ -1119,7 +1444,9 @@ export interface WSWatchTriggeredMessage { type: "watch_triggered"; watchId: str
export type WSClientMessage =
| WSHelloMessage
| WSSessionHelloMessage
| WSSendMessage
| WSClientAckMessage
| WSSetStatusMessage
| WSListPeersMessage
| WSSetSummaryMessage
@@ -1127,6 +1454,16 @@ export type WSClientMessage =
| WSSetProfileMessage
| WSJoinGroupMessage
| WSLeaveGroupMessage
| WSTopicCreateMessage
| WSTopicListMessage
| WSTopicJoinMessage
| WSTopicLeaveMessage
| WSTopicMembersMessage
| WSTopicHistoryMessage
| WSTopicMarkReadMessage
| WSApiKeyCreateMessage
| WSApiKeyListMessage
| WSApiKeyRevokeMessage
| WSSetStateMessage
| WSGetStateMessage
| WSListStateMessage
@@ -1201,7 +1538,12 @@ export type WSClientMessage =
| WSVaultGetMessage
| WSWatchMessage
| WSUnwatchMessage
| WSWatchListMessage;
| WSWatchListMessage
| WSDisconnectMessage
| WSKickMessage
| WSBanMessage
| WSUnbanMessage
| WSListBansMessage;
// --- Skill messages ---
@@ -1253,6 +1595,8 @@ export interface WSSkillDataMessage {
instructions: string;
tags: string[];
author: string;
/** Optional opaque metadata stored alongside the skill body. */
manifest?: unknown;
createdAt: string;
} | null;
_reqId?: string;
@@ -1295,6 +1639,13 @@ export type WSServerMessage =
| WSPushMessage
| WSAckMessage
| WSPeersListMessage
| WSTopicCreatedMessage
| WSTopicListResponseMessage
| WSTopicMembersResponseMessage
| WSTopicHistoryResponseMessage
| WSApiKeyCreatedMessage
| WSApiKeyListResponseMessage
| WSApiKeyRevokeResponseMessage
| WSStateChangeMessage
| WSStateResultMessage
| WSStateListMessage

View File

@@ -0,0 +1,53 @@
/**
* Audit hash chain uses canonical JSON (sorted keys) so JSONB key
* order can't break verification. This test pins the contract.
*/
import { describe, expect, test } from "vitest";
import { createHash } from "node:crypto";
// Re-derive canonicalJson for the test (duplicate of audit.ts internal).
function canonicalJson(value: unknown): string {
if (value === null || typeof value !== "object") return JSON.stringify(value);
if (Array.isArray(value)) return "[" + value.map(canonicalJson).join(",") + "]";
const obj = value as Record<string, unknown>;
const keys = Object.keys(obj).sort();
return "{" + keys.map((k) => JSON.stringify(k) + ":" + canonicalJson(obj[k])).join(",") + "}";
}
function hash(prev: string, meshId: string, eventType: string, actor: string | null, payload: Record<string, unknown>, createdAt: Date): string {
const input = `${prev}|${meshId}|${eventType}|${actor}|${canonicalJson(payload)}|${createdAt.toISOString()}`;
return createHash("sha256").update(input).digest("hex");
}
describe("audit canonical json hash", () => {
test("key order does not affect the computed hash", () => {
const createdAt = new Date("2026-04-15T12:00:00Z");
const a = hash("prev", "mesh1", "peer_joined", "actor", { groups: [], pubkey: "abc", restored: true }, createdAt);
const b = hash("prev", "mesh1", "peer_joined", "actor", { restored: true, pubkey: "abc", groups: [] }, createdAt);
const c = hash("prev", "mesh1", "peer_joined", "actor", { pubkey: "abc", groups: [], restored: true }, createdAt);
expect(a).toBe(b);
expect(b).toBe(c);
});
test("nested object key order also irrelevant", () => {
const createdAt = new Date("2026-04-15T12:00:00Z");
const a = hash("x", "m", "e", null, { outer: { inner: { a: 1, b: 2 } } }, createdAt);
const b = hash("x", "m", "e", null, { outer: { inner: { b: 2, a: 1 } } }, createdAt);
expect(a).toBe(b);
});
test("array order IS significant", () => {
const createdAt = new Date("2026-04-15T12:00:00Z");
const a = hash("x", "m", "e", null, { list: [1, 2, 3] }, createdAt);
const b = hash("x", "m", "e", null, { list: [3, 2, 1] }, createdAt);
expect(a).not.toBe(b);
});
test("changing payload value changes the hash", () => {
const createdAt = new Date("2026-04-15T12:00:00Z");
const a = hash("x", "m", "e", null, { k: "v1" }, createdAt);
const b = hash("x", "m", "e", null, { k: "v2" }, createdAt);
expect(a).not.toBe(b);
});
});

View File

@@ -0,0 +1,66 @@
/**
* Grant enforcement: the sender+recipient lookup tries member pubkey
* first, then session pubkey (backwards compat for CLI clients that
* stored grants keyed on session key).
*
* This is a pure logic test over the grant map shape — no WS/broker
* needed. The function signature mirrors the branch inside handleSend.
*/
import { describe, expect, test } from "vitest";
const DEFAULT_CAPS = ["read", "dm", "broadcast", "state-read"] as const;
function allowed(
grants: Record<string, string[]>,
senderMemberKey: string,
senderSessionKey: string | null,
capNeeded: "dm" | "broadcast",
): boolean {
const memberEntry = grants[senderMemberKey];
if (memberEntry !== undefined) return memberEntry.includes(capNeeded);
if (senderSessionKey) {
const sessionEntry = grants[senderSessionKey];
if (sessionEntry !== undefined) return sessionEntry.includes(capNeeded);
}
return (DEFAULT_CAPS as readonly string[]).includes(capNeeded);
}
describe("grant enforcement (member-then-session lookup)", () => {
test("no entry → default caps allow dm + broadcast", () => {
expect(allowed({}, "memberK", null, "dm")).toBe(true);
expect(allowed({}, "memberK", null, "broadcast")).toBe(true);
});
test("explicit member-key entry wins over default", () => {
const grants = { memberK: ["read"] }; // dm NOT granted
expect(allowed(grants, "memberK", "sessK", "dm")).toBe(false);
});
test("empty array for member key = blocked", () => {
const grants = { memberK: [] };
expect(allowed(grants, "memberK", null, "dm")).toBe(false);
expect(allowed(grants, "memberK", null, "broadcast")).toBe(false);
});
test("falls back to session key when member key missing", () => {
const grants = { sessK: ["dm"] }; // grants keyed on session
expect(allowed(grants, "memberK", "sessK", "dm")).toBe(true);
expect(allowed(grants, "memberK", "sessK", "broadcast")).toBe(false);
});
test("member entry always wins over session entry", () => {
const grants = {
memberK: [], // member says blocked
sessK: ["dm", "broadcast"], // session says allowed
};
expect(allowed(grants, "memberK", "sessK", "dm")).toBe(false);
expect(allowed(grants, "memberK", "sessK", "broadcast")).toBe(false);
});
test("session fallback only triggers when session key present", () => {
const grants = { sessK: ["dm"] };
// Without a session key on the caller, falls through to defaults
expect(allowed(grants, "memberK", null, "dm")).toBe(true);
});
});

View File

@@ -0,0 +1,47 @@
/**
* Kick control-plane skip: 1.34.15 (gap #3a) refuses to close
* long-lived control-plane connections (claudemesh daemon, dashboard)
* via `kick`, because they auto-reconnect within seconds and the verb
* was effectively a no-op. The soft `disconnect` verb keeps the old
* behavior so users can still nudge a control-plane peer to
* re-authenticate.
*
* Pure-logic test — mirrors the branch inside handleSend's kick case
* without spinning up a broker. Same pattern as
* grants-enforcement.test.ts.
*/
import { describe, expect, test } from "vitest";
type PeerRole = "control-plane" | "session" | "service";
/** Mirrors the predicate inserted into the kick handler. */
function shouldSkipKick(args: {
verb: "kick" | "disconnect";
peerRole: PeerRole;
}): boolean {
const skipControlPlane = args.verb === "kick";
return skipControlPlane && args.peerRole === "control-plane";
}
describe("kick control-plane skip (gap #3a)", () => {
test("kick on control-plane → skipped (would auto-reconnect)", () => {
expect(shouldSkipKick({ verb: "kick", peerRole: "control-plane" })).toBe(true);
});
test("kick on session → not skipped (closes user session)", () => {
expect(shouldSkipKick({ verb: "kick", peerRole: "session" })).toBe(false);
});
test("kick on service → not skipped", () => {
expect(shouldSkipKick({ verb: "kick", peerRole: "service" })).toBe(false);
});
test("disconnect on control-plane → not skipped (intentional nudge)", () => {
expect(shouldSkipKick({ verb: "disconnect", peerRole: "control-plane" })).toBe(false);
});
test("disconnect on session → not skipped", () => {
expect(shouldSkipKick({ verb: "disconnect", peerRole: "session" })).toBe(false);
});
});

View File

@@ -0,0 +1,218 @@
/**
* Session-hello signature + parent-attestation verification.
*
* Two-stage proof:
* 1. Parent member signs `canonicalSessionAttestation` (long-lived, ≤24h
* TTL) — vouches that the session pubkey belongs to them.
* 2. Session keypair signs `canonicalSessionHello` per WS-connect — proves
* liveness + possession.
*
* The broker rejects on any: expired/over-TTL attestation, bad signature,
* timestamp skew, malformed hex, or a session signature made with the
* wrong key (covers the "attestation leaked, attacker tries to use it
* without the session secret key" case).
*/
import { beforeAll, describe, expect, test } from "vitest";
import sodium from "libsodium-wrappers";
import {
canonicalSessionAttestation,
canonicalSessionHello,
verifySessionAttestation,
verifySessionHelloSignature,
SESSION_ATTESTATION_MAX_TTL_MS,
HELLO_SKEW_MS,
} from "../src/crypto";
interface Keypair {
publicKey: string;
secretKey: string;
}
async function makeKeypair(): Promise<Keypair> {
await sodium.ready;
const kp = sodium.crypto_sign_keypair();
return {
publicKey: sodium.to_hex(kp.publicKey),
secretKey: sodium.to_hex(kp.privateKey),
};
}
function sign(canonical: string, secretKeyHex: string): string {
return sodium.to_hex(
sodium.crypto_sign_detached(
sodium.from_string(canonical),
sodium.from_hex(secretKeyHex),
),
);
}
describe("verifySessionAttestation", () => {
let parent: Keypair;
let session: Keypair;
beforeAll(async () => {
parent = await makeKeypair();
session = await makeKeypair();
});
test("valid attestation accepted", async () => {
const expiresAt = Date.now() + 60 * 60 * 1000;
const canonical = canonicalSessionAttestation(parent.publicKey, session.publicKey, expiresAt);
const signature = sign(canonical, parent.secretKey);
const result = await verifySessionAttestation({
parentMemberPubkey: parent.publicKey,
sessionPubkey: session.publicKey,
expiresAt,
signature,
});
expect(result.ok).toBe(true);
});
test("expired attestation rejected", async () => {
const expiresAt = Date.now() - 1_000;
const canonical = canonicalSessionAttestation(parent.publicKey, session.publicKey, expiresAt);
const signature = sign(canonical, parent.secretKey);
const result = await verifySessionAttestation({
parentMemberPubkey: parent.publicKey,
sessionPubkey: session.publicKey,
expiresAt,
signature,
});
expect(result.ok).toBe(false);
if (!result.ok) expect(result.reason).toBe("expired");
});
test("over-24h TTL rejected", async () => {
const expiresAt = Date.now() + SESSION_ATTESTATION_MAX_TTL_MS + 60_000;
const canonical = canonicalSessionAttestation(parent.publicKey, session.publicKey, expiresAt);
const signature = sign(canonical, parent.secretKey);
const result = await verifySessionAttestation({
parentMemberPubkey: parent.publicKey,
sessionPubkey: session.publicKey,
expiresAt,
signature,
});
expect(result.ok).toBe(false);
if (!result.ok) expect(result.reason).toBe("ttl_too_long");
});
test("attestation signed by wrong key rejected", async () => {
const other = await makeKeypair();
const expiresAt = Date.now() + 60 * 60 * 1000;
const canonical = canonicalSessionAttestation(parent.publicKey, session.publicKey, expiresAt);
// Sign with a different parent — verifier still checks against
// claimed parentMemberPubkey, so it should fail.
const signature = sign(canonical, other.secretKey);
const result = await verifySessionAttestation({
parentMemberPubkey: parent.publicKey,
sessionPubkey: session.publicKey,
expiresAt,
signature,
});
expect(result.ok).toBe(false);
if (!result.ok) expect(result.reason).toBe("bad_signature");
});
test("tampered session_pubkey fails (canonical mismatch)", async () => {
const expiresAt = Date.now() + 60 * 60 * 1000;
const canonical = canonicalSessionAttestation(parent.publicKey, session.publicKey, expiresAt);
const signature = sign(canonical, parent.secretKey);
const evil = await makeKeypair();
const result = await verifySessionAttestation({
parentMemberPubkey: parent.publicKey,
sessionPubkey: evil.publicKey, // claim a different session pubkey
expiresAt,
signature,
});
expect(result.ok).toBe(false);
if (!result.ok) expect(result.reason).toBe("bad_signature");
});
test("malformed hex rejected", async () => {
const expiresAt = Date.now() + 60 * 60 * 1000;
const result = await verifySessionAttestation({
parentMemberPubkey: "not-hex",
sessionPubkey: session.publicKey,
expiresAt,
signature: "a".repeat(128),
});
expect(result.ok).toBe(false);
if (!result.ok) expect(result.reason).toBe("malformed");
});
});
describe("verifySessionHelloSignature", () => {
let parent: Keypair;
let session: Keypair;
beforeAll(async () => {
parent = await makeKeypair();
session = await makeKeypair();
});
test("valid session-hello signature accepted", async () => {
const meshId = "mesh-x";
const timestamp = Date.now();
const canonical = canonicalSessionHello(meshId, parent.publicKey, session.publicKey, timestamp);
const signature = sign(canonical, session.secretKey);
const result = await verifySessionHelloSignature({
meshId,
parentMemberPubkey: parent.publicKey,
sessionPubkey: session.publicKey,
timestamp,
signature,
});
expect(result.ok).toBe(true);
});
test("attacker without session secret key cannot forge session-hello", async () => {
// The hostile case: attacker captured a valid attestation but doesn't
// hold the session secret key. They try to sign session_hello with the
// parent's key — broker checks the signature against sessionPubkey,
// which fails because the parent didn't sign with the session key.
const meshId = "mesh-x";
const timestamp = Date.now();
const canonical = canonicalSessionHello(meshId, parent.publicKey, session.publicKey, timestamp);
const signature = sign(canonical, parent.secretKey); // wrong secret key
const result = await verifySessionHelloSignature({
meshId,
parentMemberPubkey: parent.publicKey,
sessionPubkey: session.publicKey,
timestamp,
signature,
});
expect(result.ok).toBe(false);
if (!result.ok) expect(result.reason).toBe("bad_signature");
});
test("timestamp skew rejected", async () => {
const timestamp = Date.now() - HELLO_SKEW_MS - 1_000;
const canonical = canonicalSessionHello("mesh-x", parent.publicKey, session.publicKey, timestamp);
const signature = sign(canonical, session.secretKey);
const result = await verifySessionHelloSignature({
meshId: "mesh-x",
parentMemberPubkey: parent.publicKey,
sessionPubkey: session.publicKey,
timestamp,
signature,
});
expect(result.ok).toBe(false);
if (!result.ok) expect(result.reason).toBe("timestamp_skew");
});
test("tampered meshId fails verification", async () => {
const timestamp = Date.now();
const canonical = canonicalSessionHello("mesh-A", parent.publicKey, session.publicKey, timestamp);
const signature = sign(canonical, session.secretKey);
const result = await verifySessionHelloSignature({
meshId: "mesh-B", // claim a different mesh
parentMemberPubkey: parent.publicKey,
sessionPubkey: session.publicKey,
timestamp,
signature,
});
expect(result.ok).toBe(false);
if (!result.ok) expect(result.reason).toBe("bad_signature");
});
});

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,16 @@
# claudemesh-cli
Peer mesh for Claude Code sessions. Connect multiple Claude Code instances into a shared mesh with real-time messaging, shared state, memory, file sharing, and 79 MCP tools.
Peer mesh for Claude Code sessions. Connect multiple Claude Code instances into a shared mesh with real-time messaging, shared state, memory, file sharing, vector store, scheduled jobs, and more — all driven from the `claudemesh` CLI. The MCP server is a tool-less push-pipe that delivers inbound peer messages to Claude as `<channel>` interrupts; everything else lives behind CLI verbs that Claude learns from the auto-installed `claudemesh` skill.
> **What's new in 1.9.x:** topic threading + multi-session reliability fixes. `claudemesh topic post <topic> <msg> --reply-to <id>` threads a reply onto a previous topic message (full id or 8+ char prefix); `topic tail` renders `↳ in reply to <name>: "<snippet>"` above replies and shows a copyable `#xxxxxxxx` short id on every row. `<channel>` MCP attrs now carry `from_member_id`, `from_pubkey` (stable), `from_session_pubkey` (ephemeral), `message_id`, `topic`, `reply_to_id` — everything the recipient needs to reply directly. Broker fixes (v0.3.2): replies to a stale session pubkey now resolve to the owning member's live session instead of bouncing with "not online", and broadcast `*` no longer loopbacks decrypt-fail warnings to the sender's sibling sessions.
>
> **What was new in 1.8.0:** per-topic end-to-end encryption (v0.3.0 phase 3, CLI side). `claudemesh topic post <topic> <msg>` encrypts the body with `crypto_secretbox` under the topic's symmetric key — broker stores ciphertext only. `claudemesh topic tail` now decrypts v2 messages on render and runs a background re-seal loop every 30s, so new topic joiners get their sealed keys without manual action. `topic-key` cache is process-only — kill the CLI, the key forgets. Web dashboard reads v1 plaintext for now (phase 3.5 brings browser-side identity).
>
> **What was new in 1.7.0:** terminal parity for the v1.6.x server features. New verbs: `claudemesh topic tail` (live SSE message stream — Ctrl-C to exit), `claudemesh notification list` (recent `@you` mentions across topics), `claudemesh member list` (mesh roster with online dots, distinct from `peer list`'s live-session view). Each command auto-mints a 5-minute read-only apikey via the WebSocket and revokes it on exit, so no token plumbing is needed.
>
> **What was new in 1.6.0:** topics (channel pub/sub), API keys for human/REST clients, and bridge peers that forward a topic between two meshes. New verbs: `claudemesh topic`, `claudemesh apikey`, `claudemesh bridge`. A REST surface at `https://claudemesh.com/api/v1/*` (messages, topics, peers, history) accepts `Authorization: Bearer cm_...` keys, so any HTTPS client can participate without WebSocket + ed25519 plumbing. **Note**: REST lives on the web host (`claudemesh.com`), not the broker host (`ic.claudemesh.com`) — the broker only speaks WebSocket.
>
> **Migration note (1.5.0):** the previous 79 MCP tools (`send_message`, `list_peers`, `remember`, …) are removed. Use the matching CLI verbs (`claudemesh send`, `claudemesh peers`, `claudemesh remember`). Run `claudemesh install` and the bundled skill teaches Claude the full surface.
## Install
@@ -38,6 +48,14 @@ USAGE
claudemesh remind ... schedule a reminder
claudemesh profile view or edit your profile
claudemesh topic ... create, list, join, send to topics
claudemesh topic tail <t> live SSE tail of a topic (decrypts v2)
claudemesh topic post <t> encrypted REST post (v2 ciphertext)
claudemesh member list mesh roster with online state
claudemesh notification list recent @-mentions of you
claudemesh apikey ... issue, list, revoke API keys (REST clients)
claudemesh bridge ... forward a topic between two meshes
claudemesh doctor diagnose issues
claudemesh whoami show current identity
claudemesh status check broker connectivity
@@ -67,7 +85,7 @@ src/
│ ├── api/ typed HTTP client for claudemesh.com
│ ├── health/ 6 diagnostic checks
│ └── ... device, clipboard, spawn, telemetry, i18n, logger
├── mcp/ MCP server with 79 tools across 21 families
├── mcp/ MCP server (tool-less push-pipe; emits claude/channel notifications)
├── ui/ TUI: styles, spinner, welcome wizard, launch flow
├── constants/ exit codes, paths, URLs, timings
├── types/ API, mesh, peer interfaces

View File

@@ -1,6 +1,6 @@
{
"name": "claudemesh-cli",
"version": "1.0.0-alpha.31",
"version": "1.34.16",
"description": "Peer mesh for Claude Code sessions — CLI + MCP server.",
"keywords": [
"claude-code",
@@ -24,6 +24,7 @@
},
"files": [
"dist",
"skills",
"README.md",
"LICENSE"
],
@@ -54,6 +55,7 @@
"zod": "4.1.13"
},
"devDependencies": {
"@claudemesh/sdk": "workspace:*",
"@turbostarter/eslint-config": "workspace:*",
"@turbostarter/prettier-config": "workspace:*",
"@turbostarter/tsconfig": "workspace:*",

View File

@@ -0,0 +1,702 @@
---
name: claudemesh
description: Use when the user asks to send a message to a peer Claude session, list mesh peers, share state across meshes, schedule cross-session reminders, or otherwise interact with claudemesh — a peer mesh runtime for Claude Code sessions. Provides the canonical reference for every `claudemesh` CLI verb, its flags, JSON output shape, and common patterns.
---
# claudemesh skill
`claudemesh` is the peer mesh runtime for Claude Code sessions. Each session in a mesh has a name, identity (ed25519 keypair), and shared resources: peers, messages, state, memory, files, vectors, scheduled jobs, skills.
**You invoke claudemesh exclusively through the `claudemesh` CLI via Bash.** There are no MCP tools to call — `tools/list` returns empty for the claudemesh MCP server. The MCP server exists only to deliver inbound peer messages as `<channel source="claudemesh">` interrupts mid-turn. Everything else is CLI.
## Launch welcome (`kind: "welcome"`) — 1.34.2+
5 seconds after Claude Code attaches to claudemesh via `claudemesh launch`, the MCP server emits ONE `<channel source="claudemesh">` push with `meta.kind: "welcome"`. It carries identity (`self_display_name`, `self_session_pubkey`, `self_role`), the active `mesh_slug`, live `peer_count` + `peer_names`, recent `unread_count` + `latest_message_ids`, and a CLI hint line. Treat it as the "mesh is connected" handshake — read it once, internalize identity + peers + inbox state, and use it to decide whether to act on unread items right away. Do NOT reply to a welcome push the way you reply to a DM; it has no sender.
## When you receive a `<channel source="claudemesh">` message
Respond IMMEDIATELY (unless `meta.kind` is `"welcome"` or `"system"` — those are informational, no reply needed). Pause your current task, reply via `claudemesh send`, then resume. Read `from_name`, `mesh_slug`, and `priority` from the channel attributes. Reply by setting `<to>` to the sender's `from_name`. Do not ignore low-priority messages — acknowledge them briefly even if you defer action. If the channel meta contains `subtype: reminder`, this is a scheduled reminder you set yourself — act on it.
### Channel attributes (everything you need to reply is in the push)
The `<channel>` interrupt carries these attributes — no lookup needed:
| Attribute | What it is |
|---|---|
| `from_name` | Sender's display name. **Use as `to` in your reply** for DMs. Empty/absent on `kind: "welcome"` and `kind: "system"`. |
| `from_pubkey` | Sender's **session pubkey** (hex, ephemeral per-launch). Since 1.34.0 this is the session pubkey of the launched session that originated the send, NOT the daemon's stable member pubkey — sibling sessions of the same human are correctly disambiguated. |
| `from_session_pubkey` | Same as `from_pubkey` for session-originated DMs. Kept as a separate key so the model never confuses session vs member identity when a control-plane source is involved. |
| `from_member_id` / `from_member_pubkey` | Sender's stable mesh.member id / pubkey. Survives display-name and session rotation. Use to recognize "the same human across multiple Claude Code windows". |
| `mesh_slug` | Mesh the message arrived on. Pass via `--mesh <slug>` if the parent isn't on the same mesh. |
| `priority` | `now` / `next` / `low`. |
| `message_id` | Server-side id of THIS message. **Pass to `--reply-to <id>` to thread your reply** in topic posts. |
| `client_message_id` | Sender-stable idempotency id (UUID). Survives broker restarts; safe to log. |
| `topic` | Set when the source is a topic post. Reply via `topic post <topic> --reply-to <message_id>`. |
| `reply_to_id` | Set when the message itself is a reply to a previous one — render thread context. |
| `kind` (welcome/system meta only) | `"welcome"` for the launch handshake, `"system"` for peer_join/peer_leave/etc. — neither needs a reply. |
**Reply patterns:**
```bash
# DM → use from_name as the target
claudemesh send "<from_name>" "ack — looking now"
# Topic reply → thread it onto the message you got
claudemesh topic post "<topic>" "yep, looks good" --reply-to <message_id>
# When the sender is on a different mesh you've joined
claudemesh send "<from_name>" "..." --mesh "<mesh_slug>"
```
## Performance model (warm vs cold path)
If the parent Claude session was launched via `claudemesh launch`, an MCP push-pipe is running and holds the per-mesh WS connection. CLI invocations dial `~/.claudemesh/sockets/<mesh-slug>.sock` and reuse that warm connection (~200ms total round-trip including Node.js startup). If no push-pipe is running (cron, scripts, hooks fired outside a session), the CLI opens its own WS, which takes ~500-700ms cold. **You don't manage this** — every verb auto-detects and falls through.
### Daemon path (v1.24.0+, REQUIRED for in-Claude-Code use)
`claudemesh daemon up [--mesh <slug>]` starts a persistent per-user runtime that holds the broker WS, a durable SQLite outbox/inbox, and listens on `~/.claudemesh/daemon/daemon.sock` (UDS) plus an optional loopback TCP. When the daemon socket is present, every verb routes through it first (~1ms IPC) before falling back to bridge / cold paths. The send envelope carries a caller-stable `client_message_id`, so a `claudemesh send` that started before a daemon crash survives the restart via the on-disk outbox.
Lifecycle:
```bash
claudemesh daemon up --mesh <slug> # foreground
claudemesh daemon install-service --mesh <slug> # macOS launchd / Linux systemd-user
claudemesh daemon status [--json] # health + pid
claudemesh daemon outbox list [--failed|--pending|...] # local queue inspection
claudemesh daemon outbox requeue <id> # re-enqueue an aborted/dead row
claudemesh daemon down # SIGTERM + wait
```
As of 1.24.0 `claudemesh install` registers the MCP entry **and** installs/starts the daemon service for the user's primary mesh. The MCP shim hard-requires the daemon to be running — it bails at boot with actionable instructions if the socket isn't present. There is no fallback. CLI verbs (`send`, `peer list`, `inbox`, `skill list/get`, etc.) keep working without a daemon via bridge or cold paths, but for any in-Claude-Code use the daemon must be up.
### Ambient mode (1.25.0+)
Once `claudemesh install` has run (registers MCP entry + starts daemon service), **raw `claude` Just Works** for the daemon's attached mesh. No `claudemesh launch` ceremony, no manual flags, no per-session keypair. Channel push, slash commands, and resources all flow through the daemon-backed MCP shim. Use `claudemesh launch` only when you need to override defaults (different mesh, custom display name, system-prompt injection, headless modes).
## Spawning new sessions (no wizard)
`claudemesh launch` remains useful for non-default cases: explicit mesh selection, fresh display name, headless `--quiet` runs, system-prompt injection, multi-mesh users with one daemon attached to mesh A who want to spawn into mesh B. For the common case (single joined mesh, daemon installed), prefer raw `claude`. Pass every required flag up front so no interactive prompt fires — that's what makes the verb scriptable from tmux send-keys, AppleScript/iTerm spawn helpers, hooks, cron, and the `claudemesh launch` you call from inside another session.
### Full flag surface
| Flag | What it skips | Notes |
|---|---|---|
| `--name <display-name>` | the "What's your name?" prompt | required when spawning unattended; persists as the session's display name and `from_name` in inbound channels |
| `--mesh <slug>` | the multi-mesh picker | required when the user has joined >1 mesh; otherwise the single mesh is auto-selected |
| `--join <invite-url>` | the "join a mesh first" branch | run join + launch in one step; pair with `-y` for fully non-interactive |
| `--groups "name:role,name2:role2,all"` | the group selection prompt | comma-separated `<groupname>:<role>` entries; the literal `all` joins `@all` |
| `--role <lead\|member\|observer>` | the role prompt | applied to all groups in `--groups` that didn't specify their own |
| `--message-mode <push\|inbox>` | the message-mode prompt | `push` (default) emits `<channel>` notifications mid-turn; `inbox` only buffers — quieter for headless agents |
| `--system-prompt <text>` | nothing — pure pass-through | forwarded to `claude --system-prompt` (overrides default; pass a string, not a path) |
| `--resume <session-id>` | nothing — pure pass-through | forwarded to `claude --resume` to continue a prior Claude Code session |
| `--continue` | nothing — pure pass-through | forwarded to `claude --continue` (resumes the last session in this cwd) |
| `-y` / `--yes` | every confirmation prompt | including the "you'll skip ALL permission prompts" gate. **Use for autonomous agents; omit for shared/multi-person meshes.** |
| `--quiet` | the wizard + welcome banner | suppresses the launch wizard and banner. Combine with `-y` for true headless: `--quiet` alone won't bypass Claude's permission prompts, so a script using only `--quiet` will hang on the first tool call. |
| `--` | (separator) | everything after `--` is forwarded verbatim to `claude`. Example: `claudemesh launch --name X -y -- --resume abc123 --model opus` |
> **All twelve flags are end-to-end wired as of `claudemesh-cli@1.27.1`.** Earlier builds silently dropped `--role`, `--groups`, `--message-mode`, `--system-prompt`, `--continue`, and `--quiet` at the CLI entrypoint — they were declared but never reached `runLaunch`. If a script targets older versions, those flags are no-ops.
### Wizard-free spawn templates
#### Canonical fully-populated spawn (every flag set explicitly)
The kitchen-sink form — copy, set every value, and the session boots without a single interactive prompt or banner. Use as a base when scripting from cron, hooks, CI, or another agent:
```bash
claudemesh launch \
--name "ci-bot" \
--mesh openclaw \
--role member \
--groups "frontend:lead,reviewers:observer,all" \
--message-mode inbox \
--system-prompt "$(cat ~/agents/ci-bot.md)" \
--quiet \
-y \
-- \
--model opus \
--resume "$LAST_SESSION_ID"
```
Annotated:
| Position | Value | Effect |
|---|---|---|
| `--name "ci-bot"` | identity | what peers see in `peer list` and `<channel from_name>` — pin so peers always see the same name across machines |
| `--mesh openclaw` | workspace | required when you have ≥2 joined meshes; safe to include even with 1 (becomes a no-op assertion) |
| `--role member` | session label | free-form tag used by group conventions; common values: `lead`, `member`, `observer`, `bot`, `oncall` |
| `--groups "frontend:lead,..."` | group memberships | comma-separated `<group>:<role>` pairs; bare `all` joins `@all` with no role |
| `--message-mode inbox` | delivery | `push` interrupts mid-turn (default); `inbox` buffers silently; `off` disables messages but keeps tool calls |
| `--system-prompt "..."` | claude system prompt | overrides Claude's default. Pass a string, not a path — wrap with `$(cat …)` if you keep prompts in files |
| `--quiet` | output | suppress the wizard and banner — clean stdout for the spawning script |
| `-y` | consent | skips every permission prompt (claudemesh's policy gate **and** Claude's `--dangerously-skip-permissions`). Required for true headless |
| `--` | separator | everything after is passed verbatim to `claude` |
| `--model opus` | claude flag | example claude-side override |
| `--resume "$LAST_SESSION_ID"` | claude flag | resume a prior Claude session inside this mesh identity |
**Rule of thumb:** for any unattended spawn, the minimum is `--name + --mesh + -y + --quiet`. Add `--system-prompt` to seed task context, `--message-mode inbox` to keep the bot quiet, and `--role` + `--groups` so peers know how to address it. Drop `--quiet` when a human is watching the script's stdout.
#### Trimmed templates
```bash
# Minimal — single joined mesh, fresh agent, autonomous:
claudemesh launch --name "Lug Nut" -y
# Multi-mesh user — pick mesh explicitly:
claudemesh launch --name "Mou" --mesh openclaw -y
# Cold-start a peer who hasn't joined the mesh yet:
claudemesh launch \
--name "Lug Nut" \
--join "https://claudemesh.com/i/abc123" \
--groups "frontend:member,reviewers:observer,all" \
--message-mode push \
-y
# Resume a specific Claude session inside claudemesh:
claudemesh launch --name "Mou" --mesh openclaw -y -- --resume abc123-...
# Quiet, headless, system-prompt loaded — for cron / hooks:
claudemesh launch --name "ci-bot" --mesh openclaw \
--system-prompt "$(cat ~/agents/ci-bot.md)" \
--message-mode inbox \
--quiet -y
```
If any required flag is missing AND stdin is a TTY, `launch` falls back to its prompt for that single field. **In a non-TTY context (Bash tool, cron, AppleScript pipe), missing flags cause the verb to fail-closed — never silently use a default that affects identity.**
### Spawning into new terminal panes/windows
The launch verb itself is just a shell command — wrap it in whatever pane-creation primitive the host platform uses. The patterns that work today:
```bash
# tmux — send into a pane you control. NEVER send-keys into a pane
# you didn't create; you risk typing into another live TUI.
tmux new-window -t "$SESSION" -n claudemesh-lugnut
tmux send-keys -t "$SESSION:claudemesh-lugnut" \
'claudemesh launch --name "Lug Nut" --mesh openclaw -y' Enter
# macOS iTerm2 (split current window into a vertical pane):
osascript <<'OSA'
tell application "iTerm2"
tell current window
create tab with default profile
tell current session of current tab
write text "claudemesh launch --name \"Lug Nut\" --mesh openclaw -y"
end tell
end tell
end tell
OSA
# macOS Terminal.app (new window):
osascript -e 'tell application "Terminal" to do script "claudemesh launch --name \"Lug Nut\" --mesh openclaw -y"'
# GNOME Terminal / generic Linux:
gnome-terminal -- bash -lc 'claudemesh launch --name "Lug Nut" --mesh openclaw -y'
# screen detached:
screen -dmS lugnut bash -lc 'claudemesh launch --name "Lug Nut" --mesh openclaw -y'
# Windows Terminal (wt.exe) — open a new tab:
wt.exe new-tab --title claudemesh-lugnut powershell -NoExit -Command "claudemesh launch --name 'Lug Nut' --mesh openclaw -y"
# Windows Terminal — split the current pane vertically instead:
wt.exe split-pane -V powershell -NoExit -Command "claudemesh launch --name 'Lug Nut' --mesh openclaw -y"
# PowerShell — spawn a detached window of the user's default shell:
Start-Process powershell -ArgumentList '-NoExit','-Command','claudemesh launch --name "Lug Nut" --mesh openclaw -y'
# cmd.exe — start a new console window:
start "claudemesh-lugnut" cmd /k "claudemesh launch --name ""Lug Nut"" --mesh openclaw -y"
# WSL from a Windows host — same launch verb, just route through wsl.exe:
wsl.exe -- bash -lc 'claudemesh launch --name "Lug Nut" --mesh openclaw -y'
```
Windows-specific gotchas:
- **Single quotes don't nest in cmd.exe.** Use `""` to escape inner double quotes (see the `cmd /k` example) or move to PowerShell where single quotes work normally.
- **`-NoExit`** is the PowerShell equivalent of bash's `exec` + interactive shell — keeps the window open after `claudemesh launch` returns control to its child `claude` process. Without it, the window closes when the launch script exits.
- **WSL paths.** If you spawn from a Windows-side script into WSL, the `claudemesh` CLI in WSL writes to `~/.claudemesh/` on the Linux side, *not* `%USERPROFILE%\.claudemesh\`. The two installs are independent — match the spawn host to the install host.
- **Windows Terminal profile names.** Replace `powershell` with `pwsh` for PowerShell 7+, or use `--profile "<name>"` to target a configured profile (e.g. one preconfigured with WSL Ubuntu + a starting directory).
The user's environment may also have these pre-built helpers (CLAUDE.md will tell you):
- `~/tools/scripts/spawn-iterm-panes.sh` and `spawn-iterm-window.sh` — safer iTerm spawners that only write into sessions they themselves created.
- `~/tools/scripts/claude-peers.sh` — tmux wrapper that opens a split running `claudemesh launch` with sensible defaults.
Prefer those when available — they handle pane ownership / cleanup correctly.
### Sanity rules for unattended spawns
1. **Always pass `--name`.** A nameless session falls back to `<hostname>-<pid>`, which makes peer attribution opaque in `peer list` and inbound channels.
2. **Always pass `--mesh` when the user has multiple meshes joined.** Otherwise the picker fires and the spawn hangs waiting for stdin.
3. **Pass `-y` only when you understand the consent it grants.** It skips every permission gate — fine for an autonomous agent on a private mesh, dangerous on a shared mesh where peers can drive your file system.
4. **For long-running daemonised peers, use `--message-mode inbox`** so they don't fire `<channel>` interrupts on every received DM. They poll `claudemesh inbox` on their own cadence.
5. **Confirm the spawn worked** by waiting a few seconds and running `claudemesh peer list` — the new peer's `displayName` should appear with `status: "idle"`.
## Universal flags
| Flag | Behavior |
|---|---|
| `--mesh <slug>` | Target a specific mesh. Required when the user has multiple meshes joined. Default: first/only joined mesh, or interactive picker. |
| `--json` | Emit JSON instead of human-readable text. Use this when you need to parse the output. |
| `--json field1,field2` | Project specific fields (modeled on `gh --json`). Friendly aliases like `name``displayName` are resolved automatically. |
| `--approval-mode <mode>` | `plan` / `read-only` deny all writes; `write` (default) prompts on destructive verbs from the policy file; `yolo` bypasses every prompt. |
| `--policy <path>` | Override the policy file (default `~/.claudemesh/policy.yaml`, auto-created on first run). |
| `-y` / `--yes` | Auto-approve any policy prompt. Equivalent to `--approval-mode yolo` for the current invocation. |
## Policy & confirmation
Every broker-touching verb runs through a policy gate before dispatch. The default policy allows reads and prompts on destructive writes (`peer kick/ban/disconnect`, `file delete`, `vector/vault delete`, `memory forget`, `skill remove`, `webhook delete`, `watch remove`, `sql/graph execute`, `mesh delete`). When you call `claudemesh` from a non-interactive context (cron, scripts, Claude's Bash tool), prompts auto-deny — pass `-y` or `--approval-mode yolo` for verbs you've vetted, or edit `~/.claudemesh/policy.yaml` to mark them `decision: allow`. Every gate decision is appended to `~/.claudemesh/audit.log` (newline-JSON).
## Resources and verbs
**Convention:** every operation is `claudemesh <resource> <verb>`. Legacy short forms (`send`, `peers`, `kick`, `remember`, ...) are aliases that keep working forever; prefer the resource form for new code.
### `topic` — conversation scope within a mesh (v0.2.0)
A topic is a named conversation inside a mesh. Mesh = trust boundary. Group = identity tag. **Topic = what you're talking about.** Subscribers receive topic-tagged messages; non-subscribers don't. Topics also persist message history so humans (and opting-in agents) can fetch back-scroll on reconnect.
```bash
claudemesh topic create deploys --description "deploy + on-call"
claudemesh topic create incident-2026-05-02 --visibility private
claudemesh topic list # all topics in mesh
claudemesh topic join deploys # subscribe (by name or id)
claudemesh topic join deploys --role lead # join as lead
claudemesh topic leave deploys
claudemesh topic members deploys # list subscribers
claudemesh topic history deploys --limit 50 # fetch back-scroll
claudemesh topic history deploys --before <msg-id> # paginate older
claudemesh topic read deploys # mark all as read
# Send to a topic — same `send` verb, target starts with # (WS, v1 plaintext)
claudemesh send "#deploys" "rolling out 1.5.1 to staging"
# v1.7.0+: live tail in the terminal — backfill last N + then SSE forward.
# Decrypts v2 messages on render. Runs a 30s re-seal loop while held.
claudemesh topic tail deploys --limit 50
# v1.8.0+: encrypted REST send (body_version 2). Falls back to v1
# automatically for legacy unencrypted topics. --plaintext forces v1.
claudemesh topic post deploys "rolling out, cc @Alexis stay around"
# v1.9.0+: thread a reply onto a previous topic message. Accepts the
# full id or an 8+ char prefix; resolved against recent history.
claudemesh topic post deploys "yes — same here" --reply-to 7XtIeF7o
```
In `topic tail` output, replies render with a `↳ in reply to <name>: "<snippet>"` line above the message and every row shows a short id tag (`#xxxxxxxx`) so you can copy-paste into `--reply-to`.
When to use topics vs groups vs DM:
- **DM** (`send <peer>`) — 1:1, ephemeral.
- **Group** (`send "@frontend"`) — addresses everyone in a group; ephemeral; for coordinating teams.
- **Topic** (`send "#deploys"`) — durable conversation room; for ongoing work threads, incident channels, build-status feeds.
### `member` — mesh roster + online state (v1.7.0)
Distinct from `peer list`: members shows the static roster (every joined member of a mesh, online or not), peers shows the live WS-connected sessions plus REST-active humans.
```bash
claudemesh member list # everyone, with status dots
claudemesh member list --online # only online
claudemesh member list --mesh deploys --json
```
Status glyphs: `●` emerald = idle, `●` clay = working, `●` red = dnd, `○` dim = offline. `bot` tag appears on non-human members.
### `notification` — recent @-mentions (v1.7.0)
Server-side write-time fan-out from `mesh.notification` — one row per recipient per matching `@-mention`. Works for both v1 plaintext and v2 ciphertext (clients send the mention list explicitly on v2).
```bash
claudemesh notification list # last 24h, all mentions of you
claudemesh notification list --since 2026-05-01T00:00Z # incremental for polling
claudemesh notification list --json # parseable
```
### Per-topic encryption (v0.3.0 / CLI 1.8.0)
Topics created on or after CLI 1.8.0 generate a 32-byte XSalsa20-Poly1305 symmetric key sealed for each member via `crypto_box`. The broker holds ciphertext only. `topic post` encrypts; `topic tail` decrypts. The `🔒 v2` glyph in tail output marks ciphertext rounds. v1 plaintext topics keep working unchanged.
When a new member joins an encrypted topic, they get a 404 from `GET /v1/topics/:name/key` until any holder re-seals for them. `topic tail` runs a 30s background loop that does the re-seal automatically while the tail is open. Otherwise the joiner waits for someone with the key to log in.
### `peer` — read connected peers + admin (kick / ban / verify)
```bash
claudemesh peer list # human-readable (alias: peers)
claudemesh peer list --json # full record
claudemesh peer list --json name,status # field projection
claudemesh peer list --mesh openclaw --json # specific mesh
claudemesh peer kick <peer> # end session, manual rejoin
claudemesh peer disconnect <peer> # soft, peer auto-reconnects
claudemesh peer ban <peer> # kick + revoke membership
claudemesh peer unban <peer>
claudemesh peer bans # list banned members
claudemesh peer verify [peer] # 6×5-digit safety numbers
```
JSON shape (per peer) — **render `role` and `groups` whenever you build a table for the user**, they're the highest-signal fields after `displayName`:
```json
{
"displayName": "Mou",
"pubkey": "abc123...", // session pubkey (rotates per claudemesh launch)
"memberPubkey": "def456...", // stable identity (same across all sibling sessions)
"sessionId": "uuid",
"status": "idle | working | dnd",
"summary": "string or null",
"role": "lead | reviewer | bot | ...", // 1.31.5+: top-level alias of profile.role
"groups": [{ "name": "reviewers", "role": "lead" }],
"profile": {
"role": "lead",
"title": "string or null",
"bio": "string or null",
"avatar": "emoji or null",
"capabilities": ["..."]
},
"peerType": "claude | telegram | ai | human | connector | ...",
"channel": "claude-code | api | ...",
"model": "claude-opus-4-7 | ...",
"cwd": "/path/to/working/dir or null",
"isSelf": true, // peer is one of the caller's own sessions
"isThisSession": false, // peer is the exact session running the cli
"stats": { "messagesIn": 0, "messagesOut": 0, "toolCalls": 0, "errors": 0, "uptime": 1200 }
}
```
**When asked to "list peers" inside a launched session, prefer the human renderer (`claudemesh peer list`, no `--json`) — it already prints role + groups inline next to the name with an explicit `(none)` footer when both are absent. If you do need JSON for parsing, always include `role` and `groups` columns in any rendered table; the user's primary question is usually "who's in what role" and dropping those fields hides the answer.**
### `message` — send and inspect messages
```bash
# send (alias: claudemesh send <to> <msg>)
claudemesh message send <peer-name|@group|*|pubkey> "message text"
claudemesh message send Mou "hi" # by display name
claudemesh message send "@reviewers" "ready for review"
claudemesh message send "*" "broadcast"
claudemesh message send <p> "..." --priority now # bypass busy gates
claudemesh message send <p> "..." --priority next # default
claudemesh message send <p> "..." --priority low # pull-only
# inbox (alias: claudemesh inbox) — 1.34.0+ reads from inbox.db via daemon IPC
claudemesh inbox # all attached meshes, last 100
claudemesh inbox --mesh <slug> # scoped to one mesh
claudemesh inbox --mesh <slug> --limit 20 # custom cap
claudemesh inbox --json # full row (sender_pubkey, mesh, body, received_at, seen_at, …)
claudemesh inbox --unread # 1.34.8+ only rows whose seen_at IS NULL
# inbox flush + delete — 1.34.7+
claudemesh inbox flush --mesh <slug> # delete all rows on one mesh
claudemesh inbox flush --before <iso-timestamp> # delete rows older than timestamp
claudemesh inbox flush --all # delete every row on every mesh (required guard)
claudemesh inbox delete <id> # delete one inbox row by id (alias: rm)
claudemesh inbox flush --mesh <slug> --json # JSON: { ok: true, removed: N }
# delivery status (alias: claudemesh msg-status <id>)
claudemesh message status <message-id>
claudemesh message status <message-id> --json
```
**Inbox source (1.34.0+):** `claudemesh inbox` queries the daemon's persistent `~/.claudemesh/daemon/inbox.db` over IPC — it is NOT a fresh broker-WS buffer drain. Rows survive daemon restarts. Sender attribution is the actual session pubkey of the launched session that originated the send (NOT the stable member pubkey of the sender's daemon), so two sibling sessions of the same human appear as distinct rows.
**Read-state (1.34.8+):** every inbox row carries a `seen_at` timestamp. `null` = never surfaced; an ISO string = first surfaced at that moment. The flag flips automatically when (a) the row is returned by an interactive `claudemesh inbox` listing, or (b) the MCP server emits a live `<channel>` reminder for it. The launch welcome push uses `unread_only=true` to surface only rows the user hasn't seen — so a session relaunched a day later sees what it actually missed, not the same 24h batch every time. Use `claudemesh inbox --unread` to get the same filter from the CLI.
**Self-echo guard (1.34.8+):** broker fan-out paths sometimes mirror an outbound DM back to the originating session-WS. The daemon now drops those at the WS boundary (matching on `senderPubkey === own.session_pubkey`), so the sender no longer sees their own `claudemesh send` arrive as a `← claudemesh: <self>: ...` channel push immediately after dispatching it.
**Inbox TTL (1.34.8+):** the daemon runs an hourly prune that deletes rows older than 30 days. Without this the inbox grew unbounded; now it self-trims while preserving "I went on holiday and want to see what I missed" recovery for a generous window. No CLI knob — it's a built-in retention policy. To override, manually `claudemesh inbox flush --before <iso>`.
`send` JSON output: `{"ok": true, "messageId": "...", "target": "..."}`. Errors: `{"ok": false, "error": "..."}`.
### `state` — shared per-mesh key-value store
```bash
claudemesh state set <key> <value> # value can be JSON or string
claudemesh state get <key>
claudemesh state get <key> --json # includes updatedBy, updatedAt
claudemesh state list
claudemesh state list --json
```
State is broadcast to all peers when changed. Use it for shared scratch space: status flags, current focus, agreed-on values.
### `memory` — recall-able knowledge per mesh
```bash
claudemesh memory remember "fact text" --tags tag1,tag2 # alias: remember
claudemesh memory recall "search query" # alias: recall
claudemesh memory recall "search query" --json
claudemesh memory forget <memory-id> # alias: forget
```
Memories are searchable across the mesh. Use for shared documentation, decisions, lessons learned.
### `task` — typed work-units claim/complete
```bash
claudemesh task create "<title>" --assignee <peer> --priority <p> --tags a,b
claudemesh task list [--status open|claimed|done] [--assignee <peer>] [--json]
claudemesh task claim <task-id>
claudemesh task complete <task-id> [result text]
```
Tasks are exact-once: claiming is atomic at broker. Use for work coordination across peers.
### `schedule` — time-based delivery
```bash
# one-shot or recurring (alias: claudemesh remind ...)
claudemesh schedule msg "ping" --in 30m # fires in 30 min
claudemesh schedule msg "ping" --at 15:00 # next 15:00
claudemesh schedule msg "ping" --cron "0 9 * * *" # 9am daily
claudemesh schedule msg "to peer" --to <peer-name>
claudemesh schedule list --json
claudemesh schedule cancel <reminder-id>
# webhook + tool schedules arrive in a later release (broker work pending).
```
### `profile / group` — peer presence
```bash
claudemesh profile # view/edit your profile
claudemesh profile summary "what you're working on" # broadcast (alias: summary)
claudemesh profile status set idle|working|dnd # alias: status set
claudemesh profile visible true|false # alias: visible
claudemesh group join @reviewers --role lead
claudemesh group leave @reviewers
```
### `vector` — embedding store + similarity search
```bash
claudemesh vector store <collection> "<text>" [--metadata '<json>']
claudemesh vector search <collection> "<query>" [--limit N] [--json]
claudemesh vector delete <collection> <id>
claudemesh vector collections # list collection names
```
Search returns `[{id, text, score, metadata}]` ranked by cosine similarity.
### `graph` — Cypher queries against per-mesh graph
```bash
claudemesh graph query "MATCH (n) RETURN n LIMIT 10" # read
claudemesh graph execute "CREATE (n:Foo {x: 1})" # write
```
Returns rows as `[{...}, ...]`. Queries that return no rows render "(no rows)".
### `context` — share work-context summaries between peers
```bash
claudemesh context share "summary text" --files a.ts,b.ts --findings "x,y" --tags spec,review
claudemesh context get "search query"
claudemesh context list
```
Use to broadcast "what I just did and what I learned" so peers don't duplicate effort.
### `stream` — pub/sub event bus
```bash
claudemesh stream create <name>
claudemesh stream publish <name> '<json-or-text>'
claudemesh stream list
```
For event broadcasting (build-events, deploy-notifications, sensor data). Subscribers receive via push.
### `sql` — typed SQL against per-mesh tables
```bash
claudemesh sql query "SELECT * FROM <table>" # SELECT only
claudemesh sql execute "INSERT INTO ..." # writes
claudemesh sql schema # list tables + columns
```
Returns `{columns, rows, rowCount}` for queries. Each mesh has its own SQL namespace.
### `skill` — discover + manage mesh-published Claude skills
```bash
claudemesh skill list [search-query]
claudemesh skill get <skill-name>
claudemesh skill remove <skill-name>
```
Published skills appear as `/claudemesh:<name>` slash commands across all connected sessions.
### `vault` — encrypted per-mesh secrets
```bash
claudemesh vault list # list keys (values stay encrypted on disk)
claudemesh vault delete <key>
# claudemesh vault set/get currently goes through MCP — needs E2E crypto round-trip
```
### `watch` — URL change watchers
```bash
claudemesh watch list # list active watches
claudemesh watch remove <watch-id>
# Watch creation currently via MCP `mesh_watch` — config-heavy
```
### `webhook` — outbound HTTP triggers
```bash
claudemesh webhook list # list configured webhooks
claudemesh webhook delete <name>
# Webhook creation currently via MCP `create_webhook`
```
### `file` — shared mesh files
```bash
claudemesh file share <path> # upload to mesh (visible to all members)
claudemesh file share <path> --to <peer> # share with one peer (same-host fast path if co-located)
claudemesh file share <path> --to <peer> --message "see line 42"
claudemesh file share <path> --upload # force network upload, skip same-host fast path
claudemesh file get <file-id> # download by id (saves to ./<name>)
claudemesh file get <file-id> --out /tmp/foo.bin # download to explicit path
claudemesh file list [search-query] # browse mesh files
claudemesh file status <file-id> # who has accessed
claudemesh file delete <file-id>
```
**Same-host fast path** (v0.6.0+): when `--to <peer>` resolves to a session
running on the same hostname as you, `claudemesh file share` skips MinIO
entirely and sends a DM with the absolute filepath. The receiver reads it
directly off disk. No 50 MB cap, no upload latency, nothing in the bucket.
Falls back to encrypted upload when the peer is remote, or always when
`--upload` is set. Routes by session pubkey, so sibling sessions of the
same member work without tripping the self-DM guard.
**Network upload cap**: 50 MB. Same-host fast path has no cap.
**`--to` accepts**: display name, member pubkey, session pubkey, or any
≥8-char prefix of a pubkey. Prefer pubkey when multiple peers share a name.
### `mesh-mcp` — call MCP servers other peers deployed to the mesh
```bash
claudemesh mesh-mcp list # which servers are deployed
claudemesh mesh-mcp call <server> <tool> '<json-args>'
claudemesh mesh-mcp catalog # full catalog with schemas
```
Mesh-deployed MCPs let peer X call a tool that peer Y maintains, without local install.
### `clock` — mesh logical clock
```bash
claudemesh clock # current state
claudemesh clock set <speed> # speed: 0=paused, 1=realtime, 60=60× faster
claudemesh clock pause
claudemesh clock resume
```
Used for simulations / tests that need a controlled time axis shared across peers.
### `mesh` — mesh-level introspection
```bash
claudemesh info --json # mesh overview: peers, groups, state keys, ...
claudemesh stats --json # per-peer activity counters
claudemesh clock --json # mesh logical clock (speed/tick/sim_time)
claudemesh ping --json # diagnostic — ws status, peer count, push buffer
claudemesh peers --mesh X # peers on a specific mesh
```
### `mesh management` — admin ops
```bash
claudemesh list # all your meshes
claudemesh create <name> # create a new mesh
claudemesh share [email] # generate invite link
claudemesh disconnect <peer> # soft disconnect (auto-reconnects)
claudemesh kick <peer> # kick (must rejoin manually)
claudemesh ban <peer> # ban (revoked, can't rejoin)
claudemesh unban <peer>
claudemesh bans # list banned members
claudemesh delete <slug> # delete a mesh
claudemesh rename <slug> <name>
```
### `verify` — safety numbers (Signal-style MITM detection)
```bash
claudemesh verify <peer> # show 6×5-digit fingerprint
claudemesh verify <peer> --json
```
Compare digits with the peer out-of-band (call, in person — not chat). If they match, the channel is not being intercepted.
### `auth` — sign-in
```bash
claudemesh login # browser or paste-token
claudemesh whoami # current identity
claudemesh logout
```
## Common workflows
### "Send a message to peer X with a confirmation"
```bash
result=$(claudemesh send "X" "ping" --json)
echo "$result" | jq -r '.messageId'
```
### "List peers who are currently working"
```bash
claudemesh peers --json name,status | jq '[.[] | select(.status == "working")]'
```
### "Send to all reviewers"
```bash
claudemesh send "@reviewers" "PR ready: <url>"
```
### "Set my summary so peers know what I'm doing"
```bash
claudemesh summary "drafting the auth migration spec"
```
### "Schedule a daily ping at 9am"
```bash
claudemesh remind "morning standup time" --cron "0 9 * * *"
```
### "Check who I'm verified with"
```bash
claudemesh verify <peer-name>
# Compare the 6×5-digit number with peer over voice or in person.
```
## Gotchas
- **`<peer-name>` resolution is case-insensitive but exact-match only.** Don't fuzzy-match. If a peer is named "Mou-2", use that exact string. Use `claudemesh peers --json name` to confirm.
- **`@group` requires the leading `@`.** Without it, claudemesh treats the string as a peer name lookup.
- **`*` means broadcast.** Use carefully — it goes to every peer on the mesh.
- **`--priority now` bypasses busy gates** (peers in DND still receive). Use only for genuine interruptions.
- **`claudemesh launch` writes a per-session config to a tmpdir.** Don't edit `~/.claudemesh/config.json` while a session is running — changes won't take effect until the next launch.
- **The `claudemesh mcp` server registers ZERO tools.** Never search ToolSearch for `mcp__claudemesh__*` — there are none. All operations go through Bash + CLI.
- **Soft-deprecated MCP tools (1.1.x).** If you previously called `mcp__claudemesh__send_message`, use `claudemesh send` via Bash instead. The deprecated tools still work in 1.x but print a stderr warning. They're removed in 2.0.
- **Field aliases in `--json`.** `name` resolves to `displayName`. Other aliases may be added in future versions; check `--json` output to confirm field names.
- **`claudemesh send` to a name that's not online** errors with the list of online peers. Use `claudemesh peers --json` first if uncertain.
- **The `--mesh <slug>` flag is required when the user has multiple meshes joined.** Without it, the CLI either picks the first mesh deterministically or shows an interactive picker (depending on context).
## Behavioral conventions
- **Confirm before destructive ops** (`kick`, `ban`, `delete`, `forget`). Show the user what you're about to do.
- **Preview peer-name matches before sending** when the name is ambiguous. `claudemesh peers --json name,pubkey | jq` is the right tool for disambiguation.
- **Don't broadcast (`*`) for trivial messages.** It pings every peer mid-task. Prefer DM or `@group`.
- **Don't poll `inbox`.** Messages are pushed via `<channel source="claudemesh">` automatically. Only call `inbox --json` if you suspect a buffered message is stuck.
- **Echo the messageId in JSON contexts** so the caller can `msg-status` it later.
## Related
- Spec: `.artifacts/specs/2026-05-02-architecture-north-star.md` (architecture rationale)
- Source: `~/Desktop/claudemesh/apps/cli/`
- Broker: `wss://ic.claudemesh.com/ws`
- Dashboard: `https://claudemesh.com/dashboard`

View File

@@ -2,6 +2,43 @@ import { defineCommand, runMain } from "citty";
export interface ParsedArgs { command: string; positionals: string[]; flags: Record<string, string | boolean | undefined>; }
/**
* Flags that NEVER take a value. The parser's default behavior is greedy
* (any `--flag` consumes the next non-`-` arg as its value), which is
* fine for `--mesh foo` and `--priority now` but breaks for booleans:
* `claudemesh send --self <pubkey> "msg"` was eating the pubkey as the
* value of --self, leaving zero positionals and triggering Usage errors.
*
* Adding to this set: any new boolean / no-arg switch.
*/
const BOOLEAN_FLAGS = new Set([
"self",
"json", // also accepts --json=a,b,c form below
"all",
"yes", "y",
"help", "h",
"version", "v",
"quiet",
"strict",
"continue",
"no-daemon",
"no-color",
"debug",
"allow-ci-persistent",
"force",
"dry-run",
"verbose",
"skip-service",
// 1.34.8: `--unread` filters `claudemesh inbox` to rows whose
// seen_at is NULL. No value — pure switch.
"unread",
// 1.34.12: `--foreground` keeps `claudemesh daemon up` attached
// to the terminal (pre-1.34.12 behavior). Default is detached now.
"foreground",
"no-tcp",
"public-health",
]);
export function parseArgv(argv: string[]): ParsedArgs {
const args = argv.slice(2);
const flags: Record<string, string | boolean | undefined> = {};
@@ -10,14 +47,26 @@ export function parseArgv(argv: string[]): ParsedArgs {
for (let i = 0; i < args.length; i++) {
const arg = args[i]!;
// --flag=value (always parsed as a value, regardless of boolean set)
if (arg.startsWith("--") && arg.includes("=")) {
const eq = arg.indexOf("=");
const key = arg.slice(2, eq);
flags[key] = arg.slice(eq + 1);
continue;
}
if (arg.startsWith("--")) {
const key = arg.slice(2);
// Known boolean → never consume the next token as a value.
if (BOOLEAN_FLAGS.has(key)) { flags[key] = true; continue; }
const next = args[i + 1];
if (next && !next.startsWith("-")) { flags[key] = next; i++; } else flags[key] = true;
if (next !== undefined && !next.startsWith("-")) { flags[key] = next; i++; }
else flags[key] = true;
} else if (arg.startsWith("-") && arg.length === 2) {
const key = arg.slice(1);
if (BOOLEAN_FLAGS.has(key)) { flags[key] = true; continue; }
const next = args[i + 1];
if (next && !next.startsWith("-")) { flags[key] = next; i++; } else flags[key] = true;
if (next !== undefined && !next.startsWith("-")) { flags[key] = next; i++; }
else flags[key] = true;
} else if (!command) {
command = arg;
} else {

View File

@@ -0,0 +1,170 @@
/**
* Translate the parsed CLI invocation (command + positionals) into the
* (resource, verb, isWrite) shape that the policy engine evaluates.
*
* Returns `null` for commands that are not subject to policy gating:
* - local-only ops (help, version, list, doctor, sync, completions)
* - auth (login, logout, whoami, register)
* - setup (install, uninstall, url-handler, status-line, backup, restore)
* - launch / connect (no broker mutation by themselves)
*
* Spec: .artifacts/specs/2026-05-02-architecture-north-star.md commitment #7.
*/
export interface InvocationClass {
resource: string;
verb: string;
isWrite: boolean;
}
/** Commands the policy engine never evaluates. Local or auth-only. */
const SKIP = new Set([
"", "help", "version",
"login", "register", "logout", "whoami",
"install", "uninstall", "doctor", "sync", "completions", "url-handler", "status-line",
"backup", "restore", "upgrade", "update",
"list", "ls", // local mesh list
"launch", "connect", // launches Claude — no broker write
"status", // broker connectivity diagnostic
"test", "mcp", "hook", "seed-test-mesh",
"disconnect", // duplicate alias only — top-level "disconnect" prints message
]);
/** Verbs that mutate broker state (used by --approval-mode plan / read-only). */
const WRITE_VERBS = new Set([
"create", "send", "remember", "forget", "remind", "schedule", "summary",
"visible", "join", "leave", "kick", "ban", "unban", "disconnect", "delete",
"rename", "share", "invite", "store", "publish", "execute", "set", "remove",
"pause", "resume", "claim", "complete", "grant", "revoke", "block", "call",
]);
function isWrite(verb: string): boolean {
return WRITE_VERBS.has(verb);
}
/**
* Map (command, positionals) → invocation classification.
* The mapping mirrors the resource/verb namespace used in DEFAULT_POLICY so a
* `peer kick` rule actually matches both `peer kick` and the legacy `kick`.
*/
export function classifyInvocation(command: string, positionals: string[]): InvocationClass | null {
if (SKIP.has(command)) return null;
const sub = positionals[0] ?? "";
// ── Resource-form commands ───────────────────────────────────────────────
switch (command) {
case "peer": {
const verb = sub || "list";
return { resource: "peer", verb, isWrite: isWrite(verb) };
}
case "message": {
const verb = sub || "inbox";
return { resource: "message", verb, isWrite: isWrite(verb) };
}
case "memory": {
const verb = sub || "recall";
return { resource: "memory", verb, isWrite: isWrite(verb) };
}
case "profile": {
// `profile` (no sub) is read; `profile summary/visible/status set` are writes.
if (!sub) return { resource: "profile", verb: "view", isWrite: false };
if (sub === "status") {
return positionals[1] === "set"
? { resource: "profile", verb: "status", isWrite: true }
: { resource: "profile", verb: "view", isWrite: false };
}
return { resource: "profile", verb: sub, isWrite: true };
}
case "schedule": {
const verb = sub || "list";
return { resource: "schedule", verb, isWrite: verb !== "list" };
}
case "group": {
return { resource: "group", verb: sub || "list", isWrite: sub === "join" || sub === "leave" };
}
case "task": {
return { resource: "task", verb: sub || "list", isWrite: isWrite(sub) };
}
case "topic": {
// topic verbs: create | list | join | leave | members | history | read
// writes: create, join, leave; reads: list, members, history, read
const verb = sub || "list";
const writeVerbs = new Set(["create", "join", "leave"]);
return { resource: "topic", verb, isWrite: writeVerbs.has(verb) };
}
case "apikey": case "api-key": {
// apikey verbs: create | list | revoke. create issues a credential —
// strongly destructive in security terms; revoke is also a write.
const verb = sub || "list";
const writeVerbs = new Set(["create", "revoke"]);
return { resource: "apikey", verb, isWrite: writeVerbs.has(verb) };
}
case "bridge": {
// bridge verbs: run (long-lived forwarder) | init (print template).
// `run` is a write at the mesh level since it joins both meshes
// and posts messages on their topics.
const verb = sub || "init";
return { resource: "bridge", verb, isWrite: verb === "run" };
}
// Platform — sub is the verb.
case "vector": case "graph": case "context": case "stream":
case "sql": case "skill": case "vault": case "watch":
case "webhook": case "file": case "mesh-mcp": case "clock": {
const verb = sub || "list";
return { resource: command, verb, isWrite: isWrite(verb) };
}
case "state": {
const verb = sub === "set" ? "set" : sub === "list" ? "list" : "get";
return { resource: "state", verb, isWrite: verb === "set" };
}
}
// ── Legacy / flat verb form ──────────────────────────────────────────────
switch (command) {
// Mesh management
case "create": case "new": return { resource: "mesh", verb: "create", isWrite: true };
case "join": case "add": return { resource: "mesh", verb: "join", isWrite: true };
case "delete": case "rm": return { resource: "mesh", verb: "delete", isWrite: true };
case "rename": return { resource: "mesh", verb: "rename", isWrite: true };
case "share": case "invite": return { resource: "mesh", verb: "share", isWrite: true };
case "info": return { resource: "mesh", verb: "info", isWrite: false };
// Peer ops (legacy verbs)
case "peers": return { resource: "peer", verb: "list", isWrite: false };
case "kick": return { resource: "peer", verb: "kick", isWrite: true };
case "ban": return { resource: "peer", verb: "ban", isWrite: true };
case "unban": return { resource: "peer", verb: "unban", isWrite: true };
case "bans": return { resource: "peer", verb: "bans", isWrite: false };
case "verify": return { resource: "peer", verb: "verify", isWrite: false };
// Messaging
case "send": return { resource: "message", verb: "send", isWrite: true };
case "inbox": return { resource: "message", verb: "inbox", isWrite: false };
case "msg-status": return { resource: "message", verb: "status", isWrite: false };
// Memory
case "remember": return { resource: "memory", verb: "remember", isWrite: true };
case "recall": return { resource: "memory", verb: "recall", isWrite: false };
case "forget": return { resource: "memory", verb: "forget", isWrite: true };
case "remind": return { resource: "schedule", verb: "msg", isWrite: true };
// Presence
case "summary": return { resource: "profile", verb: "summary", isWrite: true };
case "visible": return { resource: "profile", verb: "visible", isWrite: true };
// Diagnostics
case "stats": return { resource: "mesh", verb: "stats", isWrite: false };
case "ping": return { resource: "mesh", verb: "ping", isWrite: false };
// Security
case "grant": return { resource: "grant", verb: "grant", isWrite: true };
case "revoke": return { resource: "grant", verb: "revoke", isWrite: true };
case "block": return { resource: "grant", verb: "block", isWrite: true };
case "grants": return { resource: "grant", verb: "list", isWrite: false };
}
// Unknown command — let the dispatcher's default branch handle it.
return null;
}

View File

@@ -0,0 +1,198 @@
/**
* Argument validators — fail loud at the boundary, with specific reasons.
*
* Each validator returns a discriminated `ValidationResult` so callers can
* branch cleanly between "shape is wrong" (INVALID_ARGS exit) vs "value
* is well-shaped, do the lookup" (proceed). Hints (`reason`, `expected`,
* `nearest`) drive the three-tier error message contract:
*
* 1. WHAT'S WRONG — the failed assertion.
* 2. WHAT WOULD BE VALID — the canonical shape.
* 3. CLOSEST VALID ALTERNATIVE — best-effort suggestion.
*
* Use these instead of throwing strings or returning `null` for malformed
* input. They make argument errors structurally distinct from "thing
* doesn't exist" errors, which today's CLI conflates.
*/
export type ValidationResult<T = string> =
| { ok: true; value: T }
| { ok: false; code: string; reason: string; expected?: string };
const HEX_RE = /^[0-9a-f]+$/i;
const BASE62_RE = /^[A-Za-z0-9]+$/;
const SLUG_RE = /^[a-z0-9]+(?:-[a-z0-9]+)*$/;
/**
* 64-char lowercase hex peer pubkey (member or session).
* Accepts UPPERCASE hex and normalizes to lowercase.
*/
export function validatePubkey(input: string | undefined): ValidationResult {
if (!input) {
return {
ok: false,
code: "missing",
reason: "pubkey is required",
expected: "64 lowercase hex chars",
};
}
if (input.length !== 64) {
return {
ok: false,
code: "wrong_length",
reason: `pubkey is ${input.length} chars, expected 64`,
expected: "64 lowercase hex chars (try `claudemesh peer list --json`)",
};
}
if (!HEX_RE.test(input)) {
return {
ok: false,
code: "non_hex",
reason: "pubkey contains non-hex characters",
expected: "characters [0-9a-f] only",
};
}
return { ok: true, value: input.toLowerCase() };
}
/**
* Hex pubkey *prefix* — used for short-form references. Min 8 chars
* to keep collisions vanishingly rare on a per-mesh roster, max 64.
*/
export function validatePubkeyPrefix(
input: string | undefined,
{ min = 8 }: { min?: number } = {},
): ValidationResult {
if (!input) {
return {
ok: false,
code: "missing",
reason: "pubkey prefix is required",
expected: `${min}-64 lowercase hex chars`,
};
}
if (input.length < min) {
return {
ok: false,
code: "too_short",
reason: `prefix is ${input.length} chars, needs ≥${min}`,
expected: `${min}+ hex chars (full pubkey is 64)`,
};
}
if (input.length > 64) {
return {
ok: false,
code: "too_long",
reason: `prefix is ${input.length} chars, max 64`,
expected: "drop trailing characters",
};
}
if (!HEX_RE.test(input)) {
return {
ok: false,
code: "non_hex",
reason: "prefix contains non-hex characters",
expected: "characters [0-9a-f] only",
};
}
return { ok: true, value: input.toLowerCase() };
}
/**
* Message id — base62, 32 chars exact, OR a prefix of ≥8 chars.
* Returns `{ value, isPrefix }` so callers can decide whether to
* resolve via lookup or treat as full id.
*/
export function validateMessageId(
input: string | undefined,
): ValidationResult<{ value: string; isPrefix: boolean }> {
if (!input) {
return {
ok: false,
code: "missing",
reason: "message id is required",
expected: "32-char base62 id, or ≥8-char prefix",
};
}
if (input.length < 8) {
return {
ok: false,
code: "too_short",
reason: `id is ${input.length} chars, needs ≥8`,
expected: "8+ chars (paste from a previous send/post output)",
};
}
if (input.length > 32) {
return {
ok: false,
code: "too_long",
reason: `id is ${input.length} chars, max 32`,
expected: "trim trailing characters",
};
}
if (!BASE62_RE.test(input)) {
return {
ok: false,
code: "bad_charset",
reason: "id contains characters outside [A-Za-z0-9]",
expected: "base62 only",
};
}
return { ok: true, value: { value: input, isPrefix: input.length < 32 } };
}
/**
* Mesh slug — kebab-case, lowercase, 2-64 chars.
*/
export function validateMeshSlug(input: string | undefined): ValidationResult {
if (!input) {
return {
ok: false,
code: "missing",
reason: "mesh slug is required",
expected: "kebab-case slug (e.g. `openclaw`)",
};
}
if (input.length < 2 || input.length > 64) {
return {
ok: false,
code: "wrong_length",
reason: `slug is ${input.length} chars, expected 2-64`,
expected: "lowercase kebab-case",
};
}
if (!SLUG_RE.test(input)) {
return {
ok: false,
code: "bad_format",
reason: "slug must be lowercase letters, digits, and hyphens (no leading/trailing hyphen)",
expected: "e.g. `team-alpha`, `flexicar-2`",
};
}
return { ok: true, value: input };
}
/**
* Render a structured validation error to stderr in the canonical
* three-line shape: `✘ <verb> <input>` / ` <reason>` / ` <expected>`.
*
* Optional fourth line for `nearest` when a fuzzy suggestion is available.
*/
export function renderValidationError(
args: {
verb: string;
input: string;
result: Extract<ValidationResult, { ok: false }>;
nearest?: string;
},
write: (s: string) => void = (s) => process.stderr.write(s),
): void {
write(` \x1b[31m✘\x1b[0m ${args.verb} ${args.input}\n`);
write(` ${args.result.reason}.\n`);
if (args.result.expected) {
write(` expected: ${args.result.expected}\n`);
}
if (args.nearest) {
write(` did you mean: \x1b[36m${args.nearest}\x1b[0m\n`);
}
}

View File

@@ -0,0 +1,132 @@
/**
* `claudemesh apikey <verb>` — manage REST + external WS bearer tokens.
*
* The plaintext secret is shown ONCE on creation and never returned
* again — there's no recovery, only revoke + re-issue. Capabilities
* (send/read/state_write/admin) and topic scopes constrain what the key
* can do; a CI bot key with `--cap send,read --topic deploys` can only
* post and read on `#deploys`, never the whole mesh.
*
* Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md
*/
import { withMesh } from "./connect.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim, green, red, yellow } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
type Capability = "send" | "read" | "state_write" | "admin";
export interface ApiKeyFlags {
mesh?: string;
json?: boolean;
/** Comma-separated capabilities: send,read,state_write,admin */
cap?: string;
/** Comma-separated topic names (without #) — empty = all topics */
topic?: string;
/** ISO 8601 expiry timestamp */
expires?: string;
}
function parseCapabilities(raw?: string): Capability[] {
if (!raw) return ["send", "read"]; // sensible default
const parts = raw.split(",").map((s) => s.trim()).filter(Boolean);
const valid = new Set<Capability>(["send", "read", "state_write", "admin"]);
return parts.filter((p): p is Capability => valid.has(p as Capability));
}
export async function runApiKeyCreate(label: string, flags: ApiKeyFlags): Promise<number> {
if (!label) {
render.err("Usage: claudemesh apikey create <label> [--cap send,read] [--topic deploys]");
return EXIT.INVALID_ARGS;
}
const caps = parseCapabilities(flags.cap);
if (caps.length === 0) {
render.err("at least one capability required: --cap send,read,state_write,admin");
return EXIT.INVALID_ARGS;
}
const topicScopes = flags.topic
? flags.topic.split(",").map((s) => s.trim()).filter(Boolean)
: undefined;
return await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
const result = await client.apiKeyCreate({
label,
capabilities: caps,
topicScopes,
expiresAt: flags.expires,
});
if (!result) {
render.err("apikey create failed");
return EXIT.INTERNAL_ERROR;
}
if (flags.json) {
console.log(JSON.stringify(result, null, 2));
return EXIT.SUCCESS;
}
render.ok("created", `${bold(result.label)} ${dim(result.id.slice(0, 8))}`);
process.stdout.write(`\n ${yellow("⚠ secret shown once — copy it now:")}\n\n`);
process.stdout.write(` ${green(result.secret)}\n\n`);
process.stdout.write(` ${dim(`capabilities: ${result.capabilities.join(", ")}`)}\n`);
if (result.topicScopes?.length) {
process.stdout.write(` ${dim(`topics: ${result.topicScopes.map((t) => "#" + t).join(", ")}`)}\n`);
} else {
process.stdout.write(` ${dim("topics: all (no scope)")}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runApiKeyList(flags: ApiKeyFlags): Promise<number> {
return await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
const keys = await client.apiKeyList();
if (flags.json) {
console.log(JSON.stringify(keys, null, 2));
return EXIT.SUCCESS;
}
if (keys.length === 0) {
render.info(dim("no api keys in this mesh."));
return EXIT.SUCCESS;
}
render.section(`api keys (${keys.length})`);
for (const k of keys) {
const status = k.revokedAt
? red("revoked")
: k.expiresAt && new Date(k.expiresAt) < new Date()
? yellow("expired")
: green("active");
const lastUsed = k.lastUsedAt ? new Date(k.lastUsedAt).toLocaleDateString() : "never";
const scope = k.topicScopes?.length ? k.topicScopes.map((t) => "#" + t).join(",") : "all topics";
process.stdout.write(` ${bold(k.label)} ${status} ${dim(k.id.slice(0, 8))}\n`);
process.stdout.write(` ${dim(`${k.prefix}… caps: ${k.capabilities.join(",")} scope: ${scope} last_used: ${lastUsed}`)}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runApiKeyRevoke(id: string, flags: ApiKeyFlags): Promise<number> {
if (!id) {
render.err("Usage: claudemesh apikey revoke <id>");
return EXIT.INVALID_ARGS;
}
return await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
const result = await client.apiKeyRevoke(id);
if (!result.ok) {
if (flags.json) {
console.log(JSON.stringify({ ok: false, code: result.code, message: result.message }));
} else {
render.err(`${result.code}: ${result.message}`);
}
return result.code === "not_found"
? EXIT.NOT_FOUND
: result.code === "not_unique"
? EXIT.INVALID_ARGS
: EXIT.INTERNAL_ERROR;
}
if (flags.json) console.log(JSON.stringify({ revoked: result.id }));
else render.ok("revoked", clay(result.id.slice(0, 8)));
return EXIT.SUCCESS;
});
}

View File

@@ -0,0 +1,80 @@
/**
* `claudemesh ban <peer>` — kick + permanently revoke member (can't reconnect)
* `claudemesh unban <peer>` — clear revocation, peer can rejoin
* `claudemesh bans` — list banned members
*/
import { withMesh } from "./connect.js";
import { readConfig } from "~/services/config/facade.js";
import { render } from "~/ui/render.js";
import { EXIT } from "~/constants/exit-codes.js";
export async function runBan(
target: string | undefined,
opts: { mesh?: string } = {},
): Promise<number> {
if (!target) { render.err("Usage: claudemesh ban <peer-name-or-pubkey>"); return EXIT.INVALID_ARGS; }
const config = readConfig();
const meshSlug = opts.mesh ?? config.meshes[0]?.slug;
if (!meshSlug) { render.err("No mesh joined."); return EXIT.NOT_FOUND; }
return await withMesh({ meshSlug }, async (client) => {
const result = await client.sendAndWait({ type: "ban", target }) as { banned?: string; error?: string; message?: string; code?: string };
if (result?.banned) {
render.ok(`Banned ${result.banned} from ${meshSlug}. They cannot reconnect until unbanned.`);
render.hint(`Undo: claudemesh unban ${result.banned} --mesh ${meshSlug}`);
} else {
render.err(result?.message ?? result?.error ?? result?.code ?? "ban failed");
}
return result?.banned ? EXIT.SUCCESS : EXIT.INTERNAL_ERROR;
});
}
export async function runUnban(
target: string | undefined,
opts: { mesh?: string } = {},
): Promise<number> {
if (!target) { render.err("Usage: claudemesh unban <peer-name-or-pubkey>"); return EXIT.INVALID_ARGS; }
const config = readConfig();
const meshSlug = opts.mesh ?? config.meshes[0]?.slug;
if (!meshSlug) { render.err("No mesh joined."); return EXIT.NOT_FOUND; }
return await withMesh({ meshSlug }, async (client) => {
const result = await client.sendAndWait({ type: "unban", target }) as { unbanned?: string; error?: string; message?: string; code?: string };
if (result?.unbanned) {
render.ok(`Unbanned ${result.unbanned} from ${meshSlug}. They can rejoin.`);
} else {
render.err(result?.message ?? result?.error ?? result?.code ?? "unban failed");
}
return result?.unbanned ? EXIT.SUCCESS : EXIT.INTERNAL_ERROR;
});
}
export async function runBans(
opts: { mesh?: string; json?: boolean } = {},
): Promise<number> {
const config = readConfig();
const meshSlug = opts.mesh ?? config.meshes[0]?.slug;
if (!meshSlug) { render.err("No mesh joined."); return EXIT.NOT_FOUND; }
return await withMesh({ meshSlug }, async (client) => {
const result = await client.sendAndWait({ type: "list_bans" }) as { bans?: Array<{ name: string; pubkey: string; revokedAt: string }> };
const bans = result?.bans ?? [];
if (opts.json) {
process.stdout.write(JSON.stringify(bans, null, 2) + "\n");
return EXIT.SUCCESS;
}
if (bans.length === 0) {
render.info("No banned members.");
return EXIT.SUCCESS;
}
render.section(`banned members on ${meshSlug}`);
for (const b of bans) {
render.kv([[b.name, `${b.pubkey.slice(0, 16)}… · banned ${new Date(b.revokedAt).toLocaleDateString()}`]]);
}
return EXIT.SUCCESS;
});
}

View File

@@ -0,0 +1,213 @@
/**
* `claudemesh bridge run <config.yaml>` — long-lived process that joins
* two meshes and forwards a single topic between them.
*
* The CLI doesn't link against @claudemesh/sdk to avoid a workspace
* coupling at publish time — instead it constructs the SDK Bridge
* inline using the same MeshClient that the rest of the CLI already
* relies on. The bridge config file specifies broker URLs, mesh ids,
* memberships (private keys), and the topic name on each side.
*
* Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md
*/
import { readFileSync, existsSync } from "node:fs";
import { render } from "~/ui/render.js";
import { bold, clay, dim, green, red, yellow } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
interface BridgeConfigSide {
broker_url: string;
mesh_id: string;
member_id: string;
/** Hex-encoded ed25519 public key. */
pubkey: string;
/** Hex-encoded ed25519 secret key (64 bytes). */
secret_key: string;
topic: string;
display_name?: string;
role?: "lead" | "member" | "observer";
}
interface BridgeConfig {
a: BridgeConfigSide;
b: BridgeConfigSide;
max_hops?: number;
}
/** Tiny YAML parser — handles the flat shape `bridge run` accepts. For
* complex configs, callers can pass JSON (.json extension). */
function parseConfig(text: string): BridgeConfig {
const trimmed = text.trim();
if (trimmed.startsWith("{")) return JSON.parse(trimmed) as BridgeConfig;
const root: Record<string, Record<string, unknown> | number> = {};
let cursor: Record<string, unknown> | null = null;
for (const raw of text.split("\n")) {
const line = raw.replace(/#.*$/, "").trimEnd();
if (!line.trim()) continue;
const top = line.match(/^(a|b)\s*:\s*$/);
if (top) {
cursor = {};
root[top[1]!] = cursor;
continue;
}
const flat = line.match(/^(\w+)\s*:\s*(.+)$/);
if (flat && /^\s/.test(line) && cursor) {
cursor[flat[1]!] = parseScalar(flat[2]!);
} else if (flat) {
const v = parseScalar(flat[2]!);
// top-level scalars (e.g. max_hops) — only number/string supported
if (typeof v === "number") root[flat[1]!] = v;
}
}
return root as unknown as BridgeConfig;
}
function parseScalar(raw: string): string | number | boolean {
const v = raw.trim().replace(/^["'](.*)["']$/, "$1");
if (v === "true") return true;
if (v === "false") return false;
if (/^-?\d+(\.\d+)?$/.test(v)) return Number(v);
return v;
}
export async function runBridge(configPath: string): Promise<number> {
if (!configPath) {
render.err("Usage: claudemesh bridge run <config.yaml>");
return EXIT.INVALID_ARGS;
}
if (!existsSync(configPath)) {
render.err(`config file not found: ${configPath}`);
return EXIT.NOT_FOUND;
}
let cfg: BridgeConfig;
try {
cfg = parseConfig(readFileSync(configPath, "utf-8"));
} catch (e) {
render.err(`failed to parse ${configPath}: ${e instanceof Error ? e.message : String(e)}`);
return EXIT.INVALID_ARGS;
}
if (!cfg.a || !cfg.b) {
render.err("config must define 'a:' and 'b:' sections");
return EXIT.INVALID_ARGS;
}
for (const [name, side] of [["a", cfg.a], ["b", cfg.b]] as const) {
if (!side.broker_url || !side.mesh_id || !side.member_id || !side.pubkey || !side.secret_key || !side.topic) {
render.err(`config side '${name}' missing required fields: broker_url, mesh_id, member_id, pubkey, secret_key, topic`);
return EXIT.INVALID_ARGS;
}
}
// Lazy-load SDK so the CLI bundle stays trim for users who never
// bridge.
const { Bridge } = await import("@claudemesh/sdk");
const bridge = new Bridge({
a: {
client: {
brokerUrl: cfg.a.broker_url,
meshId: cfg.a.mesh_id,
memberId: cfg.a.member_id,
pubkey: cfg.a.pubkey,
secretKey: cfg.a.secret_key,
displayName: cfg.a.display_name ?? "bridge",
peerType: "connector",
channel: "bridge",
},
topic: cfg.a.topic,
role: cfg.a.role,
},
b: {
client: {
brokerUrl: cfg.b.broker_url,
meshId: cfg.b.mesh_id,
memberId: cfg.b.member_id,
pubkey: cfg.b.pubkey,
secretKey: cfg.b.secret_key,
displayName: cfg.b.display_name ?? "bridge",
peerType: "connector",
channel: "bridge",
},
topic: cfg.b.topic,
role: cfg.b.role,
},
maxHops: cfg.max_hops,
});
bridge.on("forwarded", (e) => {
process.stdout.write(
`${dim(new Date().toISOString())} ${green("→")} ${e.from}${e.to} hop=${e.hop} ${dim(`${e.bytes}b`)}\n`,
);
});
bridge.on("dropped", (e) => {
process.stdout.write(
`${dim(new Date().toISOString())} ${yellow("·")} drop from=${e.from} reason=${e.reason}${e.hop >= 0 ? ` hop=${e.hop}` : ""}\n`,
);
});
bridge.on("error", (e) => {
process.stderr.write(`${red("✘")} ${e.message}\n`);
});
try {
await bridge.start();
} catch (e) {
render.err(`bridge failed to start: ${e instanceof Error ? e.message : String(e)}`);
return EXIT.NETWORK_ERROR;
}
render.ok(
"bridge running",
`${clay("#" + cfg.a.topic)} ${dim("⟷")} ${clay("#" + cfg.b.topic)}`,
);
process.stderr.write(`${dim(` meshes: ${cfg.a.mesh_id.slice(0, 8)}${cfg.b.mesh_id.slice(0, 8)} max_hops: ${cfg.max_hops ?? 2}`)}\n`);
process.stderr.write(`${dim(" Ctrl-C to stop.")}\n\n`);
// Keep the process alive; bridge runs forever.
await new Promise<void>((resolve) => {
const stop = async (): Promise<void> => {
process.stderr.write(`\n${dim("stopping bridge...")}\n`);
await bridge.stop();
resolve();
};
process.on("SIGINT", stop);
process.on("SIGTERM", stop);
});
return EXIT.SUCCESS;
}
/** Generate a config skeleton for the user to fill in. */
export function bridgeConfigTemplate(): string {
return `# claudemesh bridge config
# Spec: .artifacts/specs/2026-05-02-v0.2.0-scope.md
#
# A bridge holds memberships in two meshes and forwards messages on a
# single topic between them. Loop prevention via plaintext hop counter
# (visible in message body — minor wart, fixed in v0.3.0).
#
# Tip: \`claudemesh peer verify\` shows the keys/ids you need below.
max_hops: 2
a:
broker_url: wss://ic.claudemesh.com/ws
mesh_id: <mesh A id>
member_id: <bridge member id in mesh A>
pubkey: <ed25519 public key hex, 32 bytes>
secret_key: <ed25519 secret key hex, 64 bytes>
topic: incidents
display_name: bridge
role: member
b:
broker_url: wss://ic.claudemesh.com/ws
mesh_id: <mesh B id>
member_id: <bridge member id in mesh B>
pubkey: <ed25519 public key hex>
secret_key: <ed25519 secret key hex>
topic: incidents
`;
}

View File

@@ -0,0 +1,324 @@
/**
* Small broker-side action verbs that previously lived only as MCP tools.
*
* These are the CLI replacements for the soft-deprecated tools
* (set_status / set_summary / set_visible / set_profile / join_group /
* leave_group / forget / message_status / mesh_clock / mesh_stats /
* ping_mesh / claim_task / complete_task).
*
* Each verb runs against ONE mesh — pick with --mesh <slug>, or let the
* picker prompt when multiple meshes are joined. This is the deliberate
* difference from the MCP tools' fan-out-across-all-meshes behavior:
* the CLI invocation model binds one connection per call.
*
* Spec: .artifacts/specs/2026-05-01-mcp-tool-surface-trim.md
*/
import { withMesh } from "./connect.js";
import { tryForgetViaDaemon } from "~/services/bridge/daemon-route.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
import { validateMessageId, renderValidationError } from "~/cli/validators.js";
type StateFlags = { mesh?: string; json?: boolean };
type PeerStatus = "idle" | "working" | "dnd";
// --- status ---
export async function runStatusSet(state: string, opts: StateFlags): Promise<number> {
const valid: PeerStatus[] = ["idle", "working", "dnd"];
if (!valid.includes(state as PeerStatus)) {
render.err(`Invalid status: ${state}`, `must be one of: ${valid.join(", ")}`);
return EXIT.INVALID_ARGS;
}
// Bridge tier deleted in 1.28.0 (dead code; the orphaned warm-path
// socket was never opened by anyone). Daemon route would belong here;
// adding it for status/summary/visible is queued for 1.29.0.
await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.setStatus(state as PeerStatus);
});
if (opts.json) console.log(JSON.stringify({ status: state }));
else render.ok(`status set to ${bold(state)}`);
return EXIT.SUCCESS;
}
// --- summary ---
export async function runSummary(text: string, opts: StateFlags): Promise<number> {
if (!text) {
render.err("Usage: claudemesh summary <text>");
return EXIT.INVALID_ARGS;
}
await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.setSummary(text);
});
if (opts.json) console.log(JSON.stringify({ summary: text }));
else render.ok("summary set", dim(text));
return EXIT.SUCCESS;
}
// --- visible ---
export async function runVisible(value: string | undefined, opts: StateFlags): Promise<number> {
let visible: boolean;
if (value === "true" || value === "1" || value === "yes") visible = true;
else if (value === "false" || value === "0" || value === "no") visible = false;
else {
render.err("Usage: claudemesh visible <true|false>");
return EXIT.INVALID_ARGS;
}
await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.setVisible(visible);
});
if (opts.json) console.log(JSON.stringify({ visible }));
else render.ok(visible ? "you are now visible to peers" : "you are now hidden", visible ? undefined : "direct messages still reach you");
return EXIT.SUCCESS;
}
// --- group ---
export async function runGroupJoin(name: string | undefined, opts: StateFlags & { role?: string }): Promise<number> {
if (!name) {
render.err("Usage: claudemesh group join @<name> [--role X]");
return EXIT.INVALID_ARGS;
}
const cleanName = name.startsWith("@") ? name.slice(1) : name;
await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.joinGroup(cleanName, opts.role);
});
if (opts.json) {
console.log(JSON.stringify({ group: cleanName, role: opts.role ?? null }));
return EXIT.SUCCESS;
}
render.ok(`joined ${clay("@" + cleanName)}`, opts.role ? `as ${opts.role}` : undefined);
return EXIT.SUCCESS;
}
export async function runGroupLeave(name: string | undefined, opts: StateFlags): Promise<number> {
if (!name) {
render.err("Usage: claudemesh group leave @<name>");
return EXIT.INVALID_ARGS;
}
const cleanName = name.startsWith("@") ? name.slice(1) : name;
await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.leaveGroup(cleanName);
});
if (opts.json) {
console.log(JSON.stringify({ group: cleanName, left: true }));
return EXIT.SUCCESS;
}
render.ok(`left ${clay("@" + cleanName)}`);
return EXIT.SUCCESS;
}
// --- forget ---
export async function runForget(id: string | undefined, opts: StateFlags): Promise<number> {
if (!id) {
render.err("Usage: claudemesh forget <memory-id>");
return EXIT.INVALID_ARGS;
}
// Daemon path first.
if (await tryForgetViaDaemon(id, opts.mesh)) {
if (opts.json) { console.log(JSON.stringify({ id, forgotten: true })); return EXIT.SUCCESS; }
render.ok(`forgot ${dim(id.slice(0, 8))}`);
return EXIT.SUCCESS;
}
await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.forget(id);
});
if (opts.json) {
console.log(JSON.stringify({ id, forgotten: true }));
return EXIT.SUCCESS;
}
render.ok(`forgot ${dim(id.slice(0, 8))}`);
return EXIT.SUCCESS;
}
// --- msg-status ---
export async function runMsgStatus(id: string | undefined, opts: StateFlags): Promise<number> {
// Validate input shape *before* we open a WS connection, so a typo
// returns a structured error instead of "not found or timed out".
const v = validateMessageId(id);
if (!v.ok) {
if (opts.json) {
console.log(
JSON.stringify({
ok: false,
error: "invalid_argument",
field: "messageId",
code: v.code,
reason: v.reason,
expected: v.expected,
}),
);
} else {
renderValidationError({
verb: "msg-status",
input: id ?? "(missing)",
result: v,
});
}
return EXIT.INVALID_ARGS;
}
const lookupId = v.value.value;
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const result = await client.messageStatus(lookupId);
if (!result) {
if (opts.json) {
console.log(
JSON.stringify({
ok: false,
error: "not_found",
id: lookupId,
isPrefix: v.value.isPrefix,
}),
);
} else {
const hint = v.value.isPrefix
? ` no message id starts with ${dim("\"" + lookupId + "\"")} in this mesh.\n try: claudemesh msg-status <full-32-char-id>`
: ` message ${dim(lookupId.slice(0, 12) + "…")} not in queue (already drained, expired, or never sent in this mesh).`;
render.err(`message not found`);
process.stderr.write(hint + "\n");
}
return EXIT.NOT_FOUND;
}
if (opts.json) {
console.log(JSON.stringify(result, null, 2));
return EXIT.SUCCESS;
}
render.section(`message ${lookupId.slice(0, 12)}`);
render.kv([
["target", result.targetSpec],
["delivered", result.delivered ? "yes" : "no"],
["delivered_at", result.deliveredAt ?? dim("—")],
]);
if (result.recipients.length > 0) {
render.blank();
render.heading("recipients");
for (const r of result.recipients) {
process.stdout.write(` ${bold(r.name)} ${dim(r.pubkey.slice(0, 12) + "…")} ${dim("·")} ${r.status}\n`);
}
}
return EXIT.SUCCESS;
});
}
// --- clock ---
export async function runClock(opts: StateFlags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const result = await client.getClock();
if (!result) {
if (opts.json) console.log(JSON.stringify({ error: "timed out" }));
else render.err("Clock query timed out");
return EXIT.INTERNAL_ERROR;
}
if (opts.json) {
console.log(JSON.stringify(result, null, 2));
return EXIT.SUCCESS;
}
const statusLabel = result.speed === 0 ? "not started" : result.paused ? "paused" : "running";
render.section(`mesh clock — ${statusLabel}`);
render.kv([
["speed", `x${result.speed}`],
["tick", String(result.tick)],
["sim_time", result.simTime],
["started_at", result.startedAt],
]);
return EXIT.SUCCESS;
});
}
// --- stats ---
export async function runStats(opts: StateFlags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const peers = await client.listPeers();
if (opts.json) {
console.log(JSON.stringify({
mesh: client.meshSlug,
peers: peers.map((p) => ({ name: p.displayName, pubkey: p.pubkey, stats: p.stats ?? null })),
}, null, 2));
return EXIT.SUCCESS;
}
render.section(client.meshSlug);
for (const p of peers) {
const s = p.stats;
if (!s) {
process.stdout.write(` ${bold(p.displayName)} ${dim("(no stats)")}\n`);
continue;
}
const up = s.uptime != null ? `${Math.floor(s.uptime / 60)}m` : "—";
process.stdout.write(
` ${bold(p.displayName)} ${dim(`in:${s.messagesIn ?? 0} out:${s.messagesOut ?? 0} tools:${s.toolCalls ?? 0} up:${up} err:${s.errors ?? 0}`)}\n`,
);
}
return EXIT.SUCCESS;
});
}
// --- ping ---
export async function runPing(opts: StateFlags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const peers = await client.listPeers();
if (opts.json) {
console.log(JSON.stringify({
mesh: client.meshSlug,
ws_status: client.status,
peers_online: peers.length,
push_buffer: client.pushHistory.length,
}, null, 2));
return EXIT.SUCCESS;
}
render.section(`ping ${client.meshSlug}`);
render.kv([
["ws_status", client.status],
["peers_online", String(peers.length)],
["push_buffer", String(client.pushHistory.length)],
]);
return EXIT.SUCCESS;
});
}
// --- task ---
export async function runTaskClaim(id: string | undefined, opts: StateFlags): Promise<number> {
if (!id) {
render.err("Usage: claudemesh task claim <id>");
return EXIT.INVALID_ARGS;
}
await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.claimTask(id);
});
if (opts.json) {
console.log(JSON.stringify({ id, claimed: true }));
return EXIT.SUCCESS;
}
render.ok(`claimed ${dim(id.slice(0, 8))}`);
return EXIT.SUCCESS;
}
export async function runTaskComplete(id: string | undefined, result: string | undefined, opts: StateFlags): Promise<number> {
if (!id) {
render.err("Usage: claudemesh task complete <id> [result]");
return EXIT.INVALID_ARGS;
}
await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.completeTask(id, result);
});
if (opts.json) {
console.log(JSON.stringify({ id, completed: true, result: result ?? null }));
return EXIT.SUCCESS;
}
render.ok(`completed ${dim(id.slice(0, 8))}`, result);
return EXIT.SUCCESS;
}

View File

@@ -8,6 +8,7 @@
*/
import { EXIT } from "~/constants/exit-codes.js";
import { render } from "~/ui/render.js";
const COMMANDS = [
"create", "new", "join", "add", "launch", "connect", "disconnect",
@@ -102,7 +103,7 @@ complete -c claudemesh -l join -d 'invite url'
export async function runCompletions(shell: string | undefined): Promise<number> {
if (!shell) {
console.error("Usage: claudemesh completions <bash|zsh|fish>");
render.err("Usage: claudemesh completions <bash|zsh|fish>");
return EXIT.INVALID_ARGS;
}
switch (shell.toLowerCase()) {
@@ -116,7 +117,7 @@ export async function runCompletions(shell: string | undefined): Promise<number>
process.stdout.write(fish());
return EXIT.SUCCESS;
default:
console.error(`Unsupported shell: ${shell}. Use bash, zsh, or fish.`);
render.err(`Unsupported shell: ${shell}`, "use bash, zsh, or fish.");
return EXIT.INVALID_ARGS;
}
}

View File

@@ -1,22 +1,23 @@
import { readConfig } from "~/services/config/facade.js";
import { render } from "~/ui/render.js";
import { dim } from "~/ui/styles.js";
export async function connectTelegram(args: string[]): Promise<void> {
const config = readConfig();
if (config.meshes.length === 0) {
console.error("No meshes joined. Run 'claudemesh join' first.");
render.err("No meshes joined.", "Run `claudemesh join` first.");
process.exit(1);
}
const mesh = config.meshes[0]!;
const linkOnly = args.includes("--link");
// Convert WS broker URL to HTTP
const brokerHttp = mesh.brokerUrl
.replace("wss://", "https://")
.replace("ws://", "http://")
.replace("/ws", "");
console.log("Requesting Telegram connect token...");
render.info(dim("Requesting Telegram connect token…"));
const res = await fetch(`${brokerHttp}/tg/token`, {
method: "POST",
@@ -32,7 +33,7 @@ export async function connectTelegram(args: string[]): Promise<void> {
if (!res.ok) {
const err = await res.json().catch(() => ({}));
console.error(`Failed: ${(err as any).error ?? res.statusText}`);
render.err(`Failed: ${(err as any).error ?? res.statusText}`);
process.exit(1);
}
@@ -46,20 +47,18 @@ export async function connectTelegram(args: string[]): Promise<void> {
return;
}
// Print QR code using simple block characters
console.log("\n Connect Telegram to your mesh:\n");
console.log(` ${deepLink}\n`);
console.log(" Open this link on your phone, or scan the QR code");
console.log(" with your Telegram camera.\n");
render.section("connect Telegram to your mesh");
render.link(deepLink);
render.blank();
render.info(dim("Open this link on your phone, or scan the QR code with your Telegram camera."));
render.blank();
// Try to generate QR with qrcode-terminal if available
try {
const QRCode = require("qrcode-terminal");
QRCode.generate(deepLink, { small: true }, (code: string) => {
console.log(code);
});
} catch {
// qrcode-terminal not available, link is enough
console.log(" (Install qrcode-terminal for QR code display)");
render.info(dim("(Install qrcode-terminal for QR code display)"));
}
}

View File

@@ -10,6 +10,7 @@ import { createInterface } from "node:readline";
import { BrokerClient } from "~/services/broker/facade.js";
import { readConfig } from "~/services/config/facade.js";
import type { JoinedMesh } from "~/services/config/facade.js";
import { getDaemonPolicy } from "~/services/daemon/policy.js";
export interface ConnectOpts {
/** Mesh slug to connect to. Auto-selects if only one mesh joined. */
@@ -46,6 +47,18 @@ export async function withMesh<T>(
opts: ConnectOpts,
fn: (client: BrokerClient, mesh: JoinedMesh) => Promise<T>,
): Promise<T> {
// --strict gate: every cold-path verb funnels through here, so a single
// policy check covers the whole CLI surface. The daemon-routing helpers
// already returned null (auto-spawn failed); under --strict we refuse
// the cold-path fallback and exit loudly instead.
if (getDaemonPolicy().mode === "strict") {
console.error(
"\n ✘ daemon not reachable — --strict refuses cold-path fallback.\n" +
" run `claudemesh daemon up` (or `claudemesh doctor`) and retry.\n",
);
process.exit(1);
}
const config = readConfig();
if (config.meshes.length === 0) {
console.error("No meshes joined. Run `claudemesh join <url>` first.");
@@ -69,12 +82,27 @@ export async function withMesh<T>(
}
const displayName = opts.displayName ?? config.displayName ?? `${hostname()}-${process.pid}`;
const client = new BrokerClient(mesh, { displayName });
const client = new BrokerClient(mesh, { displayName, quiet: true });
try {
await client.connect();
const result = await fn(client, mesh);
return result;
} catch (e) {
// Terminal close from the broker (banned / kicked). Give the user
// a clear message instead of the low-level ws error.
if (client.terminalClose) {
const { code, reason } = client.terminalClose;
if (code === 4002) {
console.error(`\n ✘ ${reason}\n`);
} else if (code === 4001) {
console.error(`\n ✘ Kicked from this mesh. Run \`claudemesh\` to rejoin.\n`);
} else {
console.error(`\n ✘ Broker closed connection: ${reason}\n`);
}
process.exit(1);
}
throw e;
} finally {
client.close();
}

View File

@@ -0,0 +1,431 @@
import { spawn } from "node:child_process";
import { existsSync, openSync, mkdirSync } from "node:fs";
import { join } from "node:path";
import { runDaemon } from "~/daemon/run.js";
import { ipc, IpcError } from "~/daemon/ipc/client.js";
import { readRunningPid } from "~/daemon/lock.js";
import { DAEMON_PATHS } from "~/daemon/paths.js";
export interface DaemonOptions {
json?: boolean;
noTcp?: boolean;
publicHealth?: boolean;
mesh?: string;
displayName?: string;
/** 1.34.12: keep the daemon attached to the current shell instead
* of double-forking. Default behavior changed in 1.34.12 — `up`
* now detaches by default and writes JSON logs to
* ~/.claudemesh/daemon/daemon.log. Pass `--foreground` to get the
* pre-1.34.12 behavior (logs streaming to stdout, blocks the
* terminal until Ctrl-C). install-service and `claudemesh launch`'s
* auto-spawn path always pass --foreground because their parents
* (launchd / the launch helper) own the lifecycle. */
foreground?: boolean;
/** outbox-list status filter, set from boolean flags --failed/--pending/etc. */
outboxStatus?: "pending" | "inflight" | "done" | "dead" | "aborted";
/** outbox requeue: optional id to mint a fresh client_message_id with. */
newClientId?: string;
}
export async function runDaemonCommand(
sub: string | undefined,
opts: DaemonOptions,
rest: string[] = [],
): Promise<number> {
switch (sub) {
case undefined:
return printDaemonUsage();
case "up":
case "start":
// 1.34.10: `--mesh` and `--name` deprecated.
// --mesh: daemon attaches to every joined mesh automatically;
// pinning at start time blocks new meshes from being picked up.
// --name: overrides the daemon-WS display name GLOBALLY across
// every mesh, but each mesh has its own per-mesh display name
// in config.json (set at `claudemesh join` time). Passing one
// name flattens that out. Sessions advertise their own
// CLAUDEMESH_DISPLAY_NAME at `claudemesh launch` time anyway,
// and the daemon-WS presence is hidden from peer lists since
// 1.32, so the daemon's display name isn't user-visible.
if (opts.mesh) {
process.stderr.write(
`[claudemesh] --mesh on \`daemon up\` is deprecated; the daemon attaches to every joined mesh automatically. ` +
`Ignoring --mesh ${opts.mesh}.\n`,
);
}
if (opts.displayName) {
process.stderr.write(
`[claudemesh] --name on \`daemon up\` is deprecated; per-mesh display names live in config.json (set at join time), ` +
`and session display names come from \`claudemesh launch --name\`. Ignoring --name ${opts.displayName}.\n`,
);
}
// 1.34.12: detach by default. The pre-1.34.12 behavior streamed
// JSON logs to the controlling terminal and blocked the shell —
// fine for debugging, surprising for users who just want the
// daemon "up." `--foreground` opts back into the old behavior;
// launchd / systemd-user units always pass it because the unit
// manager owns lifecycle and stdio redirection.
if (!opts.foreground) {
return spawnDetachedDaemon(opts);
}
return runDaemon({
tcpEnabled: !opts.noTcp,
publicHealthCheck: opts.publicHealth,
});
case "help":
case "--help":
case "-h":
return printDaemonUsage();
case "status":
return runStatus(opts);
case "version":
return runVersion(opts);
case "down":
case "stop":
return runStop(opts);
case "accept-host":
return runAcceptHost(opts);
case "outbox":
return runOutbox(rest, opts);
case "install-service":
return runInstallService(opts);
case "uninstall-service":
return runUninstallService(opts);
default:
process.stderr.write(`unknown daemon subcommand: ${sub}\n\n`);
printDaemonUsage(process.stderr);
return 2;
}
}
function printDaemonUsage(stream: NodeJS.WritableStream = process.stdout): number {
stream.write(`claudemesh daemon — long-lived peer mesh runtime (v0.9.0)
USAGE
claudemesh daemon <command> [options]
COMMANDS
up | start start the daemon (detached by default)
status show running pid + IPC health
version ipc + schema version of the running daemon
down | stop stop the running daemon (SIGTERM, then wait)
accept-host pin the current host fingerprint
outbox list list local outbox rows (newest first)
outbox requeue <id> re-enqueue an aborted / dead outbox row
install-service write launchd (macOS) / systemd-user (Linux) unit
uninstall-service remove the platform service unit
OPTIONS
--foreground keep daemon attached to terminal, JSON logs to stdout (1.34.12+)
--no-tcp disable the loopback TCP listener (UDS only)
--public-health expose /v1/health unauthenticated on TCP
--json machine-readable output where supported
OUTBOX FLAGS (for 'daemon outbox list')
--pending --inflight --done --failed --aborted filter by status
OUTBOX FLAGS (for 'daemon outbox requeue')
--new-client-id <id> mint the new row with this client_message_id
See ${"https://claudemesh.com/docs"} for the full daemon spec.
`);
return 0;
}
interface OutboxRowResp {
id: string;
client_message_id: string;
status: string;
attempts: number;
enqueued_at: string;
next_attempt_at: string;
delivered_at: string | null;
broker_message_id: string | null;
last_error: string | null;
aborted_at: string | null;
aborted_by: string | null;
superseded_by: string | null;
payload_bytes: number;
}
async function runOutbox(rest: string[], opts: DaemonOptions): Promise<number> {
const sub = rest[0];
switch (sub) {
case undefined:
case "list": {
const status = opts.outboxStatus;
const path = `/v1/outbox${status ? `?status=${status}` : ""}`;
try {
const res = await ipc<{ items: OutboxRowResp[] }>({ path });
if (opts.json) {
process.stdout.write(JSON.stringify(res.body) + "\n");
return 0;
}
if (!res.body.items?.length) {
process.stdout.write("(empty)\n");
return 0;
}
for (const r of res.body.items) {
const tag = r.status.padEnd(8);
const bm = r.broker_message_id ? `${r.broker_message_id}` : "";
const err = r.last_error ? ` last_error="${r.last_error.slice(0, 60)}"` : "";
process.stdout.write(`${tag} ${r.id} cid=${r.client_message_id} attempts=${r.attempts}${bm}${err}\n`);
}
return 0;
} catch (err) {
process.stderr.write(`daemon unreachable: ${String(err)}\n`);
return 1;
}
}
case "requeue": {
const id = rest[1];
if (!id) { process.stderr.write("usage: claudemesh daemon outbox requeue <id> [--new-client-id <id>]\n"); return 2; }
const newClientMessageId = opts.newClientId;
try {
const res = await ipc<{
aborted_row_id: string; new_row_id: string; new_client_message_id: string; error?: string;
}>({
method: "POST",
path: "/v1/outbox/requeue",
body: { id, new_client_message_id: newClientMessageId },
});
if (res.status === 200) {
if (opts.json) process.stdout.write(JSON.stringify(res.body) + "\n");
else process.stdout.write(
`requeued: aborted ${res.body.aborted_row_id} → new ${res.body.new_row_id} ` +
`(client_message_id=${res.body.new_client_message_id})\n`,
);
return 0;
}
process.stderr.write(`requeue failed (${res.status}): ${res.body.error ?? "unknown"}\n`);
return 1;
} catch (err) {
process.stderr.write(`daemon unreachable: ${String(err)}\n`);
return 1;
}
}
default:
process.stderr.write(`unknown outbox subcommand: ${sub}\n`);
process.stderr.write(`usage: claudemesh daemon outbox [list|requeue <id>]\n`);
return 2;
}
}
async function runInstallService(opts: DaemonOptions): Promise<number> {
const { installService, detectPlatform } = await import("~/daemon/service-install.js");
const platform = detectPlatform();
if (!platform) {
process.stderr.write(`unsupported platform: ${process.platform}\n`);
return 2;
}
// Resolve the binary path. Prefer the running argv[0] when it's an
// installed claudemesh binary; fall back to whichever `claudemesh` is
// first on PATH.
// 1.34.10: install-service no longer bakes --mesh into the unit. The
// daemon attaches to every joined mesh by default, and pinning the
// unit to one slug at install time was the source of the "joined a
// new mesh but my service ignores it" footgun. If the user passes
// --mesh anyway, we warn + ignore.
let binary = process.argv[1] ?? "";
if (!binary || /\.ts$/.test(binary) || /node_modules|src\/entrypoints/.test(binary)) {
try {
const { execSync } = await import("node:child_process");
binary = execSync("which claudemesh", { encoding: "utf8" }).trim();
} catch {
process.stderr.write(`couldn't resolve a 'claudemesh' binary on PATH; install via npm/homebrew first\n`);
return 1;
}
}
if (opts.mesh) {
process.stderr.write(
`[claudemesh] --mesh on \`daemon install-service\` is deprecated and ignored; the daemon attaches to every joined mesh.\n`,
);
}
if (opts.displayName) {
process.stderr.write(
`[claudemesh] --name on \`daemon install-service\` is deprecated and ignored; per-mesh names live in config.json, session names come from \`claudemesh launch --name\`.\n`,
);
}
try {
const r = installService({
binaryPath: binary,
});
if (opts.json) {
process.stdout.write(JSON.stringify({ ok: true, ...r }) + "\n");
} else {
process.stdout.write(`installed ${r.platform} service unit: ${r.unitPath}\n`);
process.stdout.write(`bring it up now: ${r.bootCommand}\n`);
}
return 0;
} catch (err) {
process.stderr.write(`install-service failed: ${String(err)}\n`);
return 1;
}
}
async function runUninstallService(opts: DaemonOptions): Promise<number> {
const { uninstallService } = await import("~/daemon/service-install.js");
const r = uninstallService();
if (opts.json) process.stdout.write(JSON.stringify(r) + "\n");
else if (r.removed.length === 0) process.stdout.write("no service unit installed\n");
else process.stdout.write(`removed: ${r.removed.join(", ")}\n`);
return 0;
}
async function runAcceptHost(opts: DaemonOptions): Promise<number> {
const { acceptCurrentHost } = await import("~/daemon/identity.js");
const fp = acceptCurrentHost();
if (opts.json) process.stdout.write(JSON.stringify({ ok: true, fingerprint_prefix: fp.fingerprint.slice(0, 16) }) + "\n");
else process.stdout.write(`host fingerprint accepted: ${fp.fingerprint.slice(0, 16)}\n`);
return 0;
}
async function runStatus(opts: DaemonOptions): Promise<number> {
const pid = readRunningPid();
if (!pid) {
if (opts.json) process.stdout.write(JSON.stringify({ running: false }) + "\n");
else process.stdout.write("daemon: not running\n");
return 1;
}
try {
const res = await ipc<{ ok: boolean; pid: number }>({ path: "/v1/health" });
if (opts.json) {
process.stdout.write(JSON.stringify({ running: true, pid, health: res.body }) + "\n");
} else {
process.stdout.write(`daemon: running (pid ${pid})\n`);
process.stdout.write(`socket: ${DAEMON_PATHS.SOCK_FILE}\n`);
}
return 0;
} catch (err) {
if (opts.json) process.stdout.write(JSON.stringify({ running: true, pid, ipc_error: String(err) }) + "\n");
else process.stdout.write(`daemon: pid ${pid} alive but IPC unreachable (${String(err)})\n`);
return 1;
}
}
async function runVersion(opts: DaemonOptions): Promise<number> {
try {
const res = await ipc<Record<string, unknown>>({ path: "/v1/version" });
if (opts.json) process.stdout.write(JSON.stringify(res.body) + "\n");
else {
const v = res.body as { daemon_version?: string; ipc_api?: string; schema_version?: number };
process.stdout.write(`daemon ${v.daemon_version ?? "unknown"} (ipc ${v.ipc_api ?? "?"}, schema ${v.schema_version ?? "?"})\n`);
}
return 0;
} catch (err) {
if (err instanceof IpcError) {
process.stderr.write(`${err.message}\n`);
return err.status === 401 ? 3 : 1;
}
process.stderr.write(`daemon unreachable: ${String(err)}\n`);
return 1;
}
}
async function runStop(opts: DaemonOptions): Promise<number> {
const pid = readRunningPid();
if (!pid) {
if (opts.json) process.stdout.write(JSON.stringify({ stopped: false, reason: "not_running" }) + "\n");
else process.stdout.write("daemon: not running\n");
return 0;
}
try {
process.kill(pid, "SIGTERM");
} catch (err) {
process.stderr.write(`failed to signal pid ${pid}: ${String(err)}\n`);
return 1;
}
// Brief wait for the daemon to release its lock cleanly.
for (let i = 0; i < 50; i++) {
await new Promise<void>((r) => setTimeout(r, 100));
if (!readRunningPid()) {
if (opts.json) process.stdout.write(JSON.stringify({ stopped: true, pid }) + "\n");
else process.stdout.write(`daemon: stopped (was pid ${pid})\n`);
return 0;
}
}
if (opts.json) process.stdout.write(JSON.stringify({ stopped: false, pid, reason: "shutdown_timeout" }) + "\n");
else process.stdout.write(`daemon: signaled but did not exit within 5s (pid ${pid})\n`);
return 1;
}
/**
* 1.34.12: spawn the daemon as a detached background process. Re-execs
* the same `claudemesh` binary with `daemon up --foreground` (so the
* child runs the long-lived loop), redirects stdout/stderr to
* ~/.claudemesh/daemon/daemon.log, and `unref()`s so the parent shell
* can exit cleanly.
*
* The parent waits up to ~3s for the UDS socket to appear before
* declaring success — that's the same liveness check `claudemesh launch`
* uses, and it catches the "child crashed during boot" case (config
* read failed, port bind failed, etc.) with an actionable error
* pointing at the log file rather than silent loss.
*/
async function spawnDetachedDaemon(opts: DaemonOptions): Promise<number> {
// Ensure the log directory exists before opening the FDs.
mkdirSync(DAEMON_PATHS.DAEMON_DIR, { recursive: true, mode: 0o700 });
const logPath = join(DAEMON_PATHS.DAEMON_DIR, "daemon.log");
// The CLI binary path. process.argv[1] is the entrypoint script the
// node runtime is currently executing — for an installed CLI that's
// .../bin/claudemesh, for `bun run` dev that's the local dist file.
// Either way it's the right thing to re-exec.
const binary = process.argv[1] ?? "claudemesh";
const args = ["daemon", "up", "--foreground"];
if (opts.noTcp) args.push("--no-tcp");
if (opts.publicHealth) args.push("--public-health");
const out = openSync(logPath, "a");
const err = openSync(logPath, "a");
const child = spawn(process.execPath, [binary, ...args], {
detached: true,
stdio: ["ignore", out, err],
env: process.env,
});
// Decouple the child from the parent's process group so closing the
// shell doesn't SIGHUP the daemon.
child.unref();
// Wait for the socket to appear — the daemon's IPC listener binds
// ~immediately after the broker WS handshake starts, so socket
// existence is a reliable "the daemon is alive enough to accept
// requests" signal.
const sockPath = DAEMON_PATHS.SOCK_FILE;
const startedAt = Date.now();
while (Date.now() - startedAt < 3_000) {
if (existsSync(sockPath)) {
if (opts.json) {
process.stdout.write(JSON.stringify({ ok: true, detached: true, pid: child.pid, log: logPath }) + "\n");
} else {
process.stdout.write(` ✔ daemon started (pid ${child.pid})\n`);
process.stdout.write(` → log: ${logPath}\n`);
process.stdout.write(` → stop: claudemesh daemon down\n`);
}
return 0;
}
await new Promise<void>((r) => setTimeout(r, 100));
}
if (opts.json) {
process.stdout.write(JSON.stringify({ ok: false, detached: true, pid: child.pid, reason: "socket_not_appeared", log: logPath }) + "\n");
} else {
process.stderr.write(` ✘ daemon spawn timeout: socket did not appear within 3s\n`);
process.stderr.write(` → check log: ${logPath}\n`);
process.stderr.write(` → run foreground for live output: claudemesh daemon up --foreground\n`);
}
return 1;
}

View File

@@ -4,7 +4,8 @@ import { leave as leaveMesh } from "~/services/mesh/facade.js";
import { getStoredToken } from "~/services/auth/facade.js";
import { request } from "~/services/api/facade.js";
import { URLS } from "~/constants/urls.js";
import { green, red, bold, dim, yellow, icons } from "~/ui/styles.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim, red } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
const BROKER_HTTP = URLS.BROKER.replace("wss://", "https://").replace("ws://", "http://").replace("/ws", "");
@@ -23,34 +24,34 @@ function getUserId(token: string): string {
} catch { return ""; }
}
async function isOwner(slug: string, userId: string): Promise<boolean> {
async function isOwner(slug: string, auth: { session_token: string }): Promise<boolean> {
try {
const res = await request<{ meshes: Array<{ slug: string; is_owner: boolean }> }>({
path: `/cli/meshes?user_id=${userId}`,
path: `/cli/meshes`,
baseUrl: BROKER_HTTP,
token: auth.session_token,
});
return res.meshes?.find(m => m.slug === slug)?.is_owner ?? false;
return res.meshes?.find((m) => m.slug === slug)?.is_owner ?? false;
} catch { return false; }
}
export async function deleteMesh(slug: string, opts: { yes?: boolean } = {}): Promise<number> {
const config = readConfig();
// Mesh picker if no slug given
if (!slug) {
if (config.meshes.length === 0) {
console.error(" No meshes to remove.");
render.err("No meshes to remove.");
return EXIT.NOT_FOUND;
}
console.log("\n Select mesh to remove:\n");
render.section("select mesh to remove");
config.meshes.forEach((m, i) => {
console.log(` ${bold(String(i + 1) + ")")} ${m.slug} ${dim("(" + m.name + ")")}`);
process.stdout.write(` ${bold(String(i + 1) + ")")} ${clay(m.slug)}\n`);
});
console.log("");
const choice = await prompt(" Choice: ");
render.blank();
const choice = await prompt(` ${dim("choice:")} `);
const idx = parseInt(choice, 10) - 1;
if (idx < 0 || idx >= config.meshes.length) {
console.log(" Cancelled.");
render.info(dim("cancelled."));
return EXIT.USER_CANCELLED;
}
slug = config.meshes[idx]!.slug;
@@ -58,28 +59,27 @@ export async function deleteMesh(slug: string, opts: { yes?: boolean } = {}): Pr
const auth = getStoredToken();
const userId = auth ? getUserId(auth.session_token) : "";
const ownerCheck = userId ? await isOwner(slug, userId) : false;
const ownerCheck = auth ? await isOwner(slug, auth) : false;
// Ask what to do
if (!opts.yes) {
console.log(`\n ${bold(slug)}\n`);
render.section(slug);
if (ownerCheck) {
console.log(` ${bold("1)")} Remove from this device only ${dim("(keep on server)")}`);
console.log(` ${bold("2)")} ${red("Delete everywhere")} ${dim("(removes for all members)")}`);
console.log(` ${bold("3)")} Cancel`);
console.log("");
process.stdout.write(` ${bold("1)")} remove from this device only ${dim("(keep on server)")}\n`);
process.stdout.write(` ${bold("2)")} ${red("delete everywhere")} ${dim("(removes for all members)")}\n`);
process.stdout.write(` ${bold("3)")} cancel\n`);
render.blank();
const choice = await prompt(" Choice [1]: ") || "1";
const choice = await prompt(` ${dim("choice [1]:")} `) || "1";
if (choice === "3") { console.log(" Cancelled."); return EXIT.USER_CANCELLED; }
if (choice === "3") { render.info(dim("cancelled.")); return EXIT.USER_CANCELLED; }
if (choice === "2") {
// Server-side delete — require confirmation
console.log(`\n ${red("Warning:")} This will delete ${bold(slug)} for all members.`);
const confirm = await prompt(` Type "${slug}" to confirm: `);
render.blank();
render.warn(`this will delete ${bold(slug)} for all members.`);
const confirm = await prompt(` ${dim(`type "${slug}" to confirm:`)} `);
if (confirm.toLowerCase() !== slug.toLowerCase()) {
console.log(" Cancelled.");
render.info(dim("cancelled."));
return EXIT.USER_CANCELLED;
}
@@ -87,42 +87,39 @@ export async function deleteMesh(slug: string, opts: { yes?: boolean } = {}): Pr
await request({
path: `/cli/mesh/${slug}`,
method: "DELETE",
body: { user_id: userId },
baseUrl: BROKER_HTTP,
token: auth?.session_token,
body: { user_id: userId },
});
console.log(` ${green(icons.check)} Deleted "${slug}" from server.`);
render.ok(`deleted ${bold(slug)} from server.`);
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
console.error(` ${icons.cross} Server delete failed: ${msg}`);
render.err(`server delete failed: ${err instanceof Error ? err.message : String(err)}`);
}
leaveMesh(slug);
console.log(` ${green(icons.check)} Removed from local config.`);
render.ok("removed from local config.");
return EXIT.SUCCESS;
}
// choice === "1" — local only, fall through
} else {
// Not owner — can only remove locally
console.log(` ${bold("1)")} Remove from this device ${dim("(you can re-add later)")}`);
console.log(` ${bold("2)")} Cancel`);
if (!ownerCheck && userId) {
console.log(dim(`\n ${yellow(icons.warn)} Only the mesh owner can delete it from the server.`));
process.stdout.write(` ${bold("1)")} remove from this device ${dim("(you can re-add later)")}\n`);
process.stdout.write(` ${bold("2)")} cancel\n`);
if (userId) {
render.blank();
render.warn("only the mesh owner can delete it from the server.");
}
console.log("");
render.blank();
const choice = await prompt(" Choice [1]: ") || "1";
if (choice === "2") { console.log(" Cancelled."); return EXIT.USER_CANCELLED; }
const choice = await prompt(` ${dim("choice [1]:")} `) || "1";
if (choice === "2") { render.info(dim("cancelled.")); return EXIT.USER_CANCELLED; }
}
}
// Local-only removal
const removed = leaveMesh(slug);
if (removed) {
console.log(` ${green(icons.check)} Removed "${slug}" from this device.`);
console.log(dim(` Re-add anytime with: claudemesh mesh add <invite-url>`));
render.ok(`removed ${bold(slug)} from this device.`);
render.hint(`re-add anytime with: ${bold("claudemesh")} ${clay("<invite-url>")}`);
} else {
console.error(` Mesh "${slug}" not found in local config.`);
render.err(`mesh "${slug}" not found in local config.`);
}
return EXIT.SUCCESS;
}

View File

@@ -225,14 +225,14 @@ async function checkNpmLatest(): Promise<Check> {
return { name: "CLI up-to-date", pass: true, detail: `npm unreachable (${res.status}) — skipped` };
}
const body = (await res.json()) as { "dist-tags"?: { alpha?: string; latest?: string } };
const latest = body["dist-tags"]?.alpha ?? body["dist-tags"]?.latest;
const latest = body["dist-tags"]?.latest ?? body["dist-tags"]?.alpha;
if (!latest) return { name: "CLI up-to-date", pass: true, detail: "no dist-tag — skipped" };
const up = latest === VERSION;
return {
name: "CLI up-to-date",
pass: up,
detail: up ? `latest ${latest}` : `installed ${VERSION} → latest ${latest}`,
fix: up ? undefined : "npm i -g claudemesh-cli@alpha",
fix: up ? undefined : "npm i -g claudemesh-cli",
};
} catch {
return { name: "CLI up-to-date", pass: true, detail: "npm check skipped" };

View File

@@ -0,0 +1,166 @@
/**
* `claudemesh file share <path>` — upload a file to the mesh.
* `claudemesh file get <id>` — download a file by id.
*
* Same-host fast path: when `--to <peer>` is provided and the target
* peer's `hostname` matches this machine's, we skip the MinIO upload
* entirely and send a DM containing the absolute path. The receiver
* reads it directly off the local filesystem. Saves bandwidth + bucket
* space for the common "two Claude sessions on the same laptop" case.
*
* Falls back to encrypted MinIO upload + grant when:
* - `--to` not provided (sharing with the whole mesh)
* - target peer is on a different host
* - `--upload` flag forces the network path
*/
import { hostname as osHostname } from "node:os";
import { resolve as resolvePath, basename, dirname } from "node:path";
import { statSync, existsSync, writeFileSync, mkdirSync } from "node:fs";
import { withMesh } from "./connect.js";
import { render } from "~/ui/render.js";
import { bold, dim, green } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
// Broker enforces 50 MB on /upload (apps/broker/src/index.ts ~line 1204).
// We mirror it client-side so users get a clear error before bytes go on the wire.
const MAX_FILE_BYTES = 50 * 1024 * 1024;
type Flags = {
mesh?: string;
json?: boolean;
to?: string;
tags?: string;
out?: string;
upload?: boolean; // force network upload, skip same-host fast path
message?: string; // optional note attached to the share DM
};
function emitJson(data: unknown): void {
console.log(JSON.stringify(data, null, 2));
}
function formatSize(bytes: number): string {
if (bytes < 1024) return `${bytes} B`;
if (bytes < 1024 * 1024) return `${(bytes / 1024).toFixed(1)} KB`;
return `${(bytes / (1024 * 1024)).toFixed(1)} MB`;
}
export async function runFileShare(filePath: string, opts: Flags): Promise<number> {
if (!filePath) {
render.err("Usage: claudemesh file share <path> [--to <peer>] [--tags a,b] [--message \"...\"] [--upload]");
return EXIT.INVALID_ARGS;
}
const absPath = resolvePath(filePath);
if (!existsSync(absPath)) {
render.err(`File not found: ${absPath}`);
return EXIT.INVALID_ARGS;
}
const stat = statSync(absPath);
if (!stat.isFile()) {
render.err(`Not a regular file: ${absPath}`);
return EXIT.INVALID_ARGS;
}
// Network upload has a 50 MB cap (broker-enforced). The same-host fast
// path doesn't transfer bytes — it sends a filepath — so it has no cap.
const tags = opts.tags ? opts.tags.split(",").map((t) => t.trim()).filter(Boolean) : [];
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client, mesh) => {
// ── Same-host fast path ─────────────────────────────────────────────
// If --to points at a peer running on this same machine, just DM the
// absolute path. No upload, no MinIO, no presigned URLs.
if (opts.to && !opts.upload) {
const peers = await client.listPeers();
const myHost = osHostname();
const target = peers.find((p) => {
if (!p.hostname || p.hostname !== myHost) return false;
return (
p.displayName === opts.to ||
(p as { memberPubkey?: string }).memberPubkey === opts.to ||
p.pubkey === opts.to ||
(typeof opts.to === "string" && opts.to.length >= 8 && p.pubkey.startsWith(opts.to))
);
});
if (target) {
const note = opts.message ? `\n${opts.message}` : "";
const body = `📎 file://${absPath} (${formatSize(stat.size)} · same host, no upload)${note}`;
// Route by session pubkey, not displayName — sibling sessions of
// the same member share the displayName (and the v0.5.1 self-DM
// guard would otherwise reject sends targeting our own member).
const result = await client.send(target.pubkey, body, "next");
if (!result.ok) {
render.err(`Send failed: ${result.error ?? "unknown"}`);
return EXIT.NETWORK_ERROR;
}
if (opts.json) {
emitJson({ mode: "local", path: absPath, to: target.displayName, hostname: myHost, sizeBytes: stat.size });
} else {
render.ok(`shared ${bold(basename(absPath))} ${dim(`(${formatSize(stat.size)})`)}${green(target.displayName)} ${dim("[same host, no upload]")}`);
}
return EXIT.SUCCESS;
}
// No same-host match — fall through to upload path.
}
// ── Network upload path ─────────────────────────────────────────────
const fileId = await client.uploadFile(absPath, mesh.meshId, mesh.memberId, {
name: basename(absPath),
tags,
persistent: true,
targetSpec: opts.to,
});
// If --to was set, drop a DM so the recipient is notified + has the id.
if (opts.to) {
const note = opts.message ? `\n${opts.message}` : "";
const body = `📎 ${basename(absPath)} (${formatSize(stat.size)})\nclaudemesh file get ${fileId}${note}`;
await client.send(opts.to, body, "next");
}
if (opts.json) {
emitJson({ mode: "upload", fileId, name: basename(absPath), sizeBytes: stat.size, to: opts.to ?? null });
} else {
render.ok(`uploaded ${bold(basename(absPath))} ${dim(`(${formatSize(stat.size)})`)} ${dim("· id=" + fileId.slice(0, 12))}`);
if (opts.to) render.info(dim(` notified ${opts.to}`));
else render.info(dim(` retrieve: claudemesh file get ${fileId}`));
}
return EXIT.SUCCESS;
});
}
export async function runFileGet(fileId: string, opts: Flags): Promise<number> {
if (!fileId) {
render.err("Usage: claudemesh file get <file-id> [--out <path>]");
return EXIT.INVALID_ARGS;
}
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const meta = await client.getFile(fileId);
if (!meta) {
render.err(`File not found or not accessible: ${fileId}`);
return EXIT.NOT_FOUND;
}
const res = await fetch(meta.url, { signal: AbortSignal.timeout(60_000) });
if (!res.ok) {
render.err(`Download failed: HTTP ${res.status}`);
return EXIT.NETWORK_ERROR;
}
const buf = Buffer.from(await res.arrayBuffer());
const outPath = opts.out
? resolvePath(opts.out)
: resolvePath(process.cwd(), meta.name);
mkdirSync(dirname(outPath), { recursive: true });
writeFileSync(outPath, buf);
if (opts.json) {
emitJson({ fileId, name: meta.name, savedTo: outPath, sizeBytes: buf.length });
} else {
render.ok(`saved ${bold(meta.name)} ${dim(`(${formatSize(buf.length)})`)}${dim(outPath)}`);
}
return EXIT.SUCCESS;
});
}

View File

@@ -35,18 +35,13 @@ const BROKER_HTTP = URLS.BROKER.replace("wss://", "https://").replace("ws://", "
async function syncToBroker(meshSlug: string, grants: Record<string, string[] | null>): Promise<void> {
const auth = getStoredToken();
if (!auth) return;
let userId = "";
try {
const payload = JSON.parse(Buffer.from(auth.session_token.split(".")[1]!, "base64url").toString()) as { sub?: string };
userId = payload.sub ?? "";
} catch { return; }
if (!userId) return;
try {
await request<{ ok: true }>({
path: `/cli/mesh/${meshSlug}/grants`,
method: "POST",
body: { user_id: userId, grants },
body: { grants },
baseUrl: BROKER_HTTP,
token: auth.session_token,
});
} catch (e) {
render.warn(`broker grant sync failed — client filter still active: ${e instanceof Error ? e.message : e}`);
@@ -91,8 +86,20 @@ function resolveCaps(input: string[]): Capability[] {
async function resolvePeer(meshSlug: string, name: string): Promise<{ displayName: string; pubkey: string } | null> {
return await withMesh({ meshSlug }, async (client) => {
const peers = await client.listPeers();
const match = peers.find((p) => p.displayName === name || p.pubkey === name || p.pubkey.startsWith(name));
return match ? { displayName: match.displayName, pubkey: match.pubkey } : null;
const match = peers.find(
(p) =>
p.displayName === name ||
p.pubkey === name ||
p.pubkey.startsWith(name) ||
p.memberPubkey === name ||
(p.memberPubkey && p.memberPubkey.startsWith(name)),
);
if (!match) return null;
// Prefer the stable member pubkey for grant keys — session pubkey
// rotates on every reconnect and would invalidate the grant entry.
// Broker falls back to session-key lookup for pre-alpha.36 clients.
const key = match.memberPubkey ?? match.pubkey;
return { displayName: match.displayName, pubkey: key };
});
}

View File

@@ -0,0 +1,91 @@
/**
* `claudemesh inbox flush` and `claudemesh inbox delete <id>` —
* mutate the daemon's persistent inbox store
* (`~/.claudemesh/daemon/inbox.db`) over IPC.
*
* 1.34.7: until this version, the only way to clean the inbox was a
* raw `sqlite3 inbox.db "DELETE FROM inbox"` against the daemon's
* private DB. That works but bypasses the IPC layer (and any future
* lifecycle hooks on row removal), and is invisible to a user who
* doesn't know the schema. These two verbs make the operation visible
* + safe + scriptable.
*/
import {
tryFlushInboxViaDaemon,
tryDeleteInboxRowViaDaemon,
} from "~/services/bridge/daemon-route.js";
import { render } from "~/ui/render.js";
import { dim } from "~/ui/styles.js";
export interface InboxFlushFlags {
mesh?: string;
/** ISO-8601 timestamp; deletes rows received_at < before. */
before?: string;
/** Required when neither --mesh nor --before is set, to prevent an
* accidental "delete every row on every mesh". */
all?: boolean;
json?: boolean;
}
export async function runInboxFlush(flags: InboxFlushFlags): Promise<void> {
const hasFilter = !!(flags.mesh || flags.before);
if (!hasFilter && !flags.all) {
if (flags.json) { process.stdout.write(JSON.stringify({ ok: false, error: "missing_filter" }) + "\n"); return; }
render.info(dim(
"Refusing to flush every row on every mesh.\n" +
" Re-run with --mesh <slug>, --before <iso-timestamp>, or --all to confirm.",
));
process.exit(1);
}
const removed = await tryFlushInboxViaDaemon({
...(flags.mesh ? { mesh: flags.mesh } : {}),
...(flags.before ? { beforeIso: flags.before } : {}),
});
if (removed === null) {
if (flags.json) { process.stdout.write(JSON.stringify({ ok: false, error: "daemon_unreachable" }) + "\n"); return; }
render.info(dim("Daemon not reachable. Run `claudemesh daemon up` and retry."));
process.exit(1);
}
if (flags.json) {
process.stdout.write(JSON.stringify({ ok: true, removed }) + "\n");
return;
}
const scope = flags.mesh
? `mesh "${flags.mesh}"`
: flags.before
? `older than ${flags.before}`
: "all meshes";
render.info(`✔ Flushed ${removed} message${removed === 1 ? "" : "s"} from ${scope}.`);
}
export interface InboxDeleteFlags {
json?: boolean;
}
export async function runInboxDelete(id: string, flags: InboxDeleteFlags): Promise<void> {
if (!id) {
if (flags.json) { process.stdout.write(JSON.stringify({ ok: false, error: "missing_id" }) + "\n"); return; }
render.info(dim("Usage: claudemesh inbox delete <message-id>"));
process.exit(1);
}
const ok = await tryDeleteInboxRowViaDaemon(id);
if (ok === null) {
if (flags.json) { process.stdout.write(JSON.stringify({ ok: false, error: "daemon_unreachable" }) + "\n"); return; }
render.info(dim("Daemon not reachable. Run `claudemesh daemon up` and retry."));
process.exit(1);
}
if (!ok) {
if (flags.json) { process.stdout.write(JSON.stringify({ ok: false, error: "not_found", id }) + "\n"); return; }
render.info(dim(`No inbox row with id "${id}".`));
process.exit(1);
}
if (flags.json) {
process.stdout.write(JSON.stringify({ ok: true, id }) + "\n");
return;
}
render.info(`✔ Deleted inbox row ${id}.`);
}

View File

@@ -1,49 +1,101 @@
/**
* `claudemesh inbox` — read pending peer messages.
* `claudemesh inbox` — read pending peer messages from the daemon's
* persisted inbox (`~/.claudemesh/daemon/inbox.db`).
*
* Connects, waits briefly for push delivery, drains the buffer, prints.
* Works best when message-mode is "inbox" or "off" (messages held at broker).
* 1.34.0: switched from the legacy cold-path "open fresh broker WS,
* drain in-memory buffer" flow to a daemon IPC read against `/v1/inbox`.
* The cold path was structurally broken — the persistent inbox lives in
* the daemon, and pushes land on its session-WS, not on a freshly-opened
* standalone WS. The daemon-route `tryListInboxViaDaemon` returns rows
* persisted across daemon restarts and surfaces them with the correct
* mesh scoping (server-side mesh filter added in 1.34.0).
*
* Cold-path fallback removed: when the daemon isn't reachable, the
* prior implementation returned an empty list anyway (no broker state
* = no buffered pushes), so removing that path doesn't lose any
* functionality. Strict mode emits a clear error via daemon-route.
*/
import { withMesh } from "./connect.js";
import type { InboundPush } from "~/services/broker/facade.js";
import { tryListInboxViaDaemon } from "~/services/bridge/daemon-route.js";
import { render } from "~/ui/render.js";
import { bold, dim } from "~/ui/styles.js";
export interface InboxFlags {
mesh?: string;
json?: boolean;
wait?: number;
/** Cap the number of rows returned by the daemon. Default 100. */
limit?: number;
/** 1.34.8: only show rows whose seen_at is NULL (i.e., never
* surfaced via an interactive listing or live channel reminder).
* When omitted, every row is returned and an interactive listing
* stamps them seen as a side effect. */
unread?: boolean;
}
function formatMessage(msg: InboundPush): string {
const text = msg.plaintext ?? `[encrypted: ${msg.ciphertext.slice(0, 32)}…]`;
const from = msg.senderPubkey.slice(0, 8);
const time = new Date(msg.createdAt).toLocaleTimeString();
const kindTag = msg.kind === "direct" ? "→ direct" : msg.kind;
return ` ${bold(from)} ${dim(`[${kindTag}] ${time}`)}\n ${text}`;
interface FormattedItem {
sender_pubkey: string;
sender_name: string;
body: string | null;
topic: string | null;
received_at: string;
mesh: string;
}
function formatMessage(msg: FormattedItem, includeMesh: boolean): string {
const text = msg.body ?? "[encrypted]";
const from = msg.sender_name && msg.sender_name !== msg.sender_pubkey.slice(0, 8)
? `${msg.sender_name} (${msg.sender_pubkey.slice(0, 8)})`
: msg.sender_pubkey.slice(0, 8);
const time = new Date(msg.received_at).toLocaleTimeString();
const topicTag = msg.topic ? ` (#${msg.topic})` : "";
const meshTag = includeMesh ? ` [${msg.mesh}]` : "";
return ` ${bold(from)} ${dim(`${meshTag}${topicTag} ${time}`)}\n ${text}`;
}
export async function runInbox(flags: InboxFlags): Promise<void> {
const waitMs = (flags.wait ?? 1) * 1000;
// Mesh resolution is owned by the daemon (it knows which meshes are
// attached) — the CLI just forwards the user's --mesh flag through.
// When omitted, the daemon's `/v1/inbox` honors the session-default
// mesh on auth-token requests; out-of-session callers see rows from
// every attached mesh. We don't pre-validate the mesh slug here so
// the command works even from a launch tmpdir whose local
// `config.json` only knows about the launch's mesh.
const meshSlug = flags.mesh;
await withMesh({ meshSlug: flags.mesh ?? null }, async (client, mesh) => {
await new Promise<void>((resolve) => setTimeout(resolve, waitMs));
const messages = client.drainPushBuffer();
if (flags.json) {
process.stdout.write(JSON.stringify(messages, null, 2) + "\n");
return;
}
if (messages.length === 0) {
render.info(dim(`No messages on mesh "${mesh.slug}".`));
return;
}
render.section(`inbox — ${mesh.slug} (${messages.length} message${messages.length === 1 ? "" : "s"})`);
for (const msg of messages) {
process.stdout.write(formatMessage(msg) + "\n\n");
}
const items = await tryListInboxViaDaemon(meshSlug, flags.limit ?? 100, {
unreadOnly: flags.unread === true,
// CLI is the canonical "I'm reading my inbox" path — let the daemon
// auto-stamp seen_at on the rows we just rendered. The MCP welcome
// path passes mark_seen=false instead and stamps explicitly after
// the channel notification succeeds.
markSeen: true,
});
if (items === null) {
if (flags.json) { process.stdout.write("[]\n"); return; }
render.info(dim("Daemon not reachable. Run `claudemesh daemon up` and retry."));
return;
}
if (flags.json) {
process.stdout.write(JSON.stringify(items, null, 2) + "\n");
return;
}
if (items.length === 0) {
const scope = meshSlug ? `mesh "${meshSlug}"` : "any mesh";
const filter = flags.unread ? "unread " : "";
render.info(dim(`No ${filter}messages on ${scope}.`));
return;
}
const filterTag = flags.unread ? " unread" : "";
const heading = meshSlug
? `inbox — ${meshSlug} (${items.length}${filterTag} message${items.length === 1 ? "" : "s"})`
: `inbox (${items.length}${filterTag} message${items.length === 1 ? "" : "s"})`;
render.section(heading);
// When the user didn't filter by mesh, surface the mesh slug per row
// so they can tell apart rows from different meshes at a glance.
for (const msg of items) {
process.stdout.write(formatMessage(msg, !meshSlug) + "\n\n");
}
}

View File

@@ -26,4 +26,8 @@ export { runWelcome } from "./welcome.js";
export { runHook } from "./hook.js";
export { runMcp } from "./mcp.js";
export { runSeedTestMesh } from "./seed-test-mesh.js";
export { runNotificationList } from "./notification.js";
export { runMemberList } from "./member.js";
export { runTopicTail } from "./topic-tail.js";
export { runTopicPost } from "./topic-post.js";
export { withMesh } from "./connect.js";

View File

@@ -30,6 +30,8 @@ import { dirname, join, resolve } from "node:path";
import { fileURLToPath } from "node:url";
import { spawnSync } from "node:child_process";
import { readConfig } from "~/services/config/facade.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim, yellow } from "~/ui/styles.js";
const MCP_NAME = "claudemesh";
const CLAUDE_CONFIG = join(homedir(), ".claude.json");
@@ -149,18 +151,80 @@ function bunAvailable(): boolean {
return res.status === 0;
}
/** Is this file running from a bundled `dist/` directory? */
function isBundledFile(p: string): boolean {
// Match any file under dist/ — e.g. dist/index.js or dist/entrypoints/cli.js.
return /[/\\]dist[/\\]/.test(p);
}
/** Absolute path to this CLI's entry file. */
function resolveEntry(): string {
const here = fileURLToPath(import.meta.url);
// When bundled (dist/index.js), this file IS the entry return self.
// When running from source (src/index.ts via bun), walk up to the
// dir + resolve index.ts.
if (here.endsWith("/dist/index.js") || here.endsWith("\\dist\\index.js")) {
return here;
}
// Bundled: this file IS reachable as the entry; return self.
// Source: walk up to apps/cli/src/index.ts (legacy) or fall back.
if (isBundledFile(here)) return here;
return resolve(dirname(here), "..", "index.ts");
}
/** Find the bundled `skills/` directory at install time. Walks up from
* the entry file: dist/entrypoints/cli.js → dist/ → package root → skills/. */
function resolveBundledSkillsDir(): string | null {
const here = fileURLToPath(import.meta.url);
// Bundled: <pkg>/dist/entrypoints/cli.js → walk up two levels to <pkg>
// Source: <pkg>/src/commands/install.ts → walk up two levels to <pkg>
const pkgRoot = resolve(dirname(here), "..", "..");
const skillsDir = join(pkgRoot, "skills");
if (existsSync(skillsDir)) return skillsDir;
return null;
}
/** ~/.claude/skills/ — where Claude Code looks for user-scoped skills. */
const CLAUDE_SKILLS_ROOT = join(homedir(), ".claude", "skills");
/**
* Copy bundled skills into ~/.claude/skills/. Idempotent — overwrites
* existing files (so updates flow through on `claudemesh install` re-run).
* Returns the list of skill names installed.
*/
function installSkills(): string[] {
const src = resolveBundledSkillsDir();
if (!src) return [];
// Each subdirectory of skills/ is one skill (matches Claude Code convention).
const fs = require("node:fs") as typeof import("node:fs");
const installed: string[] = [];
for (const entry of fs.readdirSync(src, { withFileTypes: true })) {
if (!entry.isDirectory()) continue;
const srcDir = join(src, entry.name);
const dstDir = join(CLAUDE_SKILLS_ROOT, entry.name);
mkdirSync(dstDir, { recursive: true });
for (const file of fs.readdirSync(srcDir, { withFileTypes: true })) {
if (!file.isFile()) continue;
copyFileSync(join(srcDir, file.name), join(dstDir, file.name));
}
installed.push(entry.name);
}
return installed;
}
/** Remove claudemesh-shipped skills from ~/.claude/skills/. Returns names removed. */
function uninstallSkills(): string[] {
const src = resolveBundledSkillsDir();
if (!src) return [];
const fs = require("node:fs") as typeof import("node:fs");
const removed: string[] = [];
for (const entry of fs.readdirSync(src, { withFileTypes: true })) {
if (!entry.isDirectory()) continue;
const dstDir = join(CLAUDE_SKILLS_ROOT, entry.name);
if (existsSync(dstDir)) {
try {
fs.rmSync(dstDir, { recursive: true, force: true });
removed.push(entry.name);
} catch { /* best effort */ }
}
}
return removed;
}
/**
* Build the MCP server entry for Claude Code's config.
*
@@ -170,9 +234,7 @@ function resolveEntry(): string {
* - Local dev (bun apps/cli/src/index.ts): use `bun <absolute-path>`.
*/
function buildMcpEntry(entryPath: string): McpEntry {
const isBundled = entryPath.endsWith("/dist/index.js") ||
entryPath.endsWith("\\dist\\index.js");
if (isBundled) {
if (isBundledFile(entryPath)) {
return {
command: "claudemesh",
args: ["mcp"],
@@ -372,193 +434,314 @@ function installStatusLine(): { installed: boolean } {
return { installed: true };
}
export function runInstall(args: string[] = []): void {
export async function runInstall(args: string[] = []): Promise<void> {
const skipHooks = args.includes("--no-hooks");
const skipSkill = args.includes("--no-skill");
const skipService = args.includes("--no-service");
const wantStatusLine = args.includes("--status-line");
console.log("claudemesh install");
console.log("------------------");
render.section("claudemesh install");
const entry = resolveEntry();
const isBundled = entry.endsWith("/dist/index.js") ||
entry.endsWith("\\dist\\index.js");
const bundled = isBundledFile(entry);
// Dev mode (running from src/) requires bun on PATH; bundled mode
// (npm install -g) just uses node + the claudemesh bin shim.
if (!isBundled && !bunAvailable()) {
console.error(
"✗ `bun` is not on PATH. Install Bun first: https://bun.com",
);
if (!bundled && !bunAvailable()) {
render.err("`bun` is not on PATH.", "Install Bun first: https://bun.com");
process.exit(1);
}
if (!existsSync(entry)) {
console.error(`MCP entry not found at ${entry}`);
render.err(`MCP entry not found at ${entry}`);
process.exit(1);
}
const desired = buildMcpEntry(entry);
const action = patchMcpServer(desired);
// Read-back verification.
const verify = readClaudeConfig();
const verifyServers = (verify.mcpServers ?? {}) as Record<string, McpEntry>;
const stored = verifyServers[MCP_NAME];
if (!stored || !entriesEqual(stored, desired)) {
console.error(
`✗ post-write verification failed — ${CLAUDE_CONFIG} may be corrupt`,
);
render.err("post-write verification failed", `${CLAUDE_CONFIG} may be corrupt`);
process.exit(1);
}
// ANSI color helpers — stick to 8-color set so terminals without
// truecolor still render. Fall back to plain if NO_COLOR or dumb TERM.
const useColor =
!process.env.NO_COLOR && process.env.TERM !== "dumb" && process.stdout.isTTY;
const bold = (s: string) => (useColor ? `\x1b[1m${s}\x1b[22m` : s);
const yellow = (s: string) => (useColor ? `\x1b[33m${s}\x1b[39m` : s);
const dim = (s: string) => (useColor ? `\x1b[2m${s}\x1b[22m` : s);
render.ok(`MCP server "${bold(MCP_NAME)}" ${action}`);
render.kv([
["config", dim(CLAUDE_CONFIG)],
["command", dim(`${desired.command}${desired.args?.length ? " " + desired.args.join(" ") : ""}`)],
]);
console.log(`✓ MCP server "${MCP_NAME}" ${action}`);
console.log(dim(` config: ${CLAUDE_CONFIG}`));
console.log(
dim(
` command: ${desired.command}${desired.args?.length ? " " + desired.args.join(" ") : ""}`,
),
);
// allowedTools — pre-approve claudemesh MCP tools so peers don't need
// --dangerously-skip-permissions just to call mesh tools.
try {
const { added, unchanged } = installAllowedTools();
if (added.length > 0) {
console.log(
`allowedTools: ${added.length} claudemesh tools pre-approved${unchanged > 0 ? `, ${unchanged} already present` : ""}`,
render.ok(
`allowedTools: ${added.length} claudemesh tools pre-approved`,
unchanged > 0 ? `${unchanged} already present` : undefined,
);
console.log(dim(` This lets claudemesh tools run without --dangerously-skip-permissions.`));
console.log(dim(` Your existing allowedTools entries were preserved.`));
render.info(dim("This lets claudemesh tools run without --dangerously-skip-permissions."));
render.info(dim("Your existing allowedTools entries were preserved."));
} else {
console.log(`allowedTools: all ${unchanged} claudemesh tools already pre-approved`);
render.ok(`allowedTools: all ${unchanged} claudemesh tools already pre-approved`);
}
console.log(dim(` config: ${CLAUDE_SETTINGS}`));
render.info(dim(` config: ${CLAUDE_SETTINGS}`));
} catch (e) {
console.error(
`⚠ allowedTools update failed: ${e instanceof Error ? e.message : String(e)}`,
);
render.warn(`allowedTools update failed: ${e instanceof Error ? e.message : String(e)}`);
}
// Hooks — status accuracy (Stop/UserPromptSubmit → POST /hook/set-status).
if (!skipHooks) {
try {
const { added, unchanged } = installHooks();
if (added > 0) {
console.log(
`Hooks registered (Stop + UserPromptSubmit)${added} added, ${unchanged} already present`,
render.ok(
`Hooks registered (Stop + UserPromptSubmit)`,
`${added} added, ${unchanged} already present`,
);
} else {
console.log(`Hooks already registered (${unchanged} present)`);
render.ok(`Hooks already registered`, `${unchanged} present`);
}
console.log(dim(` config: ${CLAUDE_SETTINGS}`));
render.info(dim(` config: ${CLAUDE_SETTINGS}`));
} catch (e) {
console.error(
`hook registration failed: ${e instanceof Error ? e.message : String(e)}`,
);
console.error(
" (MCP is still installed — hooks just skip. Retry with --no-hooks to suppress.)",
render.warn(
`hook registration failed: ${e instanceof Error ? e.message : String(e)}`,
"MCP is still installed — hooks just skip. Retry with --no-hooks to suppress.",
);
}
} else {
console.log(dim("· Hooks skipped (--no-hooks)"));
render.info(dim("· Hooks skipped (--no-hooks)"));
}
// Claude skill — discoverability replacement for the (now-empty) MCP
// tool surface. Claude reads ~/.claude/skills/claudemesh/SKILL.md on
// demand, learns every CLI verb, JSON shape, and gotcha. See spec
// 2026-05-02 commitment #6.
if (!skipSkill) {
try {
const installed = installSkills();
if (installed.length > 0) {
render.ok(
`Claude skill${installed.length === 1 ? "" : "s"} installed`,
installed.join(", "),
);
render.info(dim(` ${join(CLAUDE_SKILLS_ROOT, installed[0]!)}/SKILL.md`));
}
} catch (e) {
render.warn(`skill install failed: ${e instanceof Error ? e.message : String(e)}`);
}
} else {
render.info(dim("· Skill install skipped (--no-skill)"));
}
// Opt-in status line (shows mesh + peer count in Claude Code).
if (wantStatusLine) {
try {
const { installed } = installStatusLine();
if (installed) {
console.log(`Claude Code statusLine → \`claudemesh status-line\``);
console.log(dim(` Shows: ◇ <mesh> · <online>/<total> online · <you>`));
render.ok(`Claude Code statusLine → ${clay("claudemesh status-line")}`);
render.info(dim(" Shows: ◇ <mesh> · <online>/<total> online · <you>"));
} else {
console.log(dim("· statusLine already set to a custom command — left alone"));
render.info(dim("· statusLine already set to a custom command — left alone"));
}
} catch (e) {
console.error(`statusLine install failed: ${e instanceof Error ? e.message : String(e)}`);
render.warn(`statusLine install failed: ${e instanceof Error ? e.message : String(e)}`);
}
}
// Check if user has any meshes joined — nudge them if not.
let hasMeshes = false;
try {
const meshConfig = readConfig();
hasMeshes = meshConfig.meshes.length > 0;
} catch {
// Config missing or corrupt — treat as no meshes.
} catch {}
// Daemon service install — required for MCP integration as of 1.24.0.
// The daemon owns the broker WS and feeds the MCP push-pipe via SSE;
// skipping it leaves channel push, slash commands, and resources broken.
// 1.30.2: install no longer locks the unit to a single mesh; the
// daemon attaches to every joined mesh on boot (1.26.0 multi-mesh
// design). Users who want single-mesh can pass `claudemesh daemon
// install-service --mesh <slug>` explicitly.
if (!skipService && hasMeshes) {
try {
await installDaemonService(entry);
} catch (e) {
render.warn(
`daemon service install failed: ${e instanceof Error ? e.message : String(e)}`,
"Run `claudemesh daemon install-service` to retry.",
);
}
} else if (skipService) {
render.info(dim("· Daemon service skipped (--no-service)"));
render.info(dim(" MCP integration will fail at boot until you start the daemon manually:"));
render.info(dim(" claudemesh daemon up --mesh <slug>"));
} else if (!hasMeshes) {
render.info(dim("· Daemon service deferred — join a mesh first, then run install again."));
}
console.log("");
console.log(yellow(bold("RESTART CLAUDE CODE")) + yellow(" for MCP tools to appear."));
render.blank();
render.warn(`${bold("RESTART CLAUDE CODE")} ${yellow("for MCP tools to appear.")}`);
if (!hasMeshes) {
console.log("");
console.log(yellow("No meshes joined.") + " To connect with peers:");
console.log(
` ${bold("claudemesh <invite-url>")}` +
dim(" — joins + launches in one step"),
);
console.log(
` ${dim("Create one at")} ${bold("https://claudemesh.com/dashboard")}`,
);
render.blank();
render.info(`${yellow("No meshes joined.")} To connect with peers:`);
render.info(` ${bold("claudemesh <invite-url>")}${dim(" — joins + launches in one step")}`);
render.info(` ${dim("Create one at")} ${bold("https://claudemesh.com/dashboard")}`);
} else {
console.log("");
console.log(
`Next: ${bold("claudemesh")}` + dim(" — launch with your joined mesh"),
);
render.blank();
render.info(`Next: ${bold("claudemesh")}${dim(" — launch with your joined mesh")}`);
}
console.log("");
console.log(dim("Optional:"));
console.log(dim(` claudemesh url-handler install # click-to-launch from email`));
console.log(dim(` claudemesh install --status-line # live peer count in Claude Code`));
console.log(dim(` claudemesh completions zsh # shell completions`));
render.blank();
render.info(dim("Optional:"));
render.info(dim(` claudemesh url-handler install # click-to-launch from email`));
render.info(dim(` claudemesh install --status-line # live peer count in Claude Code`));
render.info(dim(` claudemesh completions zsh # shell completions`));
}
/**
* Install + start the per-user daemon service for the primary mesh.
*
* Refuses on CI hosts (the service-install module guards this); falls
* back to a friendly message and lets the install otherwise succeed.
* The MCP push-pipe will fail loudly if the daemon isn't reachable, so
* the user knows there's a problem before it shows up as "no messages
* arriving."
*/
async function installDaemonService(binaryEntry: string): Promise<void> {
const {
installService,
detectPlatform,
} = require("~/daemon/service-install.js") as typeof import("../daemon/service-install.js");
const platform = detectPlatform();
if (!platform) {
render.info(dim(`· Daemon service skipped — unsupported platform: ${process.platform}`));
return;
}
// Resolve the binary the service unit should launch. When invoked from a
// bundled binary, argv[1] is correct. When invoked under tsx / dev, fall
// back to whatever `claudemesh` resolves to on PATH so the unit launches
// a shipped binary, not a dev script.
let binary = process.argv[1] ?? binaryEntry;
if (!binary || /\.ts$/.test(binary) || /node_modules|src\/entrypoints/.test(binary)) {
try {
const { execSync } = require("node:child_process") as typeof import("node:child_process");
binary = execSync("which claudemesh", { encoding: "utf8" }).trim();
} catch {
render.warn(
"couldn't resolve a 'claudemesh' binary on PATH; daemon service skipped",
"Install via npm/homebrew, then run `claudemesh daemon install-service`",
);
return;
}
}
const r = installService({ binaryPath: binary });
render.ok(`daemon service installed (${r.platform})`);
render.kv([
["unit", dim(r.unitPath)],
["mesh", dim("(all joined meshes)")],
]);
// Boot the unit immediately so MCP has a daemon to attach to on next
// Claude Code launch. Best-effort: if launchctl/systemctl errors out we
// log and continue — the user can run the boot command manually.
try {
const { execSync } = require("node:child_process") as typeof import("node:child_process");
execSync(r.bootCommand, { stdio: "ignore" });
render.ok("daemon started");
} catch (e) {
render.warn(
`daemon service installed but failed to start: ${e instanceof Error ? e.message : String(e)}`,
`Run manually: ${r.bootCommand}`,
);
return;
}
// 1.31.0 — post-flight: verify the daemon actually establishes a
// broker WebSocket. Boots that fail silently here (DNS, expired TLS,
// outbound :443 blocked, broker outage) used to surface only when
// the user's first `peer list` or `send` failed half an hour later.
// Polling /v1/health gives a clear, install-time signal.
await verifyBrokerConnectivity();
}
async function verifyBrokerConnectivity(): Promise<void> {
const VERIFY_BUDGET_MS = 15_000;
const POLL_INTERVAL_MS = 500;
const { ipc } = await import("~/daemon/ipc/client.js");
const start = Date.now();
let lastBrokers: Record<string, string> = {};
while (Date.now() - start < VERIFY_BUDGET_MS) {
try {
const res = await ipc<{ ok: boolean; brokers?: Record<string, string> }>({
path: "/v1/health",
timeoutMs: 2_000,
});
lastBrokers = res.body?.brokers ?? {};
const openMesh = Object.entries(lastBrokers).find(([, s]) => s === "open");
if (openMesh) {
const others = Object.entries(lastBrokers).filter(([slug]) => slug !== openMesh[0]);
const tail = others.length > 0 ? `, ${others.length} other mesh${others.length === 1 ? "" : "es"} attaching` : "";
render.ok(`broker connected (mesh=${openMesh[0]}${tail})`);
return;
}
} catch { /* daemon may still be starting up; keep polling */ }
await new Promise((r) => setTimeout(r, POLL_INTERVAL_MS));
}
// Timed out without a single broker reaching `open`. Surface what we
// saw last so the user can act — this is exactly the bug class we
// want to catch at install time, not at first send.
const states = Object.keys(lastBrokers).length === 0
? "no health response from daemon"
: Object.entries(lastBrokers).map(([m, s]) => `${m}=${s}`).join(", ");
render.warn(
`broker did not reach open within ${Math.round(VERIFY_BUDGET_MS / 1000)}s (${states})`,
"Check ~/.claudemesh/daemon/daemon.log for connect errors. Common causes: outbound :443 blocked, expired TLS, DNS resolution.",
);
}
export function runUninstall(): void {
console.log("claudemesh uninstall");
console.log("--------------------");
render.section("claudemesh uninstall");
// MCP entry — only removes claudemesh, never touches other servers.
if (removeMcpServer()) {
console.log(`MCP server "${MCP_NAME}" removed`);
render.ok(`MCP server "${bold(MCP_NAME)}" removed`);
} else {
console.log(`· MCP server "${MCP_NAME}" not present`);
render.info(dim(`· MCP server "${MCP_NAME}" not present`));
}
// allowedTools
try {
const removed = uninstallAllowedTools();
if (removed > 0) {
console.log(`allowedTools: ${removed} claudemesh tools removed`);
render.ok(`allowedTools: ${removed} claudemesh tools removed`);
} else {
console.log("· No claudemesh allowedTools to remove");
render.info(dim("· No claudemesh allowedTools to remove"));
}
} catch (e) {
console.error(
`⚠ allowedTools removal failed: ${e instanceof Error ? e.message : String(e)}`,
);
render.warn(`allowedTools removal failed: ${e instanceof Error ? e.message : String(e)}`);
}
// Hooks
try {
const removed = uninstallHooks();
if (removed > 0) {
console.log(`Hooks removed (${removed} entries)`);
render.ok(`Hooks removed`, `${removed} entries`);
} else {
console.log("· No claudemesh hooks to remove");
render.info(dim("· No claudemesh hooks to remove"));
}
} catch (e) {
console.error(
`⚠ hook removal failed: ${e instanceof Error ? e.message : String(e)}`,
);
render.warn(`hook removal failed: ${e instanceof Error ? e.message : String(e)}`);
}
console.log("");
console.log("Restart Claude Code to drop the MCP connection + hooks.");
try {
const removed = uninstallSkills();
if (removed.length > 0) {
render.ok(`Skill${removed.length === 1 ? "" : "s"} removed`, removed.join(", "));
} else {
render.info(dim("· No claudemesh skills to remove"));
}
} catch (e) {
render.warn(`skill removal failed: ${e instanceof Error ? e.message : String(e)}`);
}
render.blank();
render.info("Restart Claude Code to drop the MCP connection + hooks.");
}

View File

@@ -39,7 +39,7 @@ export async function invite(
// Show picker
console.log("\n Select mesh to share:\n");
config.meshes.forEach((m, i) => {
console.log(` ${bold(String(i + 1) + ")")} ${m.slug} ${dim("(" + m.name + ")")}`);
console.log(` ${bold(String(i + 1) + ")")} ${m.slug}`);
});
console.log("");
const choice = await prompt(" Choice [1]: ") || "1";

View File

@@ -111,6 +111,24 @@ export async function runJoin(args: string[]): Promise<void> {
process.exit(1);
}
// Short-circuit: if the arg matches a mesh already in local config (slug
// or name), we're already joined. Don't go through the invite flow.
if (!link.includes("://")) {
const existing = readConfig().meshes.find(
(m) => m.slug === link || m.name === link,
);
if (existing) {
console.log(`Already in "${existing.slug}" on this machine.`);
console.log("");
console.log(`Use it in the current directory:`);
console.log(` claudemesh launch --mesh ${existing.slug}`);
console.log("");
console.log(`Or list peers:`);
console.log(` claudemesh peers --mesh ${existing.slug}`);
return;
}
}
// Try v2 first — short code / `/i/<code>` URL.
const v2Code = parseV2InviteInput(link);
if (v2Code) {

View File

@@ -0,0 +1,108 @@
/**
* `claudemesh disconnect` — soft disconnect (session reset, auto-reconnects).
* `claudemesh kick` — hard kick (session ends, no auto-reconnect).
*
* claudemesh disconnect <peer> # nudge, reconnects in seconds
* claudemesh kick <peer> # stop session, user runs claudemesh to rejoin
* claudemesh kick --stale 30m # kick peers idle > 30m
* claudemesh kick --all # kick everyone except yourself
*
* Ban (permanent, revokes membership) is in ban.ts.
*/
import { withMesh } from "./connect.js";
import { readConfig } from "~/services/config/facade.js";
import { render } from "~/ui/render.js";
import { EXIT } from "~/constants/exit-codes.js";
function parseStaleMs(input: string): number | null {
const m = input.match(/^(\d+)(s|m|h)$/);
if (!m) return null;
const val = parseInt(m[1]!, 10);
const unit = m[2]!;
if (unit === "s") return val * 1000;
if (unit === "m") return val * 60_000;
if (unit === "h") return val * 3600_000;
return null;
}
function buildPayload(
kind: "disconnect" | "kick",
target: string | undefined,
opts: { stale?: string; all?: boolean },
): Record<string, unknown> | { error: string } {
if (opts.all) return { type: kind, all: true };
if (opts.stale) {
const ms = parseStaleMs(opts.stale);
if (!ms) return { error: `Invalid stale duration: "${opts.stale}". Use e.g. 30m, 1h, 300s.` };
return { type: kind, stale: ms };
}
if (target) return { type: kind, target };
return { error: `Usage: claudemesh ${kind} <peer> | --stale 30m | --all` };
}
export async function runDisconnect(
target: string | undefined,
opts: { mesh?: string; stale?: string; all?: boolean } = {},
): Promise<number> {
const config = readConfig();
const meshSlug = opts.mesh ?? config.meshes[0]?.slug;
if (!meshSlug) { render.err("No mesh joined."); return EXIT.NOT_FOUND; }
const built = buildPayload("disconnect", target, opts);
if ("error" in built) { render.err(String(built.error)); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug }, async (client) => {
const result = await client.sendAndWait(built as Record<string, unknown>) as { affected?: string[]; kicked?: string[] };
const peers = result?.affected ?? result?.kicked ?? [];
if (peers.length === 0) render.info("No peers matched.");
else {
render.ok(`Disconnected ${peers.length} peer(s): ${peers.join(", ")}`);
render.hint("They will auto-reconnect within seconds. For a session-ending kick, use `claudemesh kick`.");
}
return EXIT.SUCCESS;
});
}
export async function runKick(
target: string | undefined,
opts: { mesh?: string; stale?: string; all?: boolean } = {},
): Promise<number> {
const config = readConfig();
const meshSlug = opts.mesh ?? config.meshes[0]?.slug;
if (!meshSlug) { render.err("No mesh joined."); return EXIT.NOT_FOUND; }
const built = buildPayload("kick", target, opts);
if ("error" in built) { render.err(String(built.error)); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug }, async (client) => {
const result = await client.sendAndWait(built as Record<string, unknown>) as {
affected?: string[];
kicked?: string[];
// 1.34.15: broker refuses to kick control-plane WSes (they'd
// just auto-reconnect). Older brokers don't emit this field.
skipped_control_plane?: string[];
};
const peers = result?.affected ?? result?.kicked ?? [];
const skipped = result?.skipped_control_plane ?? [];
if (peers.length === 0 && skipped.length === 0) {
render.info("No peers matched.");
} else if (peers.length === 0 && skipped.length > 0) {
render.warn(
`${skipped.length} match(es) refused: ${skipped.join(", ")} — control-plane connections (daemon / dashboard) auto-reconnect, so kick is a no-op.`,
"To take a daemon offline locally, run `claudemesh daemon down` on that machine. To remove a member from the mesh, use `claudemesh ban <peer>`.",
);
} else {
render.ok(`Kicked ${peers.length} peer(s): ${peers.join(", ")}`);
render.hint("Their Claude Code session ended. They can rejoin anytime by running `claudemesh`.");
if (skipped.length > 0) {
render.warn(
`(also refused ${skipped.length} control-plane connection(s): ${skipped.join(", ")})`,
"Daemon / dashboard connections auto-reconnect; kick is a no-op against them. Use `claudemesh ban <peer>` to remove a member entirely.",
);
}
}
return EXIT.SUCCESS;
});
}

View File

@@ -25,6 +25,7 @@ import type { Config, JoinedMesh, GroupEntry } from "~/services/config/facade.js
import { startCallbackListener, generatePairingCode } from "~/services/auth/facade.js";
import { openBrowser } from "~/services/spawn/facade.js";
import { BrokerClient } from "~/services/broker/facade.js";
import { render } from "~/ui/render.js";
// Flags as parsed by citty (index.ts is the source of truth for definitions).
export interface LaunchFlags {
@@ -43,6 +44,62 @@ export interface LaunchFlags {
// --- Interactive mesh picker ---
/**
* Ensure the per-user daemon is running before we hand off to Claude Code.
*
* As of 1.24.0 the daemon owns the broker WS and feeds the MCP push-pipe
* over IPC SSE. If the socket is absent when Claude boots its MCP shim,
* the shim bails (no fallback). Delegates to the shared lifecycle helper
* (services/daemon/lifecycle.ts) which probes the socket properly
* (avoiding the stale-socket bug where existsSync was a false positive
* after a daemon crash), spawns under a file-lock, and polls for liveness.
*/
async function ensureDaemonRunning(meshSlug: string, quiet: boolean): Promise<void> {
const { ensureDaemonReady } = await import("~/services/daemon/lifecycle.js");
if (!quiet) render.info("ensuring claudemesh daemon is running…");
// Larger budget for `launch` — it's a one-shot flow where the user
// is actively waiting; cold node start + broker hello can take
// longer than the default 3s budget for ad-hoc verbs.
const res = await ensureDaemonReady({ budgetMs: 10_000, mesh: meshSlug });
if (res.state === "up") {
if (!quiet) render.ok("daemon already running");
await warnIfDaemonStale(quiet);
return;
}
if (res.state === "started") {
if (!quiet) render.ok(`daemon ready (${res.durationMs}ms)`);
return;
}
render.warn(
`daemon ${res.state}${res.reason ? `: ${res.reason}` : ""}`,
"Run `claudemesh daemon up` manually, then re-launch.",
);
}
/** 1.34.9: warn when the running daemon's version doesn't match the CLI
* that's about to launch a session. `npm i -g claudemesh-cli` upgrades
* the binaries on disk but doesn't restart a launchd / systemd-user
* service or a foreground `claudemesh daemon up`, so users routinely
* ship a fix to the CLI side and never see it because the WS lifecycle,
* echo guards, and self-join filters all live in the long-running
* daemon process. We probe `/v1/version` and emit a one-shot stderr
* warning when CLI ≠ daemon. Best-effort; failures are silent. */
async function warnIfDaemonStale(quiet: boolean): Promise<void> {
if (quiet) return;
try {
const { ipc } = await import("~/daemon/ipc/client.js");
const { VERSION } = await import("~/constants/urls.js");
const res = await ipc<{ daemon_version?: string }>({ path: "/v1/version", timeoutMs: 1_500 });
if (res.status !== 200) return;
const daemonVersion = res.body.daemon_version ?? "";
if (!daemonVersion || daemonVersion === VERSION) return;
render.warn(
`daemon is ${daemonVersion}, CLI is ${VERSION} — restart to pick up new fixes.`,
"Run: `claudemesh daemon down && claudemesh daemon up` (no --mesh — daemon attaches to every joined mesh; restart the launchd / systemd-user unit if you installed one).",
);
} catch { /* swallow — version probe is best-effort */ }
}
async function pickMesh(meshes: JoinedMesh[]): Promise<JoinedMesh> {
if (meshes.length === 1) return meshes[0]!;
@@ -217,7 +274,7 @@ async function runLaunchWizard(opts: {
spinner.stop();
const choice = await menuSelect({
title: "Select mesh",
items: opts.meshes.map(m => m.slug),
items: opts.meshes.map((m) => m.slug),
row,
});
mesh = opts.meshes[choice]!;
@@ -317,6 +374,66 @@ async function runLaunchWizard(opts: {
return { mesh, role, groups, messageMode, skipPermissions };
}
/**
* 1.32.0 — broker welcome line printed right after the launch banner.
* Polls the daemon's /v1/health (per-mesh broker WS state) and tries
* to fetch the inbox + peer count via daemon-route helpers. Best-effort:
* if any call fails the welcome simply prints what it knows and moves
* on — never blocks the launch path.
*/
async function printBrokerWelcome(meshSlug: string): Promise<void> {
const useColor = !process.env.NO_COLOR && process.env.TERM !== "dumb" && process.stdout.isTTY;
const dim = (s: string): string => (useColor ? `\x1b[2m${s}\x1b[22m` : s);
const green = (s: string): string => (useColor ? `\x1b[32m${s}\x1b[22m` : s);
const yellow = (s: string): string => (useColor ? `\x1b[33m${s}\x1b[22m` : s);
// Probe daemon health for broker WS state.
let brokerState = "unknown";
try {
const { ipc } = await import("~/daemon/ipc/client.js");
const res = await ipc<{ ok?: boolean; brokers?: Record<string, string> }>({
path: "/v1/health",
timeoutMs: 1_500,
});
if (res.status === 200 && res.body?.brokers) {
brokerState = res.body.brokers[meshSlug] ?? "unknown";
}
} catch { /* daemon unreachable — not fatal */ }
// Peer count (best-effort). 1.34.15: scope to the launched mesh so
// multi-mesh daemons don't inflate the welcome banner with peers
// from other meshes the user didn't just attach to.
let peerCount = -1;
try {
const { tryListPeersViaDaemon } = await import("~/services/bridge/daemon-route.js");
const peers = (await tryListPeersViaDaemon(meshSlug)) ?? [];
peerCount = peers.filter((p) =>
(p as { channel?: string }).channel !== "claudemesh-daemon",
).length;
} catch { /* skip peer count */ }
// Unread inbox count (best-effort).
let unread = -1;
try {
const { ipc } = await import("~/daemon/ipc/client.js");
const res = await ipc<{ messages?: unknown[] }>({
path: "/v1/inbox",
timeoutMs: 1_500,
});
if (res.status === 200 && Array.isArray(res.body?.messages)) {
unread = res.body.messages.length;
}
} catch { /* skip unread */ }
const dot = brokerState === "open" ? green("●") : yellow("●");
const parts: string[] = [];
parts.push(`broker ${brokerState === "open" ? "connected" : brokerState}`);
if (peerCount >= 0) parts.push(`${peerCount} peer${peerCount === 1 ? "" : "s"} online`);
if (unread >= 0) parts.push(`${unread} unread`);
console.log(`${dot} ${parts.join(dim(" · "))}`);
console.log("");
}
function printBanner(name: string, meshSlug: string, role: string | null, groups: GroupEntry[], messageMode: "push" | "inbox" | "off"): void {
const useColor =
!process.env.NO_COLOR && process.env.TERM !== "dumb" && process.stdout.isTTY;
@@ -371,7 +488,7 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
// 1. If --join, run join flow first.
if (args.joinLink) {
console.log("Joining mesh...");
render.info(tDim("Joining mesh…"));
const invite = await parseInviteLink(args.joinLink);
const keypair = await generateKeypair();
const displayName = (args.name ?? process.env.USER ?? process.env.USERNAME ?? hostname());
@@ -398,8 +515,9 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
});
const { writeConfig } = await import("~/services/config/facade.js");
writeConfig(config);
console.log(
`✓ Joined "${invite.payload.mesh_slug}"${enroll.alreadyMember ? " (already member)" : ""}`,
render.ok(
`joined ${tBold(invite.payload.mesh_slug)}`,
enroll.alreadyMember ? "already member" : undefined,
);
}
@@ -483,7 +601,7 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
}
if (config.meshes.length === 0) {
console.error("No meshes joined. Run `claudemesh join <url>` or use --join <url>.");
render.err("No meshes joined.", "Run `claudemesh join <url>` or use --join <url>.");
process.exit(1);
}
@@ -492,8 +610,9 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
if (args.meshSlug) {
const found = config.meshes.find((m) => m.slug === args.meshSlug);
if (!found) {
console.error(
`Mesh "${args.meshSlug}" not found. Joined: ${config.meshes.map((m) => m.slug).join(", ")}`,
render.err(
`Mesh "${args.meshSlug}" not found.`,
`Joined: ${config.meshes.map((m) => m.slug).join(", ")}`,
);
process.exit(1);
}
@@ -547,6 +666,12 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
}
} catch { /* best effort */ }
// Ensure the daemon is running before we spawn Claude. The MCP shim
// (loaded by --dangerously-load-development-channels server:claudemesh)
// requires the daemon's UDS to be reachable at boot — if it isn't,
// channel push, slash commands, and resources fail.
await ensureDaemonRunning(mesh.slug, args.quiet);
// Clean up stale mesh MCP entries from crashed sessions
try {
const claudeConfigPath = join(homedir(), ".claude.json");
@@ -613,9 +738,109 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
"utf-8",
);
// 4b. Mint a per-session IPC token, persist it under tmpDir, and
// register it with the daemon. The token's path is exposed to
// the spawned claude (and all its descendants) via env so
// CLI invocations from inside the session auto-attribute to it.
//
// 1.30.0: also mint an ephemeral ed25519 session keypair and a
// parent-vouched attestation. The daemon uses these to open a
// long-lived broker WebSocket per session (presence row keyed on
// the session pubkey, member_id from the parent), so sibling
// sessions in the same mesh see each other in `peer list`.
//
// Session-id resolution: 1.29.0 referenced `claudeSessionId`
// before its `const` declaration further down the file, hitting
// the TDZ → ReferenceError swallowed by the surrounding catch.
// The IPC registration has been silently failing every launch
// since 1.29.0. Hoist the declaration up so it actually runs.
const isResume = args.resume !== null || args.continueSession;
const claudeSessionId = isResume ? undefined : randomUUID();
let sessionTokenFilePath: string | null = null;
let sessionTokenForCleanup: string | null = null;
try {
const { mintSessionToken, TOKEN_FILE_ENV } = await import("~/services/session/token.js");
const minted = mintSessionToken(tmpDir);
sessionTokenFilePath = minted.filePath;
sessionTokenForCleanup = minted.token;
// Per-session ephemeral keypair + parent attestation (1.30.0+).
// Older daemons ignore unknown body fields, so sending presence
// material always is forward-compatible.
let presencePayload: {
session_pubkey: string;
session_secret_key: string;
parent_attestation: {
session_pubkey: string;
parent_member_pubkey: string;
expires_at: number;
signature: string;
};
} | undefined;
try {
const { generateKeypair } = await import("~/services/crypto/facade.js");
const { signParentAttestation } = await import("~/services/broker/session-hello-sig.js");
const sessionKp = await generateKeypair();
const att = await signParentAttestation({
parentMemberPubkey: mesh.pubkey,
parentSecretKey: mesh.secretKey,
sessionPubkey: sessionKp.publicKey,
});
presencePayload = {
session_pubkey: sessionKp.publicKey,
session_secret_key: sessionKp.secretKey,
parent_attestation: {
session_pubkey: att.sessionPubkey,
parent_member_pubkey: att.parentMemberPubkey,
expires_at: att.expiresAt,
signature: att.signature,
},
};
} catch {
// Keypair / attestation failure — proceed without per-session
// presence. The session still registers; only the broker-side
// presence row is skipped.
}
// Register with the daemon. Best-effort: a daemon failure here
// means the session falls back to user-level scope, which is fine.
const { ipc } = await import("~/daemon/ipc/client.js");
const sessionIdForRegister = claudeSessionId ?? randomUUID();
await ipc({
method: "POST",
path: "/v1/sessions/register",
timeoutMs: 3_000,
body: {
token: minted.token,
session_id: sessionIdForRegister,
mesh: mesh.slug,
display_name: displayName,
pid: process.pid,
cwd: process.cwd(),
...(role ? { role } : {}),
...(parsedGroups.length > 0 ? { groups: parsedGroups.map((g) => `@${g.name}${g.role ? `:${g.role}` : ""}`) } : {}),
...(presencePayload ? { presence: presencePayload } : {}),
},
}).catch(() => null);
// Pin the env name on a global so the spawn block below can pick it up.
(process as unknown as { _claudemeshTokenEnv?: { name: string; value: string } })._claudemeshTokenEnv = {
name: TOKEN_FILE_ENV,
value: minted.filePath,
};
} catch {
// Token mint or registration failed — proceed without per-session
// attribution. CLI invocations from the session will still work,
// they'll just default to user-level scope.
}
// 5. Print summary banner (wizard already handled all interactive config).
if (!args.quiet) {
printBanner(displayName, mesh.slug, role, parsedGroups, messageMode);
// 1.32.0+: broker welcome — confirm the per-session WS is actually
// attached and surface peer count + unread inbox so the user lands
// in claude code with a clear state instead of silent assumptions.
await printBrokerWelcome(mesh.slug);
}
// --- Install native MCP entries for deployed mesh services ---
@@ -686,10 +911,8 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
// passes -y / --yes. Without it, claudemesh tools still work because
// `claudemesh install` pre-approves them via allowedTools in settings.json.
// This keeps permissions tight for multi-person meshes.
// Session identity: --resume reuses existing session, otherwise generate new.
// When resuming, Claude Code reuses the session ID so the mesh peer identity persists.
const isResume = args.resume !== null || args.continueSession;
const claudeSessionId = isResume ? undefined : randomUUID();
// Session identity: claudeSessionId was generated above (4b) so the
// session-token registration could include it. Reuse here.
const claudeArgs = [
"--dangerously-load-development-channels",
@@ -734,7 +957,14 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
writeFileSync(claudeConfigPath, JSON.stringify(claudeConfig, null, 2) + "\n", "utf-8");
} catch { /* best effort */ }
}
// Ephemeral config dir
// The token's session-token file lives inside tmpDir; rmSync below
// shreds the secret. The daemon's session reaper notices the
// launched session's pid is gone within 30s and drops the registry
// entry. Explicit DELETE on /v1/sessions is feasible only from an
// async exit hook, which adds complexity for ~30s of memory the
// reaper will reclaim anyway. Leaving as-is; revisit if the
// registry ever grows persistence.
// Ephemeral config dir (also drops the session-token file)
try {
rmSync(tmpDir, { recursive: true, force: true });
} catch { /* best effort */ }
@@ -796,6 +1026,7 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
CLAUDEMESH_CONFIG_DIR: tmpDir,
CLAUDEMESH_DISPLAY_NAME: displayName,
...(claudeSessionId ? { CLAUDEMESH_SESSION_ID: claudeSessionId } : {}),
...(sessionTokenFilePath ? { CLAUDEMESH_IPC_TOKEN_FILE: sessionTokenFilePath } : {}),
MCP_TIMEOUT: process.env.MCP_TIMEOUT ?? "30000",
MAX_MCP_OUTPUT_TOKENS: process.env.MAX_MCP_OUTPUT_TOKENS ?? "50000",
...(role ? { CLAUDEMESH_ROLE: role } : {}),
@@ -806,9 +1037,9 @@ export async function runLaunch(flags: LaunchFlags, rawArgs: string[]): Promise<
if (result.error) {
const err = result.error as NodeJS.ErrnoException;
if (err.code === "ENOENT") {
console.error("`claude` not found on PATH. Install Claude Code first.");
render.err("`claude` not found on PATH.", "Install Claude Code first.");
} else {
console.error(`failed to launch claude: ${err.message}`);
render.err(`failed to launch claude: ${err.message}`);
}
process.exit(1);
}

View File

@@ -6,20 +6,24 @@
*/
import { readConfig, writeConfig } from "~/services/config/facade.js";
import { render } from "~/ui/render.js";
import { bold, dim } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
export function runLeave(args: string[]): void {
export function runLeave(args: string[]): number {
const slug = args[0];
if (!slug) {
console.error("Usage: claudemesh leave <slug>");
process.exit(1);
render.err("Usage: claudemesh leave <slug>");
return EXIT.INVALID_ARGS;
}
const config = readConfig();
const before = config.meshes.length;
config.meshes = config.meshes.filter((m) => m.slug !== slug);
if (config.meshes.length === before) {
console.error(`claudemesh: no joined mesh with slug "${slug}"`);
process.exit(1);
render.err(`no joined mesh with slug "${slug}"`);
return EXIT.NOT_FOUND;
}
writeConfig(config);
console.log(`Left mesh "${slug}". Remaining: ${config.meshes.length}`);
render.ok(`left ${bold(slug)}`, dim(`remaining: ${config.meshes.length}`));
return EXIT.SUCCESS;
}

View File

@@ -6,7 +6,8 @@ import { readConfig, getConfigPath } from "~/services/config/facade.js";
import { getStoredToken } from "~/services/auth/facade.js";
import { request } from "~/services/api/facade.js";
import { URLS } from "~/constants/urls.js";
import { bold, dim, green, yellow, red } from "~/ui/styles.js";
import { bold, clay, dim, green, yellow } from "~/ui/styles.js";
import { render } from "~/ui/render.js";
const BROKER_HTTP = URLS.BROKER.replace("wss://", "https://").replace("ws://", "http://").replace("/ws", "");
@@ -25,23 +26,16 @@ export async function runList(): Promise<void> {
const config = readConfig();
const auth = getStoredToken();
// Try to fetch from server
// Try to fetch from server. Broker authenticates via Bearer token.
let serverMeshes: ServerMesh[] = [];
if (auth) {
try {
let userId = "";
try {
const payload = JSON.parse(Buffer.from(auth.session_token.split(".")[1]!, "base64url").toString()) as { sub?: string };
userId = payload.sub ?? "";
} catch {}
if (userId) {
const res = await request<{ meshes: ServerMesh[] }>({
path: `/cli/meshes?user_id=${userId}`,
baseUrl: BROKER_HTTP,
});
serverMeshes = res.meshes ?? [];
}
const res = await request<{ meshes: ServerMesh[] }>({
path: `/cli/meshes`,
baseUrl: BROKER_HTTP,
token: auth.session_token,
});
serverMeshes = res.meshes ?? [];
} catch {}
}
@@ -52,26 +46,26 @@ export async function runList(): Promise<void> {
const allSlugs = new Set([...localSlugs, ...serverSlugs]);
if (allSlugs.size === 0) {
console.log("\n No meshes yet.\n");
console.log(" Create one: claudemesh mesh create <name>");
console.log(" Join one: claudemesh mesh add <invite-url>\n");
render.section("no meshes yet");
render.info(`${dim("create one:")} ${bold("claudemesh create")} ${clay("<name>")}`);
render.info(`${dim("join one:")} ${bold("claudemesh")} ${clay("<invite-url>")}`);
render.blank();
return;
}
console.log("\n Your meshes:\n");
render.section(`your meshes (${allSlugs.size})`);
for (const slug of allSlugs) {
const local = config.meshes.find(m => m.slug === slug);
const server = serverMeshes.find(m => m.slug === slug);
const local = config.meshes.find((m) => m.slug === slug);
const server = serverMeshes.find((m) => m.slug === slug);
const name = server?.name ?? local?.name ?? slug;
const role = server?.role ?? "member";
const isOwner = server?.is_owner ?? false;
const roleLabel = isOwner ? "owner" : role;
const roleLabel = isOwner ? clay("owner") : dim(role);
const memberCount = server?.member_count;
const activePeers = server?.active_peers ?? 0;
// Status indicator
const inLocal = localSlugs.has(slug);
const inServer = serverSlugs.has(slug);
let status: string;
@@ -91,14 +85,14 @@ export async function runList(): Promise<void> {
const memberInfo = memberCount ? dim(`${memberCount} member${memberCount !== 1 ? "s" : ""}`) : "";
const parts = [roleLabel, memberInfo, status].filter(Boolean);
console.log(` ${icon} ${bold(name)} ${dim(slug)}`);
console.log(` ${parts.join(" · ")}`);
process.stdout.write(` ${icon} ${bold(name)} ${dim(slug)}\n`);
process.stdout.write(` ${parts.join(dim(" · "))}\n`);
}
console.log("");
if (serverMeshes.some(m => !localSlugs.has(m.slug))) {
console.log(dim(" ○ = server only — run `claudemesh mesh add` to use locally"));
process.stdout.write("\n");
if (serverMeshes.some((m) => !localSlugs.has(m.slug))) {
render.hint(`${dim("○")} = server only — run ${bold("claudemesh join")} to use locally`);
}
console.log(dim(` Config: ${getConfigPath()}`));
console.log("");
render.hint(`config: ${dim(getConfigPath())}`);
render.blank();
}

View File

@@ -1,7 +1,8 @@
import { createInterface } from "node:readline";
import { loginWithDeviceCode, getStoredToken, clearToken, storeToken } from "~/services/auth/facade.js";
import { my } from "~/services/api/facade.js";
import { green, dim, bold, icons } from "~/ui/styles.js";
import { render } from "~/ui/render.js";
import { bold, dim } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
import { URLS } from "~/constants/urls.js";
@@ -13,16 +14,17 @@ function prompt(question: string): Promise<string> {
}
async function loginWithToken(): Promise<number> {
console.log(`\n Paste a token from ${dim(URLS.API_BASE + "/token")}`);
console.log(` ${dim("Generate one in your browser, then paste it here.")}\n`);
render.blank();
render.info(`Paste a token from ${dim(URLS.API_BASE + "/token")}`);
render.info(dim("Generate one in your browser, then paste it here."));
render.blank();
const token = await prompt(" Token: ");
if (!token) {
console.error(` ${icons.cross} No token provided.`);
render.err("No token provided.");
return EXIT.AUTH_FAILED;
}
// Decode JWT to get user info
let user = { id: "", display_name: "", email: "" };
try {
const parts = token.split(".");
@@ -31,7 +33,7 @@ async function loginWithToken(): Promise<number> {
sub?: string; email?: string; name?: string; exp?: number;
};
if (payload.exp && payload.exp < Date.now() / 1000) {
console.error(` ${icons.cross} Token expired. Generate a new one.`);
render.err("Token expired.", "Generate a new one.");
return EXIT.AUTH_FAILED;
}
user = {
@@ -41,12 +43,12 @@ async function loginWithToken(): Promise<number> {
};
}
} catch {
console.error(` ${icons.cross} Invalid token format.`);
render.err("Invalid token format.");
return EXIT.AUTH_FAILED;
}
storeToken({ session_token: token, user, token_source: "manual" });
console.log(` ${green(icons.check)} Signed in as ${user.display_name || user.email || "user"}.`);
render.ok(`signed in as ${bold(user.display_name || user.email || "user")}`);
return EXIT.SUCCESS;
}
@@ -55,7 +57,7 @@ async function syncMeshes(token: string): Promise<void> {
const meshes = await my.getMeshes(token);
if (meshes.length > 0) {
const names = meshes.map((m) => m.slug).join(", ");
console.log(` ${green(icons.check)} Synced ${meshes.length} mesh${meshes.length === 1 ? "" : "es"}: ${names}`);
render.ok(`synced ${meshes.length} mesh${meshes.length === 1 ? "" : "es"}`, names);
}
} catch {}
}
@@ -64,55 +66,55 @@ export async function login(): Promise<number> {
const existing = getStoredToken();
if (existing) {
const name = existing.user.display_name || existing.user.email || "unknown";
console.log(`\n Already signed in as ${bold(name)}.`);
console.log("");
console.log(` ${bold("1)")} Continue as ${name}`);
console.log(` ${bold("2)")} Sign in via browser`);
console.log(` ${bold("3)")} Paste a token from ${dim("claudemesh.com/token")}`);
console.log(` ${bold("4)")} Sign out`);
console.log("");
render.blank();
render.info(`Already signed in as ${bold(name)}.`);
render.blank();
process.stdout.write(` ${bold("1)")} Continue as ${name}\n`);
process.stdout.write(` ${bold("2)")} Sign in via browser\n`);
process.stdout.write(` ${bold("3)")} Paste a token from ${dim("claudemesh.com/token")}\n`);
process.stdout.write(` ${bold("4)")} Sign out\n`);
render.blank();
const choice = await prompt(" Choice [1]: ") || "1";
if (choice === "1") {
console.log(`\n ${green(icons.check)} Continuing as ${name}.`);
render.blank();
render.ok(`continuing as ${bold(name)}`);
return EXIT.SUCCESS;
}
if (choice === "4") {
clearToken();
console.log(` ${green(icons.check)} Signed out.`);
render.ok("signed out");
return EXIT.SUCCESS;
}
if (choice === "3") {
clearToken();
return loginWithToken();
}
// choice === "2" → fall through to browser login
clearToken();
console.log(` ${dim("Signing in…")}`);
render.info(dim("Signing in…"));
} else {
// Not logged in — show auth options
console.log(`\n ${bold("claudemesh")} — sign in to connect your terminal`);
console.log("");
console.log(` ${bold("1)")} Sign in via browser ${dim("(opens automatically)")}`);
console.log(` ${bold("2)")} Paste a token from ${dim("claudemesh.com/token")}`);
console.log("");
render.blank();
render.heading(`${bold("claudemesh")} — sign in to connect your terminal`);
render.blank();
process.stdout.write(` ${bold("1)")} Sign in via browser ${dim("(opens automatically)")}\n`);
process.stdout.write(` ${bold("2)")} Paste a token from ${dim("claudemesh.com/token")}\n`);
render.blank();
const choice = await prompt(" Choice [1]: ") || "1";
if (choice === "2") {
return loginWithToken();
}
// choice === "1" → fall through to browser login
}
try {
const result = await loginWithDeviceCode();
console.log(` ${green(icons.check)} Signed in as ${result.user.display_name}.`);
render.ok(`signed in as ${bold(result.user.display_name)}`);
await syncMeshes(result.session_token);
return EXIT.SUCCESS;
} catch (err) {
console.error(` ${icons.cross} Login failed: ${err instanceof Error ? err.message : err}`);
render.err(`Login failed: ${err instanceof Error ? err.message : err}`);
return EXIT.AUTH_FAILED;
}
}

755
apps/cli/src/commands/me.ts Normal file
View File

@@ -0,0 +1,755 @@
/**
* `claudemesh me` — cross-mesh workspace overview for the caller's user.
*
* Calls GET /v1/me/workspace which aggregates over every mesh the
* authenticated user belongs to: peer count, online count, topic count,
* unread @-mention count per mesh + global totals.
*
* Auth: mints a temporary read-scoped REST apikey on whichever mesh
* the user has joined first (any mesh works — the endpoint resolves
* to the issuing user, not the apikey's mesh).
*
* v0.4.0 substrate. Future verbs (`me topics`, `me notifications`,
* `me activity`, `me search`) layer on top of similar aggregating
* endpoints once they ship.
*/
import { withRestKey } from "~/services/api/with-rest-key.js";
import { request } from "~/services/api/client.js";
import { readConfig } from "~/services/config/facade.js";
import { render } from "~/ui/render.js";
import { bold, clay, cyan, dim, green, yellow } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
/**
* /v1/me/* endpoints resolve the caller's user from the apikey issuer
* regardless of which mesh issued the key — every mesh works. When the
* user didn't pass --mesh, silently pick the first joined mesh for
* apikey-mint instead of prompting; the endpoint sees the same user.
*/
function resolveMeshForMint(explicit: string | null | undefined): string | null {
if (explicit) return explicit;
const cfg = readConfig();
return cfg.meshes[0]?.slug ?? null;
}
interface WorkspaceMesh {
meshId: string;
slug: string;
name: string;
memberId: string;
myRole: string;
joinedAt: string;
peers: number;
online: number;
topics: number;
unreadMentions: number;
}
interface WorkspaceResponse {
userId: string;
meshes: WorkspaceMesh[];
totals: {
meshes: number;
peers: number;
online: number;
topics: number;
unreadMentions: number;
};
}
export interface MeFlags {
mesh?: string;
json?: boolean;
}
export async function runMe(flags: MeFlags): Promise<number> {
return withRestKey(
{
meshSlug: resolveMeshForMint(flags.mesh),
purpose: "workspace-overview",
capabilities: ["read"],
},
async ({ secret }) => {
const ws = await request<WorkspaceResponse>({
path: "/api/v1/me/workspace",
token: secret,
});
if (flags.json) {
console.log(JSON.stringify(ws, null, 2));
return EXIT.SUCCESS;
}
render.section(
`${clay("workspace")}${bold(ws.userId.slice(0, 8))} ${dim(
`· ${ws.totals.meshes} mesh${ws.totals.meshes === 1 ? "" : "es"}`,
)}`,
);
const totalsLine = [
`${green(String(ws.totals.online))}/${ws.totals.peers} online`,
`${ws.totals.topics} topic${ws.totals.topics === 1 ? "" : "s"}`,
ws.totals.unreadMentions > 0
? yellow(`${ws.totals.unreadMentions} unread @you`)
: dim("0 unread @you"),
].join(dim(" · "));
process.stdout.write(" " + totalsLine + "\n\n");
if (ws.meshes.length === 0) {
process.stdout.write(
dim(" no meshes joined — run `claudemesh new` or accept an invite\n"),
);
return EXIT.SUCCESS;
}
const slugWidth = Math.max(...ws.meshes.map((m) => m.slug.length), 8);
for (const m of ws.meshes) {
const slug = cyan(m.slug.padEnd(slugWidth));
const peers = `${m.online}/${m.peers}`;
const role = dim(m.myRole);
const unread =
m.unreadMentions > 0
? " " + yellow(`${m.unreadMentions} @you`)
: "";
process.stdout.write(
` ${slug} ${peers.padStart(5)} online ${dim(
String(m.topics).padStart(2) + " topics",
)} ${role}${unread}\n`,
);
}
return EXIT.SUCCESS;
},
);
}
interface WorkspaceTopic {
topicId: string;
name: string;
description: string | null;
visibility: string;
createdAt: string;
meshId: string;
meshSlug: string;
meshName: string;
memberId: string;
unread: number;
lastMessageAt: string | null;
}
interface WorkspaceTopicsResponse {
topics: WorkspaceTopic[];
totals: { topics: number; unread: number };
}
export interface MeTopicsFlags extends MeFlags {
unread?: boolean;
}
export async function runMeTopics(flags: MeTopicsFlags): Promise<number> {
return withRestKey(
{
meshSlug: resolveMeshForMint(flags.mesh),
purpose: "workspace-topics",
capabilities: ["read"],
},
async ({ secret }) => {
const ws = await request<WorkspaceTopicsResponse>({
path: "/api/v1/me/topics",
token: secret,
});
const visible = flags.unread
? ws.topics.filter((t) => t.unread > 0)
: ws.topics;
if (flags.json) {
console.log(
JSON.stringify(
{ topics: visible, totals: ws.totals },
null,
2,
),
);
return EXIT.SUCCESS;
}
render.section(
`${clay("topics")}${ws.totals.topics} across all meshes ${dim(
ws.totals.unread > 0
? `· ${ws.totals.unread} unread`
: "· all read",
)}`,
);
if (visible.length === 0) {
process.stdout.write(
dim(
flags.unread
? " no unread topics\n"
: " no topics — run `claudemesh topic create #general`\n",
),
);
return EXIT.SUCCESS;
}
const slugWidth = Math.max(...visible.map((t) => t.meshSlug.length), 6);
const nameWidth = Math.max(...visible.map((t) => t.name.length), 8);
for (const t of visible) {
const slug = dim(t.meshSlug.padEnd(slugWidth));
const name = cyan(t.name.padEnd(nameWidth));
const unread =
t.unread > 0
? yellow(`${t.unread} unread`.padStart(10))
: dim("·".padStart(10));
const last = t.lastMessageAt
? dim(formatRelativeTime(t.lastMessageAt))
: dim("never");
process.stdout.write(` ${slug} ${name} ${unread} ${last}\n`);
}
return EXIT.SUCCESS;
},
);
}
interface WorkspaceNotification {
notificationId: string;
messageId: string;
topicId: string;
topicName: string;
meshId: string;
meshSlug: string;
meshName: string;
senderName: string | null;
snippet: string | null;
ciphertext: string | null;
bodyVersion: number;
read: boolean;
readAt: string | null;
createdAt: string;
}
interface WorkspaceNotificationsResponse {
notifications: WorkspaceNotification[];
totals: { unread: number; total: number };
}
export interface MeNotificationsFlags extends MeFlags {
all?: boolean;
since?: string;
}
export async function runMeNotifications(
flags: MeNotificationsFlags,
): Promise<number> {
return withRestKey(
{
meshSlug: resolveMeshForMint(flags.mesh),
purpose: "workspace-notifications",
capabilities: ["read"],
},
async ({ secret }) => {
const params = new URLSearchParams();
if (flags.all) params.set("include", "all");
if (flags.since) params.set("since", flags.since);
const path =
"/api/v1/me/notifications" +
(params.toString() ? `?${params.toString()}` : "");
const ws = await request<WorkspaceNotificationsResponse>({
path,
token: secret,
});
if (flags.json) {
console.log(JSON.stringify(ws, null, 2));
return EXIT.SUCCESS;
}
const headerLabel = flags.all ? "@-mentions (all)" : "@-mentions (unread)";
render.section(
`${clay(headerLabel)}${ws.totals.total} ${dim(
ws.totals.unread > 0 ? `· ${ws.totals.unread} unread` : "· nothing pending",
)}`,
);
if (ws.notifications.length === 0) {
process.stdout.write(
dim(
flags.all
? " no @-mentions in window\n"
: " inbox zero — nothing waiting\n",
),
);
return EXIT.SUCCESS;
}
const slugWidth = Math.max(
...ws.notifications.map((n) => n.meshSlug.length),
6,
);
for (const n of ws.notifications) {
const slug = dim(n.meshSlug.padEnd(slugWidth));
const topic = cyan(`#${n.topicName}`);
const sender = n.senderName ? `from ${n.senderName}` : "from ?";
const ago = formatRelativeTime(n.createdAt);
const dot = n.read ? dim("·") : yellow("●");
const snippet =
n.snippet ?? (n.ciphertext ? dim("[encrypted]") : dim("[empty]"));
process.stdout.write(
` ${dot} ${slug} ${topic} ${dim(sender)} ${dim(ago)}\n` +
` ${snippet.length > 200 ? snippet.slice(0, 200) + "…" : snippet}\n`,
);
}
return EXIT.SUCCESS;
},
);
}
interface WorkspaceActivity {
messageId: string;
topicId: string;
topicName: string;
meshId: string;
meshSlug: string;
meshName: string;
senderName: string;
senderMemberId: string;
snippet: string | null;
ciphertext: string | null;
bodyVersion: number;
createdAt: string;
}
interface WorkspaceActivityResponse {
activity: WorkspaceActivity[];
totals: { events: number };
}
export interface MeActivityFlags extends MeFlags {
since?: string;
}
export async function runMeActivity(flags: MeActivityFlags): Promise<number> {
return withRestKey(
{
meshSlug: resolveMeshForMint(flags.mesh),
purpose: "workspace-activity",
capabilities: ["read"],
},
async ({ secret }) => {
const params = new URLSearchParams();
if (flags.since) params.set("since", flags.since);
const path =
"/api/v1/me/activity" +
(params.toString() ? `?${params.toString()}` : "");
const ws = await request<WorkspaceActivityResponse>({
path,
token: secret,
});
if (flags.json) {
console.log(JSON.stringify(ws, null, 2));
return EXIT.SUCCESS;
}
render.section(
`${clay("activity")}${ws.totals.events} ${dim(
flags.since ? `since ${flags.since}` : "in the last 24h",
)}`,
);
if (ws.activity.length === 0) {
process.stdout.write(dim(" quiet — no activity in window\n"));
return EXIT.SUCCESS;
}
const slugWidth = Math.max(
...ws.activity.map((a) => a.meshSlug.length),
6,
);
for (const a of ws.activity) {
const slug = dim(a.meshSlug.padEnd(slugWidth));
const topic = cyan(`#${a.topicName}`);
const sender = a.senderName ?? "?";
const ago = formatRelativeTime(a.createdAt);
const snippet =
a.snippet ?? (a.ciphertext ? dim("[encrypted]") : dim("[empty]"));
process.stdout.write(
` ${slug} ${topic} ${dim(sender + " ·")} ${dim(ago)}\n` +
` ${snippet.length > 200 ? snippet.slice(0, 200) + "…" : snippet}\n`,
);
}
return EXIT.SUCCESS;
},
);
}
interface WorkspaceSearchTopicHit {
id: string;
name: string;
description: string | null;
meshId: string;
meshSlug: string;
meshName: string;
}
interface WorkspaceSearchMessageHit {
messageId: string;
topicId: string;
topicName: string;
meshId: string;
meshSlug: string;
senderName: string;
snippet: string | null;
bodyVersion: number;
createdAt: string;
}
interface WorkspaceSearchResponse {
query: string;
topics: WorkspaceSearchTopicHit[];
messages: WorkspaceSearchMessageHit[];
totals: { topics: number; messages: number };
}
export interface MeSearchFlags extends MeFlags {
query: string;
}
export async function runMeSearch(flags: MeSearchFlags): Promise<number> {
if (!flags.query || flags.query.length < 2) {
process.stderr.write(
"Usage: claudemesh me search <query> (min 2 chars)\n",
);
return EXIT.INVALID_ARGS;
}
return withRestKey(
{
meshSlug: resolveMeshForMint(flags.mesh),
purpose: "workspace-search",
capabilities: ["read"],
},
async ({ secret }) => {
const params = new URLSearchParams({ q: flags.query });
const ws = await request<WorkspaceSearchResponse>({
path: `/api/v1/me/search?${params.toString()}`,
token: secret,
});
if (flags.json) {
console.log(JSON.stringify(ws, null, 2));
return EXIT.SUCCESS;
}
render.section(
`${clay("search")} — "${flags.query}" ${dim(
`${ws.totals.topics} topic${ws.totals.topics === 1 ? "" : "s"}, ` +
`${ws.totals.messages} message${ws.totals.messages === 1 ? "" : "s"}`,
)}`,
);
if (ws.topics.length === 0 && ws.messages.length === 0) {
process.stdout.write(dim(" no matches\n"));
return EXIT.SUCCESS;
}
if (ws.topics.length > 0) {
process.stdout.write(dim("\n topics\n"));
const slugWidth = Math.max(
...ws.topics.map((t) => t.meshSlug.length),
6,
);
for (const t of ws.topics) {
const slug = dim(t.meshSlug.padEnd(slugWidth));
const name = cyan(`#${t.name}`);
const desc = t.description ? dim(`${t.description}`) : "";
process.stdout.write(` ${slug} ${name}${desc}\n`);
}
}
if (ws.messages.length > 0) {
process.stdout.write(dim("\n messages\n"));
const slugWidth = Math.max(
...ws.messages.map((m) => m.meshSlug.length),
6,
);
for (const m of ws.messages) {
const slug = dim(m.meshSlug.padEnd(slugWidth));
const topic = cyan(`#${m.topicName}`);
const sender = m.senderName;
const ago = formatRelativeTime(m.createdAt);
const snippet =
m.snippet ??
(m.bodyVersion === 2 ? dim("[encrypted — open the topic to decrypt]") : dim("[empty]"));
const highlighted =
m.snippet
? highlightMatch(snippet, flags.query)
: snippet;
process.stdout.write(
` ${slug} ${topic} ${dim(sender + " ·")} ${dim(ago)}\n` +
` ${highlighted}\n`,
);
}
}
return EXIT.SUCCESS;
},
);
}
function highlightMatch(text: string, query: string): string {
if (!query) return text;
const idx = text.toLowerCase().indexOf(query.toLowerCase());
if (idx === -1) return text;
const before = text.slice(0, idx);
const match = text.slice(idx, idx + query.length);
const after = text.slice(idx + query.length);
return `${before}${yellow(match)}${after}`;
}
interface WorkspaceTask {
id: string;
meshId: string;
meshSlug: string;
title: string;
assignee: string | null;
claimedByName: string | null;
priority: string;
status: string;
tags: string[];
result: string | null;
createdByName: string | null;
createdAt: string;
claimedAt: string | null;
completedAt: string | null;
}
interface WorkspaceTasksResponse {
tasks: WorkspaceTask[];
totals: { open: number; claimed: number; completed: number };
}
export interface MeTasksFlags extends MeFlags {
status?: string;
}
export async function runMeTasks(flags: MeTasksFlags): Promise<number> {
return withRestKey(
{
meshSlug: resolveMeshForMint(flags.mesh),
purpose: "workspace-tasks",
capabilities: ["read"],
},
async ({ secret }) => {
const params = new URLSearchParams();
if (flags.status) params.set("status", flags.status);
const path =
"/api/v1/me/tasks" +
(params.toString() ? `?${params.toString()}` : "");
const ws = await request<WorkspaceTasksResponse>({
path,
token: secret,
});
if (flags.json) {
console.log(JSON.stringify(ws, null, 2));
return EXIT.SUCCESS;
}
render.section(
`${clay("tasks")}${dim(
`${ws.totals.open} open · ${ws.totals.claimed} in-flight · ${ws.totals.completed} done`,
)}`,
);
if (ws.tasks.length === 0) {
process.stdout.write(dim(" no tasks in window\n"));
return EXIT.SUCCESS;
}
const slugWidth = Math.max(...ws.tasks.map((t) => t.meshSlug.length), 6);
for (const t of ws.tasks) {
const slug = dim(t.meshSlug.padEnd(slugWidth));
const status =
t.status === "open"
? yellow("open ")
: t.status === "claimed"
? cyan("working ")
: green("done ");
const prio =
t.priority === "urgent"
? yellow("!")
: t.priority === "low"
? dim("·")
: " ";
const claimer = t.claimedByName ? dim(`${t.claimedByName}`) : "";
process.stdout.write(
` ${slug} ${prio} ${status} ${t.title}${claimer}\n`,
);
}
return EXIT.SUCCESS;
},
);
}
interface WorkspaceStateEntry {
meshId: string;
meshSlug: string;
key: string;
value: unknown;
updatedByName: string | null;
updatedAt: string;
}
interface WorkspaceStateResponse {
entries: WorkspaceStateEntry[];
totals: { entries: number; meshes: number };
}
export interface MeStateFlags extends MeFlags {
key?: string;
}
export async function runMeState(flags: MeStateFlags): Promise<number> {
return withRestKey(
{
meshSlug: resolveMeshForMint(flags.mesh),
purpose: "workspace-state",
capabilities: ["read"],
},
async ({ secret }) => {
const params = new URLSearchParams();
if (flags.key) params.set("key", flags.key);
const path =
"/api/v1/me/state" +
(params.toString() ? `?${params.toString()}` : "");
const ws = await request<WorkspaceStateResponse>({
path,
token: secret,
});
if (flags.json) {
console.log(JSON.stringify(ws, null, 2));
return EXIT.SUCCESS;
}
render.section(
`${clay("state")}${ws.totals.entries} entr${ws.totals.entries === 1 ? "y" : "ies"} ${dim(
`across ${ws.totals.meshes} mesh${ws.totals.meshes === 1 ? "" : "es"}`,
)}`,
);
if (ws.entries.length === 0) {
process.stdout.write(dim(" no state entries\n"));
return EXIT.SUCCESS;
}
const slugWidth = Math.max(...ws.entries.map((e) => e.meshSlug.length), 6);
const keyWidth = Math.max(...ws.entries.map((e) => e.key.length), 8);
for (const e of ws.entries) {
const slug = dim(e.meshSlug.padEnd(slugWidth));
const key = cyan(e.key.padEnd(keyWidth));
const valueStr =
typeof e.value === "string"
? e.value
: JSON.stringify(e.value);
const trimmed =
valueStr.length > 80 ? valueStr.slice(0, 80) + "…" : valueStr;
const ago = dim(formatRelativeTime(e.updatedAt));
process.stdout.write(` ${slug} ${key} ${trimmed} ${ago}\n`);
}
return EXIT.SUCCESS;
},
);
}
interface WorkspaceMemory {
id: string;
meshId: string;
meshSlug: string;
content: string;
tags: string[];
rememberedByName: string | null;
rememberedAt: string;
}
interface WorkspaceMemoryResponse {
query: string;
memories: WorkspaceMemory[];
totals: { entries: number };
}
export interface MeMemoryFlags extends MeFlags {
query?: string;
}
export async function runMeMemory(flags: MeMemoryFlags): Promise<number> {
return withRestKey(
{
meshSlug: resolveMeshForMint(flags.mesh),
purpose: "workspace-memory",
capabilities: ["read"],
},
async ({ secret }) => {
const params = new URLSearchParams();
if (flags.query) params.set("q", flags.query);
const path =
"/api/v1/me/memory" +
(params.toString() ? `?${params.toString()}` : "");
const ws = await request<WorkspaceMemoryResponse>({
path,
token: secret,
});
if (flags.json) {
console.log(JSON.stringify(ws, null, 2));
return EXIT.SUCCESS;
}
const headerLabel = flags.query
? `recall — "${flags.query}"`
: "recall — last 30 days";
render.section(
`${clay(headerLabel)} ${dim(`${ws.totals.entries} match${ws.totals.entries === 1 ? "" : "es"}`)}`,
);
if (ws.memories.length === 0) {
process.stdout.write(dim(" no memories\n"));
return EXIT.SUCCESS;
}
const slugWidth = Math.max(
...ws.memories.map((m) => m.meshSlug.length),
6,
);
for (const m of ws.memories) {
const slug = dim(m.meshSlug.padEnd(slugWidth));
const ago = dim(formatRelativeTime(m.rememberedAt));
const tags =
m.tags.length > 0
? " " + dim("[" + m.tags.join(", ") + "]")
: "";
const content =
m.content.length > 240 ? m.content.slice(0, 240) + "…" : m.content;
process.stdout.write(` ${slug} ${ago}${tags}\n ${content}\n`);
}
return EXIT.SUCCESS;
},
);
}
function formatRelativeTime(iso: string): string {
const then = new Date(iso).getTime();
const now = Date.now();
const sec = Math.max(0, Math.floor((now - then) / 1000));
if (sec < 60) return `${sec}s ago`;
if (sec < 3600) return `${Math.floor(sec / 60)}m ago`;
if (sec < 86_400) return `${Math.floor(sec / 3600)}h ago`;
if (sec < 86_400 * 30) return `${Math.floor(sec / 86_400)}d ago`;
if (sec < 86_400 * 365)
return `${Math.floor(sec / (86_400 * 30))}mo ago`;
return `${Math.floor(sec / (86_400 * 365))}y ago`;
}

View File

@@ -0,0 +1,78 @@
/**
* `claudemesh member list` — every (non-revoked) member of the chosen
* mesh, decorated with online state. Distinct from `peer list`: peers
* shows live WS sessions, members shows roster.
*/
import { withRestKey } from "~/services/api/with-rest-key.js";
import { request } from "~/services/api/client.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim, green, red, yellow } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
export interface MemberFlags {
mesh?: string;
json?: boolean;
/** Show only online members. */
online?: boolean;
}
interface MemberRow {
memberId: string;
pubkey: string;
displayName: string;
role: string;
isHuman: boolean;
joinedAt: string;
online: boolean;
status: string;
summary: string | null;
}
function statusGlyph(m: MemberRow): string {
if (!m.online) return dim("○");
if (m.status === "dnd") return red("●");
if (m.status === "working") return yellow("●");
return green("●");
}
export async function runMemberList(flags: MemberFlags): Promise<number> {
return withRestKey(
{ meshSlug: flags.mesh ?? null, purpose: "members" },
async ({ secret, meshSlug }) => {
const result = await request<{ members: MemberRow[] }>({
path: "/api/v1/members",
token: secret,
});
const filtered = flags.online
? result.members.filter((m) => m.online)
: result.members;
if (flags.json) {
console.log(JSON.stringify({ members: filtered }, null, 2));
return EXIT.SUCCESS;
}
if (filtered.length === 0) {
render.info(
dim(flags.online ? `no online members in ${meshSlug}.` : `no members in ${meshSlug}.`),
);
return EXIT.SUCCESS;
}
const onlineCount = result.members.filter((m) => m.online).length;
render.section(
`${clay(meshSlug)} members (${onlineCount}/${result.members.length} online)`,
);
for (const m of filtered) {
const tag = m.isHuman ? dim("human") : dim("bot");
const summary = m.summary ? `${dim(m.summary)}` : "";
process.stdout.write(
` ${statusGlyph(m)} ${bold(m.displayName)} ${tag} ${dim(m.role)} ${dim(m.pubkey.slice(0, 8))}${summary}\n`,
);
}
return EXIT.SUCCESS;
},
);
}

View File

@@ -1,6 +1,7 @@
import { create as createMesh } from "~/services/mesh/facade.js";
import { getStoredToken } from "~/services/auth/facade.js";
import { green, dim, icons } from "~/ui/styles.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
export async function newMesh(
@@ -8,16 +9,17 @@ export async function newMesh(
opts: { template?: string; description?: string; json?: boolean },
): Promise<number> {
if (!name) {
console.error(" Usage: claudemesh mesh create <name>");
render.err("Usage: claudemesh create <name>");
return EXIT.INVALID_ARGS;
}
if (!getStoredToken()) {
console.log(dim(" Not signed in — starting login…\n"));
render.info(dim("not signed in — starting login…"));
render.blank();
const { login } = await import("./login.js");
const loginResult = await login();
if (loginResult !== EXIT.SUCCESS) return loginResult;
console.log("");
render.blank();
}
try {
@@ -28,20 +30,26 @@ export async function newMesh(
if (opts.json) {
console.log(JSON.stringify({ schema_version: "1.0", ...result }, null, 2));
} else {
console.log(`\n ${green(icons.check)} Created "${result.slug}" (id: ${result.id})`);
console.log(` ${green(icons.check)} You're the owner`);
console.log(` ${green(icons.check)} Joined locally`);
console.log(`\n Share with: claudemesh mesh share\n`);
return EXIT.SUCCESS;
}
render.section(`created ${bold(result.slug)}`);
render.kv([
["id", dim(result.id)],
["role", clay("owner")],
["local", "joined"],
]);
render.blank();
render.hint(`share with: ${bold("claudemesh share")}`);
render.blank();
return EXIT.SUCCESS;
} catch (err) {
const msg = err instanceof Error ? err.message : String(err);
if (msg.includes("409") || msg.includes("already exists")) {
console.error(` ${icons.cross} A mesh with this name already exists. Try a different name.`);
render.err("A mesh with this name already exists.", "Try a different name.");
} else {
console.error(` ${icons.cross} Failed: ${msg}`);
render.err(`Failed: ${msg}`);
}
return EXIT.INTERNAL_ERROR;
}

View File

@@ -0,0 +1,93 @@
/**
* `claudemesh notification list` — recent @-mentions of the viewer
* across topics in the chosen mesh. Server-side regex match over the
* v0.2.0 plaintext-base64 ciphertext; the v0.3.0 per-topic encryption
* cut will move this to a notification table populated at write time.
*/
import { withRestKey } from "~/services/api/with-rest-key.js";
import { request } from "~/services/api/client.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
export interface NotificationFlags {
mesh?: string;
json?: boolean;
since?: string;
}
interface NotificationRow {
id: string;
topicId: string;
topicName: string;
senderName: string;
senderPubkey: string;
ciphertext: string;
createdAt: string;
}
interface NotificationsResponse {
notifications: NotificationRow[];
since: string;
mentionedAs: string;
}
function decodeCiphertext(b64: string): string {
try {
return Buffer.from(b64, "base64").toString("utf-8");
} catch {
return "[decode failed]";
}
}
function fmtRelative(iso: string): string {
const ms = Date.now() - new Date(iso).getTime();
if (ms < 60_000) return "now";
if (ms < 3_600_000) return `${Math.floor(ms / 60_000)}m`;
if (ms < 86_400_000) return `${Math.floor(ms / 3_600_000)}h`;
return `${Math.floor(ms / 86_400_000)}d`;
}
export async function runNotificationList(flags: NotificationFlags): Promise<number> {
return withRestKey(
{ meshSlug: flags.mesh ?? null, purpose: "notifications" },
async ({ secret }) => {
const qs = flags.since ? `?since=${encodeURIComponent(flags.since)}` : "";
const result = await request<NotificationsResponse>({
path: `/api/v1/notifications${qs}`,
token: secret,
});
if (flags.json) {
const decoded = result.notifications.map((n) => ({
...n,
message: decodeCiphertext(n.ciphertext),
}));
console.log(JSON.stringify({ ...result, notifications: decoded }, null, 2));
return EXIT.SUCCESS;
}
if (result.notifications.length === 0) {
render.info(
dim(`no mentions of @${result.mentionedAs} since ${result.since}.`),
);
return EXIT.SUCCESS;
}
render.section(
`mentions of @${bold(result.mentionedAs)} (${result.notifications.length})`,
);
for (const n of result.notifications) {
const when = fmtRelative(n.createdAt);
const msg = decodeCiphertext(n.ciphertext).replace(/\s+/g, " ").trim();
const snippet = msg.length > 100 ? msg.slice(0, 97) + "…" : msg;
process.stdout.write(
` ${clay("#" + n.topicName)} ${dim(when)} ${bold(n.senderName)}\n`,
);
process.stdout.write(` ${snippet}\n`);
}
return EXIT.SUCCESS;
},
);
}

View File

@@ -2,6 +2,14 @@
* `claudemesh peers` — list connected peers in the mesh.
*
* Shows all meshes by default, or filter with --mesh.
*
* Warm path: dials the per-mesh bridge socket the push-pipe holds open.
* Cold path: opens its own WS via `withMesh`. Bridge fall-through is
* transparent — output is identical.
*
* `--json` accepts an optional comma-separated field list:
* claudemesh peers --json (full record)
* claudemesh peers --json name,pubkey,status (projection)
*/
import { withMesh } from "./connect.js";
@@ -11,12 +19,208 @@ import { bold, dim, green, yellow } from "~/ui/styles.js";
export interface PeersFlags {
mesh?: string;
json?: boolean;
/** `true`/`undefined` = full record; comma-separated string = field projection. */
json?: boolean | string;
/** When false (default), hide control-plane presence rows from the
* human renderer — they're infrastructure (daemon-WS member-keyed
* presence), not interactive peers, and confused users into thinking
* the daemon counted as a "peer". The JSON output still includes them
* so scripts that need a full inventory can opt in via --all (or
* just consume JSON).
*
* Source of truth is the broker-side `role` field
* (`'control-plane' | 'session' | 'service'`). Older brokers don't
* emit `role` yet — this code falls back to treating missing role as
* `'session'` so legacy peer rows stay visible. */
all?: boolean;
}
/**
* Broker-emitted peer classification, added 2026-05-04. Older brokers
* may omit it — treat missing as 'session' so legacy meshes still
* render their peers (and don't accidentally hide them all). The CLI
* never emits 'control-plane' on its own; that comes from the broker.
*/
export type PeerRole = "control-plane" | "session" | "service";
interface PeerRecord {
pubkey: string;
/** Stable member pubkey (independent of session). When sender shares
* this with a peer, they're talking to the same person across all
* their open sessions. */
memberPubkey?: string;
/** Per-launch session identifier (uuid). Used by the renderer to
* disambiguate sibling sessions of the same member that otherwise
* look identical (same name, same cwd). */
sessionId?: string;
displayName: string;
status?: string;
summary?: string;
groups: Array<{ name: string; role?: string }>;
/** Top-level convenience alias for `profile.role`, lifted by the CLI
* since 1.31.5 so JSON consumers (the agent-vibes claudemesh skill,
* launched-session LLMs) see the user-supplied role string at the
* shape's top level. Same value as `profile.role`. Distinct from
* `peerRole` below — that's the broker's presence-class taxonomy. */
role?: string;
/** Broker-emitted presence classification: 'control-plane' | 'session'
* | 'service'. Source of truth for the --all visibility filter and
* the default-hide rule. Older brokers omit this; the CLI fills
* missing values with 'session' so legacy peer rows stay visible.
*
* Renamed from `role` to avoid collision with 1.31.5's profile.role
* lift above. Wire-level field on the broker is also `peerRole`. */
peerRole?: PeerRole;
peerType?: string;
channel?: string;
model?: string;
cwd?: string;
/** Peer-level profile metadata (set via `claudemesh profile`). The
* broker passes this through verbatim; the most common field is
* `role` ("lead", "reviewer", "human", etc.) but capabilities, bio,
* avatar, and title also live here when set. */
profile?: {
role?: string;
title?: string;
bio?: string;
avatar?: string;
capabilities?: string[];
[k: string]: unknown;
};
/** True when this peer is one of the caller's own member's sessions.
* Set in the cli (not the broker) by comparing memberPubkey against
* the caller's stable JoinedMesh.pubkey. */
isSelf?: boolean;
/** When isSelf is true, true if this is the exact session running
* the command (vs a sibling session of the same member). */
isThisSession?: boolean;
[k: string]: unknown;
}
/** Friendly aliases — `name` is what users will type; broker calls it `displayName`. */
const FIELD_ALIAS: Record<string, string> = {
name: "displayName",
};
function projectFields(record: PeerRecord, fields: string[]): Record<string, unknown> {
const out: Record<string, unknown> = {};
for (const f of fields) {
const sourceKey = FIELD_ALIAS[f] ?? f;
out[f] = (record as Record<string, unknown>)[sourceKey];
}
return out;
}
async function listPeersForMesh(slug: string): Promise<PeerRecord[]> {
const config = readConfig();
const joined = config.meshes.find((m) => m.slug === slug);
const selfMemberPubkey = joined?.pubkey ?? null;
// Resolve our own session pubkey via the daemon's /v1/sessions/me when
// we're inside a launched session. Without this, isThisSession can't
// be set on the daemon path (only on the cold path where a fresh WS
// creates the keypair), and the renderer can't tell the user which
// row in `peer list` is them.
let selfSessionPubkey: string | null = null;
try {
const { getSessionInfo } = await import("~/services/session/resolve.js");
const sess = await getSessionInfo();
if (sess && sess.mesh === slug && sess.presence?.sessionPubkey) {
selfSessionPubkey = sess.presence.sessionPubkey;
}
} catch { /* not in a launched session; isThisSession stays false */ }
// Daemon path — preferred when running. Same routing pattern as send.ts:
// ~1 ms IPC round-trip; broker WS already warm in the daemon. The
// lifecycle helper inside tryListPeersViaDaemon auto-spawns the
// daemon if it's down and probes it for liveness — no separate bridge
// tier is needed any more (1.28.0).
//
// 1.34.15: forward `slug` to the daemon as `?mesh=<slug>` so the
// server-side aggregator narrows to the requested mesh. Pre-1.34.15
// we called this with no argument, so a multi-mesh daemon returned
// peers from every attached mesh and the renderer printed "peers on
// flexicar" with cross-mesh rows mixed in. The daemon's
// `meshFromCtx` already does the right scoping when the slug is
// passed; the CLI just wasn't passing it.
try {
const { tryListPeersViaDaemon } = await import("~/services/bridge/daemon-route.js");
const dr = await tryListPeersViaDaemon(slug);
if (dr !== null) {
return dr.map((p) => annotateSelf(p as PeerRecord, selfMemberPubkey, selfSessionPubkey));
}
} catch { /* daemon route helper not available; fall through */ }
// Cold path — open our own WS. Reached only when the lifecycle helper
// could not bring the daemon up.
let result: PeerRecord[] = [];
await withMesh({ meshSlug: slug }, async (client) => {
const all = (await client.listPeers()) as unknown as PeerRecord[];
const selfSessionPubkey = client.getSessionPubkey();
result = all.map((p) =>
annotateSelf(p, selfMemberPubkey, selfSessionPubkey),
);
});
return result;
}
/**
* Tag each peer record with `isSelf` / `isThisSession` so the renderer
* (and downstream code that picks targets, e.g. `claudemesh send`) can
* tell sender's own sessions from real peers. The broker has always
* surfaced a sender's siblings as separate rows because they're separate
* presence rows; the cli just hadn't been making that visible.
*
* Also normalizes the broker's `peerRole` classification: missing
* values (older brokers) default to 'session' so legacy peer rows stay
* visible under the default `--all=false` filter.
*
* And lifts `profile.role` to a top-level `role` field — the 1.31.5
* convenience alias for JSON consumers (skill SKILL.md, launched-session
* LLMs, jq pipelines). Same value as profile.role; distinct from
* peerRole (presence taxonomy).
*/
function annotateSelf(
peer: PeerRecord,
selfMemberPubkey: string | null,
selfSessionPubkey: string | null,
): PeerRecord {
const isSelf = !!(
selfMemberPubkey &&
peer.memberPubkey &&
peer.memberPubkey === selfMemberPubkey
);
const isThisSession = !!(
isSelf &&
selfSessionPubkey &&
peer.pubkey === selfSessionPubkey
);
const peerRole: PeerRole = peer.peerRole ?? "session";
const profileRole = peer.profile?.role?.trim() || undefined;
return {
...peer,
...(profileRole ? { role: profileRole } : {}),
peerRole,
isSelf,
isThisSession,
};
}
export async function runPeers(flags: PeersFlags): Promise<void> {
const config = readConfig();
const slugs = flags.mesh ? [flags.mesh] : config.meshes.map((m) => m.slug);
// Mesh selection precedence:
// 1. explicit --mesh <slug> (always wins)
// 2. session-token mesh (when invoked from inside a launched session)
// 3. all joined meshes (default for bare shells)
let slugs: string[];
if (flags.mesh) {
slugs = [flags.mesh];
} else {
const { getSessionInfo } = await import("~/services/session/resolve.js");
const sess = await getSessionInfo();
slugs = sess ? [sess.mesh] : config.meshes.map((m) => m.slug);
}
if (slugs.length === 0) {
render.err("No meshes joined.");
@@ -24,51 +228,120 @@ export async function runPeers(flags: PeersFlags): Promise<void> {
process.exit(1);
}
// Field projection: --json a,b,c
const fieldList: string[] | null =
typeof flags.json === "string" && flags.json.length > 0
? flags.json.split(",").map((s) => s.trim()).filter(Boolean)
: null;
const wantsJson = flags.json !== undefined && flags.json !== false;
const allJson: Array<{ mesh: string; peers: unknown[] }> = [];
for (const slug of slugs) {
try {
await withMesh({ meshSlug: slug }, async (client, mesh) => {
const peers = await client.listPeers();
const peers = await listPeersForMesh(slug);
if (flags.json) {
allJson.push({ mesh: mesh.slug, peers });
return;
}
if (wantsJson) {
const projected = fieldList
? peers.map((p) => projectFields(p, fieldList))
: peers;
allJson.push({ mesh: slug, peers: projected });
continue;
}
render.section(`peers on ${mesh.slug} (${peers.length})`);
// Hide control-plane rows by default — they're infrastructure
// (daemon-WS member-keyed presence), not interactive peers, and
// they confused users into thinking the daemon counted as a
// separate peer. --all opts back in for debugging.
//
// Source of truth: broker-emitted `peerRole` field (added
// 2026-05-04). annotateSelf() filled in 'session' for older
// brokers that don't emit peerRole yet, so this filter is
// backwards-compatible by construction — legacy rows show up.
const visible = flags.all
? peers
: peers.filter((p) => p.peerRole !== "control-plane");
if (peers.length === 0) {
render.info(dim(" (no peers connected)"));
return;
}
for (const p of peers) {
const groups = p.groups.length
? " [" +
p.groups
.map((g: { name: string; role?: string }) => `@${g.name}${g.role ? `:${g.role}` : ""}`)
.join(", ") +
"]"
: "";
const statusDot = p.status === "working" ? yellow("●") : green("●");
const name = bold(p.displayName);
const meta: string[] = [];
if (p.peerType) meta.push(p.peerType);
if (p.channel) meta.push(p.channel);
if (p.model) meta.push(p.model);
const metaStr = meta.length ? dim(` (${meta.join(", ")})`) : "";
const summary = p.summary ? dim(`${p.summary}`) : "";
render.info(`${statusDot} ${name}${groups}${metaStr}${summary}`);
if (p.cwd) render.info(dim(` cwd: ${p.cwd}`));
}
// Sort: this-session first, then your-other-sessions, then real
// peers. Within each group, idle/working ahead of dnd. Inside the
// groups, leave broker order. The point is: when you run peer
// list, the row that's YOU is row 1.
const sorted = visible.slice().sort((a, b) => {
const score = (p: PeerRecord) =>
p.isThisSession ? 0 : p.isSelf ? 1 : 2;
return score(a) - score(b);
});
const hiddenControlPlane = peers.length - visible.length;
const header = hiddenControlPlane > 0
? `peers on ${slug} (${sorted.length}, ${hiddenControlPlane} control-plane hidden — use --all)`
: `peers on ${slug} (${sorted.length})`;
render.section(header);
if (sorted.length === 0) {
render.info(dim(" (no peers connected)"));
continue;
}
for (const p of sorted) {
const statusDot = p.status === "working" ? yellow("●") : green("●");
const name = bold(p.displayName);
const meta: string[] = [];
if (p.peerType) meta.push(p.peerType);
if (p.channel) meta.push(p.channel);
if (p.model) meta.push(p.model);
const metaStr = meta.length ? dim(` (${meta.join(", ")})`) : "";
const summary = p.summary ? dim(`${p.summary}`) : "";
const pubkeyTag = dim(` · ${p.pubkey.slice(0, 16)}`);
// Short sessionId tag — appears for sibling sessions of the same
// member that would otherwise be visually identical (same name,
// same cwd, only the truncated pubkey on the right differs).
const sidTag = p.sessionId
? dim(` · sid:${p.sessionId.slice(0, 8)}`)
: "";
const selfTag = p.isThisSession
? dim(" ") + yellow("(this session)")
: p.isSelf
? dim(" ") + yellow("(your other session)")
: "";
// Inline tags ("role:lead [@flexicar:reviewer, @oncall]") so the
// first thing the user sees beside the name is the access /
// affiliation context. Empty role + empty groups → omit the
// bracket entirely (the dim summary line below carries the
// explicit "(no role / no groups)" so JSON output is unaffected
// and screen readers don't get spammed with literal "no").
const inlineTags: string[] = [];
const peerRole = p.profile?.role?.trim();
if (peerRole) inlineTags.push(`role:${peerRole}`);
if (p.groups.length) {
inlineTags.push(
...p.groups.map((g) => `@${g.name}${g.role ? `:${g.role}` : ""}`),
);
}
const tagsStr = inlineTags.length ? " [" + inlineTags.join(", ") + "]" : "";
render.info(
`${statusDot} ${name}${selfTag}${tagsStr}${metaStr}${pubkeyTag}${sidTag}${summary}`,
);
// Second line: cwd + an explicit role/groups footer when both
// are absent. Surfacing the absence is important — the previous
// renderer hid it, so users couldn't tell "no role set" from
// "the cli isn't showing roles".
if (p.cwd) render.info(dim(` cwd: ${p.cwd}`));
if (!peerRole && p.groups.length === 0) {
render.info(dim(" role: (none) groups: (none)"));
}
}
} catch (e) {
render.err(`${slug}: ${e instanceof Error ? e.message : String(e)}`);
}
}
if (flags.json) {
process.stdout.write(JSON.stringify(slugs.length === 1 ? allJson[0]?.peers : allJson, null, 2) + "\n");
if (wantsJson) {
process.stdout.write(
JSON.stringify(slugs.length === 1 ? allJson[0]?.peers : allJson, null, 2) + "\n",
);
}
}

View File

@@ -0,0 +1,724 @@
/**
* Platform CLI verbs — vector / graph / context / stream / sql / skill /
* vault / watch / webhook / task / clock. These wrap broker methods that
* previously were only callable via MCP tools.
*
* All verbs run cold-path (open own WS via `withMesh`). Bridge expansion
* for high-frequency reads (vector_search, graph_query, sql_query) lands
* in 1.3.1.
*
* Spec: .artifacts/specs/2026-05-02-architecture-north-star.md
*/
import { withMesh } from "./connect.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
type Flags = { mesh?: string; json?: boolean };
function emitJson(data: unknown): void {
console.log(JSON.stringify(data, null, 2));
}
// ════════════════════════════════════════════════════════════════════════
// vector — embedding store + similarity search
// ════════════════════════════════════════════════════════════════════════
export async function runVectorStore(
collection: string,
text: string,
opts: Flags & { metadata?: string },
): Promise<number> {
if (!collection || !text) {
render.err("Usage: claudemesh vector store <collection> <text> [--metadata <json>]");
return EXIT.INVALID_ARGS;
}
let metadata: Record<string, unknown> | undefined;
if (opts.metadata) {
try { metadata = JSON.parse(opts.metadata) as Record<string, unknown>; }
catch { render.err("--metadata must be JSON"); return EXIT.INVALID_ARGS; }
}
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const id = await client.vectorStore(collection, text, metadata);
if (!id) { render.err("store failed"); return EXIT.INTERNAL_ERROR; }
if (opts.json) emitJson({ id, collection });
else render.ok(`stored in ${clay(collection)}`, dim(id));
return EXIT.SUCCESS;
});
}
export async function runVectorSearch(
collection: string,
query: string,
opts: Flags & { limit?: string },
): Promise<number> {
if (!collection || !query) {
render.err("Usage: claudemesh vector search <collection> <query> [--limit N]");
return EXIT.INVALID_ARGS;
}
const limit = opts.limit ? parseInt(opts.limit, 10) : undefined;
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const hits = await client.vectorSearch(collection, query, limit);
if (opts.json) { emitJson(hits); return EXIT.SUCCESS; }
if (hits.length === 0) { render.info(dim("(no matches)")); return EXIT.SUCCESS; }
render.section(`${hits.length} match${hits.length === 1 ? "" : "es"} in ${clay(collection)}`);
for (const h of hits) {
process.stdout.write(` ${bold(h.score.toFixed(3))} ${dim(h.id.slice(0, 8))} ${h.text}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runVectorDelete(
collection: string,
id: string,
opts: Flags,
): Promise<number> {
if (!collection || !id) {
render.err("Usage: claudemesh vector delete <collection> <id>");
return EXIT.INVALID_ARGS;
}
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.vectorDelete(collection, id);
if (opts.json) emitJson({ id, deleted: true });
else render.ok(`deleted ${dim(id.slice(0, 8))}`);
return EXIT.SUCCESS;
});
}
export async function runVectorCollections(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const cols = await client.listCollections();
if (opts.json) { emitJson(cols); return EXIT.SUCCESS; }
if (cols.length === 0) { render.info(dim("(no collections)")); return EXIT.SUCCESS; }
render.section(`vector collections (${cols.length})`);
for (const c of cols) process.stdout.write(` ${clay(c)}\n`);
return EXIT.SUCCESS;
});
}
// ════════════════════════════════════════════════════════════════════════
// graph — Cypher query / execute
// ════════════════════════════════════════════════════════════════════════
export async function runGraphQuery(cypher: string, opts: Flags): Promise<number> {
if (!cypher) { render.err("Usage: claudemesh graph query \"<cypher>\""); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const rows = await client.graphQuery(cypher);
if (opts.json) { emitJson(rows); return EXIT.SUCCESS; }
if (rows.length === 0) { render.info(dim("(no rows)")); return EXIT.SUCCESS; }
render.section(`${rows.length} row${rows.length === 1 ? "" : "s"}`);
for (const r of rows) process.stdout.write(` ${JSON.stringify(r)}\n`);
return EXIT.SUCCESS;
});
}
export async function runGraphExecute(cypher: string, opts: Flags): Promise<number> {
if (!cypher) { render.err("Usage: claudemesh graph execute \"<cypher>\""); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const rows = await client.graphExecute(cypher);
if (opts.json) { emitJson(rows); return EXIT.SUCCESS; }
render.ok("executed", `${rows.length} row${rows.length === 1 ? "" : "s"} affected`);
return EXIT.SUCCESS;
});
}
// ════════════════════════════════════════════════════════════════════════
// context — share work-context summaries
// ════════════════════════════════════════════════════════════════════════
export async function runContextShare(
summary: string,
opts: Flags & { files?: string; findings?: string; tags?: string },
): Promise<number> {
if (!summary) {
render.err("Usage: claudemesh context share \"<summary>\" [--files a,b] [--findings x,y] [--tags t1,t2]");
return EXIT.INVALID_ARGS;
}
const files = opts.files?.split(",").map((s) => s.trim()).filter(Boolean);
const findings = opts.findings?.split(",").map((s) => s.trim()).filter(Boolean);
const tags = opts.tags?.split(",").map((s) => s.trim()).filter(Boolean);
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.shareContext(summary, files, findings, tags);
if (opts.json) emitJson({ shared: true, summary });
else render.ok("context shared");
return EXIT.SUCCESS;
});
}
export async function runContextGet(query: string, opts: Flags): Promise<number> {
if (!query) { render.err("Usage: claudemesh context get \"<query>\""); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const ctxs = await client.getContext(query);
if (opts.json) { emitJson(ctxs); return EXIT.SUCCESS; }
if (ctxs.length === 0) { render.info(dim("(no matches)")); return EXIT.SUCCESS; }
render.section(`${ctxs.length} context${ctxs.length === 1 ? "" : "s"}`);
for (const c of ctxs) {
process.stdout.write(` ${bold(c.peerName)} ${dim("·")} ${c.updatedAt}\n`);
process.stdout.write(` ${c.summary}\n`);
if (c.tags.length) process.stdout.write(` ${dim("tags: " + c.tags.join(", "))}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runContextList(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const ctxs = await client.listContexts();
if (opts.json) { emitJson(ctxs); return EXIT.SUCCESS; }
if (ctxs.length === 0) { render.info(dim("(no contexts)")); return EXIT.SUCCESS; }
render.section(`shared contexts (${ctxs.length})`);
for (const c of ctxs) {
process.stdout.write(` ${bold(c.peerName)} ${dim("·")} ${c.updatedAt}\n`);
process.stdout.write(` ${c.summary}\n`);
}
return EXIT.SUCCESS;
});
}
// ════════════════════════════════════════════════════════════════════════
// stream — pub/sub event bus per mesh
// ════════════════════════════════════════════════════════════════════════
export async function runStreamCreate(name: string, opts: Flags): Promise<number> {
if (!name) { render.err("Usage: claudemesh stream create <name>"); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const id = await client.createStream(name);
if (!id) { render.err("create failed"); return EXIT.INTERNAL_ERROR; }
if (opts.json) emitJson({ id, name });
else render.ok(`created ${clay(name)}`, dim(id));
return EXIT.SUCCESS;
});
}
export async function runStreamPublish(name: string, dataRaw: string, opts: Flags): Promise<number> {
if (!name || dataRaw === undefined) {
render.err("Usage: claudemesh stream publish <name> <json-or-text>");
return EXIT.INVALID_ARGS;
}
let data: unknown;
try { data = JSON.parse(dataRaw); } catch { data = dataRaw; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.publish(name, data);
if (opts.json) emitJson({ published: true, name });
else render.ok(`published to ${clay(name)}`);
return EXIT.SUCCESS;
});
}
export async function runStreamList(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const streams = await client.listStreams();
if (opts.json) { emitJson(streams); return EXIT.SUCCESS; }
if (streams.length === 0) { render.info(dim("(no streams)")); return EXIT.SUCCESS; }
render.section(`streams (${streams.length})`);
for (const s of streams) {
process.stdout.write(` ${clay(s.name)} ${dim(`· ${s.subscriberCount} subscriber${s.subscriberCount === 1 ? "" : "s"} · by ${s.createdBy}`)}\n`);
}
return EXIT.SUCCESS;
});
}
// ════════════════════════════════════════════════════════════════════════
// sql — typed query against per-mesh tables
// ════════════════════════════════════════════════════════════════════════
export async function runSqlQuery(sql: string, opts: Flags): Promise<number> {
if (!sql) { render.err("Usage: claudemesh sql query \"<select>\""); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const result = await client.meshQuery(sql);
if (!result) { render.err("query timed out"); return EXIT.INTERNAL_ERROR; }
if (opts.json) { emitJson(result); return EXIT.SUCCESS; }
render.section(`${result.rowCount} row${result.rowCount === 1 ? "" : "s"}`);
if (result.columns.length > 0) {
process.stdout.write(` ${dim(result.columns.join(" "))}\n`);
for (const row of result.rows) {
process.stdout.write(` ${result.columns.map((c) => String(row[c] ?? "")).join(" ")}\n`);
}
}
return EXIT.SUCCESS;
});
}
export async function runSqlExecute(sql: string, opts: Flags): Promise<number> {
if (!sql) { render.err("Usage: claudemesh sql execute \"<statement>\""); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.meshExecute(sql);
if (opts.json) emitJson({ executed: true });
else render.ok("executed");
return EXIT.SUCCESS;
});
}
export async function runSqlSchema(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const tables = await client.meshSchema();
if (opts.json) { emitJson(tables); return EXIT.SUCCESS; }
if (tables.length === 0) { render.info(dim("(no tables)")); return EXIT.SUCCESS; }
render.section(`mesh tables (${tables.length})`);
for (const t of tables) {
process.stdout.write(` ${bold(t.name)}\n`);
for (const c of t.columns) {
const nullable = c.nullable ? "" : " not null";
process.stdout.write(` ${c.name} ${dim(c.type + nullable)}\n`);
}
}
return EXIT.SUCCESS;
});
}
// ════════════════════════════════════════════════════════════════════════
// skill — list / get / remove (publish currently goes through MCP)
// ════════════════════════════════════════════════════════════════════════
export async function runSkillList(opts: Flags & { query?: string }): Promise<number> {
// Daemon path — preferred when running. Mirror trySendViaDaemon shape.
try {
const { tryListSkillsViaDaemon } = await import("~/services/bridge/daemon-route.js");
const dr = await tryListSkillsViaDaemon();
if (dr !== null) {
const skills = dr as Array<{ name: string; description: string; author: string; tags: string[] }>;
if (opts.json) { emitJson(skills); return EXIT.SUCCESS; }
if (skills.length === 0) { render.info(dim("(no skills)")); return EXIT.SUCCESS; }
render.section(`mesh skills (${skills.length})`);
for (const s of skills) {
process.stdout.write(` ${bold(s.name)} ${dim("· by " + s.author)}\n`);
process.stdout.write(` ${s.description}\n`);
if (s.tags?.length) process.stdout.write(` ${dim("tags: " + s.tags.join(", "))}\n`);
}
return EXIT.SUCCESS;
}
} catch { /* fall through to cold path */ }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const skills = await client.listSkills(opts.query);
if (opts.json) { emitJson(skills); return EXIT.SUCCESS; }
if (skills.length === 0) { render.info(dim("(no skills)")); return EXIT.SUCCESS; }
render.section(`mesh skills (${skills.length})`);
for (const s of skills) {
process.stdout.write(` ${bold(s.name)} ${dim("· by " + s.author)}\n`);
process.stdout.write(` ${s.description}\n`);
if (s.tags.length) process.stdout.write(` ${dim("tags: " + s.tags.join(", "))}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runSkillGet(name: string, opts: Flags): Promise<number> {
if (!name) { render.err("Usage: claudemesh skill get <name>"); return EXIT.INVALID_ARGS; }
// Daemon path first.
try {
const { tryGetSkillViaDaemon } = await import("~/services/bridge/daemon-route.js");
const dr = await tryGetSkillViaDaemon(name);
if (dr !== null) {
const skill = dr as { name: string; description: string; instructions: string; tags: string[]; author: string; createdAt: string };
if (opts.json) { emitJson(skill); return EXIT.SUCCESS; }
render.section(skill.name);
render.kv([
["author", skill.author],
["created", skill.createdAt],
["tags", skill.tags?.join(", ") || dim("(none)")],
]);
render.blank();
render.info(skill.description);
render.blank();
process.stdout.write(skill.instructions + "\n");
return EXIT.SUCCESS;
}
} catch { /* fall through */ }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const skill = await client.getSkill(name);
if (!skill) { render.err(`skill "${name}" not found`); return EXIT.NOT_FOUND; }
if (opts.json) { emitJson(skill); return EXIT.SUCCESS; }
render.section(skill.name);
render.kv([
["author", skill.author],
["created", skill.createdAt],
["tags", skill.tags.join(", ") || dim("(none)")],
]);
render.blank();
render.info(skill.description);
render.blank();
process.stdout.write(skill.instructions + "\n");
return EXIT.SUCCESS;
});
}
export async function runSkillRemove(name: string, opts: Flags): Promise<number> {
if (!name) { render.err("Usage: claudemesh skill remove <name>"); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const removed = await client.removeSkill(name);
if (opts.json) emitJson({ name, removed });
else if (removed) render.ok(`removed ${bold(name)}`);
else render.err(`skill "${name}" not found`);
return removed ? EXIT.SUCCESS : EXIT.NOT_FOUND;
});
}
// ════════════════════════════════════════════════════════════════════════
// vault — encrypted per-mesh secrets list / delete (set/get need crypto)
// ════════════════════════════════════════════════════════════════════════
export async function runVaultList(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const entries = await client.vaultList();
if (opts.json) { emitJson(entries); return EXIT.SUCCESS; }
if (!entries || entries.length === 0) { render.info(dim("(vault empty)")); return EXIT.SUCCESS; }
render.section(`vault (${entries.length})`);
for (const e of entries) {
const k = String((e as any)?.key ?? "?");
const t = String((e as any)?.entry_type ?? "");
process.stdout.write(` ${bold(k)} ${dim(t)}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runVaultDelete(key: string, opts: Flags): Promise<number> {
if (!key) { render.err("Usage: claudemesh vault delete <key>"); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const ok = await client.vaultDelete(key);
if (opts.json) emitJson({ key, deleted: ok });
else if (ok) render.ok(`deleted ${bold(key)}`);
else render.err(`vault key "${key}" not found`);
return ok ? EXIT.SUCCESS : EXIT.NOT_FOUND;
});
}
export interface VaultSetOpts extends Flags {
entryType?: "env" | "file";
mountPath?: string;
description?: string;
}
export async function runVaultSet(key: string, value: string, opts: VaultSetOpts): Promise<number> {
if (!key || value == null) {
render.err("Usage: claudemesh vault set <key> <value> [--type env|file] [--mount /path] [--description ...]");
return EXIT.INVALID_ARGS;
}
const { encryptFile, sealKeyForPeer } = await import("~/services/crypto/file-crypto.js");
const { getMeshConfig } = await import("~/services/config/facade.js");
const { readConfig } = await import("~/services/config/facade.js");
const config = readConfig();
const slug = opts.mesh ?? (config.meshes.length === 1 ? config.meshes[0]!.slug : null);
if (!slug) {
render.err("multiple meshes joined; pass --mesh <slug>");
return EXIT.INVALID_ARGS;
}
const mesh = getMeshConfig(slug);
if (!mesh) { render.err(`not joined to mesh "${slug}"`); return EXIT.NOT_FOUND; }
const plaintext = new TextEncoder().encode(value);
const enc = await encryptFile(plaintext);
const ciphertextB64 = Buffer.from(enc.ciphertext).toString("base64");
const sealed = await sealKeyForPeer(enc.key, mesh.pubkey);
return await withMesh({ meshSlug: slug }, async (client) => {
const ok = await client.vaultSet(
key,
ciphertextB64,
enc.nonce,
sealed,
opts.entryType ?? "env",
opts.mountPath,
opts.description,
);
if (opts.json) emitJson({ key, stored: ok });
else if (ok) render.ok(`vault[${bold(key)}] stored`, dim(`(${ciphertextB64.length}b)`));
else render.err(`vault set failed for "${key}"`);
return ok ? EXIT.SUCCESS : EXIT.IO_ERROR;
});
}
// ════════════════════════════════════════════════════════════════════════
// watch — URL change watchers
// ════════════════════════════════════════════════════════════════════════
export async function runWatchList(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const watches = await client.watchList();
if (opts.json) { emitJson(watches); return EXIT.SUCCESS; }
if (!watches || watches.length === 0) { render.info(dim("(no watches)")); return EXIT.SUCCESS; }
render.section(`url watches (${watches.length})`);
for (const w of watches) {
const id = String((w as any).id ?? "?");
const url = String((w as any).url ?? "");
const label = (w as any).label ? ` ${dim("(" + (w as any).label + ")")}` : "";
process.stdout.write(` ${dim(id.slice(0, 8))} ${clay(url)}${label}\n`);
}
return EXIT.SUCCESS;
});
}
export interface WatchAddOpts extends Flags {
label?: string;
interval?: number;
mode?: string;
extract?: string;
notifyOn?: string;
}
export async function runWatchAdd(url: string, opts: WatchAddOpts): Promise<number> {
if (!url) {
render.err("Usage: claudemesh watch add <url> [--label ...] [--interval <sec>] [--extract <css>] [--notify-on changed|always]");
return EXIT.INVALID_ARGS;
}
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const result = await client.watch(url, {
label: opts.label,
interval: opts.interval,
mode: opts.mode,
extract: opts.extract,
notify_on: opts.notifyOn,
});
if (result?.error) {
if (opts.json) emitJson({ ok: false, error: result.error });
else render.err(`watch add failed: ${result.error}`);
return EXIT.IO_ERROR;
}
const id = String((result as any)?.id ?? (result as any)?.watch_id ?? "?");
if (opts.json) emitJson({ ok: true, id, url, ...(opts.label ? { label: opts.label } : {}) });
else render.ok(`watching ${clay(url)}`, dim(id.slice(0, 8)));
return EXIT.SUCCESS;
});
}
export async function runUnwatch(id: string, opts: Flags): Promise<number> {
if (!id) { render.err("Usage: claudemesh watch remove <id>"); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const ok = await client.unwatch(id);
if (opts.json) emitJson({ id, removed: ok });
else if (ok) render.ok(`unwatched ${dim(id.slice(0, 8))}`);
else render.err(`watch "${id}" not found`);
return ok ? EXIT.SUCCESS : EXIT.NOT_FOUND;
});
}
// ════════════════════════════════════════════════════════════════════════
// webhook — outbound HTTP triggers
// ════════════════════════════════════════════════════════════════════════
export async function runWebhookList(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const hooks = await client.listWebhooks();
if (opts.json) { emitJson(hooks); return EXIT.SUCCESS; }
if (hooks.length === 0) { render.info(dim("(no webhooks)")); return EXIT.SUCCESS; }
render.section(`webhooks (${hooks.length})`);
for (const h of hooks) {
const dot = h.active ? "●" : dim("○");
process.stdout.write(` ${dot} ${bold(h.name)} ${dim("· " + h.url)}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runWebhookCreate(name: string, opts: Flags): Promise<number> {
if (!name) {
render.err("Usage: claudemesh webhook create <name>");
return EXIT.INVALID_ARGS;
}
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const created = await client.createWebhook(name);
if (!created) {
if (opts.json) emitJson({ ok: false, error: "create failed (timeout or duplicate)" });
else render.err(`webhook create "${name}" failed`);
return EXIT.IO_ERROR;
}
if (opts.json) emitJson({ ok: true, ...created });
else {
render.ok(`created webhook ${bold(created.name)}`);
process.stdout.write(` url: ${clay(created.url)}\n`);
process.stdout.write(` secret: ${dim(created.secret)} ${dim("(shown once)")}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runWebhookDelete(name: string, opts: Flags): Promise<number> {
if (!name) { render.err("Usage: claudemesh webhook delete <name>"); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const ok = await client.deleteWebhook(name);
if (opts.json) emitJson({ name, deleted: ok });
else if (ok) render.ok(`deleted ${bold(name)}`);
else render.err(`webhook "${name}" not found`);
return ok ? EXIT.SUCCESS : EXIT.NOT_FOUND;
});
}
// ════════════════════════════════════════════════════════════════════════
// task — list / create (claim / complete already in broker-actions.ts)
// ════════════════════════════════════════════════════════════════════════
export async function runTaskList(opts: Flags & { status?: string; assignee?: string }): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const tasks = await client.listTasks(opts.status, opts.assignee);
if (opts.json) { emitJson(tasks); return EXIT.SUCCESS; }
if (tasks.length === 0) { render.info(dim("(no tasks)")); return EXIT.SUCCESS; }
render.section(`tasks (${tasks.length})`);
for (const t of tasks) {
const dot = t.status === "done" ? "●" : t.status === "claimed" ? clay("●") : dim("○");
const assignee = t.assignee ? dim(`${t.assignee}`) : "";
process.stdout.write(` ${dot} ${dim(t.id.slice(0, 8))} ${bold(t.title)}${assignee}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runTaskCreate(
title: string,
opts: Flags & { assignee?: string; priority?: string; tags?: string },
): Promise<number> {
if (!title) { render.err("Usage: claudemesh task create <title> [--assignee X] [--priority P]"); return EXIT.INVALID_ARGS; }
const tags = opts.tags?.split(",").map((s) => s.trim()).filter(Boolean);
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const id = await client.createTask(title, opts.assignee, opts.priority, tags);
if (!id) { render.err("create failed"); return EXIT.INTERNAL_ERROR; }
if (opts.json) emitJson({ id, title });
else render.ok(`created ${dim(id.slice(0, 8))}`, title);
return EXIT.SUCCESS;
});
}
// ════════════════════════════════════════════════════════════════════════
// clock — set / pause / resume (get already in broker-actions.ts)
// ════════════════════════════════════════════════════════════════════════
export async function runClockSet(speed: string, opts: Flags): Promise<number> {
const s = parseFloat(speed);
if (!Number.isFinite(s) || s < 0) {
render.err("Usage: claudemesh clock set <speed>", "speed is a non-negative number, e.g. 1.0 = realtime, 0 = paused, 60 = 60× faster");
return EXIT.INVALID_ARGS;
}
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const r = await client.setClock(s);
if (!r) { render.err("clock set failed"); return EXIT.INTERNAL_ERROR; }
if (opts.json) emitJson(r);
else render.ok(`clock set to ${bold("x" + r.speed)}`);
return EXIT.SUCCESS;
});
}
export async function runClockPause(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const r = await client.pauseClock();
if (!r) { render.err("pause failed"); return EXIT.INTERNAL_ERROR; }
if (opts.json) emitJson(r);
else render.ok("clock paused");
return EXIT.SUCCESS;
});
}
export async function runClockResume(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const r = await client.resumeClock();
if (!r) { render.err("resume failed"); return EXIT.INTERNAL_ERROR; }
if (opts.json) emitJson(r);
else render.ok("clock resumed");
return EXIT.SUCCESS;
});
}
// ════════════════════════════════════════════════════════════════════════
// mesh-mcp — list deployed mesh-MCP servers, call tools, view catalog
// ════════════════════════════════════════════════════════════════════════
export async function runMeshMcpList(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const servers = await client.mcpList();
if (opts.json) { emitJson(servers); return EXIT.SUCCESS; }
if (servers.length === 0) { render.info(dim("(no mesh-MCP servers)")); return EXIT.SUCCESS; }
render.section(`mesh-MCP servers (${servers.length})`);
for (const s of servers) {
process.stdout.write(` ${bold(s.name)} ${dim("· hosted by " + s.hostedBy)}\n`);
process.stdout.write(` ${s.description}\n`);
if (s.tools.length) process.stdout.write(` ${dim("tools: " + s.tools.map((t) => t.name).join(", "))}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runMeshMcpCall(
serverName: string,
toolName: string,
argsRaw: string,
opts: Flags,
): Promise<number> {
if (!serverName || !toolName) {
render.err("Usage: claudemesh mesh-mcp call <server> <tool> [json-args]");
return EXIT.INVALID_ARGS;
}
let args: Record<string, unknown> = {};
if (argsRaw) {
try { args = JSON.parse(argsRaw) as Record<string, unknown>; }
catch { render.err("args must be JSON"); return EXIT.INVALID_ARGS; }
}
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const r = await client.mcpCall(serverName, toolName, args);
if (r.error) {
if (opts.json) emitJson({ ok: false, error: r.error });
else render.err(r.error);
return EXIT.INTERNAL_ERROR;
}
if (opts.json) emitJson({ ok: true, result: r.result });
else process.stdout.write(JSON.stringify(r.result, null, 2) + "\n");
return EXIT.SUCCESS;
});
}
export async function runMeshMcpCatalog(opts: Flags): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const cat = await client.mcpCatalog();
if (opts.json) { emitJson(cat); return EXIT.SUCCESS; }
if (!cat || cat.length === 0) { render.info(dim("(catalog empty)")); return EXIT.SUCCESS; }
render.section(`mesh-MCP catalog (${cat.length})`);
for (const c of cat as Array<Record<string, unknown>>) {
process.stdout.write(` ${bold(String(c.name ?? "?"))} ${dim(String(c.status ?? ""))}\n`);
if (c.description) process.stdout.write(` ${String(c.description)}\n`);
}
return EXIT.SUCCESS;
});
}
// ════════════════════════════════════════════════════════════════════════
// file — list / status / delete (upload / get-by-name go through MCP for now)
// ════════════════════════════════════════════════════════════════════════
export async function runFileList(opts: Flags & { query?: string }): Promise<number> {
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const files = await client.listFiles(opts.query);
if (opts.json) { emitJson(files); return EXIT.SUCCESS; }
if (files.length === 0) { render.info(dim("(no files)")); return EXIT.SUCCESS; }
render.section(`mesh files (${files.length})`);
for (const f of files) {
const sizeKb = (f.size / 1024).toFixed(1);
process.stdout.write(` ${bold(f.name)} ${dim(`· ${sizeKb} KB · by ${f.uploadedBy}`)}\n`);
if (f.tags.length) process.stdout.write(` ${dim("tags: " + f.tags.join(", "))}\n`);
}
return EXIT.SUCCESS;
});
}
export async function runFileStatus(id: string, opts: Flags): Promise<number> {
if (!id) { render.err("Usage: claudemesh file status <file-id>"); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const accessors = await client.fileStatus(id);
if (opts.json) { emitJson(accessors); return EXIT.SUCCESS; }
if (accessors.length === 0) { render.info(dim("(no accesses recorded)")); return EXIT.SUCCESS; }
render.section(`accesses for ${id.slice(0, 8)}`);
for (const a of accessors) process.stdout.write(` ${bold(a.peerName)} ${dim("· " + a.accessedAt)}\n`);
return EXIT.SUCCESS;
});
}
export async function runFileDelete(id: string, opts: Flags): Promise<number> {
if (!id) { render.err("Usage: claudemesh file delete <file-id>"); return EXIT.INVALID_ARGS; }
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
await client.deleteFile(id);
if (opts.json) emitJson({ id, deleted: true });
else render.ok(`deleted ${dim(id.slice(0, 8))}`);
return EXIT.SUCCESS;
});
}

View File

@@ -1,35 +1,53 @@
import { allClients } from "~/services/broker/facade.js";
import { dim, bold } from "~/ui/styles.js";
import { withMesh } from "./connect.js";
import { tryRecallViaDaemon } from "~/services/bridge/daemon-route.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
export async function recall(
query: string,
opts: { mesh?: string; json?: boolean } = {},
): Promise<number> {
const client = allClients()[0];
if (!client) {
console.error("Not connected to any mesh.");
return EXIT.NETWORK_ERROR;
if (!query) {
render.err("Usage: claudemesh recall <query>");
return EXIT.INVALID_ARGS;
}
const memories = await client.recall(query);
if (opts.json) {
console.log(JSON.stringify(memories, null, 2));
// Daemon path first.
const daemonMatches = await tryRecallViaDaemon(query, opts.mesh);
if (daemonMatches !== null) {
if (opts.json) { console.log(JSON.stringify(daemonMatches, null, 2)); return EXIT.SUCCESS; }
if (daemonMatches.length === 0) { render.info(dim("no memories found.")); return EXIT.SUCCESS; }
render.section(`memories (${daemonMatches.length})`);
for (const m of daemonMatches) {
const tags = m.tags.length ? dim(` [${m.tags.map((t) => clay(t)).join(dim(", "))}]`) : "";
process.stdout.write(` ${bold(m.id.slice(0, 8))}${tags}\n`);
process.stdout.write(` ${m.content}\n`);
process.stdout.write(` ${dim(m.rememberedBy + " · " + new Date(m.rememberedAt).toLocaleString())}\n\n`);
}
return EXIT.SUCCESS;
}
if (memories.length === 0) {
console.log(dim("No memories found."));
return EXIT.SUCCESS;
}
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const memories = await client.recall(query);
for (const m of memories) {
const tags = m.tags.length ? dim(` [${m.tags.join(", ")}]`) : "";
console.log(`${bold(m.id.slice(0, 8))}${tags}`);
console.log(` ${m.content}`);
console.log(dim(` ${m.rememberedBy} \u00B7 ${new Date(m.rememberedAt).toLocaleString()}`));
console.log("");
}
return EXIT.SUCCESS;
if (opts.json) {
console.log(JSON.stringify(memories, null, 2));
return EXIT.SUCCESS;
}
if (memories.length === 0) {
render.info(dim("no memories found."));
return EXIT.SUCCESS;
}
render.section(`memories (${memories.length})`);
for (const m of memories) {
const tags = m.tags.length ? dim(` [${m.tags.map((t) => clay(t)).join(dim(", "))}]`) : "";
process.stdout.write(` ${bold(m.id.slice(0, 8))}${tags}\n`);
process.stdout.write(` ${m.content}\n`);
process.stdout.write(` ${dim(m.rememberedBy + " · " + new Date(m.rememberedAt).toLocaleString())}\n\n`);
}
return EXIT.SUCCESS;
});
}

View File

@@ -1,28 +1,43 @@
import { allClients } from "~/services/broker/facade.js";
import { withMesh } from "./connect.js";
import { tryRememberViaDaemon } from "~/services/bridge/daemon-route.js";
import { render } from "~/ui/render.js";
import { dim } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
export async function remember(
content: string,
opts: { mesh?: string; tags?: string; json?: boolean } = {},
): Promise<number> {
const client = allClients()[0];
if (!client) {
console.error("Not connected to any mesh.");
return EXIT.NETWORK_ERROR;
if (!content) {
render.err("Usage: claudemesh remember <text>");
return EXIT.INVALID_ARGS;
}
const tags = opts.tags?.split(",").map((t) => t.trim()).filter(Boolean);
const id = await client.remember(content, tags);
if (opts.json) {
console.log(JSON.stringify({ id, content, tags }));
// Daemon path first.
const daemonRes = await tryRememberViaDaemon(content, tags, opts.mesh);
if (daemonRes) {
if (opts.json) {
console.log(JSON.stringify({ id: daemonRes.id, content, tags, mesh: daemonRes.mesh }));
return EXIT.SUCCESS;
}
render.ok("remembered", dim(daemonRes.id.slice(0, 8)));
return EXIT.SUCCESS;
}
if (id) {
console.log(`\u2713 Remembered (${id.slice(0, 8)})`);
return EXIT.SUCCESS;
}
console.error("\u2717 Failed to store memory");
return EXIT.INTERNAL_ERROR;
return await withMesh({ meshSlug: opts.mesh ?? null }, async (client) => {
const id = await client.remember(content, tags);
if (opts.json) {
console.log(JSON.stringify({ id, content, tags }));
return EXIT.SUCCESS;
}
if (id) {
render.ok("remembered", dim(id.slice(0, 8)));
return EXIT.SUCCESS;
}
render.err("failed to store memory");
return EXIT.INTERNAL_ERROR;
});
}

View File

@@ -7,6 +7,8 @@
*/
import { withMesh } from "./connect.js";
import { render } from "~/ui/render.js";
import { bold, clay, dim } from "~/ui/styles.js";
export interface RemindFlags {
mesh?: string;
@@ -35,13 +37,12 @@ function parseDeliverAt(flags: RemindFlags): number | null {
return Date.now() + ms;
}
if (flags.at) {
// Try HH:MM first
const hm = flags.at.match(/^(\d{1,2}):(\d{2})$/);
if (hm) {
const now = new Date();
const target = new Date(now);
target.setHours(parseInt(hm[1]!, 10), parseInt(hm[2]!, 10), 0, 0);
if (target <= now) target.setDate(target.getDate() + 1); // next occurrence
if (target <= now) target.setDate(target.getDate() + 1);
return target.getTime();
}
const ts = Date.parse(flags.at);
@@ -54,61 +55,53 @@ export async function runRemind(
flags: RemindFlags,
positional: string[],
): Promise<void> {
const useColor = !process.env.NO_COLOR && process.env.TERM !== "dumb" && process.stdout.isTTY;
const dim = (s: string) => (useColor ? `\x1b[2m${s}\x1b[22m` : s);
const bold = (s: string) => (useColor ? `\x1b[1m${s}\x1b[22m` : s);
const action = positional[0];
// claudemesh remind list
if (action === "list") {
await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
const scheduled = await client.listScheduled();
if (flags.json) { console.log(JSON.stringify(scheduled, null, 2)); return; }
if (scheduled.length === 0) { console.log(dim("No pending reminders.")); return; }
if (scheduled.length === 0) { render.info(dim("No pending reminders.")); return; }
render.section(`reminders (${scheduled.length})`);
for (const m of scheduled) {
const when = new Date(m.deliverAt).toLocaleString();
const to = m.to === client.getSessionPubkey() ? dim("(self)") : m.to;
console.log(` ${bold(m.id.slice(0, 8))} ${to} at ${when}`);
console.log(` ${dim(m.message.slice(0, 80))}`);
console.log("");
process.stdout.write(` ${bold(m.id.slice(0, 8))} ${dim("→")} ${to} ${dim("at")} ${when}\n`);
process.stdout.write(` ${dim(m.message.slice(0, 80))}\n\n`);
}
});
return;
}
// claudemesh remind cancel <id>
if (action === "cancel") {
const id = positional[1];
if (!id) { console.error("Usage: claudemesh remind cancel <id>"); process.exit(1); }
if (!id) { render.err("Usage: claudemesh remind cancel <id>"); process.exit(1); }
await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
const ok = await client.cancelScheduled(id);
if (ok) console.log(`✓ Cancelled ${id}`);
else { console.error(`✗ Not found or already fired: ${id}`); process.exit(1); }
if (ok) render.ok(`cancelled ${bold(id.slice(0, 8))}`);
else { render.err(`not found or already fired: ${id}`); process.exit(1); }
});
return;
}
// claudemesh remind <message> --in <duration> | --at <time> | --cron <expr>
const message = action ?? positional.join(" ");
if (!message) {
console.error("Usage: claudemesh remind <message> --in <duration>");
console.error(" claudemesh remind <message> --at <time>");
console.error(' claudemesh remind <message> --cron "0 */2 * * *"');
console.error(" claudemesh remind list");
console.error(" claudemesh remind cancel <id>");
render.err("Usage: claudemesh remind <message> --in <duration>");
render.info(dim(" claudemesh remind <message> --at <time>"));
render.info(dim(' claudemesh remind <message> --cron "0 */2 * * *"'));
render.info(dim(" claudemesh remind list"));
render.info(dim(" claudemesh remind cancel <id>"));
process.exit(1);
}
const isCron = !!flags.cron;
const deliverAt = isCron ? 0 : parseDeliverAt(flags);
if (!isCron && deliverAt === null) {
console.error('Specify when: --in <duration> (e.g. "2h", "30m"), --at <time> (e.g. "15:00"), or --cron <expression>');
render.err('Specify when', 'use --in <duration> (e.g. "2h", "30m"), --at <time> (e.g. "15:00"), or --cron <expression>');
process.exit(1);
}
await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
// Determine target: --to flag or self
let targetSpec: string;
if (flags.to && flags.to !== "self") {
if (flags.to.startsWith("@") || flags.to === "*" || /^[0-9a-f]{64}$/i.test(flags.to)) {
@@ -117,7 +110,7 @@ export async function runRemind(
const peers = await client.listPeers();
const match = peers.find((p) => p.displayName.toLowerCase() === flags.to!.toLowerCase());
if (!match) {
console.error(`Peer "${flags.to}" not found. Online: ${peers.map((p) => p.displayName).join(", ") || "(none)"}`);
render.err(`Peer "${flags.to}" not found`, `online: ${peers.map((p) => p.displayName).join(", ") || "(none)"}`);
process.exit(1);
}
targetSpec = match.pubkey;
@@ -127,16 +120,22 @@ export async function runRemind(
}
const result = await client.scheduleMessage(targetSpec, message, deliverAt ?? 0, false, flags.cron);
if (!result) { console.error("Broker did not acknowledge — check connection"); process.exit(1); }
if (!result) { render.err("Broker did not acknowledge — check connection"); process.exit(1); }
if (flags.json) { console.log(JSON.stringify(result)); return; }
const toLabel = !flags.to || flags.to === "self" ? "yourself" : flags.to;
if (isCron) {
const nextFire = new Date(result.deliverAt).toLocaleString();
console.log(`✓ Recurring reminder set (${result.scheduledId.slice(0, 8)}): "${message}" → ${toLabel} — cron: ${flags.cron}, next fire: ${nextFire}`);
render.ok(
`recurring reminder set`,
`${result.scheduledId.slice(0, 8)} · ${clay(message)}${toLabel} · cron ${flags.cron} · next ${nextFire}`,
);
} else {
const when = new Date(result.deliverAt).toLocaleString();
console.log(`✓ Reminder set (${result.scheduledId.slice(0, 8)}): "${message}" → ${toLabel} at ${when}`);
render.ok(
`reminder set`,
`${result.scheduledId.slice(0, 8)} · ${clay(message)}${toLabel} at ${when}`,
);
}
});
}

View File

@@ -1,13 +1,72 @@
import { rename as renameMesh } from "~/services/mesh/facade.js";
import { green, icons } from "~/ui/styles.js";
/**
* `claudemesh rename <old-slug> <new-slug>` — change a mesh's identifier.
*
* v0.7.0 collapse: slug IS the identifier — there is no separate
* "display name". Pre-launch we collapsed the model so users only ever
* deal with one identifier per mesh. The mesh.name column on the DB is
* kept for now (avoids touching ~25 reader sites) but is always synced
* to slug; a follow-up migration drops it.
*/
import { reslug as reslugMesh } from "~/services/mesh/facade.js";
import { getStoredToken } from "~/services/auth/facade.js";
import { ApiError } from "~/services/api/facade.js";
import { readConfig, setMeshConfig, removeMeshConfig } from "~/services/config/facade.js";
import { bold, dim, green, icons } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
export async function rename(slug: string, newName: string): Promise<number> {
const SLUG_RE = /^[a-z0-9][a-z0-9-]{1,31}$/;
export async function rename(oldSlug: string, newSlug: string): Promise<number> {
if (!oldSlug || !newSlug) {
console.error(` ${icons.cross} Usage: ${bold("claudemesh rename")} <old-slug> <new-slug>`);
return EXIT.INVALID_ARGS;
}
if (!SLUG_RE.test(newSlug)) {
console.error(` ${icons.cross} Invalid slug: must be 2-32 chars, lowercase alnum + hyphens, start with alnum`);
return EXIT.INVALID_ARGS;
}
if (oldSlug === newSlug) {
console.error(` ${icons.cross} Old and new slug are the same.`);
return EXIT.INVALID_ARGS;
}
const auth = getStoredToken();
if (!auth) {
console.error(` ${icons.cross} Renaming a mesh requires a claudemesh.com account session.`);
console.error(` ${dim("Run")} ${bold("claudemesh login")} ${dim("first.")}`);
return EXIT.AUTH_FAILED;
}
const cfg = readConfig();
const collision = cfg.meshes.find((m) => m.slug === newSlug && m.slug !== oldSlug);
if (collision) {
console.error(` ${icons.cross} Slug "${newSlug}" already used locally by another joined mesh.`);
console.error(` ${dim("Pick a different slug, or leave the other mesh first.")}`);
return EXIT.ALREADY_EXISTS;
}
try {
await renameMesh(slug, newName);
console.log(` ${green(icons.check)} Renamed "${slug}" to "${newName}"`);
const updated = await reslugMesh(oldSlug, newSlug);
const local = cfg.meshes.find((m) => m.slug === oldSlug);
if (local) {
removeMeshConfig(oldSlug);
setMeshConfig(updated.slug, { ...local, slug: updated.slug, name: updated.slug });
}
console.log(` ${green(icons.check)} Renamed: "${oldSlug}" → "${updated.slug}"`);
console.log(` ${dim("Other peers will pick up the new identifier after they run")} ${bold("claudemesh sync")}`);
return EXIT.SUCCESS;
} catch (err) {
if (err instanceof ApiError) {
const body = err.body as { error?: string } | undefined;
console.error(` ${icons.cross} ${body?.error ?? err.statusText}`);
if (err.status === 401) return EXIT.AUTH_FAILED;
if (err.status === 403) return EXIT.PERMISSION_DENIED;
if (err.status === 404) return EXIT.NOT_FOUND;
if (err.status === 409) return EXIT.ALREADY_EXISTS;
if (err.status === 400) return EXIT.INVALID_ARGS;
return EXIT.INTERNAL_ERROR;
}
console.error(` ${icons.cross} Failed: ${err instanceof Error ? err.message : err}`);
return EXIT.INTERNAL_ERROR;
}

View File

@@ -10,21 +10,17 @@
*/
import { readConfig, writeConfig } from "~/services/config/facade.js";
import { render } from "~/ui/render.js";
import { bold, dim } from "~/ui/styles.js";
export function runSeedTestMesh(args: string[]): void {
const [brokerUrl, meshId, memberId, pubkey, slug] = args;
if (!brokerUrl || !meshId || !memberId || !pubkey || !slug) {
console.error(
"Usage: claudemesh seed-test-mesh <broker-ws-url> <mesh-id> <member-id> <pubkey> <slug>",
);
console.error("");
console.error(
'Example: claudemesh seed-test-mesh "ws://localhost:7900/ws" mesh-123 member-abc aaa..aaa smoke-test',
);
render.err("Usage: claudemesh seed-test-mesh <broker-ws-url> <mesh-id> <member-id> <pubkey> <slug>");
render.info(dim('Example: claudemesh seed-test-mesh "ws://localhost:7900/ws" mesh-123 member-abc aaa..aaa smoke-test'));
process.exit(1);
}
const config = readConfig();
// Remove any prior entry with same slug (idempotent).
config.meshes = config.meshes.filter((m) => m.slug !== slug);
config.meshes.push({
meshId,
@@ -32,13 +28,11 @@ export function runSeedTestMesh(args: string[]): void {
slug,
name: `Test: ${slug}`,
pubkey,
secretKey: "dev-only-stub", // real keypair generated during join in Step 17
secretKey: "dev-only-stub",
brokerUrl,
joinedAt: new Date().toISOString(),
});
writeConfig(config);
console.log(`Seeded mesh "${slug}" (${meshId}) into local config.`);
console.log(
`Run \`claudemesh mcp\` to connect, or register with Claude Code via \`claudemesh install\`.`,
);
render.ok(`seeded ${bold(slug)}`, dim(meshId));
render.hint(`run ${bold("claudemesh mcp")} to connect, or register with Claude Code via ${bold("claudemesh install")}`);
}

View File

@@ -6,35 +6,229 @@
* - a pubkey hex ("abc123...")
* - @group ("@flexicar")
* - * (broadcast to all)
*
* Warm path: dials the per-mesh bridge socket the push-pipe holds open
* (~5ms). Cold path: opens its own WS via `withMesh` (~300-700ms).
*/
import { withMesh } from "./connect.js";
import { readConfig } from "~/services/config/facade.js";
import { trySendViaDaemon } from "~/services/bridge/daemon-route.js";
import type { Priority } from "~/services/broker/facade.js";
import { render } from "~/ui/render.js";
import { dim } from "~/ui/styles.js";
export interface SendFlags {
mesh?: string;
priority?: string;
json?: boolean;
/** Allow sending to a target that resolves to one of the caller's
* own sessions. Off by default — trying to message your own
* sibling session is almost always an accident (copying a hex
* pubkey from `peer list` without realizing it was your own row). */
self?: boolean;
}
export async function runSend(flags: SendFlags, to: string, message: string): Promise<void> {
if (!to || !message) {
render.err("Usage: claudemesh send <to> <message>");
process.exit(1);
}
const priority: Priority =
flags.priority === "now" ? "now"
: flags.priority === "low" ? "low"
: "next";
// Resolve which mesh to use. With --mesh, target it directly.
// Without, use first joined mesh — same default as withMesh.
const config = readConfig();
const meshSlug =
flags.mesh ??
(config.meshes.length === 1 ? config.meshes[0]!.slug : null);
// 1.31.6: hex-prefix resolution. If `to` looks like hex but isn't a
// full 64-char pubkey, resolve it against the peer list and replace
// it with the matching full pubkey. The broker stores `targetSpec`
// verbatim and the drain query at apps/broker/src/broker.ts:2408
// matches only on full pubkeys, so a 16-hex prefix would queue
// successfully but never fetch — sender saw "sent", recipient saw
// nothing. Resolving here makes the CLI's prefix UX work end-to-end
// and surfaces ambiguous / unmatched prefixes with a clear error
// instead of a silent drop.
if (
!to.startsWith("@") &&
!to.startsWith("#") &&
to !== "*" &&
/^[0-9a-f]{4,63}$/i.test(to)
) {
try {
const { tryListPeersViaDaemon } = await import("~/services/bridge/daemon-route.js");
const peers = (await tryListPeersViaDaemon()) ?? [];
const lower = to.toLowerCase();
const matches = peers.filter((p) => {
const pk = (p as { pubkey?: string }).pubkey ?? "";
const mpk = (p as { memberPubkey?: string }).memberPubkey ?? "";
return pk.toLowerCase().startsWith(lower) || mpk.toLowerCase().startsWith(lower);
});
if (matches.length === 0) {
render.err(`No peer matches hex prefix "${to}".`);
const names = peers
.map((p) => (p as { displayName?: string }).displayName)
.filter(Boolean)
.join(", ");
if (names) render.hint(`online: ${names}`);
process.exit(1);
}
if (matches.length > 1) {
const candidates = matches
.map((p) => {
const pk = (p as { pubkey?: string }).pubkey ?? "";
const dn = (p as { displayName?: string }).displayName ?? "?";
return `${dn} ${pk.slice(0, 16)}`;
})
.join(", ");
render.err(`Ambiguous hex prefix "${to}" — matches ${matches.length} peers.`);
render.hint(`candidates: ${candidates}`);
render.hint("Use a longer prefix or paste the full 64-char pubkey.");
process.exit(1);
}
to = (matches[0] as { pubkey?: string }).pubkey ?? to;
} catch {
// Daemon unreachable — fall through; cold path will try a name
// lookup and surface its own error if that also fails.
}
}
// Self-DM safety check: if target is a 64-char hex that matches the
// caller's own member pubkey, refuse without --self. Catches the
// common pasted-from-peer-list-not-realizing-it-was-mine footgun.
// With --self, member-pubkey targeting fans out to every connected
// sibling session of your member (the broker's drain only matches
// exact session pubkeys, so we resolve here in the CLI).
if (meshSlug) {
const joined = config.meshes.find((m) => m.slug === meshSlug);
const isOwnMemberKey =
joined && /^[0-9a-f]{64}$/i.test(to) && to.toLowerCase() === joined.pubkey.toLowerCase();
if (isOwnMemberKey && !flags.self) {
render.err(
`Target "${to.slice(0, 16)}…" is your own member pubkey on mesh "${meshSlug}".`,
);
render.hint(
"Pass --self to message a sibling session of your own member, or pick a different peer's pubkey.",
);
process.exit(1);
}
if (isOwnMemberKey && flags.self) {
// Member-pubkey fan-out: resolve to every connected sibling
// session pubkey and send one message per recipient. Required
// because the broker's drain query at apps/broker/src/broker.ts
// matches target_spec only against full session pubkeys —
// sending to a member pubkey would queue successfully but no
// drain would fetch.
try {
const { tryListPeersViaDaemon } = await import("~/services/bridge/daemon-route.js");
const { getSessionInfo } = await import("~/services/session/resolve.js");
const peers = (await tryListPeersViaDaemon()) ?? [];
const session = await getSessionInfo();
const ownSessionPk = session?.presence?.sessionPubkey?.toLowerCase();
const siblings = peers.filter((p) => {
const r = p as { memberPubkey?: string; pubkey?: string; channel?: string };
if (!r.pubkey) return false;
if (ownSessionPk && r.pubkey.toLowerCase() === ownSessionPk) return false;
if (r.channel === "claudemesh-daemon") return false;
return r.memberPubkey?.toLowerCase() === to.toLowerCase();
});
if (siblings.length === 0) {
render.err(`--self fan-out: no other sibling sessions of your member online.`);
process.exit(1);
}
const results: Array<{ pubkey: string; ok: boolean; messageId?: string; error?: string }> = [];
for (const peer of siblings) {
const pk = (peer as { pubkey: string }).pubkey;
const dr = await trySendViaDaemon({ to: pk, message, priority, expectedMesh: meshSlug ?? undefined });
if (dr === null) {
results.push({ pubkey: pk, ok: false, error: "daemon path unavailable" });
continue;
}
if (dr.ok) {
results.push({
pubkey: pk,
ok: true,
...(dr.messageId ? { messageId: dr.messageId } : {}),
});
} else {
results.push({ pubkey: pk, ok: false, error: dr.error });
}
}
const okCount = results.filter((r) => r.ok).length;
if (flags.json) {
console.log(JSON.stringify({ ok: okCount > 0, fanout: results, via: "daemon" }));
} else if (okCount === results.length) {
render.ok(`fanned out to ${okCount} sibling session${okCount === 1 ? "" : "s"} (daemon)`);
for (const r of results) render.info(dim(`${r.pubkey.slice(0, 16)}${r.messageId ? dim(r.messageId.slice(0, 8)) : ""}`));
} else {
render.warn(`fanned out: ${okCount}/${results.length} delivered`);
for (const r of results) {
const tag = r.ok ? "✔" : "✘";
render.info(` ${tag} ${r.pubkey.slice(0, 16)}${r.error ? dim(`${r.error}`) : ""}`);
}
}
return;
} catch (e) {
render.err(`--self fan-out failed: ${e instanceof Error ? e.message : String(e)}`);
process.exit(1);
}
}
}
// Daemon path — preferred when a long-lived daemon is local. UDS at
// ~/.claudemesh/daemon/daemon.sock; ~1ms round-trip; persists outbox
// across CLI invocations so a `claudemesh send` survives a daemon
// crash via the on-disk outbox.
{
const dr = await trySendViaDaemon({ to, message, priority, expectedMesh: meshSlug ?? undefined });
if (dr !== null) {
if (dr.ok) {
if (flags.json) console.log(JSON.stringify({ ok: true, messageId: dr.messageId, target: to, via: "daemon", duplicate: !!dr.duplicate }));
else render.ok(`sent to ${to} (daemon)`, dr.messageId ? dim(dr.messageId.slice(0, 8)) : undefined);
return;
}
// Daemon answered but rejected (409 idempotency, 400 schema). Surface; do not fall through.
if (flags.json) console.log(JSON.stringify({ ok: false, error: dr.error, via: "daemon" }));
else render.err(`send failed (daemon): ${dr.error}`);
process.exit(1);
}
// dr === null → daemon not running and lifecycle couldn't auto-
// spawn it; fall through to cold path. The orphaned bridge tier
// was removed in 1.28.0.
}
// Cold path — open our own WS, encrypt locally, fire envelope.
await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
// Resolve display name → pubkey for direct messages.
// If `to` starts with @, *, or looks like a hex pubkey, use as-is.
let targetSpec = to;
if (!to.startsWith("@") && to !== "*" && !/^[0-9a-f]{64}$/i.test(to)) {
// Treat as display name — look up pubkey via list_peers.
if (to.startsWith("#") && !/^#[0-9a-z_-]{20,}$/i.test(to)) {
// Topic by name → resolve to "#<topicId>" via topicList. The broker
// wire format is "#<topicId>"; users type "#<name>" for ergonomics.
const name = to.slice(1);
const topics = await client.topicList();
const match = topics.find((t) => t.name === name);
if (!match) {
const names = topics.map((t) => "#" + t.name).join(", ");
render.err(`Topic "${to}" not found.`, `topics: ${names || "(none)"}`);
process.exit(1);
}
targetSpec = "#" + match.id;
} else if (!to.startsWith("@") && !to.startsWith("#") && to !== "*" && !/^[0-9a-f]{64}$/i.test(to)) {
const peers = await client.listPeers();
const match = peers.find(
(p) => p.displayName.toLowerCase() === to.toLowerCase(),
);
if (!match) {
const names = peers.map((p) => p.displayName).join(", ");
console.error(`Peer "${to}" not found. Online: ${names || "(none)"}`);
render.err(`Peer "${to}" not found.`, `online: ${names || "(none)"}`);
process.exit(1);
}
targetSpec = match.pubkey;
@@ -42,9 +236,17 @@ export async function runSend(flags: SendFlags, to: string, message: string): Pr
const result = await client.send(targetSpec, message, priority);
if (result.ok) {
console.log(`✓ Sent to ${to}${result.messageId ? ` (${result.messageId.slice(0, 8)})` : ""}`);
if (flags.json) {
console.log(JSON.stringify({ ok: true, messageId: result.messageId, target: to }));
} else {
render.ok(`sent to ${to}`, result.messageId ? dim(result.messageId.slice(0, 8)) : undefined);
}
} else {
console.error(`✗ Send failed: ${result.error ?? "unknown error"}`);
if (flags.json) {
console.log(JSON.stringify({ ok: false, error: result.error ?? "unknown" }));
} else {
render.err(`send failed: ${result.error ?? "unknown error"}`);
}
process.exit(1);
}
});

View File

@@ -0,0 +1,21 @@
/**
* `claudemesh skill` — print the bundled SKILL.md to stdout.
*
* Zero-install access: the skill is embedded into the binary at build
* time via Bun's text-import attribute, so a fresh `npm i -g` user
* (or someone running the prebuilt binary) can pipe the contents into
* Claude Code (or anywhere else) without copying files into
* ~/.claude/skills.
*
* claudemesh skill | claude --skill-add -
* claudemesh skill > /tmp/cm.md
*/
import skillContent from "../../skills/claudemesh/SKILL.md" with { type: "text" };
import { EXIT } from "~/constants/exit-codes.js";
export async function runSkill(): Promise<number> {
process.stdout.write(skillContent);
if (!skillContent.endsWith("\n")) process.stdout.write("\n");
return EXIT.SUCCESS;
}

View File

@@ -5,6 +5,9 @@
*/
import { withMesh } from "./connect.js";
import { tryGetStateViaDaemon, tryListStateViaDaemon, trySetStateViaDaemon } from "~/services/bridge/daemon-route.js";
import { render } from "~/ui/render.js";
import { bold, dim } from "~/ui/styles.js";
export interface StateFlags {
mesh?: string;
@@ -12,14 +15,20 @@ export interface StateFlags {
}
export async function runStateGet(flags: StateFlags, key: string): Promise<void> {
const useColor =
!process.env.NO_COLOR && process.env.TERM !== "dumb" && process.stdout.isTTY;
const dim = (s: string) => (useColor ? `\x1b[2m${s}\x1b[22m` : s);
// Daemon path first.
const daemonEntry = await tryGetStateViaDaemon(key, flags.mesh);
if (daemonEntry !== null) {
if (!daemonEntry) { render.info(dim("(not set)")); return; }
if (flags.json) { console.log(JSON.stringify(daemonEntry, null, 2)); return; }
const val = typeof daemonEntry.value === "string" ? daemonEntry.value : JSON.stringify(daemonEntry.value);
render.info(val);
render.info(dim(` set by ${daemonEntry.updatedBy} at ${new Date(daemonEntry.updatedAt).toLocaleString()}`));
return;
}
await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
const entry = await client.getState(key);
if (!entry) {
console.log(dim(`(not set)`));
render.info(dim("(not set)"));
return;
}
if (flags.json) {
@@ -27,13 +36,12 @@ export async function runStateGet(flags: StateFlags, key: string): Promise<void>
return;
}
const val = typeof entry.value === "string" ? entry.value : JSON.stringify(entry.value);
console.log(val);
console.log(dim(` set by ${entry.updatedBy} at ${new Date(entry.updatedAt).toLocaleString()}`));
render.info(val);
render.info(dim(` set by ${entry.updatedBy} at ${new Date(entry.updatedAt).toLocaleString()}`));
});
}
export async function runStateSet(flags: StateFlags, key: string, value: string): Promise<void> {
// Try to parse as JSON so numbers/booleans/objects work; fall back to string.
let parsed: unknown;
try {
parsed = JSON.parse(value);
@@ -41,18 +49,32 @@ export async function runStateSet(flags: StateFlags, key: string, value: string)
parsed = value;
}
// Daemon path first.
const daemonOk = await trySetStateViaDaemon(key, parsed, flags.mesh);
if (daemonOk) {
render.ok(`${bold(key)} = ${JSON.stringify(parsed)}`);
return;
}
await withMesh({ meshSlug: flags.mesh ?? null }, async (client) => {
await client.setState(key, parsed);
console.log(`${key} = ${JSON.stringify(parsed)}`);
render.ok(`${bold(key)} = ${JSON.stringify(parsed)}`);
});
}
export async function runStateList(flags: StateFlags): Promise<void> {
const useColor =
!process.env.NO_COLOR && process.env.TERM !== "dumb" && process.stdout.isTTY;
const dim = (s: string) => (useColor ? `\x1b[2m${s}\x1b[22m` : s);
const bold = (s: string) => (useColor ? `\x1b[1m${s}\x1b[22m` : s);
// Daemon path first.
const daemonRows = await tryListStateViaDaemon(flags.mesh);
if (daemonRows !== null) {
if (flags.json) { console.log(JSON.stringify(daemonRows, null, 2)); return; }
if (daemonRows.length === 0) { render.info(dim("(no state)")); return; }
render.section(`state (${daemonRows.length})`);
for (const e of daemonRows) {
const val = typeof e.value === "string" ? e.value : JSON.stringify(e.value);
process.stdout.write(` ${bold(e.key)}: ${val}\n`);
process.stdout.write(` ${dim(e.updatedBy + " · " + new Date(e.updatedAt).toLocaleString())}\n`);
}
return;
}
await withMesh({ meshSlug: flags.mesh ?? null }, async (client, mesh) => {
const entries = await client.listState();
@@ -62,14 +84,15 @@ export async function runStateList(flags: StateFlags): Promise<void> {
}
if (entries.length === 0) {
console.log(dim(`No state on mesh "${mesh.slug}".`));
render.info(dim(`No state on mesh "${mesh.slug}".`));
return;
}
render.section(`state (${entries.length})`);
for (const e of entries) {
const val = typeof e.value === "string" ? e.value : JSON.stringify(e.value);
console.log(`${bold(e.key)}: ${val}`);
console.log(dim(` ${e.updatedBy} · ${new Date(e.updatedAt).toLocaleString()}`));
process.stdout.write(` ${bold(e.key)}: ${val}\n`);
process.stdout.write(` ${dim(e.updatedBy + " · " + new Date(e.updatedAt).toLocaleString())}\n`);
}
});
}

View File

@@ -0,0 +1,167 @@
/**
* `claudemesh topic post <name> <message>` — REST-encrypted send.
*
* Distinct from `claudemesh topic send` (WS-based, currently v1
* plaintext). This verb:
* 1. Mints an ephemeral REST apikey scoped to the topic.
* 2. Fetches + decrypts the topic key (crypto_box).
* 3. Encrypts the body with crypto_secretbox under the topic key.
* 4. POSTs body_version: 2 ciphertext to /api/v1/messages.
* 5. Revokes the apikey.
*
* If the topic doesn't yet have a sealed key for this member (404
* not_sealed) we surface a clear error and skip — the user must wait
* for a holder to re-seal.
*/
import { withRestKey } from "~/services/api/with-rest-key.js";
import { request } from "~/services/api/client.js";
import {
getTopicKey,
encryptMessage,
} from "~/services/crypto/topic-key.js";
import { render } from "~/ui/render.js";
import { clay, dim, green } from "~/ui/styles.js";
import { EXIT } from "~/constants/exit-codes.js";
export interface TopicPostFlags {
mesh?: string;
json?: boolean;
/** Force v1 plaintext send even if the topic is encrypted. */
plaintext?: boolean;
/** Reply-to message id (full or 8+ char prefix). */
replyTo?: string;
}
interface PostResponse {
messageId: string | null;
historyId: string | null;
topic: string;
topicId: string;
notifications: number;
replyToId?: string | null;
}
export async function runTopicPost(
topicName: string,
message: string,
flags: TopicPostFlags,
): Promise<number> {
if (!topicName || !message) {
render.err("Usage: claudemesh topic post <topic> <message>");
return EXIT.INVALID_ARGS;
}
const cleanName = topicName.replace(/^#/, "");
// Extract @-mention tokens for write-time fan-out so the server can
// populate notifications without reading ciphertext.
const mentions: string[] = [];
const mentionRe = /(^|[^A-Za-z0-9_-])@([A-Za-z0-9_-]{1,64})(?=$|[^A-Za-z0-9_-])/g;
let m: RegExpExecArray | null;
while ((m = mentionRe.exec(message)) !== null) {
mentions.push(m[2]!.toLowerCase());
if (mentions.length >= 16) break;
}
return withRestKey(
{
meshSlug: flags.mesh ?? null,
purpose: `post-${cleanName}`,
capabilities: ["read", "send"],
topicScopes: [cleanName],
},
async ({ secret, mesh }) => {
let bodyVersion: 1 | 2 = 1;
let ciphertext: string;
let nonce: string;
if (flags.plaintext) {
// Explicit v1: caller wants plaintext. Encode UTF-8 → base64.
ciphertext = Buffer.from(message, "utf-8").toString("base64");
nonce = Buffer.from(new Uint8Array(24)).toString("base64");
} else {
const keyResult = await getTopicKey({
apiKeySecret: secret,
memberSecretKeyHex: mesh.secretKey,
topicName: cleanName,
});
if (keyResult.ok && keyResult.topicKey) {
const enc = await encryptMessage(keyResult.topicKey, message);
ciphertext = enc.ciphertext;
nonce = enc.nonce;
bodyVersion = 2;
} else if (keyResult.error === "topic_unencrypted") {
// Legacy v0.2.0 topic — fall back to v1 plaintext.
ciphertext = Buffer.from(message, "utf-8").toString("base64");
nonce = Buffer.from(new Uint8Array(24)).toString("base64");
} else {
render.err(
`cannot encrypt for #${cleanName}: ${keyResult.error ?? "unknown"}${
keyResult.message ? " — " + keyResult.message : ""
}`,
);
return EXIT.INTERNAL_ERROR;
}
}
// Resolve reply-to: accept full id or 8+ char prefix by querying recent
// history once and matching. Server validates same-topic membership.
let replyToId: string | undefined;
if (flags.replyTo) {
if (flags.replyTo.length >= 16) {
replyToId = flags.replyTo;
} else if (flags.replyTo.length >= 6) {
const recent = await request<{
messages: Array<{ id: string }>;
}>({
path: `/api/v1/topics/${encodeURIComponent(cleanName)}/messages?limit=200`,
method: "GET",
token: secret,
});
const hit = recent.messages?.find((r) =>
r.id.startsWith(flags.replyTo!),
);
if (!hit) {
render.err(
`--reply-to ${flags.replyTo}: no recent message id starts with that prefix`,
);
return EXIT.INVALID_ARGS;
}
replyToId = hit.id;
} else {
render.err("--reply-to needs at least 6 characters of the message id");
return EXIT.INVALID_ARGS;
}
}
const result = await request<PostResponse>({
path: "/api/v1/messages",
method: "POST",
token: secret,
body: {
topic: cleanName,
ciphertext,
nonce,
bodyVersion,
...(mentions.length > 0 ? { mentions } : {}),
...(replyToId ? { replyToId } : {}),
},
});
if (flags.json) {
console.log(JSON.stringify({ ...result, bodyVersion, mentions }));
return EXIT.SUCCESS;
}
const versionTag = bodyVersion === 2 ? green("🔒 v2") : dim("v1");
const replyTag = result.replyToId
? ` ${dim("↳ " + result.replyToId.slice(0, 8))}`
: "";
render.ok(
"posted",
`${clay("#" + cleanName)} ${versionTag}${replyTag} ${dim(`(${result.notifications} mentions)`)}`,
);
return EXIT.SUCCESS;
},
);
}

Some files were not shown because too many files have changed in this diff Show More