diff --git a/.artifacts/specs/2026-05-04-per-session-presence.md b/.artifacts/specs/2026-05-04-per-session-presence.md new file mode 100644 index 0000000..30fb0f0 --- /dev/null +++ b/.artifacts/specs/2026-05-04-per-session-presence.md @@ -0,0 +1,282 @@ +# Per-session broker presence — daemon-multiplexed + +**Status:** spec, queued for 1.30.0 (alongside launch-wizard refactor). +**Owner:** alezmad +**Author:** Claude (Sprint A planning, 2026-05-04) +**Related:** `2026-05-04-v2-roadmap-completion.md` (Sprint A overview), +1.29.0 session-registry CHANGELOG entry. + +## Problem + +After 1.28.0 dropped the bridge tier, **launched `claude` sessions have +no persistent broker presence**. Only the daemon does. + +Concretely: two `claudemesh launch` sessions in the same cwd, querying +`peer list` 2 s apart, **never see each other**. Each `claudemesh peer +list` opens a short-lived cold-path WS that creates a `presence` row +for the duration of the query and tears it down. The "this session" +row everyone sees in their own snapshot is created by the snapshot +itself; sibling sessions' queries miss it because their WS-lifetimes +don't overlap. + +Confirmed empirically (2026-05-04, same-cwd ECIJA-Intranet test): + +| Snapshot | timestamp | self pubkey | self `connectedAt` | +|---|---|---|---| +| Session A | 11:42:37Z | `61d96106cb499208` | 11:42:38Z (= query time) | +| Session B | 11:42:39Z | `ce77188aba02827d` | 11:42:38Z (= query time) | + +Each saw 5 long-lived peers (the daemon and unrelated other sessions) +plus its own ephemeral row. Neither saw the other. + +## Goal + +Every launched `claude` session has a long-lived broker presence row +**owned by the daemon**, identified by the session's per-launch +keypair. Siblings see each other in `peer list` immediately and +continuously, not as snapshot artifacts. + +## Non-goals + +- Cross-machine session sync (waiting on 2.0.0 HKDF identity). +- Replacing the daemon's own presence row — the daemon stays as a + separate row for "the user on this machine, no specific session." +- Persistence of the session-presence link across daemon restarts — + daemon restart can be allowed to require launched sessions to + re-register (same compromise as the in-memory session registry from + 1.29.0). + +## Design + +### State machine + +The 1.29.0 session registry already tracks `Map` +inside the daemon. Extend it to own a per-session broker connection. + +``` +session lifecycle: + POST /v1/sessions/register + → registry.set(token, info) + → daemon.openSessionWs(info) ← NEW + → broker creates presence row owned by session.pubkey + + DELETE /v1/sessions/:token + → registry.delete(token) + → daemon.closeSessionWs(token) ← NEW + → broker marks presence.disconnectedAt = now() + + reaper (30 s tick): pid dead? + → registry.delete(token) + → daemon.closeSessionWs(token) +``` + +### Daemon-side: per-session `BrokerClient` + +Today the daemon holds `Map` (one WS per +attached mesh). Add a parallel `Map` for +the per-launch ephemeral connections. + +`SessionBrokerClient` is the existing `BrokerClient` reused, configured +with the session's per-launch keypair instead of the member's stable +keypair. It registers presence (`presence_join`) and stays connected +until `closeSessionWs(token)` fires. It does **not** drain the outbox +— that's the member-keypair `DaemonBrokerClient`'s job. It only carries +presence + receives DMs targeted at the session pubkey. + +### Broker-side: parent-vouched presence auth + +Today's broker accepts hello-sig auth where: +- Caller signs the broker's nonce with their `mesh_member` keypair. +- Broker looks up `mesh_member.peer_pubkey == sig.pubkey`. + +For per-session keypairs, the session pubkey is **not** in `mesh_member` +— it's freshly generated by `claudemesh launch`. We need a new +attestation flow: + +``` +hello { + type: "session_hello", + session_pubkey: , + parent_member_pubkey: , + display_name, cwd, role, groups, + parent_signature: ed25519_sign(member_priv, + "claudemesh-session/" || session_pubkey || "/" || nonce), + nonce_challenge: , +} +``` + +Broker validates: +1. `parent_member_pubkey` exists in `mesh.member` for the target mesh. +2. `parent_signature` validates against `parent_member_pubkey` over the + canonical message above. +3. Broker inserts a presence row keyed on `session_pubkey` but + `member_id` pointing at the parent member's `mesh.member.id`. + +This is the OAuth-style refresh-vs-access pattern: the parent member +key vouches "this ephemeral session pubkey belongs to me." The broker +binds the row to the parent member but uses the session pubkey for +routing (so DMs targeted at the session pubkey land at this WS). + +### CLI-side: launch.ts produces the parent signature + +`claudemesh launch` already mints the session keypair and writes the +session-token file. Extend it to also produce a `parent_signature` +that the daemon can present when opening the session WS: + +```ts +const sessionPubkey = sessionKeypair.publicKey; +const parentSig = ed25519_sign( + mesh.secretKey, + Buffer.concat([ + Buffer.from("claudemesh-session/"), + sessionPubkey, + Buffer.from("/"), + /* nonce comes from broker — handled at WS-connect time */ + ]), +); +``` + +Actually, the nonce is broker-issued at hello time, so the signature +needs to be produced fresh per WS-connect. Simpler approach: the +`POST /v1/sessions/register` body carries the *member secret key* (or +a derived signing capability) so the daemon can sign nonces on behalf +of the session. + +That's a key-leak risk. Better: register carries a **pre-signed +attestation** good for a TTL window: + +``` +register body adds: + parent_attestation: { + session_pubkey: hex, + parent_member_pubkey: hex, + expires_at: ISO, + signature: ed25519_sign(member_priv, + "claudemesh-session-attest/" || + session_pubkey || "/" || + expires_at), + } +``` + +Daemon presents this attestation in `session_hello`; broker validates +expiry and signature, then issues a nonce challenge that the daemon +can satisfy with the session keypair (which IS held by the daemon +for the lifetime of the registration). Two-stage: parent vouches the +session; session signs the nonce. + +### Registry persistence + +For now, in-memory only (matching 1.29.0). Daemon restart drops all +session WSes; launched `claude` processes are responsible for +re-registering on next CLI invocation. Acceptable v1 behaviour; +revisit when sqlite persistence lands for the registry. + +## Wire changes + +### Broker + +- New `session_hello` message type (additive; existing `hello` for + member auth unchanged). +- `presence` row schema unchanged — `member_id` still required, but + `session_pubkey` differs from member's stable pubkey. +- Validate `parent_attestation.expires_at <= now() + 24h` to bound + attestation reuse. + +### Daemon + +- New `SessionBrokerClient` factory — wraps `BrokerClient` with + session-mode hello. +- `Map` alongside the existing + `Map`. +- IPC routes: + - `POST /v1/sessions/register` — extend body schema with + `parent_attestation`. + - `DELETE /v1/sessions/:token` — close the session WS first, then + drop registry entry. + +### CLI (`claudemesh launch`) + +- Mint session keypair (today only writes the session token; need to + add ed25519 keypair generation per launch and write the privkey + alongside the token). +- Sign `parent_attestation` with the member key from the joined-mesh + config. +- POST register with both the new keypair and the attestation. + +## LoC estimate + +- Daemon `SessionBrokerClient` + registry hook: ~120 LoC. +- IPC route schema extension + validation: ~40 LoC. +- Broker `session_hello` handler + tests: ~140 LoC. +- CLI `claudemesh launch` keypair + attestation: ~60 LoC. +- Tests + smoke: ~80 LoC. + +Total: **~440 LoC** across CLI + daemon + broker. + +## Risks + +| Risk | Mitigation | +|---|---| +| Member private key never leaves the user's machine, but the **attestation** (signed token) can be replayed within its TTL. | TTL bound 24h; refresh on launch; revocation path = drop the parent member's mesh enrollment (nuclear, but works). | +| Cascading WS connections — N launches = N+1 broker WSes per user. | Acceptable up to 10-20 concurrent sessions; if it ever becomes a problem, multiplex per-session at the protocol level (one WS, multiple presence rows). Out of scope for v1. | +| Daemon restart kills all session WSes — `peer list` from inside a launched session sees the remaining 5 peers but not its own siblings until they re-register. | Same as 1.29.0 registry. The registry could persist to sqlite later; for v1, accepted. | +| Broker schema cost: every new presence row has a different `session_pubkey`, growing the table faster. | Already accepted — broker prunes disconnected rows on a 30-day window. Per-session keys triple the row count at peak but stay within the prune budget. | + +## Compatibility + +- **Older brokers** can't validate `session_hello`. Sessions will + attempt the new hello, get back `unknown_message_type`, and fall + back to the existing member-keyed hello (no per-session presence, + but everything still works as 1.28.0). Add the broker change first, + let it deploy, then ship the CLI side. +- **Older CLIs** continue to work unchanged — they don't open + per-session WSes. They appear as ephemeral cold-path rows just like + today, and lose the symmetric-visibility property between siblings. +- **Backward visible:** users on 1.30.0+ on the same mesh as users on + ≤1.29.x will see the older users as one row (their daemon) instead + of one row per session. Acceptable — opt-in to the new visibility + by upgrading. + +## Sequencing + +1. **Broker change ships first.** Add `session_hello` handler, deploy, + bake for ~24h. No CLI behaviour change yet. +2. **Daemon `SessionBrokerClient` ships next** behind a feature flag + (`CLAUDEMESH_SESSION_PRESENCE=1`). Manually test with two launched + sessions in the same cwd; verify both see each other. +3. **CLI keypair-mint + attestation in `launch.ts` ships last**, behind + the same flag. +4. Flip the flag default in 1.30.0 release; document rollback via env. + +## Verification + +End-to-end smoke (paste into 1.30.0's CHANGELOG): + +``` +$ # In two different shells, both cd ~/Desktop/foo: +$ claudemesh launch --name SessionA -y # shell 1 +$ claudemesh launch --name SessionB -y # shell 2 +$ +$ # In a third shell: +$ claudemesh peer list --json --mesh foo | jq '.[] | {n: .displayName, c: .cwd}' +{ "n": "SessionA", "c": "/.../foo" } ← persistent, not query-induced +{ "n": "SessionB", "c": "/.../foo" } +$ +$ # In SessionA's shell: +$ claudemesh peer list --mesh foo +should include SessionB. +$ +$ # Kill SessionB (Ctrl-C in shell 2). Wait <30s. +$ claudemesh peer list --mesh foo +should NOT include SessionB (reaper closed its WS). +``` + +## Open questions + +- Should the per-session WS also drain *its own* outbox subset, or stay + presence-only? Recommend presence-only for v1 — keeps state machines + simple, daemon's member-keyed WS handles all sends. Can be revisited + when per-session policy DSL ships. +- Should the parent attestation be revocable mid-session? Could add an + IPC route on the daemon. Out of scope for v1; revoke = drop the + whole member enrollment.