docs(spec): per-session broker presence (queued for 1.30.0)

records the design for daemon-multiplexed broker presence — every
launched claude session gets its own long-lived presence row owned
by the daemon, identified by a per-launch ephemeral keypair vouched
by the member's stable keypair.

resolves the "two sibling sessions can't see each other in peer list"
gap that surfaced when the bridge tier was deleted in 1.28.0. covers
state machine, broker session_hello handler, parent-attestation
signing, ipc route extension, sequencing (broker first, daemon
flagged, cli third), compat with older builds, and verification
smoke.

~440 loc estimate across cli + daemon + broker. queued for 1.30.0
alongside the launch-wizard refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-05-04 12:47:31 +01:00
parent f91871c71d
commit 364178d95b

View File

@@ -0,0 +1,282 @@
# Per-session broker presence — daemon-multiplexed
**Status:** spec, queued for 1.30.0 (alongside launch-wizard refactor).
**Owner:** alezmad
**Author:** Claude (Sprint A planning, 2026-05-04)
**Related:** `2026-05-04-v2-roadmap-completion.md` (Sprint A overview),
1.29.0 session-registry CHANGELOG entry.
## Problem
After 1.28.0 dropped the bridge tier, **launched `claude` sessions have
no persistent broker presence**. Only the daemon does.
Concretely: two `claudemesh launch` sessions in the same cwd, querying
`peer list` 2 s apart, **never see each other**. Each `claudemesh peer
list` opens a short-lived cold-path WS that creates a `presence` row
for the duration of the query and tears it down. The "this session"
row everyone sees in their own snapshot is created by the snapshot
itself; sibling sessions' queries miss it because their WS-lifetimes
don't overlap.
Confirmed empirically (2026-05-04, same-cwd ECIJA-Intranet test):
| Snapshot | timestamp | self pubkey | self `connectedAt` |
|---|---|---|---|
| Session A | 11:42:37Z | `61d96106cb499208` | 11:42:38Z (= query time) |
| Session B | 11:42:39Z | `ce77188aba02827d` | 11:42:38Z (= query time) |
Each saw 5 long-lived peers (the daemon and unrelated other sessions)
plus its own ephemeral row. Neither saw the other.
## Goal
Every launched `claude` session has a long-lived broker presence row
**owned by the daemon**, identified by the session's per-launch
keypair. Siblings see each other in `peer list` immediately and
continuously, not as snapshot artifacts.
## Non-goals
- Cross-machine session sync (waiting on 2.0.0 HKDF identity).
- Replacing the daemon's own presence row — the daemon stays as a
separate row for "the user on this machine, no specific session."
- Persistence of the session-presence link across daemon restarts —
daemon restart can be allowed to require launched sessions to
re-register (same compromise as the in-memory session registry from
1.29.0).
## Design
### State machine
The 1.29.0 session registry already tracks `Map<token, SessionInfo>`
inside the daemon. Extend it to own a per-session broker connection.
```
session lifecycle:
POST /v1/sessions/register
→ registry.set(token, info)
→ daemon.openSessionWs(info) ← NEW
→ broker creates presence row owned by session.pubkey
DELETE /v1/sessions/:token
→ registry.delete(token)
→ daemon.closeSessionWs(token) ← NEW
→ broker marks presence.disconnectedAt = now()
reaper (30 s tick): pid dead?
→ registry.delete(token)
→ daemon.closeSessionWs(token)
```
### Daemon-side: per-session `BrokerClient`
Today the daemon holds `Map<meshSlug, DaemonBrokerClient>` (one WS per
attached mesh). Add a parallel `Map<token, SessionBrokerClient>` for
the per-launch ephemeral connections.
`SessionBrokerClient` is the existing `BrokerClient` reused, configured
with the session's per-launch keypair instead of the member's stable
keypair. It registers presence (`presence_join`) and stays connected
until `closeSessionWs(token)` fires. It does **not** drain the outbox
— that's the member-keypair `DaemonBrokerClient`'s job. It only carries
presence + receives DMs targeted at the session pubkey.
### Broker-side: parent-vouched presence auth
Today's broker accepts hello-sig auth where:
- Caller signs the broker's nonce with their `mesh_member` keypair.
- Broker looks up `mesh_member.peer_pubkey == sig.pubkey`.
For per-session keypairs, the session pubkey is **not** in `mesh_member`
— it's freshly generated by `claudemesh launch`. We need a new
attestation flow:
```
hello {
type: "session_hello",
session_pubkey: <fresh keypair>,
parent_member_pubkey: <member keypair from config>,
display_name, cwd, role, groups,
parent_signature: ed25519_sign(member_priv,
"claudemesh-session/" || session_pubkey || "/" || nonce),
nonce_challenge: <broker nonce>,
}
```
Broker validates:
1. `parent_member_pubkey` exists in `mesh.member` for the target mesh.
2. `parent_signature` validates against `parent_member_pubkey` over the
canonical message above.
3. Broker inserts a presence row keyed on `session_pubkey` but
`member_id` pointing at the parent member's `mesh.member.id`.
This is the OAuth-style refresh-vs-access pattern: the parent member
key vouches "this ephemeral session pubkey belongs to me." The broker
binds the row to the parent member but uses the session pubkey for
routing (so DMs targeted at the session pubkey land at this WS).
### CLI-side: launch.ts produces the parent signature
`claudemesh launch` already mints the session keypair and writes the
session-token file. Extend it to also produce a `parent_signature`
that the daemon can present when opening the session WS:
```ts
const sessionPubkey = sessionKeypair.publicKey;
const parentSig = ed25519_sign(
mesh.secretKey,
Buffer.concat([
Buffer.from("claudemesh-session/"),
sessionPubkey,
Buffer.from("/"),
/* nonce comes from broker — handled at WS-connect time */
]),
);
```
Actually, the nonce is broker-issued at hello time, so the signature
needs to be produced fresh per WS-connect. Simpler approach: the
`POST /v1/sessions/register` body carries the *member secret key* (or
a derived signing capability) so the daemon can sign nonces on behalf
of the session.
That's a key-leak risk. Better: register carries a **pre-signed
attestation** good for a TTL window:
```
register body adds:
parent_attestation: {
session_pubkey: hex,
parent_member_pubkey: hex,
expires_at: ISO,
signature: ed25519_sign(member_priv,
"claudemesh-session-attest/" ||
session_pubkey || "/" ||
expires_at),
}
```
Daemon presents this attestation in `session_hello`; broker validates
expiry and signature, then issues a nonce challenge that the daemon
can satisfy with the session keypair (which IS held by the daemon
for the lifetime of the registration). Two-stage: parent vouches the
session; session signs the nonce.
### Registry persistence
For now, in-memory only (matching 1.29.0). Daemon restart drops all
session WSes; launched `claude` processes are responsible for
re-registering on next CLI invocation. Acceptable v1 behaviour;
revisit when sqlite persistence lands for the registry.
## Wire changes
### Broker
- New `session_hello` message type (additive; existing `hello` for
member auth unchanged).
- `presence` row schema unchanged — `member_id` still required, but
`session_pubkey` differs from member's stable pubkey.
- Validate `parent_attestation.expires_at <= now() + 24h` to bound
attestation reuse.
### Daemon
- New `SessionBrokerClient` factory — wraps `BrokerClient` with
session-mode hello.
- `Map<token, SessionBrokerClient>` alongside the existing
`Map<slug, DaemonBrokerClient>`.
- IPC routes:
- `POST /v1/sessions/register` — extend body schema with
`parent_attestation`.
- `DELETE /v1/sessions/:token` — close the session WS first, then
drop registry entry.
### CLI (`claudemesh launch`)
- Mint session keypair (today only writes the session token; need to
add ed25519 keypair generation per launch and write the privkey
alongside the token).
- Sign `parent_attestation` with the member key from the joined-mesh
config.
- POST register with both the new keypair and the attestation.
## LoC estimate
- Daemon `SessionBrokerClient` + registry hook: ~120 LoC.
- IPC route schema extension + validation: ~40 LoC.
- Broker `session_hello` handler + tests: ~140 LoC.
- CLI `claudemesh launch` keypair + attestation: ~60 LoC.
- Tests + smoke: ~80 LoC.
Total: **~440 LoC** across CLI + daemon + broker.
## Risks
| Risk | Mitigation |
|---|---|
| Member private key never leaves the user's machine, but the **attestation** (signed token) can be replayed within its TTL. | TTL bound 24h; refresh on launch; revocation path = drop the parent member's mesh enrollment (nuclear, but works). |
| Cascading WS connections — N launches = N+1 broker WSes per user. | Acceptable up to 10-20 concurrent sessions; if it ever becomes a problem, multiplex per-session at the protocol level (one WS, multiple presence rows). Out of scope for v1. |
| Daemon restart kills all session WSes — `peer list` from inside a launched session sees the remaining 5 peers but not its own siblings until they re-register. | Same as 1.29.0 registry. The registry could persist to sqlite later; for v1, accepted. |
| Broker schema cost: every new presence row has a different `session_pubkey`, growing the table faster. | Already accepted — broker prunes disconnected rows on a 30-day window. Per-session keys triple the row count at peak but stay within the prune budget. |
## Compatibility
- **Older brokers** can't validate `session_hello`. Sessions will
attempt the new hello, get back `unknown_message_type`, and fall
back to the existing member-keyed hello (no per-session presence,
but everything still works as 1.28.0). Add the broker change first,
let it deploy, then ship the CLI side.
- **Older CLIs** continue to work unchanged — they don't open
per-session WSes. They appear as ephemeral cold-path rows just like
today, and lose the symmetric-visibility property between siblings.
- **Backward visible:** users on 1.30.0+ on the same mesh as users on
≤1.29.x will see the older users as one row (their daemon) instead
of one row per session. Acceptable — opt-in to the new visibility
by upgrading.
## Sequencing
1. **Broker change ships first.** Add `session_hello` handler, deploy,
bake for ~24h. No CLI behaviour change yet.
2. **Daemon `SessionBrokerClient` ships next** behind a feature flag
(`CLAUDEMESH_SESSION_PRESENCE=1`). Manually test with two launched
sessions in the same cwd; verify both see each other.
3. **CLI keypair-mint + attestation in `launch.ts` ships last**, behind
the same flag.
4. Flip the flag default in 1.30.0 release; document rollback via env.
## Verification
End-to-end smoke (paste into 1.30.0's CHANGELOG):
```
$ # In two different shells, both cd ~/Desktop/foo:
$ claudemesh launch --name SessionA -y # shell 1
$ claudemesh launch --name SessionB -y # shell 2
$
$ # In a third shell:
$ claudemesh peer list --json --mesh foo | jq '.[] | {n: .displayName, c: .cwd}'
{ "n": "SessionA", "c": "/.../foo" } ← persistent, not query-induced
{ "n": "SessionB", "c": "/.../foo" }
$
$ # In SessionA's shell:
$ claudemesh peer list --mesh foo
should include SessionB.
$
$ # Kill SessionB (Ctrl-C in shell 2). Wait <30s.
$ claudemesh peer list --mesh foo
should NOT include SessionB (reaper closed its WS).
```
## Open questions
- Should the per-session WS also drain *its own* outbox subset, or stay
presence-only? Recommend presence-only for v1 — keeps state machines
simple, daemon's member-keyed WS handles all sends. Can be revisited
when per-session policy DSL ships.
- Should the parent attestation be revocable mid-session? Could add an
IPC route on the daemon. Out of scope for v1; revoke = drop the
whole member enrollment.