Files
claudemesh/.artifacts/specs/2026-05-05-continuous-presence.md
Alejandro Gutiérrez ffd0621ccc
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
feat(broker,cli): liveness watchdogs — 75s stale-pong terminate
Both sides now actively detect half-dead WS connections instead of
waiting for kernel TCP keepalive (~2hrs default on Linux). Bug user
reported: "claudemesh peer list" shows zero peers despite running
sessions, because NAT/CGNAT silently dropped the WS flow but neither
side noticed.

Broker (apps/broker/src/index.ts):
- Add lastPongAt to PeerConn, populate at connections.set sites,
  bump in ws.on("pong").
- 30s ping loop now also terminates conns whose pong is >75s stale.
  ws.terminate() fires the close handler → existing peer_left path.

Daemon (apps/cli/src/daemon/ws-lifecycle.ts):
- Add idle watchdog at 30s cadence, started after hello-ack.
- Bumps lastActivity on incoming message, ping, and pong frames.
- Sends sock.ping() if recent activity, terminates if idle >75s.
- Watchdog cleared on close handler + explicit close().

CLI 1.34.15 → 1.34.16. Broker stays 0.1.0 (deploys from main).

Spec: .artifacts/specs/2026-05-05-continuous-presence.md (full lease
model + resume token, this commit ships only the watchdogs — first
of four progressive layers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:22:15 +01:00

351 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Continuous presence — lease model + resume token
**Status:** spec, ready for v0.3.0.
**Owner:** alezmad
**Author:** Claude (2026-05-05, follow-up to user-reported "after hours claudemesh disconnects")
**Related:** `2026-05-04-per-session-presence.md` (per-launch ephemeral keypair), `apps/broker/src/index.ts:5430-5436` (current 30s ping loop), `apps/cli/src/daemon/ws-lifecycle.ts` (current backoff reconnect).
## Problem
Today, presence is fused to a single TCP/WS connection. When the
connection breaks — half-dead NAT entries, ISP route changes, laptop
sleep, broker restart — the broker tears down the presence row, fires
`peer_left`, and waits for the daemon to dial a fresh socket and run
the full attestation hello again. Other peers see the user blink
offline → back online. Messages sent to the session during the gap are
either dropped (if it's a `now`/`next` priority DM with no recipient
match) or held in `message_queue` for `low` only.
Concrete symptom (user-reported): `claudemesh peer list` shows zero
peers despite multiple sessions being "up" — they're stuck on
half-dead TCP connections. Daemon hasn't noticed because no `close`
fired. Hours later, kernel TCP keepalive (default Linux: 7200s idle +
9 × 75s probes ≈ 2h11m) finally RSTs the socket, daemon's existing
backoff reconnects, peers reappear. Until then: zombie session.
Two coupled bugs:
1. **No application-layer staleness detection.** Broker pings every
30s (line 5431) and updates `lastPingAt` on pong, but never
`terminate()`s a connection that stops returning pongs. Daemon
doesn't ping at all. Both sides trust the kernel for liveness,
which only fires after hours.
2. **Presence == connection.** Even once the staleness IS detected
and the daemon reconnects, peers see a full `peer_left` /
`peer_joined` cycle for a network blip that took 130 seconds.
Outbound messages during the gap that target the session by
pubkey route to nothing.
The user's ask: peers should never see a gap during transient
disconnects. Presence should be continuous as long as the *session
intent* is alive, regardless of how many sockets carried it.
## Goal
Presence is a **lease** keyed off the session's stable identity
(`sessionPubkey`), held in broker memory + DB, with a TTL refreshed
on every keepalive. Sockets come and go beneath the lease. Other peers
see continuous online status across reconnects up to the lease TTL.
Specifically:
- A daemon (or per-session WS) can drop and re-establish the WS
within a configurable grace window (default 90s) without any peer
observing `peer_left` / `peer_joined`.
- Messages sent to a session while its socket is mid-flap are queued,
delivered on the next reattach, ordered.
- Reconnect itself is sub-second on the wire when a `resume_token` is
presented — broker recognises the session, restores the slot, no
re-attestation round-trip.
- After the grace window expires, the broker fires `peer_left`
exactly once; on a later reconnect it fires `peer_joined` exactly
once. No flapping.
## Non-goals
- **Multi-broker handoff.** Out of scope. If the broker process
restarts, leases are lost and we fall back to today's behavior
(clean reconnect, peers see one cycle). A future spec can address
this with a shared lease store (Redis / Postgres LISTEN).
- **Dual-socket on the daemon.** Useful gold-plating but not required
for the user-facing problem. Single-socket with watchdog +
resume-token covers the failure modes actually observed (NAT drops,
ISP blips, sleep <90s).
- **Manual `claudemesh reconnect` CLI.** Not needed; the lease model
makes it redundant. Re-evaluate if real support cases surface.
## Design
### Lease model
```
sessionPubkey → { transport: "online" | "offline",
leaseUntil: Date,
ws: WebSocket | null,
...existing PeerConn fields }
```
Today the `connections` Map IS keyed by `presenceId`, which is a fresh
UUID per WS. We change that key to `sessionPubkey` (member-WS:
`memberPubkey`; session-WS: `sessionPubkey`). The PeerConn struct
gains:
```ts
transport: "online" | "offline";
leaseUntil: Date; // Date.now() + LEASE_TTL_MS
evictionTimer: NodeJS.Timeout | null;
```
### State transitions
**On WS open + hello accepted (initial):**
- Insert into `connections` with `transport: "online"`,
`leaseUntil: now + 90s`, `evictionTimer: null`.
- Broadcast `peer_joined` (today's behavior).
- Issue `resume_token` (see below) in the `hello_ack`.
**On WS open + hello carries valid `resume_token`:**
- Look up by `sessionPubkey`, verify token signature + freshness
(TTL <= LEASE_TTL_MS). If valid AND entry exists with
`transport: "offline"`:
- Cancel `evictionTimer`.
- Swap `ws` reference.
- Set `transport: "online"`, refresh `leaseUntil`.
- **Do NOT** broadcast `peer_joined`. The lease never expired.
- Drain any queued DMs accumulated during offline window.
- Reply `hello_ack` with new `resume_token`.
- If entry exists with `transport: "online"` (token replay attack or
rapid reconnect race): close old `ws` with `1000, "session_replaced"`
before swapping. Same as today's `oldConn.ws.close(1000, ...)`
pattern at lines 1768/1996.
- If no entry exists or token is stale: treat as a fresh hello,
broadcast `peer_joined`. Token expired = same as a cold start.
**On WS close (any reason):**
- Look up by `sessionPubkey`. If not found, no-op (already evicted).
- Set `transport: "offline"`, clear `ws` reference.
- Start `evictionTimer = setTimeout(evict, GRACE_MS)`.
- **Do NOT** broadcast `peer_left`. **Do NOT** delete the entry.
- **Do NOT** call `disconnectPresence(presenceId)` yet.
**On `evictionTimer` fire (lease expired without reattach):**
- Delete from `connections`.
- Broadcast `peer_left` (today's behavior at lines 5167-5189).
- `decMeshCount`.
- `disconnectPresence(presenceId)`.
- Clean up URL watches, stream subs, MCP registry — same as today's
close handler.
- Audit `peer_left`.
**Watchdog (broker):**
- The 30s ping loop (line 5431) gains a staleness check: if any
conn's `transport === "online"` and `lastPingAt < now - 75s`, call
`ws.terminate()`. This converts the half-dead socket into a clean
`close` event, which fires the lease-offline transition above.
- Same logic on the daemon side (see § Daemon changes).
### Resume token
A short opaque string the broker hands the daemon in `hello_ack`.
Format: `mesh-resume.v1.<base64url(JSON-payload)>.<base64url(sig)>`
where `JSON-payload = { sub: <sessionPubkey>, mid: <meshId>, exp:
<unix-ms>, iat: <unix-ms> }` and `sig = ed25519(brokerSigningKey,
JSON-payload)`.
- **Why a token, not just sessionPubkey?** A session needs to prove
it's the holder of an existing lease without re-running the full
attestation handshake (which involves member key + parent
attestation lookup). The token is a server-issued cookie: cheap to
verify, scoped to a single session, expires with the lease.
- **Storage:** broker keeps the signing key in env (`RESUME_TOKEN_KEY`,
generated on first boot if missing, persisted to a config row). No
DB column needed for the tokens themselves — they're verified by
signature alone.
- **TTL:** equal to LEASE_TTL_MS (90s). After that the daemon must
re-handshake with full attestation. Refreshed on every successful
reattach.
- **Daemon storage:** in-memory only. Lost on daemon restart, which
is correct: a daemon restart is a real reconnect and should run
the full hello.
### Wire protocol additions
`hello` (member-WS, session-WS, fresh-launch hello — all three):
```diff
{
type: "hello",
memberPubkey: "...",
sessionPubkey: "...", // session-WS only
attestation: "...", // session-WS only
signature: "...",
+ resumeToken?: "mesh-resume.v1...", // optional; presence = reattach attempt
...
}
```
`hello_ack`:
```diff
{
type: "hello_ack",
presenceId: "...",
...
+ resumeToken: "mesh-resume.v1...", // always issued; replaces prior on reattach
+ leaseTtlMs: 90000, // informational; daemon may use for ping cadence
}
```
No new message types. Old daemons that don't send `resumeToken` get
today's full-handshake behavior — fully backward compatible.
### Message queue during grace window
Today: DMs to a presence whose WS is closed → routed to
`message_queue` only for `priority: low`; `now`/`next` either route
to a different connected session of the same member or drop.
Change: when broker would route to a session whose
`transport === "offline"` (lease still valid), enqueue regardless of
priority. On reattach, the existing inbox-drain path
(`maybePushQueuedMessages` at line 967) flushes them in order. The
`message_queue` already has the schema for this; we're just relaxing
the priority gate when the target is in grace.
### Constants
```ts
const LEASE_TTL_MS = 90_000; // grace window after WS close
const PING_INTERVAL_MS = 30_000; // unchanged
const STALE_PONG_THRESHOLD_MS = 75_000; // 2.5x ping interval
const RESUME_TOKEN_TTL_MS = LEASE_TTL_MS;
```
`LEASE_TTL_MS` = 90s rationale: long enough to absorb a sleep/resume
cycle, NAT timeout, ISP route flap, mobile→wifi handover. Short
enough that a true crash (daemon killed, machine off) clears the
session within 90s — peers don't see ghost online status forever.
Configurable via env (`LEASE_TTL_MS`) for self-hosted brokers.
## Daemon changes
### Watchdog
In `ws-lifecycle.ts`, add an `idleWatchdog` parallel to the existing
backoff/reconnect machinery:
```ts
let lastActivity = Date.now(); // bumped on every incoming message + pong
const watchdog = setInterval(() => {
if (Date.now() - lastActivity > STALE_THRESHOLD_MS) {
log("warn", "ws_stale_terminate", { url: opts.url });
sock.terminate(); // fires existing close handler → reconnect path
} else if (sock.readyState === sock.OPEN) {
sock.ping(); // matches broker's 30s cadence, gives broker a pong
}
}, PING_INTERVAL_MS);
sock.on("message", () => { lastActivity = Date.now(); });
sock.on("pong", () => { lastActivity = Date.now(); });
```
Cleanup `clearInterval(watchdog)` in the close handler and explicit
`close()` path.
### Resume token in hello
`apps/cli/src/daemon/broker.ts:136` and equivalent in
`session-broker.ts`: persist the `resumeToken` from each successful
`hello_ack` into a private field, include it in the next
`buildHello()` call. On daemon restart the field is empty → cold
start, exactly today's behavior.
### No CLI changes
`claudemesh peer list` keeps reading the broker's `connections` Map
which now reflects continuous presence. Users see online sessions as
online during transient blips. No UX surface changes.
## Migration
- New broker is fully backward compatible with old daemons (resume
token is optional, defaults fall through to today's path).
- New daemons against an old broker: token is sent but ignored, full
handshake runs each reconnect — same as today.
- DB migration: none. `presence` table semantics unchanged. The
`disconnectedAt` column is now set only on lease eviction (>90s),
not on every WS close. This is a behavioral change but not a
schema change.
- Add ENV var `RESUME_TOKEN_KEY` (broker generates on first boot if
unset, persists to a singleton config row).
## Test plan
1. **Sleep test:** kill -STOP the daemon for 60s, then kill -CONT.
Expect: peers never see `peer_left`. Daemon's WS is dead-on-arrival
when it wakes; watchdog terminates it; reconnect with resume_token
succeeds within 1-2s; lease was at ~30s of its 90s TTL when the
daemon resumed.
2. **Hard offline:** kill -STOP for 120s, kill -CONT. Expect: peers
see exactly one `peer_left` at t=90s, then exactly one
`peer_joined` after the daemon resumes and reconnects (resume
token is now stale; full handshake runs).
3. **NAT drop simulation:** `iptables -A OUTPUT -p tcp --dport 443
-j DROP` for 60s on the daemon host, then remove the rule. Expect:
broker pings stop landing, broker-side watchdog calls
`ws.terminate()` at t=75s, lease enters grace, daemon's own
watchdog fires within ~30s, daemon reconnects with resume_token,
peers never see a flap.
4. **Message-during-grace:** while a target session is in grace
(offline, lease valid), send a `priority: now` DM. Expect: queued
in `message_queue`, delivered exactly once on reattach, no
`peer_left` visible to sender, ack returns delivered.
5. **Replay attack:** capture a resume_token in flight, replay it
against a different broker connection while the original session
is still online. Expect: broker treats it as a reconnect for an
already-online session → closes old WS with `session_replaced`,
new WS takes over. Equivalent to today's session-replacement
semantics; the original session detects the close and either
reconnects (if it's still alive) or gives up.
6. **Token forgery:** send a `resumeToken` not signed by the broker.
Expect: signature check fails, broker treats hello as a fresh
handshake (or rejects if the rest of the hello is invalid).
## Open questions
- **Should `peer list` expose a `transport` field** so callers can
distinguish "leased but offline" from "online"? Default no — the
abstraction we're selling is "they're online." But debugging may
want it; gate it behind `--all` or `--debug`.
- **What about the broker-side `mcpRegistry` cleanup?** Today we
delete non-persistent MCP entries on WS close (line 5217). With
leases, we should defer that to lease eviction, not WS close.
Otherwise an MCP server registered by a session disappears every
time its WS reconnects.
## Build order
1. **Broker lease model** — change `connections` keying, add
`transport`/`leaseUntil`/`evictionTimer`, refactor close handler
to start grace timer instead of immediate teardown, refactor
eviction path. (~80 lines.)
2. **Resume token** — signing key bootstrap, token issue/verify,
wire format, hello_ack changes. (~50 lines + 1 config row.)
3. **Daemon watchdog** — `ws-lifecycle.ts` adds `idleWatchdog` and
stores `resumeToken` from acks. (~25 lines.)
4. **Daemon hello** — pass `resumeToken` in next `buildHello()`.
(~10 lines across `broker.ts` + `session-broker.ts`.)
5. **Broker watchdog** — extend the 30s ping loop with
`terminate()`-on-stale logic. (~15 lines.)
6. **Queue-during-grace** — relax priority gate in DM routing.
(~5 lines.)
7. **Spec docs** — update `docs/protocol.md` with resume_token,
lease semantics. (~30 lines.)
8. **Tests** — six scenarios above. Likely ~3 new test files.
Estimated total: one focused day. The broker lease model is the load-
bearing change; everything else slots in cleanly once that's done.