Both sides now actively detect half-dead WS connections instead of
waiting for kernel TCP keepalive (~2hrs default on Linux). Bug user
reported: "claudemesh peer list" shows zero peers despite running
sessions, because NAT/CGNAT silently dropped the WS flow but neither
side noticed.
Broker (apps/broker/src/index.ts):
- Add lastPongAt to PeerConn, populate at connections.set sites,
bump in ws.on("pong").
- 30s ping loop now also terminates conns whose pong is >75s stale.
ws.terminate() fires the close handler → existing peer_left path.
Daemon (apps/cli/src/daemon/ws-lifecycle.ts):
- Add idle watchdog at 30s cadence, started after hello-ack.
- Bumps lastActivity on incoming message, ping, and pong frames.
- Sends sock.ping() if recent activity, terminates if idle >75s.
- Watchdog cleared on close handler + explicit close().
CLI 1.34.15 → 1.34.16. Broker stays 0.1.0 (deploys from main).
Spec: .artifacts/specs/2026-05-05-continuous-presence.md (full lease
model + resume token, this commit ships only the watchdogs — first
of four progressive layers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
Continuous presence — lease model + resume token
Status: spec, ready for v0.3.0.
Owner: alezmad
Author: Claude (2026-05-05, follow-up to user-reported "after hours claudemesh disconnects")
Related: 2026-05-04-per-session-presence.md (per-launch ephemeral keypair), apps/broker/src/index.ts:5430-5436 (current 30s ping loop), apps/cli/src/daemon/ws-lifecycle.ts (current backoff reconnect).
Problem
Today, presence is fused to a single TCP/WS connection. When the
connection breaks — half-dead NAT entries, ISP route changes, laptop
sleep, broker restart — the broker tears down the presence row, fires
peer_left, and waits for the daemon to dial a fresh socket and run
the full attestation hello again. Other peers see the user blink
offline → back online. Messages sent to the session during the gap are
either dropped (if it's a now/next priority DM with no recipient
match) or held in message_queue for low only.
Concrete symptom (user-reported): claudemesh peer list shows zero
peers despite multiple sessions being "up" — they're stuck on
half-dead TCP connections. Daemon hasn't noticed because no close
fired. Hours later, kernel TCP keepalive (default Linux: 7200s idle +
9 × 75s probes ≈ 2h11m) finally RSTs the socket, daemon's existing
backoff reconnects, peers reappear. Until then: zombie session.
Two coupled bugs:
-
No application-layer staleness detection. Broker pings every 30s (line 5431) and updates
lastPingAton pong, but neverterminate()s a connection that stops returning pongs. Daemon doesn't ping at all. Both sides trust the kernel for liveness, which only fires after hours. -
Presence == connection. Even once the staleness IS detected and the daemon reconnects, peers see a full
peer_left/peer_joinedcycle for a network blip that took 1–30 seconds. Outbound messages during the gap that target the session by pubkey route to nothing.
The user's ask: peers should never see a gap during transient disconnects. Presence should be continuous as long as the session intent is alive, regardless of how many sockets carried it.
Goal
Presence is a lease keyed off the session's stable identity
(sessionPubkey), held in broker memory + DB, with a TTL refreshed
on every keepalive. Sockets come and go beneath the lease. Other peers
see continuous online status across reconnects up to the lease TTL.
Specifically:
- A daemon (or per-session WS) can drop and re-establish the WS
within a configurable grace window (default 90s) without any peer
observing
peer_left/peer_joined. - Messages sent to a session while its socket is mid-flap are queued, delivered on the next reattach, ordered.
- Reconnect itself is sub-second on the wire when a
resume_tokenis presented — broker recognises the session, restores the slot, no re-attestation round-trip. - After the grace window expires, the broker fires
peer_leftexactly once; on a later reconnect it firespeer_joinedexactly once. No flapping.
Non-goals
- Multi-broker handoff. Out of scope. If the broker process restarts, leases are lost and we fall back to today's behavior (clean reconnect, peers see one cycle). A future spec can address this with a shared lease store (Redis / Postgres LISTEN).
- Dual-socket on the daemon. Useful gold-plating but not required for the user-facing problem. Single-socket with watchdog + resume-token covers the failure modes actually observed (NAT drops, ISP blips, sleep <90s).
- Manual
claudemesh reconnectCLI. Not needed; the lease model makes it redundant. Re-evaluate if real support cases surface.
Design
Lease model
sessionPubkey → { transport: "online" | "offline",
leaseUntil: Date,
ws: WebSocket | null,
...existing PeerConn fields }
Today the connections Map IS keyed by presenceId, which is a fresh
UUID per WS. We change that key to sessionPubkey (member-WS:
memberPubkey; session-WS: sessionPubkey). The PeerConn struct
gains:
transport: "online" | "offline";
leaseUntil: Date; // Date.now() + LEASE_TTL_MS
evictionTimer: NodeJS.Timeout | null;
State transitions
On WS open + hello accepted (initial):
- Insert into
connectionswithtransport: "online",leaseUntil: now + 90s,evictionTimer: null. - Broadcast
peer_joined(today's behavior). - Issue
resume_token(see below) in thehello_ack.
On WS open + hello carries valid resume_token:
- Look up by
sessionPubkey, verify token signature + freshness (TTL <= LEASE_TTL_MS). If valid AND entry exists withtransport: "offline":- Cancel
evictionTimer. - Swap
wsreference. - Set
transport: "online", refreshleaseUntil. - Do NOT broadcast
peer_joined. The lease never expired. - Drain any queued DMs accumulated during offline window.
- Reply
hello_ackwith newresume_token.
- Cancel
- If entry exists with
transport: "online"(token replay attack or rapid reconnect race): close oldwswith1000, "session_replaced"before swapping. Same as today'soldConn.ws.close(1000, ...)pattern at lines 1768/1996. - If no entry exists or token is stale: treat as a fresh hello,
broadcast
peer_joined. Token expired = same as a cold start.
On WS close (any reason):
- Look up by
sessionPubkey. If not found, no-op (already evicted). - Set
transport: "offline", clearwsreference. - Start
evictionTimer = setTimeout(evict, GRACE_MS). - Do NOT broadcast
peer_left. Do NOT delete the entry. - Do NOT call
disconnectPresence(presenceId)yet.
On evictionTimer fire (lease expired without reattach):
- Delete from
connections. - Broadcast
peer_left(today's behavior at lines 5167-5189). decMeshCount.disconnectPresence(presenceId).- Clean up URL watches, stream subs, MCP registry — same as today's close handler.
- Audit
peer_left.
Watchdog (broker):
- The 30s ping loop (line 5431) gains a staleness check: if any
conn's
transport === "online"andlastPingAt < now - 75s, callws.terminate(). This converts the half-dead socket into a cleancloseevent, which fires the lease-offline transition above. - Same logic on the daemon side (see § Daemon changes).
Resume token
A short opaque string the broker hands the daemon in hello_ack.
Format: mesh-resume.v1.<base64url(JSON-payload)>.<base64url(sig)>
where JSON-payload = { sub: <sessionPubkey>, mid: <meshId>, exp: <unix-ms>, iat: <unix-ms> } and sig = ed25519(brokerSigningKey, JSON-payload).
- Why a token, not just sessionPubkey? A session needs to prove it's the holder of an existing lease without re-running the full attestation handshake (which involves member key + parent attestation lookup). The token is a server-issued cookie: cheap to verify, scoped to a single session, expires with the lease.
- Storage: broker keeps the signing key in env (
RESUME_TOKEN_KEY, generated on first boot if missing, persisted to a config row). No DB column needed for the tokens themselves — they're verified by signature alone. - TTL: equal to LEASE_TTL_MS (90s). After that the daemon must re-handshake with full attestation. Refreshed on every successful reattach.
- Daemon storage: in-memory only. Lost on daemon restart, which is correct: a daemon restart is a real reconnect and should run the full hello.
Wire protocol additions
hello (member-WS, session-WS, fresh-launch hello — all three):
{
type: "hello",
memberPubkey: "...",
sessionPubkey: "...", // session-WS only
attestation: "...", // session-WS only
signature: "...",
+ resumeToken?: "mesh-resume.v1...", // optional; presence = reattach attempt
...
}
hello_ack:
{
type: "hello_ack",
presenceId: "...",
...
+ resumeToken: "mesh-resume.v1...", // always issued; replaces prior on reattach
+ leaseTtlMs: 90000, // informational; daemon may use for ping cadence
}
No new message types. Old daemons that don't send resumeToken get
today's full-handshake behavior — fully backward compatible.
Message queue during grace window
Today: DMs to a presence whose WS is closed → routed to
message_queue only for priority: low; now/next either route
to a different connected session of the same member or drop.
Change: when broker would route to a session whose
transport === "offline" (lease still valid), enqueue regardless of
priority. On reattach, the existing inbox-drain path
(maybePushQueuedMessages at line 967) flushes them in order. The
message_queue already has the schema for this; we're just relaxing
the priority gate when the target is in grace.
Constants
const LEASE_TTL_MS = 90_000; // grace window after WS close
const PING_INTERVAL_MS = 30_000; // unchanged
const STALE_PONG_THRESHOLD_MS = 75_000; // 2.5x ping interval
const RESUME_TOKEN_TTL_MS = LEASE_TTL_MS;
LEASE_TTL_MS = 90s rationale: long enough to absorb a sleep/resume
cycle, NAT timeout, ISP route flap, mobile→wifi handover. Short
enough that a true crash (daemon killed, machine off) clears the
session within 90s — peers don't see ghost online status forever.
Configurable via env (LEASE_TTL_MS) for self-hosted brokers.
Daemon changes
Watchdog
In ws-lifecycle.ts, add an idleWatchdog parallel to the existing
backoff/reconnect machinery:
let lastActivity = Date.now(); // bumped on every incoming message + pong
const watchdog = setInterval(() => {
if (Date.now() - lastActivity > STALE_THRESHOLD_MS) {
log("warn", "ws_stale_terminate", { url: opts.url });
sock.terminate(); // fires existing close handler → reconnect path
} else if (sock.readyState === sock.OPEN) {
sock.ping(); // matches broker's 30s cadence, gives broker a pong
}
}, PING_INTERVAL_MS);
sock.on("message", () => { lastActivity = Date.now(); });
sock.on("pong", () => { lastActivity = Date.now(); });
Cleanup clearInterval(watchdog) in the close handler and explicit
close() path.
Resume token in hello
apps/cli/src/daemon/broker.ts:136 and equivalent in
session-broker.ts: persist the resumeToken from each successful
hello_ack into a private field, include it in the next
buildHello() call. On daemon restart the field is empty → cold
start, exactly today's behavior.
No CLI changes
claudemesh peer list keeps reading the broker's connections Map
which now reflects continuous presence. Users see online sessions as
online during transient blips. No UX surface changes.
Migration
- New broker is fully backward compatible with old daemons (resume token is optional, defaults fall through to today's path).
- New daemons against an old broker: token is sent but ignored, full handshake runs each reconnect — same as today.
- DB migration: none.
presencetable semantics unchanged. ThedisconnectedAtcolumn is now set only on lease eviction (>90s), not on every WS close. This is a behavioral change but not a schema change. - Add ENV var
RESUME_TOKEN_KEY(broker generates on first boot if unset, persists to a singleton config row).
Test plan
-
Sleep test: kill -STOP the daemon for 60s, then kill -CONT. Expect: peers never see
peer_left. Daemon's WS is dead-on-arrival when it wakes; watchdog terminates it; reconnect with resume_token succeeds within 1-2s; lease was at ~30s of its 90s TTL when the daemon resumed. -
Hard offline: kill -STOP for 120s, kill -CONT. Expect: peers see exactly one
peer_leftat t=90s, then exactly onepeer_joinedafter the daemon resumes and reconnects (resume token is now stale; full handshake runs). -
NAT drop simulation:
iptables -A OUTPUT -p tcp --dport 443 -j DROPfor 60s on the daemon host, then remove the rule. Expect: broker pings stop landing, broker-side watchdog callsws.terminate()at t=75s, lease enters grace, daemon's own watchdog fires within ~30s, daemon reconnects with resume_token, peers never see a flap. -
Message-during-grace: while a target session is in grace (offline, lease valid), send a
priority: nowDM. Expect: queued inmessage_queue, delivered exactly once on reattach, nopeer_leftvisible to sender, ack returns delivered. -
Replay attack: capture a resume_token in flight, replay it against a different broker connection while the original session is still online. Expect: broker treats it as a reconnect for an already-online session → closes old WS with
session_replaced, new WS takes over. Equivalent to today's session-replacement semantics; the original session detects the close and either reconnects (if it's still alive) or gives up. -
Token forgery: send a
resumeTokennot signed by the broker. Expect: signature check fails, broker treats hello as a fresh handshake (or rejects if the rest of the hello is invalid).
Open questions
- Should
peer listexpose atransportfield so callers can distinguish "leased but offline" from "online"? Default no — the abstraction we're selling is "they're online." But debugging may want it; gate it behind--allor--debug. - What about the broker-side
mcpRegistrycleanup? Today we delete non-persistent MCP entries on WS close (line 5217). With leases, we should defer that to lease eviction, not WS close. Otherwise an MCP server registered by a session disappears every time its WS reconnects.
Build order
- Broker lease model — change
connectionskeying, addtransport/leaseUntil/evictionTimer, refactor close handler to start grace timer instead of immediate teardown, refactor eviction path. (~80 lines.) - Resume token — signing key bootstrap, token issue/verify, wire format, hello_ack changes. (~50 lines + 1 config row.)
- Daemon watchdog —
ws-lifecycle.tsaddsidleWatchdogand storesresumeTokenfrom acks. (~25 lines.) - Daemon hello — pass
resumeTokenin nextbuildHello(). (~10 lines acrossbroker.ts+session-broker.ts.) - Broker watchdog — extend the 30s ping loop with
terminate()-on-stale logic. (~15 lines.) - Queue-during-grace — relax priority gate in DM routing. (~5 lines.)
- Spec docs — update
docs/protocol.mdwith resume_token, lease semantics. (~30 lines.) - Tests — six scenarios above. Likely ~3 new test files.
Estimated total: one focused day. The broker lease model is the load- bearing change; everything else slots in cleanly once that's done.