feat(broker,cli): liveness watchdogs — 75s stale-pong terminate
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled

Both sides now actively detect half-dead WS connections instead of
waiting for kernel TCP keepalive (~2hrs default on Linux). Bug user
reported: "claudemesh peer list" shows zero peers despite running
sessions, because NAT/CGNAT silently dropped the WS flow but neither
side noticed.

Broker (apps/broker/src/index.ts):
- Add lastPongAt to PeerConn, populate at connections.set sites,
  bump in ws.on("pong").
- 30s ping loop now also terminates conns whose pong is >75s stale.
  ws.terminate() fires the close handler → existing peer_left path.

Daemon (apps/cli/src/daemon/ws-lifecycle.ts):
- Add idle watchdog at 30s cadence, started after hello-ack.
- Bumps lastActivity on incoming message, ping, and pong frames.
- Sends sock.ping() if recent activity, terminates if idle >75s.
- Watchdog cleared on close handler + explicit close().

CLI 1.34.15 → 1.34.16. Broker stays 0.1.0 (deploys from main).

Spec: .artifacts/specs/2026-05-05-continuous-presence.md (full lease
model + resume token, this commit ships only the watchdogs — first
of four progressive layers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-05-05 11:22:15 +01:00
parent b9ecbe79ad
commit ffd0621ccc
4 changed files with 420 additions and 5 deletions

View File

@@ -161,6 +161,11 @@ interface PeerConn {
* on long-lived control-plane connections (daemon, dashboard) that
* would just auto-reconnect. */
peerRole: "control-plane" | "session" | "service";
/** Last time this connection's WS replied to a broker ping. Bumped
* in the `pong` handler. Used by the staleness watchdog to detect
* half-dead TCP/NAT-dropped connections that the kernel hasn't yet
* RST'd (Linux default keepalive ≈ 2hrs). */
lastPongAt: number;
}
const connections = new Map<string, PeerConn>();
@@ -1803,6 +1808,7 @@ async function handleHello(
visible: saved?.visible ?? true,
profile: saved?.profile ?? {},
peerRole: "control-plane",
lastPongAt: Date.now(),
});
incMeshCount(hello.meshId);
void audit(hello.meshId, "peer_joined", member.id, effectiveDisplayName, {
@@ -2029,6 +2035,7 @@ async function handleSessionHello(
visible: true,
profile: {},
peerRole: "session",
lastPongAt: Date.now(),
});
incMeshCount(hello.meshId);
void audit(hello.meshId, "peer_joined", member.id, effectiveDisplayName, {
@@ -5235,7 +5242,11 @@ function handleConnection(ws: WebSocket): void {
log.warn("ws error", { error: err.message });
});
ws.on("pong", () => {
if (presenceId) void heartbeat(presenceId);
if (presenceId) {
const conn = connections.get(presenceId);
if (conn) conn.lastPongAt = Date.now();
void heartbeat(presenceId);
}
});
}
@@ -5427,10 +5438,26 @@ async function main(): Promise<void> {
});
});
// WS heartbeat ping every 30s; clients reply with pong → bumps lastPingAt.
// WS heartbeat ping every 30s; clients reply with pong → bumps
// lastPongAt. Connections whose pong is older than 75s (2.5x the
// ping interval) are considered half-dead — kernel hasn't yet RST'd
// the socket but no application traffic is flowing. Force-terminate
// them to fire the close handler and free the connection slot.
const STALE_PONG_THRESHOLD_MS = 75_000;
const pingInterval = setInterval(() => {
for (const { ws } of connections.values()) {
if (ws.readyState === ws.OPEN) ws.ping();
const now = Date.now();
for (const [pid, conn] of connections) {
const { ws } = conn;
if (ws.readyState !== ws.OPEN) continue;
if (now - conn.lastPongAt > STALE_PONG_THRESHOLD_MS) {
log.warn("ws stale terminate", {
presence_id: pid,
last_pong_ago_ms: now - conn.lastPongAt,
});
try { ws.terminate(); } catch { /* socket already gone */ }
continue;
}
ws.ping();
}
}, 30_000);
pingInterval.unref();