Commit Graph

3 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
ffd0621ccc feat(broker,cli): liveness watchdogs — 75s stale-pong terminate
Some checks failed
CI / Typecheck (push) Has been cancelled
CI / Lint (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Both sides now actively detect half-dead WS connections instead of
waiting for kernel TCP keepalive (~2hrs default on Linux). Bug user
reported: "claudemesh peer list" shows zero peers despite running
sessions, because NAT/CGNAT silently dropped the WS flow but neither
side noticed.

Broker (apps/broker/src/index.ts):
- Add lastPongAt to PeerConn, populate at connections.set sites,
  bump in ws.on("pong").
- 30s ping loop now also terminates conns whose pong is >75s stale.
  ws.terminate() fires the close handler → existing peer_left path.

Daemon (apps/cli/src/daemon/ws-lifecycle.ts):
- Add idle watchdog at 30s cadence, started after hello-ack.
- Bumps lastActivity on incoming message, ping, and pong frames.
- Sends sock.ping() if recent activity, terminates if idle >75s.
- Watchdog cleared on close handler + explicit close().

CLI 1.34.15 → 1.34.16. Broker stays 0.1.0 (deploys from main).

Spec: .artifacts/specs/2026-05-05-continuous-presence.md (full lease
model + resume token, this commit ships only the watchdogs — first
of four progressive layers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 11:22:15 +01:00
Alejandro Gutiérrez
cba4a938ec chore(cli): keep WS lifecycle diagnostic logs
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Five info-level log points across the WS lifecycle helper:
ws_open_attempt / ws_open_ok / ws_hello_sent / ws_hello_acked /
ws_closed (with status + close code/reason).

Surfaced during M1 smoke testing — without these the only visible
signal was "presence row missing on broker," which made it hard to
distinguish "WS never opened" / "opened but hello rejected" /
"acked then closed by broker."

Both clients prefix the helper-emitted msg ("session_broker_*",
"broker_*") so log greps stay clean per role.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:34:18 +01:00
Alejandro Gutiérrez
dab80f475e refactor(cli): m1 lifecycle + role-aware peer list
Foundational cleanups before agentic-comms architecture work
(.artifacts/specs/2026-05-04-agentic-comms-architecture-v2.md).
All behavior-preserving.

1. Extract `connectWsWithBackoff` into apps/cli/src/daemon/ws-lifecycle.ts.
   Both DaemonBrokerClient and SessionBrokerClient now share one
   lifecycle implementation (connect, hello-handshake, ack-timeout,
   close + backoff reconnect). Each client provides its own buildHello
   / isHelloAck / onMessage hooks and keeps its own RPC bookkeeping
   (pendingAcks, peerListResolvers, onPush). Composition over
   inheritance per Codex's review; no protocol shape changes.

2. Drop daemon-WS ephemeral session pubkey. DaemonBrokerClient no
   longer mints + sends a per-reconnect ephemeral keypair in its
   hello. Session-targeted DMs land on SessionBrokerClient since
   1.32.1, not the member-keyed daemon-WS, so the field was
   vestigial. Send-encrypt path now signs DMs with the stable mesh
   member secret. handleBrokerPush invocations from daemon-WS only
   pass the member secret — session decryption is the session-WS's
   job.

3. Role-aware peer list. `peer list` now hides peers whose
   broker-emitted `role` is `'control-plane'`. `--all` opts back in.
   JSON output emits `role` at top level. Older brokers that don't
   emit role yet default to 'session', so legacy peer rows stay
   visible without the broker-side change shipped first. Replaces
   the prior `peerType === 'claudemesh-daemon'` channel-name hack.

Typecheck + tests + build all green.
2026-05-04 18:08:32 +01:00