Commit Graph

3 Commits

Author SHA1 Message Date
Alejandro Gutiérrez
1a14cef1e0 feat(cli): 1.31.0 — session autoclean + broker verification + service path
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
Three operability fixes for users running the daemon under launchd or
systemd.

PID-watcher autoclean
=====================

The session reaper already dropped registry entries with dead pids on
a 30s loop, but had two real-world gaps:

- 30s sweep let stale presence linger on the broker for half a minute
- bare process.kill(pid, 0) trusts a recycled pid; a registry entry
  could survive its real owner's death whenever the OS rolled the
  pid number forward to a new program

Process-exit IPC from claude-code is best-effort and skipped on
SIGKILL / OOM / segfault / panic, so it cannot replace the sweep.

Fix:

- New process-info.ts captures opaque per-process start-times via
  ps -o lstart= (works on macOS and Linux, ~1 ms per call)
- registerSession stores the start-time alongside the pid
- reapDead drops entries when pid is dead OR start-time changed
  since register
- Sweep cadence 30s -> 5s
- Best-effort fallback to bare liveness when start-time capture
  fails at register time

Registry hooks already close the per-session broker WS on
deregister, so peer list rebuilds within one sweep of any session
exit.

Service-managed daemon: no more "spawn failed" false alarms
===========================================================

After claudemesh install (which writes a launchd plist or systemd
unit with KeepAlive=true), users routinely saw

  [claudemesh] warn daemon spawn failed: socket did not appear
  within 3000ms

even when the daemon was running fine. Two contributing causes:

1. Probe timeout was 800ms — the first IPC after a launchd-driven
   restart can take longer (SQLite migration + broker WS opens) and
   tripped it. Bumped to 2500ms.
2. On a failed probe the CLI tried its own detached spawn, which
   collided with launchd's KeepAlive restart cycle (singleton lock
   fails, child exits) and we'd then time out polling for a socket
   that was actually about to come up.

Now: when the launchd plist or systemd unit exists, the CLI does not
attempt a spawn. It waits up to 8s for the OS-managed unit to bring
the socket up. New service-not-ready state distinguishes "OS hasn't
restarted it yet" from "we tried to spawn and it failed".

Install verifies broker connectivity, not just process start
============================================================

Previously install ended once launchctl reported the unit loaded —
a daemon that boots but cannot reach the broker (blocked :443,
expired TLS, DNS, broker outage) only surfaced on the user's first
peer list or send.

/v1/health now includes per-mesh broker WS state. install polls it
for up to 15s after service boot and prints either "broker
connected (mesh=...)" or a warning naming the meshes still in
connecting state, with a hint at common causes.

The verification is best-effort and does not fail the install — it
just surfaces the issue early.

Tests
=====

4 new vitest cases cover the reaper paths: dead pid, live pid plus
matching start-time, live pid plus mismatched start-time (PID
reuse), and the no-start-time fallback. 83 of 83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 14:05:44 +01:00
Alejandro Gutiérrez
81f0e4f7ac feat(cli): 1.28.0 — bridge deletion + daemon-policy flags
drop the orphaned bridge tier (~600 LoC). client/server/protocol
files deleted; tryBridge had returned null in production for seven
releases since the 1.24.0 mcp shim rewrite stopped opening the
sockets. each verb now has two paths: daemon (with 1.27.3's
auto-spawn) → cold ws.

add per-process daemon policy: --strict (error instead of cold
fallback) and --no-daemon (skip daemon entirely). enforcement at
withMesh so a single chokepoint covers every verb. env equivalents
CLAUDEMESH_STRICT_DAEMON / CLAUDEMESH_NO_DAEMON. flag wins.

net -394 loc; the daemon-up case ships ~600 loc lighter and the
fallback story is one tier simpler. first sprint A drop; per-session
ipc tokens and the wizard refactors follow in 1.29.0+.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 12:23:04 +01:00
Alejandro Gutiérrez
2b6cf2c14b feat(cli): self-healing daemon lifecycle
every daemon-routed verb now probes the ipc socket via /v1/version
(instead of trusting existsSync), cleans up stale sock/pid files left
by a crashed daemon, and auto-spawns a detached `claudemesh daemon up`
under a file-lock when the daemon is down. polls for liveness up to a
budget (3s for ad-hoc verbs, 10s for launch) before falling through to
cold path.

includes a per-process result cache (script doing 50 sends pays spawn
cost at most once), a 30s recently-failed marker (no thundering-herd
retries on crash-loop), a spawn-lock (concurrent invocations share one
attempt), and a recursion guard env var (nested cli calls inside the
daemon process skip auto-spawn).

fixes the stale-socket bug where launch's ensureDaemonRunning returned
early on a left-over socket file from a crashed daemon, silently
breaking the spawned claude session's mcp shim.

deferred to 1.28.0: --strict / --no-daemon flags, lazy-loading of
cold-path code, per-session ipc tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 11:17:32 +01:00