1.31.0 introduced a session reaper that called execFileSync(ps) once
per registered session every 5s. With many sessions registered, the
daemon's event loop stalled for hundreds of ms — long enough that
incoming /v1/version probes from the CLI timed out against a healthy
daemon and the new service-managed warning fired.
Fix:
- getProcessStartTime is now async (execFile + promisify); never
blocks the event loop
- New getProcessStartTimes(pids) issues one batched ps for all
survivors instead of N separate forks. Sweep cost is fixed
regardless of session count.
- registerSession stays sync; start-time capture is fire-and-forget
- reapDead is now async; the setInterval wrapper voids it so a
rejected sweep cannot crash the daemon
Behavior is otherwise unchanged from 1.31.0: same 5s cadence, same
PID-reuse guard semantics, same broker-WS teardown via the registry
hook. 83/83 tests still green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three operability fixes for users running the daemon under launchd or
systemd.
PID-watcher autoclean
=====================
The session reaper already dropped registry entries with dead pids on
a 30s loop, but had two real-world gaps:
- 30s sweep let stale presence linger on the broker for half a minute
- bare process.kill(pid, 0) trusts a recycled pid; a registry entry
could survive its real owner's death whenever the OS rolled the
pid number forward to a new program
Process-exit IPC from claude-code is best-effort and skipped on
SIGKILL / OOM / segfault / panic, so it cannot replace the sweep.
Fix:
- New process-info.ts captures opaque per-process start-times via
ps -o lstart= (works on macOS and Linux, ~1 ms per call)
- registerSession stores the start-time alongside the pid
- reapDead drops entries when pid is dead OR start-time changed
since register
- Sweep cadence 30s -> 5s
- Best-effort fallback to bare liveness when start-time capture
fails at register time
Registry hooks already close the per-session broker WS on
deregister, so peer list rebuilds within one sweep of any session
exit.
Service-managed daemon: no more "spawn failed" false alarms
===========================================================
After claudemesh install (which writes a launchd plist or systemd
unit with KeepAlive=true), users routinely saw
[claudemesh] warn daemon spawn failed: socket did not appear
within 3000ms
even when the daemon was running fine. Two contributing causes:
1. Probe timeout was 800ms — the first IPC after a launchd-driven
restart can take longer (SQLite migration + broker WS opens) and
tripped it. Bumped to 2500ms.
2. On a failed probe the CLI tried its own detached spawn, which
collided with launchd's KeepAlive restart cycle (singleton lock
fails, child exits) and we'd then time out polling for a socket
that was actually about to come up.
Now: when the launchd plist or systemd unit exists, the CLI does not
attempt a spawn. It waits up to 8s for the OS-managed unit to bring
the socket up. New service-not-ready state distinguishes "OS hasn't
restarted it yet" from "we tried to spawn and it failed".
Install verifies broker connectivity, not just process start
============================================================
Previously install ended once launchctl reported the unit loaded —
a daemon that boots but cannot reach the broker (blocked :443,
expired TLS, DNS, broker outage) only surfaced on the user's first
peer list or send.
/v1/health now includes per-mesh broker WS state. install polls it
for up to 15s after service boot and prints either "broker
connected (mesh=...)" or a warning naming the meshes still in
connecting state, with a hint at common causes.
The verification is best-effort and does not fail the install — it
just surfaces the issue early.
Tests
=====
4 new vitest cases cover the reaper paths: dead pid, live pid plus
matching start-time, live pid plus mismatched start-time (PID
reuse), and the no-start-time fallback. 83 of 83 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>