1.31.0 introduced a session reaper that called execFileSync(ps) once
per registered session every 5s. With many sessions registered, the
daemon's event loop stalled for hundreds of ms — long enough that
incoming /v1/version probes from the CLI timed out against a healthy
daemon and the new service-managed warning fired.
Fix:
- getProcessStartTime is now async (execFile + promisify); never
blocks the event loop
- New getProcessStartTimes(pids) issues one batched ps for all
survivors instead of N separate forks. Sweep cost is fixed
regardless of session count.
- registerSession stays sync; start-time capture is fire-and-forget
- reapDead is now async; the setInterval wrapper voids it so a
rejected sweep cannot crash the daemon
Behavior is otherwise unchanged from 1.31.0: same 5s cadence, same
PID-reuse guard semantics, same broker-WS teardown via the registry
hook. 83/83 tests still green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three operability fixes for users running the daemon under launchd or
systemd.
PID-watcher autoclean
=====================
The session reaper already dropped registry entries with dead pids on
a 30s loop, but had two real-world gaps:
- 30s sweep let stale presence linger on the broker for half a minute
- bare process.kill(pid, 0) trusts a recycled pid; a registry entry
could survive its real owner's death whenever the OS rolled the
pid number forward to a new program
Process-exit IPC from claude-code is best-effort and skipped on
SIGKILL / OOM / segfault / panic, so it cannot replace the sweep.
Fix:
- New process-info.ts captures opaque per-process start-times via
ps -o lstart= (works on macOS and Linux, ~1 ms per call)
- registerSession stores the start-time alongside the pid
- reapDead drops entries when pid is dead OR start-time changed
since register
- Sweep cadence 30s -> 5s
- Best-effort fallback to bare liveness when start-time capture
fails at register time
Registry hooks already close the per-session broker WS on
deregister, so peer list rebuilds within one sweep of any session
exit.
Service-managed daemon: no more "spawn failed" false alarms
===========================================================
After claudemesh install (which writes a launchd plist or systemd
unit with KeepAlive=true), users routinely saw
[claudemesh] warn daemon spawn failed: socket did not appear
within 3000ms
even when the daemon was running fine. Two contributing causes:
1. Probe timeout was 800ms — the first IPC after a launchd-driven
restart can take longer (SQLite migration + broker WS opens) and
tripped it. Bumped to 2500ms.
2. On a failed probe the CLI tried its own detached spawn, which
collided with launchd's KeepAlive restart cycle (singleton lock
fails, child exits) and we'd then time out polling for a socket
that was actually about to come up.
Now: when the launchd plist or systemd unit exists, the CLI does not
attempt a spawn. It waits up to 8s for the OS-managed unit to bring
the socket up. New service-not-ready state distinguishes "OS hasn't
restarted it yet" from "we tried to spawn and it failed".
Install verifies broker connectivity, not just process start
============================================================
Previously install ended once launchctl reported the unit loaded —
a daemon that boots but cannot reach the broker (blocked :443,
expired TLS, DNS, broker outage) only surfaced on the user's first
peer list or send.
/v1/health now includes per-mesh broker WS state. install polls it
for up to 15s after service boot and prints either "broker
connected (mesh=...)" or a warning naming the meshes still in
connecting state, with a hint at common causes.
The verification is best-effort and does not fail the install — it
just surfaces the issue early.
Tests
=====
4 new vitest cases cover the reaper paths: dead pid, live pid plus
matching start-time, live pid plus mismatched start-time (PID
reuse), and the no-start-time fallback. 83 of 83 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
daemon-side half of 1.30.0 per-session broker presence. behind
CLAUDEMESH_SESSION_PRESENCE=1 (default OFF this cycle so the broker
side bakes before the flag flips).
- SessionBrokerClient (apps/cli/src/daemon/session-broker.ts) — slim
WS that opens with session_hello, presence-only, no outbox drain.
- session-hello-sig.ts — signParentAttestation (12h TTL, ≤24h cap) and
signSessionHello, mirroring the broker canonical formats.
- session-registry: optional presence field on SessionInfo;
setRegistryHooks for onRegister/onDeregister callbacks. Hook errors
are caught so they can never throttle registry mutations.
- IPC POST /v1/sessions/register accepts the presence material under
body.presence (session_pubkey, session_secret_key, parent_attestation).
Older callers without it stay scoped + supported.
- run.ts wires the registry hooks: on register, opens a SessionBrokerClient
for the matching mesh; on deregister (explicit or reaper), closes it.
Shutdown closes any remaining session WSes before the IPC server.
8 new unit tests cover registry lifecycle (replace/throw/presence
roundtrip) and signature canonical-bytes verification against libsodium.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
every claudemesh launch-spawned session now mints a 32-byte random
token, writes it under tmpdir (mode 0600), and registers it with the
daemon. cli invocations from inside that session inherit
CLAUDEMESH_IPC_TOKEN_FILE in env, attach the token via Authorization:
ClaudeMesh-Session <hex>, and the daemon resolves it to a SessionInfo.
server-side: every read route that filters by mesh now uses meshFromCtx —
explicit query/body wins, session default fills in when missing. write
routes follow the same pattern.
cli-side: peers.ts (and other multi-mesh-iterating verbs in future)
prefers session-token mesh over all joined meshes when the user didn't
pass --mesh explicitly.
backward-compatible in both directions — tokenless callers behave
exactly as before. registry is in-memory; daemon restart loses it but
the 30s reaper handles dead pids and most callers re-register on next
launch.
verified end-to-end: peer list with token returns 4 prueba1 peers,
without token returns 3 meshes' peers (aggregate).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>