claudemesh

alezmad/claudemesh

Fork 0

Commit Graph

Author	SHA1	Message	Date
Alejandro Gutiérrez	15b7920b2a	fix(cli): 1.31.1 — reaper no longer blocks the daemon event loop Some checks failed CI / Lint (push) Has been cancelled Details CI / Typecheck (push) Has been cancelled Details CI / Broker tests (Postgres) (push) Has been cancelled Details CI / Docker build (linux/amd64) (push) Has been cancelled Details 1.31.0 introduced a session reaper that called execFileSync(ps) once per registered session every 5s. With many sessions registered, the daemon's event loop stalled for hundreds of ms — long enough that incoming /v1/version probes from the CLI timed out against a healthy daemon and the new service-managed warning fired. Fix: - getProcessStartTime is now async (execFile + promisify); never blocks the event loop - New getProcessStartTimes(pids) issues one batched ps for all survivors instead of N separate forks. Sweep cost is fixed regardless of session count. - registerSession stays sync; start-time capture is fire-and-forget - reapDead is now async; the setInterval wrapper voids it so a rejected sweep cannot crash the daemon Behavior is otherwise unchanged from 1.31.0: same 5s cadence, same PID-reuse guard semantics, same broker-WS teardown via the registry hook. 83/83 tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 14:15:48 +01:00
Alejandro Gutiérrez	1a14cef1e0	feat(cli): 1.31.0 — session autoclean + broker verification + service path Some checks failed CI / Lint (push) Has been cancelled Details CI / Typecheck (push) Has been cancelled Details CI / Broker tests (Postgres) (push) Has been cancelled Details CI / Docker build (linux/amd64) (push) Has been cancelled Details Three operability fixes for users running the daemon under launchd or systemd. PID-watcher autoclean ===================== The session reaper already dropped registry entries with dead pids on a 30s loop, but had two real-world gaps: - 30s sweep let stale presence linger on the broker for half a minute - bare process.kill(pid, 0) trusts a recycled pid; a registry entry could survive its real owner's death whenever the OS rolled the pid number forward to a new program Process-exit IPC from claude-code is best-effort and skipped on SIGKILL / OOM / segfault / panic, so it cannot replace the sweep. Fix: - New process-info.ts captures opaque per-process start-times via ps -o lstart= (works on macOS and Linux, ~1 ms per call) - registerSession stores the start-time alongside the pid - reapDead drops entries when pid is dead OR start-time changed since register - Sweep cadence 30s -> 5s - Best-effort fallback to bare liveness when start-time capture fails at register time Registry hooks already close the per-session broker WS on deregister, so peer list rebuilds within one sweep of any session exit. Service-managed daemon: no more "spawn failed" false alarms =========================================================== After claudemesh install (which writes a launchd plist or systemd unit with KeepAlive=true), users routinely saw [claudemesh] warn daemon spawn failed: socket did not appear within 3000ms even when the daemon was running fine. Two contributing causes: 1. Probe timeout was 800ms — the first IPC after a launchd-driven restart can take longer (SQLite migration + broker WS opens) and tripped it. Bumped to 2500ms. 2. On a failed probe the CLI tried its own detached spawn, which collided with launchd's KeepAlive restart cycle (singleton lock fails, child exits) and we'd then time out polling for a socket that was actually about to come up. Now: when the launchd plist or systemd unit exists, the CLI does not attempt a spawn. It waits up to 8s for the OS-managed unit to bring the socket up. New service-not-ready state distinguishes "OS hasn't restarted it yet" from "we tried to spawn and it failed". Install verifies broker connectivity, not just process start ============================================================ Previously install ended once launchctl reported the unit loaded — a daemon that boots but cannot reach the broker (blocked :443, expired TLS, DNS, broker outage) only surfaced on the user's first peer list or send. /v1/health now includes per-mesh broker WS state. install polls it for up to 15s after service boot and prints either "broker connected (mesh=...)" or a warning naming the meshes still in connecting state, with a hint at common causes. The verification is best-effort and does not fail the install — it just surfaces the issue early. Tests ===== 4 new vitest cases cover the reaper paths: dead pid, live pid plus matching start-time, live pid plus mismatched start-time (PID reuse), and the no-start-time fallback. 83 of 83 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 14:05:44 +01:00

Author

SHA1

Message

Date

Alejandro Gutiérrez

15b7920b2a

fix(cli): 1.31.1 — reaper no longer blocks the daemon event loop

CI / Lint (push) Has been cancelled

Details

CI / Typecheck (push) Has been cancelled

Details

CI / Broker tests (Postgres) (push) Has been cancelled

Details

CI / Docker build (linux/amd64) (push) Has been cancelled

Details

1.31.0 introduced a session reaper that called execFileSync(ps) once
per registered session every 5s. With many sessions registered, the
daemon's event loop stalled for hundreds of ms — long enough that
incoming /v1/version probes from the CLI timed out against a healthy
daemon and the new service-managed warning fired.

Fix:

- getProcessStartTime is now async (execFile + promisify); never
  blocks the event loop
- New getProcessStartTimes(pids) issues one batched ps for all
  survivors instead of N separate forks. Sweep cost is fixed
  regardless of session count.
- registerSession stays sync; start-time capture is fire-and-forget
- reapDead is now async; the setInterval wrapper voids it so a
  rejected sweep cannot crash the daemon

Behavior is otherwise unchanged from 1.31.0: same 5s cadence, same
PID-reuse guard semantics, same broker-WS teardown via the registry
hook. 83/83 tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-04 14:15:48 +01:00

Alejandro Gutiérrez

1a14cef1e0

feat(cli): 1.31.0 — session autoclean + broker verification + service path

CI / Lint (push) Has been cancelled

Details

CI / Typecheck (push) Has been cancelled

Details

CI / Broker tests (Postgres) (push) Has been cancelled

Details

CI / Docker build (linux/amd64) (push) Has been cancelled

Details

Three operability fixes for users running the daemon under launchd or
systemd.

PID-watcher autoclean
=====================

The session reaper already dropped registry entries with dead pids on
a 30s loop, but had two real-world gaps:

- 30s sweep let stale presence linger on the broker for half a minute
- bare process.kill(pid, 0) trusts a recycled pid; a registry entry
  could survive its real owner's death whenever the OS rolled the
  pid number forward to a new program

Process-exit IPC from claude-code is best-effort and skipped on
SIGKILL / OOM / segfault / panic, so it cannot replace the sweep.

Fix:

- New process-info.ts captures opaque per-process start-times via
  ps -o lstart= (works on macOS and Linux, ~1 ms per call)
- registerSession stores the start-time alongside the pid
- reapDead drops entries when pid is dead OR start-time changed
  since register
- Sweep cadence 30s -> 5s
- Best-effort fallback to bare liveness when start-time capture
  fails at register time

Registry hooks already close the per-session broker WS on
deregister, so peer list rebuilds within one sweep of any session
exit.

Service-managed daemon: no more "spawn failed" false alarms
===========================================================

After claudemesh install (which writes a launchd plist or systemd
unit with KeepAlive=true), users routinely saw

  [claudemesh] warn daemon spawn failed: socket did not appear
  within 3000ms

even when the daemon was running fine. Two contributing causes:

1. Probe timeout was 800ms — the first IPC after a launchd-driven
   restart can take longer (SQLite migration + broker WS opens) and
   tripped it. Bumped to 2500ms.
2. On a failed probe the CLI tried its own detached spawn, which
   collided with launchd's KeepAlive restart cycle (singleton lock
   fails, child exits) and we'd then time out polling for a socket
   that was actually about to come up.

Now: when the launchd plist or systemd unit exists, the CLI does not
attempt a spawn. It waits up to 8s for the OS-managed unit to bring
the socket up. New service-not-ready state distinguishes "OS hasn't
restarted it yet" from "we tried to spawn and it failed".

Install verifies broker connectivity, not just process start
============================================================

Previously install ended once launchctl reported the unit loaded —
a daemon that boots but cannot reach the broker (blocked :443,
expired TLS, DNS, broker outage) only surfaced on the user's first
peer list or send.

/v1/health now includes per-mesh broker WS state. install polls it
for up to 15s after service boot and prints either "broker
connected (mesh=...)" or a warning naming the meshes still in
connecting state, with a hint at common causes.

The verification is best-effort and does not fail the install — it
just surfaces the issue early.

Tests
=====

4 new vitest cases cover the reaper paths: dead pid, live pid plus
matching start-time, live pid plus mismatched start-time (PID
reuse), and the no-start-time fallback. 83 of 83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-04 14:05:44 +01:00

2 Commits