feat(cli): 1.31.0 — session autoclean + broker verification + service path
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled

Three operability fixes for users running the daemon under launchd or
systemd.

PID-watcher autoclean
=====================

The session reaper already dropped registry entries with dead pids on
a 30s loop, but had two real-world gaps:

- 30s sweep let stale presence linger on the broker for half a minute
- bare process.kill(pid, 0) trusts a recycled pid; a registry entry
  could survive its real owner's death whenever the OS rolled the
  pid number forward to a new program

Process-exit IPC from claude-code is best-effort and skipped on
SIGKILL / OOM / segfault / panic, so it cannot replace the sweep.

Fix:

- New process-info.ts captures opaque per-process start-times via
  ps -o lstart= (works on macOS and Linux, ~1 ms per call)
- registerSession stores the start-time alongside the pid
- reapDead drops entries when pid is dead OR start-time changed
  since register
- Sweep cadence 30s -> 5s
- Best-effort fallback to bare liveness when start-time capture
  fails at register time

Registry hooks already close the per-session broker WS on
deregister, so peer list rebuilds within one sweep of any session
exit.

Service-managed daemon: no more "spawn failed" false alarms
===========================================================

After claudemesh install (which writes a launchd plist or systemd
unit with KeepAlive=true), users routinely saw

  [claudemesh] warn daemon spawn failed: socket did not appear
  within 3000ms

even when the daemon was running fine. Two contributing causes:

1. Probe timeout was 800ms — the first IPC after a launchd-driven
   restart can take longer (SQLite migration + broker WS opens) and
   tripped it. Bumped to 2500ms.
2. On a failed probe the CLI tried its own detached spawn, which
   collided with launchd's KeepAlive restart cycle (singleton lock
   fails, child exits) and we'd then time out polling for a socket
   that was actually about to come up.

Now: when the launchd plist or systemd unit exists, the CLI does not
attempt a spawn. It waits up to 8s for the OS-managed unit to bring
the socket up. New service-not-ready state distinguishes "OS hasn't
restarted it yet" from "we tried to spawn and it failed".

Install verifies broker connectivity, not just process start
============================================================

Previously install ended once launchctl reported the unit loaded —
a daemon that boots but cannot reach the broker (blocked :443,
expired TLS, DNS, broker outage) only surfaced on the user's first
peer list or send.

/v1/health now includes per-mesh broker WS state. install polls it
for up to 15s after service boot and prints either "broker
connected (mesh=...)" or a warning naming the meshes still in
connecting state, with a hint at common causes.

The verification is best-effort and does not fail the install — it
just surfaces the issue early.

Tests
=====

4 new vitest cases cover the reaper paths: dead pid, live pid plus
matching start-time, live pid plus mismatched start-time (PID
reuse), and the no-start-time fallback. 83 of 83 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-05-04 14:05:44 +01:00
parent 71f7f81880
commit 1a14cef1e0
10 changed files with 436 additions and 15 deletions

View File

@@ -0,0 +1,127 @@
/**
* Session reaper — PID-watcher autoclean (1.31.0).
*
* Verifies that registry entries are dropped when:
* 1. their pid is no longer alive,
* 2. their pid is alive but its start-time changed since register
* (PID reuse — original process gone, OS recycled the number).
*
* The reaper is the autoclean source-of-truth: process-exit IPC from
* the launched session is best-effort (skipped on SIGKILL, OOM, hard
* crash, kernel panic) so this sweep is what actually keeps the
* broker presence honest. Both signals must work or stale "ghost"
* sessions linger on the broker.
*/
import { afterEach, describe, expect, test, vi } from "vitest";
import {
_resetRegistry,
_runReaperOnce,
listSessions,
registerSession,
setRegistryHooks,
type SessionInfo,
} from "../../src/daemon/session-registry.js";
afterEach(() => {
_resetRegistry();
vi.restoreAllMocks();
});
describe("session reaper", () => {
test("drops entry when pid is dead", () => {
const onDeregister = vi.fn();
setRegistryHooks({ onDeregister });
// Use a high pid that is exceedingly unlikely to be alive on any
// host — the alive check uses signal 0 which returns ESRCH for
// unused pids.
registerSession({
token: "a".repeat(64),
sessionId: "sess-dead",
mesh: "m",
displayName: "x",
pid: 999_999,
startTime: "Fri May 1 09:00:00 2026",
});
expect(listSessions()).toHaveLength(1);
_runReaperOnce();
expect(listSessions()).toHaveLength(0);
expect(onDeregister).toHaveBeenCalledTimes(1);
const arg = onDeregister.mock.calls[0]![0] as SessionInfo;
expect(arg.sessionId).toBe("sess-dead");
});
test("keeps entry when pid is alive and start-time matches", () => {
const onDeregister = vi.fn();
setRegistryHooks({ onDeregister });
// Use the test runner's own pid (process.pid is always alive here)
// and capture its real start-time so the start-time guard sees a
// match. Without pre-seeding startTime, registerSession would
// probe ps and we'd race with that — explicit value keeps the
// test deterministic.
const { execFileSync } = require("node:child_process");
const realStart = execFileSync("ps", ["-o", "lstart=", "-p", String(process.pid)], {
encoding: "utf8",
}).trim();
registerSession({
token: "b".repeat(64),
sessionId: "sess-live",
mesh: "m",
displayName: "x",
pid: process.pid,
startTime: realStart,
});
_runReaperOnce();
expect(listSessions()).toHaveLength(1);
expect(onDeregister).not.toHaveBeenCalled();
});
test("drops entry when pid is alive but start-time mismatched (PID reuse)", () => {
const onDeregister = vi.fn();
setRegistryHooks({ onDeregister });
// Pid IS alive (process.pid) but we register a fake start-time
// that won't match. Reaper must reap.
registerSession({
token: "c".repeat(64),
sessionId: "sess-reused",
mesh: "m",
displayName: "x",
pid: process.pid,
startTime: "Sat Jan 1 00:00:00 1980",
});
_runReaperOnce();
expect(listSessions()).toHaveLength(0);
expect(onDeregister).toHaveBeenCalledTimes(1);
});
test("keeps entry when start-time wasn't captured (best-effort fallback)", () => {
const onDeregister = vi.fn();
setRegistryHooks({ onDeregister });
// Register without startTime → reaper falls back to bare liveness.
// process.pid is alive, so the entry must survive.
registerSession({
token: "d".repeat(64),
sessionId: "sess-no-start",
mesh: "m",
displayName: "x",
pid: process.pid,
});
_runReaperOnce();
expect(listSessions()).toHaveLength(1);
expect(onDeregister).not.toHaveBeenCalled();
});
});