feat(cli): claudemesh daemon — peer mesh runtime (v0.9.0)

Long-lived process that holds a persistent WS to the broker and exposes
a local IPC surface (UDS + bearer-auth TCP loopback). Implements the
v0.9.0 spec under .artifacts/specs/.

Core:
- daemon up | status | version | down | accept-host
- daemon outbox list [--failed|--pending|--inflight|--done|--aborted]
- daemon outbox requeue <id> [--new-client-id <id>]
- daemon install-service / uninstall-service (macOS launchd, Linux systemd)

IPC routes:
- /v1/version, /v1/health
- /v1/send  (POST)  — full §4.5.1 idempotency lookup table
- /v1/inbox (GET)   — paged history
- /v1/events        — SSE stream of message/peer_join/peer_leave/broker_status
- /v1/peers         — broker passthrough
- /v1/profile       — summary/status/visible/avatar/title/bio/capabilities
- /v1/outbox + /v1/outbox/requeue — operator recovery

Storage (SQLite via node:sqlite / bun:sqlite):
- outbox.db: pending/inflight/done/dead/aborted with audit columns
- inbox.db: dedupe by client_message_id, decrypts DMs via existing crypto
- BEGIN IMMEDIATE serialization for daemon-local accept races

Identity:
- host_fingerprint.json (machine-id || first-stable-mac)
- refuse-on-mismatch policy with `daemon accept-host` recovery

CLI integration:
- claudemesh send detects the daemon and routes through /v1/send when
  present, falling back to bridge socket / cold path otherwise

Tests: 15-case coverage of the §4.5.1 IPC duplicate lookup table.

Spec arc preserved at .artifacts/specs/2026-05-03-daemon-{v1..v10}.md;
v0.9.0 implementation target locked at 2026-05-03-daemon-spec-v0.9.0.md;
deferred items at 2026-05-03-daemon-spec-broker-hardening-followups.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-05-03 20:03:05 +01:00
parent 65e63b0b27
commit abaa4bcf87
34 changed files with 9067 additions and 0 deletions

View File

@@ -0,0 +1,218 @@
# `claudemesh daemon` — broker-hardening followups
> **Purpose**: refinements found during the v6 → v10 codex review series
> that are real improvements but **not** v0.9.0 blockers. The
> implementation target is `2026-05-03-daemon-spec-v0.9.0.md`. This
> document lists what was deferred, why, and the trigger that promotes
> each item to "must-do."
>
> **Background**: codex reviewed the daemon spec across 9 rounds (v1
> through v10). Rounds 14 found load-bearing architectural issues
> (identity, IPC auth, exactly-once lie, hook tokens, rotation, etc.).
> Rounds 59 found progressively finer correctness issues inside one
> subsystem (broker idempotency mechanics). v6 closed the architectural
> review; v7v10 are increasingly fine-grained idempotency-correctness
> shavings on the same layer. Pre-launch (no users) doesn't need v7v10
> level rigor. We pulled the cheap wins into v0.9.0; the rest waits.
---
## 1. B0 dedupe fast-path before rate-limit (v10)
**What v10 said**: read `mesh.client_message_dedupe` BEFORE consulting
the rate limiter. Existing id (match or mismatch) returns immediately
without touching rate-limit budget.
**Why deferred**: v0.9.0 doesn't have meaningful rate-limit pressure on
the daemon path. The split-brain failure (broker accepted, daemon
believes failure due to rate-limit-rejection-on-retry) requires
sustained saturated rate-limit windows, which don't exist pre-launch.
**Promote when**: any single mesh sees rate-limit rejections AND has
daemon retries against committed ids. Telemetry to watch:
`cm_broker_rate_limit_rejection_total` per mesh > 0 sustained.
**Implementation cost**: small — one indexed PK lookup before the
existing limiter call. The work is mostly testing the race semantics.
---
## 2. Lua-scripted idempotent rate limiter (v10)
**What v10 said**: limiter keyed by `(mesh_id, client_message_id,
window_bucket)` so retries-within-window consume budget at most once.
**Why deferred**: depends on (1) above. Without B0 fast-path this is
incremental complexity for marginal benefit. With B0 it becomes the
right belt-and-suspenders fix for the rare race where two same-id
requests both miss B0 simultaneously.
**Promote when**: B0 ships. Same trigger.
**Implementation cost**: medium — Lua script in Redis, careful TTL
tuning, integration with existing limiter call sites.
---
## 3. In-tx `mesh.mention_index` (v8)
**What v8 said**: mention-fanout index updates should commit inside the
broker accept transaction so mention-search reads can never see a
mention pointing at an uncommitted message.
**Why deferred**: the lag between accept-commit and async
mention-indexer is small (single-digit milliseconds in expected
deployment). Stale-read window during mention search is acceptable for
v0.9.0; receivers learn of mentions via the `mention` event in their
inbox stream regardless.
**Promote when**: real users complain about "I was mentioned but the
mention search doesn't show it" with reproducible cases that don't
self-heal in seconds.
**Implementation cost**: small — add `INSERT INTO mesh.mention_index`
to the accept transaction. The async indexer becomes a backfill
fallback rather than the primary path.
---
## 4. 4011 / 4012 close-code split (v6 §15.5)
**What v6 said**: split `4010 feature_unavailable` into three codes:
`4010` (missing), `4011` (params invalid), `4012` (params below floor).
**Why deferred**: v0.9.0 ships single `4010` with structured
`close_reason` JSON containing `kind`, `feature`, `detail`. Same
diagnostic information, simpler protocol surface.
**Promote when**: ops tooling or external monitoring needs distinct
status codes (e.g. PagerDuty rules that fire on 4012-only). Probably
never; structured JSON is parseable.
**Implementation cost**: trivial — three constants and a switch on
`close_reason.kind`.
---
## 5. Per-OS fingerprint precedence elaborate table (v8 §2.2.1)
**What v8 said**: comprehensive per-OS table covering Linux machine-id
sources, macOS `IOPlatformUUID`, Windows `MachineGuid`, BSD
`kern.hostuuid`, plus interface exclusion rules.
**Why deferred**: v0.9.0 ships with the simpler "machine-id ||
first-stable-mac" rule from v6. Edge cases (cloud images,
machine-id-not-readable, etc.) are documented when first hit.
**Promote when**: operators report fingerprint false-positives we can't
explain from the v6 rule. Each report adds one row to the per-OS
table.
**Implementation cost**: incremental — each OS-specific source is a
small probe function with a fallback chain.
---
## 6. `request_fingerprint` schema-version-2 in feature negotiation (v6 §15.1)
**What v6 said**: `client_message_id_dedupe` feature parameters
versioned independently. v0.9.0 ships at version 1 with a single
`request_fingerprint: bool` flag.
**Why deferred**: we don't yet need parameterized fingerprint variants
(different canonical forms, different hash algos). Version-bump path
is documented; we'll use it when we add the second fingerprint mode.
**Promote when**: we want a fingerprint algo other than sha256/JCS
(e.g. a faster hash, or a normalized canonical form).
**Implementation cost**: small — single feature-bit version bump
following the documented pattern.
---
## 7. Force-expiry / quarantine semantics for `keypair-archive.json` (v8 §14.1.1)
**What v8 said**: `max_archived_keys` cap with force-expiry; explicit
quarantine of malformed archive (`keypair-archive.json.malformed-<ts>`);
duplicate `key_id` rejection; mode-mismatch warning behavior.
**Why deferred**: v0.9.0 ships the simpler v6 rule — drop expired
entries on cleanup pass; refuse to start on malformed archive (loud,
operator-actionable). The v8 elaboration makes archive corruption
non-blocking, which is operationally nicer but trades off audit
clarity.
**Promote when**: a real operator hits an archive corruption that
shouldn't have brought the daemon down (e.g. mid-rotation crash leaves
a partially-written archive).
**Implementation cost**: small — quarantine logic + one extra startup
check.
---
## 8. Cross-language JCS conformance for `request_fingerprint` (v6 §4.4 round-6 question)
**What v6 asked**: does JCS work cross-language for
`meta_canonical_json`? Python json.dumps, Go encoding/json, and JS
JSON.stringify all behave differently. Should we ship a vetted JCS lib
in each SDK?
**Why deferred from v0.9.0**: the daemon ships in TypeScript only for
v0.9.0 (the `claudemesh-cli` package). Single-language JCS is trivial.
SDK ports come post-v0.9.0.
**Promote when**: we ship the Python or Go SDK. Each SDK port gets a
JCS conformance test against a corpus of envelopes.
**Implementation cost**: small per-language — a conformance fixture
file and a unit test.
---
## Sprint 7 (this session) — what landed vs deferred
**Landed in code** (not yet deployed):
- `packages/db/migrations/0028_message_queue_idempotency_fields.sql` adds
nullable `client_message_id` and `request_fingerprint` columns to
`mesh.message_queue` (additive, online-safe).
- `apps/broker/src/broker.ts``queueMessage` and `drainForMember`
thread the new columns through.
- `apps/broker/src/index.ts``handleSend` picks them up from the
daemon's wire envelope; outbound push echoes them back so receiving
daemons can dedupe.
- `apps/broker/src/types.ts``WSPushMessage` declares the optional
fields.
**Deployment plan (not auto-applied)**:
1. Apply migration against prod DB (the broker's filename-tracked
migrator picks up `0028_*.sql` on next startup).
2. Deploy the broker with the code changes via Coolify.
3. Verify a daemon-originated send shows non-null `client_message_id`
in `mesh.message_queue` afterwards.
**Still deferred** (full broker hardening):
- `mesh.client_message_dedupe` table with `request_fingerprint BYTEA`
and atomic accept transaction (spec §4.7).
- Feature-bit advertisement on hello_ack of
`client_message_id_dedupe` v1, with daemon-side enforcement (spec §15).
- Partial unique index `(mesh_id, client_message_id) WHERE NOT NULL`.
These sit behind the same trigger as the followups below: do them when
real users hit operational corners that this addressing doesn't cover.
---
## How to use this document
When picking up post-v0.9.0 work on the daemon:
1. Check whether any of the "promote when" triggers above have fired.
2. If yes, consult the corresponding versioned spec (v6/v7/v8/v9/v10)
for the full proposed change.
3. Implement the lift, update `daemon-spec-v0.9.0.md` to reflect the
merge, and remove the item from this followups list.
The versioned specs live in `.artifacts/specs/` indefinitely as a
review-trail audit.