Files
claudemesh/.artifacts/specs/2026-05-03-daemon-spec-broker-hardening-followups.md
Alejandro Gutiérrez abaa4bcf87 feat(cli): claudemesh daemon — peer mesh runtime (v0.9.0)
Long-lived process that holds a persistent WS to the broker and exposes
a local IPC surface (UDS + bearer-auth TCP loopback). Implements the
v0.9.0 spec under .artifacts/specs/.

Core:
- daemon up | status | version | down | accept-host
- daemon outbox list [--failed|--pending|--inflight|--done|--aborted]
- daemon outbox requeue <id> [--new-client-id <id>]
- daemon install-service / uninstall-service (macOS launchd, Linux systemd)

IPC routes:
- /v1/version, /v1/health
- /v1/send  (POST)  — full §4.5.1 idempotency lookup table
- /v1/inbox (GET)   — paged history
- /v1/events        — SSE stream of message/peer_join/peer_leave/broker_status
- /v1/peers         — broker passthrough
- /v1/profile       — summary/status/visible/avatar/title/bio/capabilities
- /v1/outbox + /v1/outbox/requeue — operator recovery

Storage (SQLite via node:sqlite / bun:sqlite):
- outbox.db: pending/inflight/done/dead/aborted with audit columns
- inbox.db: dedupe by client_message_id, decrypts DMs via existing crypto
- BEGIN IMMEDIATE serialization for daemon-local accept races

Identity:
- host_fingerprint.json (machine-id || first-stable-mac)
- refuse-on-mismatch policy with `daemon accept-host` recovery

CLI integration:
- claudemesh send detects the daemon and routes through /v1/send when
  present, falling back to bridge socket / cold path otherwise

Tests: 15-case coverage of the §4.5.1 IPC duplicate lookup table.

Spec arc preserved at .artifacts/specs/2026-05-03-daemon-{v1..v10}.md;
v0.9.0 implementation target locked at 2026-05-03-daemon-spec-v0.9.0.md;
deferred items at 2026-05-03-daemon-spec-broker-hardening-followups.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:03:05 +01:00

219 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `claudemesh daemon` — broker-hardening followups
> **Purpose**: refinements found during the v6 → v10 codex review series
> that are real improvements but **not** v0.9.0 blockers. The
> implementation target is `2026-05-03-daemon-spec-v0.9.0.md`. This
> document lists what was deferred, why, and the trigger that promotes
> each item to "must-do."
>
> **Background**: codex reviewed the daemon spec across 9 rounds (v1
> through v10). Rounds 14 found load-bearing architectural issues
> (identity, IPC auth, exactly-once lie, hook tokens, rotation, etc.).
> Rounds 59 found progressively finer correctness issues inside one
> subsystem (broker idempotency mechanics). v6 closed the architectural
> review; v7v10 are increasingly fine-grained idempotency-correctness
> shavings on the same layer. Pre-launch (no users) doesn't need v7v10
> level rigor. We pulled the cheap wins into v0.9.0; the rest waits.
---
## 1. B0 dedupe fast-path before rate-limit (v10)
**What v10 said**: read `mesh.client_message_dedupe` BEFORE consulting
the rate limiter. Existing id (match or mismatch) returns immediately
without touching rate-limit budget.
**Why deferred**: v0.9.0 doesn't have meaningful rate-limit pressure on
the daemon path. The split-brain failure (broker accepted, daemon
believes failure due to rate-limit-rejection-on-retry) requires
sustained saturated rate-limit windows, which don't exist pre-launch.
**Promote when**: any single mesh sees rate-limit rejections AND has
daemon retries against committed ids. Telemetry to watch:
`cm_broker_rate_limit_rejection_total` per mesh > 0 sustained.
**Implementation cost**: small — one indexed PK lookup before the
existing limiter call. The work is mostly testing the race semantics.
---
## 2. Lua-scripted idempotent rate limiter (v10)
**What v10 said**: limiter keyed by `(mesh_id, client_message_id,
window_bucket)` so retries-within-window consume budget at most once.
**Why deferred**: depends on (1) above. Without B0 fast-path this is
incremental complexity for marginal benefit. With B0 it becomes the
right belt-and-suspenders fix for the rare race where two same-id
requests both miss B0 simultaneously.
**Promote when**: B0 ships. Same trigger.
**Implementation cost**: medium — Lua script in Redis, careful TTL
tuning, integration with existing limiter call sites.
---
## 3. In-tx `mesh.mention_index` (v8)
**What v8 said**: mention-fanout index updates should commit inside the
broker accept transaction so mention-search reads can never see a
mention pointing at an uncommitted message.
**Why deferred**: the lag between accept-commit and async
mention-indexer is small (single-digit milliseconds in expected
deployment). Stale-read window during mention search is acceptable for
v0.9.0; receivers learn of mentions via the `mention` event in their
inbox stream regardless.
**Promote when**: real users complain about "I was mentioned but the
mention search doesn't show it" with reproducible cases that don't
self-heal in seconds.
**Implementation cost**: small — add `INSERT INTO mesh.mention_index`
to the accept transaction. The async indexer becomes a backfill
fallback rather than the primary path.
---
## 4. 4011 / 4012 close-code split (v6 §15.5)
**What v6 said**: split `4010 feature_unavailable` into three codes:
`4010` (missing), `4011` (params invalid), `4012` (params below floor).
**Why deferred**: v0.9.0 ships single `4010` with structured
`close_reason` JSON containing `kind`, `feature`, `detail`. Same
diagnostic information, simpler protocol surface.
**Promote when**: ops tooling or external monitoring needs distinct
status codes (e.g. PagerDuty rules that fire on 4012-only). Probably
never; structured JSON is parseable.
**Implementation cost**: trivial — three constants and a switch on
`close_reason.kind`.
---
## 5. Per-OS fingerprint precedence elaborate table (v8 §2.2.1)
**What v8 said**: comprehensive per-OS table covering Linux machine-id
sources, macOS `IOPlatformUUID`, Windows `MachineGuid`, BSD
`kern.hostuuid`, plus interface exclusion rules.
**Why deferred**: v0.9.0 ships with the simpler "machine-id ||
first-stable-mac" rule from v6. Edge cases (cloud images,
machine-id-not-readable, etc.) are documented when first hit.
**Promote when**: operators report fingerprint false-positives we can't
explain from the v6 rule. Each report adds one row to the per-OS
table.
**Implementation cost**: incremental — each OS-specific source is a
small probe function with a fallback chain.
---
## 6. `request_fingerprint` schema-version-2 in feature negotiation (v6 §15.1)
**What v6 said**: `client_message_id_dedupe` feature parameters
versioned independently. v0.9.0 ships at version 1 with a single
`request_fingerprint: bool` flag.
**Why deferred**: we don't yet need parameterized fingerprint variants
(different canonical forms, different hash algos). Version-bump path
is documented; we'll use it when we add the second fingerprint mode.
**Promote when**: we want a fingerprint algo other than sha256/JCS
(e.g. a faster hash, or a normalized canonical form).
**Implementation cost**: small — single feature-bit version bump
following the documented pattern.
---
## 7. Force-expiry / quarantine semantics for `keypair-archive.json` (v8 §14.1.1)
**What v8 said**: `max_archived_keys` cap with force-expiry; explicit
quarantine of malformed archive (`keypair-archive.json.malformed-<ts>`);
duplicate `key_id` rejection; mode-mismatch warning behavior.
**Why deferred**: v0.9.0 ships the simpler v6 rule — drop expired
entries on cleanup pass; refuse to start on malformed archive (loud,
operator-actionable). The v8 elaboration makes archive corruption
non-blocking, which is operationally nicer but trades off audit
clarity.
**Promote when**: a real operator hits an archive corruption that
shouldn't have brought the daemon down (e.g. mid-rotation crash leaves
a partially-written archive).
**Implementation cost**: small — quarantine logic + one extra startup
check.
---
## 8. Cross-language JCS conformance for `request_fingerprint` (v6 §4.4 round-6 question)
**What v6 asked**: does JCS work cross-language for
`meta_canonical_json`? Python json.dumps, Go encoding/json, and JS
JSON.stringify all behave differently. Should we ship a vetted JCS lib
in each SDK?
**Why deferred from v0.9.0**: the daemon ships in TypeScript only for
v0.9.0 (the `claudemesh-cli` package). Single-language JCS is trivial.
SDK ports come post-v0.9.0.
**Promote when**: we ship the Python or Go SDK. Each SDK port gets a
JCS conformance test against a corpus of envelopes.
**Implementation cost**: small per-language — a conformance fixture
file and a unit test.
---
## Sprint 7 (this session) — what landed vs deferred
**Landed in code** (not yet deployed):
- `packages/db/migrations/0028_message_queue_idempotency_fields.sql` adds
nullable `client_message_id` and `request_fingerprint` columns to
`mesh.message_queue` (additive, online-safe).
- `apps/broker/src/broker.ts``queueMessage` and `drainForMember`
thread the new columns through.
- `apps/broker/src/index.ts``handleSend` picks them up from the
daemon's wire envelope; outbound push echoes them back so receiving
daemons can dedupe.
- `apps/broker/src/types.ts``WSPushMessage` declares the optional
fields.
**Deployment plan (not auto-applied)**:
1. Apply migration against prod DB (the broker's filename-tracked
migrator picks up `0028_*.sql` on next startup).
2. Deploy the broker with the code changes via Coolify.
3. Verify a daemon-originated send shows non-null `client_message_id`
in `mesh.message_queue` afterwards.
**Still deferred** (full broker hardening):
- `mesh.client_message_dedupe` table with `request_fingerprint BYTEA`
and atomic accept transaction (spec §4.7).
- Feature-bit advertisement on hello_ack of
`client_message_id_dedupe` v1, with daemon-side enforcement (spec §15).
- Partial unique index `(mesh_id, client_message_id) WHERE NOT NULL`.
These sit behind the same trigger as the followups below: do them when
real users hit operational corners that this addressing doesn't cover.
---
## How to use this document
When picking up post-v0.9.0 work on the daemon:
1. Check whether any of the "promote when" triggers above have fired.
2. If yes, consult the corresponding versioned spec (v6/v7/v8/v9/v10)
for the full proposed change.
3. Implement the lift, update `daemon-spec-v0.9.0.md` to reflect the
merge, and remove the item from this followups list.
The versioned specs live in `.artifacts/specs/` indefinitely as a
review-trail audit.