Files

Alejandro Gutiérrez abaa4bcf87 feat(cli): claudemesh daemon — peer mesh runtime (v0.9.0)

Long-lived process that holds a persistent WS to the broker and exposes
a local IPC surface (UDS + bearer-auth TCP loopback). Implements the
v0.9.0 spec under .artifacts/specs/.

Core:
- daemon up | status | version | down | accept-host
- daemon outbox list [--failed|--pending|--inflight|--done|--aborted]
- daemon outbox requeue <id> [--new-client-id <id>]
- daemon install-service / uninstall-service (macOS launchd, Linux systemd)

IPC routes:
- /v1/version, /v1/health
- /v1/send  (POST)  — full §4.5.1 idempotency lookup table
- /v1/inbox (GET)   — paged history
- /v1/events        — SSE stream of message/peer_join/peer_leave/broker_status
- /v1/peers         — broker passthrough
- /v1/profile       — summary/status/visible/avatar/title/bio/capabilities
- /v1/outbox + /v1/outbox/requeue — operator recovery

Storage (SQLite via node:sqlite / bun:sqlite):
- outbox.db: pending/inflight/done/dead/aborted with audit columns
- inbox.db: dedupe by client_message_id, decrypts DMs via existing crypto
- BEGIN IMMEDIATE serialization for daemon-local accept races

Identity:
- host_fingerprint.json (machine-id || first-stable-mac)
- refuse-on-mismatch policy with `daemon accept-host` recovery

CLI integration:
- claudemesh send detects the daemon and routes through /v1/send when
  present, falling back to bridge socket / cold path otherwise

Tests: 15-case coverage of the §4.5.1 IPC duplicate lookup table.

Spec arc preserved at .artifacts/specs/2026-05-03-daemon-{v1..v10}.md;
v0.9.0 implementation target locked at 2026-05-03-daemon-spec-v0.9.0.md;
deferred items at 2026-05-03-daemon-spec-broker-hardening-followups.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-03 20:03:05 +01:00

23 KiB

Raw Blame History

`claudemesh daemon` — Final Spec v9

Round 9. v8 was reviewed by codex (round 8) which closed aborted/UNIQUE (5/5) and SQLite locking (5/5) cleanly, but flagged three spec-level correctness problems:

Cross-layer ID-consumed authority contradiction — v8 §4.1 said "id consumed iff dedupe row exists" while §4.6.1 says a daemon-rejected id stays consumed locally with no broker dedupe row. Two incompatible authorities.

Rate-limit authority muddled — v8 listed rate limit in B2 (in-tx authoritative) but classified rate-limit counters as async/non-authoritative in §4.7.1.

§4.1 broker guarantee wording — "post-validation accept phase" was fuzzy because B2 rolls back. Tighten to "accept committed."

v9 fixes all three with two-layer ID rules (daemon vs broker), rate-limit moved to B1 via an external atomic limiter, and §4.1 tightened. Intent §0 unchanged from v2. v9 only revises §4.

0. Intent — unchanged, see v2 §0

1. Process model — unchanged

2. Identity — unchanged from v5 §2

3. IPC surface — unchanged from v4 §3

4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking

4.1 The contract (precise — v9, two-layer ID model)

Two-layer ID rules (NEW v9 — codex r8):

Daemon-layer: a client_message_id is daemon-consumed iff an outbox row exists for it. Daemon-mediated callers can never reuse a daemon-consumed id, regardless of whether the broker ever saw it. The daemon's outbox is the single authority for "this id was issued by my caller against this daemon."

Broker-layer: a client_message_id is broker-consumed iff a dedupe row exists for (mesh_id, client_message_id) in mesh.client_message_dedupe. Direct broker callers (none in v0.9.0; reserved for future SDK paths that bypass the daemon) can reuse a broker-non-consumed id freely.

In v0.9.0 there are no daemon-bypass clients, so for practical purposes "daemon-consumed" is the operative rule.

Local guarantee: each successful POST /v1/send returns a stable client_message_id. The send is durably persisted to outbox.db before the response returns. The daemon enforces request-fingerprint idempotency at the IPC layer (§4.5.1).

Local audit guarantee: a client_message_id once written to outbox.db is never released (daemon-layer rule). Operator recovery via requeue always mints a fresh id; the old row stays in aborted for audit. There is no daemon-side path to free a used id.

Broker guarantee (v9 — tightened): a dedupe row exists iff the broker accept transaction committed (Phase B3 reached). Phase B1 rejections never insert dedupe rows. Phase B2 rejections roll the transaction back, so any partial dedupe row is unwound. Direct broker callers retrying after B1/B2 rejection see no dedupe row and may reuse the id.

Atomicity guarantee: same as v8 §4.1.

End-to-end guarantee: at-least-once.

4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2

4.3 Broker schema — unchanged from v6 §4.3

4.4 Request fingerprint canonical form — unchanged from v6 §4.4

4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)

4.5.1 IPC accept algorithm (v8)

On POST /v1/send:

Validate request envelope (auth, schema, size limits, destination resolvable). Failures here return 4xx immediately. No outbox row is written; the client_message_id is not consumed.
Compute request_fingerprint (§4.4).
Open a SQLite transaction with BEGIN IMMEDIATE (v8 — codex r7) so a concurrent IPC accept on the same id serializes against this one. BEGIN IMMEDIATE acquires the RESERVED lock at transaction start, preventing any other writer from beginning a transaction on the same database; SQLite has no row-level lock and SELECT FOR UPDATE is not supported.
SELECT id, request_fingerprint, status, broker_message_id, last_error FROM outbox WHERE client_message_id = ?.
Apply the lookup table below. For the "(no row)" case, INSERT the new row inside the same transaction.
COMMIT.

Existing row state	Fingerprint match?	Daemon response
(no row)	—	INSERT new outbox row in `pending`; return `202 accepted, queued`
`pending`	match	Return `202 accepted, queued`. No mutation
`pending`	mismatch	Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation
`inflight`	match	Return `202 accepted, inflight`. No mutation
`inflight`	mismatch	Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"`
`done`	match	Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call
`done`	mismatch	Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id`
`dead`	match	Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried
`dead`	mismatch	Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"`
`aborted` (NEW v8)	match	Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable
`aborted` (NEW v8)	mismatch	Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"`

Rule (v8 — codex r7): every IPC 409 carries the daemon's request_fingerprint (8-byte hex prefix) so callers can debug client/server canonical-form drift. Every state in the table returns something deterministic, including aborted. A client_message_id written to outbox.db is permanently bound to that row's lifecycle — the only "free" state is "no row exists".

4.5.2 Outbox table — fingerprint required

CREATE TABLE outbox (
  id                  TEXT PRIMARY KEY,
  client_message_id   TEXT NOT NULL UNIQUE,
  request_fingerprint BLOB NOT NULL,                          -- 32 bytes
  payload             BLOB NOT NULL,
  enqueued_at         INTEGER NOT NULL,
  attempts            INTEGER DEFAULT 0,
  next_attempt_at     INTEGER NOT NULL,
  status              TEXT CHECK(status IN
                        ('pending','inflight','done','dead','aborted')),
  last_error          TEXT,
  delivered_at        INTEGER,
  broker_message_id   TEXT,
  aborted_at          INTEGER,                                -- NEW v8
  aborted_by          TEXT,                                   -- NEW v8: operator/auto
  superseded_by       TEXT                                    -- NEW v8: id of the requeue successor row, if any
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';

aborted_at, aborted_by, superseded_by give operators a clear audit trail. superseded_by lets outbox inspect show the chain when a row was requeued multiple times.

request_fingerprint is computed once at IPC accept time and frozen forever for the row's lifecycle. Daemon never recomputes from payload.

4.6 Rejected-request semantics — two-layer rules + rate-limit moved to B1 (v9 — codex r8)

Two-layer rule (v9): a client_message_id is daemon-consumed iff an outbox row exists for it; broker-consumed iff a dedupe row exists. Daemon-mediated callers see daemon-layer authority (the only path in v0.9.0). Pre-validation failures at any layer consume nothing at that layer. The two layers are independent: a daemon-consumed id may or may not be broker-consumed (depending on whether the send reached B3); a daemon-non-consumed id can never be broker-consumed (no outbox row ⇒ no broker call from the daemon).

4.6.1 Daemon-side rejection phasing (v9)

Phase	When daemon rejects	Outbox row?	Daemon-consumed?	Same daemon caller may reuse id?
A. IPC validation (auth, schema, size, destination resolvable)	Before §4.5.1 step 3	No	No	Yes — id never written locally
B. Outbox stored, broker network/transient failure	After IPC accept, broker `5xx` or timeout	`pending` → retried	Yes	N/A — daemon owns retries
C. Outbox stored, broker permanent rejection	Broker returns `4xx` after IPC accept	`dead`	Yes	No — rotate via `requeue`
D. Operator retirement	Operator runs `requeue` on `dead` or `pending` row	`aborted` (audit) + new row with fresh id	Yes (still consumed)	Old id NEVER reusable; new id is fresh

The "daemon-consumed?" column is the daemon-layer authority. It does not depend on whether the broker ever saw the request — phase C above shows the broker has not committed a dedupe row, but the daemon still holds the id in dead state.

4.6.2 Broker-side rejection phasing (v9 — rate limit moved to B1)

The broker validates in two phases relative to dedupe-row insertion:

Phase	Validation	Side effects	Result for direct broker callers
B1. Pre-dedupe-claim (atomic, external)	Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, rate limit not exceeded (atomic external limiter — see §4.6.4)	None	`4xx` returned. No dedupe row, no broker-consumed id. Caller may retry with same id once condition clears
B2. Post-dedupe-claim (in-tx)	Conditions that require the accept transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.)	INSERT into dedupe rolled back	`4xx` returned, transaction rolled back, no dedupe row remains. Caller may retry with same id
B3. Accepted	All side effects commit atomically	Dedupe row, message row, history row, delivery_queue rows, mention_index rows	`201` returned with `broker_message_id`. Id is broker-consumed

Daemon-mediated callers: in v0.9.0 the daemon is the only B-phase caller. Daemon-mediated callers see only the daemon-layer rules (§4.6.1). The broker's "may retry with same id" wording in the table above applies to direct broker callers only (none in v0.9.0; reserved for future SDK paths).

Critical guarantee (v9 — tightened from v8): a dedupe row exists iff the broker accept transaction committed (B3). There is no broker code path where a permanent 4xx leaves a dedupe row behind.

If the broker decides post-commit that an accepted message is invalid (async content-policy job, async moderation, etc.), that's NOT a permanent rejection — it's a follow-up event that operates on the broker_message_id, not on the dedupe key.

4.6.4 Rate limiter — atomic, external, B1 (NEW v9 — codex r8)

Codex r8 caught: v8 listed rate-limit enforcement in B2 (in-tx) but classified rate-limit counters as async/non-authoritative. Both can't be true. v9 resolves it by moving rate-limit enforcement to B1 backed by an atomic external limiter:

Authority: the broker's existing Redis (or equivalent fixed-window limiter) used for claudemesh launch rate-limiting is the authority for accept-time rate-limit enforcement. INCR with TTL is atomic; the broker checks the result before committing the Phase B2/B3 transaction.
Idempotency interaction: rate-limit INCR happens before the dedupe-claim INSERT. If the limiter rejects, no DB transaction is opened, no dedupe row exists. If the limiter accepts but the in-tx Phase B2 then rejects (e.g. topic not found), the limiter INCR is not refunded. This is intentional: refunding would require a reliable distributed counter, and the over-counting risk is acceptable. Counter cm_broker_rate_limit_consumed_then_rejected_total exposes the delta for ops awareness.
Retries: a daemon retry with the same client_message_id after a B1 rate-limit rejection produces another INCR. To avoid burning rate-limit budget on retries-of-rejected-ids, the broker can optionally short-circuit INCR if the rate-limit subsystem can cheaply detect "this exact client_message_id was rejected for rate-limit in the last N seconds" — but this is an optimization, not a correctness requirement.
Async counters: mesh.rate_limit_counter (or any DB-resident view of "messages-per-mesh-per-window") is non-authoritative — it's metrics/telemetry rebuilt from the authoritative limiter and from message-history. Used for dashboards, not for accept decisions.

This split — atomic external limiter for enforcement, async DB counters for telemetry — matches how every other rate-limited subsystem in claudemesh works (claudemesh launch, dashboard chat posts, etc.). No new infrastructure required.

4.6.3 Operator recovery via `requeue` (corrected v8)

To unstick a dead or pending-but-stuck row, operator runs:

claudemesh daemon outbox requeue --id <outbox_row_id>
                                  [--new-client-id <id> | --auto]
                                  [--patch-payload <path>]

This atomically (single SQLite transaction):

Marks the existing row's status to aborted, sets aborted_at = now, aborted_by = "operator". Row is never deleted — audit trail permanent.
Mints a fresh client_message_id (caller-supplied via --new-client-id or auto-ulid'd via --auto).
Inserts a new outbox row in pending with the fresh id and the same payload (or patched payload if --patch-payload was given).
Sets superseded_by = <new_row_id> on the old row so outbox inspect <old_id> displays the chain.

The old client_message_id is permanently dead — outbox.db still holds it via the aborted row's UNIQUE constraint, and any caller re-using it gets 409 outbox_aborted_* per §4.5.1.

If broker had ever accepted the old id (it reached B3), the broker's dedupe row is also permanent — duplicate sends to broker with the old id would also 409 for fingerprint mismatch (or return the original broker_message_id for matching fingerprint). Daemon-side aborted and broker-side dedupe row are independent records of "this id was used," neither releases the id.

This is the resolution to v7's contradiction: there is no path for an id to "become free again." If the operator wants to retry the payload, they get a new id. The old id stays buried.

4.7 Broker atomicity contract — side-effect classification (v9)

4.7.1 Side effects (v9 — rate limit moved to B1 external)

Every successful broker accept atomically commits these durable state changes in one transaction:

Effect	Table	In-tx?	Why
Dedupe record	`mesh.client_message_dedupe`	Yes	Idempotency authority
Message body	`mesh.topic_message` / `mesh.message_queue`	Yes	Authoritative store
History row	`mesh.message_history`	Yes	Replay log; lost-on-rollback would break ordered replay
Fan-out work	`mesh.delivery_queue`	Yes	Each recipient must see exactly the messages that committed
Mention index entries	`mesh.mention_index`	Yes	Reads off mention queries must match committed messages

Outside the transaction — non-authoritative or rebuildable, with explicit rationale per item:

Effect	Where	Why outside
WS push to live subscribers	Async after COMMIT	Live notifications are best-effort; receivers re-fetch from history on reconnect
Webhook fan-out	Async via `delivery_queue` workers	Off-band; consumes committed `delivery_queue` rows
Rate-limit counters (telemetry only)	Async, eventually consistent	Authoritative limiter is the external Redis-style INCR in B1 (§4.6.4); the DB counter is rebuilt for dashboards, not consulted for accept
Audit log entries	Async append-only stream	Audit log can be rebuilt from message history; in-tx writes hurt p99
Search/FTS index updates	Async via outbox-pattern worker	Index can be rebuilt from authoritative tables
Metrics	Prometheus, pull-based	Always non-authoritative

If any in-transaction insert fails, the transaction rolls back completely. The accept is 5xx to daemon; daemon retries. No partial state.

The async side effects are driven off the in-transaction delivery_queue and message_history rows, so they cannot get ahead of committed state — only lag behind.

4.7.2 Pseudocode — corrected and final (v8)

-- Phase B1 already passed (see §4.6.2). This includes:
--   - schema/auth/size validation
--   - external atomic rate-limit INCR (§4.6.4)
-- Anything that fails B1 returns 4xx without ever opening this tx.

BEGIN;

-- Phase B2 + B3: try to claim the idempotency key.
INSERT INTO mesh.client_message_dedupe
  (mesh_id, client_message_id, broker_message_id, request_fingerprint,
   destination_kind, destination_ref, expires_at)
  VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
          $dest_kind, $dest_ref, $expires_at)
  ON CONFLICT (mesh_id, client_message_id) DO NOTHING;

-- Inspect the row that's actually there now (ours or someone else's).
SELECT broker_message_id, request_fingerprint, destination_kind,
       destination_ref, history_available, first_seen_at
  FROM mesh.client_message_dedupe
  WHERE mesh_id = $mesh_id AND client_message_id = $client_id
  FOR SHARE;

-- Branch:
--   row.broker_message_id == $msg_id  → first insert; continue to step 3.
--   row.broker_message_id != $msg_id  → duplicate. Compare fingerprints:
--     fingerprint match    → ROLLBACK; return 200 duplicate.
--     fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.

-- Step 3: validate Phase B2 (destination_ref existence: topic exists,
-- member subscribed, etc.). Rate limit is NOT here — it was checked
-- atomically in B1 via the external limiter (§4.6.4) before this
-- transaction opened.
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).

-- Step 4: insert all in-tx side effects (§4.7.1).
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
  VALUES ($msg_id, $mesh_id, $client_id, ...);

INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
  VALUES ($msg_id, $mesh_id, ...);

INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
  SELECT $msg_id, member_pubkey, ...
    FROM mesh.topic_subscription
    WHERE topic = $dest_ref AND mesh_id = $mesh_id;

INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
  SELECT $msg_id, mention_pubkey, ...
    FROM unnest($mention_list);

COMMIT;

-- After COMMIT, async workers consume delivery_queue and update
-- search indexes, audit logs, rate-limit counters, etc.

4.7.3 Orphan check — same as v7 §4.7.3

Extended over the side-effect inventory to verify in-tx items consistency.

4.8 Outbox max-age math — unchanged from v7 §4.8

Min dedupe_retention_days = 7; derived max_age_hours = window - safety_margin strictly < window; safety_margin floor 24h.

4.9 Inbox schema — unchanged from v3 §4.5

4.10 Crash recovery — unchanged from v3 §4.6

4.11 Failure modes — `aborted` semantics added (v8)

IPC accept fingerprint-mismatch on duplicate id (any state): returns 409 with conflict field per §4.5.1. Caller must use a new id.
IPC accept against aborted row, fingerprint match: returns 409 per §4.5.1 (NEW v8). Caller must use a new id; the old id is permanently retired.
Outbox row stuck in dead: operator runs outbox requeue per §4.6.3; old id stays in aborted, new id is fresh.
Broker fingerprint mismatch on retry: as v6/v7. Daemon marks dead; operator requeue path.
Daemon retry after dedupe row hard-deleted by broker retention sweep: cannot happen unless operator overrode max_age_hours.
Broker phase B2 rejection on retry: same id, same fingerprint, but B2 condition has changed (e.g. mesh rate-limit now exceeded). Daemon receives 4xx → marks dead. Operator can requeue once conditions clear.
Atomicity violation found by orphan check: alerts ops.

5-13. — unchanged from v4

14. Lifecycle — unchanged from v5 §14

15. Version compat — unchanged from v7 §15

16. Threat model — unchanged

17. Migration — v8 outbox columns + broker phase B2 (v8)

Broker side, deploy order: same as v7 §17, with one addition:

Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe validation, returns 4xx without writing) and Phase B2/B3 (within the accept transaction). Implementation: refactor handler to validate Phase B1 conditions before opening the DB transaction.

Daemon side:

Outbox schema gains aborted_at, aborted_by, superseded_by columns and the aborted enum value (§4.5.2). Migration applies via INSERT INTO new SELECT * FROM old recreation if needed; v0.9.0 is greenfield.
IPC accept switches to BEGIN IMMEDIATE for SQLite serialization (§4.5.1 step 3).
IPC accept handles aborted rows per §4.5.1 (always 409).
claudemesh daemon outbox requeue always mints a fresh client_message_id; never frees the old id. --new-client-id <id> and --auto are the only modes; the old client_message_id argument is removed.

What changed v8 → v9 (codex round-8 actionable items)

Codex r8 item	v9 fix	Section
Cross-layer ID-consumed authority contradiction	Two-layer model: daemon-consumed iff outbox row; broker-consumed iff dedupe row committed; daemon-mediated callers see only daemon-layer authority	§4.1, §4.6.1, §4.6.2
Rate-limit authority muddled (B2 vs async counters)	Rate limit moved to B1 via external atomic limiter (Redis-style INCR with TTL); DB rate-limit counters demoted to telemetry-only	§4.6.2, §4.6.4, §4.7.1
§4.1 broker guarantee fuzzy	Tightened: "dedupe row exists iff broker accept transaction committed (B3)"	§4.1, §4.6.2

(Earlier rounds' fixes preserved unchanged.)

What needs review (round 9)

Two-layer ID model (§4.1, §4.6.1) — is the daemon-vs-broker authority split clear, or does it create more confusion for operators reading "consumed" in different contexts? Should we use different verbs (e.g. "claimed" at daemon, "committed" at broker)?
Rate-limit external limiter (§4.6.4) — is "atomic external limiter" specified concretely enough? Is the over-counting on limiter-accepted-then-B2-rejected acceptable?
B2 contents after rate-limit move — B2 now only has destination_ref existence. Worth keeping a B2 phase at all, or collapse into B1+B3?
Anything else still wrong? Read it as if you were going to operate this for a year.

Three options:

(a) v9 is shippable: lock the spec, start coding the frozen core.
(b) v10 needed: list the must-fix items.
(c) the architecture itself is wrong: what would you do differently?

Be ruthless.

23 KiB Raw Blame History

claudemesh daemon — Final Spec v9