Files
claudemesh/.artifacts/shipped/2026-05-03-daemon-final-spec-v6.md
Alejandro Gutiérrez a2568ad9f4
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
chore(release): cli 1.22.0 — daemon v0.9.0 + housekeeping
- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh
  daemon long-lived runtime).
- CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring,
  crash recovery, and the deferred Sprint 7 broker hardening.
- Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0
  daemon redesign section, so the bridge release is documented as the
  shipped step toward the larger architectural shift.
- Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0
  spec + broker-hardening followups) from .artifacts/specs/ to
  .artifacts/shipped/ per the project artifact-pipeline convention.

Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag
— both are public-distribution actions and require explicit user
approval.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:24:32 +01:00

20 KiB

claudemesh daemon — Final Spec v6

Round 6. v5 was reviewed by codex (round 5) which found the dedupe table architecture sound but called out four idempotency-correctness issues that would silently corrupt sends in production:

  1. Idempotency key reuse with different payload/destination — v5 silently collapsed a different send onto the original. Need a request fingerprint.
  2. status = 'rejected' underspecified — schema allowed it, semantics didn't. Either fully define or drop.
  3. Outbox max-age math edgesdedupe_retention_days = 1 minus 24h margin = 0 hours, which is undefined.
  4. Broker atomicity not stated — dedupe insert and message insert must be one transaction or you produce orphan dedupe rows.

v6 fixes all four. Intent §0 unchanged from v2. v6 only revises idempotency semantics in §4 and migration in §17.


0. Intent — unchanged, see v2 §0


1. Process model — unchanged from v3 §1 / v2 §1


2. Identity — unchanged from v5 §2


3. IPC surface — unchanged from v4 §3


4. Delivery contract — at-least-once with request-fingerprinted dedupe

Codex r5: dedupe must compare the whole request shape, not just (mesh, client_message_id). Otherwise a caller who reuses an idempotency key with a different destination or body silently drops the new send and gets the old send's metadata back.

4.1 The contract (precise — v6)

Local guarantee: each successful POST /v1/send returns a stable client_message_id. The send is durably persisted to outbox.db before the response returns.

Broker guarantee: the broker maintains a dedupe record per accepted (mesh_id, client_message_id) in mesh.client_message_dedupe. Each dedupe record carries a canonical request_fingerprint. Retries with the same client_message_id AND matching fingerprint collapse to the original broker_message_id. Retries with the same client_message_id but a different fingerprint return a deterministic conflict (409 idempotency_key_reused) and do not create a new message.

Atomicity guarantee: dedupe row insertion and message row insertion happen in one broker DB transaction. Either both land, or neither. No orphan dedupe rows. If the broker crashes between dedupe insert and message insert, the rollback unwinds both.

End-to-end guarantee: at-least-once delivery, with client_message_id propagated to receivers' inboxes.

4.2 Daemon-supplied client_message_id — unchanged from v3 §4.2

4.3 Broker schema — request fingerprint added (v6)

CREATE TABLE mesh.client_message_dedupe (
  mesh_id              UUID    NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
  client_message_id    TEXT    NOT NULL,

  -- The original accepted message; FK NOT enforced because the message row
  -- may be GC'd by retention sweeps before the dedupe row expires.
  broker_message_id    UUID    NOT NULL,

  -- Canonical fingerprint of the original request. Recomputed on every
  -- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
  request_fingerprint  BYTEA   NOT NULL,                    -- 32-byte sha256

  destination_kind     TEXT    NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
  destination_ref      TEXT    NOT NULL,
  first_seen_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  expires_at           TIMESTAMPTZ,                          -- NULL = `permanent` mode
  history_available    BOOLEAN NOT NULL DEFAULT TRUE,        -- flipped FALSE when message row GC'd

  PRIMARY KEY (mesh_id, client_message_id)
);

CREATE INDEX client_message_dedupe_expires_idx
  ON mesh.client_message_dedupe(expires_at)
  WHERE expires_at IS NOT NULL;

ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;

status column dropped (codex r5). Rejected requests do not consume idempotency keys. Rationale below in §4.6.

4.4 Request fingerprint — canonical form (NEW v6)

The fingerprint covers everything that makes a send semantically distinct. A retry must reproduce the same fingerprint bit-for-bit; anything else is a different send and must not be collapsed.

request_fingerprint = sha256(
  envelope_version || 0x00 ||
  destination_kind || 0x00 ||
  destination_ref  || 0x00 ||
  reply_to_id_or_empty || 0x00 ||
  priority         || 0x00 ||
  meta_canonical_json || 0x00 ||
  body_hash
)

Where:

  • envelope_version: integer string (e.g. "1"). Bumps when the envelope shape changes.
  • destination_kind: topic, dm, or queue.
  • destination_ref: topic name, recipient ed25519 pubkey hex, or queue id.
  • reply_to_id_or_empty: original broker_message_id or empty string.
  • priority: now, next, or low.
  • meta_canonical_json: the meta field, serialized with sorted keys, no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.
  • body_hash: sha256(body bytes), hex.

The fingerprint is computed:

  1. Daemon-side before durable outbox persistence — stored as outbox.request_fingerprint (NEW column) so retries always produce the same fingerprint regardless of caller behavior.
  2. Broker-side on first receipt — stored in client_message_dedupe.request_fingerprint.
  3. Broker-side on every duplicate retry — recomputed and compared byte-equal to the stored value.

If the daemon and broker disagree on the canonical form (e.g. JCS implementation drift), the broker emits cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id} and returns 409 idempotency_key_reused with a body that includes the broker's fingerprint hex for debugging. Daemons that see this should log it loudly and stop retrying that outbox row (it goes to dead).

4.5 Duplicate response — three cases (v6)

Case HTTP/WS code Body
First insert 201 created { broker_message_id, client_message_id, history_id, duplicate: false }
Duplicate, fingerprint match 200 ok { broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at }
Duplicate, fingerprint mismatch 409 idempotency_key_reused { client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." } (first 8 bytes hex)

Daemon outcomes:

  • 201 → mark outbox row done, store broker_message_id. Normal path.
  • 200 duplicate with history_available: true → mark done, no re-fanout, log at INFO.
  • 200 duplicate with history_available: false → mark done, log at WARN. The original delivery succeeded; receivers got it.
  • 409 idempotency_key_reused → mark outbox row dead, surface in claudemesh daemon outbox --failed. Operator must rotate the idempotency key by hand and resubmit (outbox requeue --new-id <id>, NEW v6 subcommand). Daemon does NOT auto-rotate to avoid masking caller bugs.

4.6 Why rejected requests don't consume idempotency keys (v6)

status was in v5's schema but underspecified. Two scenarios:

  • Transient broker error (DB down, queue full, network blip): daemon retries. If we'd persisted a rejected row on the first attempt, the retry would fail forever. Bad.
  • Permanent validation error (payload too large, destination not found, auth missing): broker returns the appropriate 4xx immediately without inserting a dedupe row. Daemon either fixes the request and retries (different fingerprint → fingerprint mismatch → 409 per §4.5) or marks dead. Persisting a "rejected" row buys nothing — the daemon isn't going to send the same broken request again with the same key.

Net result: client_message_dedupe rows only exist when the broker successfully accepted a message and committed it. The single source of truth for "was this idempotency key consumed?" is the existence of the dedupe row. No status enum, no ambiguous states.

4.7 Broker atomicity contract (NEW v6)

Every accept path runs in one DB transaction with the following shape:

BEGIN;
  -- Pre-generate broker_message_id outside the transaction; pass in.
  INSERT INTO mesh.client_message_dedupe
    (mesh_id, client_message_id, broker_message_id, request_fingerprint,
     destination_kind, destination_ref, expires_at)
    VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
            $dest_kind, $dest_ref, $expires_at)
    ON CONFLICT (mesh_id, client_message_id) DO NOTHING
    RETURNING broker_message_id, request_fingerprint, history_available, first_seen_at;

  -- If RETURNING was empty (conflict), do a SELECT to fetch the original
  -- and exit the transaction with a duplicate response.
  -- If RETURNING produced a row AND $fingerprint != returned.fingerprint,
  -- that's the §4.5 mismatch path — also exit with 409.

  -- Otherwise, this is the first insert. Insert the message row.
  INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
    VALUES ($msg_id, $mesh_id, $client_id, ...);

  -- Optional: enqueue fan-out work, etc.
COMMIT;

Failure modes:

  • Crash before COMMIT: both rows roll back. Next daemon retry inserts cleanly.
  • Crash after COMMIT but before WS ACK: dedupe row exists, message row exists. Daemon retries → fingerprint matches → 200 duplicate. Net: exactly one broker-accepted row, one daemon done transition.
  • Constraint violation on message row insert (e.g. unique violation on some other column): rolls back the dedupe insert. Returns 5xx to daemon. Daemon retries; same fingerprint reproduces the same constraint violation; daemon eventually marks dead. No orphan dedupe row.

Counter cm_broker_dedupe_orphan_check_total runs nightly and validates that every client_message_dedupe row has a matching topic_message or message_queue row OR the matching message row has been retention-pruned (in which case history_available = FALSE was set). Any row failing both conditions is logged as cm_broker_dedupe_orphan_found{mesh_id} for human review. Should be zero in steady state.

4.8 Outbox schema — fingerprint stored alongside (v6)

CREATE TABLE outbox (
  id                  TEXT PRIMARY KEY,
  client_message_id   TEXT NOT NULL UNIQUE,
  request_fingerprint BLOB NOT NULL,                          -- 32 bytes
  payload             BLOB NOT NULL,
  enqueued_at         INTEGER NOT NULL,
  attempts            INTEGER DEFAULT 0,
  next_attempt_at     INTEGER NOT NULL,
  status              TEXT CHECK(status IN ('pending','inflight','done','dead')),
  last_error          TEXT,
  delivered_at        INTEGER,
  broker_message_id   TEXT
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);

request_fingerprint is computed at IPC accept time and stored. Every retry sends the same bytes. The daemon never recomputes from payload post-enqueue (would produce drift if envelope_version changes between daemon runs).

4.9 Outbox max-age math — bounded (v6)

Codex r5: the v5 formula (dedupe_retention_days * 24) - 24h_margin breaks at dedupe_retention_days = 1 (yields zero) and is undefined behavior at <= 1.

v6 formula and bounds:

  • Minimum supported broker dedupe retention: 3 days. Daemon refuses to start if broker advertises dedupe_retention_days < 3 (treats it as feature_param_invalid, exits 4010).

  • Daemon max_age_hours derivation:

    • permanent mode → daemon uses config default (168h = 7d), cap 720h (30d).
    • retention_scoped mode → daemon max_age_hours = max(72, (dedupe_retention_days * 24) - safety_margin_hours) where safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 * 24)). For dedupe_retention_days=3 this gives max(72, 72-24) = 72h. For 30 days: max(72, 720-72) = 648h. For 365 days: max(72, 8760-876) = 7884h.
    • The 72h floor prevents the daemon outbox from being uselessly short — three days is enough margin for normal operator response to a paged outage.
  • Operator override allowed via [outbox] max_age_hours_override = N, but if N exceeds dedupe_retention_days * 24 - 1 daemon refuses to start with outbox_max_age_above_dedupe_window. The override exists for the rare case of a much-shorter-than-default outbox; it does not exist to circumvent the broker's dedupe window.

4.10 Inbox schema — unchanged from v3 §4.5

4.11 Crash recovery — unchanged from v3 §4.6

4.12 Failure modes — corrected for fingerprint model (v6)

  • Fingerprint mismatch on retry (409 idempotency_key_reused): outbox row marked dead. Surfaced in --failed view. Operator command outbox requeue --new-id <id> rotates client_message_id and retries.
  • Daemon retry after dedupe row hard-deleted by retention sweep: in retention_scoped mode, daemon max_age_hours is bounded inside the retention window (§4.9), so this can only happen via operator override. In that case the retry creates a NEW dedupe row + new message — the caller chose this risk explicitly. Counter cm_daemon_retry_after_dedupe_expired_total.
  • Daemon retry after dedupe row hard-deleted in permanent mode: cannot happen by definition — permanent means no expires_at. Only mesh deletion removes dedupe rows.
  • Duplicate row, history pruned: as v5 §4.4. Mark done, log cm_daemon_dedupe_history_pruned_total.

5. Inbound — unchanged from v3 §5


6. Hooks — unchanged from v4 §6


7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4


14. Lifecycle — unchanged from v5 §14


15. Version compat — feature param updated for new dedupe semantics

15.1 Feature bits with parameters (v6 update)

Bit params.version Required parameters Optional parameters
client_message_id_dedupe 2 mode: "retention_scoped"|"permanent", dedupe_retention_days: int (>= 3) (when mode=retention_scoped), request_fingerprint: bool == true tombstone_history_pruned_window_days: int
concurrent_connection_policy 1 (no parameters) default_policy: "prefer_newest"|"prefer_oldest"|"allow_concurrent"
member_keypair_rotated_event 1 (no parameters)
key_epoch 1 max_concurrent_epochs: int (>= 1)
max_payload 1 inline_bytes: int (>= 1024), blob_bytes: int (>= 1024)

client_message_id_dedupe bumped to params.version = 2 because it now requires request_fingerprint = true. A broker still on version 1 (no fingerprint comparison) is treated as "feature missing" and the daemon refuses to start. That's intentional — v0.9.0 daemons require fingerprint enforcement for safe idempotency.

dedupe_retention_days minimum raised to 3 (matches the §4.9 floor).

15.2 Negotiation handshake — unchanged shape from v5 §15.2

15.3 IPC negotiation — unchanged from v3 §15.3

15.4 Compatibility matrix — unchanged from v3 §15.4

15.5 Diagnostic close codes (NEW v6 — codex r5)

WebSocket close codes are split for diagnostic clarity:

Code Reason When
4010 feature_unavailable Required feature missing from broker's supported
4011 feature_param_invalid Required feature present but parameters fail validation (missing required, out of bounds, unknown version)
4012 feature_param_below_floor Required feature parameter below daemon's hard floor (e.g. dedupe_retention_days < 3)

Daemon logs the full negotiation payload at WARN before exiting; supervisor

  • alerting catches the restart loop.

16. Threat model — unchanged from v4 §16


17. Migration — broker dedupe table + atomicity (v6)

Broker side, deploy order:

  1. CREATE TABLE mesh.client_message_dedupe with v6 schema (additive, online-safe).
  2. ALTER TABLE mesh.topic_message ADD COLUMN client_message_id.
  3. ALTER TABLE mesh.message_queue ADD COLUMN client_message_id.
  4. Broker code refactor: every accept path wraps dedupe insert + message insert in one transaction (§4.7). Pre-generated broker_message_id (ulid in code) passed in.
  5. Broker code: nightly job to delete dedupe rows where expires_at < NOW() (skip in permanent mode).
  6. Broker code: hook into the message-retention sweep — when a topic_message or message_queue row is hard-deleted, find the matching dedupe row by client_message_id and set history_available = FALSE. (Note: client_message_id is nullable on those tables for legacy traffic; nullable rows have no dedupe row to update.)
  7. Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
  8. Broker advertises client_message_id_dedupe feature with params.version = 2 and request_fingerprint: true.
  9. Daemon refuses to start unless that feature bit is advertised with valid v2 params.

Rollback plan: feature flag disables fingerprint enforcement broker-side (falls back to existing pre-v6 behavior — no dedupe). Daemons that require fingerprint refuse to start. Operator switches off the feature flag, reverts the daemon, restarts. No data loss; pending dedupe rows remain in place for the next forward roll.


What changed v5 → v6 (codex round-5 actionable items)

Codex r5 item v6 fix Section
Idempotency key reuse with different payload silently collapses request_fingerprint BYTEA in dedupe table; canonical form per §4.4; 409 on mismatch §4.3, §4.4, §4.5
status='rejected' underspecified Dropped status column; rejected requests don't consume keys; existence of dedupe row = "key consumed" §4.3, §4.6
Outbox max-age math edges at low retention 72h floor; min dedupe_retention_days = 3; percentage-based safety margin; explicit override gating §4.9, §15.1
Broker atomicity not stated One transaction per accept path; orphan-check job; rollback semantics §4.7
Diagnostic detail on feature param failures New close codes 4011 / 4012 separate from 4010 §15.5
Outbox stores fingerprint NEW column outbox.request_fingerprint BLOB; computed once at IPC accept §4.8
Operator command for fingerprint-mismatch recovery NEW outbox requeue --new-id <id> to rotate idempotency key §4.5

What needs review (round 6)

  1. Request fingerprint canonical form (§4.4) — does JCS work cross-language for meta_canonical_json (Python json.dumps, Go encoding/json, JS JSON.stringify all behave differently)? Should we ship a vetted JCS lib in each SDK or fall back to a simpler "sorted keys + no spaces + escape-as-stored" rule with conformance tests?
  2. Atomicity contract (§4.7) — is the orphan-check sufficient, or does a violation mean we need a "broker rebuild dedupe from messages" recovery tool? The latter is destructive but useful for ops emergencies.
  3. Max-age formula (§4.9) — is the 72h floor correct? Is the percentage-based safety margin (max(24, ceil(0.1 * dedupe_window))) the right shape? Or simpler to say "always 24h"?
  4. 409 idempotency_key_reused recovery flow (§4.5) — is sending the row to dead and surfacing it via outbox --failed enough? Should the daemon emit a high-priority event for the SSE stream so operators are paged immediately?
  5. Diagnostic close codes (§15.5) — is splitting 4010/4011/4012 useful, or does it just push complexity onto operators? Should we collapse to 4010 with structured close-reason JSON instead?
  6. Anything else still wrong? Read it as if you were going to operate this for a year. What falls down?

Three options:

  • (a) v6 is shippable: lock the spec, start coding the frozen core.
  • (b) v7 needed: list the must-fix items.
  • (c) the architecture itself is wrong: what would you do differently?

Be ruthless.