- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh daemon long-lived runtime). - CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring, crash recovery, and the deferred Sprint 7 broker hardening. - Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0 daemon redesign section, so the bridge release is documented as the shipped step toward the larger architectural shift. - Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0 spec + broker-hardening followups) from .artifacts/specs/ to .artifacts/shipped/ per the project artifact-pipeline convention. Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag — both are public-distribution actions and require explicit user approval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
20 KiB
claudemesh daemon — Final Spec v6
Round 6. v5 was reviewed by codex (round 5) which found the dedupe table architecture sound but called out four idempotency-correctness issues that would silently corrupt sends in production:
- Idempotency key reuse with different payload/destination — v5 silently collapsed a different send onto the original. Need a request fingerprint.
status = 'rejected'underspecified — schema allowed it, semantics didn't. Either fully define or drop.- Outbox max-age math edges —
dedupe_retention_days = 1minus 24h margin = 0 hours, which is undefined.- Broker atomicity not stated — dedupe insert and message insert must be one transaction or you produce orphan dedupe rows.
v6 fixes all four. Intent §0 unchanged from v2. v6 only revises idempotency semantics in §4 and migration in §17.
0. Intent — unchanged, see v2 §0
1. Process model — unchanged from v3 §1 / v2 §1
2. Identity — unchanged from v5 §2
3. IPC surface — unchanged from v4 §3
4. Delivery contract — at-least-once with request-fingerprinted dedupe
Codex r5: dedupe must compare the whole request shape, not just
(mesh, client_message_id). Otherwise a caller who reuses an idempotency
key with a different destination or body silently drops the new send and
gets the old send's metadata back.
4.1 The contract (precise — v6)
Local guarantee: each successful
POST /v1/sendreturns a stableclient_message_id. The send is durably persisted tooutbox.dbbefore the response returns.Broker guarantee: the broker maintains a dedupe record per accepted
(mesh_id, client_message_id)inmesh.client_message_dedupe. Each dedupe record carries a canonicalrequest_fingerprint. Retries with the sameclient_message_idAND matching fingerprint collapse to the originalbroker_message_id. Retries with the sameclient_message_idbut a different fingerprint return a deterministic conflict (409 idempotency_key_reused) and do not create a new message.Atomicity guarantee: dedupe row insertion and message row insertion happen in one broker DB transaction. Either both land, or neither. No orphan dedupe rows. If the broker crashes between dedupe insert and message insert, the rollback unwinds both.
End-to-end guarantee: at-least-once delivery, with
client_message_idpropagated to receivers' inboxes.
4.2 Daemon-supplied client_message_id — unchanged from v3 §4.2
4.3 Broker schema — request fingerprint added (v6)
CREATE TABLE mesh.client_message_dedupe (
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
client_message_id TEXT NOT NULL,
-- The original accepted message; FK NOT enforced because the message row
-- may be GC'd by retention sweeps before the dedupe row expires.
broker_message_id UUID NOT NULL,
-- Canonical fingerprint of the original request. Recomputed on every
-- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
request_fingerprint BYTEA NOT NULL, -- 32-byte sha256
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
destination_ref TEXT NOT NULL,
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ, -- NULL = `permanent` mode
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
PRIMARY KEY (mesh_id, client_message_id)
);
CREATE INDEX client_message_dedupe_expires_idx
ON mesh.client_message_dedupe(expires_at)
WHERE expires_at IS NOT NULL;
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
status column dropped (codex r5). Rejected requests do not
consume idempotency keys. Rationale below in §4.6.
4.4 Request fingerprint — canonical form (NEW v6)
The fingerprint covers everything that makes a send semantically distinct. A retry must reproduce the same fingerprint bit-for-bit; anything else is a different send and must not be collapsed.
request_fingerprint = sha256(
envelope_version || 0x00 ||
destination_kind || 0x00 ||
destination_ref || 0x00 ||
reply_to_id_or_empty || 0x00 ||
priority || 0x00 ||
meta_canonical_json || 0x00 ||
body_hash
)
Where:
envelope_version: integer string (e.g."1"). Bumps when the envelope shape changes.destination_kind:topic,dm, orqueue.destination_ref: topic name, recipient ed25519 pubkey hex, or queue id.reply_to_id_or_empty: originalbroker_message_idor empty string.priority:now,next, orlow.meta_canonical_json: themetafield, serialized with sorted keys, no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.body_hash: sha256(body bytes), hex.
The fingerprint is computed:
- Daemon-side before durable outbox persistence — stored as
outbox.request_fingerprint(NEW column) so retries always produce the same fingerprint regardless of caller behavior. - Broker-side on first receipt — stored in
client_message_dedupe.request_fingerprint. - Broker-side on every duplicate retry — recomputed and compared byte-equal to the stored value.
If the daemon and broker disagree on the canonical form (e.g. JCS
implementation drift), the broker emits
cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id} and
returns 409 idempotency_key_reused with a body that includes the
broker's fingerprint hex for debugging. Daemons that see this should
log it loudly and stop retrying that outbox row (it goes to dead).
4.5 Duplicate response — three cases (v6)
| Case | HTTP/WS code | Body |
|---|---|---|
| First insert | 201 created |
{ broker_message_id, client_message_id, history_id, duplicate: false } |
| Duplicate, fingerprint match | 200 ok |
{ broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at } |
| Duplicate, fingerprint mismatch | 409 idempotency_key_reused |
{ client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." } (first 8 bytes hex) |
Daemon outcomes:
201→ mark outbox rowdone, storebroker_message_id. Normal path.200 duplicatewithhistory_available: true→ markdone, no re-fanout, log at INFO.200 duplicatewithhistory_available: false→ markdone, log at WARN. The original delivery succeeded; receivers got it.409 idempotency_key_reused→ mark outbox rowdead, surface inclaudemesh daemon outbox --failed. Operator must rotate the idempotency key by hand and resubmit (outbox requeue --new-id <id>, NEW v6 subcommand). Daemon does NOT auto-rotate to avoid masking caller bugs.
4.6 Why rejected requests don't consume idempotency keys (v6)
status was in v5's schema but underspecified. Two scenarios:
- Transient broker error (DB down, queue full, network blip): daemon
retries. If we'd persisted a
rejectedrow on the first attempt, the retry would fail forever. Bad. - Permanent validation error (payload too large, destination not
found, auth missing): broker returns the appropriate
4xximmediately without inserting a dedupe row. Daemon either fixes the request and retries (different fingerprint → fingerprint mismatch →409per §4.5) or marks dead. Persisting a "rejected" row buys nothing — the daemon isn't going to send the same broken request again with the same key.
Net result: client_message_dedupe rows only exist when the broker
successfully accepted a message and committed it. The single source
of truth for "was this idempotency key consumed?" is the existence of
the dedupe row. No status enum, no ambiguous states.
4.7 Broker atomicity contract (NEW v6)
Every accept path runs in one DB transaction with the following shape:
BEGIN;
-- Pre-generate broker_message_id outside the transaction; pass in.
INSERT INTO mesh.client_message_dedupe
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
destination_kind, destination_ref, expires_at)
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
$dest_kind, $dest_ref, $expires_at)
ON CONFLICT (mesh_id, client_message_id) DO NOTHING
RETURNING broker_message_id, request_fingerprint, history_available, first_seen_at;
-- If RETURNING was empty (conflict), do a SELECT to fetch the original
-- and exit the transaction with a duplicate response.
-- If RETURNING produced a row AND $fingerprint != returned.fingerprint,
-- that's the §4.5 mismatch path — also exit with 409.
-- Otherwise, this is the first insert. Insert the message row.
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
VALUES ($msg_id, $mesh_id, $client_id, ...);
-- Optional: enqueue fan-out work, etc.
COMMIT;
Failure modes:
- Crash before
COMMIT: both rows roll back. Next daemon retry inserts cleanly. - Crash after
COMMITbut before WS ACK: dedupe row exists, message row exists. Daemon retries → fingerprint matches →200 duplicate. Net: exactly one broker-accepted row, one daemondonetransition. - Constraint violation on message row insert (e.g. unique violation on
some other column): rolls back the dedupe insert. Returns
5xxto daemon. Daemon retries; same fingerprint reproduces the same constraint violation; daemon eventually marksdead. No orphan dedupe row.
Counter cm_broker_dedupe_orphan_check_total runs nightly and validates
that every client_message_dedupe row has a matching topic_message or
message_queue row OR the matching message row has been retention-pruned
(in which case history_available = FALSE was set). Any row failing both
conditions is logged as cm_broker_dedupe_orphan_found{mesh_id} for
human review. Should be zero in steady state.
4.8 Outbox schema — fingerprint stored alongside (v6)
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
request_fingerprint BLOB NOT NULL, -- 32 bytes
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
request_fingerprint is computed at IPC accept time and stored. Every
retry sends the same bytes. The daemon never recomputes from payload
post-enqueue (would produce drift if envelope_version changes between
daemon runs).
4.9 Outbox max-age math — bounded (v6)
Codex r5: the v5 formula (dedupe_retention_days * 24) - 24h_margin
breaks at dedupe_retention_days = 1 (yields zero) and is undefined
behavior at <= 1.
v6 formula and bounds:
-
Minimum supported broker dedupe retention: 3 days. Daemon refuses to start if broker advertises
dedupe_retention_days < 3(treats it asfeature_param_invalid, exits 4010). -
Daemon
max_age_hoursderivation:permanentmode → daemon uses config default (168h = 7d), cap 720h (30d).retention_scopedmode → daemonmax_age_hours = max(72, (dedupe_retention_days * 24) - safety_margin_hours)wheresafety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 * 24)). Fordedupe_retention_days=3this givesmax(72, 72-24) = 72h. For 30 days:max(72, 720-72) = 648h. For 365 days:max(72, 8760-876) = 7884h.- The 72h floor prevents the daemon outbox from being uselessly short — three days is enough margin for normal operator response to a paged outage.
-
Operator override allowed via
[outbox] max_age_hours_override = N, but ifNexceedsdedupe_retention_days * 24 - 1daemon refuses to start withoutbox_max_age_above_dedupe_window. The override exists for the rare case of a much-shorter-than-default outbox; it does not exist to circumvent the broker's dedupe window.
4.10 Inbox schema — unchanged from v3 §4.5
4.11 Crash recovery — unchanged from v3 §4.6
4.12 Failure modes — corrected for fingerprint model (v6)
- Fingerprint mismatch on retry (
409 idempotency_key_reused): outbox row markeddead. Surfaced in--failedview. Operator commandoutbox requeue --new-id <id>rotatesclient_message_idand retries. - Daemon retry after dedupe row hard-deleted by retention sweep: in
retention_scopedmode, daemonmax_age_hoursis bounded inside the retention window (§4.9), so this can only happen via operator override. In that case the retry creates a NEW dedupe row + new message — the caller chose this risk explicitly. Countercm_daemon_retry_after_dedupe_expired_total. - Daemon retry after dedupe row hard-deleted in
permanentmode: cannot happen by definition —permanentmeans noexpires_at. Only mesh deletion removes dedupe rows. - Duplicate row, history pruned: as v5 §4.4. Mark
done, logcm_daemon_dedupe_history_pruned_total.
5. Inbound — unchanged from v3 §5
6. Hooks — unchanged from v4 §6
7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
14. Lifecycle — unchanged from v5 §14
15. Version compat — feature param updated for new dedupe semantics
15.1 Feature bits with parameters (v6 update)
| Bit | params.version |
Required parameters | Optional parameters |
|---|---|---|---|
client_message_id_dedupe |
2 |
mode: "retention_scoped"|"permanent", dedupe_retention_days: int (>= 3) (when mode=retention_scoped), request_fingerprint: bool == true |
tombstone_history_pruned_window_days: int |
concurrent_connection_policy |
1 |
(no parameters) | default_policy: "prefer_newest"|"prefer_oldest"|"allow_concurrent" |
member_keypair_rotated_event |
1 |
(no parameters) | — |
key_epoch |
1 |
max_concurrent_epochs: int (>= 1) |
— |
max_payload |
1 |
inline_bytes: int (>= 1024), blob_bytes: int (>= 1024) |
— |
client_message_id_dedupe bumped to params.version = 2 because it now
requires request_fingerprint = true. A broker still on version 1
(no fingerprint comparison) is treated as "feature missing" and the
daemon refuses to start. That's intentional — v0.9.0 daemons require
fingerprint enforcement for safe idempotency.
dedupe_retention_days minimum raised to 3 (matches the §4.9 floor).
15.2 Negotiation handshake — unchanged shape from v5 §15.2
15.3 IPC negotiation — unchanged from v3 §15.3
15.4 Compatibility matrix — unchanged from v3 §15.4
15.5 Diagnostic close codes (NEW v6 — codex r5)
WebSocket close codes are split for diagnostic clarity:
| Code | Reason | When |
|---|---|---|
4010 |
feature_unavailable |
Required feature missing from broker's supported |
4011 |
feature_param_invalid |
Required feature present but parameters fail validation (missing required, out of bounds, unknown version) |
4012 |
feature_param_below_floor |
Required feature parameter below daemon's hard floor (e.g. dedupe_retention_days < 3) |
Daemon logs the full negotiation payload at WARN before exiting; supervisor
- alerting catches the restart loop.
16. Threat model — unchanged from v4 §16
17. Migration — broker dedupe table + atomicity (v6)
Broker side, deploy order:
CREATE TABLE mesh.client_message_dedupewith v6 schema (additive, online-safe).ALTER TABLE mesh.topic_message ADD COLUMN client_message_id.ALTER TABLE mesh.message_queue ADD COLUMN client_message_id.- Broker code refactor: every accept path wraps dedupe insert + message
insert in one transaction (§4.7). Pre-generated
broker_message_id(ulid in code) passed in. - Broker code: nightly job to delete dedupe rows where
expires_at < NOW()(skip inpermanentmode). - Broker code: hook into the message-retention sweep — when a
topic_messageormessage_queuerow is hard-deleted, find the matching dedupe row byclient_message_idand sethistory_available = FALSE. (Note:client_message_idis nullable on those tables for legacy traffic; nullable rows have no dedupe row to update.) - Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
- Broker advertises
client_message_id_dedupefeature withparams.version = 2andrequest_fingerprint: true. - Daemon refuses to start unless that feature bit is advertised with valid v2 params.
Rollback plan: feature flag disables fingerprint enforcement broker-side (falls back to existing pre-v6 behavior — no dedupe). Daemons that require fingerprint refuse to start. Operator switches off the feature flag, reverts the daemon, restarts. No data loss; pending dedupe rows remain in place for the next forward roll.
What changed v5 → v6 (codex round-5 actionable items)
| Codex r5 item | v6 fix | Section |
|---|---|---|
| Idempotency key reuse with different payload silently collapses | request_fingerprint BYTEA in dedupe table; canonical form per §4.4; 409 on mismatch |
§4.3, §4.4, §4.5 |
status='rejected' underspecified |
Dropped status column; rejected requests don't consume keys; existence of dedupe row = "key consumed" |
§4.3, §4.6 |
| Outbox max-age math edges at low retention | 72h floor; min dedupe_retention_days = 3; percentage-based safety margin; explicit override gating |
§4.9, §15.1 |
| Broker atomicity not stated | One transaction per accept path; orphan-check job; rollback semantics | §4.7 |
| Diagnostic detail on feature param failures | New close codes 4011 / 4012 separate from 4010 | §15.5 |
| Outbox stores fingerprint | NEW column outbox.request_fingerprint BLOB; computed once at IPC accept |
§4.8 |
| Operator command for fingerprint-mismatch recovery | NEW outbox requeue --new-id <id> to rotate idempotency key |
§4.5 |
What needs review (round 6)
- Request fingerprint canonical form (§4.4) — does JCS work
cross-language for
meta_canonical_json(Python json.dumps, Go encoding/json, JS JSON.stringify all behave differently)? Should we ship a vetted JCS lib in each SDK or fall back to a simpler "sorted keys + no spaces + escape-as-stored" rule with conformance tests? - Atomicity contract (§4.7) — is the orphan-check sufficient, or does a violation mean we need a "broker rebuild dedupe from messages" recovery tool? The latter is destructive but useful for ops emergencies.
- Max-age formula (§4.9) — is the 72h floor correct? Is the
percentage-based safety margin (
max(24, ceil(0.1 * dedupe_window))) the right shape? Or simpler to say "always 24h"? 409 idempotency_key_reusedrecovery flow (§4.5) — is sending the row todeadand surfacing it viaoutbox --failedenough? Should the daemon emit a high-priority event for the SSE stream so operators are paged immediately?- Diagnostic close codes (§15.5) — is splitting 4010/4011/4012 useful, or does it just push complexity onto operators? Should we collapse to 4010 with structured close-reason JSON instead?
- Anything else still wrong? Read it as if you were going to operate this for a year. What falls down?
Three options:
- (a) v6 is shippable: lock the spec, start coding the frozen core.
- (b) v7 needed: list the must-fix items.
- (c) the architecture itself is wrong: what would you do differently?
Be ruthless.