Long-lived process that holds a persistent WS to the broker and exposes
a local IPC surface (UDS + bearer-auth TCP loopback). Implements the
v0.9.0 spec under .artifacts/specs/.
Core:
- daemon up | status | version | down | accept-host
- daemon outbox list [--failed|--pending|--inflight|--done|--aborted]
- daemon outbox requeue <id> [--new-client-id <id>]
- daemon install-service / uninstall-service (macOS launchd, Linux systemd)
IPC routes:
- /v1/version, /v1/health
- /v1/send (POST) — full §4.5.1 idempotency lookup table
- /v1/inbox (GET) — paged history
- /v1/events — SSE stream of message/peer_join/peer_leave/broker_status
- /v1/peers — broker passthrough
- /v1/profile — summary/status/visible/avatar/title/bio/capabilities
- /v1/outbox + /v1/outbox/requeue — operator recovery
Storage (SQLite via node:sqlite / bun:sqlite):
- outbox.db: pending/inflight/done/dead/aborted with audit columns
- inbox.db: dedupe by client_message_id, decrypts DMs via existing crypto
- BEGIN IMMEDIATE serialization for daemon-local accept races
Identity:
- host_fingerprint.json (machine-id || first-stable-mac)
- refuse-on-mismatch policy with `daemon accept-host` recovery
CLI integration:
- claudemesh send detects the daemon and routes through /v1/send when
present, falling back to bridge socket / cold path otherwise
Tests: 15-case coverage of the §4.5.1 IPC duplicate lookup table.
Spec arc preserved at .artifacts/specs/2026-05-03-daemon-{v1..v10}.md;
v0.9.0 implementation target locked at 2026-05-03-daemon-spec-v0.9.0.md;
deferred items at 2026-05-03-daemon-spec-broker-hardening-followups.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
681 lines
30 KiB
Markdown
681 lines
30 KiB
Markdown
# `claudemesh daemon` — Implementation spec v0.9.0
|
||
|
||
> **Implementation target.** Locked from the v1–v10 codex-reviewed spec
|
||
> series. This document is what we build for v0.9.0 of the daemon.
|
||
>
|
||
> **Base**: v6 (the round where the architecture passed codex's
|
||
> structural review — request_fingerprint, dedupe table, atomicity
|
||
> contract, feature-bit negotiation, key archive format).
|
||
>
|
||
> **Pulled in from v7–v9**: six cheap, load-bearing fixes that close
|
||
> real v0.9.0-era bugs (not future-scale concerns):
|
||
>
|
||
> 1. `aborted` outbox status + audit columns (operator recovery without
|
||
> destroying audit trail) — v7 §4.5.2
|
||
> 2. `BEGIN IMMEDIATE` for daemon-local SQLite serialization (v6's
|
||
> `SELECT FOR UPDATE` is invalid SQLite anyway) — v7 §4.5.1
|
||
> 3. Daemon-local IPC duplicate lookup table over outbox states ×
|
||
> fingerprint match/mismatch — v8 §4.5.1
|
||
> 4. Phase B1/B2/B3 broker validation split (the concept; we don't need
|
||
> the elaborate phase tables) — v7 §4.6.2
|
||
> 5. Side-effect inventory (in-tx vs async) as an implementation comment
|
||
> block — v8 §4.7.1
|
||
> 6. Two-layer ID model wording: daemon-consumed iff outbox row,
|
||
> broker-consumed iff dedupe row — v9 §4.1
|
||
>
|
||
> **Deferred to broker-hardening followups** (see
|
||
> `2026-05-03-daemon-spec-broker-hardening-followups.md` for the full list and
|
||
> rationale): B0 dedupe fast-path, Lua-scripted idempotent rate
|
||
> limiter, in-tx mention_index, 4011/4012 close-code split, per-OS
|
||
> fingerprint precedence table, request-fingerprint schema-v2 in
|
||
> feature negotiation. These are real improvements but not v0.9.0
|
||
> blockers; they land as the broker matures.
|
||
>
|
||
> **Intent §0 unchanged from v2.**
|
||
|
||
---
|
||
|
||
## 0. Intent — unchanged, see v2 §0
|
||
|
||
---
|
||
|
||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||
|
||
---
|
||
|
||
## 2. Identity — unchanged from v5 §2
|
||
|
||
---
|
||
|
||
## 3. IPC surface — unchanged from v4 §3
|
||
|
||
---
|
||
|
||
## 4. Delivery contract — at-least-once with **request-fingerprinted** dedupe
|
||
|
||
Codex r5: dedupe must compare the *whole request shape*, not just
|
||
`(mesh, client_message_id)`. Otherwise a caller who reuses an idempotency
|
||
key with a different destination or body silently drops the new send and
|
||
gets the old send's metadata back.
|
||
|
||
### 4.1 The contract (precise)
|
||
|
||
> **Two-layer ID rule** (from v9): a `client_message_id` is
|
||
> **daemon-consumed** iff an outbox row exists for it; **broker-consumed**
|
||
> iff a dedupe row exists in `mesh.client_message_dedupe`. The two layers
|
||
> are independent: a daemon-consumed id may or may not be broker-consumed
|
||
> (depending on whether the send reached broker commit). In v0.9.0 there
|
||
> are no daemon-bypass clients, so for practical purposes "daemon-consumed"
|
||
> is the operative rule.
|
||
>
|
||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||
> the response returns. The daemon enforces request-fingerprint
|
||
> idempotency at the IPC layer (§4.5).
|
||
>
|
||
> **Local audit guarantee**: a `client_message_id` once written to
|
||
> `outbox.db` is never released. Operator recovery via `requeue` always
|
||
> mints a fresh id; the old row stays in `aborted` for audit. There is
|
||
> no daemon-side path to free a used id.
|
||
>
|
||
> **Broker guarantee**: the broker maintains a dedupe record per accepted
|
||
> `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`. Each
|
||
> dedupe record carries a canonical `request_fingerprint`. Retries with
|
||
> the same id AND matching fingerprint collapse to the original
|
||
> `broker_message_id`. Retries with mismatched fingerprint return
|
||
> `409 idempotency_key_reused` and do **not** create a new message.
|
||
>
|
||
> **Atomicity guarantee**: dedupe row insertion, message row insertion,
|
||
> and history row insertion happen in one broker DB transaction. Either
|
||
> all land, or none do. No orphan dedupe rows.
|
||
>
|
||
> **End-to-end guarantee**: at-least-once delivery, with
|
||
> `client_message_id` propagated to receivers' inboxes.
|
||
|
||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||
|
||
### 4.3 Broker schema — request fingerprint added (v6)
|
||
|
||
```sql
|
||
CREATE TABLE mesh.client_message_dedupe (
|
||
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
|
||
client_message_id TEXT NOT NULL,
|
||
|
||
-- The original accepted message; FK NOT enforced because the message row
|
||
-- may be GC'd by retention sweeps before the dedupe row expires.
|
||
broker_message_id UUID NOT NULL,
|
||
|
||
-- Canonical fingerprint of the original request. Recomputed on every
|
||
-- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
|
||
request_fingerprint BYTEA NOT NULL, -- 32-byte sha256
|
||
|
||
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
|
||
destination_ref TEXT NOT NULL,
|
||
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||
expires_at TIMESTAMPTZ, -- NULL = `permanent` mode
|
||
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
|
||
|
||
PRIMARY KEY (mesh_id, client_message_id)
|
||
);
|
||
|
||
CREATE INDEX client_message_dedupe_expires_idx
|
||
ON mesh.client_message_dedupe(expires_at)
|
||
WHERE expires_at IS NOT NULL;
|
||
|
||
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
|
||
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
|
||
```
|
||
|
||
**`status` column dropped (codex r5)**. Rejected requests do **not**
|
||
consume idempotency keys. Rationale below in §4.6.
|
||
|
||
### 4.4 Request fingerprint — canonical form (NEW v6)
|
||
|
||
The fingerprint covers everything that makes a send semantically distinct.
|
||
A retry must reproduce the same fingerprint bit-for-bit; anything else is
|
||
a different send and must not be collapsed.
|
||
|
||
```
|
||
request_fingerprint = sha256(
|
||
envelope_version || 0x00 ||
|
||
destination_kind || 0x00 ||
|
||
destination_ref || 0x00 ||
|
||
reply_to_id_or_empty || 0x00 ||
|
||
priority || 0x00 ||
|
||
meta_canonical_json || 0x00 ||
|
||
body_hash
|
||
)
|
||
```
|
||
|
||
Where:
|
||
- `envelope_version`: integer string (e.g. `"1"`). Bumps when the envelope
|
||
shape changes.
|
||
- `destination_kind`: `topic`, `dm`, or `queue`.
|
||
- `destination_ref`: topic name, recipient ed25519 pubkey hex, or queue id.
|
||
- `reply_to_id_or_empty`: original `broker_message_id` or empty string.
|
||
- `priority`: `now`, `next`, or `low`.
|
||
- `meta_canonical_json`: the `meta` field, serialized with sorted keys,
|
||
no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.
|
||
- `body_hash`: sha256(body bytes), hex.
|
||
|
||
The fingerprint is computed:
|
||
1. **Daemon-side** before durable outbox persistence — stored as
|
||
`outbox.request_fingerprint` (NEW column) so retries always produce
|
||
the same fingerprint regardless of caller behavior.
|
||
2. **Broker-side** on first receipt — stored in
|
||
`client_message_dedupe.request_fingerprint`.
|
||
3. **Broker-side** on every duplicate retry — recomputed and compared
|
||
byte-equal to the stored value.
|
||
|
||
If the daemon and broker disagree on the canonical form (e.g. JCS
|
||
implementation drift), the broker emits
|
||
`cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id}` and
|
||
returns `409 idempotency_key_reused` with a body that includes the
|
||
broker's fingerprint hex for debugging. Daemons that see this should
|
||
log it loudly and stop retrying that outbox row (it goes to `dead`).
|
||
|
||
### 4.5 Daemon-local idempotency at the IPC layer (from v8)
|
||
|
||
The daemon enforces fingerprint idempotency **before** the request hits
|
||
`outbox.db` so a caller bug never creates duplicate-key/mismatch-payload
|
||
state at all.
|
||
|
||
#### 4.5.1 IPC accept algorithm
|
||
|
||
On `POST /v1/send`:
|
||
|
||
1. Validate request envelope (auth, schema, size limits, destination
|
||
resolvable). Failures here return `4xx` immediately. **No outbox
|
||
row is written; the `client_message_id` is not consumed.**
|
||
2. Compute `request_fingerprint` (§4.4).
|
||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` so a concurrent IPC
|
||
accept on the same id serializes against this one. `BEGIN IMMEDIATE`
|
||
acquires the RESERVED lock at transaction start; SQLite has no
|
||
row-level lock and `SELECT FOR UPDATE` is not supported.
|
||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||
last_error FROM outbox WHERE client_message_id = ?`.
|
||
5. Apply the lookup table below. For the "(no row)" case, INSERT inside
|
||
the same transaction.
|
||
6. COMMIT.
|
||
|
||
| Existing row state | Fingerprint | Daemon response |
|
||
|---|---|---|
|
||
| (no row) | — | INSERT new outbox row `pending`; return `202 accepted, queued` |
|
||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||
| `pending` | mismatch | Return `409`, `conflict: "outbox_pending_fingerprint_mismatch"` |
|
||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||
| `inflight` | mismatch | Return `409`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||
| `done` | mismatch | Return `409`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||
| `dead` | match | Return `409`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"` |
|
||
| `dead` | mismatch | Return `409`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||
| `aborted` | match | Return `409`, `conflict: "outbox_aborted_fingerprint_match"`. Operator-retired id, never reusable |
|
||
| `aborted` | mismatch | Return `409`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||
|
||
Every `409` carries the daemon's `request_fingerprint` (8-byte hex
|
||
prefix) for client/server canonical-form-drift debugging. A
|
||
`client_message_id` written to `outbox.db` is permanently bound to that
|
||
row's lifecycle — the only "free" state is "no row exists".
|
||
|
||
#### 4.5.2 Outbox table
|
||
|
||
```sql
|
||
CREATE TABLE outbox (
|
||
id TEXT PRIMARY KEY,
|
||
client_message_id TEXT NOT NULL UNIQUE,
|
||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||
payload BLOB NOT NULL,
|
||
enqueued_at INTEGER NOT NULL,
|
||
attempts INTEGER DEFAULT 0,
|
||
next_attempt_at INTEGER NOT NULL,
|
||
status TEXT CHECK(status IN
|
||
('pending','inflight','done','dead','aborted')),
|
||
last_error TEXT,
|
||
delivered_at INTEGER,
|
||
broker_message_id TEXT,
|
||
aborted_at INTEGER, -- v7
|
||
aborted_by TEXT, -- v7: operator/auto
|
||
superseded_by TEXT -- v7: id of requeue successor
|
||
);
|
||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||
```
|
||
|
||
`aborted_at` / `aborted_by` / `superseded_by` give operators a clear
|
||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||
a row is requeued multiple times. `request_fingerprint` is computed
|
||
once at IPC accept time and frozen for the row's lifecycle.
|
||
|
||
#### 4.5.3 Operator recovery via `requeue`
|
||
|
||
```
|
||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||
[--new-client-id <id> | --auto]
|
||
[--patch-payload <path>]
|
||
```
|
||
|
||
Atomically (single SQLite transaction):
|
||
1. Marks the existing row `aborted`, sets `aborted_at = now`,
|
||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||
permanent.
|
||
2. Mints a fresh `client_message_id` (caller-supplied or auto-ulid).
|
||
3. Inserts a new outbox row `pending` with the fresh id and the same
|
||
payload (or patched if `--patch-payload`).
|
||
4. Sets `superseded_by = <new_row_id>` on the old row.
|
||
|
||
The old `client_message_id` is permanently dead. There is no path for
|
||
an id to become free again.
|
||
|
||
### 4.5b Broker duplicate response — three cases
|
||
|
||
| Case | HTTP/WS code | Body |
|
||
|---|---|---|
|
||
| First insert | `201 created` | `{ broker_message_id, client_message_id, history_id, duplicate: false }` |
|
||
| Duplicate, fingerprint match | `200 ok` | `{ broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at }` |
|
||
| Duplicate, fingerprint mismatch | `409 idempotency_key_reused` | `{ client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." }` (first 8 bytes hex) |
|
||
|
||
Daemon outcomes:
|
||
- `201` → mark outbox row `done`, store `broker_message_id`.
|
||
- `200 duplicate` with `history_available: true` → mark `done`, log INFO.
|
||
- `200 duplicate` with `history_available: false` → mark `done`, log WARN.
|
||
- `409 idempotency_key_reused` → mark outbox row `dead`. Operator runs
|
||
`outbox requeue` (§4.5.3); old id stays `aborted`, new id is fresh.
|
||
|
||
### 4.6 Rejected-request semantics — id consumed iff outbox row written
|
||
|
||
> **Rule**: a `client_message_id` is daemon-consumed iff the daemon
|
||
> writes an outbox row. Anything that fails before outbox insertion
|
||
> (auth, schema, size, destination not resolvable) leaves the id
|
||
> untouched and freely reusable.
|
||
|
||
#### 4.6.1 Daemon-side rejection phasing
|
||
|
||
| Phase | When daemon rejects | Outbox row? | Caller may reuse id? |
|
||
|---|---|---|---|
|
||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | Yes — id never consumed |
|
||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | N/A — daemon owns retries |
|
||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | No — rotate via `requeue` |
|
||
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Old id NEVER reusable; new id is fresh |
|
||
|
||
#### 4.6.2 Broker-side rejection phasing (B1 / B2 / B3)
|
||
|
||
The broker validates in three phases relative to dedupe-row insertion:
|
||
|
||
| Phase | Validation | Side effects | Result for direct broker callers (none in v0.9.0) |
|
||
|---|---|---|---|
|
||
| **B1. Pre-dedupe-claim** | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, rate limit not exceeded | None | `4xx`. No dedupe row. Direct broker caller may retry with same id |
|
||
| **B2. Post-dedupe-claim** (in-tx) | destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx`, transaction rolled back, no dedupe row remains. Direct broker caller may retry with same id |
|
||
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows | `201` with `broker_message_id` |
|
||
|
||
**Daemon-mediated callers (the only path in v0.9.0)** see only the
|
||
daemon-layer rules of §4.6.1: any broker `4xx` after IPC accept lands
|
||
the outbox row in `dead`. Daemon-mediated callers MUST rotate via
|
||
`requeue` (§4.5.3); the daemon-consumed id is never reusable
|
||
regardless of whether the broker layer sees a dedupe row. The "may
|
||
retry with same id" wording above describes broker-bypass callers
|
||
only, which v0.9.0 does not have.
|
||
|
||
**Critical guarantee**: there is no broker code path where a permanent
|
||
4xx leaves a dedupe row behind. Either the request committed and a
|
||
dedupe row exists (B3), or it didn't and no dedupe row exists (B1, B2).
|
||
"Dedupe row exists" is the unambiguous signal of "id consumed at the
|
||
broker layer."
|
||
|
||
If the broker decides post-commit that an accepted message is invalid
|
||
(async content-policy job), that's NOT a permanent rejection — it's a
|
||
follow-up moderation event that operates on the `broker_message_id`,
|
||
not on the dedupe key.
|
||
|
||
Net result: `client_message_dedupe` rows only exist when the broker
|
||
**successfully** accepted a message and committed it. The single source
|
||
of truth for "was this idempotency key consumed?" is the existence of
|
||
the dedupe row. No status enum, no ambiguous states.
|
||
|
||
### 4.7 Broker atomicity contract
|
||
|
||
#### 4.7.1 Side-effect inventory
|
||
|
||
Every successful broker accept atomically commits these durable state
|
||
changes in **one transaction**:
|
||
|
||
| Effect | Table | Why in-tx |
|
||
|---|---|---|
|
||
| Dedupe record | `mesh.client_message_dedupe` | Idempotency authority |
|
||
| Message body | `mesh.topic_message` / `mesh.message_queue` | Authoritative store |
|
||
| History row | `mesh.message_history` | Replay log; lost-on-rollback breaks ordered replay |
|
||
| Fan-out work | `mesh.delivery_queue` | Each recipient must see exactly committed messages |
|
||
|
||
**Outside the transaction** (non-authoritative or rebuildable):
|
||
- WS push to live subscribers — best-effort live notifications.
|
||
- Webhook fan-out — async via `delivery_queue` workers.
|
||
- Rate-limit counters — telemetry only; authority is the external
|
||
limiter checked in B1.
|
||
- Audit log entries — append-only stream; rebuildable from history.
|
||
- Search/FTS index updates — async via outbox-pattern worker.
|
||
- Mention index updates — async (deferred in-tx promotion to followups
|
||
doc).
|
||
- Metrics — Prometheus, pull-based.
|
||
|
||
If any in-transaction insert fails, the transaction rolls back
|
||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||
state.
|
||
|
||
#### 4.7.2 Pseudocode
|
||
|
||
```sql
|
||
-- Pre-generate broker_message_id (ulid) in code, pass in.
|
||
BEGIN;
|
||
|
||
-- Step 1: try to claim the idempotency key.
|
||
INSERT INTO mesh.client_message_dedupe
|
||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||
destination_kind, destination_ref, expires_at)
|
||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||
$dest_kind, $dest_ref, $expires_at)
|
||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||
|
||
-- Step 2: inspect what's actually there now (ours or someone else's).
|
||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||
destination_ref, history_available, first_seen_at
|
||
FROM mesh.client_message_dedupe
|
||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||
FOR SHARE;
|
||
|
||
-- Branch:
|
||
-- row.broker_message_id == $msg_id → first insert; continue.
|
||
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
|
||
-- match → ROLLBACK; return 200 duplicate.
|
||
-- mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||
|
||
-- Step 3: validate Phase B2 (destination_ref existence — topic exists,
|
||
-- member subscribed, etc.). If B2 fails → ROLLBACK; return 4xx (no
|
||
-- dedupe row remains).
|
||
|
||
-- Step 4: insert in-tx side effects (§4.7.1).
|
||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||
|
||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||
VALUES ($msg_id, $mesh_id, ...);
|
||
|
||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||
SELECT $msg_id, member_pubkey, ...
|
||
FROM mesh.topic_subscription
|
||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||
|
||
COMMIT;
|
||
```
|
||
|
||
The branch logic determines the response shape (`201` / `200 duplicate`
|
||
/ `409 idempotency_key_reused`) before COMMIT. The duplicate and 409
|
||
branches always ROLLBACK because nothing else needs to commit.
|
||
`SELECT … FOR SHARE` blocks concurrent writers from upgrading the same
|
||
dedupe row mid-transaction.
|
||
|
||
#### 4.7.3 Failure modes
|
||
|
||
- Crash before `COMMIT`: all rows roll back. Next daemon retry inserts
|
||
cleanly.
|
||
- Crash after `COMMIT` but before WS ACK: dedupe row exists. Daemon
|
||
retries → fingerprint matches → `200 duplicate`. Net: exactly one
|
||
broker-accepted row, one daemon `done` transition.
|
||
- Constraint violation on message row insert: rolls back the whole tx.
|
||
`5xx` to daemon. Same fingerprint reproduces; daemon eventually
|
||
marks `dead`. No orphan dedupe row.
|
||
|
||
Counter `cm_broker_dedupe_orphan_check_total` runs nightly and
|
||
validates that every `client_message_dedupe` row has a matching
|
||
`topic_message` / `message_queue` row OR the matching row has been
|
||
retention-pruned (`history_available = FALSE`). Inconsistencies logged
|
||
as `cm_broker_dedupe_orphan_found{mesh_id}` for human review.
|
||
|
||
### 4.8 Outbox schema
|
||
|
||
The authoritative outbox schema for v0.9.0 is in §4.5.2 (includes
|
||
`aborted` status and audit columns from the v7 pull). `request_fingerprint`
|
||
is computed at IPC accept time and frozen for the row's lifecycle —
|
||
the daemon never recomputes from `payload` post-enqueue (would produce
|
||
drift if envelope_version changes between daemon runs).
|
||
|
||
### 4.9 Outbox max-age math — bounded (v6)
|
||
|
||
Codex r5: the v5 formula `(dedupe_retention_days * 24) - 24h_margin`
|
||
breaks at `dedupe_retention_days = 1` (yields zero) and is undefined
|
||
behavior at `<= 1`.
|
||
|
||
v6 formula and bounds:
|
||
|
||
- **Minimum supported broker dedupe retention**: 3 days. Daemon refuses
|
||
to start if broker advertises `dedupe_retention_days < 3` (treats it
|
||
as `feature_param_invalid`, exits 4010).
|
||
- **Daemon `max_age_hours` derivation**:
|
||
- `permanent` mode → daemon uses config default (168h = 7d), cap 720h
|
||
(30d).
|
||
- `retention_scoped` mode → daemon `max_age_hours = max(72,
|
||
(dedupe_retention_days * 24) - safety_margin_hours)` where
|
||
`safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 *
|
||
24))`. For `dedupe_retention_days=3` this gives
|
||
`max(72, 72-24) = 72h`. For 30 days: `max(72, 720-72) = 648h`. For
|
||
365 days: `max(72, 8760-876) = 7884h`.
|
||
- The 72h floor prevents the daemon outbox from being uselessly short
|
||
— three days is enough margin for normal operator response to a
|
||
paged outage.
|
||
|
||
- Operator override allowed via `[outbox] max_age_hours_override = N`,
|
||
but if `N` exceeds `dedupe_retention_days * 24 - 1` daemon refuses to
|
||
start with `outbox_max_age_above_dedupe_window`. The override exists
|
||
for the rare case of a much-shorter-than-default outbox; it does not
|
||
exist to circumvent the broker's dedupe window.
|
||
|
||
### 4.10 Inbox schema — unchanged from v3 §4.5
|
||
|
||
### 4.11 Crash recovery — unchanged from v3 §4.6
|
||
|
||
### 4.12 Failure modes — corrected for fingerprint model (v6)
|
||
|
||
- **Fingerprint mismatch on retry** (`409 idempotency_key_reused`): outbox
|
||
row marked `dead`. Surfaced in `--failed` view. Operator command
|
||
`outbox requeue --new-id <id>` rotates `client_message_id` and retries.
|
||
- **Daemon retry after dedupe row hard-deleted by retention sweep**: in
|
||
`retention_scoped` mode, daemon `max_age_hours` is bounded inside the
|
||
retention window (§4.9), so this can only happen via operator override.
|
||
In that case the retry creates a NEW dedupe row + new message — the
|
||
caller chose this risk explicitly. Counter
|
||
`cm_daemon_retry_after_dedupe_expired_total`.
|
||
- **Daemon retry after dedupe row hard-deleted in `permanent` mode**:
|
||
cannot happen by definition — `permanent` means no `expires_at`. Only
|
||
mesh deletion removes dedupe rows.
|
||
- **Duplicate row, history pruned**: as v5 §4.4. Mark `done`, log
|
||
`cm_daemon_dedupe_history_pruned_total`.
|
||
|
||
---
|
||
|
||
## 5. Inbound — unchanged from v3 §5
|
||
|
||
---
|
||
|
||
## 6. Hooks — unchanged from v4 §6
|
||
|
||
---
|
||
|
||
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
|
||
|
||
---
|
||
|
||
## 14. Lifecycle — unchanged from v5 §14
|
||
|
||
---
|
||
|
||
## 15. Version compat — feature param updated for new dedupe semantics
|
||
|
||
### 15.1 Feature bits with parameters (v6 update)
|
||
|
||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||
|---|---|---|---|
|
||
| `client_message_id_dedupe` | `1` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 3)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
|
||
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
|
||
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
|
||
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
|
||
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
|
||
|
||
`client_message_id_dedupe` ships at `params.version = 1` with
|
||
`request_fingerprint: bool == true` as a required parameter. A broker
|
||
that doesn't advertise the feature, or advertises it without
|
||
`request_fingerprint: true`, is treated as "feature missing" and the
|
||
daemon refuses to start. That's intentional — v0.9.0 daemons require
|
||
fingerprint enforcement for safe idempotency.
|
||
|
||
The schema-version-2 evolution (parameters that need versioning) is
|
||
deferred (see followups doc).
|
||
|
||
`dedupe_retention_days` minimum is 3 (matches the §4.9 floor).
|
||
|
||
### 15.2 Negotiation handshake — unchanged shape from v5 §15.2
|
||
|
||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||
|
||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||
|
||
### 15.5 Diagnostic close code (v0.9.0)
|
||
|
||
v0.9.0 ships a single WebSocket close code with a structured
|
||
`close_reason` JSON payload that distinguishes the underlying cause:
|
||
|
||
| Code | Reason | `close_reason.kind` values |
|
||
|---|---|---|
|
||
| `4010` | `feature_unavailable` | `feature_unavailable` (feature missing from broker's `supported`) · `feature_param_invalid` (params fail validation: missing required, out of bounds, unknown version) · `feature_param_below_floor` (param below daemon's hard floor, e.g. `dedupe_retention_days < 3`) |
|
||
|
||
`close_reason` payload shape:
|
||
```json
|
||
{
|
||
"kind": "feature_unavailable" | "feature_param_invalid" | "feature_param_below_floor",
|
||
"feature": "client_message_id_dedupe",
|
||
"detail": "..."
|
||
}
|
||
```
|
||
|
||
Daemon logs the full negotiation payload at WARN before exiting;
|
||
supervisor + alerting catches the restart loop. The split into
|
||
4011/4012 codes is deferred (see followups doc).
|
||
|
||
---
|
||
|
||
## 16. Threat model — unchanged from v4 §16
|
||
|
||
---
|
||
|
||
## 17. Migration — broker dedupe table + atomicity (v6)
|
||
|
||
Broker side, deploy order:
|
||
|
||
1. `CREATE TABLE mesh.client_message_dedupe` with v6 schema (additive,
|
||
online-safe).
|
||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
|
||
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
|
||
4. Broker code refactor: every accept path wraps dedupe insert + message
|
||
insert in **one transaction** (§4.7). Pre-generated
|
||
`broker_message_id` (ulid in code) passed in.
|
||
5. Broker code: nightly job to delete dedupe rows where `expires_at <
|
||
NOW()` (skip in `permanent` mode).
|
||
6. Broker code: hook into the message-retention sweep — when a
|
||
`topic_message` or `message_queue` row is hard-deleted, find the
|
||
matching dedupe row by `client_message_id` and set `history_available
|
||
= FALSE`. (Note: `client_message_id` is nullable on those tables for
|
||
legacy traffic; nullable rows have no dedupe row to update.)
|
||
7. Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
|
||
8. Broker advertises `client_message_id_dedupe` feature with
|
||
`params.version = 1` and `request_fingerprint: true`.
|
||
9. Daemon refuses to start unless that feature bit is advertised with
|
||
valid v1 params.
|
||
|
||
Rollback plan: feature flag disables fingerprint enforcement broker-side
|
||
(falls back to existing pre-v6 behavior — no dedupe). Daemons that
|
||
require fingerprint refuse to start. Operator switches off the feature
|
||
flag, reverts the daemon, restarts. No data loss; pending dedupe rows
|
||
remain in place for the next forward roll.
|
||
|
||
---
|
||
|
||
## v0.9.0 lock — what's in vs deferred
|
||
|
||
**In** (this document): everything codex r1–r4 ratified plus the six
|
||
sweet-spot pulls from v7–v9 enumerated at the top — `aborted` outbox
|
||
status, `BEGIN IMMEDIATE`, IPC duplicate lookup table, B1/B2/B3 phasing
|
||
concept, side-effect inventory, two-layer ID model.
|
||
|
||
**Deferred** (see `2026-05-03-daemon-spec-broker-hardening-followups.md`):
|
||
- B0 dedupe fast-path before rate-limit (v10).
|
||
- Lua-scripted idempotent rate limiter keyed by
|
||
`(mesh, client_id, window)` (v10).
|
||
- In-tx `mesh.mention_index` (v8).
|
||
- 4011 / 4012 close-code split (v6 §15.5 — collapsed to 4010 with
|
||
structured reason JSON for v0.9.0).
|
||
- Per-OS fingerprint precedence elaborate table (v8 §2.2.1).
|
||
- `request_fingerprint` schema-version-2 in feature negotiation (v6
|
||
§15.1 ships at version 1 with `request_fingerprint: bool`).
|
||
- Force-expiry / quarantine semantics for `keypair-archive.json`
|
||
(v8 §14.1.1).
|
||
|
||
These deferrals are real improvements but not v0.9.0 blockers. They
|
||
land as the broker matures and we have actual scale-load to optimize
|
||
against.
|
||
|
||
---
|
||
|
||
## Cross-spec note: §15.5 close-code collapse
|
||
|
||
For v0.9.0 we ship a single `4010 feature_unavailable` close code with
|
||
a structured `close_reason` JSON payload that distinguishes the
|
||
underlying cause:
|
||
|
||
```json
|
||
{
|
||
"close_reason": {
|
||
"kind": "feature_unavailable" | "feature_param_invalid" | "feature_param_below_floor",
|
||
"feature": "client_message_id_dedupe",
|
||
"detail": "..."
|
||
}
|
||
}
|
||
```
|
||
|
||
The 4011/4012 split is deferred to followups.
|
||
|
||
---
|
||
|
||
## NON-NORMATIVE: round-6 review trailer (preserved for audit only)
|
||
|
||
> **Not part of the v0.9.0 contract.** Preserved verbatim from the
|
||
> v6 source spec as a record of the open questions at the time of the
|
||
> codex round-6 review. Items below have either been resolved in this
|
||
> merged document, deferred to the followups doc, or superseded.
|
||
> Do NOT use this section as a checklist for implementation.
|
||
|
||
1. **Request fingerprint canonical form (§4.4)** — does JCS work
|
||
cross-language for `meta_canonical_json` (Python json.dumps,
|
||
Go encoding/json, JS JSON.stringify all behave differently)? Should
|
||
we ship a vetted JCS lib in each SDK or fall back to a simpler
|
||
"sorted keys + no spaces + escape-as-stored" rule with conformance
|
||
tests?
|
||
2. **Atomicity contract (§4.7)** — is the orphan-check sufficient, or
|
||
does a violation mean we need a "broker rebuild dedupe from messages"
|
||
recovery tool? The latter is destructive but useful for ops emergencies.
|
||
3. **Max-age formula (§4.9)** — is the 72h floor correct? Is the
|
||
percentage-based safety margin (`max(24, ceil(0.1 * dedupe_window))`)
|
||
the right shape? Or simpler to say "always 24h"?
|
||
4. **`409 idempotency_key_reused` recovery flow (§4.5)** — is sending the
|
||
row to `dead` and surfacing it via `outbox --failed` enough? Should
|
||
the daemon emit a high-priority event for the SSE stream so operators
|
||
are paged immediately?
|
||
5. **Diagnostic close codes (§15.5)** — is splitting 4010/4011/4012
|
||
useful, or does it just push complexity onto operators? Should we
|
||
collapse to 4010 with structured close-reason JSON instead?
|
||
6. **Anything else still wrong?** Read it as if you were going to
|
||
operate this for a year. What falls down?
|
||
|
||
Three options:
|
||
- **(a) v6 is shippable**: lock the spec, start coding the frozen core.
|
||
- **(b) v7 needed**: list the must-fix items.
|
||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||
|
||
Be ruthless.
|