chore(release): cli 1.22.0 — daemon v0.9.0 + housekeeping
- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh daemon long-lived runtime). - CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring, crash recovery, and the deferred Sprint 7 broker hardening. - Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0 daemon redesign section, so the bridge release is documented as the shipped step toward the larger architectural shift. - Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0 spec + broker-hardening followups) from .artifacts/specs/ to .artifacts/shipped/ per the project artifact-pipeline convention. Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag — both are public-distribution actions and require explicit user approval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
551
.artifacts/shipped/2026-05-03-daemon-final-spec-v10.md
Normal file
551
.artifacts/shipped/2026-05-03-daemon-final-spec-v10.md
Normal file
@@ -0,0 +1,551 @@
|
||||
# `claudemesh daemon` — Final Spec v10
|
||||
|
||||
> **Round 10.** v9 was reviewed by codex (round 9). The two-layer ID
|
||||
> model (5/5) and §4.1 wording (4/5) were closed cleanly, but rate-limit
|
||||
> placement created a worse failure: putting B1 limiter before dedupe
|
||||
> lookup means **idempotent retries burn rate-limit budget** and a
|
||||
> daemon retry of an already-committed message during a saturated
|
||||
> window can get rate-limit-rejected → daemon marks `dead` → split-brain
|
||||
> (broker has the message, daemon believes failure).
|
||||
>
|
||||
> **v10 fixes**:
|
||||
>
|
||||
> 1. New **Phase B0 dedupe fast-path** — read dedupe table BEFORE rate
|
||||
> limit. Existing id (match or mismatch) returns immediately without
|
||||
> touching rate-limit budget.
|
||||
> 2. **Idempotent rate-limiter** keyed by `(mesh_id, client_message_id,
|
||||
> window_bucket)` so even if two same-id requests race past B0, only
|
||||
> the first one consumes budget.
|
||||
> 3. **§4.11 stale text** — rate-limit moved out of B2 failure mode.
|
||||
> 4. **§4.7.2 pseudocode reordered** to show B0 → B1 → BEGIN → claim →
|
||||
> B2 → B3.
|
||||
>
|
||||
> **Intent §0 unchanged from v2.** v10 only revises §4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
## 1. Process model — unchanged
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
|
||||
|
||||
### 4.1 The contract (precise — v9, two-layer ID model)
|
||||
|
||||
> **Two-layer ID rules** (NEW v9 — codex r8):
|
||||
>
|
||||
> - **Daemon-layer**: a `client_message_id` is **daemon-consumed** iff an
|
||||
> outbox row exists for it. Daemon-mediated callers can never reuse a
|
||||
> daemon-consumed id, regardless of whether the broker ever saw it.
|
||||
> The daemon's outbox is the single authority for "this id was issued
|
||||
> by my caller against this daemon."
|
||||
> - **Broker-layer**: a `client_message_id` is **broker-consumed** iff a
|
||||
> dedupe row exists for `(mesh_id, client_message_id)` in
|
||||
> `mesh.client_message_dedupe`. Direct broker callers (none in
|
||||
> v0.9.0; reserved for future SDK paths that bypass the daemon) can
|
||||
> reuse a broker-non-consumed id freely.
|
||||
> - In v0.9.0 there are no daemon-bypass clients, so for practical
|
||||
> purposes "daemon-consumed" is the operative rule.
|
||||
>
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db`
|
||||
> before the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer (§4.5.1).
|
||||
>
|
||||
> **Local audit guarantee**: a `client_message_id` once written to
|
||||
> `outbox.db` is **never released** (daemon-layer rule). Operator
|
||||
> recovery via `requeue` always mints a fresh id; the old row stays in
|
||||
> `aborted` for audit. There is no daemon-side path to free a used id.
|
||||
>
|
||||
> **Broker guarantee** (v9 — tightened): a dedupe row exists iff the
|
||||
> broker accept transaction **committed** (Phase B3 reached). Phase B1
|
||||
> rejections never insert dedupe rows. Phase B2 rejections roll the
|
||||
> transaction back, so any partial dedupe row is unwound. Direct
|
||||
> broker callers retrying after B1/B2 rejection see no dedupe row and
|
||||
> may reuse the id.
|
||||
>
|
||||
> **Atomicity guarantee**: same as v8 §4.1.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — unchanged from v6 §4.3
|
||||
|
||||
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
|
||||
|
||||
#### 4.5.1 IPC accept algorithm (v8)
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits, destination
|
||||
resolvable). Failures here return `4xx` immediately. **No outbox row
|
||||
is written; the `client_message_id` is not consumed.**
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
|
||||
a concurrent IPC accept on the same id serializes against this one.
|
||||
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
|
||||
preventing any other writer from beginning a transaction on the same
|
||||
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
|
||||
supported.
|
||||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||||
last_error FROM outbox WHERE client_message_id = ?`.
|
||||
5. Apply the lookup table below. For the "(no row)" case, INSERT the
|
||||
new row inside the same transaction.
|
||||
6. COMMIT.
|
||||
|
||||
| Existing row state | Fingerprint match? | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
|
||||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||||
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||||
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||||
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
|
||||
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
|
||||
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||||
|
||||
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
|
||||
`request_fingerprint` (8-byte hex prefix) so callers can debug
|
||||
client/server canonical-form drift. **Every state in the table returns
|
||||
something deterministic, including `aborted`.** A `client_message_id`
|
||||
written to `outbox.db` is permanently bound to that row's lifecycle —
|
||||
the only "free" state is "no row exists".
|
||||
|
||||
#### 4.5.2 Outbox table — fingerprint required
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN
|
||||
('pending','inflight','done','dead','aborted')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT,
|
||||
aborted_at INTEGER, -- NEW v8
|
||||
aborted_by TEXT, -- NEW v8: operator/auto
|
||||
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||||
```
|
||||
|
||||
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
|
||||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||||
a row was requeued multiple times.
|
||||
|
||||
`request_fingerprint` is computed once at IPC accept time and frozen
|
||||
forever for the row's lifecycle. Daemon never recomputes from
|
||||
`payload`.
|
||||
|
||||
### 4.6 Rejected-request semantics — two-layer rules + rate-limit moved to B1 (v9 — codex r8)
|
||||
|
||||
> **Two-layer rule (v9)**: a `client_message_id` is **daemon-consumed**
|
||||
> iff an outbox row exists for it; **broker-consumed** iff a dedupe row
|
||||
> exists. Daemon-mediated callers see daemon-layer authority (the only
|
||||
> path in v0.9.0). Pre-validation failures at any layer consume nothing
|
||||
> at that layer. The two layers are independent: a daemon-consumed id
|
||||
> may or may not be broker-consumed (depending on whether the send
|
||||
> reached B3); a daemon-non-consumed id can never be broker-consumed
|
||||
> (no outbox row ⇒ no broker call from the daemon).
|
||||
|
||||
#### 4.6.1 Daemon-side rejection phasing (v9)
|
||||
|
||||
| Phase | When daemon rejects | Outbox row? | Daemon-consumed? | Same daemon caller may reuse id? |
|
||||
|---|---|---|---|---|
|
||||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | No | Yes — id never written locally |
|
||||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | Yes | N/A — daemon owns retries |
|
||||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | Yes | No — rotate via `requeue` |
|
||||
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Yes (still consumed) | Old id NEVER reusable; new id is fresh |
|
||||
|
||||
The "daemon-consumed?" column is the daemon-layer authority. It does
|
||||
not depend on whether the broker ever saw the request — phase C above
|
||||
shows the broker has not committed a dedupe row, but the daemon still
|
||||
holds the id in `dead` state.
|
||||
|
||||
#### 4.6.2 Broker-side rejection phasing (v10 — B0 dedupe fast-path added)
|
||||
|
||||
The broker validates in **four phases** relative to dedupe-row
|
||||
insertion. Phase B0 (NEW v10 — codex r9) makes idempotent retries
|
||||
free of rate-limit budget so a daemon retry of an already-committed
|
||||
message can never get rate-limit-rejected:
|
||||
|
||||
| Phase | Validation | Side effects | Result for direct broker callers |
|
||||
|---|---|---|---|
|
||||
| **B0. Dedupe fast-path** (NEW v10) | Read `mesh.client_message_dedupe` for `(mesh_id, client_message_id)`. **Does not touch rate-limit budget.** | None | If row exists & fingerprint matches → `200 duplicate` with original `broker_message_id`. If row exists & fingerprint mismatches → `409 idempotency_key_reused`. If row absent → continue to B1 |
|
||||
| **B1. Pre-dedupe-claim** (atomic, external) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, **rate limit not exceeded** (idempotent external limiter — see §4.6.4) | None | `4xx` returned. No dedupe row, no broker-consumed id. Caller may retry with same id once condition clears |
|
||||
| **B2. Post-dedupe-claim** (in-tx) | Conditions that require the accept transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx` returned, transaction rolled back, no dedupe row remains. Caller may retry with same id |
|
||||
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows, mention_index rows | `201` returned with `broker_message_id`. Id is broker-consumed |
|
||||
|
||||
**Why B0 is correct (codex r9)**: idempotent retries should never be
|
||||
distinguishable from "the call worked" from the caller's perspective.
|
||||
A retry that the broker can resolve to the original accept must do so
|
||||
before any operation that could fail (rate limit, capacity check,
|
||||
auth-quota, etc.). B0 reads — non-mutating, no transaction — so it can
|
||||
be skipped on the strictly-new-id path with negligible cost (one
|
||||
indexed PK lookup against the dedupe table).
|
||||
|
||||
**Race semantics for new ids (v10 — codex r9)**: B0 is a non-locking
|
||||
read; two same-id requests can both miss B0 simultaneously. Without
|
||||
care, both would consume rate-limit budget. v10 requires the limiter
|
||||
to be **idempotent over `(mesh_id, client_message_id, window)`**:
|
||||
budget is consumed at most once per id-window pair regardless of
|
||||
concurrent retries (§4.6.4). The "second" retry that misses B0 still
|
||||
sees its `INCR` short-circuited by the limiter and proceeds to B2/B3
|
||||
without budget impact. Whichever request wins the dedupe `INSERT`
|
||||
commits; the loser sees fingerprint match (rollback to `200
|
||||
duplicate`) or mismatch (`409`).
|
||||
|
||||
**Daemon-mediated callers**: in v0.9.0 the daemon is the only B-phase
|
||||
caller. Daemon-mediated callers see only the daemon-layer rules
|
||||
(§4.6.1). The broker's "may retry with same id" wording in the table
|
||||
above applies to direct broker callers only (none in v0.9.0; reserved
|
||||
for future SDK paths).
|
||||
|
||||
**Critical guarantee (v9 — tightened from v8)**: a dedupe row exists
|
||||
**iff the broker accept transaction committed (B3)**. There is no
|
||||
broker code path where a permanent 4xx leaves a dedupe row behind.
|
||||
|
||||
If the broker decides post-commit that an accepted message is invalid
|
||||
(async content-policy job, async moderation, etc.), that's NOT a
|
||||
permanent rejection — it's a follow-up event that operates on the
|
||||
`broker_message_id`, not on the dedupe key.
|
||||
|
||||
#### 4.6.4 Rate limiter — idempotent over `(mesh, client_id, window)` (v10 — codex r9)
|
||||
|
||||
Codex r9 caught: v9's plain `INCR` limiter would let idempotent
|
||||
retries burn budget. A daemon retry of an already-committed message
|
||||
that gets rate-limit-rejected creates a split-brain (broker has it,
|
||||
daemon marks dead). v10 makes the limiter idempotent over
|
||||
`(mesh_id, client_message_id, window_bucket)` so retries are free.
|
||||
|
||||
- **Authority**: same external Redis-style limiter used elsewhere in
|
||||
claudemesh, but called via an idempotency-aware wrapper:
|
||||
```
|
||||
consume_budget(mesh_id, client_message_id, window_bucket) → {ok, denied}
|
||||
Lua / WATCH-MULTI on Redis:
|
||||
key = "rl:" + mesh_id + ":" + window_bucket
|
||||
idem = "rli:" + mesh_id + ":" + client_message_id + ":" + window_bucket
|
||||
if EXISTS idem → return ok -- already counted
|
||||
if INCR key > limit_per_window
|
||||
DECR key -- refund this attempt
|
||||
return denied
|
||||
SET idem 1 EX 2*window_seconds -- short TTL for repeat-detection
|
||||
return ok
|
||||
```
|
||||
The `idem` key TTL is small (2× window) to keep memory bounded;
|
||||
outside the window, retries that arrive late count as new traffic
|
||||
(which is correct — the original `INCR` row has rolled out of the
|
||||
window too).
|
||||
- **Race semantics**: two same-id requests racing past B0 both arrive
|
||||
at `consume_budget`. Whichever Redis call lands first runs the
|
||||
conditional `INCR`+`SET idem`; the second sees `EXISTS idem` and
|
||||
returns `ok` without `INCR`. Each id-window pair consumes at most
|
||||
one budget unit. Implemented in Lua (single round-trip, atomic).
|
||||
- **B2 rollback non-refund**: if the limiter accepts but the in-tx
|
||||
Phase B2 then rejects (e.g. topic not found), the consumed budget
|
||||
is **not** refunded. Counter
|
||||
`cm_broker_rate_limit_consumed_then_b2_rejected_total` exposes the
|
||||
delta. Refunding would require a coordinated rollback across the DB
|
||||
tx and the limiter, which we don't want to build.
|
||||
- **Async counters**: `mesh.rate_limit_counter` (or any DB-resident
|
||||
view of "messages-per-mesh-per-window") is **non-authoritative** —
|
||||
metrics/telemetry only, rebuilt from the authoritative limiter and
|
||||
from message-history. Used for dashboards, not for accept decisions.
|
||||
|
||||
This split — idempotent atomic external limiter for enforcement,
|
||||
async DB counters for telemetry — keeps idempotent retries free of
|
||||
budget impact, prevents the v9 split-brain, and stays inside the
|
||||
existing claudemesh rate-limit infrastructure.
|
||||
|
||||
**Why B0 still matters even with the idempotent limiter**: the
|
||||
idempotent limiter prevents budget over-consumption, but it does NOT
|
||||
make the limiter itself the dedupe authority. B0 is a non-mutating DB
|
||||
read that resolves committed dedupe rows (the truth) without any
|
||||
limiter or DB-write side effects at all. For the common retry case
|
||||
(daemon timeout after broker B3 commit), B0 returns `200 duplicate`
|
||||
without ever calling the limiter. B0 + idempotent limiter together
|
||||
mean: idempotent retries are O(1 PK lookup), free, and never visible
|
||||
to rate-limit accounting.
|
||||
|
||||
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
|
||||
|
||||
To unstick a `dead` or `pending`-but-stuck row, operator runs:
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||||
[--new-client-id <id> | --auto]
|
||||
[--patch-payload <path>]
|
||||
```
|
||||
|
||||
This atomically (single SQLite transaction):
|
||||
|
||||
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
|
||||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||||
permanent.
|
||||
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
|
||||
or auto-ulid'd via `--auto`).
|
||||
3. Inserts a new outbox row in `pending` with the fresh id and the same
|
||||
payload (or patched payload if `--patch-payload` was given).
|
||||
4. Sets `superseded_by = <new_row_id>` on the old row so
|
||||
`outbox inspect <old_id>` displays the chain.
|
||||
|
||||
**The old `client_message_id` is permanently dead** — `outbox.db` still
|
||||
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
|
||||
re-using it gets `409 outbox_aborted_*` per §4.5.1.
|
||||
|
||||
If broker had ever accepted the old id (it reached B3), the broker's
|
||||
dedupe row is also permanent — duplicate sends to broker with the old
|
||||
id would also `409` for fingerprint mismatch (or return the original
|
||||
`broker_message_id` for matching fingerprint). Daemon-side
|
||||
`aborted` and broker-side dedupe row are independent records of "this
|
||||
id was used," neither releases the id.
|
||||
|
||||
This is the resolution to v7's contradiction: there is **no path** for
|
||||
an id to "become free again." If the operator wants to retry the
|
||||
payload, they get a new id. The old id stays buried.
|
||||
|
||||
### 4.7 Broker atomicity contract — side-effect classification (v9)
|
||||
|
||||
#### 4.7.1 Side effects (v9 — rate limit moved to B1 external)
|
||||
|
||||
Every successful broker accept atomically commits these durable
|
||||
state changes in **one transaction**:
|
||||
|
||||
| Effect | Table | In-tx? | Why |
|
||||
|---|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
|
||||
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
|
||||
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
|
||||
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
|
||||
|
||||
**Outside the transaction** — non-authoritative or rebuildable, with
|
||||
explicit rationale per item:
|
||||
|
||||
| Effect | Where | Why outside |
|
||||
|---|---|---|
|
||||
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
|
||||
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
|
||||
| Rate-limit **counters** (telemetry only) | Async, eventually consistent | Authoritative limiter is the external Redis-style INCR in B1 (§4.6.4); the DB counter is rebuilt for dashboards, not consulted for accept |
|
||||
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
|
||||
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
|
||||
| Metrics | Prometheus, pull-based | Always non-authoritative |
|
||||
|
||||
If any in-transaction insert fails, the transaction rolls back
|
||||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||||
state.
|
||||
|
||||
The async side effects are driven off the in-transaction
|
||||
`delivery_queue` and `message_history` rows, so they cannot get ahead
|
||||
of committed state — only lag behind.
|
||||
|
||||
#### 4.7.2 Pseudocode — corrected and final (v8)
|
||||
|
||||
```sql
|
||||
-- =========================================================================
|
||||
-- Phase B0: dedupe fast-path (NEW v10 — codex r9). Non-mutating.
|
||||
-- Resolves idempotent retries WITHOUT touching rate-limit budget.
|
||||
-- =========================================================================
|
||||
SELECT broker_message_id, request_fingerprint, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id;
|
||||
|
||||
-- If row exists:
|
||||
-- fingerprint match → return 200 duplicate (broker_message_id, history_available). Done.
|
||||
-- fingerprint mismatch → return 409 idempotency_key_reused. Done.
|
||||
-- Otherwise: row absent → continue.
|
||||
|
||||
-- =========================================================================
|
||||
-- Phase B1: schema/auth/size validation + idempotent rate-limit consume.
|
||||
-- All before any DB transaction. Failures here return 4xx without opening a tx.
|
||||
-- =========================================================================
|
||||
-- consume_budget(mesh_id, client_id, window_bucket) — Lua/Redis (§4.6.4).
|
||||
-- Idempotent over (mesh_id, client_id, window_bucket): retries within window
|
||||
-- consume at most once.
|
||||
|
||||
-- =========================================================================
|
||||
-- Phase B2 + B3: in-transaction claim and side effects.
|
||||
-- =========================================================================
|
||||
BEGIN;
|
||||
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Inspect the row that's actually there now (ours or a racer's).
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- Branch:
|
||||
-- row.broker_message_id == $msg_id → we won the race; continue to side effects.
|
||||
-- row.broker_message_id != $msg_id → racer won. Compare fingerprints:
|
||||
-- fingerprint match → ROLLBACK; return 200 duplicate (the rare race-vs-B0 case
|
||||
-- where two concurrent first-time-but-same-id requests
|
||||
-- both missed B0 and one beat the other to the INSERT).
|
||||
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Phase B2 validation: destination_ref existence (topic exists,
|
||||
-- member subscribed, etc.). Rate limit is NOT here — it was checked
|
||||
-- in B1 (§4.6.4) before this transaction opened.
|
||||
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
|
||||
|
||||
-- Step 4: insert all in-tx side effects (§4.7.1).
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
|
||||
SELECT $msg_id, mention_pubkey, ...
|
||||
FROM unnest($mention_list);
|
||||
|
||||
COMMIT;
|
||||
|
||||
-- After COMMIT, async workers consume delivery_queue and update
|
||||
-- search indexes, audit logs, rate-limit counters, etc.
|
||||
```
|
||||
|
||||
#### 4.7.3 Orphan check — same as v7 §4.7.3
|
||||
|
||||
Extended over the side-effect inventory to verify in-tx items consistency.
|
||||
|
||||
### 4.8 Outbox max-age math — unchanged from v7 §4.8
|
||||
|
||||
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
|
||||
safety_margin` strictly < window; safety_margin floor 24h.
|
||||
|
||||
### 4.9 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.10 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.11 Failure modes — B0/B1/B2 distinction (v10)
|
||||
|
||||
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
|
||||
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
|
||||
- **IPC accept against `aborted` row, fingerprint match**: returns 409
|
||||
per §4.5.1. Caller must use a new id; the old id is permanently retired.
|
||||
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
|
||||
§4.6.3; old id stays in `aborted`, new id is fresh.
|
||||
- **Broker fingerprint mismatch on retry**: at B0 → returns 409
|
||||
immediately (no rate-limit consumed). Daemon marks `dead`; operator
|
||||
requeue path.
|
||||
- **Idempotent retry of an already-committed id during a saturated
|
||||
rate-limit window** (NEW v10): B0 fast-path returns `200 duplicate`
|
||||
with the original `broker_message_id`. Rate-limit budget is NOT
|
||||
consumed. Daemon transitions outbox row from `pending`/`inflight`
|
||||
to `done`. **No split-brain.** This is the key correctness fix
|
||||
from codex r9.
|
||||
- **Daemon retry after dedupe row hard-deleted by broker retention
|
||||
sweep**: cannot happen unless operator overrode `max_age_hours`.
|
||||
- **Broker phase B1 rejection (rate limit, schema, size, etc.)**: no
|
||||
dedupe row exists; daemon receives 4xx; idempotent limiter ensures
|
||||
retries within window don't re-consume budget. If the rejection is
|
||||
permanent (size, schema), daemon marks `dead`. If transient (rate
|
||||
limit), daemon retries with exponential backoff until window clears
|
||||
or `max_age_hours` exhausted.
|
||||
- **Broker phase B2 rejection on retry**: same id reaches B2 and the
|
||||
in-tx condition fails (topic deleted, member unsubscribed). B2
|
||||
rolls back the dedupe insert; no dedupe row remains. Daemon
|
||||
receives 4xx → marks `dead`. Operator can `requeue` if condition
|
||||
clears (note: `requeue` mints a fresh id per §4.6.3, so the old id
|
||||
stays `aborted`).
|
||||
- **Atomicity violation found by orphan check**: alerts ops.
|
||||
|
||||
---
|
||||
|
||||
## 5-13. — unchanged from v4
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
## 15. Version compat — unchanged from v7 §15
|
||||
|
||||
## 16. Threat model — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
|
||||
|
||||
Broker side, deploy order: same as v7 §17, with one addition:
|
||||
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
|
||||
validation, returns 4xx without writing) and Phase B2/B3 (within the
|
||||
accept transaction). Implementation: refactor handler to validate
|
||||
Phase B1 conditions before opening the DB transaction.
|
||||
|
||||
Daemon side:
|
||||
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
|
||||
columns and the `aborted` enum value (§4.5.2). Migration applies via
|
||||
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
|
||||
greenfield.
|
||||
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
|
||||
(§4.5.1 step 3).
|
||||
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
|
||||
- `claudemesh daemon outbox requeue` always mints a fresh
|
||||
`client_message_id`; never frees the old id. `--new-client-id <id>`
|
||||
and `--auto` are the only modes; the old `client_message_id`
|
||||
argument is removed.
|
||||
|
||||
---
|
||||
|
||||
## What changed v8 → v9 (codex round-8 actionable items)
|
||||
|
||||
| Codex r8 item | v9 fix | Section |
|
||||
|---|---|---|
|
||||
| Cross-layer ID-consumed authority contradiction | Two-layer model: daemon-consumed iff outbox row; broker-consumed iff dedupe row committed; daemon-mediated callers see only daemon-layer authority | §4.1, §4.6.1, §4.6.2 |
|
||||
| Rate-limit authority muddled (B2 vs async counters) | Rate limit moved to B1 via external atomic limiter (Redis-style INCR with TTL); DB rate-limit counters demoted to telemetry-only | §4.6.2, §4.6.4, §4.7.1 |
|
||||
| §4.1 broker guarantee fuzzy | Tightened: "dedupe row exists iff broker accept transaction committed (B3)" | §4.1, §4.6.2 |
|
||||
|
||||
(Earlier rounds' fixes preserved unchanged.)
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 9)
|
||||
|
||||
1. **Two-layer ID model (§4.1, §4.6.1)** — is the daemon-vs-broker
|
||||
authority split clear, or does it create more confusion for
|
||||
operators reading "consumed" in different contexts? Should we use
|
||||
different verbs (e.g. "claimed" at daemon, "committed" at broker)?
|
||||
2. **Rate-limit external limiter (§4.6.4)** — is "atomic external
|
||||
limiter" specified concretely enough? Is the over-counting on
|
||||
limiter-accepted-then-B2-rejected acceptable?
|
||||
3. **B2 contents after rate-limit move** — B2 now only has
|
||||
`destination_ref existence`. Worth keeping a B2 phase at all, or
|
||||
collapse into B1+B3?
|
||||
4. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year.
|
||||
|
||||
Three options:
|
||||
- **(a) v9 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v10 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
853
.artifacts/shipped/2026-05-03-daemon-final-spec-v2.md
Normal file
853
.artifacts/shipped/2026-05-03-daemon-final-spec-v2.md
Normal file
@@ -0,0 +1,853 @@
|
||||
# `claudemesh daemon` — Final Spec v2
|
||||
|
||||
> **Round 2 after a critical first-pass review.** v1 of this spec was reviewed
|
||||
> by another model and pushed back on identity model, no-auth IPC, "exactly-once"
|
||||
> overclaim, hook credentials, surface bloat, and missing operational flows
|
||||
> (rotation, image clones, schema migration, threat model). v2 incorporates all
|
||||
> of those.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — what this is, what it isn't
|
||||
|
||||
### 0.1 The product reality
|
||||
|
||||
claudemesh today is a **peer mesh runtime for Claude Code sessions**. Each
|
||||
session runs `claudemesh launch`, opens a WebSocket to a managed broker, gets
|
||||
ephemeral identity, sends/receives DMs and topic messages with other Claude Code
|
||||
sessions, posts to shared state, deploys MCP servers / skills / files,
|
||||
participates in tasks, schedules reminders. Everything is E2E encrypted with
|
||||
crypto_box envelopes for DMs and per-topic symmetric keys for topics. The broker
|
||||
is a routing/persistence layer; peers do the actual work.
|
||||
|
||||
The CLI is the canonical surface — every operation is a `claudemesh <verb>`.
|
||||
The MCP server is a "tool-less push pipe" that surfaces inbound messages to
|
||||
Claude Code as channel notifications. There is also a web dashboard, an `/v1/*`
|
||||
REST API, and an existing apikey auth model for external integrations.
|
||||
|
||||
### 0.2 The gap
|
||||
|
||||
Anything that **isn't a Claude Code session** is a second-class citizen:
|
||||
|
||||
- A RunPod handler that wants to alert a peer when an OOM happens has only
|
||||
one option: curl an apikey-authed REST endpoint. One-way only. The handler
|
||||
is not a peer — it can't be DM'd back, can't be `@-mentioned`, can't be in
|
||||
`peer list`, can't claim a task assigned to it, can't host an MCP service or
|
||||
share a skill. It's a webhook spoke, not a participant.
|
||||
|
||||
- A Temporal worker that wants to track its own progress in shared mesh state,
|
||||
publish to a `#alerts` topic, and listen for "retry now" instructions has
|
||||
no good shape. Either it shells out to `claudemesh send` cold-path
|
||||
(a fresh WS handshake per message — ~1s latency, broker churn, no inbound
|
||||
path) or it speaks the WS protocol manually (significant code, no SDK).
|
||||
|
||||
- A long-running CI runner, an IoT box, a phone app, a future Python or Go
|
||||
service — none can be **first-class peers** without writing the same WS
|
||||
reconnect / queue / encryption / presence code that the existing CLI already
|
||||
has, plus an IPC surface so the host's apps can use it without re-implementing
|
||||
any of that.
|
||||
|
||||
### 0.3 What this daemon is
|
||||
|
||||
A long-running process — the same `claudemesh-cli` binary in `daemon` mode —
|
||||
that turns any host into a **first-class peer**:
|
||||
|
||||
- Stable identity across restarts (the host *is* a member of the mesh, not a
|
||||
series of disconnected sessions).
|
||||
- Persistent WS to the broker, with reconnect, queue, dedupe.
|
||||
- Local IPC surface (UDS + loopback HTTP + SSE) that any local app can hit
|
||||
to send, subscribe, query — without learning the broker protocol or carrying
|
||||
long-lived secrets in app code.
|
||||
- Hooks: shell scripts that fire on events. Server replies to DMs, auto-claims
|
||||
tasks, escalates errors — without the app being involved.
|
||||
- Same security primitives as `claudemesh launch` (mesh keypair, crypto_box,
|
||||
per-topic keys). No new auth model toward the broker.
|
||||
|
||||
The daemon **is the runtime**. The CLI in cold-path mode is a fallback. The
|
||||
Claude Code MCP integration is one client of the daemon (eventually).
|
||||
|
||||
### 0.4 What this daemon is NOT
|
||||
|
||||
- **Not a webhook gateway.** `/v1/notify` and apikeys remain the path for
|
||||
systems that can't host the runtime (third-party SaaS, monitoring tools).
|
||||
The daemon is for systems that *can* run a process — code you control.
|
||||
|
||||
- **Not a generic message broker.** It speaks claudemesh protocol to one
|
||||
managed broker. It is not a substitute for NATS, Redis, Kafka, RabbitMQ.
|
||||
|
||||
- **Not a Slack replacement.** Topics, DMs, mentions exist because *AI
|
||||
sessions* use them. Humans interact via the dashboard or a Claude Code
|
||||
session, not by reading the daemon's inbox directly.
|
||||
|
||||
- **Not a fleet manager.** One daemon manages one mesh on one host. Multi-mesh
|
||||
on one host is supported (one daemon per mesh, supervised). Cross-host
|
||||
supervision is an external concern (systemd, k8s, etc.) — the daemon doesn't
|
||||
reach across hosts.
|
||||
|
||||
### 0.5 Who deploys this
|
||||
|
||||
- A developer running `claudemesh daemon up` on their laptop so their open
|
||||
Claude Code sessions all share one persistent connection (instead of each
|
||||
opening its own ephemeral WS).
|
||||
- The same developer running `claudemesh daemon install-service` on their VPS,
|
||||
RunPod pod, Temporal worker, CI runner — turning each into an
|
||||
addressable peer that scripts on that host can talk to via local IPC.
|
||||
- Eventually: language SDKs (Python / Go / TypeScript) talking to the daemon
|
||||
on `localhost`, exposing claudemesh as a first-class API for any app the
|
||||
developer writes.
|
||||
|
||||
### 0.6 Pre-launch posture
|
||||
|
||||
No users yet. We can break protocol, schema, surface, anything. Optimize for
|
||||
the architecture we want to live with for years, not for the smallest
|
||||
shippable cut. Codex pushed back on v1 on this exact axis: do not ship
|
||||
graph/vector/MCP/skills/tasks on day one — freeze a small, hardened core,
|
||||
expand deliberately.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model
|
||||
|
||||
**One daemon per (user, mesh)**. Persistent. Survives reboots via OS
|
||||
supervisor. Serves multiple local apps concurrently.
|
||||
|
||||
```
|
||||
~/.claudemesh/daemon/<mesh-slug>/
|
||||
pid 0600 pidfile, cleaned on shutdown
|
||||
sock 0600 unix domain socket (primary IPC)
|
||||
http.port 0644 auto-allocated loopback port
|
||||
local_token 0600 per-daemon bearer for HTTP/TCP transports
|
||||
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
|
||||
host_fingerprint.json 0600 machine-id + boot-id + interface mac digest
|
||||
config.toml 0644 user-editable runtime tuning
|
||||
outbox.db 0600 SQLite — durable outbound queue
|
||||
inbox.db 0600 SQLite — N-day inbound history, FTS-indexed
|
||||
schema_version 0644 integer; gates online migrations
|
||||
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
|
||||
hooks/ 0700 user-managed event scripts
|
||||
```
|
||||
|
||||
**Resource caps (defaults, configurable):**
|
||||
|
||||
| Resource | Default | Why |
|
||||
|---|---|---|
|
||||
| RSS | 256 MB | Most workloads stay under 50 MB; cap protects multi-mesh hosts |
|
||||
| CPU | unlimited | Hook fan-out can spike briefly; rely on OS scheduler |
|
||||
| Outbox DB | 5 GB | At 1KB avg msg, that's 5M queued. Disk-full handling at 90% |
|
||||
| Inbox DB | 5 GB | Same |
|
||||
| File descriptors | 1024 | UDS clients + SSE streams + DB handles + WS |
|
||||
| SSE concurrent | 32 streams | DoS protection; configurable up |
|
||||
| IPC concurrent | 64 in-flight | Backpressure beyond this returns `429 daemon_busy` |
|
||||
| Hook concurrency | 8 | Bounded pool; overflow queues |
|
||||
|
||||
Single binary. Same `claudemesh-cli` package; `daemon` is one of its modes.
|
||||
|
||||
## 2. Identity — persistent member by default, ephemeral on opt-in, clone-aware
|
||||
|
||||
### 2.1 Modes
|
||||
|
||||
```
|
||||
claudemesh daemon up # default: persistent member
|
||||
claudemesh daemon up --ephemeral # session-shaped, no keypair persisted
|
||||
claudemesh daemon up --ephemeral --ttl=2h # auto-shutdown after TTL
|
||||
```
|
||||
|
||||
- **Persistent (default)**: ed25519 + x25519 keypair stored in `keypair.json`.
|
||||
Same identity across restarts, reconnects, supervisor cycles. Right for
|
||||
servers, workers, addressable peers.
|
||||
- **Ephemeral**: keypair generated in memory, never written. Daemon exits =
|
||||
identity gone. Right for CI jobs, preview environments, disposable RunPod
|
||||
pods, test harnesses, build agents, anything that should not leave a peer
|
||||
ghost in the broker after teardown.
|
||||
- **`--ttl <duration>`** on ephemeral mode: auto-shutdown after the duration,
|
||||
or after `claudemesh daemon down`, whichever first. Broker member record
|
||||
cleaned up on shutdown.
|
||||
|
||||
### 2.2 Image-clone detection
|
||||
|
||||
Two daemons booting with the same `keypair.json` (VM image clone, container
|
||||
copy, restored backup) is a serious failure mode — broker sees connection
|
||||
collisions, presence flickers, encrypted messages route to the wrong host.
|
||||
|
||||
Handled in three places:
|
||||
|
||||
1. **Daemon side**: `host_fingerprint.json` is written on first startup —
|
||||
`sha256(machine-id || boot-id || mac-of-default-iface || hostname)`. On every
|
||||
subsequent startup, the fingerprint is recomputed and compared. If it
|
||||
differs, the daemon **refuses to start** unless `--accept-cloned-identity`
|
||||
is passed (writes a fresh fingerprint and continues with the same keypair —
|
||||
for legitimate hardware migrations) or `--remint` is passed (mints fresh
|
||||
keypair, registers as a new member, broker reaps the old member after
|
||||
grace period).
|
||||
2. **Broker side**: tracks `lastSeenHostFingerprint` per member. On
|
||||
reconnection from a different fingerprint, broker emits a
|
||||
`member_clone_suspected` security event to the mesh owner's dashboard.
|
||||
Connection itself is allowed (legitimate hardware swaps happen) but visible
|
||||
for audit.
|
||||
3. **Mesh owner**: `claudemesh member revoke <pubkey>` revokes the keypair
|
||||
server-side; daemon receives `keypair_revoked` push event on next
|
||||
connection and self-disables.
|
||||
|
||||
### 2.3 Rename
|
||||
|
||||
`--name` is taken at first `daemon up`; subsequent runs read the keypair file
|
||||
and ignore `--name` unless `--rename` is passed (which produces a
|
||||
`member_renamed` event the broker propagates to peers).
|
||||
|
||||
## 3. IPC surface — stable core only in v0.9.0
|
||||
|
||||
### 3.1 Frozen core surface (v0.9.0)
|
||||
|
||||
Codex's feedback: do not ship every CLI verb on day one. A small hardened core
|
||||
first, expand under explicit capability gates.
|
||||
|
||||
```
|
||||
# Messaging — durable, tested
|
||||
POST /v1/send {to, message, priority?, meta?, replyToId?}
|
||||
POST /v1/topic/post {topic, message, priority?, mentions?}
|
||||
POST /v1/topic/subscribe {topic} (idempotent)
|
||||
POST /v1/topic/unsubscribe {topic}
|
||||
GET /v1/topic/list
|
||||
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
|
||||
GET /v1/inbox/search ?q=<fts-query>&limit=<n> (FTS5)
|
||||
|
||||
# Peers + presence — read-only on day one
|
||||
GET /v1/peers ?mesh=<slug>
|
||||
POST /v1/profile {summary?, status?, visible?} (limited fields)
|
||||
|
||||
# Files — already production in CLI
|
||||
POST /v1/file/share {path, to?, message?, persistent?}
|
||||
GET /v1/file/get ?id=<fileId>&out=<path>
|
||||
GET /v1/file/list
|
||||
|
||||
# Events — push
|
||||
GET /v1/events text/event-stream
|
||||
core events: message, peer_join, peer_leave, file_shared,
|
||||
daemon_disconnect, daemon_reconnect, hook_executed
|
||||
|
||||
# Control plane
|
||||
GET /v1/health {connected, lag_ms, queue_depth, inflight,
|
||||
mesh, member_pubkey, uptime_s, schema_version,
|
||||
daemon_version, broker_version}
|
||||
GET /v1/metrics Prometheus exposition
|
||||
GET /v1/version {daemon, schema, ipc_api} (negotiation)
|
||||
POST /v1/heartbeat {} (caller-side liveness signal)
|
||||
```
|
||||
|
||||
That's it. ~20 endpoints. Battle-test these before adding more.
|
||||
|
||||
### 3.2 Capability-gated future surface (v0.9.x roadmap)
|
||||
|
||||
Behind explicit feature flags in `config.toml`, post-v0.9.0:
|
||||
|
||||
```toml
|
||||
[capabilities]
|
||||
state = false # /v1/state/{set,get,list}
|
||||
memory = false # /v1/memory/{remember,recall}
|
||||
vector = false # /v1/vector/{store,search,delete}
|
||||
graph = false # /v1/graph/query
|
||||
tasks = false # /v1/task/{create,claim,complete}
|
||||
scheduling = false # /v1/scheduling/remind
|
||||
mcp_host = false # /v1/mcp/{register,call} (LARGEST surface; treat as v1.0)
|
||||
skill_share = false # /v1/skill/{deploy,share}
|
||||
```
|
||||
|
||||
Each capability is its own ship: design review, security review, test
|
||||
coverage, capability-token model, then enable. None enabled in v0.9.0.
|
||||
|
||||
### 3.3 Local IPC authentication
|
||||
|
||||
Codex was right: loopback TCP without auth is an attack surface (browser SSRF,
|
||||
container side-channels, sandboxed apps with network but no FS access, WSL
|
||||
host-shared loopback).
|
||||
|
||||
| Transport | Auth | Rationale |
|
||||
|---|---|---|
|
||||
| UDS | None (relies on FS perms 0600) | Reaching the socket = same UID = can read keypair anyway |
|
||||
| TCP loopback | **Required**: `Authorization: Bearer <local_token>` | Browser/container/sandbox can reach loopback without FS access |
|
||||
| SSE | Required: `Authorization: Bearer <local_token>` | Same |
|
||||
|
||||
`local_token` is 32 bytes of `crypto.randomBytes` (~256 bits), encoded base64url,
|
||||
written to `local_token` mode 0600 at daemon init. Rotated on `claudemesh
|
||||
daemon rotate-token`. SDKs auto-discover the token by reading the file (same
|
||||
mechanism as discovering the socket path).
|
||||
|
||||
**Additional defenses:**
|
||||
- HTTP listener binds **127.0.0.1 only**. Refuses to bind elsewhere unless
|
||||
`[ipc] http_bind = "..."` is set explicitly **and** `[ipc] http_external_auth = "..."`
|
||||
points to a separate token file (escape hatch for advanced users; never the default).
|
||||
- `Origin` header check: rejects requests with `Origin` set unless it's
|
||||
explicitly allowlisted in config (default: empty allowlist). Defends against
|
||||
browser SSRF.
|
||||
- `Host` header check: must be `localhost` or `127.0.0.1`. Defends against DNS
|
||||
rebinding.
|
||||
- CORS: `Access-Control-Allow-Origin` never echoed; preflight returns `403`.
|
||||
- `User-Agent` required (rejects empty UA — mild signal against simple SSRF).
|
||||
|
||||
### 3.4 Request limits + backpressure
|
||||
|
||||
- Max request body: **1 MB** (override per endpoint; file uploads use a separate
|
||||
streaming endpoint).
|
||||
- Max response body: **10 MB**; truncated with `Link: rel=next` cursor.
|
||||
- Max in-flight IPC requests: **64**. Beyond → `429 daemon_busy`.
|
||||
- Max SSE concurrent streams: **32**. Beyond → `429 too_many_streams`.
|
||||
- Per-token rate limit: **100 req/sec** sustained, 1000/sec burst (token
|
||||
bucket). Tunable.
|
||||
|
||||
## 4. Delivery contract — durable at-least-once with idempotent send
|
||||
|
||||
Codex was right: "exactly-once" is a lie. Replacing the claim with a precise
|
||||
contract.
|
||||
|
||||
### 4.1 The contract
|
||||
|
||||
> **The daemon guarantees: each successful send call enqueues exactly one row
|
||||
> to the broker eventually, identified by a stable `messageId`. The daemon
|
||||
> does not guarantee that downstream peers process the message exactly once —
|
||||
> that is the receiver's responsibility, aided by the propagated
|
||||
> `idempotency_key`.**
|
||||
|
||||
Concretely:
|
||||
|
||||
- **Caller → daemon**: caller may supply `Idempotency-Key`; daemon dedupes
|
||||
identical keys for 24h. Without one, daemon mints `ulid` and returns it as
|
||||
`messageId`.
|
||||
- **Daemon → broker**: each outbox row has at-most-one inflight transmit.
|
||||
Daemon retries with exponential backoff until broker ACKs OR row hits TTL
|
||||
(7d default → moves to `dead`).
|
||||
- **Broker → peer**: existing claudemesh delivery semantics. Broker dedupes by
|
||||
`messageId`. Peer receives ≥1 copy.
|
||||
- **Peer hooks**: hooks see `idempotency_key` in the event JSON. Idempotent
|
||||
hook implementations are the receiver's responsibility.
|
||||
|
||||
### 4.2 Outbox row state machine
|
||||
|
||||
```
|
||||
┌────────────┐
|
||||
send call → │ pending │
|
||||
└─────┬──────┘
|
||||
│ daemon picks up batch
|
||||
▼
|
||||
┌────────────┐
|
||||
│ inflight │ ← attempts++, last_error written
|
||||
└─┬────┬─────┘
|
||||
│ │ broker NACK / network err
|
||||
broker ACK │ └──────────► back to pending (with exp. backoff)
|
||||
▼
|
||||
┌────────────┐
|
||||
│ done │ ← delivered_at set, broker_message_id stored
|
||||
└────────────┘
|
||||
|
||||
age > max_age_hours:
|
||||
┌────────────┐
|
||||
│ dead │ ← surfaces in `daemon outbox --failed`
|
||||
└────────────┘
|
||||
```
|
||||
|
||||
### 4.3 Crash recovery
|
||||
|
||||
On daemon startup:
|
||||
|
||||
1. Any rows in `inflight` are reset to `pending` with `attempts++` and
|
||||
`next_attempt_at = now + min_backoff`. Note: this MAY cause double-delivery
|
||||
of a message that was actually ACK'd by the broker but the ACK didn't
|
||||
persist locally before crash. The `idempotency_key` propagates to broker
|
||||
(via message `meta`) so the broker dedupes by key.
|
||||
2. `outbox.db` integrity check (`PRAGMA integrity_check`); if fails, daemon
|
||||
refuses to start, points user at `claudemesh daemon recover`.
|
||||
3. `inbox.db` integrity check; on failure, drops to `inbox.db.corrupt-<ts>`,
|
||||
creates fresh empty inbox, logs `inbox_corruption_recovered` (does not
|
||||
block startup — inbox is a cache).
|
||||
|
||||
### 4.4 Disk-full
|
||||
|
||||
- At 80% of `outbox.max_queue_size` or 80% of `[disk] reserved_bytes`: daemon
|
||||
emits `outbox_pressure_high` event + Prometheus gauge. Sends still accept.
|
||||
- At 95%: new sends return `507 insufficient_storage`. Existing inflight
|
||||
drains.
|
||||
- At 100%: daemon enters degraded mode — refuses sends, refuses new SSE
|
||||
streams, holds open WS for inbound only. `daemon status` shows degraded.
|
||||
- Recovery: drain via broker reconnect (drains `done` rows older than
|
||||
retention window) or `claudemesh daemon outbox prune --confirm`.
|
||||
|
||||
### 4.5 Schema migration
|
||||
|
||||
`schema_version` file holds an integer. On startup:
|
||||
1. If `schema_version` matches binary's expected version → continue.
|
||||
2. If version is older → run `apps/cli/src/daemon/migrations/<from>-<to>.sql`
|
||||
in a transaction, write new version on success.
|
||||
3. If version is newer (downgrade) → daemon refuses to start, error points at
|
||||
re-installing matching version.
|
||||
|
||||
Migrations are forward-only. Each migration is ≤ 1 transaction. Test coverage
|
||||
required: every migration has a snapshot test from prior schema.
|
||||
|
||||
## 5. Inbound — durable history with FTS
|
||||
|
||||
Every inbound message is written to `inbox.db` before any hook fires:
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE inbox USING fts5(
|
||||
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
|
||||
sender_name, body, meta, idempotency_key UNINDEXED,
|
||||
received_at UNINDEXED, replied_to_id UNINDEXED
|
||||
);
|
||||
CREATE INDEX inbox_received_at ON inbox(received_at);
|
||||
CREATE INDEX inbox_idem ON inbox(idempotency_key);
|
||||
```
|
||||
|
||||
- **Receiver-side dedupe**: on insert, `INSERT OR IGNORE` on `idempotency_key`.
|
||||
Duplicate broker delivery becomes a no-op locally + `cm_daemon_dedupe_total`
|
||||
counter increments.
|
||||
- 30-day rolling retention (configurable). `VACUUM` weekly during low-traffic
|
||||
window.
|
||||
- `claudemesh daemon search "OOM"` queries the FTS index.
|
||||
- Apps connecting mid-stream replay history via `?since=<iso>`.
|
||||
|
||||
## 6. Hooks — first-class but tightly bounded
|
||||
|
||||
Codex was right: hooks were underspecified, and putting `CLAUDEMESH_TOKEN` in
|
||||
every hook env was a serious exfil footgun.
|
||||
|
||||
### 6.1 Hook directory & contract
|
||||
|
||||
```
|
||||
hooks/
|
||||
on-message.sh every inbound message (DM + topic)
|
||||
on-dm.sh DMs only
|
||||
on-mention.sh when @<my-name> appears anywhere
|
||||
on-topic-<name>.sh a specific topic
|
||||
on-file-share.sh file shared with me
|
||||
on-disconnect.sh WS dropped
|
||||
on-reconnect.sh reconnected
|
||||
on-startup.sh daemon up
|
||||
pre-send.sh filter / mutate outbound (last gate)
|
||||
hooks.toml per-hook policy (auth, redaction, env, timeout)
|
||||
```
|
||||
|
||||
`hooks.toml` (mandatory; daemon refuses to invoke hooks without it):
|
||||
|
||||
```toml
|
||||
[on-mention]
|
||||
enabled = true
|
||||
timeout_s = 30
|
||||
output_size_limit = 65536
|
||||
redact_payload = ["body.password", "meta.api_key"] # JSONPath
|
||||
allow_reply = true # if false, stdout reply ignored
|
||||
capability_token_scope = ["topic:alerts:post"] # scoped, NOT broker session token
|
||||
network_policy = "deny" # 'deny' | 'allow' | 'allowlist'
|
||||
network_allowlist = [] # only if policy = 'allowlist'
|
||||
fs_policy = "readonly" # 'readonly' | 'rw' | 'sandbox'
|
||||
killpg_on_timeout = true # SIGTERM process group, not just child
|
||||
audit = true # log every invocation
|
||||
```
|
||||
|
||||
### 6.2 Credentials passed to hooks
|
||||
|
||||
**Default: nothing.** No `CLAUDEMESH_TOKEN`, no broker session, nothing that
|
||||
lets the hook impersonate the daemon's identity broadly.
|
||||
|
||||
**Opt-in per hook**: `capability_token_scope = ["topic:alerts:post"]` mints a
|
||||
**short-lived (5 min) capability token** scoped to exactly that capability.
|
||||
The hook can use it to call back into the daemon's IPC ("post a reply to
|
||||
#alerts") but cannot use it to read state, read inbox, deploy MCP, etc. Token
|
||||
expires when hook process exits OR after 5 min, whichever first.
|
||||
|
||||
Capability tokens are local-only — they authorize against the daemon's IPC
|
||||
surface, never the broker directly. Daemon translates capability calls into
|
||||
broker calls.
|
||||
|
||||
Env variables the hook DOES get:
|
||||
- `CLAUDEMESH_MESH=<slug>`
|
||||
- `CLAUDEMESH_HOOK_NAME=on-mention`
|
||||
- `CLAUDEMESH_EVENT_ID=<ulid>`
|
||||
- `CLAUDEMESH_CAPABILITY_TOKEN=<token>` (only if scope was configured; else absent)
|
||||
- `CLAUDEMESH_DAEMON_SOCK=<path>` (so SDKs can connect for capability calls)
|
||||
- `PATH=/usr/bin:/bin` (locked down)
|
||||
|
||||
### 6.3 Payload redaction
|
||||
|
||||
Hook stdin receives event JSON minus paths listed in `redact_payload`. Default
|
||||
redaction: nothing. Mesh owner / daemon admin opts in.
|
||||
|
||||
### 6.4 Timeout & cleanup
|
||||
|
||||
- Per-hook `timeout_s` (default 30s). On timeout, daemon sends SIGTERM to the
|
||||
hook's process group (`killpg_on_timeout=true`), waits 5s, then SIGKILL.
|
||||
Catches forked grandchildren that were trying to keep things alive.
|
||||
- Hook stdout/stderr captured, truncated at `output_size_limit`. Larger
|
||||
outputs log a warning and discard the overflow.
|
||||
|
||||
### 6.5 Audit log
|
||||
|
||||
Every hook invocation logs:
|
||||
```json
|
||||
{"hook":"on-mention","event_id":"01H8…","exit":0,"duration_ms":47,
|
||||
"stdout_bytes":120,"stderr_bytes":0,"replied":true,"capability_calls":1,
|
||||
"ts":"2026-05-03T14:00:00Z"}
|
||||
```
|
||||
|
||||
Stored in `daemon.log`; metrics exposed via `cm_daemon_hook_*`.
|
||||
|
||||
### 6.6 Sandboxing — supported, not required
|
||||
|
||||
The contract supports sandboxing without mandating it (mandating breaks too
|
||||
many real workflows):
|
||||
|
||||
- Linux: opt-in `sandbox = "bubblewrap"` in `hooks.toml` runs the hook under
|
||||
`bwrap` with no network (unless `network_policy != "deny"`), readonly FS
|
||||
except `/tmp/<hook-id>`, no DBus, no /proc.
|
||||
- macOS: opt-in `sandbox = "sandbox-exec"` with similar profile.
|
||||
- Default: no sandbox; rely on Unix permissions + `network_policy=deny` (which
|
||||
is enforced via `unshare --net` on Linux when available, otherwise
|
||||
best-effort firewall rule).
|
||||
|
||||
## 7. Multi-mesh — daemon-per-mesh, supervised by a thin shell
|
||||
|
||||
### 7.1 The decision
|
||||
|
||||
One daemon per mesh, coordinated by a supervisor script. Codex pushed back —
|
||||
"why not one daemon serving all meshes?". Going daemon-per-mesh because:
|
||||
|
||||
- **Crash isolation**: a panic in `prod` mesh's WS reader can't corrupt
|
||||
`dev` mesh's outbox.
|
||||
- **Resource accounting**: per-mesh RSS, per-mesh metrics, per-mesh disk
|
||||
budget — easy to attribute, easy to cap.
|
||||
- **Independent identity**: each mesh has its own keypair, host fingerprint,
|
||||
capability gates. Conflating into one process forces shared trust.
|
||||
- **Independent upgrades**: rolling daemon restarts per mesh, no downtime
|
||||
across all meshes.
|
||||
- **Simpler code**: zero cross-mesh routing logic in the daemon body.
|
||||
|
||||
The cost (process count, log fan-out) is real but bounded: typical user has
|
||||
1–3 meshes. Heavy users (10–20) get a `claudemesh daemon ps` + `--all` UX that
|
||||
treats them as a fleet.
|
||||
|
||||
### 7.2 Resource caps for fleet hosts
|
||||
|
||||
`config.toml` has `[fleet]` section read by `daemon up --all`:
|
||||
|
||||
```toml
|
||||
[fleet]
|
||||
max_daemons = 10
|
||||
total_memory_budget = "2GB" # divided across daemons; each gets budget/N RSS cap
|
||||
total_disk_budget = "20GB" # divided across outbox + inbox per daemon
|
||||
```
|
||||
|
||||
If a user hits `max_daemons`, `daemon up <next>` errors with a clear message
|
||||
pointing at the cap.
|
||||
|
||||
### 7.3 Commands
|
||||
|
||||
```
|
||||
claudemesh daemon up --mesh <slug> # one mesh
|
||||
claudemesh daemon up --all # all joined meshes (respects fleet caps)
|
||||
claudemesh daemon down --mesh <slug>
|
||||
claudemesh daemon down --all
|
||||
claudemesh daemon status # all daemons, table view
|
||||
claudemesh daemon status --json # machine-readable
|
||||
claudemesh daemon ps # alias of status
|
||||
claudemesh daemon logs --mesh <slug> [-f]
|
||||
claudemesh daemon restart --mesh <slug>
|
||||
```
|
||||
|
||||
## 8. Auto-routing — clarified, not transparent
|
||||
|
||||
Codex pushed back: "no behavior difference" was hand-waving. Persistent
|
||||
identity, queueing, hooks, profile state — these legitimately change behavior.
|
||||
|
||||
### 8.1 What changes when a daemon is up
|
||||
|
||||
| Behavior | Cold-path CLI | Daemon-routed CLI |
|
||||
|---|---|---|
|
||||
| Sender attribution | Ephemeral session pubkey for that invocation | Daemon's persistent member pubkey |
|
||||
| Latency | ~1s (fresh WS handshake) | <10ms (local UDS round-trip) |
|
||||
| Send durability | None — if broker is unreachable, command fails | Outbox queue retries until TTL |
|
||||
| Inbound visibility | Not available (cold path closes WS) | `claudemesh inbox` reads daemon's inbox.db |
|
||||
| Hooks | Not invoked | Invoked on every event |
|
||||
| Presence | Brief flicker as session connects+disconnects | Continuous; daemon's status reflected |
|
||||
| `peer list` shows me as | A new ephemeral session each invocation | The daemon's persistent member |
|
||||
|
||||
### 8.2 Detection logic — connect, don't trust pidfile
|
||||
|
||||
```
|
||||
1. Check ~/.claudemesh/daemon/<slug>/sock exists.
|
||||
2. attempt UDS connect with 100ms timeout.
|
||||
3. If connect succeeds: send GET /v1/version.
|
||||
4. If response is well-formed AND mesh matches AND daemon_version is
|
||||
compatible → use this daemon.
|
||||
5. Otherwise → cold path.
|
||||
```
|
||||
|
||||
PID liveness check is unreliable (PID reuse, process orphaned). Socket
|
||||
handshake is canonical.
|
||||
|
||||
### 8.3 Coexistence with `claudemesh launch`
|
||||
|
||||
Both can be running for the same mesh:
|
||||
- Daemon connected as persistent member `runpod-worker-3`.
|
||||
- A separate `claudemesh launch` connects as ephemeral session of the same
|
||||
member. Visible to peers as "another session of runpod-worker-3"
|
||||
(sibling-session relationship via `memberPubkey`).
|
||||
- CLI verbs from inside `claudemesh launch` route through the launch session,
|
||||
NOT the daemon (preserves "this Claude Code session has its own ephemeral
|
||||
identity" semantics).
|
||||
- CLI verbs from a separate shell route through the daemon (faster, durable).
|
||||
|
||||
This is consistent with the v0.5.1 self-DM guard and sibling-session
|
||||
semantics already shipped.
|
||||
|
||||
## 9. Service installation
|
||||
|
||||
```bash
|
||||
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
|
||||
claudemesh daemon uninstall-service
|
||||
claudemesh daemon install-service --user # user-scope unit (default; no root)
|
||||
claudemesh daemon install-service --system # system-scope unit (root; multi-user host)
|
||||
```
|
||||
|
||||
Unit defaults:
|
||||
- `Restart=on-failure`, `RestartSec=5s`, `StartLimitBurst=5/5min`
|
||||
- `MemoryMax=<resource cap>`, `TasksMax=128`, `LimitNOFILE=4096`
|
||||
- `StandardOutput/Error=journal`
|
||||
- `NoNewPrivileges=yes`, `PrivateTmp=yes`, `ProtectSystem=strict`,
|
||||
`ProtectHome=read-only` with `ReadWritePaths=~/.claudemesh`
|
||||
- For systemd `--user`, runs as the invoking user (no root needed).
|
||||
|
||||
`claudemesh install` (the existing setup verb) gains an opt-in prompt:
|
||||
*"Install as a background service that always runs?"* Defaults differently
|
||||
based on detected environment (TTY vs no-TTY, presence of systemd, etc.).
|
||||
|
||||
## 10. Observability
|
||||
|
||||
Standard CLI surface unchanged from v1, with the new gauges/counters:
|
||||
|
||||
```
|
||||
cm_daemon_connected{mesh} 0/1
|
||||
cm_daemon_reconnects_total{mesh,reason}
|
||||
cm_daemon_lag_ms{mesh} last broker round-trip
|
||||
cm_daemon_outbox_depth{mesh,status} pending|inflight|dead
|
||||
cm_daemon_outbox_age_seconds{mesh} oldest pending row
|
||||
cm_daemon_dedupe_total{mesh,direction} out|in
|
||||
cm_daemon_disk_pct{mesh,kind} outbox|inbox
|
||||
cm_daemon_send_total{mesh,kind,status}
|
||||
cm_daemon_recv_total{mesh,kind,from_type}
|
||||
cm_daemon_hook_invocations_total{hook,exit}
|
||||
cm_daemon_hook_duration_seconds{hook} histogram
|
||||
cm_daemon_hook_capability_calls_total{hook,scope}
|
||||
cm_daemon_ipc_request_total{endpoint,status,transport}
|
||||
cm_daemon_ipc_duration_seconds{endpoint} histogram
|
||||
cm_daemon_local_token_rotations_total
|
||||
cm_daemon_clone_suspected_total
|
||||
```
|
||||
|
||||
Tracing: optional OpenTelemetry export.
|
||||
|
||||
## 11. SDKs — three, slim, core-API only
|
||||
|
||||
Same shape as v1 but only target the **frozen core surface** (§3.1). State /
|
||||
memory / vector / graph / tasks / MCP / skills are NOT in v0.9.0 SDKs — they
|
||||
ship per capability gate.
|
||||
|
||||
Each SDK auto-discovers the daemon: reads `sock` path, `http.port`,
|
||||
`local_token`. SDKs versioned in lockstep with the daemon's `/v1` surface.
|
||||
|
||||
## 12. Security model — explicit boundaries
|
||||
|
||||
| Boundary | Trust | Mechanism |
|
||||
|---|---|---|
|
||||
| App ↔ Daemon (UDS) | OS user, FS perms | UDS 0600 |
|
||||
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 only + `local_token` + Origin/Host check |
|
||||
| Hook ↔ Daemon | Capability scope | Short-lived capability token, never broker session |
|
||||
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello + crypto_box DM + per-topic keys |
|
||||
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
|
||||
| Cloned identity | Host fingerprint check | Daemon refuses to start; dashboard audit event |
|
||||
|
||||
## 13. Configuration
|
||||
|
||||
`config.toml` — same shape as v1 plus:
|
||||
- `[capabilities]` (§3.2)
|
||||
- `[fleet]` (§7.2)
|
||||
- `[disk] reserved_bytes` (§4.4)
|
||||
- `[clone] policy = "refuse" | "warn" | "allow"` (§2.2)
|
||||
|
||||
User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.
|
||||
|
||||
## 14. Lifecycle — the operational flows v1 was missing
|
||||
|
||||
### 14.1 Key rotation
|
||||
|
||||
```
|
||||
claudemesh daemon rotate-keypair
|
||||
```
|
||||
|
||||
Mints fresh ed25519 + x25519. Registers new pubkey with broker as a `member_keypair_rotated` operation (broker associates new pubkey with same member id). Old pubkey is held server-side for 24h grace (decrypts in-flight messages encrypted to old pubkey), then revoked.
|
||||
|
||||
### 14.2 Local token rotation
|
||||
|
||||
```
|
||||
claudemesh daemon rotate-token
|
||||
```
|
||||
|
||||
Atomically writes a new `local_token`, returns the old one alongside the new
|
||||
one for 60s grace. SDKs that already have the old token finish in-flight
|
||||
requests; new requests use the new token. After 60s, old token is rejected.
|
||||
|
||||
### 14.3 Compromised host revocation
|
||||
|
||||
From the dashboard or another mesh-owner session:
|
||||
|
||||
```
|
||||
claudemesh member revoke <pubkey>
|
||||
```
|
||||
|
||||
Broker marks member as revoked. Connected daemon receives `member_revoked`
|
||||
push, self-disables (refuses new IPC, closes WS), exits with non-zero status,
|
||||
logs forensic event.
|
||||
|
||||
### 14.4 Image-clone lifecycle
|
||||
|
||||
Covered in §2.2. Three policies (`refuse`, `warn`, `allow` — settable per-host
|
||||
via `config.toml`).
|
||||
|
||||
### 14.5 Backup & restore
|
||||
|
||||
```
|
||||
claudemesh daemon backup --out <path> # dumps keypair, config, schema_version
|
||||
claudemesh daemon restore --in <path> # writes them; refuses if a daemon is running
|
||||
```
|
||||
|
||||
Backup is encrypted with a passphrase (Argon2id KDF + crypto_secretbox). The
|
||||
intent: "I'm reformatting my laptop, I want my mesh memberships back without
|
||||
re-joining." NOT for "deploy this same identity on 10 servers" (that's the
|
||||
clone problem above).
|
||||
|
||||
### 14.6 Uninstall / reset
|
||||
|
||||
```
|
||||
claudemesh daemon uninstall # full purge: stops, deregisters from broker, wipes ~/.claudemesh/daemon/<slug>
|
||||
claudemesh daemon reset # wipes local state, keeps broker member registration (for restoring)
|
||||
```
|
||||
|
||||
Uninstall calls broker's `POST /v1/me/members/:pubkey/leave` so member doesn't
|
||||
linger as ghost. Reset is local-only, no broker contact.
|
||||
|
||||
### 14.7 Disk corruption recovery
|
||||
|
||||
```
|
||||
claudemesh daemon recover # interactive: integrity check + offer rebuild paths
|
||||
```
|
||||
|
||||
Detects corrupt `outbox.db` / `inbox.db`. Options:
|
||||
- Restore from local journal-only inbox (read-only mode; sends disabled).
|
||||
- Wipe + rebuild from broker (fetches last N days of message history if
|
||||
available; topics need re-subscribe; outbox is irrecoverable, queued sends are
|
||||
lost).
|
||||
- Wipe + start fresh.
|
||||
|
||||
## 15. Version compatibility
|
||||
|
||||
### 15.1 Negotiation handshake
|
||||
|
||||
On daemon connect to broker AND on every IPC request:
|
||||
|
||||
```
|
||||
GET /v1/version
|
||||
{
|
||||
"daemon_version": "0.9.0",
|
||||
"ipc_api": "v1",
|
||||
"ipc_minor": 3, # additive minor
|
||||
"schema_version": 7,
|
||||
"broker_protocol_min": "0.7",
|
||||
"broker_protocol_max": "0.9"
|
||||
}
|
||||
```
|
||||
|
||||
### 15.2 Compat policy
|
||||
|
||||
| Across | Policy |
|
||||
|---|---|
|
||||
| Daemon ↔ Broker | Daemon refuses to connect if broker version < daemon's `broker_protocol_min`. Broker logs warning. Pre-1.0 we may break this with notice; post-1.0 we maintain backward compat for ≥6 months. |
|
||||
| CLI ↔ Daemon | CLI checks daemon's `ipc_api`. Same major = OK. Different major = CLI falls back to cold-path with warning. |
|
||||
| SDK ↔ Daemon | SDK negotiates `ipc_minor`; uses minimum of (SDK's, daemon's). |
|
||||
| Daemon binary ↔ schema | Binary refuses to start on unknown schema; migrations run forward-only; no automatic downgrade. |
|
||||
|
||||
### 15.3 Compatibility matrix (published in docs, machine-readable JSON at /v1/compat)
|
||||
|
||||
```json
|
||||
{
|
||||
"daemon": "0.9.0",
|
||||
"compatible_brokers": ["0.7.x", "0.8.x", "0.9.x"],
|
||||
"compatible_clis": ["0.9.x"],
|
||||
"compatible_sdks": {
|
||||
"python": ">=0.9.0,<1.0.0",
|
||||
"go": ">=0.9.0,<1.0.0",
|
||||
"ts": ">=0.9.0,<1.0.0"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 16. Threat model
|
||||
|
||||
### 16.1 Attacker classes
|
||||
|
||||
| Attacker | Has | Wants | Mitigations |
|
||||
|---|---|---|---|
|
||||
| Local same-user shell | OS user creds | Send / read mesh messages | None needed — they already have FS access to keypair; daemon is no worse |
|
||||
| Local different-user shell | Different OS user | Read this user's daemon | UDS 0600 + TCP loopback + token. Requires OS exploit to escalate |
|
||||
| Browser SSRF | Loopback HTTP | Send messages, read inbox | `local_token` + Origin/Host check + non-default port. SSRF without token cannot succeed |
|
||||
| Container side-channel | Same loopback namespace | Read another container's daemon | Containers share host loopback only if explicitly net=host. `local_token` defends. Recommended: bind UDS only inside containers |
|
||||
| Compromised hook | Capability token in env | Use that scope | Capability tokens are scoped + short-lived; cannot escalate |
|
||||
| Compromised broker | Full mesh visibility on its side | Deliver malicious messages, identity-impersonate | E2E encryption (crypto_box DMs, per-topic keys) — broker can't read content. Out-of-scope for daemon |
|
||||
| Cloned VM image | Same keypair on two hosts | Identity collision | Host fingerprint detection + dashboard audit + `--remint` flow |
|
||||
| Stolen laptop | Disk access | Mesh impersonation forever | `member revoke` from dashboard. Without disk encryption, this is the user's laptop security; documented in security guide |
|
||||
| Untrusted hook author | Hook script content | Exfil mesh data | Hook is on disk YOU control. If you ran `git pull` on a malicious hooks/ repo, that's a code-supply-chain attack out of scope for the daemon |
|
||||
|
||||
### 16.2 Out of scope
|
||||
|
||||
- Defending against an attacker with root on the daemon host. They can read
|
||||
`keypair.json` directly.
|
||||
- Defending against malicious peers in the same mesh sending malformed
|
||||
payloads. Daemon validates structure but trusts mesh members.
|
||||
- Defending against compromised broker. Out-of-scope for daemon; mesh-level
|
||||
E2E protects content but not metadata.
|
||||
|
||||
## 17. Migration — what changes for existing users
|
||||
|
||||
Same as v1. Additive. No DB migration on broker. Existing
|
||||
`~/.claudemesh/config.json` consumed unchanged. `claudemesh launch` keeps
|
||||
working; daemon is opt-in.
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 2)
|
||||
|
||||
Round 1 produced: identity model needs `--ephemeral` + clone-detect, IPC needs
|
||||
local token, "exactly-once" was a lie, hooks needed scoped credentials, surface
|
||||
needed shrinking, missing rotation/recovery/migration/threat-model.
|
||||
|
||||
This v2 attempts to address all of them. Specifically critique:
|
||||
|
||||
1. **Has the identity model fully closed the clone problem?** Refuses-on-fingerprint-mismatch
|
||||
plus broker audit plus mesh-owner revoke — does this catch a sophisticated
|
||||
attacker who copies `host_fingerprint.json` along with the keypair?
|
||||
2. **Is the local-token model sufficient for browser-SSRF defense?**
|
||||
Token + Origin + Host checks + 127.0.0.1-only. Anything else needed?
|
||||
3. **The delivery contract** (§4) — is it now defensible? Does the inflight-recovery
|
||||
semantics + idempotency-key propagation produce the guarantees claimed?
|
||||
4. **Hook capability tokens** (§6.2) — short-lived, scoped, expire on hook exit.
|
||||
Does this fully eliminate the exfil footgun? What capability scopes are
|
||||
actually needed for v0.9.0 hooks?
|
||||
5. **Frozen v0.9.0 surface** (§3.1) — is the cut right? Should `peer list` be
|
||||
in core or capability-gated? Should `inbox/search` ship in v0.9.0?
|
||||
6. **Threat model** (§16) — anything missing? Specifically thinking about CI
|
||||
environments where the daemon's host is a fleet shared across many users'
|
||||
builds.
|
||||
7. **Lifecycle flows** (§14) — image clones, key rotation, host moves, disk
|
||||
corruption, uninstall semantics. Anything still missing?
|
||||
8. **Version compat** (§15) — is the negotiation handshake sufficient, or do
|
||||
we need stronger guarantees (e.g. semver-strict, or a feature-bit
|
||||
negotiation rather than version numbers)?
|
||||
|
||||
Score 1–5 each. Top 3 changes you'd insist on for v3, if any. If you think v2
|
||||
is shippable, say so explicitly — over-engineering is a real risk.
|
||||
648
.artifacts/shipped/2026-05-03-daemon-final-spec-v3.md
Normal file
648
.artifacts/shipped/2026-05-03-daemon-final-spec-v3.md
Normal file
@@ -0,0 +1,648 @@
|
||||
# `claudemesh daemon` — Final Spec v3
|
||||
|
||||
> **Round 3.** v2 of this spec was reviewed by another model and pushed back on
|
||||
> identity/clone semantics (boot-id false-positives), delivery contract (broker
|
||||
> must dedupe on client-supplied id — protocol change), CI shared-runner threat
|
||||
> model, version negotiation (need feature bits, not ranges), key rotation
|
||||
> crypto, hook scope granularity, inbox schema correctness, and ~7 smaller
|
||||
> polish items. v3 incorporates all of them.
|
||||
>
|
||||
> **The intent §0 from v2 is unchanged and still authoritative — read it
|
||||
> there.** v3 only revises what changed.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
|
||||
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
|
||||
a generic broker. We can break anything.
|
||||
|
||||
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
|
||||
precise contract in §4 below.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — same as v2 §1
|
||||
|
||||
Resource caps, file layout, single-binary unchanged.
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — accidental-clone detection only, plus broker dedupe
|
||||
|
||||
Codex was right: v2's clone detection was both too weak (anyone copying
|
||||
`host_fingerprint.json` along with `keypair.json` defeats it) and too noisy
|
||||
(boot-id flips every reboot → false-positives on every legitimate restart).
|
||||
|
||||
### 2.1 Modes
|
||||
|
||||
```
|
||||
claudemesh daemon up # default: persistent member
|
||||
claudemesh daemon up --ephemeral # in-memory keypair, never written
|
||||
claudemesh daemon up --ephemeral --ttl 2h # auto-shutdown after duration
|
||||
```
|
||||
|
||||
**CI auto-detection** (NEW): if any of the following env vars are set
|
||||
(`CI=true`, `GITHUB_ACTIONS`, `GITLAB_CI`, `BUILDKITE`, `CIRCLECI`,
|
||||
`JENKINS_URL`, `RUNPOD_POD_ID`, `KUBERNETES_SERVICE_HOST`), AND `--persistent`
|
||||
is not explicitly passed, daemon defaults to `--ephemeral`. Rationale in §16.
|
||||
|
||||
### 2.2 Accidental-clone detection (NOT attacker-grade)
|
||||
|
||||
Frame change: this catches **image clones, restored backups, copy-pasted
|
||||
homedirs** — accidents made by humans operating at human speed. It does not
|
||||
defend against an attacker who copies both `keypair.json` and
|
||||
`host_fingerprint.json`. The threat model (§16) says this explicitly.
|
||||
|
||||
Persisted fingerprint = `sha256(machine-id || first-stable-mac)`. Notably:
|
||||
- **No boot-id** — that flips on every reboot and would false-positive
|
||||
every legitimate restart.
|
||||
- **No hostname** — laptops legitimately rename themselves.
|
||||
- **`first-stable-mac`** = MAC of the lexicographically first non-loopback,
|
||||
non-virtual interface present at first daemon boot. Frozen at first run;
|
||||
not recomputed.
|
||||
|
||||
Behavior on mismatch:
|
||||
- Default policy: refuse to start. Print: *"This keypair was created on a
|
||||
different host. If you legitimately moved hardware, run
|
||||
`claudemesh daemon accept-host` (writes a fresh fingerprint, keeps keypair).
|
||||
If this is a clone of an existing daemon, run `claudemesh daemon remint`
|
||||
(mints fresh keypair, registers as a new member)."*
|
||||
- `[clone] policy = "refuse" | "warn" | "allow"` overrides per host.
|
||||
|
||||
### 2.3 Concurrent-duplicate-identity broker policy (NEW — protocol change)
|
||||
|
||||
When the broker receives two WS connections claiming the same member pubkey:
|
||||
|
||||
- **`prefer_newest`** (default): older connection is closed with code 4003
|
||||
`replaced_by_newer_connection`. New connection takes over presence/inbox
|
||||
delivery. Daemon-side: receives the close code, logs forensic event, exits
|
||||
with non-zero status (lets supervisor restart it; if the *other* host is
|
||||
the legitimate one, supervisor restart-loops are noisy enough to alert).
|
||||
- **`prefer_oldest`**: new connection is rejected with code 4004
|
||||
`member_already_connected`. The new daemon refuses to start.
|
||||
- **`allow_concurrent`** (new mode, server-side feature flag): both
|
||||
connections accepted; broker tracks both as sibling sessions of the same
|
||||
member (same model as `claudemesh launch` siblings today). Useful when a
|
||||
user really does want one keypair on multiple hosts (e.g. failover pairs).
|
||||
|
||||
Configured per-mesh in `mesh.cloneConcurrencyPolicy`. Default:
|
||||
`prefer_newest`. Broker emits `member_concurrent_connection` audit event in
|
||||
all cases.
|
||||
|
||||
### 2.4 Rename, key rotation — see §14
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — frozen core, hardened auth
|
||||
|
||||
### 3.1 Frozen core (v0.9.0) — slight cut from v2
|
||||
|
||||
Codex agreed v2's cut was mostly right, except: defer FTS-search to a
|
||||
capability gate, keep `peer list` in core, drop redundancies.
|
||||
|
||||
```
|
||||
# Messaging
|
||||
POST /v1/send {to, message, priority?, meta?, replyToId?,
|
||||
client_message_id?}
|
||||
POST /v1/topic/post {topic, message, priority?, mentions?,
|
||||
client_message_id?}
|
||||
POST /v1/topic/subscribe {topic}
|
||||
POST /v1/topic/unsubscribe {topic}
|
||||
GET /v1/topic/list
|
||||
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
|
||||
# plain SQL paging; NO FTS in v0.9.0
|
||||
|
||||
# Peers + presence (kept in core — central to "first-class peer")
|
||||
GET /v1/peers ?mesh=<slug>
|
||||
POST /v1/profile {summary?, status?, visible?}
|
||||
|
||||
# Files (already production)
|
||||
POST /v1/file/share {path, to?, message?, persistent?}
|
||||
GET /v1/file/get ?id=<fileId>&out=<path>
|
||||
GET /v1/file/list
|
||||
|
||||
# Events — push
|
||||
GET /v1/events text/event-stream
|
||||
core events: message, peer_join, peer_leave, file_shared,
|
||||
daemon_disconnect, daemon_reconnect, hook_executed,
|
||||
feature_negotiation_failed
|
||||
|
||||
# Control plane
|
||||
GET /v1/health (auth required by default — see §3.3)
|
||||
GET /v1/metrics (auth required by default)
|
||||
GET /v1/version (auth required by default)
|
||||
POST /v1/heartbeat {}
|
||||
```
|
||||
|
||||
`inbox/search` with FTS deferred to v0.9.x capability gate `inbox_fts`.
|
||||
|
||||
### 3.2 Capability-gated future surface (v0.9.x)
|
||||
|
||||
Same as v2 §3.2 — state, memory, vector, graph, tasks, scheduling,
|
||||
mcp_host, skill_share, plus new `inbox_fts`. None enabled in v0.9.0.
|
||||
|
||||
### 3.3 Local IPC authentication — tightened
|
||||
|
||||
Same shape as v2 §3.3 but with codex's polish folded in:
|
||||
|
||||
| Transport | Auth | Notes |
|
||||
|---|---|---|
|
||||
| UDS | None (FS perms 0600) | Reaching socket = same UID |
|
||||
| TCP loopback | `Authorization: Bearer <local_token>` REQUIRED | 127.0.0.1 only |
|
||||
| SSE | `Authorization: Bearer <local_token>` REQUIRED | same |
|
||||
|
||||
**Token plumbing rules (NEW):**
|
||||
- `local_token` MUST be in the `Authorization` header. **Never** accepted in
|
||||
query string. Endpoint that sees a `?token=...` query param logs a security
|
||||
event and returns 400.
|
||||
- `local_token` MUST be redacted from access logs (`Authorization: Bearer
|
||||
***` in logs).
|
||||
- `local_token` rotation atomically writes a new file; SDKs hold the OLD
|
||||
token valid for 60s grace, then it's rejected.
|
||||
|
||||
**Endpoint default auth (NEW — codex):**
|
||||
- Every IPC endpoint requires the local token by default, **including**
|
||||
`/v1/health`, `/v1/metrics`, `/v1/version`. `[ipc] public_health_check =
|
||||
true` opts in to public `/v1/health` for k8s probes etc.
|
||||
|
||||
**Container default (NEW — codex):**
|
||||
- If `KUBERNETES_SERVICE_HOST` is set OR `/.dockerenv` exists OR
|
||||
`/proc/1/cgroup` indicates a container OR explicit `--container` flag,
|
||||
daemon defaults to **UDS-only** (`[ipc] tcp_enabled = false`). Containers
|
||||
share host loopback when `network_mode: host`; UDS-only avoids the
|
||||
side-channel.
|
||||
|
||||
**Origin/Host policy:**
|
||||
- `Host` header must be `localhost`, `127.0.0.1`, `[::1]` or empty. Else 403.
|
||||
- `Origin` header: explicit allowlist (default: empty). SSRF-from-browser
|
||||
bounce-attack defense.
|
||||
- `User-Agent` requirement DROPPED (codex called it theatre — correct).
|
||||
- CORS: never echo `Access-Control-Allow-Origin`; preflight returns 403.
|
||||
|
||||
### 3.4 Request limits & backpressure — same as v2
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once, broker-dedupes-on-client-id
|
||||
|
||||
Codex caught the real protocol gap: idempotency only works if the broker
|
||||
dedupes on the **caller's** id, not its own. This requires a broker change.
|
||||
|
||||
### 4.1 The contract (precise)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns.
|
||||
>
|
||||
> **Broker guarantee**: the broker dedupes on `client_message_id` for a
|
||||
> 24h window. Multiple inflight retries from the daemon for the same
|
||||
> `client_message_id` produce **at most one** broker-accepted row.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
|
||||
> `client_message_id` propagated in the inbound envelope so receivers can
|
||||
> dedupe locally on their side. We do **not** guarantee at-most-once
|
||||
> end-to-end — that requires receiver-side dedupe, which the daemon's
|
||||
> inbox.db provides for daemon-hosted peers.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` (NEW — broker protocol change)
|
||||
|
||||
Every send has a stable id minted **on the daemon**, not the broker:
|
||||
- Caller-supplied via `Idempotency-Key` header → wins.
|
||||
- Caller-supplied in body as `client_message_id` field → second.
|
||||
- Else daemon mints a `ulid` → last.
|
||||
|
||||
The id is:
|
||||
- Returned in the IPC response.
|
||||
- Stored in `outbox.db` as a UNIQUE NOT NULL column (real dedupe, not
|
||||
`INSERT OR IGNORE` on nullable — codex caught this).
|
||||
- Propagated to the broker on every retry (`client_message_id` field in the
|
||||
WS send envelope and in `POST /v1/messages`).
|
||||
- Stored in the broker's `meshTopicMessage.client_message_id` column with a
|
||||
`UNIQUE` constraint scoped to `(meshId, client_message_id)`.
|
||||
- Propagated in the inbound delivery to receivers' inboxes.
|
||||
|
||||
**Broker behavior on duplicate `client_message_id`**: returns the
|
||||
already-stored `messageId` and `historyId` from the prior insertion. No new
|
||||
row, no new fan-out, idempotent.
|
||||
|
||||
### 4.3 Broker schema delta (NEW)
|
||||
|
||||
```sql
|
||||
ALTER TABLE mesh.topic_message
|
||||
ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue
|
||||
ADD COLUMN client_message_id TEXT;
|
||||
|
||||
CREATE UNIQUE INDEX topic_message_client_id_idx
|
||||
ON mesh.topic_message(mesh_id, client_message_id)
|
||||
WHERE client_message_id IS NOT NULL;
|
||||
CREATE UNIQUE INDEX message_queue_client_id_idx
|
||||
ON mesh.message_queue(mesh_id, client_message_id)
|
||||
WHERE client_message_id IS NOT NULL;
|
||||
```
|
||||
|
||||
Partial unique index — legacy traffic without `client_message_id` (from
|
||||
`claudemesh launch`, dashboard chat, web posts) is unaffected.
|
||||
|
||||
### 4.4 Outbox schema (corrected)
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY, -- ulid (local row id)
|
||||
client_message_id TEXT NOT NULL UNIQUE, -- propagated to broker
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT -- set on ACK
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
`UNIQUE NOT NULL` on `client_message_id`: caller retries with the same id
|
||||
collide locally and become a no-op.
|
||||
|
||||
### 4.5 Inbox schema (corrected — content table + FTS index)
|
||||
|
||||
Codex caught: FTS5 virtual tables are not where you put `CREATE INDEX`.
|
||||
Real shape:
|
||||
|
||||
```sql
|
||||
-- Content table — the durable store
|
||||
CREATE TABLE inbox (
|
||||
id TEXT PRIMARY KEY, -- ulid (local row id)
|
||||
client_message_id TEXT NOT NULL UNIQUE, -- dedupe key
|
||||
broker_message_id TEXT,
|
||||
mesh TEXT NOT NULL,
|
||||
topic TEXT,
|
||||
sender_pubkey TEXT NOT NULL,
|
||||
sender_name TEXT NOT NULL,
|
||||
body TEXT,
|
||||
meta TEXT, -- JSON
|
||||
received_at INTEGER NOT NULL,
|
||||
reply_to_id TEXT
|
||||
);
|
||||
CREATE INDEX inbox_received_at ON inbox(received_at);
|
||||
CREATE INDEX inbox_topic ON inbox(topic);
|
||||
CREATE INDEX inbox_sender ON inbox(sender_pubkey);
|
||||
|
||||
-- FTS5 index — gated behind capability `inbox_fts` (deferred to v0.9.x)
|
||||
-- When enabled, populated via triggers; absent in v0.9.0.
|
||||
```
|
||||
|
||||
Insert path: `INSERT INTO inbox(...) ON CONFLICT(client_message_id) DO
|
||||
NOTHING RETURNING id`. The `RETURNING` clause tells us whether a new row
|
||||
landed; only new rows trigger hooks.
|
||||
|
||||
### 4.6 Crash recovery — explicit semantics
|
||||
|
||||
On daemon startup:
|
||||
1. Rows in `inflight` reset to `pending` with `attempts++`,
|
||||
`next_attempt_at = now + min_backoff`. **Note:** these may double-deliver
|
||||
if the broker actually accepted before the local ACK persisted. The
|
||||
`client_message_id` propagation ensures the broker dedupes the retry —
|
||||
net result: exactly one broker-accepted row, possibly two daemon-side
|
||||
`inflight → done` transitions.
|
||||
2. `outbox.db` PRAGMA integrity_check; failure → daemon refuses to start,
|
||||
point at `claudemesh daemon recover`.
|
||||
3. `inbox.db` integrity check; failure → move to `inbox.db.corrupt-<ts>`,
|
||||
create fresh empty inbox, log `inbox_corruption_recovered`. Inbox is a
|
||||
cache; recoverable from broker history.
|
||||
|
||||
### 4.7 Failure modes the spec is honest about
|
||||
|
||||
- **Broker dedupe window expired**: daemon retries a 25h-old send. Broker
|
||||
accepts again as if new (no dedupe). Daemon's outbox `max_age_hours`
|
||||
(default 168h = 7d) is longer than broker dedupe (24h), so this is
|
||||
possible. Default daemon `max_age_hours` REDUCED to **23h** to stay inside
|
||||
broker dedupe window. Configurable up only if the operator accepts the
|
||||
risk explicitly.
|
||||
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. User
|
||||
manually requeues (`outbox requeue <id>`) or drops (`outbox drop <id>`).
|
||||
- **Receiver-side dedupe failure**: only daemon-hosted receivers dedupe.
|
||||
`claudemesh launch` and dashboard chat clients DO NOT dedupe today —
|
||||
fixing them is post-v0.9.0.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — schema corrected (see §4.5), retention as v2
|
||||
|
||||
30-day rolling retention (configurable). Weekly VACUUM.
|
||||
`claudemesh daemon search` deferred to `inbox_fts` capability.
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — scopes tightened, exfiltration acknowledged
|
||||
|
||||
Codex was right: capability tokens removed the broad-token footgun, not
|
||||
exfiltration. Untrusted hook payload + `network_policy=deny` not reliable
|
||||
across platforms. Spec is now honest about that.
|
||||
|
||||
### 6.1 Hooks contract — same shape as v2 §6, with tighter defaults
|
||||
|
||||
### 6.2 Capability scopes — narrowed for v0.9.0
|
||||
|
||||
Codex pushed: scopes were too coarse. v0.9.0 scopes are exactly:
|
||||
|
||||
| Scope | Capability | Notes |
|
||||
|---|---|---|
|
||||
| `reply:event` | Reply to the specific event that triggered this hook | Bound to `event_id`; daemon validates target; expires on hook exit |
|
||||
| `dm:send:<sender_pubkey>` | Send DM only to the specific sender | Bound to one pubkey from event; not a write to anyone |
|
||||
| `topic:<name>:post` | Post to the specific topic that fired | Bound to topic from event; can't write elsewhere |
|
||||
|
||||
**No read scopes in v0.9.0.** A hook cannot read state, inbox, peers, etc.
|
||||
If a hook wants to consult mesh data to compose its reply, it does so via
|
||||
the *event payload* (which the daemon redacted appropriately) or via shell
|
||||
out to a fresh `claudemesh <verb>` call (which uses the user's existing
|
||||
config and is subject to its own auth). No daemon-mediated read tokens.
|
||||
|
||||
### 6.3 Sandboxing — supported, not promised
|
||||
|
||||
Codex caught: "network_policy=deny" sounds reliable but isn't cross-platform.
|
||||
Spec now says explicitly:
|
||||
|
||||
- `network_policy = "deny"` is **best-effort**:
|
||||
- Linux: enforced via `unshare --net` if available; else firewall rule via
|
||||
`iptables -m owner` if available; else daemon logs warning that policy
|
||||
cannot be enforced and the hook STILL runs.
|
||||
- macOS: enforced via `sandbox-exec` profile if available; else warning + run.
|
||||
- Windows: not enforced; warning + run.
|
||||
- Operators on hostile networks should set `enabled = false` for hooks they
|
||||
don't trust.
|
||||
- Daemon `cm_daemon_hook_unenforceable_total` counter exposes the count of
|
||||
hooks that ran with weakened sandbox.
|
||||
|
||||
### 6.4 Payload size & truncation — NEW
|
||||
|
||||
Stdin payloads to hooks capped at 256 KB (configurable). Larger payloads
|
||||
truncated with `_truncated: true` flag in the JSON event. Hook stdout
|
||||
captured up to `output_size_limit` (default 64 KB).
|
||||
|
||||
### 6.5 Audit log + killpg — same as v2
|
||||
|
||||
---
|
||||
|
||||
## 7. Multi-mesh — same as v2 §7
|
||||
|
||||
---
|
||||
|
||||
## 8. Auto-routing — same as v2 §8 (codex agreed it was clarified correctly)
|
||||
|
||||
---
|
||||
|
||||
## 9. Service installation — same as v2 §9
|
||||
|
||||
Add: when `claudemesh daemon install-service` runs in CI-detected
|
||||
environment, prints `Refusing to install persistent service in CI; ephemeral
|
||||
mode only.` and exits non-zero unless `--allow-ci-persistent` is passed.
|
||||
|
||||
---
|
||||
|
||||
## 10. Observability — same as v2 §10
|
||||
|
||||
Add metric: `cm_daemon_hook_unenforceable_total{hook,reason}` (§6.3).
|
||||
|
||||
---
|
||||
|
||||
## 11. SDKs — same shape as v2, bound to frozen core only
|
||||
|
||||
---
|
||||
|
||||
## 12. Security model — same boundaries, plus dedupe + feature negotiation
|
||||
|
||||
| Boundary | Trust | Mechanism |
|
||||
|---|---|---|
|
||||
| App ↔ Daemon (UDS) | OS user | UDS 0600 |
|
||||
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 + `local_token` + Origin/Host |
|
||||
| Hook ↔ Daemon | Capability scope | Short-lived token bound to event; no read scopes |
|
||||
| Daemon ↔ Broker | Mesh keypair + feature bits | WSS + ed25519 + crypto_box + per-topic keys + feature negotiation (§15) |
|
||||
| Daemon ↔ Disk | OS user | All files 0600/0644 |
|
||||
| Cloned identity | First-mac fingerprint | Accidental-clone detection only; broker concurrent-policy on §2.3 |
|
||||
|
||||
---
|
||||
|
||||
## 13. Configuration — same shape as v2 §13, plus `[features]`
|
||||
|
||||
```toml
|
||||
[features]
|
||||
require = ["client_message_id_dedupe", "concurrent_connection_policy"]
|
||||
optional = ["mesh_skill_share", "mcp_host"]
|
||||
# Daemon refuses to start if broker doesn't advertise all `require` bits.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — key rotation crypto fixed
|
||||
|
||||
### 14.1 Key rotation (CORRECTED — codex)
|
||||
|
||||
v2 said: *"old pubkey held server-side for 24h grace (decrypts in-flight
|
||||
messages encrypted to old pubkey)"*. **Wrong** — only the daemon has the
|
||||
private key. Broker can't decrypt.
|
||||
|
||||
Real semantics:
|
||||
|
||||
- `claudemesh daemon rotate-keypair` mints fresh ed25519 + x25519, registers
|
||||
the new pubkey with the broker as `member_keypair_rotated`.
|
||||
- Broker associates the new pubkey with the same member id, marks the old
|
||||
pubkey as `rotated_out` (not revoked).
|
||||
- **Daemon-side**: the OLD x25519 private key is retained in
|
||||
`keypair-archive.json` (mode 0600, durable) for a `key_grace_period`
|
||||
(default 7 days). During the grace window, daemon will attempt to decrypt
|
||||
inbound messages with the new private key first, falling back to archived
|
||||
keys (one or more). Messages encrypted to the old pubkey by senders who
|
||||
haven't yet seen the rotation event continue to decrypt cleanly.
|
||||
- After the grace period, archived keys are zeroed and the file is deleted.
|
||||
Messages encrypted to a stale pubkey after the grace window fail to
|
||||
decrypt and are logged as `cm_daemon_decrypt_stale_total`.
|
||||
|
||||
### 14.2 Backup includes topic state (CORRECTED)
|
||||
|
||||
`claudemesh daemon backup` now packages:
|
||||
- `keypair.json` (current)
|
||||
- `keypair-archive.json` (any in-grace-window archived keys)
|
||||
- `host_fingerprint.json`
|
||||
- `config.toml`
|
||||
- `local_token` (NOT — token is rotated on restore)
|
||||
- `topic_subscriptions.json` (which topics this daemon subscribes to)
|
||||
- `topic_keys.json` (per-topic symmetric keys this member holds)
|
||||
- `key_epoch.json` (current epoch number per topic; relevant when the mesh
|
||||
rotates topic keys)
|
||||
- `schema_version`
|
||||
|
||||
Backup file: encrypted with a passphrase (Argon2id KDF + crypto_secretbox).
|
||||
Restore writes everything except `local_token` (regenerated). On first run
|
||||
after restore, daemon performs `accept-host` if fingerprint mismatches
|
||||
(restore is by definition a host change).
|
||||
|
||||
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — same as v2 §14
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature-bit negotiation (REPLACES v2 §15)
|
||||
|
||||
Codex was right: version ranges aren't enough when daemon depends on
|
||||
specific broker capabilities (client-supplied IDs, concurrent-connection
|
||||
policy, key epochs).
|
||||
|
||||
### 15.1 Feature bits
|
||||
|
||||
Each protocol-relevant capability gets a stable string identifier:
|
||||
|
||||
```
|
||||
client_message_id_dedupe broker dedupes on client_message_id (§4.2)
|
||||
concurrent_connection_policy broker honours mesh.cloneConcurrencyPolicy (§2.3)
|
||||
member_keypair_rotated_event broker emits the event (§14.1)
|
||||
key_epoch per-topic key epochs supported (§14.2)
|
||||
mesh_skill_share post-v0.9, future
|
||||
mcp_host post-v0.9, future
|
||||
```
|
||||
|
||||
### 15.2 Negotiation handshake
|
||||
|
||||
On WS connect (after hello, before normal traffic):
|
||||
|
||||
```
|
||||
→ daemon: feature_negotiation_request
|
||||
{ require: ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy"],
|
||||
optional: ["mesh_skill_share","mcp_host"] }
|
||||
|
||||
← broker: feature_negotiation_response
|
||||
{ supported: ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy",
|
||||
"member_keypair_rotated_event"],
|
||||
missing_required: [] }
|
||||
```
|
||||
|
||||
If `missing_required` is non-empty, daemon closes the connection with code
|
||||
4010 `feature_unavailable`, logs forensic event, exits with non-zero status.
|
||||
Supervisor sees a restart-loop → operator alerted via configured
|
||||
mechanisms.
|
||||
|
||||
### 15.3 IPC negotiation (CLI/SDK ↔ daemon)
|
||||
|
||||
`GET /v1/version` returns:
|
||||
```json
|
||||
{
|
||||
"daemon_version": "0.9.0",
|
||||
"ipc_api": "v1",
|
||||
"ipc_features": ["send","topic","peers","files","events","health"],
|
||||
"schema_version": 7,
|
||||
"broker_features_negotiated": ["client_message_id_dedupe", ...]
|
||||
}
|
||||
```
|
||||
|
||||
CLI/SDK matches `ipc_features` against required. Missing required →
|
||||
fall-back to cold-path with warning OR fail explicitly (CLI verb's choice).
|
||||
|
||||
### 15.4 Compatibility matrix — published
|
||||
|
||||
```json
|
||||
GET /v1/compat
|
||||
{
|
||||
"daemon": "0.9.0",
|
||||
"compatible_brokers": ["0.7.x","0.8.x","0.9.x"],
|
||||
"required_broker_features": ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy"],
|
||||
"compatible_clis": ["0.9.x"],
|
||||
"compatible_sdks": {
|
||||
"python": ">=0.9.0,<1.0.0",
|
||||
"go": ">=0.9.0,<1.0.0",
|
||||
"ts": ">=0.9.0,<1.0.0"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — shared-CI reality folded in
|
||||
|
||||
### 16.1 Attacker classes — same matrix as v2 §16, plus:
|
||||
|
||||
| Attacker | Has | Wants | Mitigations |
|
||||
|---|---|---|---|
|
||||
| **Shared CI runner** (NEW) | Same Unix UID as other untrusted jobs | Read this user's persistent keypair across job boundaries | Auto-detect CI envs (§2.1) → ephemeral default + UDS-only + isolated `$HOME`. If operator overrides with `--persistent`, log warning `persistent_keypair_in_ci_environment`. |
|
||||
| **Malicious mesh peer** (PROMOTED from out-of-scope to in-scope) | Mesh membership | Send malformed payload to crash daemon | Every inbound shape validated against schema before any processing. Daemon refuses unknown fields (defense-in-depth) and emits `cm_daemon_invalid_inbound_total`. Crashes from inbound payloads are bugs. |
|
||||
|
||||
### 16.2 Stated explicitly out of scope
|
||||
|
||||
- Root attacker on daemon host (can read keypair directly).
|
||||
- Compromised broker (E2E content protection still holds; metadata is not
|
||||
protected by daemon — that's mesh-level).
|
||||
- Sophisticated attacker who copies BOTH `keypair.json` and
|
||||
`host_fingerprint.json` (§2.2 calls this out).
|
||||
- Receivers other than daemon-hosted peers deduping inbound traffic
|
||||
(post-v0.9.0).
|
||||
|
||||
### 16.3 Container & CI defaults table (NEW)
|
||||
|
||||
| Environment | Identity | IPC | Hooks |
|
||||
|---|---|---|---|
|
||||
| Bare metal / VM (default) | Persistent (clone-detected) | UDS + TCP loopback | Enabled |
|
||||
| Docker container (`/.dockerenv`) | Persistent | UDS-only by default | Enabled |
|
||||
| Kubernetes (`KUBERNETES_SERVICE_HOST`) | Persistent | UDS-only | Enabled |
|
||||
| CI (`CI=true`, `GITHUB_ACTIONS`, etc.) | Ephemeral | UDS-only | Disabled by default (`[hooks] enabled = false` until opted-in) |
|
||||
| RunPod (`RUNPOD_POD_ID`) | Ephemeral | UDS-only | Enabled |
|
||||
|
||||
Operator overrides any default with explicit flags; warning logged for
|
||||
non-default-secure choices.
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — same as v2 §17, plus broker schema add
|
||||
|
||||
Broker needs the schema delta in §4.3 (additive, partial unique indexes —
|
||||
safe for online migration). Coordinated with daemon rollout: broker first,
|
||||
then daemon. Daemon refuses to start against a broker that lacks
|
||||
`client_message_id_dedupe` feature bit (§15).
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 3)
|
||||
|
||||
Round 1 → identity, IPC auth, exactly-once lie, hook tokens, surface bloat,
|
||||
missing rotation/recovery/migration/threat-model.
|
||||
|
||||
Round 2 → boot-id false-positive, broker must dedupe on client id (protocol
|
||||
change), CI shared-runner reality, feature-bit negotiation, key rotation
|
||||
crypto, hook scopes, FTS schema, ~7 polish items.
|
||||
|
||||
This v3 attempts to address all of those. Specifically critique:
|
||||
|
||||
1. **Accidental-clone framing (§2.2)** — does the honest framing close the
|
||||
issue, or does removing boot-id make the detection so weak it's not worth
|
||||
shipping at all? Should we drop fingerprint detection entirely and rely on
|
||||
broker concurrent-connection policy?
|
||||
2. **Broker schema delta (§4.3)** — is this the smallest correct change?
|
||||
Partial unique indexes feel right; anything else needed (audit table,
|
||||
gc job)?
|
||||
3. **`max_age_hours` reduced to 23h** — codex's logic says daemon outbox TTL
|
||||
must be inside broker dedupe window. Is 23h vs 24h tight enough? Should
|
||||
the broker advertise its dedupe window as a feature parameter so the
|
||||
daemon configures itself?
|
||||
4. **Hook scopes (§6.2)** — too tight? `reply:event` + `dm:send:<sender>` +
|
||||
`topic:<name>:post`. Does this cover real use cases for v0.9.0 hooks
|
||||
(auto-reply, escalate-to-oncall, file-receipt-ack)?
|
||||
5. **Feature-bit negotiation (§15)** — is the scheme right? Should
|
||||
feature-bits be string identifiers (current) or numeric bit positions in
|
||||
a bitmask (denser, more brittle)?
|
||||
6. **CI defaults (§16.3)** — is the table accurate? Anything wrong about
|
||||
defaulting hooks-disabled in CI?
|
||||
7. **Key rotation grace-key archive (§14.1)** — is 7d the right default? Is
|
||||
storing archived private keys on disk (mode 0600) acceptable, or should
|
||||
they be encrypted at rest with a passphrase?
|
||||
8. **Anything still wrong?** Read it as if you were going to operate this
|
||||
daemon for a year — what falls down?
|
||||
|
||||
Three options after this review:
|
||||
- **(a) v3 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v4 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless. We can break anything.
|
||||
538
.artifacts/shipped/2026-05-03-daemon-final-spec-v4.md
Normal file
538
.artifacts/shipped/2026-05-03-daemon-final-spec-v4.md
Normal file
@@ -0,0 +1,538 @@
|
||||
# `claudemesh daemon` — Final Spec v4
|
||||
|
||||
> **Round 4.** v3 was reviewed by codex (round 3) and got an overall pass on
|
||||
> architecture but flagged three precision gaps: (1) broker dedupe window
|
||||
> semantics — permanent or windowed? schema as drawn was permanent but the
|
||||
> prose said 24h; (2) feature-bit negotiation should carry parameters, not
|
||||
> just booleans (so daemon can derive its outbox TTL from broker policy
|
||||
> instead of hardcoding 23h); (3) key-archive record format and retention
|
||||
> behavior were unspecified. Plus minor polish: document machine-id/MAC
|
||||
> source precedence per OS, explicitly defer arbitrary outbound hook sends,
|
||||
> resolve RunPod identity-vs-hooks inconsistency.
|
||||
>
|
||||
> **The intent §0 is unchanged from v2 — read it there.** v4 only revises
|
||||
> what changed from v3.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
|
||||
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
|
||||
a generic broker. We can break anything.
|
||||
|
||||
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
|
||||
precise contract in §4 below.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||||
|
||||
Resource caps, file layout, single-binary unchanged.
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — accidental-clone detection only, plus broker dedupe
|
||||
|
||||
Codex round-2 fix retained: no boot-id (false-positives every reboot).
|
||||
Codex round-3 polish: spell out fingerprint sources per OS so we don't ship
|
||||
a brittle "machine-id || first-mac" with no precedence rules.
|
||||
|
||||
### 2.1 Modes
|
||||
|
||||
```
|
||||
claudemesh daemon up # default: persistent member
|
||||
claudemesh daemon up --ephemeral # in-memory keypair, never written
|
||||
claudemesh daemon up --ephemeral --ttl 2h # auto-shutdown after duration
|
||||
```
|
||||
|
||||
**CI auto-detection**: if any of these env vars are set (`CI=true`,
|
||||
`GITHUB_ACTIONS`, `GITLAB_CI`, `BUILDKITE`, `CIRCLECI`, `JENKINS_URL`,
|
||||
`KUBERNETES_SERVICE_HOST`), AND `--persistent` is not explicitly passed,
|
||||
daemon defaults to `--ephemeral`. Rationale in §16.
|
||||
|
||||
`RUNPOD_POD_ID` removed from auto-CI list (was inconsistent — see §16.3).
|
||||
|
||||
### 2.2 Accidental-clone detection (NOT attacker-grade)
|
||||
|
||||
This catches **image clones, restored backups, copy-pasted homedirs** —
|
||||
accidents made by humans. It does not defend against an attacker who copies
|
||||
both `keypair.json` and `host_fingerprint.json`. The threat model (§16) says
|
||||
this explicitly.
|
||||
|
||||
#### 2.2.1 Fingerprint source precedence (NEW — codex r3)
|
||||
|
||||
`host_fingerprint.json` stores `sha256(host_id || stable_mac)` where the
|
||||
inputs are computed from the OS-specific table below, in order:
|
||||
|
||||
| OS | `host_id` (try in order) | `stable_mac` |
|
||||
|---|---|---|
|
||||
| Linux | `/etc/machine-id` → `/var/lib/dbus/machine-id` → first stable MAC | First non-loopback non-virtual interface, lex-sorted by name (`en…`/`eth…` before `wl…`); `docker0/veth*/br-*/lo` excluded |
|
||||
| macOS | `IOPlatformUUID` (`ioreg -rd1 -c IOPlatformExpertDevice`) | First non-loopback non-virtual interface (`en0` typical) |
|
||||
| Windows | `HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid` | First physical adapter (`Get-NetAdapter -Physical`), MAC sorted lex by adapter name |
|
||||
| BSD | `kern.hostuuid` (`sysctl -n kern.hostuuid`) | Same MAC rule as Linux |
|
||||
|
||||
**Excluded interfaces** (cross-platform): loopback, point-to-point tunnels
|
||||
(tailscale*, wg*, utun*, ppp*), docker (docker0, br-*, veth*), VPN
|
||||
(`tap*`/`tun*`), VM bridges (vboxnet*, vmnet*), Apple awdl/llw bridges.
|
||||
|
||||
**Cloud-image false-positive note**: bare AMIs/Azure images regenerate
|
||||
`/etc/machine-id` on first boot via cloud-init; for those, the first-boot
|
||||
fingerprint is what we keep. If an operator clones a *running* VM
|
||||
post-cloud-init, both `host_id` AND first-MAC will collide → the daemon
|
||||
correctly flags this as an accidental clone.
|
||||
|
||||
If `host_id` cannot be read on the host's OS, daemon logs
|
||||
`fingerprint_host_id_unavailable` and falls back to MAC-only. If MAC also
|
||||
unavailable (truly headless container with no NIC), daemon logs
|
||||
`fingerprint_unavailable`, persists a random UUID as `host_id`, and the
|
||||
clone-detection feature is effectively disabled for this host (broker
|
||||
concurrent-connection policy still works).
|
||||
|
||||
Behavior on mismatch (unchanged from v3): refuse / `accept-host` / `remint`.
|
||||
`[clone] policy = "refuse" | "warn" | "allow"` overrides per host.
|
||||
|
||||
### 2.3 Concurrent-duplicate-identity broker policy — unchanged from v3 §2.3
|
||||
|
||||
`prefer_newest` (default), `prefer_oldest`, `allow_concurrent`. Configured
|
||||
per-mesh in `mesh.cloneConcurrencyPolicy`.
|
||||
|
||||
### 2.4 Rename, key rotation — see §14
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — unchanged from v3 §3
|
||||
|
||||
Same frozen core, same auth model (UDS 0600 / TCP+SSE bearer / no token in
|
||||
query / all endpoints auth by default / UDS-only in containers / Origin/Host
|
||||
checks / no User-Agent theatre).
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once, **permanent** broker dedupe
|
||||
|
||||
Codex round 3 caught: v3's prose said "24h dedupe window" but the schema
|
||||
(partial unique indexes with no `created_at`) gave **permanent** dedupe. We
|
||||
have to pick. v4 chooses **permanent dedupe** because:
|
||||
|
||||
- It's the simplest correct choice. No GC job, no edge case where a
|
||||
long-asleep daemon's retry slips past the window and double-sends.
|
||||
- The unique index storage cost is bounded: at 1 KB per row × 100k
|
||||
messages/day × 365 = ~36 GB/year of broker storage, which is well within
|
||||
the broker's existing message-retention budget. Older message rows
|
||||
themselves can still be GC'd by the existing message retention policy
|
||||
(currently 365d) — only the `client_message_id` column on retained rows
|
||||
has to live as long as that row does.
|
||||
- It eliminates the daemon-side `max_age_hours = 23h` hack. Daemon outbox
|
||||
TTL becomes "however long you want to keep retrying"; default 7d.
|
||||
- It removes a class of "where exactly is the dedupe window edge?" bugs.
|
||||
|
||||
If broker storage growth becomes a real concern post-v0.9.0, we can convert
|
||||
to a windowed scheme via a feature-bit upgrade (§15) — but we'd own the
|
||||
correct migration semantics then.
|
||||
|
||||
### 4.1 The contract (precise)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns.
|
||||
>
|
||||
> **Broker guarantee**: the broker dedupes on `client_message_id`
|
||||
> **permanently within the lifetime of the row**. Multiple inflight retries
|
||||
> from the daemon for the same `client_message_id` produce **at most one**
|
||||
> broker-accepted row, regardless of time elapsed (subject to message-row
|
||||
> retention policy on the broker). This is advertised via the
|
||||
> `client_message_id_dedupe` feature-bit with `{ mode: "permanent" }`
|
||||
> parameter (§15).
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
|
||||
> `client_message_id` propagated in the inbound envelope so receivers can
|
||||
> dedupe locally. We do **not** guarantee at-most-once end-to-end —
|
||||
> receiver-side dedupe is the receiver's job. The daemon's `inbox.db`
|
||||
> provides it for daemon-hosted peers.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
Sources: `Idempotency-Key` header → body `client_message_id` → daemon-minted
|
||||
ulid. Stored in outbox UNIQUE NOT NULL, propagated to broker, propagated to
|
||||
receivers.
|
||||
|
||||
### 4.3 Broker schema delta — clarified as permanent dedupe
|
||||
|
||||
```sql
|
||||
ALTER TABLE mesh.topic_message
|
||||
ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue
|
||||
ADD COLUMN client_message_id TEXT;
|
||||
|
||||
CREATE UNIQUE INDEX topic_message_client_id_idx
|
||||
ON mesh.topic_message(mesh_id, client_message_id)
|
||||
WHERE client_message_id IS NOT NULL;
|
||||
CREATE UNIQUE INDEX message_queue_client_id_idx
|
||||
ON mesh.message_queue(mesh_id, client_message_id)
|
||||
WHERE client_message_id IS NOT NULL;
|
||||
|
||||
-- No created_at column needed for dedupe; the existing message row's
|
||||
-- created_at handles row-level retention. Dedupe is permanent for the row's
|
||||
-- lifetime, then naturally GC'd when the row is purged.
|
||||
```
|
||||
|
||||
Partial unique indexes — legacy traffic without `client_message_id` (from
|
||||
`claudemesh launch`, dashboard chat, web posts) is unaffected.
|
||||
|
||||
**Migration**: additive-only. Online ALTER TABLE on Postgres takes the row
|
||||
lock for the column add but not the index build (`CREATE UNIQUE INDEX
|
||||
CONCURRENTLY` is safe). Deploy order: schema migration → broker code that
|
||||
reads/writes `client_message_id` → daemon code that sends it → daemon
|
||||
enforces feature bit.
|
||||
|
||||
### 4.4 Outbox schema — unchanged from v3 §4.4
|
||||
|
||||
`UNIQUE NOT NULL` on `client_message_id`. Default `max_age_hours` raised
|
||||
back to **168h (7d)** because broker dedupe is permanent — no need to stay
|
||||
inside a 24h window.
|
||||
|
||||
### 4.5 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
Content table + indexes; FTS5 deferred.
|
||||
|
||||
### 4.6 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.7 Failure modes — windowed-broker case removed
|
||||
|
||||
The "broker dedupe window expired" failure mode in v3 §4.7 is **deleted**
|
||||
because dedupe is permanent. Remaining cases:
|
||||
|
||||
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. User
|
||||
manually requeues (`outbox requeue <id>`) or drops (`outbox drop <id>`).
|
||||
- **Receiver-side dedupe**: only daemon-hosted receivers dedupe.
|
||||
`claudemesh launch` and dashboard chat don't dedupe today; post-v0.9.0.
|
||||
- **Broker row already GC'd, daemon retries**: daemon retry hits the
|
||||
partial unique index → 23505 conflict. Broker treats as already-accepted,
|
||||
returns the original `messageId` from a soft-delete tombstone OR (if the
|
||||
row was hard-deleted by retention) returns `client_id_unknown`. Daemon
|
||||
treats `client_id_unknown` as "delivered, history may have been pruned"
|
||||
and marks `done`. Tombstone strategy is a broker implementation choice
|
||||
(advertised via `client_message_id_dedupe.tombstone_retention_days` in
|
||||
§15.1).
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — scopes tightened (codex r2), explicit deferment of arbitrary sends (codex r3)
|
||||
|
||||
### 6.1 Hooks contract — unchanged from v2 §6 / v3 §6.1
|
||||
|
||||
### 6.2 Capability scopes — narrowed for v0.9.0
|
||||
|
||||
| Scope | Capability | Notes |
|
||||
|---|---|---|
|
||||
| `reply:event` | Reply to the specific event that triggered this hook | Bound to `event_id`; daemon validates target; expires on hook exit |
|
||||
| `dm:send:<sender_pubkey>` | Send DM only to the specific sender | Bound to one pubkey from event; not a write to anyone |
|
||||
| `topic:<name>:post` | Post to the specific topic that fired | Bound to topic from event; can't write elsewhere |
|
||||
|
||||
**No read scopes in v0.9.0.** Hooks read via the event payload (which the
|
||||
daemon redacts appropriately), not via daemon-mediated reads.
|
||||
|
||||
**Explicitly deferred to post-v0.9.0** (codex r3 — say it out loud so use
|
||||
cases don't pile up against an undocumented limit):
|
||||
|
||||
- **Arbitrary outbound `dm:send` to anyone other than the event sender** —
|
||||
no scope grant for this. "Escalate to oncall" hooks must shell out to
|
||||
`claudemesh send <oncall>` with the user's normal config; the daemon
|
||||
doesn't issue capability tokens for arbitrary recipients.
|
||||
- **Cross-topic post** — a hook firing on `topic:alerts` cannot post to
|
||||
`topic:incidents`. Same reason.
|
||||
- **Mesh-cross post** — hooks see one mesh at a time.
|
||||
- **Reading state/inbox/peers** — covered above.
|
||||
|
||||
If a real use case demands cross-topic or arbitrary-recipient hooks
|
||||
post-v0.9.0, we add scopes like `dm:send:*` (wildcard) or
|
||||
`topic:*:post` (wildcard) and gate them behind explicit operator opt-in in
|
||||
config (`[hooks.<name>] dangerous_wildcards = true`). Not in v0.9.0.
|
||||
|
||||
### 6.3 Sandboxing — unchanged from v3 §6.3
|
||||
|
||||
Best-effort `network_policy = "deny"`; cross-platform unenforceability
|
||||
acknowledged; counter `cm_daemon_hook_unenforceable_total` exposed.
|
||||
|
||||
### 6.4 Payload size & truncation — unchanged from v3 §6.4
|
||||
|
||||
### 6.5 Audit log + killpg — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 7. Multi-mesh — unchanged
|
||||
|
||||
## 8. Auto-routing — unchanged
|
||||
|
||||
## 9. Service installation — unchanged
|
||||
|
||||
## 10. Observability — unchanged
|
||||
|
||||
## 11. SDKs — unchanged
|
||||
|
||||
## 12. Security model — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 13. Configuration — unchanged shape, plus parameterized features
|
||||
|
||||
```toml
|
||||
[features]
|
||||
require = [
|
||||
"client_message_id_dedupe", # broker provides §4.1 contract
|
||||
"concurrent_connection_policy", # broker honours mesh.cloneConcurrencyPolicy
|
||||
]
|
||||
optional = ["mesh_skill_share", "mcp_host"]
|
||||
# Daemon refuses to start if broker doesn't advertise all `require` bits.
|
||||
# Broker advertises feature parameters in the negotiation response (§15.1)
|
||||
# — daemon picks up `dedupe_mode` and `tombstone_retention_days` from there
|
||||
# and writes them to its runtime view, not config.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — key rotation crypto fixed (codex r2), archive format spec'd (codex r3)
|
||||
|
||||
### 14.1 Key rotation — crypto correct (codex r2)
|
||||
|
||||
`claudemesh daemon rotate-keypair`:
|
||||
|
||||
- Mints fresh ed25519 + x25519 keypairs.
|
||||
- Registers new pubkeys with the broker as `member_keypair_rotated` event.
|
||||
- Broker associates the new pubkey with the same member id, marks the old
|
||||
pubkey as `rotated_out` (not revoked); senders who haven't received the
|
||||
rotation event continue to encrypt to the old pubkey for a grace window.
|
||||
- Daemon retains the old x25519 **private** key (only x25519 — ed25519 is
|
||||
for signing, doesn't need a grace window) in `keypair-archive.json`.
|
||||
- During grace, decrypt path: try current private key first; on
|
||||
`crypto_box_open_easy` failure, walk archived keys in order. Successful
|
||||
archived-key decrypts increment `cm_daemon_decrypt_archived_total`.
|
||||
- After grace expiry, archived keys are zeroed and the file is rewritten
|
||||
without them. Messages still encrypted to a fully-expired pubkey fail to
|
||||
decrypt and increment `cm_daemon_decrypt_stale_total`.
|
||||
|
||||
#### 14.1.1 Archive record format (NEW — codex r3)
|
||||
|
||||
`keypair-archive.json` (mode 0600, atomic-rename writes):
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": 1,
|
||||
"max_archived_keys": 8,
|
||||
"keys": [
|
||||
{
|
||||
"pubkey": "ed25519-base64...",
|
||||
"x25519_pubkey": "base64...",
|
||||
"x25519_privkey": "base64...", // sensitive; whole file is 0600
|
||||
"key_id": "k_01HQX...", // ulid; matches broker's record
|
||||
"created_at": "2026-04-12T11:00:00Z",
|
||||
"rotated_out_at": "2026-05-03T16:00:00Z",
|
||||
"expires_at": "2026-05-10T16:00:00Z" // rotated_out_at + grace
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- **`max_archived_keys`** (default 8): cap on archive size. If a rotation
|
||||
would push the archive past the cap, the oldest entry is force-expired
|
||||
(zeroed + removed) regardless of `expires_at`. Force-expiry increments
|
||||
`cm_daemon_archive_force_expired_total{key_id}`. Operator who rotates
|
||||
faster than 8 keys per grace-window-duration is intentionally accepting
|
||||
decryption gaps for very-late inbound messages encrypted to those keys.
|
||||
- **Grace period default**: 7 days. Configurable via
|
||||
`[crypto] key_grace_period_days = 7`. Hard cap 30 days (codex review:
|
||||
unbounded grace = unbounded archive on disk = bigger blast radius if
|
||||
daemon host is compromised mid-life).
|
||||
- **Cleanup**: scheduled daily at midnight local time + on-demand via
|
||||
`claudemesh daemon archive-cleanup`. Walks `keys[]`, drops anything with
|
||||
`expires_at < now`. If file is empty after cleanup, file is deleted.
|
||||
- **Archive write failure**: rotation is aborted. Daemon refuses to commit
|
||||
the new keypair if the archive can't be written durably. Logged as
|
||||
`key_rotation_aborted_archive_write_failed`. New keypair is in memory
|
||||
only; restart returns to old keypair. This is intentional: the archive
|
||||
write is the durability point of rotation.
|
||||
- **At-rest encryption**: archive file is mode 0600 plaintext, same threat
|
||||
model as `keypair.json` (root-on-host can read both anyway). Operators
|
||||
who want disk-level encryption can put `~/.claudemesh/` on an encrypted
|
||||
volume; we don't reinvent that. Documented in the threat model (§16).
|
||||
Future option `--archive-passphrase` deferred — adds passphrase prompt to
|
||||
rotation/decrypt path, but breaks unattended daemon restart.
|
||||
|
||||
### 14.2 Backup includes topic state — unchanged from v3 §14.2
|
||||
|
||||
`keypair.json`, `keypair-archive.json` (with all archived keys),
|
||||
`host_fingerprint.json`, `config.toml`, `topic_subscriptions.json`,
|
||||
`topic_keys.json`, `key_epoch.json`, `schema_version`.
|
||||
|
||||
`local_token` NOT included; regenerated on restore.
|
||||
|
||||
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — unchanged from v2 §14.3
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature-bit negotiation with **parameters** (codex r3)
|
||||
|
||||
v3's feature bits were boolean. Codex r3: dedupe-window, max-payload, key
|
||||
epochs all need parameters. v4 makes feature bits string-keyed entries that
|
||||
optionally carry a value.
|
||||
|
||||
### 15.1 Feature bits with parameters
|
||||
|
||||
| Bit | Type | Parameters | Notes |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | object | `{ mode: "permanent"\|"windowed", window_hours?: int, tombstone_retention_days: int }` | Daemon reads `mode` to decide whether to enforce its own outbox max-age cap. `tombstone_retention_days` (broker-controlled) tells daemon how long it can expect "already-accepted" replies after the source row is GC'd |
|
||||
| `concurrent_connection_policy` | bool | — | Broker honours `mesh.cloneConcurrencyPolicy` |
|
||||
| `member_keypair_rotated_event` | bool | — | Broker emits the event |
|
||||
| `key_epoch` | object | `{ max_concurrent_epochs: int }` | Per-topic key epochs supported |
|
||||
| `max_payload` | object | `{ inline_bytes: int, blob_bytes: int }` | Hard limits broker enforces |
|
||||
| `mesh_skill_share` | bool | — | Future |
|
||||
| `mcp_host` | bool | — | Future |
|
||||
|
||||
### 15.2 Negotiation handshake (parameterized)
|
||||
|
||||
On WS connect, after hello, before normal traffic:
|
||||
|
||||
```
|
||||
→ daemon: feature_negotiation_request
|
||||
{
|
||||
require: ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy"],
|
||||
optional: ["mesh_skill_share","mcp_host","max_payload"]
|
||||
}
|
||||
|
||||
← broker: feature_negotiation_response
|
||||
{
|
||||
supported: {
|
||||
"client_message_id_dedupe": {
|
||||
"mode": "permanent",
|
||||
"tombstone_retention_days": 30
|
||||
},
|
||||
"concurrent_connection_policy": true,
|
||||
"member_keypair_rotated_event": true,
|
||||
"max_payload": {
|
||||
"inline_bytes": 65536,
|
||||
"blob_bytes": 524288000
|
||||
}
|
||||
},
|
||||
missing_required: []
|
||||
}
|
||||
```
|
||||
|
||||
If `missing_required` is non-empty, daemon closes the connection with code
|
||||
4010 `feature_unavailable`, logs forensic event, exits non-zero. Supervisor
|
||||
sees a restart-loop → operator alert.
|
||||
|
||||
If `client_message_id_dedupe.mode == "windowed"`, daemon reads
|
||||
`window_hours` and configures its outbox `max_age_hours` to
|
||||
`window_hours - 1` (margin) instead of the 168h default. Permanent mode →
|
||||
daemon uses the config default, no override.
|
||||
|
||||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||||
|
||||
`GET /v1/version` returns daemon version, IPC features, schema version, and
|
||||
the **parsed** broker feature parameters (so SDKs querying the daemon can
|
||||
display them).
|
||||
|
||||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||||
|
||||
Published at `GET /v1/compat`.
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v3 §16, plus RunPod fix
|
||||
|
||||
### 16.1 Attacker classes — unchanged
|
||||
|
||||
### 16.2 Out of scope — unchanged
|
||||
|
||||
### 16.3 Container & CI defaults table (RunPod inconsistency fixed)
|
||||
|
||||
| Environment | Identity | IPC | Hooks | Rationale |
|
||||
|---|---|---|---|---|
|
||||
| Bare metal / VM (default) | Persistent (clone-detected) | UDS + TCP loopback | Enabled | Trusted operator-owned host |
|
||||
| Docker container (`/.dockerenv`) | Persistent | UDS-only by default | Enabled | Single-tenant container, host loopback shared |
|
||||
| Kubernetes (`KUBERNETES_SERVICE_HOST`) | Persistent | UDS-only | Enabled | Single pod = single tenant |
|
||||
| CI (`CI=true`, `GITHUB_ACTIONS`, etc.) | Ephemeral | UDS-only | Disabled by default (`[hooks] enabled = false`) | Multi-tenant runner; arbitrary code; ephemeral identity = no cross-job leak; hooks disabled because CI workloads are arbitrary user code |
|
||||
| RunPod (`RUNPOD_POD_ID`) | Persistent | UDS-only | Enabled | Long-lived single-tenant sandbox; user owns the pod for its lifetime; identical trust model to a Docker container, NOT to a CI runner |
|
||||
|
||||
**RunPod resolution (codex r3)**: v3 listed RunPod under both "ephemeral
|
||||
identity" and "hooks enabled" which was contradictory. v4 treats RunPod as
|
||||
a **single-tenant container** (Docker-like): persistent identity, UDS-only,
|
||||
hooks enabled. RunPod is removed from the CI auto-detect list (§2.1).
|
||||
Operators who run RunPod as multi-tenant sandbox-as-CI can opt in with
|
||||
`--ephemeral` + `[hooks] enabled = false` explicitly.
|
||||
|
||||
Operator overrides any default with explicit flags; warning logged for
|
||||
non-default-secure choices.
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — unchanged from v3 §17
|
||||
|
||||
Broker schema delta (additive partial unique indexes, safe online),
|
||||
deployed before daemon. Daemon refuses to start if `client_message_id_dedupe`
|
||||
feature bit is missing from broker's negotiation response.
|
||||
|
||||
---
|
||||
|
||||
## What changed v3 → v4 (codex round-3 actionable items)
|
||||
|
||||
| Codex r3 item | v4 fix | Section |
|
||||
|---|---|---|
|
||||
| Broker dedupe window: permanent vs windowed? | **Picked permanent**; schema clarified; outbox `max_age_hours` raised back to 168h | §4 |
|
||||
| Feature bits should be parameterized | All feature bits are string-keyed with optional value object | §15.1, §15.2 |
|
||||
| Key archive record format unspecified | Full schema with `key_id`, timestamps, `max_archived_keys`, force-expiry rule, write-failure semantics | §14.1.1 |
|
||||
| Document fingerprint source precedence per OS | Per-OS table for `host_id` and stable MAC; cloud-image false-positive note | §2.2.1 |
|
||||
| Explicit deferment of arbitrary outbound hook sends | Listed deferred capabilities + escape hatch path post-v0.9.0 | §6.2 |
|
||||
| RunPod ephemeral-but-hooks-enabled inconsistency | RunPod treated as single-tenant container; removed from CI auto-detect | §2.1, §16.3 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 4)
|
||||
|
||||
Round 1 → identity, IPC auth, exactly-once lie, hook tokens, surface bloat,
|
||||
missing rotation/recovery/migration/threat-model.
|
||||
|
||||
Round 2 → boot-id false-positive, broker must dedupe on client id, CI
|
||||
shared-runner reality, feature-bit negotiation, key rotation crypto, hook
|
||||
scopes, FTS schema, ~7 polish items.
|
||||
|
||||
Round 3 → dedupe window semantics, feature-bit parameters, key archive
|
||||
record format, fingerprint source precedence, deferred hook scopes, RunPod
|
||||
inconsistency.
|
||||
|
||||
This v4 attempts to address all of round 3. Specifically:
|
||||
|
||||
1. **Permanent dedupe choice (§4)** — does the storage-cost calculus hold?
|
||||
Is the tombstone path (`client_id_unknown` after row GC) actually
|
||||
workable, or does it need to be a real tombstone table?
|
||||
2. **Feature parameter shape (§15.1)** — is the type system right (object
|
||||
with optional value)? Should it be a flat key-value list instead?
|
||||
Versioning of parameters within a feature?
|
||||
3. **Archive record format (§14.1.1)** — anything missing? Is
|
||||
`max_archived_keys=8` a sensible default, or should it be unbounded with
|
||||
a force-expiry on storage size instead of count?
|
||||
4. **Fingerprint per-OS table (§2.2.1)** — accurate? Is BSD worth listing
|
||||
if we're not actively building for FreeBSD in v0.9.0?
|
||||
5. **Hook deferment list (§6.2)** — does it cover all the realistic v0.9.0
|
||||
ask? Is the "shell out to `claudemesh send`" workaround for escalation
|
||||
ergonomically acceptable?
|
||||
6. **RunPod resolution (§16.3)** — agree with treating RunPod as
|
||||
single-tenant container? Or are there real multi-tenant RunPod
|
||||
deployments we should default-guard against?
|
||||
7. **Anything else still wrong?** Read it as if you were going to operate
|
||||
this for a year. What falls down?
|
||||
|
||||
Three options after this review:
|
||||
- **(a) v4 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v5 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless. We can break anything.
|
||||
468
.artifacts/shipped/2026-05-03-daemon-final-spec-v5.md
Normal file
468
.artifacts/shipped/2026-05-03-daemon-final-spec-v5.md
Normal file
@@ -0,0 +1,468 @@
|
||||
# `claudemesh daemon` — Final Spec v5
|
||||
|
||||
> **Round 5.** v4 was reviewed by codex (round 4) and got an architectural
|
||||
> pass but flagged one blocker plus four polish items.
|
||||
>
|
||||
> **Blocker**: §4 called dedupe "permanent" while also saying it disappears
|
||||
> when retained rows are hard-deleted. Internally inconsistent. Fix: real
|
||||
> broker-side dedupe/tombstone table independent of message retention.
|
||||
>
|
||||
> **Polish**: (a) rename `mode: "permanent"` to `retention_scoped`; (b)
|
||||
> deterministic duplicate-response shape; (c) feature-parameter schema
|
||||
> validation rules + per-feature parameter version; (d) drop
|
||||
> "zeroed/secure-delete" promises in archive cleanup, define malformed-archive
|
||||
> startup behavior; plus Linux MAC||MAC self-collision noted, RunPod warning
|
||||
> log on persistent default.
|
||||
>
|
||||
> **Intent §0 unchanged from v2.** v5 only revises what changed from v4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
|
||||
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
|
||||
a generic broker. We can break anything.
|
||||
|
||||
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
|
||||
precise contract in §4.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — accidental-clone detection only
|
||||
|
||||
### 2.1 Modes — unchanged from v4 §2.1, RunPod warning added
|
||||
|
||||
When `RUNPOD_POD_ID` is set and identity is persistent (the default for
|
||||
RunPod under v4 §16.3), daemon logs `runpod_persistent_default_assumed` at
|
||||
INFO. Operators running RunPod as multi-tenant CI surface set `--ephemeral`
|
||||
explicitly; the warning makes the default visible in case the assumption
|
||||
doesn't fit their deployment.
|
||||
|
||||
### 2.2 Accidental-clone detection — unchanged from v4 §2.2
|
||||
|
||||
#### 2.2.1 Fingerprint source precedence — unchanged from v4 §2.2.1, with self-collision note
|
||||
|
||||
**Linux MAC-only fallback (NEW note)**: when `/etc/machine-id` is unreadable
|
||||
and we fall back to MAC-only as `host_id`, the resulting fingerprint is
|
||||
effectively `sha256(mac || mac)`. This is acceptable for clone detection
|
||||
(still uniquely identifies *this* host's first-NIC MAC) but reduces entropy
|
||||
to ~48 bits. Operators who want stronger fingerprinting in degraded
|
||||
environments can persist a generated UUID via `host_fingerprint.id_override`
|
||||
in config; documented but not required.
|
||||
|
||||
### 2.3 Concurrent-duplicate-identity broker policy — unchanged from v3 §2.3
|
||||
|
||||
### 2.4 Rename, key rotation — see §14
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once, **dedupe table**, retention-scoped
|
||||
|
||||
Codex round 4 caught: v4 said "permanent" but also said dedupe disappears
|
||||
when message rows are hard-deleted. That's `retention_scoped`, not
|
||||
permanent — and worse, the partial-unique-index design fails when the row
|
||||
itself is gone. v5 introduces a real broker-side dedupe table with its own
|
||||
retention policy, independent of message retention.
|
||||
|
||||
### 4.1 The contract (precise)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns.
|
||||
>
|
||||
> **Broker guarantee**: the broker maintains a dedupe record for every
|
||||
> accepted `client_message_id` in a dedicated table
|
||||
> (`mesh.client_message_dedupe`). The dedupe record outlives the message
|
||||
> row when the dedupe-retention policy is longer than the
|
||||
> message-retention policy. While the dedupe record exists, all retries
|
||||
> with that `client_message_id` collapse to the original
|
||||
> `broker_message_id` deterministically. After the dedupe record expires,
|
||||
> a retry would create a new message — but daemon outbox `max_age_hours`
|
||||
> is configured against the broker's advertised `dedupe_retention_days`
|
||||
> with margin (§15.1), so this should not happen in practice.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
|
||||
> `client_message_id` propagated in the inbound envelope. Receiver-side
|
||||
> dedupe is the receiver's job; the daemon's `inbox.db` provides it for
|
||||
> daemon-hosted peers.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
Sources: `Idempotency-Key` header → body `client_message_id` → daemon ulid.
|
||||
Stored in outbox UNIQUE NOT NULL, propagated to broker, propagated to
|
||||
receivers in inbound envelope.
|
||||
|
||||
### 4.3 Broker schema — dedupe table separate from message rows (v5)
|
||||
|
||||
```sql
|
||||
-- The dedupe authority. One row per (mesh, client_message_id) accepted
|
||||
-- by the broker. Outlives mesh.topic_message rows when retention >
|
||||
-- message retention.
|
||||
CREATE TABLE mesh.client_message_dedupe (
|
||||
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
|
||||
client_message_id TEXT NOT NULL,
|
||||
broker_message_id UUID NOT NULL, -- the original accepted message id
|
||||
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
|
||||
destination_ref TEXT NOT NULL, -- topic name, recipient pubkey, etc.
|
||||
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
expires_at TIMESTAMPTZ, -- NULL = never expires (operator opt-in)
|
||||
status TEXT NOT NULL CHECK(status IN ('accepted','rejected')),
|
||||
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
|
||||
PRIMARY KEY (mesh_id, client_message_id)
|
||||
);
|
||||
|
||||
CREATE INDEX client_message_dedupe_expires_idx
|
||||
ON mesh.client_message_dedupe(expires_at)
|
||||
WHERE expires_at IS NOT NULL;
|
||||
|
||||
-- Existing tables get the convenience back-pointer (for receiver
|
||||
-- inclusion in delivered envelopes); UNIQUE NOT enforced here — the
|
||||
-- dedupe table is the authority.
|
||||
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
|
||||
```
|
||||
|
||||
**Retention semantics**:
|
||||
|
||||
- `expires_at = NULL` → dedupe row never expires unless mesh is deleted.
|
||||
Operator opts in via mesh setting `dedupeRetentionMode = "permanent"`.
|
||||
- `expires_at = first_seen_at + dedupe_retention_days` → default
|
||||
`retention_scoped` mode. Default value: 365 days. Configurable per-mesh.
|
||||
- A nightly broker job deletes rows where `expires_at < NOW()`.
|
||||
- A separate broker job, fired when the message-retention sweep hard-deletes
|
||||
a `mesh.topic_message` or `mesh.message_queue` row, sets the corresponding
|
||||
dedupe row's `history_available = FALSE`. The dedupe row stays — only the
|
||||
payload is gone. Retries still collapse correctly; receiver requests for
|
||||
history return "row pruned" deterministically (§4.4 below).
|
||||
|
||||
**Migration**: additive-only. Daemon refuses to start unless broker
|
||||
advertises feature `client_message_id_dedupe` with `mode` of
|
||||
`retention_scoped` or `permanent` (§15.1).
|
||||
|
||||
### 4.4 Duplicate response — deterministic shape (NEW v5 — codex r4)
|
||||
|
||||
When the broker sees a send with a `client_message_id` already in
|
||||
`mesh.client_message_dedupe`, the response is deterministic:
|
||||
|
||||
```json
|
||||
{
|
||||
"broker_message_id": "msg_01HQX...",
|
||||
"client_message_id": "cmid_01HQX...",
|
||||
"duplicate": true,
|
||||
"history_available": true, // false if message row was GC'd
|
||||
"first_seen_at": "2026-05-03T11:42:00Z",
|
||||
"destination_kind": "topic",
|
||||
"destination_ref": "alerts"
|
||||
}
|
||||
```
|
||||
|
||||
Daemon outcomes:
|
||||
|
||||
- `duplicate: true, history_available: true` → mark outbox row `done`,
|
||||
store `broker_message_id`. No re-fanout (broker did the work the first
|
||||
time).
|
||||
- `duplicate: true, history_available: false` → mark outbox row `done` but
|
||||
log `cm_daemon_dedupe_history_pruned_total`. The message *did* deliver
|
||||
the first time; we just can't show it in history. Receivers who needed
|
||||
it have it; receivers who didn't have already missed their window.
|
||||
- No more `client_id_unknown` — that response code is removed.
|
||||
|
||||
### 4.5 Outbox schema — daemon-side max-age derived (v5)
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
Daemon `max_age_hours` is **derived** from the broker-advertised
|
||||
`dedupe_retention_days` parameter:
|
||||
- `permanent` → daemon default 168h (7d), capped at 30d. (Daemon doesn't
|
||||
hold sends forever — that's an outbox bug surface.)
|
||||
- `retention_scoped, dedupe_retention_days = N` → daemon
|
||||
`max_age_hours = (N * 24) - safety_margin_hours`. Default
|
||||
`safety_margin_hours = 24`.
|
||||
- Operator override permitted but logged as
|
||||
`outbox_max_age_above_broker_window` if it exceeds broker safe range.
|
||||
|
||||
### 4.6 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.7 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.8 Failure modes — corrected for dedupe-table model
|
||||
|
||||
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. Same as v4.
|
||||
- **Receiver-side dedupe**: only daemon-hosted receivers dedupe. Same as v4.
|
||||
- **Daemon retry after dedupe row expired AND message row GC'd**: in
|
||||
`retention_scoped` mode this can only happen if the daemon outbox row
|
||||
was older than `dedupe_retention_days - safety_margin`. Daemon will
|
||||
refuse to send rows older than its computed `max_age_hours` (§4.5) —
|
||||
they go to `dead` first, surfaced for human action. So this edge is
|
||||
closed by daemon-side gating, not broker-side dedupe.
|
||||
- **Daemon retry after dedupe row expired BUT message row still alive**:
|
||||
doesn't happen by design — dedupe retention is always ≥ message
|
||||
retention in operator-sane configs. If misconfigured, message row
|
||||
persists with NULL `client_message_id` reference, retry creates a new
|
||||
message, broker emits `cm_broker_dedupe_misconfig_total` with
|
||||
`(mesh_id, retention_dedupe_days, retention_message_days)` labels.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — unchanged from v4 §6
|
||||
|
||||
---
|
||||
|
||||
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — archive cleanup wording corrected (codex r4)
|
||||
|
||||
### 14.1 Key rotation — unchanged crypto from v4 §14.1
|
||||
|
||||
### 14.1.1 Archive record format — corrected wording (v5)
|
||||
|
||||
`keypair-archive.json` (mode 0600, atomic-rename writes):
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": 1,
|
||||
"max_archived_keys": 8,
|
||||
"keys": [
|
||||
{
|
||||
"ed25519_pubkey": "base64...", // metadata only; matches the rotated-out signing key for that key_id
|
||||
"x25519_pubkey": "base64...", // matches the retained private key
|
||||
"x25519_privkey": "base64...", // sensitive; whole file is 0600
|
||||
"key_id": "k_01HQX...",
|
||||
"created_at": "2026-04-12T11:00:00Z",
|
||||
"rotated_out_at": "2026-05-03T16:00:00Z",
|
||||
"expires_at": "2026-05-10T16:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Field clarifications (codex r4)**:
|
||||
- `ed25519_pubkey` is metadata — the daemon does not retain the old ed25519
|
||||
*private* key. Stored to bind `key_id` ↔ old signing identity for audit
|
||||
reconstruction (e.g. "this archived x25519 was the recipient half of a
|
||||
member who at the time signed messages with the matching ed25519").
|
||||
- `x25519_pubkey` MUST match the public half of `x25519_privkey`. Daemon
|
||||
validates on archive load; mismatch → quarantine (see corruption rules).
|
||||
|
||||
**Cleanup wording (codex r4)**:
|
||||
- On `expires_at < now`: entry is removed from the live archive file via
|
||||
atomic-rename rewrite. **Secure deletion of the prior file's data is not
|
||||
guaranteed** on modern filesystems (journals, COW snapshots, SSD wear
|
||||
leveling, atomic-rename leaving stale inodes). Operators who need
|
||||
cryptographic erasure must operate on encrypted volumes or reissue
|
||||
hardware. Documented in threat model §16.
|
||||
- "Force-expiry" when `max_archived_keys` is exceeded uses the same
|
||||
removal mechanism; same caveat applies. Counter
|
||||
`cm_daemon_archive_force_expired_total{key_id}` exposed.
|
||||
|
||||
**Duplicate `key_id` handling (NEW v5)**:
|
||||
- Archive load rejects any file whose `keys[]` contains two records with
|
||||
the same `key_id`. Quarantine to `keypair-archive.json.malformed-<ts>`,
|
||||
start with empty archive, log `keypair_archive_duplicate_key_id`. Daemon
|
||||
continues to start (we don't want archive corruption to be a permanent
|
||||
outage). Old in-flight messages encrypted to the lost archived keys
|
||||
fail to decrypt and are counted in `cm_daemon_decrypt_stale_total`.
|
||||
|
||||
**Malformed archive on startup (NEW v5)**:
|
||||
- File present but JSON parse fails OR schema fails OR pubkey/privkey pair
|
||||
fails validation: quarantine as above, start with empty archive, log
|
||||
`keypair_archive_malformed`. Same continue-startup behavior.
|
||||
- File missing entirely: treated as empty archive (normal first run /
|
||||
post-cleanup state), no warning.
|
||||
- File present but mode != 0600: log `keypair_archive_perms` warning,
|
||||
read anyway. Operators surfaced; daemon doesn't auto-chmod (they should
|
||||
fix their pipeline).
|
||||
|
||||
### 14.2 Backup — unchanged from v4 §14.2
|
||||
|
||||
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature-bit schema validation (v5)
|
||||
|
||||
Codex r4: feature parameters need explicit schema-validation rules and
|
||||
per-feature versioning so we don't paint ourselves into a corner when a
|
||||
parameter shape evolves.
|
||||
|
||||
### 15.1 Feature bits with parameters and versions
|
||||
|
||||
Each feature bit's parameters are versioned independently of broker version:
|
||||
|
||||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | `1` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 1)` (when mode=retention_scoped) | `tombstone_history_pruned_window_days: int` |
|
||||
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
|
||||
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
|
||||
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
|
||||
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
|
||||
| `mesh_skill_share` | future | — | — |
|
||||
| `mcp_host` | future | — | — |
|
||||
|
||||
**Validation rules (NEW v5)**:
|
||||
|
||||
When the broker advertises feature parameters in
|
||||
`feature_negotiation_response`, the daemon validates against the
|
||||
parameter schema for that `params.version`. Validation failures:
|
||||
|
||||
- **Required parameter missing**: treated identically to "feature missing
|
||||
from `supported`" — if the feature is in daemon's `require[]`, daemon
|
||||
closes WS with code 4010 `feature_unavailable` and exits non-zero.
|
||||
- **Required parameter out of bounds** (e.g. `dedupe_retention_days = -5`,
|
||||
`inline_bytes = 0`): same — treated as "feature missing from
|
||||
`supported`."
|
||||
- **Unknown `params.version`**: if daemon doesn't recognize the version,
|
||||
treated as "feature missing." Daemon does NOT silently degrade.
|
||||
- **Optional parameter missing or invalid**: daemon uses its own default,
|
||||
logs `feature_optional_param_invalid{feature, param, reason}`, continues.
|
||||
- **Unknown `mode` for `client_message_id_dedupe`** (not "retention_scoped"
|
||||
or "permanent"): treated as "feature missing." Future modes require a
|
||||
`params.version` bump.
|
||||
|
||||
Validation is NOT silent: every feature_negotiation_response is logged
|
||||
fully (with sensitive parameters redacted, though we don't currently have
|
||||
any) at DEBUG, and a single line at INFO summarizes negotiated capabilities
|
||||
on each successful negotiation.
|
||||
|
||||
### 15.2 Negotiation handshake — shape updated (v5)
|
||||
|
||||
```
|
||||
→ daemon: feature_negotiation_request
|
||||
{
|
||||
require: ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy"],
|
||||
optional: ["mesh_skill_share","mcp_host","max_payload"]
|
||||
}
|
||||
|
||||
← broker: feature_negotiation_response
|
||||
{
|
||||
supported: {
|
||||
"client_message_id_dedupe": {
|
||||
"params": {
|
||||
"version": 1,
|
||||
"mode": "retention_scoped",
|
||||
"dedupe_retention_days": 365,
|
||||
"tombstone_history_pruned_window_days": 30
|
||||
}
|
||||
},
|
||||
"concurrent_connection_policy": {
|
||||
"params": { "version": 1, "default_policy": "prefer_newest" }
|
||||
},
|
||||
"member_keypair_rotated_event": { "params": { "version": 1 } },
|
||||
"max_payload": {
|
||||
"params": { "version": 1, "inline_bytes": 65536, "blob_bytes": 524288000 }
|
||||
}
|
||||
},
|
||||
missing_required: []
|
||||
}
|
||||
```
|
||||
|
||||
If `missing_required` is non-empty after broker's response OR after daemon
|
||||
parameter validation, daemon closes with 4010 and exits non-zero.
|
||||
|
||||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||||
|
||||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v4 §16
|
||||
|
||||
Plus archive-secure-delete clarification under §14.1.1.
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — broker dedupe table is the new prereq
|
||||
|
||||
Broker side, deploy order:
|
||||
1. `CREATE TABLE mesh.client_message_dedupe` + supporting indexes
|
||||
(additive, online-safe).
|
||||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id` (already
|
||||
in v3/v4 plan).
|
||||
3. Broker code: every `INSERT` into `topic_message` / `message_queue` first
|
||||
`INSERT ... ON CONFLICT DO UPDATE RETURNING` into
|
||||
`client_message_dedupe`. The conflict path returns existing
|
||||
`broker_message_id` instead of creating a new row.
|
||||
4. Broker code: nightly job to delete `client_message_dedupe` rows where
|
||||
`expires_at < NOW()`.
|
||||
5. Broker code: hook into the existing message-retention sweep to set
|
||||
`history_available = FALSE` on dedupe rows whose message row has been
|
||||
pruned.
|
||||
6. Broker advertises `client_message_id_dedupe` feature bit in negotiation
|
||||
response.
|
||||
7. Daemon refuses to start unless that feature bit is advertised with valid
|
||||
params.
|
||||
|
||||
---
|
||||
|
||||
## What changed v4 → v5 (codex round-4 actionable items)
|
||||
|
||||
| Codex r4 item | v5 fix | Section |
|
||||
|---|---|---|
|
||||
| Dedupe must be retention-scoped, not "permanent" with row-deletion gap | Real `mesh.client_message_dedupe` table; retention independent of message rows; `permanent` becomes opt-in mode meaning "no expires_at" | §4.1, §4.3 |
|
||||
| Rename misleading mode | `retention_scoped` is the default; `permanent` reserved for explicit opt-in | §4.3, §15.1 |
|
||||
| Deterministic duplicate response | New shape with `duplicate`, `broker_message_id`, `history_available`; removed `client_id_unknown` | §4.4 |
|
||||
| Feature parameter validation rules | `params.version` per feature; required-param failure = treated as missing-required-feature; daemon closes WS 4010, exits non-zero | §15.1 |
|
||||
| Drop "zeroed/secure-delete" promise | Replaced with "removed from live archive; secure deletion not guaranteed"; threat model documents | §14.1.1 |
|
||||
| Duplicate `key_id` handling | Archive load rejects, quarantine, start empty, continue | §14.1.1 |
|
||||
| Malformed archive startup behavior | Quarantine, start empty, continue; mode-mismatch warns but reads | §14.1.1 |
|
||||
| Linux MAC||MAC self-collision | Documented; `host_fingerprint.id_override` escape hatch | §2.2.1 |
|
||||
| RunPod warning on persistent default | Logged at INFO so default is visible | §2.1 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 5)
|
||||
|
||||
1. **Dedupe table design (§4.3)** — is `(mesh_id, client_message_id)`
|
||||
PRIMARY KEY enough, or do we need versioning of the dedupe row itself
|
||||
(e.g. when destination changes mid-retry)? Is `destination_kind` /
|
||||
`destination_ref` needed at all, or just for audit?
|
||||
2. **`history_available = FALSE` semantics (§4.4)** — does it actually fix
|
||||
the case where receivers ask for history of a pruned message? Or does
|
||||
the receiver need its own dedupe-with-history-pruned pathway?
|
||||
3. **Daemon outbox max-age math (§4.5)** — is `dedupe_retention_days * 24
|
||||
- 24` margin correct? Should the margin be a percentage instead of a
|
||||
fixed 24h?
|
||||
4. **Feature param validation (§15.1)** — does treating "invalid required
|
||||
param" as "missing required feature" lose useful diagnostic detail?
|
||||
Should we have a 4011 `feature_param_invalid` close code separately?
|
||||
5. **Archive quarantine (§14.1.1)** — is "continue startup with empty
|
||||
archive" the right call, or should it be opt-in / refuse-by-default?
|
||||
6. **Anything else still wrong?** Read it as if you were going to operate
|
||||
this for a year.
|
||||
|
||||
Three options:
|
||||
- **(a) v5 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v6 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
447
.artifacts/shipped/2026-05-03-daemon-final-spec-v6.md
Normal file
447
.artifacts/shipped/2026-05-03-daemon-final-spec-v6.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# `claudemesh daemon` — Final Spec v6
|
||||
|
||||
> **Round 6.** v5 was reviewed by codex (round 5) which found the dedupe
|
||||
> table architecture sound but called out four idempotency-correctness
|
||||
> issues that would silently corrupt sends in production:
|
||||
>
|
||||
> 1. **Idempotency key reuse with different payload/destination** — v5
|
||||
> silently collapsed a different send onto the original. Need a request
|
||||
> fingerprint.
|
||||
> 2. **`status = 'rejected'` underspecified** — schema allowed it, semantics
|
||||
> didn't. Either fully define or drop.
|
||||
> 3. **Outbox max-age math edges** — `dedupe_retention_days = 1` minus 24h
|
||||
> margin = 0 hours, which is undefined.
|
||||
> 4. **Broker atomicity not stated** — dedupe insert and message insert
|
||||
> must be one transaction or you produce orphan dedupe rows.
|
||||
>
|
||||
> v6 fixes all four. **Intent §0 unchanged from v2.** v6 only revises
|
||||
> idempotency semantics in §4 and migration in §17.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once with **request-fingerprinted** dedupe
|
||||
|
||||
Codex r5: dedupe must compare the *whole request shape*, not just
|
||||
`(mesh, client_message_id)`. Otherwise a caller who reuses an idempotency
|
||||
key with a different destination or body silently drops the new send and
|
||||
gets the old send's metadata back.
|
||||
|
||||
### 4.1 The contract (precise — v6)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns.
|
||||
>
|
||||
> **Broker guarantee**: the broker maintains a dedupe record per accepted
|
||||
> `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`. Each
|
||||
> dedupe record carries a canonical `request_fingerprint`. Retries with
|
||||
> the same `client_message_id` AND matching fingerprint collapse to the
|
||||
> original `broker_message_id`. Retries with the same `client_message_id`
|
||||
> but a different fingerprint return a deterministic conflict
|
||||
> (`409 idempotency_key_reused`) and do **not** create a new message.
|
||||
>
|
||||
> **Atomicity guarantee**: dedupe row insertion and message row insertion
|
||||
> happen in one broker DB transaction. Either both land, or neither. No
|
||||
> orphan dedupe rows. If the broker crashes between dedupe insert and
|
||||
> message insert, the rollback unwinds both.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery, with
|
||||
> `client_message_id` propagated to receivers' inboxes.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — request fingerprint added (v6)
|
||||
|
||||
```sql
|
||||
CREATE TABLE mesh.client_message_dedupe (
|
||||
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
|
||||
client_message_id TEXT NOT NULL,
|
||||
|
||||
-- The original accepted message; FK NOT enforced because the message row
|
||||
-- may be GC'd by retention sweeps before the dedupe row expires.
|
||||
broker_message_id UUID NOT NULL,
|
||||
|
||||
-- Canonical fingerprint of the original request. Recomputed on every
|
||||
-- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
|
||||
request_fingerprint BYTEA NOT NULL, -- 32-byte sha256
|
||||
|
||||
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
|
||||
destination_ref TEXT NOT NULL,
|
||||
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
expires_at TIMESTAMPTZ, -- NULL = `permanent` mode
|
||||
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
|
||||
|
||||
PRIMARY KEY (mesh_id, client_message_id)
|
||||
);
|
||||
|
||||
CREATE INDEX client_message_dedupe_expires_idx
|
||||
ON mesh.client_message_dedupe(expires_at)
|
||||
WHERE expires_at IS NOT NULL;
|
||||
|
||||
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
|
||||
```
|
||||
|
||||
**`status` column dropped (codex r5)**. Rejected requests do **not**
|
||||
consume idempotency keys. Rationale below in §4.6.
|
||||
|
||||
### 4.4 Request fingerprint — canonical form (NEW v6)
|
||||
|
||||
The fingerprint covers everything that makes a send semantically distinct.
|
||||
A retry must reproduce the same fingerprint bit-for-bit; anything else is
|
||||
a different send and must not be collapsed.
|
||||
|
||||
```
|
||||
request_fingerprint = sha256(
|
||||
envelope_version || 0x00 ||
|
||||
destination_kind || 0x00 ||
|
||||
destination_ref || 0x00 ||
|
||||
reply_to_id_or_empty || 0x00 ||
|
||||
priority || 0x00 ||
|
||||
meta_canonical_json || 0x00 ||
|
||||
body_hash
|
||||
)
|
||||
```
|
||||
|
||||
Where:
|
||||
- `envelope_version`: integer string (e.g. `"1"`). Bumps when the envelope
|
||||
shape changes.
|
||||
- `destination_kind`: `topic`, `dm`, or `queue`.
|
||||
- `destination_ref`: topic name, recipient ed25519 pubkey hex, or queue id.
|
||||
- `reply_to_id_or_empty`: original `broker_message_id` or empty string.
|
||||
- `priority`: `now`, `next`, or `low`.
|
||||
- `meta_canonical_json`: the `meta` field, serialized with sorted keys,
|
||||
no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.
|
||||
- `body_hash`: sha256(body bytes), hex.
|
||||
|
||||
The fingerprint is computed:
|
||||
1. **Daemon-side** before durable outbox persistence — stored as
|
||||
`outbox.request_fingerprint` (NEW column) so retries always produce
|
||||
the same fingerprint regardless of caller behavior.
|
||||
2. **Broker-side** on first receipt — stored in
|
||||
`client_message_dedupe.request_fingerprint`.
|
||||
3. **Broker-side** on every duplicate retry — recomputed and compared
|
||||
byte-equal to the stored value.
|
||||
|
||||
If the daemon and broker disagree on the canonical form (e.g. JCS
|
||||
implementation drift), the broker emits
|
||||
`cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id}` and
|
||||
returns `409 idempotency_key_reused` with a body that includes the
|
||||
broker's fingerprint hex for debugging. Daemons that see this should
|
||||
log it loudly and stop retrying that outbox row (it goes to `dead`).
|
||||
|
||||
### 4.5 Duplicate response — three cases (v6)
|
||||
|
||||
| Case | HTTP/WS code | Body |
|
||||
|---|---|---|
|
||||
| First insert | `201 created` | `{ broker_message_id, client_message_id, history_id, duplicate: false }` |
|
||||
| Duplicate, fingerprint match | `200 ok` | `{ broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at }` |
|
||||
| Duplicate, fingerprint mismatch | `409 idempotency_key_reused` | `{ client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." }` (first 8 bytes hex) |
|
||||
|
||||
Daemon outcomes:
|
||||
- `201` → mark outbox row `done`, store `broker_message_id`. Normal path.
|
||||
- `200 duplicate` with `history_available: true` → mark `done`, no
|
||||
re-fanout, log at INFO.
|
||||
- `200 duplicate` with `history_available: false` → mark `done`, log at
|
||||
WARN. The original delivery succeeded; receivers got it.
|
||||
- `409 idempotency_key_reused` → mark outbox row `dead`, surface in
|
||||
`claudemesh daemon outbox --failed`. Operator must rotate the
|
||||
idempotency key by hand and resubmit (`outbox requeue --new-id <id>`,
|
||||
NEW v6 subcommand). Daemon does NOT auto-rotate to avoid masking caller
|
||||
bugs.
|
||||
|
||||
### 4.6 Why rejected requests don't consume idempotency keys (v6)
|
||||
|
||||
`status` was in v5's schema but underspecified. Two scenarios:
|
||||
|
||||
- **Transient broker error** (DB down, queue full, network blip): daemon
|
||||
retries. If we'd persisted a `rejected` row on the first attempt, the
|
||||
retry would fail forever. Bad.
|
||||
- **Permanent validation error** (payload too large, destination not
|
||||
found, auth missing): broker returns the appropriate `4xx` immediately
|
||||
without inserting a dedupe row. Daemon either fixes the request and
|
||||
retries (different fingerprint → fingerprint mismatch → `409` per §4.5)
|
||||
or marks dead. Persisting a "rejected" row buys nothing — the daemon
|
||||
isn't going to send the same broken request again with the same key.
|
||||
|
||||
Net result: `client_message_dedupe` rows only exist when the broker
|
||||
**successfully** accepted a message and committed it. The single source
|
||||
of truth for "was this idempotency key consumed?" is the existence of
|
||||
the dedupe row. No status enum, no ambiguous states.
|
||||
|
||||
### 4.7 Broker atomicity contract (NEW v6)
|
||||
|
||||
Every accept path runs in one DB transaction with the following shape:
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
-- Pre-generate broker_message_id outside the transaction; pass in.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING
|
||||
RETURNING broker_message_id, request_fingerprint, history_available, first_seen_at;
|
||||
|
||||
-- If RETURNING was empty (conflict), do a SELECT to fetch the original
|
||||
-- and exit the transaction with a duplicate response.
|
||||
-- If RETURNING produced a row AND $fingerprint != returned.fingerprint,
|
||||
-- that's the §4.5 mismatch path — also exit with 409.
|
||||
|
||||
-- Otherwise, this is the first insert. Insert the message row.
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
-- Optional: enqueue fan-out work, etc.
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
Failure modes:
|
||||
- Crash before `COMMIT`: both rows roll back. Next daemon retry inserts
|
||||
cleanly.
|
||||
- Crash after `COMMIT` but before WS ACK: dedupe row exists, message row
|
||||
exists. Daemon retries → fingerprint matches → `200 duplicate`. Net:
|
||||
exactly one broker-accepted row, one daemon `done` transition.
|
||||
- Constraint violation on message row insert (e.g. unique violation on
|
||||
some other column): rolls back the dedupe insert. Returns `5xx` to
|
||||
daemon. Daemon retries; same fingerprint reproduces the same constraint
|
||||
violation; daemon eventually marks `dead`. No orphan dedupe row.
|
||||
|
||||
Counter `cm_broker_dedupe_orphan_check_total` runs nightly and validates
|
||||
that every `client_message_dedupe` row has a matching `topic_message` or
|
||||
`message_queue` row OR the matching message row has been retention-pruned
|
||||
(in which case `history_available = FALSE` was set). Any row failing both
|
||||
conditions is logged as `cm_broker_dedupe_orphan_found{mesh_id}` for
|
||||
human review. Should be zero in steady state.
|
||||
|
||||
### 4.8 Outbox schema — fingerprint stored alongside (v6)
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
`request_fingerprint` is computed at IPC accept time and stored. Every
|
||||
retry sends the same bytes. The daemon never recomputes from `payload`
|
||||
post-enqueue (would produce drift if envelope_version changes between
|
||||
daemon runs).
|
||||
|
||||
### 4.9 Outbox max-age math — bounded (v6)
|
||||
|
||||
Codex r5: the v5 formula `(dedupe_retention_days * 24) - 24h_margin`
|
||||
breaks at `dedupe_retention_days = 1` (yields zero) and is undefined
|
||||
behavior at `<= 1`.
|
||||
|
||||
v6 formula and bounds:
|
||||
|
||||
- **Minimum supported broker dedupe retention**: 3 days. Daemon refuses
|
||||
to start if broker advertises `dedupe_retention_days < 3` (treats it
|
||||
as `feature_param_invalid`, exits 4010).
|
||||
- **Daemon `max_age_hours` derivation**:
|
||||
- `permanent` mode → daemon uses config default (168h = 7d), cap 720h
|
||||
(30d).
|
||||
- `retention_scoped` mode → daemon `max_age_hours = max(72,
|
||||
(dedupe_retention_days * 24) - safety_margin_hours)` where
|
||||
`safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 *
|
||||
24))`. For `dedupe_retention_days=3` this gives
|
||||
`max(72, 72-24) = 72h`. For 30 days: `max(72, 720-72) = 648h`. For
|
||||
365 days: `max(72, 8760-876) = 7884h`.
|
||||
- The 72h floor prevents the daemon outbox from being uselessly short
|
||||
— three days is enough margin for normal operator response to a
|
||||
paged outage.
|
||||
|
||||
- Operator override allowed via `[outbox] max_age_hours_override = N`,
|
||||
but if `N` exceeds `dedupe_retention_days * 24 - 1` daemon refuses to
|
||||
start with `outbox_max_age_above_dedupe_window`. The override exists
|
||||
for the rare case of a much-shorter-than-default outbox; it does not
|
||||
exist to circumvent the broker's dedupe window.
|
||||
|
||||
### 4.10 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.11 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.12 Failure modes — corrected for fingerprint model (v6)
|
||||
|
||||
- **Fingerprint mismatch on retry** (`409 idempotency_key_reused`): outbox
|
||||
row marked `dead`. Surfaced in `--failed` view. Operator command
|
||||
`outbox requeue --new-id <id>` rotates `client_message_id` and retries.
|
||||
- **Daemon retry after dedupe row hard-deleted by retention sweep**: in
|
||||
`retention_scoped` mode, daemon `max_age_hours` is bounded inside the
|
||||
retention window (§4.9), so this can only happen via operator override.
|
||||
In that case the retry creates a NEW dedupe row + new message — the
|
||||
caller chose this risk explicitly. Counter
|
||||
`cm_daemon_retry_after_dedupe_expired_total`.
|
||||
- **Daemon retry after dedupe row hard-deleted in `permanent` mode**:
|
||||
cannot happen by definition — `permanent` means no `expires_at`. Only
|
||||
mesh deletion removes dedupe rows.
|
||||
- **Duplicate row, history pruned**: as v5 §4.4. Mark `done`, log
|
||||
`cm_daemon_dedupe_history_pruned_total`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — unchanged from v4 §6
|
||||
|
||||
---
|
||||
|
||||
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature param updated for new dedupe semantics
|
||||
|
||||
### 15.1 Feature bits with parameters (v6 update)
|
||||
|
||||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | `2` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 3)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
|
||||
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
|
||||
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
|
||||
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
|
||||
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
|
||||
|
||||
`client_message_id_dedupe` bumped to `params.version = 2` because it now
|
||||
requires `request_fingerprint = true`. A broker still on version 1
|
||||
(no fingerprint comparison) is treated as "feature missing" and the
|
||||
daemon refuses to start. That's intentional — v0.9.0 daemons require
|
||||
fingerprint enforcement for safe idempotency.
|
||||
|
||||
`dedupe_retention_days` minimum raised to 3 (matches the §4.9 floor).
|
||||
|
||||
### 15.2 Negotiation handshake — unchanged shape from v5 §15.2
|
||||
|
||||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||||
|
||||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||||
|
||||
### 15.5 Diagnostic close codes (NEW v6 — codex r5)
|
||||
|
||||
WebSocket close codes are split for diagnostic clarity:
|
||||
|
||||
| Code | Reason | When |
|
||||
|---|---|---|
|
||||
| `4010` | `feature_unavailable` | Required feature missing from broker's `supported` |
|
||||
| `4011` | `feature_param_invalid` | Required feature present but parameters fail validation (missing required, out of bounds, unknown version) |
|
||||
| `4012` | `feature_param_below_floor` | Required feature parameter below daemon's hard floor (e.g. `dedupe_retention_days < 3`) |
|
||||
|
||||
Daemon logs the full negotiation payload at WARN before exiting; supervisor
|
||||
+ alerting catches the restart loop.
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v4 §16
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — broker dedupe table + atomicity (v6)
|
||||
|
||||
Broker side, deploy order:
|
||||
|
||||
1. `CREATE TABLE mesh.client_message_dedupe` with v6 schema (additive,
|
||||
online-safe).
|
||||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
|
||||
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
|
||||
4. Broker code refactor: every accept path wraps dedupe insert + message
|
||||
insert in **one transaction** (§4.7). Pre-generated
|
||||
`broker_message_id` (ulid in code) passed in.
|
||||
5. Broker code: nightly job to delete dedupe rows where `expires_at <
|
||||
NOW()` (skip in `permanent` mode).
|
||||
6. Broker code: hook into the message-retention sweep — when a
|
||||
`topic_message` or `message_queue` row is hard-deleted, find the
|
||||
matching dedupe row by `client_message_id` and set `history_available
|
||||
= FALSE`. (Note: `client_message_id` is nullable on those tables for
|
||||
legacy traffic; nullable rows have no dedupe row to update.)
|
||||
7. Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
|
||||
8. Broker advertises `client_message_id_dedupe` feature with
|
||||
`params.version = 2` and `request_fingerprint: true`.
|
||||
9. Daemon refuses to start unless that feature bit is advertised with
|
||||
valid v2 params.
|
||||
|
||||
Rollback plan: feature flag disables fingerprint enforcement broker-side
|
||||
(falls back to existing pre-v6 behavior — no dedupe). Daemons that
|
||||
require fingerprint refuse to start. Operator switches off the feature
|
||||
flag, reverts the daemon, restarts. No data loss; pending dedupe rows
|
||||
remain in place for the next forward roll.
|
||||
|
||||
---
|
||||
|
||||
## What changed v5 → v6 (codex round-5 actionable items)
|
||||
|
||||
| Codex r5 item | v6 fix | Section |
|
||||
|---|---|---|
|
||||
| Idempotency key reuse with different payload silently collapses | `request_fingerprint` BYTEA in dedupe table; canonical form per §4.4; 409 on mismatch | §4.3, §4.4, §4.5 |
|
||||
| `status='rejected'` underspecified | Dropped `status` column; rejected requests don't consume keys; existence of dedupe row = "key consumed" | §4.3, §4.6 |
|
||||
| Outbox max-age math edges at low retention | 72h floor; min `dedupe_retention_days = 3`; percentage-based safety margin; explicit override gating | §4.9, §15.1 |
|
||||
| Broker atomicity not stated | One transaction per accept path; orphan-check job; rollback semantics | §4.7 |
|
||||
| Diagnostic detail on feature param failures | New close codes 4011 / 4012 separate from 4010 | §15.5 |
|
||||
| Outbox stores fingerprint | NEW column `outbox.request_fingerprint` BLOB; computed once at IPC accept | §4.8 |
|
||||
| Operator command for fingerprint-mismatch recovery | NEW `outbox requeue --new-id <id>` to rotate idempotency key | §4.5 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 6)
|
||||
|
||||
1. **Request fingerprint canonical form (§4.4)** — does JCS work
|
||||
cross-language for `meta_canonical_json` (Python json.dumps,
|
||||
Go encoding/json, JS JSON.stringify all behave differently)? Should
|
||||
we ship a vetted JCS lib in each SDK or fall back to a simpler
|
||||
"sorted keys + no spaces + escape-as-stored" rule with conformance
|
||||
tests?
|
||||
2. **Atomicity contract (§4.7)** — is the orphan-check sufficient, or
|
||||
does a violation mean we need a "broker rebuild dedupe from messages"
|
||||
recovery tool? The latter is destructive but useful for ops emergencies.
|
||||
3. **Max-age formula (§4.9)** — is the 72h floor correct? Is the
|
||||
percentage-based safety margin (`max(24, ceil(0.1 * dedupe_window))`)
|
||||
the right shape? Or simpler to say "always 24h"?
|
||||
4. **`409 idempotency_key_reused` recovery flow (§4.5)** — is sending the
|
||||
row to `dead` and surfacing it via `outbox --failed` enough? Should
|
||||
the daemon emit a high-priority event for the SSE stream so operators
|
||||
are paged immediately?
|
||||
5. **Diagnostic close codes (§15.5)** — is splitting 4010/4011/4012
|
||||
useful, or does it just push complexity onto operators? Should we
|
||||
collapse to 4010 with structured close-reason JSON instead?
|
||||
6. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year. What falls down?
|
||||
|
||||
Three options:
|
||||
- **(a) v6 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v7 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
439
.artifacts/shipped/2026-05-03-daemon-final-spec-v7.md
Normal file
439
.artifacts/shipped/2026-05-03-daemon-final-spec-v7.md
Normal file
@@ -0,0 +1,439 @@
|
||||
# `claudemesh daemon` — Final Spec v7
|
||||
|
||||
> **Round 7.** v6 was reviewed by codex (round 6) which found the broker
|
||||
> layer largely correct but caught five daemon-side and broker-tx
|
||||
> correctness gaps:
|
||||
>
|
||||
> 1. **Daemon-local duplicate POST semantics** undefined — local fingerprint
|
||||
> comparison missing across `pending` / `inflight` / `done` / `dead`.
|
||||
> 2. **§4.6 rejected-request contradiction** — talked about both "fix and
|
||||
> retry" and "fingerprint mismatch → 409". Only one of those can be true.
|
||||
> 3. **§4.7 pseudocode bug** — `ON CONFLICT DO NOTHING RETURNING` returns
|
||||
> nothing on conflict; the fingerprint comparison was in the wrong branch.
|
||||
> 4. **Max-age math floor consumes margin** — at min retention (3 days),
|
||||
> daemon max-age 72h equals broker window 72h. Not inside the window.
|
||||
> 5. **Broker transaction boundary incomplete** — fan-out/queue/history side
|
||||
> effects not stated as in-transaction; "optional" wording was wrong.
|
||||
>
|
||||
> v7 fixes all five. **Intent §0 unchanged from v2.** v7 only revises §4
|
||||
> (delivery contract) and §15 (feature param min) and §17 (migration).
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once, fingerprinted at IPC and broker layers
|
||||
|
||||
### 4.1 The contract (precise — v7)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer: a duplicate `POST` with the same
|
||||
> `client_message_id` and matching `request_fingerprint` returns the
|
||||
> stable prior result; with a mismatched fingerprint it returns local
|
||||
> `409 idempotency_key_reused` and the new request is **not** persisted.
|
||||
>
|
||||
> **Broker guarantee**: the broker maintains a dedupe record per
|
||||
> accepted `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`
|
||||
> with `request_fingerprint`. Retries with matching fingerprint collapse;
|
||||
> retries with mismatched fingerprint return `409
|
||||
> idempotency_key_reused` without creating a new message.
|
||||
>
|
||||
> **Atomicity guarantee**: every durable side effect of a successful
|
||||
> accept (dedupe row, message row, fan-out work, history row, queue
|
||||
> insertion) lands in the same broker DB transaction. Either all commit
|
||||
> or none do.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery, with
|
||||
> `client_message_id` propagated to receivers' inboxes.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — unchanged from v6 §4.3
|
||||
|
||||
(`mesh.client_message_dedupe` table with `request_fingerprint BYTEA`, no
|
||||
`status` column.)
|
||||
|
||||
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (NEW v7 — codex r6)
|
||||
|
||||
The daemon enforces fingerprint idempotency **before** the request hits
|
||||
`outbox.db` so a caller bug never creates duplicate-key/mismatch-payload
|
||||
state at all.
|
||||
|
||||
#### 4.5.1 IPC accept algorithm
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits). Failures
|
||||
here return `4xx` immediately. **No outbox row is written.** The
|
||||
`client_message_id` (whether caller-supplied or daemon-minted) is
|
||||
**not consumed** — the same id may be reused by the caller for a
|
||||
subsequent valid send.
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Look up existing outbox row by `client_message_id`:
|
||||
|
||||
| Existing row state | Fingerprint match? | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | Insert new outbox row in `pending`; return `202 accepted, queued` with `client_message_id` |
|
||||
| `pending` | match | Return `202 accepted, queued` with the existing `client_message_id`. No new row. Idempotent retry of an in-progress send |
|
||||
| `pending` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_pending_fingerprint_mismatch"`. **No mutation of the existing row.** |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No new row. Caller is retrying mid-broker-roundtrip |
|
||||
| `inflight` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No new row, no broker call |
|
||||
| `done` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409 idempotency_key_reused` with `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Caller must rotate the id (see §4.6.3) — daemon refuses to re-attempt a dead row's exact bytes. |
|
||||
| `dead` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
|
||||
Rule: any IPC `409` carries the daemon's `request_fingerprint` (8-byte
|
||||
hex prefix) so callers can debug client/server canonical-form drift.
|
||||
|
||||
#### 4.5.2 Outbox table — fingerprint required, atomic UPSERT removed
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
Insertion is `BEGIN; SELECT FOR UPDATE; if-no-row INSERT; COMMIT;` —
|
||||
explicit lock + check + insert, not `INSERT OR IGNORE`. The daemon
|
||||
never auto-mutates an existing row's `request_fingerprint` or
|
||||
`payload`; mismatches are 409s, not silent overwrites.
|
||||
|
||||
`request_fingerprint` is computed once at IPC accept time and frozen.
|
||||
Retries to the broker re-send the same bytes from `payload` and the
|
||||
same `request_fingerprint`. Daemon does not recompute post-enqueue.
|
||||
|
||||
### 4.6 Rejected-request semantics — pick one rule (NEW v7 — codex r6)
|
||||
|
||||
> **Rule: the `client_message_id` is consumed iff the daemon writes an
|
||||
> outbox row. Anything that fails before outbox insertion (validation,
|
||||
> auth, size) leaves the id untouched and freely reusable.**
|
||||
|
||||
This makes §4.6 internally consistent with §4.5:
|
||||
|
||||
#### 4.6.1 IPC validation failure (no outbox row written)
|
||||
|
||||
- Schema/auth/size/destination-not-resolvable failures return `4xx`
|
||||
immediately. The `client_message_id` is **not** stored anywhere on
|
||||
the daemon. Caller may re-send with the same id and a fixed payload;
|
||||
it will be treated as a fresh request because no outbox row exists.
|
||||
|
||||
#### 4.6.2 Outbox row exists, broker permanent rejection (4xx response)
|
||||
|
||||
- Daemon receives `4xx` from broker (e.g. payload size delta between
|
||||
daemon and broker advertised limits, mesh-level reject). Outbox row
|
||||
transitions to `dead` with `last_error` populated.
|
||||
- Caller retrying with same `client_message_id` → daemon returns
|
||||
`409 idempotency_key_reused, conflict: "outbox_dead_*"` per §4.5.1.
|
||||
- The id is consumed (row is locked in `dead`) until operator action.
|
||||
|
||||
#### 4.6.3 Operator recovery: rotating an idempotency key
|
||||
|
||||
To unstick a `dead` row whose payload needs to change, operator runs:
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_id> --new-client-id [auto|<id>]
|
||||
```
|
||||
|
||||
This atomically:
|
||||
1. Marks the existing `dead` row as `aborted` (terminal, never retried).
|
||||
2. Creates a new outbox row with a fresh `client_message_id` (caller-
|
||||
supplied or daemon-ulid'd) and the SAME or a CALLER-PATCHED payload.
|
||||
3. The old `client_message_id` becomes free again at the daemon layer
|
||||
but is still locked at the broker layer if the broker had ever
|
||||
accepted it (its dedupe row stays). For a row that died before
|
||||
broker acceptance, the id is fully reusable end-to-end.
|
||||
|
||||
Operators see a clear distinction between `dead` (needs operator
|
||||
attention) and `aborted` (intentionally retired). Add `aborted` to the
|
||||
status CHECK constraint:
|
||||
|
||||
```sql
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead','aborted'))
|
||||
```
|
||||
|
||||
### 4.7 Broker atomicity contract — corrected pseudocode + side-effect inventory (v7 — codex r6)
|
||||
|
||||
#### 4.7.1 Side effects inside the transaction
|
||||
|
||||
Every successful broker accept atomically commits the following durable
|
||||
state in **one transaction**:
|
||||
|
||||
| Effect | Table | Notes |
|
||||
|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | NEW row keyed by `(mesh_id, client_message_id)` |
|
||||
| Message body | `mesh.topic_message` OR `mesh.message_queue` | NEW row keyed by `broker_message_id` (pre-generated ulid) |
|
||||
| History row | `mesh.message_history` | NEW row pointing at `broker_message_id` for ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | One row per intended recipient (member subscribed to topic, recipient of DM, etc.) |
|
||||
|
||||
Effects **outside** the transaction (committed after ACK to daemon):
|
||||
- WebSocket pushes to currently-connected subscribers — these are best-
|
||||
effort live notifications; on failure subscribers fetch from history
|
||||
on next connect.
|
||||
- Webhook fan-out (post-v0.9.0 feature) — runs asynchronously off the
|
||||
`delivery_queue` rows committed inside the transaction.
|
||||
|
||||
If any in-transaction insert fails (constraint violation, DB error),
|
||||
the transaction rolls back: no dedupe row, no message row, no history,
|
||||
no delivery queue rows. Broker returns `5xx` to daemon; daemon retries.
|
||||
|
||||
#### 4.7.2 Corrected pseudocode (codex r6)
|
||||
|
||||
The fingerprint comparison must happen on the conflict-select branch,
|
||||
not the `RETURNING` branch:
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
|
||||
-- Pre-generate broker_message_id (ulid) outside the transaction, pass in.
|
||||
|
||||
-- Step 1: try to claim the idempotency key.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Step 2: was it our insert?
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- If returned.broker_message_id == $msg_id (our pre-generated id),
|
||||
-- this was the first insert. Continue to step 3.
|
||||
-- If returned.broker_message_id != $msg_id AND
|
||||
-- returned.request_fingerprint == $fingerprint,
|
||||
-- this is a duplicate retry. ROLLBACK; return 200 duplicate.
|
||||
-- If returned.broker_message_id != $msg_id AND
|
||||
-- returned.request_fingerprint != $fingerprint,
|
||||
-- ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Step 3: insert message row, history, fan-out queue.
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
The branch logic determines the response shape (`201` vs `200
|
||||
duplicate` vs `409 idempotency_key_reused`) before COMMIT. The
|
||||
duplicate and 409 branches always ROLLBACK because nothing else
|
||||
needs to commit on those paths.
|
||||
|
||||
`SELECT … FOR SHARE` blocks concurrent writers from upgrading the
|
||||
same dedupe row mid-transaction; a concurrent insert with the same
|
||||
key will block until our transaction completes.
|
||||
|
||||
#### 4.7.3 Orphan check — covers full inventory now
|
||||
|
||||
The nightly `cm_broker_dedupe_orphan_check_total` job (v6 §4.7) is
|
||||
extended to verify all four in-transaction effects. For each
|
||||
`client_message_dedupe` row:
|
||||
- Either the corresponding `topic_message` / `message_queue` row exists,
|
||||
OR `history_available = FALSE` AND a deleted-tombstone is recorded.
|
||||
- AND a corresponding `message_history` row exists (or has been pruned
|
||||
per history retention).
|
||||
- AND zero outstanding `delivery_queue` rows older than fan-out timeout
|
||||
reference a `broker_message_id` whose dedupe row is missing.
|
||||
|
||||
Any inconsistency logged as `cm_broker_atomicity_violation_found` for
|
||||
human review. Should be zero in steady state.
|
||||
|
||||
### 4.8 Outbox max-age math — strictly inside broker window (v7 — codex r6)
|
||||
|
||||
Codex r6: at v6's 3-day minimum, daemon max_age (72h) **equaled** broker
|
||||
window (72h). That isn't "inside the window."
|
||||
|
||||
v7 raises the floor and tightens the formula:
|
||||
|
||||
- **Minimum supported broker `dedupe_retention_days`**: **7** (was 3 in
|
||||
v6). Below this, daemon refuses to start with `4012
|
||||
feature_param_below_floor`.
|
||||
- **Daemon `max_age_hours` derivation** (`retention_scoped` mode):
|
||||
```
|
||||
safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 * 24))
|
||||
max_age_hours = (dedupe_retention_days * 24) - safety_margin_hours
|
||||
```
|
||||
At minimum (7 days): `safety_margin = max(24, 17) = 24h`; `max_age =
|
||||
168 - 24 = 144h`. Daemon outbox ≤144h, broker window ≥168h, gap ≥24h.
|
||||
- **Daemon `max_age_hours` derivation** (`permanent` mode):
|
||||
```
|
||||
max_age_hours = config.outbox.max_age_hours_default (168h)
|
||||
capped at config.outbox.max_age_hours_cap (720h)
|
||||
```
|
||||
- **Operator override**: `[outbox] max_age_hours_override = N` accepted
|
||||
iff `N <= dedupe_retention_days * 24 - 24`. Above that → daemon
|
||||
refuses to start with `outbox_max_age_above_dedupe_window` clear text.
|
||||
- The 72h floor from v6 is **dropped** because the new 7-day broker
|
||||
minimum already produces a 144h derived max-age — well above any
|
||||
realistic floor concern.
|
||||
|
||||
### 4.9 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.10 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.11 Failure modes — unchanged from v6 §4.12, with §4.5/§4.6 added
|
||||
|
||||
- **IPC accept fingerprint-mismatch on duplicate id**: returns 409 with
|
||||
`conflict` field per §4.5.1. Caller must rotate id.
|
||||
- **Outbox row stuck in `dead`**: operator runs `outbox requeue
|
||||
--new-client-id` per §4.6.3.
|
||||
- **Broker fingerprint mismatch on retry**: as v6 §4.5. Daemon marks
|
||||
`dead`, surfaces in `outbox --failed`.
|
||||
- **Daemon retry after dedupe row hard-deleted by broker retention
|
||||
sweep**: cannot happen unless operator overrode `max_age_hours`
|
||||
beyond the safety margin. In `permanent` mode cannot happen at all.
|
||||
- **Atomicity violation found by orphan check**: alerts ops; broker
|
||||
team investigates. Should be zero.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
## 6. Hooks — unchanged from v4 §6
|
||||
|
||||
## 7-13. — unchanged from v4
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — minimum dedupe_retention_days raised
|
||||
|
||||
### 15.1 Feature bits with parameters (v7 update)
|
||||
|
||||
Only one row changes from v6 §15.1:
|
||||
|
||||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | `2` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 7)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
|
||||
|
||||
`dedupe_retention_days` minimum raised from 3 to 7 to keep daemon
|
||||
outbox max-age strictly inside the broker window with margin (§4.8).
|
||||
|
||||
### 15.2 — 15.5 unchanged from v6 §15
|
||||
|
||||
(`feature_negotiation_request/response`, IPC negotiation, compat
|
||||
matrix, diagnostic close codes 4010 / 4011 / 4012.)
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v4 §16
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — broker dedupe + atomicity + corrected pseudocode (v7)
|
||||
|
||||
Broker side, deploy order:
|
||||
|
||||
1. `CREATE TABLE mesh.client_message_dedupe` (v6 §4.3 schema, unchanged
|
||||
in v7).
|
||||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
|
||||
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
|
||||
4. Broker code refactor: every accept path runs the v7 §4.7.2 corrected
|
||||
pseudocode in **one transaction** with the side-effect inventory
|
||||
from §4.7.1 — dedupe row, message row, history row, delivery_queue
|
||||
rows all in-tx.
|
||||
5. Broker code: existing fan-out workers consume `delivery_queue` rows
|
||||
committed by the accept transaction.
|
||||
6. Broker code: nightly retention sweep + `history_available` flip on
|
||||
message-row pruning (unchanged from v6 §17 step 5+6).
|
||||
7. Broker code: extended orphan-check job (v7 §4.7.3) — alerts on
|
||||
atomicity violations across full inventory.
|
||||
8. Broker advertises `client_message_id_dedupe` feature with
|
||||
`params.version = 2`, `request_fingerprint: true`,
|
||||
`dedupe_retention_days >= 7` (was 3).
|
||||
9. Daemon refuses to start unless above is advertised.
|
||||
|
||||
Daemon side:
|
||||
- Outbox table gains `aborted` status (§4.6.3); migration ALTER on the
|
||||
CHECK constraint at startup if SQLite version <DDL works without
|
||||
a recreate; else table recreate via `INSERT INTO new SELECT * FROM
|
||||
old`. v0.9.0 daemons are fresh installs by definition; existing
|
||||
outboxes don't exist.
|
||||
- IPC accept path implements §4.5.1 lookup table.
|
||||
- IPC error envelope adds `conflict` and `daemon_fingerprint_prefix`
|
||||
fields for 409 responses.
|
||||
- New CLI verb `claudemesh daemon outbox requeue --id <id>
|
||||
--new-client-id [auto|<id>]` (§4.6.3).
|
||||
|
||||
---
|
||||
|
||||
## What changed v6 → v7 (codex round-6 actionable items)
|
||||
|
||||
| Codex r6 item | v7 fix | Section |
|
||||
|---|---|---|
|
||||
| Daemon-local duplicate POST semantics undefined | Full lookup table for pending/inflight/done/dead × match/mismatch; `409 idempotency_key_reused` at IPC layer with `conflict` field | §4.5 |
|
||||
| §4.6 rejected-request contradiction | Single rule: id consumed iff outbox row written; pre-outbox failures leave id untouched; broker-rejected outbox row goes to `dead`, requires `requeue --new-client-id` | §4.6 |
|
||||
| §4.7 pseudocode wrong | Corrected: `INSERT ON CONFLICT DO NOTHING`, then `SELECT FOR SHARE`, then branch on returned `broker_message_id` and `fingerprint` | §4.7.2 |
|
||||
| Max-age math equals window at min | Min `dedupe_retention_days` raised to 7; safety margin always >= 24h; derived max-age strictly < window | §4.8, §15.1 |
|
||||
| Broker atomicity scope incomplete | Side-effect inventory: dedupe + message + history + delivery_queue all in-tx; WS push and webhook fan-out explicitly outside-tx; orphan check extended | §4.7.1, §4.7.3 |
|
||||
| New `aborted` outbox status | Distinguishes operator-retired rows from dead rows | §4.6.3 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 7)
|
||||
|
||||
1. **IPC lookup table (§4.5.1)** — does it cover all the realistic
|
||||
client races? The "inflight + match" return is `202 accepted,
|
||||
inflight` — should it be `200 ok` with the broker response if the
|
||||
broker has already responded? Or does the daemon prefer to respond
|
||||
from local state always?
|
||||
2. **Aborted vs dead vs done (§4.6.3)** — is the three-state terminal
|
||||
distinction useful, or noisy? Would `dead` + an `aborted_at`
|
||||
timestamp suffice?
|
||||
3. **§4.7.2 transaction shape** — `SELECT FOR SHARE` after `INSERT ON
|
||||
CONFLICT DO NOTHING` is two round-trips. Could it be one with
|
||||
`INSERT ... ON CONFLICT DO UPDATE SET ... RETURNING xmax = 0` or
|
||||
similar Postgres-specific trick? Worth optimizing here?
|
||||
4. **Max-age formula at higher windows** — at 365 days,
|
||||
`safety_margin = ceil(0.1 * 365 * 24) = 876h ≈ 36.5 days`. Daemon
|
||||
max-age = `8760 - 876 = 7884h ≈ 328 days`. Is that the right shape,
|
||||
or should the safety margin be capped (e.g. `min(72, ceil(0.1 * w))`)?
|
||||
5. **Side-effect inventory (§4.7.1)** — anything missing? E.g. broker-
|
||||
side rate-limit counters, audit-log entries, mention-fanout-search?
|
||||
6. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year. What falls down?
|
||||
|
||||
Three options:
|
||||
- **(a) v7 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v8 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
401
.artifacts/shipped/2026-05-03-daemon-final-spec-v8.md
Normal file
401
.artifacts/shipped/2026-05-03-daemon-final-spec-v8.md
Normal file
@@ -0,0 +1,401 @@
|
||||
# `claudemesh daemon` — Final Spec v8
|
||||
|
||||
> **Round 8.** v7 was reviewed by codex (round 7) which found four
|
||||
> remaining correctness problems, one of them new in v7:
|
||||
>
|
||||
> 1. **`aborted` semantics not in §4.5.1** and contradiction with `UNIQUE`
|
||||
> constraint — v7 said the old id "becomes free again at the daemon
|
||||
> layer," but `client_message_id TEXT NOT NULL UNIQUE` makes that
|
||||
> impossible without DELETE.
|
||||
> 2. **Broker permanent-rejection ordering underspec** — v7 didn't state
|
||||
> when (relative to dedupe insertion) permanent 4xx fires.
|
||||
> 3. **SQLite `SELECT FOR UPDATE`** — SQLite doesn't support it; needs
|
||||
> `BEGIN IMMEDIATE` for daemon-local serialization.
|
||||
> 4. **Side-effect inventory still ambiguous** — rate-limit counters,
|
||||
> audit logs, mention/search indexes need explicit
|
||||
> in-tx/non-authoritative classification.
|
||||
>
|
||||
> v8 fixes all four. **Intent §0 unchanged from v2.** v8 only revises §4
|
||||
> (delivery contract).
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
## 1. Process model — unchanged
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
|
||||
|
||||
### 4.1 The contract (precise — v8)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer: a duplicate `POST` with the same
|
||||
> `client_message_id` returns `409 idempotency_key_reused` if the
|
||||
> fingerprint mismatches, regardless of outbox row state.
|
||||
>
|
||||
> **Local audit guarantee (NEW v8)**: a `client_message_id` once written
|
||||
> to `outbox.db` is **never released**. Operator recovery via
|
||||
> `requeue --new-client-id` always mints a fresh id; the old row stays
|
||||
> in `aborted` for audit. There is no daemon-side path to free a used
|
||||
> id.
|
||||
>
|
||||
> **Broker guarantee**: same as v7 §4.1. Dedupe row exists iff the
|
||||
> broker reached the post-validation accept phase (§4.7.1).
|
||||
>
|
||||
> **Atomicity guarantee**: same as v7 §4.1.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — unchanged from v6 §4.3
|
||||
|
||||
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
|
||||
|
||||
#### 4.5.1 IPC accept algorithm (v8)
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits, destination
|
||||
resolvable). Failures here return `4xx` immediately. **No outbox row
|
||||
is written; the `client_message_id` is not consumed.**
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
|
||||
a concurrent IPC accept on the same id serializes against this one.
|
||||
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
|
||||
preventing any other writer from beginning a transaction on the same
|
||||
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
|
||||
supported.
|
||||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||||
last_error FROM outbox WHERE client_message_id = ?`.
|
||||
5. Apply the lookup table below. For the "(no row)" case, INSERT the
|
||||
new row inside the same transaction.
|
||||
6. COMMIT.
|
||||
|
||||
| Existing row state | Fingerprint match? | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
|
||||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||||
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||||
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||||
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
|
||||
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
|
||||
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||||
|
||||
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
|
||||
`request_fingerprint` (8-byte hex prefix) so callers can debug
|
||||
client/server canonical-form drift. **Every state in the table returns
|
||||
something deterministic, including `aborted`.** A `client_message_id`
|
||||
written to `outbox.db` is permanently bound to that row's lifecycle —
|
||||
the only "free" state is "no row exists".
|
||||
|
||||
#### 4.5.2 Outbox table — fingerprint required
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN
|
||||
('pending','inflight','done','dead','aborted')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT,
|
||||
aborted_at INTEGER, -- NEW v8
|
||||
aborted_by TEXT, -- NEW v8: operator/auto
|
||||
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||||
```
|
||||
|
||||
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
|
||||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||||
a row was requeued multiple times.
|
||||
|
||||
`request_fingerprint` is computed once at IPC accept time and frozen
|
||||
forever for the row's lifecycle. Daemon never recomputes from
|
||||
`payload`.
|
||||
|
||||
### 4.6 Rejected-request semantics — phasing made explicit (v8 — codex r7)
|
||||
|
||||
> **Single rule, phased**: a `client_message_id` is consumed iff a
|
||||
> dedupe row exists. The dedupe row is the durable evidence that a
|
||||
> request reached the post-validation accept phase. Pre-validation
|
||||
> failures consume nothing — caller may freely retry the same id with
|
||||
> a fixed payload.
|
||||
|
||||
#### 4.6.1 Daemon-side rejection phasing
|
||||
|
||||
| Phase | When daemon rejects | Outbox row? | Caller may reuse id? |
|
||||
|---|---|---|---|
|
||||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | Yes — id never consumed |
|
||||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | N/A — daemon owns retries |
|
||||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | No — rotate via `requeue --new-client-id` |
|
||||
| **D. Operator retirement** | Operator runs `requeue --new-client-id` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Old id NEVER reusable; new id is fresh |
|
||||
|
||||
#### 4.6.2 Broker-side rejection phasing (NEW v8 — codex r7)
|
||||
|
||||
The broker validates in two phases relative to dedupe-row insertion:
|
||||
|
||||
| Phase | Validation | Result |
|
||||
|---|---|---|
|
||||
| **B1. Pre-dedupe-claim** (NEW — explicit) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes` | `4xx` returned. **No dedupe row inserted.** Caller may retry with same id and corrected payload. |
|
||||
| **B2. Post-dedupe-claim** | Anything that requires the dedupe-claim transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.), per-mesh rate limit not exceeded | `4xx` returned, transaction rolled back, **no dedupe row remains**. Caller may retry with same id. |
|
||||
| **B3. Accepted** | All side effects (dedupe row, message row, history row, delivery_queue rows) commit atomically | `201` returned with `broker_message_id` |
|
||||
|
||||
**Critical guarantee (v8)**: there is no broker code path where a
|
||||
permanent rejection (4xx) leaves a dedupe row behind. Either the
|
||||
request committed and a dedupe row exists (B3), or it didn't and no
|
||||
dedupe row exists (B1, B2). This makes "dedupe row exists" the single
|
||||
unambiguous signal of "id consumed at the broker layer."
|
||||
|
||||
If broker decides post-commit that an accepted message is invalid
|
||||
(e.g. an async content-policy job runs on accepted messages), that's
|
||||
NOT a permanent rejection — that's a follow-up moderation event that
|
||||
operates on the broker_message_id, not on the dedupe key.
|
||||
|
||||
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
|
||||
|
||||
To unstick a `dead` or `pending`-but-stuck row, operator runs:
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||||
[--new-client-id <id> | --auto]
|
||||
[--patch-payload <path>]
|
||||
```
|
||||
|
||||
This atomically (single SQLite transaction):
|
||||
|
||||
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
|
||||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||||
permanent.
|
||||
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
|
||||
or auto-ulid'd via `--auto`).
|
||||
3. Inserts a new outbox row in `pending` with the fresh id and the same
|
||||
payload (or patched payload if `--patch-payload` was given).
|
||||
4. Sets `superseded_by = <new_row_id>` on the old row so
|
||||
`outbox inspect <old_id>` displays the chain.
|
||||
|
||||
**The old `client_message_id` is permanently dead** — `outbox.db` still
|
||||
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
|
||||
re-using it gets `409 outbox_aborted_*` per §4.5.1.
|
||||
|
||||
If broker had ever accepted the old id (it reached B3), the broker's
|
||||
dedupe row is also permanent — duplicate sends to broker with the old
|
||||
id would also `409` for fingerprint mismatch (or return the original
|
||||
`broker_message_id` for matching fingerprint). Daemon-side
|
||||
`aborted` and broker-side dedupe row are independent records of "this
|
||||
id was used," neither releases the id.
|
||||
|
||||
This is the resolution to v7's contradiction: there is **no path** for
|
||||
an id to "become free again." If the operator wants to retry the
|
||||
payload, they get a new id. The old id stays buried.
|
||||
|
||||
### 4.7 Broker atomicity contract — side-effect classification (v8 — codex r7)
|
||||
|
||||
#### 4.7.1 Side effects (v8 — explicit classification)
|
||||
|
||||
Every successful broker accept atomically commits these durable
|
||||
state changes in **one transaction**:
|
||||
|
||||
| Effect | Table | In-tx? | Why |
|
||||
|---|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
|
||||
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
|
||||
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
|
||||
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
|
||||
|
||||
**Outside the transaction** — non-authoritative or rebuildable, with
|
||||
explicit rationale per item:
|
||||
|
||||
| Effect | Where | Why outside |
|
||||
|---|---|---|
|
||||
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
|
||||
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
|
||||
| Rate-limit counters | Async, eventually consistent | Counters are an estimate; over-counting on retry > under-counting |
|
||||
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
|
||||
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
|
||||
| Metrics | Prometheus, pull-based | Always non-authoritative |
|
||||
|
||||
If any in-transaction insert fails, the transaction rolls back
|
||||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||||
state.
|
||||
|
||||
The async side effects are driven off the in-transaction
|
||||
`delivery_queue` and `message_history` rows, so they cannot get ahead
|
||||
of committed state — only lag behind.
|
||||
|
||||
#### 4.7.2 Pseudocode — corrected and final (v8)
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
|
||||
-- Phase B1 already passed (see §4.6.2).
|
||||
|
||||
-- Phase B2 + B3: try to claim the idempotency key.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Inspect the row that's actually there now (ours or someone else's).
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- Branch:
|
||||
-- row.broker_message_id == $msg_id → first insert; continue to step 3.
|
||||
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
|
||||
-- fingerprint match → ROLLBACK; return 200 duplicate.
|
||||
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Step 3: validate Phase B2 (subscribers exist, rate limit not exceeded, etc.)
|
||||
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
|
||||
|
||||
-- Step 4: insert all in-tx side effects (§4.7.1).
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
|
||||
SELECT $msg_id, mention_pubkey, ...
|
||||
FROM unnest($mention_list);
|
||||
|
||||
COMMIT;
|
||||
|
||||
-- After COMMIT, async workers consume delivery_queue and update
|
||||
-- search indexes, audit logs, rate-limit counters, etc.
|
||||
```
|
||||
|
||||
#### 4.7.3 Orphan check — same as v7 §4.7.3
|
||||
|
||||
Extended over the side-effect inventory to verify in-tx items consistency.
|
||||
|
||||
### 4.8 Outbox max-age math — unchanged from v7 §4.8
|
||||
|
||||
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
|
||||
safety_margin` strictly < window; safety_margin floor 24h.
|
||||
|
||||
### 4.9 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.10 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.11 Failure modes — `aborted` semantics added (v8)
|
||||
|
||||
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
|
||||
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
|
||||
- **IPC accept against `aborted` row, fingerprint match**: returns 409
|
||||
per §4.5.1 (NEW v8). Caller must use a new id; the old id is
|
||||
permanently retired.
|
||||
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
|
||||
§4.6.3; old id stays in `aborted`, new id is fresh.
|
||||
- **Broker fingerprint mismatch on retry**: as v6/v7. Daemon marks
|
||||
`dead`; operator requeue path.
|
||||
- **Daemon retry after dedupe row hard-deleted by broker retention
|
||||
sweep**: cannot happen unless operator overrode `max_age_hours`.
|
||||
- **Broker phase B2 rejection on retry**: same id, same fingerprint,
|
||||
but B2 condition has changed (e.g. mesh rate-limit now exceeded).
|
||||
Daemon receives 4xx → marks `dead`. Operator can `requeue` once
|
||||
conditions clear.
|
||||
- **Atomicity violation found by orphan check**: alerts ops.
|
||||
|
||||
---
|
||||
|
||||
## 5-13. — unchanged from v4
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
## 15. Version compat — unchanged from v7 §15
|
||||
|
||||
## 16. Threat model — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
|
||||
|
||||
Broker side, deploy order: same as v7 §17, with one addition:
|
||||
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
|
||||
validation, returns 4xx without writing) and Phase B2/B3 (within the
|
||||
accept transaction). Implementation: refactor handler to validate
|
||||
Phase B1 conditions before opening the DB transaction.
|
||||
|
||||
Daemon side:
|
||||
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
|
||||
columns and the `aborted` enum value (§4.5.2). Migration applies via
|
||||
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
|
||||
greenfield.
|
||||
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
|
||||
(§4.5.1 step 3).
|
||||
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
|
||||
- `claudemesh daemon outbox requeue` always mints a fresh
|
||||
`client_message_id`; never frees the old id. `--new-client-id <id>`
|
||||
and `--auto` are the only modes; the old `client_message_id`
|
||||
argument is removed.
|
||||
|
||||
---
|
||||
|
||||
## What changed v7 → v8 (codex round-7 actionable items)
|
||||
|
||||
| Codex r7 item | v8 fix | Section |
|
||||
|---|---|---|
|
||||
| `aborted` not in §4.5.1; `UNIQUE` contradiction | Added two `aborted` rows (match/mismatch) to lookup table; old id never reusable; new audit columns `aborted_at`/`aborted_by`/`superseded_by` | §4.5.1, §4.5.2, §4.6.3 |
|
||||
| Broker permanent-rejection ordering vague | Three-phase model B1 (pre-dedupe), B2 (post-claim, in-tx), B3 (accepted); permanent 4xx never leaves dedupe row | §4.6.2 |
|
||||
| SQLite `SELECT FOR UPDATE` invalid | Replaced with `BEGIN IMMEDIATE` for daemon-local serialization | §4.5.1 |
|
||||
| Side-effect inventory ambiguous on rate-limit/audit/search | Explicit in-tx vs outside-tx table with rationale per item | §4.7.1 |
|
||||
| Operator id reuse semantics | Old id permanently retired in `aborted`; requeue always mints fresh id; no daemon-side path to release used ids | §4.6.3 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 8)
|
||||
|
||||
1. **`aborted` permanence (§4.5.1, §4.6.3)** — is "old id permanently
|
||||
dead" correct, or is there a real operational case where releasing
|
||||
an id (e.g. caller mistyped a uuid) is worth the audit-trail loss?
|
||||
2. **Phase B1/B2/B3 split (§4.6.2)** — clean enough? Is rate-limiting
|
||||
in B2 (in-tx) the right call, or should it be B1 (cheaper to enforce
|
||||
pre-tx)?
|
||||
3. **In-tx mention_index (§4.7.1)** — agree it should be in-tx, or
|
||||
should mention indexing be async like search?
|
||||
4. **`BEGIN IMMEDIATE` (§4.5.1)** — correct SQLite primitive, or should
|
||||
it be `BEGIN EXCLUSIVE` to also block readers? (Probably not — readers
|
||||
should see committed-pending rows, but worth confirming.)
|
||||
5. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year.
|
||||
|
||||
Three options:
|
||||
- **(a) v8 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v9 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
473
.artifacts/shipped/2026-05-03-daemon-final-spec-v9.md
Normal file
473
.artifacts/shipped/2026-05-03-daemon-final-spec-v9.md
Normal file
@@ -0,0 +1,473 @@
|
||||
# `claudemesh daemon` — Final Spec v9
|
||||
|
||||
> **Round 9.** v8 was reviewed by codex (round 8) which closed
|
||||
> aborted/UNIQUE (5/5) and SQLite locking (5/5) cleanly, but flagged
|
||||
> three spec-level correctness problems:
|
||||
>
|
||||
> 1. **Cross-layer ID-consumed authority contradiction** — v8 §4.1
|
||||
> said "id consumed iff dedupe row exists" while §4.6.1 says a
|
||||
> daemon-rejected id stays consumed locally with no broker dedupe
|
||||
> row. Two incompatible authorities.
|
||||
> 2. **Rate-limit authority muddled** — v8 listed rate limit in B2
|
||||
> (in-tx authoritative) but classified rate-limit counters as
|
||||
> async/non-authoritative in §4.7.1.
|
||||
> 3. **§4.1 broker guarantee wording** — "post-validation accept
|
||||
> phase" was fuzzy because B2 rolls back. Tighten to "accept
|
||||
> committed."
|
||||
>
|
||||
> v9 fixes all three with **two-layer ID rules** (daemon vs broker),
|
||||
> rate-limit moved to B1 via an external atomic limiter, and §4.1
|
||||
> tightened. **Intent §0 unchanged from v2.** v9 only revises §4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
## 1. Process model — unchanged
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
|
||||
|
||||
### 4.1 The contract (precise — v9, two-layer ID model)
|
||||
|
||||
> **Two-layer ID rules** (NEW v9 — codex r8):
|
||||
>
|
||||
> - **Daemon-layer**: a `client_message_id` is **daemon-consumed** iff an
|
||||
> outbox row exists for it. Daemon-mediated callers can never reuse a
|
||||
> daemon-consumed id, regardless of whether the broker ever saw it.
|
||||
> The daemon's outbox is the single authority for "this id was issued
|
||||
> by my caller against this daemon."
|
||||
> - **Broker-layer**: a `client_message_id` is **broker-consumed** iff a
|
||||
> dedupe row exists for `(mesh_id, client_message_id)` in
|
||||
> `mesh.client_message_dedupe`. Direct broker callers (none in
|
||||
> v0.9.0; reserved for future SDK paths that bypass the daemon) can
|
||||
> reuse a broker-non-consumed id freely.
|
||||
> - In v0.9.0 there are no daemon-bypass clients, so for practical
|
||||
> purposes "daemon-consumed" is the operative rule.
|
||||
>
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db`
|
||||
> before the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer (§4.5.1).
|
||||
>
|
||||
> **Local audit guarantee**: a `client_message_id` once written to
|
||||
> `outbox.db` is **never released** (daemon-layer rule). Operator
|
||||
> recovery via `requeue` always mints a fresh id; the old row stays in
|
||||
> `aborted` for audit. There is no daemon-side path to free a used id.
|
||||
>
|
||||
> **Broker guarantee** (v9 — tightened): a dedupe row exists iff the
|
||||
> broker accept transaction **committed** (Phase B3 reached). Phase B1
|
||||
> rejections never insert dedupe rows. Phase B2 rejections roll the
|
||||
> transaction back, so any partial dedupe row is unwound. Direct
|
||||
> broker callers retrying after B1/B2 rejection see no dedupe row and
|
||||
> may reuse the id.
|
||||
>
|
||||
> **Atomicity guarantee**: same as v8 §4.1.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — unchanged from v6 §4.3
|
||||
|
||||
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
|
||||
|
||||
#### 4.5.1 IPC accept algorithm (v8)
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits, destination
|
||||
resolvable). Failures here return `4xx` immediately. **No outbox row
|
||||
is written; the `client_message_id` is not consumed.**
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
|
||||
a concurrent IPC accept on the same id serializes against this one.
|
||||
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
|
||||
preventing any other writer from beginning a transaction on the same
|
||||
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
|
||||
supported.
|
||||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||||
last_error FROM outbox WHERE client_message_id = ?`.
|
||||
5. Apply the lookup table below. For the "(no row)" case, INSERT the
|
||||
new row inside the same transaction.
|
||||
6. COMMIT.
|
||||
|
||||
| Existing row state | Fingerprint match? | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
|
||||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||||
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||||
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||||
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
|
||||
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
|
||||
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||||
|
||||
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
|
||||
`request_fingerprint` (8-byte hex prefix) so callers can debug
|
||||
client/server canonical-form drift. **Every state in the table returns
|
||||
something deterministic, including `aborted`.** A `client_message_id`
|
||||
written to `outbox.db` is permanently bound to that row's lifecycle —
|
||||
the only "free" state is "no row exists".
|
||||
|
||||
#### 4.5.2 Outbox table — fingerprint required
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN
|
||||
('pending','inflight','done','dead','aborted')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT,
|
||||
aborted_at INTEGER, -- NEW v8
|
||||
aborted_by TEXT, -- NEW v8: operator/auto
|
||||
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||||
```
|
||||
|
||||
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
|
||||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||||
a row was requeued multiple times.
|
||||
|
||||
`request_fingerprint` is computed once at IPC accept time and frozen
|
||||
forever for the row's lifecycle. Daemon never recomputes from
|
||||
`payload`.
|
||||
|
||||
### 4.6 Rejected-request semantics — two-layer rules + rate-limit moved to B1 (v9 — codex r8)
|
||||
|
||||
> **Two-layer rule (v9)**: a `client_message_id` is **daemon-consumed**
|
||||
> iff an outbox row exists for it; **broker-consumed** iff a dedupe row
|
||||
> exists. Daemon-mediated callers see daemon-layer authority (the only
|
||||
> path in v0.9.0). Pre-validation failures at any layer consume nothing
|
||||
> at that layer. The two layers are independent: a daemon-consumed id
|
||||
> may or may not be broker-consumed (depending on whether the send
|
||||
> reached B3); a daemon-non-consumed id can never be broker-consumed
|
||||
> (no outbox row ⇒ no broker call from the daemon).
|
||||
|
||||
#### 4.6.1 Daemon-side rejection phasing (v9)
|
||||
|
||||
| Phase | When daemon rejects | Outbox row? | Daemon-consumed? | Same daemon caller may reuse id? |
|
||||
|---|---|---|---|---|
|
||||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | No | Yes — id never written locally |
|
||||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | Yes | N/A — daemon owns retries |
|
||||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | Yes | No — rotate via `requeue` |
|
||||
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Yes (still consumed) | Old id NEVER reusable; new id is fresh |
|
||||
|
||||
The "daemon-consumed?" column is the daemon-layer authority. It does
|
||||
not depend on whether the broker ever saw the request — phase C above
|
||||
shows the broker has not committed a dedupe row, but the daemon still
|
||||
holds the id in `dead` state.
|
||||
|
||||
#### 4.6.2 Broker-side rejection phasing (v9 — rate limit moved to B1)
|
||||
|
||||
The broker validates in two phases relative to dedupe-row insertion:
|
||||
|
||||
| Phase | Validation | Side effects | Result for direct broker callers |
|
||||
|---|---|---|---|
|
||||
| **B1. Pre-dedupe-claim** (atomic, external) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, **rate limit not exceeded** (atomic external limiter — see §4.6.4) | None | `4xx` returned. No dedupe row, no broker-consumed id. Caller may retry with same id once condition clears |
|
||||
| **B2. Post-dedupe-claim** (in-tx) | Conditions that require the accept transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx` returned, transaction rolled back, no dedupe row remains. Caller may retry with same id |
|
||||
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows, mention_index rows | `201` returned with `broker_message_id`. Id is broker-consumed |
|
||||
|
||||
**Daemon-mediated callers**: in v0.9.0 the daemon is the only B-phase
|
||||
caller. Daemon-mediated callers see only the daemon-layer rules
|
||||
(§4.6.1). The broker's "may retry with same id" wording in the table
|
||||
above applies to direct broker callers only (none in v0.9.0; reserved
|
||||
for future SDK paths).
|
||||
|
||||
**Critical guarantee (v9 — tightened from v8)**: a dedupe row exists
|
||||
**iff the broker accept transaction committed (B3)**. There is no
|
||||
broker code path where a permanent 4xx leaves a dedupe row behind.
|
||||
|
||||
If the broker decides post-commit that an accepted message is invalid
|
||||
(async content-policy job, async moderation, etc.), that's NOT a
|
||||
permanent rejection — it's a follow-up event that operates on the
|
||||
`broker_message_id`, not on the dedupe key.
|
||||
|
||||
#### 4.6.4 Rate limiter — atomic, external, B1 (NEW v9 — codex r8)
|
||||
|
||||
Codex r8 caught: v8 listed rate-limit enforcement in B2 (in-tx) but
|
||||
classified rate-limit *counters* as async/non-authoritative. Both
|
||||
can't be true. v9 resolves it by moving rate-limit enforcement to B1
|
||||
backed by an atomic external limiter:
|
||||
|
||||
- **Authority**: the broker's existing Redis (or equivalent
|
||||
fixed-window limiter) used for `claudemesh launch` rate-limiting is
|
||||
the authority for accept-time rate-limit enforcement. `INCR` with
|
||||
TTL is atomic; the broker checks the result before committing the
|
||||
Phase B2/B3 transaction.
|
||||
- **Idempotency interaction**: rate-limit `INCR` happens **before** the
|
||||
dedupe-claim INSERT. If the limiter rejects, no DB transaction is
|
||||
opened, no dedupe row exists. If the limiter accepts but the in-tx
|
||||
Phase B2 then rejects (e.g. topic not found), the limiter `INCR` is
|
||||
not refunded. This is intentional: refunding would require a
|
||||
reliable distributed counter, and the over-counting risk is
|
||||
acceptable. Counter
|
||||
`cm_broker_rate_limit_consumed_then_rejected_total` exposes the
|
||||
delta for ops awareness.
|
||||
- **Retries**: a daemon retry with the same `client_message_id` after a
|
||||
B1 rate-limit rejection produces another `INCR`. To avoid burning
|
||||
rate-limit budget on retries-of-rejected-ids, the broker can
|
||||
optionally short-circuit `INCR` if the rate-limit subsystem can
|
||||
cheaply detect "this exact `client_message_id` was rejected for
|
||||
rate-limit in the last N seconds" — but this is an optimization,
|
||||
not a correctness requirement.
|
||||
- **Async counters**: `mesh.rate_limit_counter` (or any DB-resident
|
||||
view of "messages-per-mesh-per-window") is **non-authoritative** —
|
||||
it's metrics/telemetry rebuilt from the authoritative limiter and
|
||||
from message-history. Used for dashboards, not for accept decisions.
|
||||
|
||||
This split — atomic external limiter for enforcement, async DB
|
||||
counters for telemetry — matches how every other rate-limited
|
||||
subsystem in claudemesh works (`claudemesh launch`, dashboard chat
|
||||
posts, etc.). No new infrastructure required.
|
||||
|
||||
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
|
||||
|
||||
To unstick a `dead` or `pending`-but-stuck row, operator runs:
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||||
[--new-client-id <id> | --auto]
|
||||
[--patch-payload <path>]
|
||||
```
|
||||
|
||||
This atomically (single SQLite transaction):
|
||||
|
||||
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
|
||||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||||
permanent.
|
||||
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
|
||||
or auto-ulid'd via `--auto`).
|
||||
3. Inserts a new outbox row in `pending` with the fresh id and the same
|
||||
payload (or patched payload if `--patch-payload` was given).
|
||||
4. Sets `superseded_by = <new_row_id>` on the old row so
|
||||
`outbox inspect <old_id>` displays the chain.
|
||||
|
||||
**The old `client_message_id` is permanently dead** — `outbox.db` still
|
||||
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
|
||||
re-using it gets `409 outbox_aborted_*` per §4.5.1.
|
||||
|
||||
If broker had ever accepted the old id (it reached B3), the broker's
|
||||
dedupe row is also permanent — duplicate sends to broker with the old
|
||||
id would also `409` for fingerprint mismatch (or return the original
|
||||
`broker_message_id` for matching fingerprint). Daemon-side
|
||||
`aborted` and broker-side dedupe row are independent records of "this
|
||||
id was used," neither releases the id.
|
||||
|
||||
This is the resolution to v7's contradiction: there is **no path** for
|
||||
an id to "become free again." If the operator wants to retry the
|
||||
payload, they get a new id. The old id stays buried.
|
||||
|
||||
### 4.7 Broker atomicity contract — side-effect classification (v9)
|
||||
|
||||
#### 4.7.1 Side effects (v9 — rate limit moved to B1 external)
|
||||
|
||||
Every successful broker accept atomically commits these durable
|
||||
state changes in **one transaction**:
|
||||
|
||||
| Effect | Table | In-tx? | Why |
|
||||
|---|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
|
||||
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
|
||||
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
|
||||
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
|
||||
|
||||
**Outside the transaction** — non-authoritative or rebuildable, with
|
||||
explicit rationale per item:
|
||||
|
||||
| Effect | Where | Why outside |
|
||||
|---|---|---|
|
||||
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
|
||||
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
|
||||
| Rate-limit **counters** (telemetry only) | Async, eventually consistent | Authoritative limiter is the external Redis-style INCR in B1 (§4.6.4); the DB counter is rebuilt for dashboards, not consulted for accept |
|
||||
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
|
||||
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
|
||||
| Metrics | Prometheus, pull-based | Always non-authoritative |
|
||||
|
||||
If any in-transaction insert fails, the transaction rolls back
|
||||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||||
state.
|
||||
|
||||
The async side effects are driven off the in-transaction
|
||||
`delivery_queue` and `message_history` rows, so they cannot get ahead
|
||||
of committed state — only lag behind.
|
||||
|
||||
#### 4.7.2 Pseudocode — corrected and final (v8)
|
||||
|
||||
```sql
|
||||
-- Phase B1 already passed (see §4.6.2). This includes:
|
||||
-- - schema/auth/size validation
|
||||
-- - external atomic rate-limit INCR (§4.6.4)
|
||||
-- Anything that fails B1 returns 4xx without ever opening this tx.
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Phase B2 + B3: try to claim the idempotency key.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Inspect the row that's actually there now (ours or someone else's).
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- Branch:
|
||||
-- row.broker_message_id == $msg_id → first insert; continue to step 3.
|
||||
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
|
||||
-- fingerprint match → ROLLBACK; return 200 duplicate.
|
||||
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Step 3: validate Phase B2 (destination_ref existence: topic exists,
|
||||
-- member subscribed, etc.). Rate limit is NOT here — it was checked
|
||||
-- atomically in B1 via the external limiter (§4.6.4) before this
|
||||
-- transaction opened.
|
||||
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
|
||||
|
||||
-- Step 4: insert all in-tx side effects (§4.7.1).
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
|
||||
SELECT $msg_id, mention_pubkey, ...
|
||||
FROM unnest($mention_list);
|
||||
|
||||
COMMIT;
|
||||
|
||||
-- After COMMIT, async workers consume delivery_queue and update
|
||||
-- search indexes, audit logs, rate-limit counters, etc.
|
||||
```
|
||||
|
||||
#### 4.7.3 Orphan check — same as v7 §4.7.3
|
||||
|
||||
Extended over the side-effect inventory to verify in-tx items consistency.
|
||||
|
||||
### 4.8 Outbox max-age math — unchanged from v7 §4.8
|
||||
|
||||
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
|
||||
safety_margin` strictly < window; safety_margin floor 24h.
|
||||
|
||||
### 4.9 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.10 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.11 Failure modes — `aborted` semantics added (v8)
|
||||
|
||||
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
|
||||
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
|
||||
- **IPC accept against `aborted` row, fingerprint match**: returns 409
|
||||
per §4.5.1 (NEW v8). Caller must use a new id; the old id is
|
||||
permanently retired.
|
||||
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
|
||||
§4.6.3; old id stays in `aborted`, new id is fresh.
|
||||
- **Broker fingerprint mismatch on retry**: as v6/v7. Daemon marks
|
||||
`dead`; operator requeue path.
|
||||
- **Daemon retry after dedupe row hard-deleted by broker retention
|
||||
sweep**: cannot happen unless operator overrode `max_age_hours`.
|
||||
- **Broker phase B2 rejection on retry**: same id, same fingerprint,
|
||||
but B2 condition has changed (e.g. mesh rate-limit now exceeded).
|
||||
Daemon receives 4xx → marks `dead`. Operator can `requeue` once
|
||||
conditions clear.
|
||||
- **Atomicity violation found by orphan check**: alerts ops.
|
||||
|
||||
---
|
||||
|
||||
## 5-13. — unchanged from v4
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
## 15. Version compat — unchanged from v7 §15
|
||||
|
||||
## 16. Threat model — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
|
||||
|
||||
Broker side, deploy order: same as v7 §17, with one addition:
|
||||
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
|
||||
validation, returns 4xx without writing) and Phase B2/B3 (within the
|
||||
accept transaction). Implementation: refactor handler to validate
|
||||
Phase B1 conditions before opening the DB transaction.
|
||||
|
||||
Daemon side:
|
||||
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
|
||||
columns and the `aborted` enum value (§4.5.2). Migration applies via
|
||||
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
|
||||
greenfield.
|
||||
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
|
||||
(§4.5.1 step 3).
|
||||
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
|
||||
- `claudemesh daemon outbox requeue` always mints a fresh
|
||||
`client_message_id`; never frees the old id. `--new-client-id <id>`
|
||||
and `--auto` are the only modes; the old `client_message_id`
|
||||
argument is removed.
|
||||
|
||||
---
|
||||
|
||||
## What changed v8 → v9 (codex round-8 actionable items)
|
||||
|
||||
| Codex r8 item | v9 fix | Section |
|
||||
|---|---|---|
|
||||
| Cross-layer ID-consumed authority contradiction | Two-layer model: daemon-consumed iff outbox row; broker-consumed iff dedupe row committed; daemon-mediated callers see only daemon-layer authority | §4.1, §4.6.1, §4.6.2 |
|
||||
| Rate-limit authority muddled (B2 vs async counters) | Rate limit moved to B1 via external atomic limiter (Redis-style INCR with TTL); DB rate-limit counters demoted to telemetry-only | §4.6.2, §4.6.4, §4.7.1 |
|
||||
| §4.1 broker guarantee fuzzy | Tightened: "dedupe row exists iff broker accept transaction committed (B3)" | §4.1, §4.6.2 |
|
||||
|
||||
(Earlier rounds' fixes preserved unchanged.)
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 9)
|
||||
|
||||
1. **Two-layer ID model (§4.1, §4.6.1)** — is the daemon-vs-broker
|
||||
authority split clear, or does it create more confusion for
|
||||
operators reading "consumed" in different contexts? Should we use
|
||||
different verbs (e.g. "claimed" at daemon, "committed" at broker)?
|
||||
2. **Rate-limit external limiter (§4.6.4)** — is "atomic external
|
||||
limiter" specified concretely enough? Is the over-counting on
|
||||
limiter-accepted-then-B2-rejected acceptable?
|
||||
3. **B2 contents after rate-limit move** — B2 now only has
|
||||
`destination_ref existence`. Worth keeping a B2 phase at all, or
|
||||
collapse into B1+B3?
|
||||
4. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year.
|
||||
|
||||
Three options:
|
||||
- **(a) v9 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v10 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
374
.artifacts/shipped/2026-05-03-daemon-final-spec.md
Normal file
374
.artifacts/shipped/2026-05-03-daemon-final-spec.md
Normal file
@@ -0,0 +1,374 @@
|
||||
# `claudemesh daemon` — Final Spec
|
||||
|
||||
> Context for the reviewer: claudemesh is a peer mesh runtime for Claude Code
|
||||
> sessions. Existing infrastructure: a managed broker (`wss://ic.claudemesh.com/ws`,
|
||||
> Bun + Drizzle + Postgres) that handles routing, presence, topics, files,
|
||||
> per-mesh apikeys, etc. There is also a CLI (`claudemesh-cli`, npm) and a web
|
||||
> dashboard. Each session today is short-lived: `claudemesh launch` opens a WS,
|
||||
> stays up while Claude Code is running, then closes. Server-side
|
||||
> integrations (RunPod handlers, Temporal workers, CI jobs) currently have no
|
||||
> first-class way to participate in a mesh — they'd either curl an apikey-auth
|
||||
> REST endpoint (one-way) or shell out to the CLI cold-path (slow, no inbound).
|
||||
>
|
||||
> This spec proposes a `claudemesh daemon` mode that turns any host (laptop,
|
||||
> server, RunPod pod) into a persistent, addressable peer with a local IPC
|
||||
> surface that apps can talk to without dealing with the broker directly.
|
||||
>
|
||||
> The user has explicitly said: pre-launch, no users yet, optimize for the
|
||||
> right architecture not the smallest first cut. They want the FINAL spec, not
|
||||
> phased MVPs.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model
|
||||
|
||||
**One daemon per (user, mesh)**. Persistent. Survives reboots via OS supervisor (systemd / launchd / SCM). Serves multiple local apps concurrently.
|
||||
|
||||
```
|
||||
~/.claudemesh/daemon/<mesh-slug>/
|
||||
pid 0600 pidfile, cleaned on shutdown
|
||||
sock 0600 unix domain socket (primary IPC)
|
||||
http.port 0644 auto-allocated loopback port (Windows / Docker fallback)
|
||||
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
|
||||
config.toml 0644 user-editable runtime tuning
|
||||
outbox.db 0600 SQLite — durable outbound queue + dedupe ledger
|
||||
inbox.db 0600 SQLite — 30-day inbound history, FTS-indexed
|
||||
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
|
||||
hooks/ 0700 user-managed event scripts
|
||||
```
|
||||
|
||||
Single binary. No external runtime beyond the existing CLI dependencies. The daemon *is* the CLI in long-running mode — `claudemesh daemon up` is a flag on the same binary.
|
||||
|
||||
## 2. Identity — persistent member, not ephemeral session
|
||||
|
||||
The daemon mints a stable ed25519 + x25519 keypair on first startup, stored in `keypair.json`. Registers with the broker as a **persistent member** — same identity across restarts, reconnects, host migrations. `runpod-worker-3` is `runpod-worker-3` forever, until you `claudemesh daemon reset` or revoke the keypair.
|
||||
|
||||
`--name` is taken at first `daemon up`; subsequent runs read the keypair file and ignore `--name` unless `--rename` is passed (which produces a `member_renamed` event the broker propagates to peers).
|
||||
|
||||
This is the default. It's the right thing for servers. There is no `--ephemeral` mode.
|
||||
|
||||
## 3. IPC surface — single versioned API, three transports
|
||||
|
||||
**Transports**, all serving identical JSON:
|
||||
- **UDS** at `~/.claudemesh/daemon/<slug>/sock` (primary, default)
|
||||
- **TCP loopback** on auto-allocated port written to `http.port` (Docker / Windows clients)
|
||||
- **Server-Sent Events** stream at `GET /v1/events` for push (real-time inbound)
|
||||
|
||||
**No auth on local IPC.** Trust boundary is the OS — UDS is mode 0600, TCP listens on 127.0.0.1 only. If you can reach the socket, you're already running as the right user; the daemon's `keypair.json` is also reachable, so adding a token would be theatre.
|
||||
|
||||
**Endpoint surface — exactly mirrors CLI verbs:**
|
||||
|
||||
```
|
||||
# messaging
|
||||
POST /v1/send {to, message, priority?, meta?, replyToId?}
|
||||
POST /v1/topic/post {topic, message, priority?, mentions?}
|
||||
POST /v1/topic/subscribe {topic}
|
||||
GET /v1/topic/list
|
||||
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
|
||||
POST /v1/broadcast {message, scope: "*"|"@group"|...}
|
||||
|
||||
# peers + presence
|
||||
GET /v1/peers ?mesh=<slug>
|
||||
POST /v1/profile {summary?, status?, visible?, avatar?, ...}
|
||||
POST /v1/groups/join {name, role?}
|
||||
POST /v1/groups/leave {name}
|
||||
|
||||
# state, memory, vector, graph — full mesh-services platform
|
||||
POST /v1/state/set {key, value, scope?: "mesh"|"member"}
|
||||
GET /v1/state/get ?key=...
|
||||
GET /v1/state/list
|
||||
POST /v1/memory/remember {content, tags?}
|
||||
GET /v1/memory/recall ?q=<query>
|
||||
POST /v1/vector/store {collection, text, metadata?}
|
||||
GET /v1/vector/search ?collection=<c>&q=<query>&limit=<n>
|
||||
POST /v1/graph/query {cypher, params?}
|
||||
|
||||
# files
|
||||
POST /v1/file/share {path, to?, message?, persistent?}
|
||||
GET /v1/file/get ?id=<fileId>&out=<path>
|
||||
GET /v1/file/list
|
||||
|
||||
# tasks + scheduling
|
||||
POST /v1/task/create {title, assignee?, priority?, tags?}
|
||||
POST /v1/task/claim {id}
|
||||
POST /v1/task/complete {id, result?}
|
||||
POST /v1/scheduling/remind {at|in|cron, message, to?}
|
||||
|
||||
# skills + MCP services (full peer participation)
|
||||
POST /v1/skill/deploy {path}
|
||||
POST /v1/skill/share {name, manifest}
|
||||
POST /v1/mcp/register {server_name, description, tools, transport}
|
||||
POST /v1/mcp/call {server, tool, args}
|
||||
|
||||
# events (push)
|
||||
GET /v1/events text/event-stream
|
||||
events: message, peer_join, peer_leave, file_shared, task_assigned,
|
||||
state_changed, mcp_deployed, skill_shared, hook_executed,
|
||||
disconnect, reconnect
|
||||
|
||||
# control plane
|
||||
GET /v1/health {connected, lag_ms, queue_depth, mesh, member_pubkey, uptime_s}
|
||||
GET /v1/metrics Prometheus exposition
|
||||
POST /v1/heartbeat {} (caller asserts it's alive — daemon may set status="working")
|
||||
```
|
||||
|
||||
Every CLI verb the platform offers has a daemon endpoint. No second-class features. Apps written against the daemon get the same surface as Claude Code itself.
|
||||
|
||||
## 4. Outbound — exactly-once via SQLite + idempotency keys
|
||||
|
||||
Sends route through `outbox.db` first, then to the broker. Schema:
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY, -- ulid
|
||||
idempotency_key TEXT UNIQUE, -- caller-provided or autogen
|
||||
payload BLOB NOT NULL, -- serialized envelope
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
- WAL mode, `synchronous=NORMAL` — durable enough, ~10k inserts/sec.
|
||||
- Caller-supplied `Idempotency-Key` header dedupes retries (24h window).
|
||||
- Exponential backoff with jitter; 7-day max retention; `dead` rows surface in `claudemesh daemon outbox --failed`.
|
||||
- `delivered_at` set when broker ACKs the queue row, not when daemon sends — gives true at-least-once with explicit dedupe → effectively exactly-once.
|
||||
|
||||
## 5. Inbound — durable history with FTS
|
||||
|
||||
Every inbound message is written to `inbox.db` before any hook fires:
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE inbox USING fts5(
|
||||
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
|
||||
sender_name, body, meta, received_at UNINDEXED, replied_to_id UNINDEXED
|
||||
);
|
||||
```
|
||||
|
||||
- 30-day rolling retention (configurable).
|
||||
- `claudemesh daemon search "OOM"` queries the FTS index (instant, offline-capable).
|
||||
- Apps that connect mid-stream replay history via `?since=<iso>`.
|
||||
- Exposed in metrics: `cm_daemon_inbox_rows`, `cm_daemon_inbox_bytes`.
|
||||
|
||||
## 6. Hooks — first-class scripted reactions
|
||||
|
||||
Hooks turn the daemon from a passive relay into an autonomous peer. Files in `hooks/`:
|
||||
|
||||
```
|
||||
hooks/
|
||||
on-message.sh every inbound message (DM + topic)
|
||||
on-dm.sh DMs only
|
||||
on-mention.sh when @<my-name> appears anywhere
|
||||
on-topic-<name>.sh a specific topic (e.g. on-topic-alerts.sh)
|
||||
on-file-share.sh file shared with me
|
||||
on-task-assigned.sh task assigned to me
|
||||
on-disconnect.sh WS dropped (informational)
|
||||
on-reconnect.sh reconnected (informational)
|
||||
on-startup.sh daemon up
|
||||
pre-send.sh filter / mutate outbound (last gate)
|
||||
```
|
||||
|
||||
**Contract:**
|
||||
- Stdin: full event JSON.
|
||||
- Stdout (if non-empty, JSON object): used as a structured response. For inbound messages, `{reply: "..."}` posts a reply automatically.
|
||||
- Exit 0 = success; non-zero logs + counts but does not retry.
|
||||
- Timeout: 30s default, override via `# claudemesh:timeout=120s` shebang comment.
|
||||
- Env: `PATH=/usr/bin:/bin`, `CLAUDEMESH_MESH=<slug>`, `CLAUDEMESH_MEMBER=<pubkey>`, `CLAUDEMESH_HOME=<config-dir>`, plus the daemon's own broker session token in `CLAUDEMESH_TOKEN` so the script can call `claudemesh send` without re-authenticating.
|
||||
- Concurrent execution: bounded pool (default 8) — overflow queues, never blocks the WS reader.
|
||||
|
||||
This makes a server a real participant: it auto-replies to "@worker-3 status?", auto-acks file shares, auto-claims tasks, escalates errors to oncall — all configured by dropping shell scripts in a directory.
|
||||
|
||||
## 7. Multi-mesh — one daemon per mesh, coordinated by a supervisor
|
||||
|
||||
Multi-mesh handled by **one daemon per mesh** (no shared state, no cross-mesh leakage). Coordinated by:
|
||||
|
||||
```
|
||||
claudemesh daemon up --all # spawns one daemon per joined mesh
|
||||
claudemesh daemon down --all
|
||||
claudemesh daemon status --all # JSON table of every daemon
|
||||
claudemesh daemon ps # alias of status
|
||||
```
|
||||
|
||||
CLI verbs without `--mesh` continue to do their existing aggregator routing (`/v1/me/...`) and additionally each daemon contributes inbound state to the aggregator.
|
||||
|
||||
## 8. Auto-routing — every CLI verb prefers the daemon
|
||||
|
||||
The CLI's `withMesh` helper is replaced by `viaDaemonOrMesh`:
|
||||
|
||||
1. Read `~/.claudemesh/daemon/<slug>/pid`.
|
||||
2. If alive → call the daemon's UDS endpoint.
|
||||
3. Else → cold path (existing `withMesh` flow, opens its own short-lived WS).
|
||||
|
||||
Transparent to the user. `claudemesh send X "msg"` from a script becomes a sub-millisecond local UDS call when a daemon is up, instead of a 1-second broker handshake.
|
||||
|
||||
## 9. Service installation
|
||||
|
||||
```bash
|
||||
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
|
||||
claudemesh daemon uninstall-service
|
||||
```
|
||||
|
||||
Generated unit:
|
||||
- `Restart=on-failure`, `RestartSec=5s`
|
||||
- `MemoryMax=512M` (will rarely use this)
|
||||
- `StandardOutput/Error=journal`
|
||||
- For systemd, runs as the invoking user (no root needed).
|
||||
|
||||
`claudemesh install` (the existing setup verb) gains an opt-in prompt: *"Install as a background service that always runs?"* For interactive users this is opt-in; for `--yes` it defaults to yes on Linux servers (detected by absence of TTY + presence of systemd).
|
||||
|
||||
## 10. Observability
|
||||
|
||||
```
|
||||
claudemesh daemon status human-readable: connected, lag, queue, hooks fired
|
||||
claudemesh daemon status --json machine-readable
|
||||
claudemesh daemon logs [-f] tail daemon.log
|
||||
claudemesh daemon outbox pending sends + dead-letter queue
|
||||
claudemesh daemon inbox recent received messages (FTS-searchable)
|
||||
claudemesh daemon metrics prints /v1/metrics
|
||||
|
||||
# Prometheus counters/gauges:
|
||||
cm_daemon_connected{mesh} 0/1
|
||||
cm_daemon_reconnects_total{mesh,reason}
|
||||
cm_daemon_lag_ms{mesh} last broker round-trip
|
||||
cm_daemon_outbox_depth{mesh}
|
||||
cm_daemon_outbox_dead_total{mesh}
|
||||
cm_daemon_send_total{mesh,kind=topic|dm|broadcast,status}
|
||||
cm_daemon_recv_total{mesh,kind=topic|dm,from_type=peer|apikey|webhook}
|
||||
cm_daemon_hook_invocations_total{hook,exit}
|
||||
cm_daemon_hook_duration_seconds{hook} histogram
|
||||
cm_daemon_ipc_request_total{endpoint,status}
|
||||
cm_daemon_ipc_duration_seconds{endpoint} histogram
|
||||
```
|
||||
|
||||
Tracing: optional OpenTelemetry export (`config.toml: [otel] endpoint = ...`) — emits spans for every IPC request + downstream broker call.
|
||||
|
||||
## 11. SDKs — three, all thin
|
||||
|
||||
The daemon's HTTP+UDS surface is the API; SDKs are convenience wrappers, not new surfaces.
|
||||
|
||||
**Python** (single file, stdlib only — no `requests`, no `aiohttp`):
|
||||
```python
|
||||
from claudemesh import Daemon
|
||||
cm = Daemon() # auto-discovers running daemon for current cwd's mesh
|
||||
cm.send("@oncall", "OOM detected")
|
||||
cm.topic.post("alerts", "build done", mentions=["alice"])
|
||||
for evt in cm.events(): # SSE stream, blocking iterator
|
||||
if evt.kind == "message" and "@me" in evt.body:
|
||||
cm.send(evt.from_pubkey, "got it, on it")
|
||||
```
|
||||
|
||||
**Go** (single file, stdlib only — no third-party deps):
|
||||
```go
|
||||
cm, _ := claudemesh.Connect()
|
||||
cm.Send(ctx, "@oncall", "OOM detected")
|
||||
for evt := range cm.Events(ctx) { ... }
|
||||
```
|
||||
|
||||
**TypeScript / Node** (zero runtime deps, ESM only):
|
||||
```ts
|
||||
import { Daemon } from "@claudemesh/daemon-client";
|
||||
const cm = await Daemon.connect();
|
||||
await cm.send("@oncall", "OOM detected");
|
||||
for await (const evt of cm.events()) { ... }
|
||||
```
|
||||
|
||||
Each is ~300 lines. All three are versioned in lockstep with the daemon's `/v1` surface. A `/v2` surface (when it eventually exists) keeps `/v1` alive indefinitely — old SDKs never break.
|
||||
|
||||
## 12. Security model — explicit boundaries
|
||||
|
||||
| Boundary | Trust | Mechanism |
|
||||
|---|---|---|
|
||||
| App ↔ Daemon (local) | OS user | UDS 0600, TCP loopback only |
|
||||
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello sig + crypto_box DM envelopes + per-topic keys (existing model) |
|
||||
| Hook ↔ Daemon (env) | OS user + filesystem | `hooks/` dir mode 0700; only files there execute; no remote install |
|
||||
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
|
||||
|
||||
**No new attack surface introduced by the daemon** — apps that previously could read `~/.claudemesh/config.json` directly already had full mesh access; the daemon just adds an IPC layer on top.
|
||||
|
||||
**Hook RCE consideration**: a peer cannot install a hook on your daemon. Hooks are files YOU put on disk. Inbound messages can only trigger hooks that already exist with content you wrote. The broker has no path to your hook directory.
|
||||
|
||||
## 13. Configuration — `config.toml`
|
||||
|
||||
```toml
|
||||
[daemon]
|
||||
mesh = "prod" # set on `daemon up --mesh`; immutable thereafter
|
||||
display_name = "runpod-worker-3"
|
||||
log_level = "info"
|
||||
|
||||
[ipc]
|
||||
http_port = 0 # 0 = auto-allocate
|
||||
http_bind = "127.0.0.1" # never 0.0.0.0; explicit if you know what you're doing
|
||||
uds_mode = "0600"
|
||||
|
||||
[outbox]
|
||||
max_queue_size = 10000
|
||||
max_age_hours = 168 # 7 days
|
||||
fsync_mode = "batched_50ms" # 'strict' | 'batched_50ms' | 'off'
|
||||
|
||||
[inbox]
|
||||
retention_days = 30
|
||||
fts_enabled = true
|
||||
|
||||
[reconnect]
|
||||
initial_backoff_ms = 500
|
||||
max_backoff_ms = 30000
|
||||
backoff_multiplier = 2.0
|
||||
jitter_pct = 25
|
||||
|
||||
[hooks]
|
||||
enabled = true
|
||||
concurrency = 8
|
||||
default_timeout_s = 30
|
||||
|
||||
[metrics]
|
||||
prometheus_enabled = true
|
||||
otel_endpoint = "" # empty = disabled
|
||||
```
|
||||
|
||||
User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.
|
||||
|
||||
## 14. Migration — what changes for existing users
|
||||
|
||||
- `claudemesh launch` (Claude Code mode) is unchanged. It can optionally `--via-daemon` to share the WS with a running daemon, but defaults to its own session (preserves "ephemeral session" semantics that Claude Code expects).
|
||||
- `claudemesh send X "msg"` and every other cold-path verb gets a transparent speedup when a daemon is up. No flag, no opt-in, no behavior difference visible to the user.
|
||||
- Existing `~/.claudemesh/config.json` is consumed unchanged by the daemon.
|
||||
- No DB migration. No broker changes. The daemon talks to the existing `/v1` HTTPS + WSS surfaces — broker doesn't even know whether a connection is `claudemesh launch` or `claudemesh daemon`.
|
||||
|
||||
---
|
||||
|
||||
## What needs review
|
||||
|
||||
Please critically review this spec for the v0.9.0 anchor. Specifically I want
|
||||
your hardest pushback on:
|
||||
|
||||
1. **Identity model** — persistent member by default vs ephemeral session. Have I
|
||||
missed a case where ephemeral is the right answer for a daemon? Should
|
||||
`--ephemeral` exist?
|
||||
2. **No-auth local IPC** — UDS 0600 + TCP loopback. Is "OS-trust is enough"
|
||||
actually safe in shared-tenant Linux (multi-user host, container
|
||||
side-channel)? Should there be a per-daemon token even locally?
|
||||
3. **SQLite outbox/inbox** — single writer, WAL, batched fsync. Is the
|
||||
exactly-once-via-idempotency-key claim defensible? What's the failure mode
|
||||
I'm glossing over?
|
||||
4. **Hooks fork-execing scripts** — RCE/data-exfil concerns I'm dismissing too
|
||||
easily? Should hooks be sandboxed (seccomp, no network, …)?
|
||||
5. **Auto-routing CLI verbs through daemon** — does this break composability
|
||||
with existing `claudemesh launch`? Race conditions when both are running?
|
||||
What about pidfile-stale detection?
|
||||
6. **One daemon per mesh** — why not one daemon serving all meshes, with mesh
|
||||
selection per-request? What does single-daemon actually buy beyond "fewer
|
||||
processes"?
|
||||
7. **The IPC surface duplicates the broker REST surface** — am I solving a
|
||||
problem the broker REST + per-mesh apikey already solves, with extra
|
||||
complexity for caching + queueing?
|
||||
8. **What's missing entirely** — auth boundaries, recovery flows, on-disk
|
||||
secret rotation, anything else a production daemon shipped with this spec
|
||||
would lack?
|
||||
|
||||
Score the spec on each axis: 1 = serious flaw, 5 = sound. Then list the
|
||||
top 3 changes you'd insist on before I write any code. Be ruthless — pre-launch
|
||||
window means I can break anything.
|
||||
@@ -0,0 +1,218 @@
|
||||
# `claudemesh daemon` — broker-hardening followups
|
||||
|
||||
> **Purpose**: refinements found during the v6 → v10 codex review series
|
||||
> that are real improvements but **not** v0.9.0 blockers. The
|
||||
> implementation target is `2026-05-03-daemon-spec-v0.9.0.md`. This
|
||||
> document lists what was deferred, why, and the trigger that promotes
|
||||
> each item to "must-do."
|
||||
>
|
||||
> **Background**: codex reviewed the daemon spec across 9 rounds (v1
|
||||
> through v10). Rounds 1–4 found load-bearing architectural issues
|
||||
> (identity, IPC auth, exactly-once lie, hook tokens, rotation, etc.).
|
||||
> Rounds 5–9 found progressively finer correctness issues inside one
|
||||
> subsystem (broker idempotency mechanics). v6 closed the architectural
|
||||
> review; v7–v10 are increasingly fine-grained idempotency-correctness
|
||||
> shavings on the same layer. Pre-launch (no users) doesn't need v7–v10
|
||||
> level rigor. We pulled the cheap wins into v0.9.0; the rest waits.
|
||||
|
||||
---
|
||||
|
||||
## 1. B0 dedupe fast-path before rate-limit (v10)
|
||||
|
||||
**What v10 said**: read `mesh.client_message_dedupe` BEFORE consulting
|
||||
the rate limiter. Existing id (match or mismatch) returns immediately
|
||||
without touching rate-limit budget.
|
||||
|
||||
**Why deferred**: v0.9.0 doesn't have meaningful rate-limit pressure on
|
||||
the daemon path. The split-brain failure (broker accepted, daemon
|
||||
believes failure due to rate-limit-rejection-on-retry) requires
|
||||
sustained saturated rate-limit windows, which don't exist pre-launch.
|
||||
|
||||
**Promote when**: any single mesh sees rate-limit rejections AND has
|
||||
daemon retries against committed ids. Telemetry to watch:
|
||||
`cm_broker_rate_limit_rejection_total` per mesh > 0 sustained.
|
||||
|
||||
**Implementation cost**: small — one indexed PK lookup before the
|
||||
existing limiter call. The work is mostly testing the race semantics.
|
||||
|
||||
---
|
||||
|
||||
## 2. Lua-scripted idempotent rate limiter (v10)
|
||||
|
||||
**What v10 said**: limiter keyed by `(mesh_id, client_message_id,
|
||||
window_bucket)` so retries-within-window consume budget at most once.
|
||||
|
||||
**Why deferred**: depends on (1) above. Without B0 fast-path this is
|
||||
incremental complexity for marginal benefit. With B0 it becomes the
|
||||
right belt-and-suspenders fix for the rare race where two same-id
|
||||
requests both miss B0 simultaneously.
|
||||
|
||||
**Promote when**: B0 ships. Same trigger.
|
||||
|
||||
**Implementation cost**: medium — Lua script in Redis, careful TTL
|
||||
tuning, integration with existing limiter call sites.
|
||||
|
||||
---
|
||||
|
||||
## 3. In-tx `mesh.mention_index` (v8)
|
||||
|
||||
**What v8 said**: mention-fanout index updates should commit inside the
|
||||
broker accept transaction so mention-search reads can never see a
|
||||
mention pointing at an uncommitted message.
|
||||
|
||||
**Why deferred**: the lag between accept-commit and async
|
||||
mention-indexer is small (single-digit milliseconds in expected
|
||||
deployment). Stale-read window during mention search is acceptable for
|
||||
v0.9.0; receivers learn of mentions via the `mention` event in their
|
||||
inbox stream regardless.
|
||||
|
||||
**Promote when**: real users complain about "I was mentioned but the
|
||||
mention search doesn't show it" with reproducible cases that don't
|
||||
self-heal in seconds.
|
||||
|
||||
**Implementation cost**: small — add `INSERT INTO mesh.mention_index`
|
||||
to the accept transaction. The async indexer becomes a backfill
|
||||
fallback rather than the primary path.
|
||||
|
||||
---
|
||||
|
||||
## 4. 4011 / 4012 close-code split (v6 §15.5)
|
||||
|
||||
**What v6 said**: split `4010 feature_unavailable` into three codes:
|
||||
`4010` (missing), `4011` (params invalid), `4012` (params below floor).
|
||||
|
||||
**Why deferred**: v0.9.0 ships single `4010` with structured
|
||||
`close_reason` JSON containing `kind`, `feature`, `detail`. Same
|
||||
diagnostic information, simpler protocol surface.
|
||||
|
||||
**Promote when**: ops tooling or external monitoring needs distinct
|
||||
status codes (e.g. PagerDuty rules that fire on 4012-only). Probably
|
||||
never; structured JSON is parseable.
|
||||
|
||||
**Implementation cost**: trivial — three constants and a switch on
|
||||
`close_reason.kind`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Per-OS fingerprint precedence elaborate table (v8 §2.2.1)
|
||||
|
||||
**What v8 said**: comprehensive per-OS table covering Linux machine-id
|
||||
sources, macOS `IOPlatformUUID`, Windows `MachineGuid`, BSD
|
||||
`kern.hostuuid`, plus interface exclusion rules.
|
||||
|
||||
**Why deferred**: v0.9.0 ships with the simpler "machine-id ||
|
||||
first-stable-mac" rule from v6. Edge cases (cloud images,
|
||||
machine-id-not-readable, etc.) are documented when first hit.
|
||||
|
||||
**Promote when**: operators report fingerprint false-positives we can't
|
||||
explain from the v6 rule. Each report adds one row to the per-OS
|
||||
table.
|
||||
|
||||
**Implementation cost**: incremental — each OS-specific source is a
|
||||
small probe function with a fallback chain.
|
||||
|
||||
---
|
||||
|
||||
## 6. `request_fingerprint` schema-version-2 in feature negotiation (v6 §15.1)
|
||||
|
||||
**What v6 said**: `client_message_id_dedupe` feature parameters
|
||||
versioned independently. v0.9.0 ships at version 1 with a single
|
||||
`request_fingerprint: bool` flag.
|
||||
|
||||
**Why deferred**: we don't yet need parameterized fingerprint variants
|
||||
(different canonical forms, different hash algos). Version-bump path
|
||||
is documented; we'll use it when we add the second fingerprint mode.
|
||||
|
||||
**Promote when**: we want a fingerprint algo other than sha256/JCS
|
||||
(e.g. a faster hash, or a normalized canonical form).
|
||||
|
||||
**Implementation cost**: small — single feature-bit version bump
|
||||
following the documented pattern.
|
||||
|
||||
---
|
||||
|
||||
## 7. Force-expiry / quarantine semantics for `keypair-archive.json` (v8 §14.1.1)
|
||||
|
||||
**What v8 said**: `max_archived_keys` cap with force-expiry; explicit
|
||||
quarantine of malformed archive (`keypair-archive.json.malformed-<ts>`);
|
||||
duplicate `key_id` rejection; mode-mismatch warning behavior.
|
||||
|
||||
**Why deferred**: v0.9.0 ships the simpler v6 rule — drop expired
|
||||
entries on cleanup pass; refuse to start on malformed archive (loud,
|
||||
operator-actionable). The v8 elaboration makes archive corruption
|
||||
non-blocking, which is operationally nicer but trades off audit
|
||||
clarity.
|
||||
|
||||
**Promote when**: a real operator hits an archive corruption that
|
||||
shouldn't have brought the daemon down (e.g. mid-rotation crash leaves
|
||||
a partially-written archive).
|
||||
|
||||
**Implementation cost**: small — quarantine logic + one extra startup
|
||||
check.
|
||||
|
||||
---
|
||||
|
||||
## 8. Cross-language JCS conformance for `request_fingerprint` (v6 §4.4 round-6 question)
|
||||
|
||||
**What v6 asked**: does JCS work cross-language for
|
||||
`meta_canonical_json`? Python json.dumps, Go encoding/json, and JS
|
||||
JSON.stringify all behave differently. Should we ship a vetted JCS lib
|
||||
in each SDK?
|
||||
|
||||
**Why deferred from v0.9.0**: the daemon ships in TypeScript only for
|
||||
v0.9.0 (the `claudemesh-cli` package). Single-language JCS is trivial.
|
||||
SDK ports come post-v0.9.0.
|
||||
|
||||
**Promote when**: we ship the Python or Go SDK. Each SDK port gets a
|
||||
JCS conformance test against a corpus of envelopes.
|
||||
|
||||
**Implementation cost**: small per-language — a conformance fixture
|
||||
file and a unit test.
|
||||
|
||||
---
|
||||
|
||||
## Sprint 7 (this session) — what landed vs deferred
|
||||
|
||||
**Landed in code** (not yet deployed):
|
||||
- `packages/db/migrations/0028_message_queue_idempotency_fields.sql` adds
|
||||
nullable `client_message_id` and `request_fingerprint` columns to
|
||||
`mesh.message_queue` (additive, online-safe).
|
||||
- `apps/broker/src/broker.ts` — `queueMessage` and `drainForMember`
|
||||
thread the new columns through.
|
||||
- `apps/broker/src/index.ts` — `handleSend` picks them up from the
|
||||
daemon's wire envelope; outbound push echoes them back so receiving
|
||||
daemons can dedupe.
|
||||
- `apps/broker/src/types.ts` — `WSPushMessage` declares the optional
|
||||
fields.
|
||||
|
||||
**Deployment plan (not auto-applied)**:
|
||||
1. Apply migration against prod DB (the broker's filename-tracked
|
||||
migrator picks up `0028_*.sql` on next startup).
|
||||
2. Deploy the broker with the code changes via Coolify.
|
||||
3. Verify a daemon-originated send shows non-null `client_message_id`
|
||||
in `mesh.message_queue` afterwards.
|
||||
|
||||
**Still deferred** (full broker hardening):
|
||||
- `mesh.client_message_dedupe` table with `request_fingerprint BYTEA`
|
||||
and atomic accept transaction (spec §4.7).
|
||||
- Feature-bit advertisement on hello_ack of
|
||||
`client_message_id_dedupe` v1, with daemon-side enforcement (spec §15).
|
||||
- Partial unique index `(mesh_id, client_message_id) WHERE NOT NULL`.
|
||||
|
||||
These sit behind the same trigger as the followups below: do them when
|
||||
real users hit operational corners that this addressing doesn't cover.
|
||||
|
||||
---
|
||||
|
||||
## How to use this document
|
||||
|
||||
When picking up post-v0.9.0 work on the daemon:
|
||||
|
||||
1. Check whether any of the "promote when" triggers above have fired.
|
||||
2. If yes, consult the corresponding versioned spec (v6/v7/v8/v9/v10)
|
||||
for the full proposed change.
|
||||
3. Implement the lift, update `daemon-spec-v0.9.0.md` to reflect the
|
||||
merge, and remove the item from this followups list.
|
||||
|
||||
The versioned specs live in `.artifacts/specs/` indefinitely as a
|
||||
review-trail audit.
|
||||
680
.artifacts/shipped/2026-05-03-daemon-spec-v0.9.0.md
Normal file
680
.artifacts/shipped/2026-05-03-daemon-spec-v0.9.0.md
Normal file
@@ -0,0 +1,680 @@
|
||||
# `claudemesh daemon` — Implementation spec v0.9.0
|
||||
|
||||
> **Implementation target.** Locked from the v1–v10 codex-reviewed spec
|
||||
> series. This document is what we build for v0.9.0 of the daemon.
|
||||
>
|
||||
> **Base**: v6 (the round where the architecture passed codex's
|
||||
> structural review — request_fingerprint, dedupe table, atomicity
|
||||
> contract, feature-bit negotiation, key archive format).
|
||||
>
|
||||
> **Pulled in from v7–v9**: six cheap, load-bearing fixes that close
|
||||
> real v0.9.0-era bugs (not future-scale concerns):
|
||||
>
|
||||
> 1. `aborted` outbox status + audit columns (operator recovery without
|
||||
> destroying audit trail) — v7 §4.5.2
|
||||
> 2. `BEGIN IMMEDIATE` for daemon-local SQLite serialization (v6's
|
||||
> `SELECT FOR UPDATE` is invalid SQLite anyway) — v7 §4.5.1
|
||||
> 3. Daemon-local IPC duplicate lookup table over outbox states ×
|
||||
> fingerprint match/mismatch — v8 §4.5.1
|
||||
> 4. Phase B1/B2/B3 broker validation split (the concept; we don't need
|
||||
> the elaborate phase tables) — v7 §4.6.2
|
||||
> 5. Side-effect inventory (in-tx vs async) as an implementation comment
|
||||
> block — v8 §4.7.1
|
||||
> 6. Two-layer ID model wording: daemon-consumed iff outbox row,
|
||||
> broker-consumed iff dedupe row — v9 §4.1
|
||||
>
|
||||
> **Deferred to broker-hardening followups** (see
|
||||
> `2026-05-03-daemon-spec-broker-hardening-followups.md` for the full list and
|
||||
> rationale): B0 dedupe fast-path, Lua-scripted idempotent rate
|
||||
> limiter, in-tx mention_index, 4011/4012 close-code split, per-OS
|
||||
> fingerprint precedence table, request-fingerprint schema-v2 in
|
||||
> feature negotiation. These are real improvements but not v0.9.0
|
||||
> blockers; they land as the broker matures.
|
||||
>
|
||||
> **Intent §0 unchanged from v2.**
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once with **request-fingerprinted** dedupe
|
||||
|
||||
Codex r5: dedupe must compare the *whole request shape*, not just
|
||||
`(mesh, client_message_id)`. Otherwise a caller who reuses an idempotency
|
||||
key with a different destination or body silently drops the new send and
|
||||
gets the old send's metadata back.
|
||||
|
||||
### 4.1 The contract (precise)
|
||||
|
||||
> **Two-layer ID rule** (from v9): a `client_message_id` is
|
||||
> **daemon-consumed** iff an outbox row exists for it; **broker-consumed**
|
||||
> iff a dedupe row exists in `mesh.client_message_dedupe`. The two layers
|
||||
> are independent: a daemon-consumed id may or may not be broker-consumed
|
||||
> (depending on whether the send reached broker commit). In v0.9.0 there
|
||||
> are no daemon-bypass clients, so for practical purposes "daemon-consumed"
|
||||
> is the operative rule.
|
||||
>
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer (§4.5).
|
||||
>
|
||||
> **Local audit guarantee**: a `client_message_id` once written to
|
||||
> `outbox.db` is never released. Operator recovery via `requeue` always
|
||||
> mints a fresh id; the old row stays in `aborted` for audit. There is
|
||||
> no daemon-side path to free a used id.
|
||||
>
|
||||
> **Broker guarantee**: the broker maintains a dedupe record per accepted
|
||||
> `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`. Each
|
||||
> dedupe record carries a canonical `request_fingerprint`. Retries with
|
||||
> the same id AND matching fingerprint collapse to the original
|
||||
> `broker_message_id`. Retries with mismatched fingerprint return
|
||||
> `409 idempotency_key_reused` and do **not** create a new message.
|
||||
>
|
||||
> **Atomicity guarantee**: dedupe row insertion, message row insertion,
|
||||
> and history row insertion happen in one broker DB transaction. Either
|
||||
> all land, or none do. No orphan dedupe rows.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery, with
|
||||
> `client_message_id` propagated to receivers' inboxes.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — request fingerprint added (v6)
|
||||
|
||||
```sql
|
||||
CREATE TABLE mesh.client_message_dedupe (
|
||||
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
|
||||
client_message_id TEXT NOT NULL,
|
||||
|
||||
-- The original accepted message; FK NOT enforced because the message row
|
||||
-- may be GC'd by retention sweeps before the dedupe row expires.
|
||||
broker_message_id UUID NOT NULL,
|
||||
|
||||
-- Canonical fingerprint of the original request. Recomputed on every
|
||||
-- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
|
||||
request_fingerprint BYTEA NOT NULL, -- 32-byte sha256
|
||||
|
||||
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
|
||||
destination_ref TEXT NOT NULL,
|
||||
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
expires_at TIMESTAMPTZ, -- NULL = `permanent` mode
|
||||
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
|
||||
|
||||
PRIMARY KEY (mesh_id, client_message_id)
|
||||
);
|
||||
|
||||
CREATE INDEX client_message_dedupe_expires_idx
|
||||
ON mesh.client_message_dedupe(expires_at)
|
||||
WHERE expires_at IS NOT NULL;
|
||||
|
||||
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
|
||||
```
|
||||
|
||||
**`status` column dropped (codex r5)**. Rejected requests do **not**
|
||||
consume idempotency keys. Rationale below in §4.6.
|
||||
|
||||
### 4.4 Request fingerprint — canonical form (NEW v6)
|
||||
|
||||
The fingerprint covers everything that makes a send semantically distinct.
|
||||
A retry must reproduce the same fingerprint bit-for-bit; anything else is
|
||||
a different send and must not be collapsed.
|
||||
|
||||
```
|
||||
request_fingerprint = sha256(
|
||||
envelope_version || 0x00 ||
|
||||
destination_kind || 0x00 ||
|
||||
destination_ref || 0x00 ||
|
||||
reply_to_id_or_empty || 0x00 ||
|
||||
priority || 0x00 ||
|
||||
meta_canonical_json || 0x00 ||
|
||||
body_hash
|
||||
)
|
||||
```
|
||||
|
||||
Where:
|
||||
- `envelope_version`: integer string (e.g. `"1"`). Bumps when the envelope
|
||||
shape changes.
|
||||
- `destination_kind`: `topic`, `dm`, or `queue`.
|
||||
- `destination_ref`: topic name, recipient ed25519 pubkey hex, or queue id.
|
||||
- `reply_to_id_or_empty`: original `broker_message_id` or empty string.
|
||||
- `priority`: `now`, `next`, or `low`.
|
||||
- `meta_canonical_json`: the `meta` field, serialized with sorted keys,
|
||||
no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.
|
||||
- `body_hash`: sha256(body bytes), hex.
|
||||
|
||||
The fingerprint is computed:
|
||||
1. **Daemon-side** before durable outbox persistence — stored as
|
||||
`outbox.request_fingerprint` (NEW column) so retries always produce
|
||||
the same fingerprint regardless of caller behavior.
|
||||
2. **Broker-side** on first receipt — stored in
|
||||
`client_message_dedupe.request_fingerprint`.
|
||||
3. **Broker-side** on every duplicate retry — recomputed and compared
|
||||
byte-equal to the stored value.
|
||||
|
||||
If the daemon and broker disagree on the canonical form (e.g. JCS
|
||||
implementation drift), the broker emits
|
||||
`cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id}` and
|
||||
returns `409 idempotency_key_reused` with a body that includes the
|
||||
broker's fingerprint hex for debugging. Daemons that see this should
|
||||
log it loudly and stop retrying that outbox row (it goes to `dead`).
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (from v8)
|
||||
|
||||
The daemon enforces fingerprint idempotency **before** the request hits
|
||||
`outbox.db` so a caller bug never creates duplicate-key/mismatch-payload
|
||||
state at all.
|
||||
|
||||
#### 4.5.1 IPC accept algorithm
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits, destination
|
||||
resolvable). Failures here return `4xx` immediately. **No outbox
|
||||
row is written; the `client_message_id` is not consumed.**
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` so a concurrent IPC
|
||||
accept on the same id serializes against this one. `BEGIN IMMEDIATE`
|
||||
acquires the RESERVED lock at transaction start; SQLite has no
|
||||
row-level lock and `SELECT FOR UPDATE` is not supported.
|
||||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||||
last_error FROM outbox WHERE client_message_id = ?`.
|
||||
5. Apply the lookup table below. For the "(no row)" case, INSERT inside
|
||||
the same transaction.
|
||||
6. COMMIT.
|
||||
|
||||
| Existing row state | Fingerprint | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | INSERT new outbox row `pending`; return `202 accepted, queued` |
|
||||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||||
| `pending` | mismatch | Return `409`, `conflict: "outbox_pending_fingerprint_mismatch"` |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||||
| `inflight` | mismatch | Return `409`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||||
| `done` | mismatch | Return `409`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"` |
|
||||
| `dead` | mismatch | Return `409`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
| `aborted` | match | Return `409`, `conflict: "outbox_aborted_fingerprint_match"`. Operator-retired id, never reusable |
|
||||
| `aborted` | mismatch | Return `409`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||||
|
||||
Every `409` carries the daemon's `request_fingerprint` (8-byte hex
|
||||
prefix) for client/server canonical-form-drift debugging. A
|
||||
`client_message_id` written to `outbox.db` is permanently bound to that
|
||||
row's lifecycle — the only "free" state is "no row exists".
|
||||
|
||||
#### 4.5.2 Outbox table
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN
|
||||
('pending','inflight','done','dead','aborted')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT,
|
||||
aborted_at INTEGER, -- v7
|
||||
aborted_by TEXT, -- v7: operator/auto
|
||||
superseded_by TEXT -- v7: id of requeue successor
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||||
```
|
||||
|
||||
`aborted_at` / `aborted_by` / `superseded_by` give operators a clear
|
||||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||||
a row is requeued multiple times. `request_fingerprint` is computed
|
||||
once at IPC accept time and frozen for the row's lifecycle.
|
||||
|
||||
#### 4.5.3 Operator recovery via `requeue`
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||||
[--new-client-id <id> | --auto]
|
||||
[--patch-payload <path>]
|
||||
```
|
||||
|
||||
Atomically (single SQLite transaction):
|
||||
1. Marks the existing row `aborted`, sets `aborted_at = now`,
|
||||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||||
permanent.
|
||||
2. Mints a fresh `client_message_id` (caller-supplied or auto-ulid).
|
||||
3. Inserts a new outbox row `pending` with the fresh id and the same
|
||||
payload (or patched if `--patch-payload`).
|
||||
4. Sets `superseded_by = <new_row_id>` on the old row.
|
||||
|
||||
The old `client_message_id` is permanently dead. There is no path for
|
||||
an id to become free again.
|
||||
|
||||
### 4.5b Broker duplicate response — three cases
|
||||
|
||||
| Case | HTTP/WS code | Body |
|
||||
|---|---|---|
|
||||
| First insert | `201 created` | `{ broker_message_id, client_message_id, history_id, duplicate: false }` |
|
||||
| Duplicate, fingerprint match | `200 ok` | `{ broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at }` |
|
||||
| Duplicate, fingerprint mismatch | `409 idempotency_key_reused` | `{ client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." }` (first 8 bytes hex) |
|
||||
|
||||
Daemon outcomes:
|
||||
- `201` → mark outbox row `done`, store `broker_message_id`.
|
||||
- `200 duplicate` with `history_available: true` → mark `done`, log INFO.
|
||||
- `200 duplicate` with `history_available: false` → mark `done`, log WARN.
|
||||
- `409 idempotency_key_reused` → mark outbox row `dead`. Operator runs
|
||||
`outbox requeue` (§4.5.3); old id stays `aborted`, new id is fresh.
|
||||
|
||||
### 4.6 Rejected-request semantics — id consumed iff outbox row written
|
||||
|
||||
> **Rule**: a `client_message_id` is daemon-consumed iff the daemon
|
||||
> writes an outbox row. Anything that fails before outbox insertion
|
||||
> (auth, schema, size, destination not resolvable) leaves the id
|
||||
> untouched and freely reusable.
|
||||
|
||||
#### 4.6.1 Daemon-side rejection phasing
|
||||
|
||||
| Phase | When daemon rejects | Outbox row? | Caller may reuse id? |
|
||||
|---|---|---|---|
|
||||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | Yes — id never consumed |
|
||||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | N/A — daemon owns retries |
|
||||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | No — rotate via `requeue` |
|
||||
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Old id NEVER reusable; new id is fresh |
|
||||
|
||||
#### 4.6.2 Broker-side rejection phasing (B1 / B2 / B3)
|
||||
|
||||
The broker validates in three phases relative to dedupe-row insertion:
|
||||
|
||||
| Phase | Validation | Side effects | Result for direct broker callers (none in v0.9.0) |
|
||||
|---|---|---|---|
|
||||
| **B1. Pre-dedupe-claim** | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, rate limit not exceeded | None | `4xx`. No dedupe row. Direct broker caller may retry with same id |
|
||||
| **B2. Post-dedupe-claim** (in-tx) | destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx`, transaction rolled back, no dedupe row remains. Direct broker caller may retry with same id |
|
||||
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows | `201` with `broker_message_id` |
|
||||
|
||||
**Daemon-mediated callers (the only path in v0.9.0)** see only the
|
||||
daemon-layer rules of §4.6.1: any broker `4xx` after IPC accept lands
|
||||
the outbox row in `dead`. Daemon-mediated callers MUST rotate via
|
||||
`requeue` (§4.5.3); the daemon-consumed id is never reusable
|
||||
regardless of whether the broker layer sees a dedupe row. The "may
|
||||
retry with same id" wording above describes broker-bypass callers
|
||||
only, which v0.9.0 does not have.
|
||||
|
||||
**Critical guarantee**: there is no broker code path where a permanent
|
||||
4xx leaves a dedupe row behind. Either the request committed and a
|
||||
dedupe row exists (B3), or it didn't and no dedupe row exists (B1, B2).
|
||||
"Dedupe row exists" is the unambiguous signal of "id consumed at the
|
||||
broker layer."
|
||||
|
||||
If the broker decides post-commit that an accepted message is invalid
|
||||
(async content-policy job), that's NOT a permanent rejection — it's a
|
||||
follow-up moderation event that operates on the `broker_message_id`,
|
||||
not on the dedupe key.
|
||||
|
||||
Net result: `client_message_dedupe` rows only exist when the broker
|
||||
**successfully** accepted a message and committed it. The single source
|
||||
of truth for "was this idempotency key consumed?" is the existence of
|
||||
the dedupe row. No status enum, no ambiguous states.
|
||||
|
||||
### 4.7 Broker atomicity contract
|
||||
|
||||
#### 4.7.1 Side-effect inventory
|
||||
|
||||
Every successful broker accept atomically commits these durable state
|
||||
changes in **one transaction**:
|
||||
|
||||
| Effect | Table | Why in-tx |
|
||||
|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | Idempotency authority |
|
||||
| Message body | `mesh.topic_message` / `mesh.message_queue` | Authoritative store |
|
||||
| History row | `mesh.message_history` | Replay log; lost-on-rollback breaks ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | Each recipient must see exactly committed messages |
|
||||
|
||||
**Outside the transaction** (non-authoritative or rebuildable):
|
||||
- WS push to live subscribers — best-effort live notifications.
|
||||
- Webhook fan-out — async via `delivery_queue` workers.
|
||||
- Rate-limit counters — telemetry only; authority is the external
|
||||
limiter checked in B1.
|
||||
- Audit log entries — append-only stream; rebuildable from history.
|
||||
- Search/FTS index updates — async via outbox-pattern worker.
|
||||
- Mention index updates — async (deferred in-tx promotion to followups
|
||||
doc).
|
||||
- Metrics — Prometheus, pull-based.
|
||||
|
||||
If any in-transaction insert fails, the transaction rolls back
|
||||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||||
state.
|
||||
|
||||
#### 4.7.2 Pseudocode
|
||||
|
||||
```sql
|
||||
-- Pre-generate broker_message_id (ulid) in code, pass in.
|
||||
BEGIN;
|
||||
|
||||
-- Step 1: try to claim the idempotency key.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Step 2: inspect what's actually there now (ours or someone else's).
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- Branch:
|
||||
-- row.broker_message_id == $msg_id → first insert; continue.
|
||||
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
|
||||
-- match → ROLLBACK; return 200 duplicate.
|
||||
-- mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Step 3: validate Phase B2 (destination_ref existence — topic exists,
|
||||
-- member subscribed, etc.). If B2 fails → ROLLBACK; return 4xx (no
|
||||
-- dedupe row remains).
|
||||
|
||||
-- Step 4: insert in-tx side effects (§4.7.1).
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
The branch logic determines the response shape (`201` / `200 duplicate`
|
||||
/ `409 idempotency_key_reused`) before COMMIT. The duplicate and 409
|
||||
branches always ROLLBACK because nothing else needs to commit.
|
||||
`SELECT … FOR SHARE` blocks concurrent writers from upgrading the same
|
||||
dedupe row mid-transaction.
|
||||
|
||||
#### 4.7.3 Failure modes
|
||||
|
||||
- Crash before `COMMIT`: all rows roll back. Next daemon retry inserts
|
||||
cleanly.
|
||||
- Crash after `COMMIT` but before WS ACK: dedupe row exists. Daemon
|
||||
retries → fingerprint matches → `200 duplicate`. Net: exactly one
|
||||
broker-accepted row, one daemon `done` transition.
|
||||
- Constraint violation on message row insert: rolls back the whole tx.
|
||||
`5xx` to daemon. Same fingerprint reproduces; daemon eventually
|
||||
marks `dead`. No orphan dedupe row.
|
||||
|
||||
Counter `cm_broker_dedupe_orphan_check_total` runs nightly and
|
||||
validates that every `client_message_dedupe` row has a matching
|
||||
`topic_message` / `message_queue` row OR the matching row has been
|
||||
retention-pruned (`history_available = FALSE`). Inconsistencies logged
|
||||
as `cm_broker_dedupe_orphan_found{mesh_id}` for human review.
|
||||
|
||||
### 4.8 Outbox schema
|
||||
|
||||
The authoritative outbox schema for v0.9.0 is in §4.5.2 (includes
|
||||
`aborted` status and audit columns from the v7 pull). `request_fingerprint`
|
||||
is computed at IPC accept time and frozen for the row's lifecycle —
|
||||
the daemon never recomputes from `payload` post-enqueue (would produce
|
||||
drift if envelope_version changes between daemon runs).
|
||||
|
||||
### 4.9 Outbox max-age math — bounded (v6)
|
||||
|
||||
Codex r5: the v5 formula `(dedupe_retention_days * 24) - 24h_margin`
|
||||
breaks at `dedupe_retention_days = 1` (yields zero) and is undefined
|
||||
behavior at `<= 1`.
|
||||
|
||||
v6 formula and bounds:
|
||||
|
||||
- **Minimum supported broker dedupe retention**: 3 days. Daemon refuses
|
||||
to start if broker advertises `dedupe_retention_days < 3` (treats it
|
||||
as `feature_param_invalid`, exits 4010).
|
||||
- **Daemon `max_age_hours` derivation**:
|
||||
- `permanent` mode → daemon uses config default (168h = 7d), cap 720h
|
||||
(30d).
|
||||
- `retention_scoped` mode → daemon `max_age_hours = max(72,
|
||||
(dedupe_retention_days * 24) - safety_margin_hours)` where
|
||||
`safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 *
|
||||
24))`. For `dedupe_retention_days=3` this gives
|
||||
`max(72, 72-24) = 72h`. For 30 days: `max(72, 720-72) = 648h`. For
|
||||
365 days: `max(72, 8760-876) = 7884h`.
|
||||
- The 72h floor prevents the daemon outbox from being uselessly short
|
||||
— three days is enough margin for normal operator response to a
|
||||
paged outage.
|
||||
|
||||
- Operator override allowed via `[outbox] max_age_hours_override = N`,
|
||||
but if `N` exceeds `dedupe_retention_days * 24 - 1` daemon refuses to
|
||||
start with `outbox_max_age_above_dedupe_window`. The override exists
|
||||
for the rare case of a much-shorter-than-default outbox; it does not
|
||||
exist to circumvent the broker's dedupe window.
|
||||
|
||||
### 4.10 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.11 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.12 Failure modes — corrected for fingerprint model (v6)
|
||||
|
||||
- **Fingerprint mismatch on retry** (`409 idempotency_key_reused`): outbox
|
||||
row marked `dead`. Surfaced in `--failed` view. Operator command
|
||||
`outbox requeue --new-id <id>` rotates `client_message_id` and retries.
|
||||
- **Daemon retry after dedupe row hard-deleted by retention sweep**: in
|
||||
`retention_scoped` mode, daemon `max_age_hours` is bounded inside the
|
||||
retention window (§4.9), so this can only happen via operator override.
|
||||
In that case the retry creates a NEW dedupe row + new message — the
|
||||
caller chose this risk explicitly. Counter
|
||||
`cm_daemon_retry_after_dedupe_expired_total`.
|
||||
- **Daemon retry after dedupe row hard-deleted in `permanent` mode**:
|
||||
cannot happen by definition — `permanent` means no `expires_at`. Only
|
||||
mesh deletion removes dedupe rows.
|
||||
- **Duplicate row, history pruned**: as v5 §4.4. Mark `done`, log
|
||||
`cm_daemon_dedupe_history_pruned_total`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — unchanged from v4 §6
|
||||
|
||||
---
|
||||
|
||||
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature param updated for new dedupe semantics
|
||||
|
||||
### 15.1 Feature bits with parameters (v6 update)
|
||||
|
||||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | `1` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 3)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
|
||||
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
|
||||
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
|
||||
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
|
||||
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
|
||||
|
||||
`client_message_id_dedupe` ships at `params.version = 1` with
|
||||
`request_fingerprint: bool == true` as a required parameter. A broker
|
||||
that doesn't advertise the feature, or advertises it without
|
||||
`request_fingerprint: true`, is treated as "feature missing" and the
|
||||
daemon refuses to start. That's intentional — v0.9.0 daemons require
|
||||
fingerprint enforcement for safe idempotency.
|
||||
|
||||
The schema-version-2 evolution (parameters that need versioning) is
|
||||
deferred (see followups doc).
|
||||
|
||||
`dedupe_retention_days` minimum is 3 (matches the §4.9 floor).
|
||||
|
||||
### 15.2 Negotiation handshake — unchanged shape from v5 §15.2
|
||||
|
||||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||||
|
||||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||||
|
||||
### 15.5 Diagnostic close code (v0.9.0)
|
||||
|
||||
v0.9.0 ships a single WebSocket close code with a structured
|
||||
`close_reason` JSON payload that distinguishes the underlying cause:
|
||||
|
||||
| Code | Reason | `close_reason.kind` values |
|
||||
|---|---|---|
|
||||
| `4010` | `feature_unavailable` | `feature_unavailable` (feature missing from broker's `supported`) · `feature_param_invalid` (params fail validation: missing required, out of bounds, unknown version) · `feature_param_below_floor` (param below daemon's hard floor, e.g. `dedupe_retention_days < 3`) |
|
||||
|
||||
`close_reason` payload shape:
|
||||
```json
|
||||
{
|
||||
"kind": "feature_unavailable" | "feature_param_invalid" | "feature_param_below_floor",
|
||||
"feature": "client_message_id_dedupe",
|
||||
"detail": "..."
|
||||
}
|
||||
```
|
||||
|
||||
Daemon logs the full negotiation payload at WARN before exiting;
|
||||
supervisor + alerting catches the restart loop. The split into
|
||||
4011/4012 codes is deferred (see followups doc).
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v4 §16
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — broker dedupe table + atomicity (v6)
|
||||
|
||||
Broker side, deploy order:
|
||||
|
||||
1. `CREATE TABLE mesh.client_message_dedupe` with v6 schema (additive,
|
||||
online-safe).
|
||||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
|
||||
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
|
||||
4. Broker code refactor: every accept path wraps dedupe insert + message
|
||||
insert in **one transaction** (§4.7). Pre-generated
|
||||
`broker_message_id` (ulid in code) passed in.
|
||||
5. Broker code: nightly job to delete dedupe rows where `expires_at <
|
||||
NOW()` (skip in `permanent` mode).
|
||||
6. Broker code: hook into the message-retention sweep — when a
|
||||
`topic_message` or `message_queue` row is hard-deleted, find the
|
||||
matching dedupe row by `client_message_id` and set `history_available
|
||||
= FALSE`. (Note: `client_message_id` is nullable on those tables for
|
||||
legacy traffic; nullable rows have no dedupe row to update.)
|
||||
7. Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
|
||||
8. Broker advertises `client_message_id_dedupe` feature with
|
||||
`params.version = 1` and `request_fingerprint: true`.
|
||||
9. Daemon refuses to start unless that feature bit is advertised with
|
||||
valid v1 params.
|
||||
|
||||
Rollback plan: feature flag disables fingerprint enforcement broker-side
|
||||
(falls back to existing pre-v6 behavior — no dedupe). Daemons that
|
||||
require fingerprint refuse to start. Operator switches off the feature
|
||||
flag, reverts the daemon, restarts. No data loss; pending dedupe rows
|
||||
remain in place for the next forward roll.
|
||||
|
||||
---
|
||||
|
||||
## v0.9.0 lock — what's in vs deferred
|
||||
|
||||
**In** (this document): everything codex r1–r4 ratified plus the six
|
||||
sweet-spot pulls from v7–v9 enumerated at the top — `aborted` outbox
|
||||
status, `BEGIN IMMEDIATE`, IPC duplicate lookup table, B1/B2/B3 phasing
|
||||
concept, side-effect inventory, two-layer ID model.
|
||||
|
||||
**Deferred** (see `2026-05-03-daemon-spec-broker-hardening-followups.md`):
|
||||
- B0 dedupe fast-path before rate-limit (v10).
|
||||
- Lua-scripted idempotent rate limiter keyed by
|
||||
`(mesh, client_id, window)` (v10).
|
||||
- In-tx `mesh.mention_index` (v8).
|
||||
- 4011 / 4012 close-code split (v6 §15.5 — collapsed to 4010 with
|
||||
structured reason JSON for v0.9.0).
|
||||
- Per-OS fingerprint precedence elaborate table (v8 §2.2.1).
|
||||
- `request_fingerprint` schema-version-2 in feature negotiation (v6
|
||||
§15.1 ships at version 1 with `request_fingerprint: bool`).
|
||||
- Force-expiry / quarantine semantics for `keypair-archive.json`
|
||||
(v8 §14.1.1).
|
||||
|
||||
These deferrals are real improvements but not v0.9.0 blockers. They
|
||||
land as the broker matures and we have actual scale-load to optimize
|
||||
against.
|
||||
|
||||
---
|
||||
|
||||
## Cross-spec note: §15.5 close-code collapse
|
||||
|
||||
For v0.9.0 we ship a single `4010 feature_unavailable` close code with
|
||||
a structured `close_reason` JSON payload that distinguishes the
|
||||
underlying cause:
|
||||
|
||||
```json
|
||||
{
|
||||
"close_reason": {
|
||||
"kind": "feature_unavailable" | "feature_param_invalid" | "feature_param_below_floor",
|
||||
"feature": "client_message_id_dedupe",
|
||||
"detail": "..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The 4011/4012 split is deferred to followups.
|
||||
|
||||
---
|
||||
|
||||
## NON-NORMATIVE: round-6 review trailer (preserved for audit only)
|
||||
|
||||
> **Not part of the v0.9.0 contract.** Preserved verbatim from the
|
||||
> v6 source spec as a record of the open questions at the time of the
|
||||
> codex round-6 review. Items below have either been resolved in this
|
||||
> merged document, deferred to the followups doc, or superseded.
|
||||
> Do NOT use this section as a checklist for implementation.
|
||||
|
||||
1. **Request fingerprint canonical form (§4.4)** — does JCS work
|
||||
cross-language for `meta_canonical_json` (Python json.dumps,
|
||||
Go encoding/json, JS JSON.stringify all behave differently)? Should
|
||||
we ship a vetted JCS lib in each SDK or fall back to a simpler
|
||||
"sorted keys + no spaces + escape-as-stored" rule with conformance
|
||||
tests?
|
||||
2. **Atomicity contract (§4.7)** — is the orphan-check sufficient, or
|
||||
does a violation mean we need a "broker rebuild dedupe from messages"
|
||||
recovery tool? The latter is destructive but useful for ops emergencies.
|
||||
3. **Max-age formula (§4.9)** — is the 72h floor correct? Is the
|
||||
percentage-based safety margin (`max(24, ceil(0.1 * dedupe_window))`)
|
||||
the right shape? Or simpler to say "always 24h"?
|
||||
4. **`409 idempotency_key_reused` recovery flow (§4.5)** — is sending the
|
||||
row to `dead` and surfacing it via `outbox --failed` enough? Should
|
||||
the daemon emit a high-priority event for the SSE stream so operators
|
||||
are paged immediately?
|
||||
5. **Diagnostic close codes (§15.5)** — is splitting 4010/4011/4012
|
||||
useful, or does it just push complexity onto operators? Should we
|
||||
collapse to 4010 with structured close-reason JSON instead?
|
||||
6. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year. What falls down?
|
||||
|
||||
Three options:
|
||||
- **(a) v6 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v7 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
Reference in New Issue
Block a user