Files
claudemesh/.artifacts/shipped/2026-05-03-daemon-final-spec-v7.md
Alejandro Gutiérrez a2568ad9f4
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
chore(release): cli 1.22.0 — daemon v0.9.0 + housekeeping
- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh
  daemon long-lived runtime).
- CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring,
  crash recovery, and the deferred Sprint 7 broker hardening.
- Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0
  daemon redesign section, so the bridge release is documented as the
  shipped step toward the larger architectural shift.
- Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0
  spec + broker-hardening followups) from .artifacts/specs/ to
  .artifacts/shipped/ per the project artifact-pipeline convention.

Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag
— both are public-distribution actions and require explicit user
approval.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:24:32 +01:00

440 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `claudemesh daemon` — Final Spec v7
> **Round 7.** v6 was reviewed by codex (round 6) which found the broker
> layer largely correct but caught five daemon-side and broker-tx
> correctness gaps:
>
> 1. **Daemon-local duplicate POST semantics** undefined — local fingerprint
> comparison missing across `pending` / `inflight` / `done` / `dead`.
> 2. **§4.6 rejected-request contradiction** — talked about both "fix and
> retry" and "fingerprint mismatch → 409". Only one of those can be true.
> 3. **§4.7 pseudocode bug** — `ON CONFLICT DO NOTHING RETURNING` returns
> nothing on conflict; the fingerprint comparison was in the wrong branch.
> 4. **Max-age math floor consumes margin** — at min retention (3 days),
> daemon max-age 72h equals broker window 72h. Not inside the window.
> 5. **Broker transaction boundary incomplete** — fan-out/queue/history side
> effects not stated as in-transaction; "optional" wording was wrong.
>
> v7 fixes all five. **Intent §0 unchanged from v2.** v7 only revises §4
> (delivery contract) and §15 (feature param min) and §17 (migration).
---
## 0. Intent — unchanged, see v2 §0
---
## 1. Process model — unchanged
## 2. Identity — unchanged from v5 §2
## 3. IPC surface — unchanged from v4 §3
---
## 4. Delivery contract — at-least-once, fingerprinted at IPC and broker layers
### 4.1 The contract (precise — v7)
> **Local guarantee**: each successful `POST /v1/send` returns a stable
> `client_message_id`. The send is durably persisted to `outbox.db` before
> the response returns. The daemon enforces request-fingerprint
> idempotency at the IPC layer: a duplicate `POST` with the same
> `client_message_id` and matching `request_fingerprint` returns the
> stable prior result; with a mismatched fingerprint it returns local
> `409 idempotency_key_reused` and the new request is **not** persisted.
>
> **Broker guarantee**: the broker maintains a dedupe record per
> accepted `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`
> with `request_fingerprint`. Retries with matching fingerprint collapse;
> retries with mismatched fingerprint return `409
> idempotency_key_reused` without creating a new message.
>
> **Atomicity guarantee**: every durable side effect of a successful
> accept (dedupe row, message row, fan-out work, history row, queue
> insertion) lands in the same broker DB transaction. Either all commit
> or none do.
>
> **End-to-end guarantee**: at-least-once delivery, with
> `client_message_id` propagated to receivers' inboxes.
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
### 4.3 Broker schema — unchanged from v6 §4.3
(`mesh.client_message_dedupe` table with `request_fingerprint BYTEA`, no
`status` column.)
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
### 4.5 Daemon-local idempotency at the IPC layer (NEW v7 — codex r6)
The daemon enforces fingerprint idempotency **before** the request hits
`outbox.db` so a caller bug never creates duplicate-key/mismatch-payload
state at all.
#### 4.5.1 IPC accept algorithm
On `POST /v1/send`:
1. Validate request envelope (auth, schema, size limits). Failures
here return `4xx` immediately. **No outbox row is written.** The
`client_message_id` (whether caller-supplied or daemon-minted) is
**not consumed** — the same id may be reused by the caller for a
subsequent valid send.
2. Compute `request_fingerprint` (§4.4).
3. Look up existing outbox row by `client_message_id`:
| Existing row state | Fingerprint match? | Daemon response |
|---|---|---|
| (no row) | — | Insert new outbox row in `pending`; return `202 accepted, queued` with `client_message_id` |
| `pending` | match | Return `202 accepted, queued` with the existing `client_message_id`. No new row. Idempotent retry of an in-progress send |
| `pending` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_pending_fingerprint_mismatch"`. **No mutation of the existing row.** |
| `inflight` | match | Return `202 accepted, inflight`. No new row. Caller is retrying mid-broker-roundtrip |
| `inflight` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_inflight_fingerprint_mismatch"` |
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No new row, no broker call |
| `done` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
| `dead` | match | Return `409 idempotency_key_reused` with `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Caller must rotate the id (see §4.6.3) — daemon refuses to re-attempt a dead row's exact bytes. |
| `dead` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_dead_fingerprint_mismatch"` |
Rule: any IPC `409` carries the daemon's `request_fingerprint` (8-byte
hex prefix) so callers can debug client/server canonical-form drift.
#### 4.5.2 Outbox table — fingerprint required, atomic UPSERT removed
```sql
CREATE TABLE outbox (
id TEXT PRIMARY KEY,
client_message_id TEXT NOT NULL UNIQUE,
request_fingerprint BLOB NOT NULL, -- 32 bytes
payload BLOB NOT NULL,
enqueued_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
next_attempt_at INTEGER NOT NULL,
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
last_error TEXT,
delivered_at INTEGER,
broker_message_id TEXT
);
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
```
Insertion is `BEGIN; SELECT FOR UPDATE; if-no-row INSERT; COMMIT;`
explicit lock + check + insert, not `INSERT OR IGNORE`. The daemon
never auto-mutates an existing row's `request_fingerprint` or
`payload`; mismatches are 409s, not silent overwrites.
`request_fingerprint` is computed once at IPC accept time and frozen.
Retries to the broker re-send the same bytes from `payload` and the
same `request_fingerprint`. Daemon does not recompute post-enqueue.
### 4.6 Rejected-request semantics — pick one rule (NEW v7 — codex r6)
> **Rule: the `client_message_id` is consumed iff the daemon writes an
> outbox row. Anything that fails before outbox insertion (validation,
> auth, size) leaves the id untouched and freely reusable.**
This makes §4.6 internally consistent with §4.5:
#### 4.6.1 IPC validation failure (no outbox row written)
- Schema/auth/size/destination-not-resolvable failures return `4xx`
immediately. The `client_message_id` is **not** stored anywhere on
the daemon. Caller may re-send with the same id and a fixed payload;
it will be treated as a fresh request because no outbox row exists.
#### 4.6.2 Outbox row exists, broker permanent rejection (4xx response)
- Daemon receives `4xx` from broker (e.g. payload size delta between
daemon and broker advertised limits, mesh-level reject). Outbox row
transitions to `dead` with `last_error` populated.
- Caller retrying with same `client_message_id` → daemon returns
`409 idempotency_key_reused, conflict: "outbox_dead_*"` per §4.5.1.
- The id is consumed (row is locked in `dead`) until operator action.
#### 4.6.3 Operator recovery: rotating an idempotency key
To unstick a `dead` row whose payload needs to change, operator runs:
```
claudemesh daemon outbox requeue --id <outbox_id> --new-client-id [auto|<id>]
```
This atomically:
1. Marks the existing `dead` row as `aborted` (terminal, never retried).
2. Creates a new outbox row with a fresh `client_message_id` (caller-
supplied or daemon-ulid'd) and the SAME or a CALLER-PATCHED payload.
3. The old `client_message_id` becomes free again at the daemon layer
but is still locked at the broker layer if the broker had ever
accepted it (its dedupe row stays). For a row that died before
broker acceptance, the id is fully reusable end-to-end.
Operators see a clear distinction between `dead` (needs operator
attention) and `aborted` (intentionally retired). Add `aborted` to the
status CHECK constraint:
```sql
status TEXT CHECK(status IN ('pending','inflight','done','dead','aborted'))
```
### 4.7 Broker atomicity contract — corrected pseudocode + side-effect inventory (v7 — codex r6)
#### 4.7.1 Side effects inside the transaction
Every successful broker accept atomically commits the following durable
state in **one transaction**:
| Effect | Table | Notes |
|---|---|---|
| Dedupe record | `mesh.client_message_dedupe` | NEW row keyed by `(mesh_id, client_message_id)` |
| Message body | `mesh.topic_message` OR `mesh.message_queue` | NEW row keyed by `broker_message_id` (pre-generated ulid) |
| History row | `mesh.message_history` | NEW row pointing at `broker_message_id` for ordered replay |
| Fan-out work | `mesh.delivery_queue` | One row per intended recipient (member subscribed to topic, recipient of DM, etc.) |
Effects **outside** the transaction (committed after ACK to daemon):
- WebSocket pushes to currently-connected subscribers — these are best-
effort live notifications; on failure subscribers fetch from history
on next connect.
- Webhook fan-out (post-v0.9.0 feature) — runs asynchronously off the
`delivery_queue` rows committed inside the transaction.
If any in-transaction insert fails (constraint violation, DB error),
the transaction rolls back: no dedupe row, no message row, no history,
no delivery queue rows. Broker returns `5xx` to daemon; daemon retries.
#### 4.7.2 Corrected pseudocode (codex r6)
The fingerprint comparison must happen on the conflict-select branch,
not the `RETURNING` branch:
```sql
BEGIN;
-- Pre-generate broker_message_id (ulid) outside the transaction, pass in.
-- Step 1: try to claim the idempotency key.
INSERT INTO mesh.client_message_dedupe
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
destination_kind, destination_ref, expires_at)
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
$dest_kind, $dest_ref, $expires_at)
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
-- Step 2: was it our insert?
SELECT broker_message_id, request_fingerprint, destination_kind,
destination_ref, history_available, first_seen_at
FROM mesh.client_message_dedupe
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
FOR SHARE;
-- If returned.broker_message_id == $msg_id (our pre-generated id),
-- this was the first insert. Continue to step 3.
-- If returned.broker_message_id != $msg_id AND
-- returned.request_fingerprint == $fingerprint,
-- this is a duplicate retry. ROLLBACK; return 200 duplicate.
-- If returned.broker_message_id != $msg_id AND
-- returned.request_fingerprint != $fingerprint,
-- ROLLBACK; return 409 idempotency_key_reused.
-- Step 3: insert message row, history, fan-out queue.
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
VALUES ($msg_id, $mesh_id, $client_id, ...);
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
VALUES ($msg_id, $mesh_id, ...);
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
SELECT $msg_id, member_pubkey, ...
FROM mesh.topic_subscription
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
COMMIT;
```
The branch logic determines the response shape (`201` vs `200
duplicate` vs `409 idempotency_key_reused`) before COMMIT. The
duplicate and 409 branches always ROLLBACK because nothing else
needs to commit on those paths.
`SELECT … FOR SHARE` blocks concurrent writers from upgrading the
same dedupe row mid-transaction; a concurrent insert with the same
key will block until our transaction completes.
#### 4.7.3 Orphan check — covers full inventory now
The nightly `cm_broker_dedupe_orphan_check_total` job (v6 §4.7) is
extended to verify all four in-transaction effects. For each
`client_message_dedupe` row:
- Either the corresponding `topic_message` / `message_queue` row exists,
OR `history_available = FALSE` AND a deleted-tombstone is recorded.
- AND a corresponding `message_history` row exists (or has been pruned
per history retention).
- AND zero outstanding `delivery_queue` rows older than fan-out timeout
reference a `broker_message_id` whose dedupe row is missing.
Any inconsistency logged as `cm_broker_atomicity_violation_found` for
human review. Should be zero in steady state.
### 4.8 Outbox max-age math — strictly inside broker window (v7 — codex r6)
Codex r6: at v6's 3-day minimum, daemon max_age (72h) **equaled** broker
window (72h). That isn't "inside the window."
v7 raises the floor and tightens the formula:
- **Minimum supported broker `dedupe_retention_days`**: **7** (was 3 in
v6). Below this, daemon refuses to start with `4012
feature_param_below_floor`.
- **Daemon `max_age_hours` derivation** (`retention_scoped` mode):
```
safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 * 24))
max_age_hours = (dedupe_retention_days * 24) - safety_margin_hours
```
At minimum (7 days): `safety_margin = max(24, 17) = 24h`; `max_age =
168 - 24 = 144h`. Daemon outbox ≤144h, broker window ≥168h, gap ≥24h.
- **Daemon `max_age_hours` derivation** (`permanent` mode):
```
max_age_hours = config.outbox.max_age_hours_default (168h)
capped at config.outbox.max_age_hours_cap (720h)
```
- **Operator override**: `[outbox] max_age_hours_override = N` accepted
iff `N <= dedupe_retention_days * 24 - 24`. Above that → daemon
refuses to start with `outbox_max_age_above_dedupe_window` clear text.
- The 72h floor from v6 is **dropped** because the new 7-day broker
minimum already produces a 144h derived max-age — well above any
realistic floor concern.
### 4.9 Inbox schema — unchanged from v3 §4.5
### 4.10 Crash recovery — unchanged from v3 §4.6
### 4.11 Failure modes — unchanged from v6 §4.12, with §4.5/§4.6 added
- **IPC accept fingerprint-mismatch on duplicate id**: returns 409 with
`conflict` field per §4.5.1. Caller must rotate id.
- **Outbox row stuck in `dead`**: operator runs `outbox requeue
--new-client-id` per §4.6.3.
- **Broker fingerprint mismatch on retry**: as v6 §4.5. Daemon marks
`dead`, surfaces in `outbox --failed`.
- **Daemon retry after dedupe row hard-deleted by broker retention
sweep**: cannot happen unless operator overrode `max_age_hours`
beyond the safety margin. In `permanent` mode cannot happen at all.
- **Atomicity violation found by orphan check**: alerts ops; broker
team investigates. Should be zero.
---
## 5. Inbound — unchanged from v3 §5
## 6. Hooks — unchanged from v4 §6
## 7-13. — unchanged from v4
## 14. Lifecycle — unchanged from v5 §14
---
## 15. Version compat — minimum dedupe_retention_days raised
### 15.1 Feature bits with parameters (v7 update)
Only one row changes from v6 §15.1:
| Bit | `params.version` | Required parameters | Optional parameters |
|---|---|---|---|
| `client_message_id_dedupe` | `2` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 7)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
`dedupe_retention_days` minimum raised from 3 to 7 to keep daemon
outbox max-age strictly inside the broker window with margin (§4.8).
### 15.2 — 15.5 unchanged from v6 §15
(`feature_negotiation_request/response`, IPC negotiation, compat
matrix, diagnostic close codes 4010 / 4011 / 4012.)
---
## 16. Threat model — unchanged from v4 §16
---
## 17. Migration — broker dedupe + atomicity + corrected pseudocode (v7)
Broker side, deploy order:
1. `CREATE TABLE mesh.client_message_dedupe` (v6 §4.3 schema, unchanged
in v7).
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
4. Broker code refactor: every accept path runs the v7 §4.7.2 corrected
pseudocode in **one transaction** with the side-effect inventory
from §4.7.1 — dedupe row, message row, history row, delivery_queue
rows all in-tx.
5. Broker code: existing fan-out workers consume `delivery_queue` rows
committed by the accept transaction.
6. Broker code: nightly retention sweep + `history_available` flip on
message-row pruning (unchanged from v6 §17 step 5+6).
7. Broker code: extended orphan-check job (v7 §4.7.3) — alerts on
atomicity violations across full inventory.
8. Broker advertises `client_message_id_dedupe` feature with
`params.version = 2`, `request_fingerprint: true`,
`dedupe_retention_days >= 7` (was 3).
9. Daemon refuses to start unless above is advertised.
Daemon side:
- Outbox table gains `aborted` status (§4.6.3); migration ALTER on the
CHECK constraint at startup if SQLite version <DDL works without
a recreate; else table recreate via `INSERT INTO new SELECT * FROM
old`. v0.9.0 daemons are fresh installs by definition; existing
outboxes don't exist.
- IPC accept path implements §4.5.1 lookup table.
- IPC error envelope adds `conflict` and `daemon_fingerprint_prefix`
fields for 409 responses.
- New CLI verb `claudemesh daemon outbox requeue --id <id>
--new-client-id [auto|<id>]` (§4.6.3).
---
## What changed v6 → v7 (codex round-6 actionable items)
| Codex r6 item | v7 fix | Section |
|---|---|---|
| Daemon-local duplicate POST semantics undefined | Full lookup table for pending/inflight/done/dead × match/mismatch; `409 idempotency_key_reused` at IPC layer with `conflict` field | §4.5 |
| §4.6 rejected-request contradiction | Single rule: id consumed iff outbox row written; pre-outbox failures leave id untouched; broker-rejected outbox row goes to `dead`, requires `requeue --new-client-id` | §4.6 |
| §4.7 pseudocode wrong | Corrected: `INSERT ON CONFLICT DO NOTHING`, then `SELECT FOR SHARE`, then branch on returned `broker_message_id` and `fingerprint` | §4.7.2 |
| Max-age math equals window at min | Min `dedupe_retention_days` raised to 7; safety margin always >= 24h; derived max-age strictly < window | §4.8, §15.1 |
| Broker atomicity scope incomplete | Side-effect inventory: dedupe + message + history + delivery_queue all in-tx; WS push and webhook fan-out explicitly outside-tx; orphan check extended | §4.7.1, §4.7.3 |
| New `aborted` outbox status | Distinguishes operator-retired rows from dead rows | §4.6.3 |
---
## What needs review (round 7)
1. **IPC lookup table (§4.5.1)** — does it cover all the realistic
client races? The "inflight + match" return is `202 accepted,
inflight` — should it be `200 ok` with the broker response if the
broker has already responded? Or does the daemon prefer to respond
from local state always?
2. **Aborted vs dead vs done (§4.6.3)** — is the three-state terminal
distinction useful, or noisy? Would `dead` + an `aborted_at`
timestamp suffice?
3. **§4.7.2 transaction shape** — `SELECT FOR SHARE` after `INSERT ON
CONFLICT DO NOTHING` is two round-trips. Could it be one with
`INSERT ... ON CONFLICT DO UPDATE SET ... RETURNING xmax = 0` or
similar Postgres-specific trick? Worth optimizing here?
4. **Max-age formula at higher windows** — at 365 days,
`safety_margin = ceil(0.1 * 365 * 24) = 876h ≈ 36.5 days`. Daemon
max-age = `8760 - 876 = 7884h ≈ 328 days`. Is that the right shape,
or should the safety margin be capped (e.g. `min(72, ceil(0.1 * w))`)?
5. **Side-effect inventory (§4.7.1)** — anything missing? E.g. broker-
side rate-limit counters, audit-log entries, mention-fanout-search?
6. **Anything else still wrong?** Read it as if you were going to
operate this for a year. What falls down?
Three options:
- **(a) v7 is shippable**: lock the spec, start coding the frozen core.
- **(b) v8 needed**: list the must-fix items.
- **(c) the architecture itself is wrong**: what would you do differently?
Be ruthless.