- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh daemon long-lived runtime). - CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring, crash recovery, and the deferred Sprint 7 broker hardening. - Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0 daemon redesign section, so the bridge release is documented as the shipped step toward the larger architectural shift. - Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0 spec + broker-hardening followups) from .artifacts/specs/ to .artifacts/shipped/ per the project artifact-pipeline convention. Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag — both are public-distribution actions and require explicit user approval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
219 lines
8.5 KiB
Markdown
219 lines
8.5 KiB
Markdown
# `claudemesh daemon` — broker-hardening followups
|
||
|
||
> **Purpose**: refinements found during the v6 → v10 codex review series
|
||
> that are real improvements but **not** v0.9.0 blockers. The
|
||
> implementation target is `2026-05-03-daemon-spec-v0.9.0.md`. This
|
||
> document lists what was deferred, why, and the trigger that promotes
|
||
> each item to "must-do."
|
||
>
|
||
> **Background**: codex reviewed the daemon spec across 9 rounds (v1
|
||
> through v10). Rounds 1–4 found load-bearing architectural issues
|
||
> (identity, IPC auth, exactly-once lie, hook tokens, rotation, etc.).
|
||
> Rounds 5–9 found progressively finer correctness issues inside one
|
||
> subsystem (broker idempotency mechanics). v6 closed the architectural
|
||
> review; v7–v10 are increasingly fine-grained idempotency-correctness
|
||
> shavings on the same layer. Pre-launch (no users) doesn't need v7–v10
|
||
> level rigor. We pulled the cheap wins into v0.9.0; the rest waits.
|
||
|
||
---
|
||
|
||
## 1. B0 dedupe fast-path before rate-limit (v10)
|
||
|
||
**What v10 said**: read `mesh.client_message_dedupe` BEFORE consulting
|
||
the rate limiter. Existing id (match or mismatch) returns immediately
|
||
without touching rate-limit budget.
|
||
|
||
**Why deferred**: v0.9.0 doesn't have meaningful rate-limit pressure on
|
||
the daemon path. The split-brain failure (broker accepted, daemon
|
||
believes failure due to rate-limit-rejection-on-retry) requires
|
||
sustained saturated rate-limit windows, which don't exist pre-launch.
|
||
|
||
**Promote when**: any single mesh sees rate-limit rejections AND has
|
||
daemon retries against committed ids. Telemetry to watch:
|
||
`cm_broker_rate_limit_rejection_total` per mesh > 0 sustained.
|
||
|
||
**Implementation cost**: small — one indexed PK lookup before the
|
||
existing limiter call. The work is mostly testing the race semantics.
|
||
|
||
---
|
||
|
||
## 2. Lua-scripted idempotent rate limiter (v10)
|
||
|
||
**What v10 said**: limiter keyed by `(mesh_id, client_message_id,
|
||
window_bucket)` so retries-within-window consume budget at most once.
|
||
|
||
**Why deferred**: depends on (1) above. Without B0 fast-path this is
|
||
incremental complexity for marginal benefit. With B0 it becomes the
|
||
right belt-and-suspenders fix for the rare race where two same-id
|
||
requests both miss B0 simultaneously.
|
||
|
||
**Promote when**: B0 ships. Same trigger.
|
||
|
||
**Implementation cost**: medium — Lua script in Redis, careful TTL
|
||
tuning, integration with existing limiter call sites.
|
||
|
||
---
|
||
|
||
## 3. In-tx `mesh.mention_index` (v8)
|
||
|
||
**What v8 said**: mention-fanout index updates should commit inside the
|
||
broker accept transaction so mention-search reads can never see a
|
||
mention pointing at an uncommitted message.
|
||
|
||
**Why deferred**: the lag between accept-commit and async
|
||
mention-indexer is small (single-digit milliseconds in expected
|
||
deployment). Stale-read window during mention search is acceptable for
|
||
v0.9.0; receivers learn of mentions via the `mention` event in their
|
||
inbox stream regardless.
|
||
|
||
**Promote when**: real users complain about "I was mentioned but the
|
||
mention search doesn't show it" with reproducible cases that don't
|
||
self-heal in seconds.
|
||
|
||
**Implementation cost**: small — add `INSERT INTO mesh.mention_index`
|
||
to the accept transaction. The async indexer becomes a backfill
|
||
fallback rather than the primary path.
|
||
|
||
---
|
||
|
||
## 4. 4011 / 4012 close-code split (v6 §15.5)
|
||
|
||
**What v6 said**: split `4010 feature_unavailable` into three codes:
|
||
`4010` (missing), `4011` (params invalid), `4012` (params below floor).
|
||
|
||
**Why deferred**: v0.9.0 ships single `4010` with structured
|
||
`close_reason` JSON containing `kind`, `feature`, `detail`. Same
|
||
diagnostic information, simpler protocol surface.
|
||
|
||
**Promote when**: ops tooling or external monitoring needs distinct
|
||
status codes (e.g. PagerDuty rules that fire on 4012-only). Probably
|
||
never; structured JSON is parseable.
|
||
|
||
**Implementation cost**: trivial — three constants and a switch on
|
||
`close_reason.kind`.
|
||
|
||
---
|
||
|
||
## 5. Per-OS fingerprint precedence elaborate table (v8 §2.2.1)
|
||
|
||
**What v8 said**: comprehensive per-OS table covering Linux machine-id
|
||
sources, macOS `IOPlatformUUID`, Windows `MachineGuid`, BSD
|
||
`kern.hostuuid`, plus interface exclusion rules.
|
||
|
||
**Why deferred**: v0.9.0 ships with the simpler "machine-id ||
|
||
first-stable-mac" rule from v6. Edge cases (cloud images,
|
||
machine-id-not-readable, etc.) are documented when first hit.
|
||
|
||
**Promote when**: operators report fingerprint false-positives we can't
|
||
explain from the v6 rule. Each report adds one row to the per-OS
|
||
table.
|
||
|
||
**Implementation cost**: incremental — each OS-specific source is a
|
||
small probe function with a fallback chain.
|
||
|
||
---
|
||
|
||
## 6. `request_fingerprint` schema-version-2 in feature negotiation (v6 §15.1)
|
||
|
||
**What v6 said**: `client_message_id_dedupe` feature parameters
|
||
versioned independently. v0.9.0 ships at version 1 with a single
|
||
`request_fingerprint: bool` flag.
|
||
|
||
**Why deferred**: we don't yet need parameterized fingerprint variants
|
||
(different canonical forms, different hash algos). Version-bump path
|
||
is documented; we'll use it when we add the second fingerprint mode.
|
||
|
||
**Promote when**: we want a fingerprint algo other than sha256/JCS
|
||
(e.g. a faster hash, or a normalized canonical form).
|
||
|
||
**Implementation cost**: small — single feature-bit version bump
|
||
following the documented pattern.
|
||
|
||
---
|
||
|
||
## 7. Force-expiry / quarantine semantics for `keypair-archive.json` (v8 §14.1.1)
|
||
|
||
**What v8 said**: `max_archived_keys` cap with force-expiry; explicit
|
||
quarantine of malformed archive (`keypair-archive.json.malformed-<ts>`);
|
||
duplicate `key_id` rejection; mode-mismatch warning behavior.
|
||
|
||
**Why deferred**: v0.9.0 ships the simpler v6 rule — drop expired
|
||
entries on cleanup pass; refuse to start on malformed archive (loud,
|
||
operator-actionable). The v8 elaboration makes archive corruption
|
||
non-blocking, which is operationally nicer but trades off audit
|
||
clarity.
|
||
|
||
**Promote when**: a real operator hits an archive corruption that
|
||
shouldn't have brought the daemon down (e.g. mid-rotation crash leaves
|
||
a partially-written archive).
|
||
|
||
**Implementation cost**: small — quarantine logic + one extra startup
|
||
check.
|
||
|
||
---
|
||
|
||
## 8. Cross-language JCS conformance for `request_fingerprint` (v6 §4.4 round-6 question)
|
||
|
||
**What v6 asked**: does JCS work cross-language for
|
||
`meta_canonical_json`? Python json.dumps, Go encoding/json, and JS
|
||
JSON.stringify all behave differently. Should we ship a vetted JCS lib
|
||
in each SDK?
|
||
|
||
**Why deferred from v0.9.0**: the daemon ships in TypeScript only for
|
||
v0.9.0 (the `claudemesh-cli` package). Single-language JCS is trivial.
|
||
SDK ports come post-v0.9.0.
|
||
|
||
**Promote when**: we ship the Python or Go SDK. Each SDK port gets a
|
||
JCS conformance test against a corpus of envelopes.
|
||
|
||
**Implementation cost**: small per-language — a conformance fixture
|
||
file and a unit test.
|
||
|
||
---
|
||
|
||
## Sprint 7 (this session) — what landed vs deferred
|
||
|
||
**Landed in code** (not yet deployed):
|
||
- `packages/db/migrations/0028_message_queue_idempotency_fields.sql` adds
|
||
nullable `client_message_id` and `request_fingerprint` columns to
|
||
`mesh.message_queue` (additive, online-safe).
|
||
- `apps/broker/src/broker.ts` — `queueMessage` and `drainForMember`
|
||
thread the new columns through.
|
||
- `apps/broker/src/index.ts` — `handleSend` picks them up from the
|
||
daemon's wire envelope; outbound push echoes them back so receiving
|
||
daemons can dedupe.
|
||
- `apps/broker/src/types.ts` — `WSPushMessage` declares the optional
|
||
fields.
|
||
|
||
**Deployment plan (not auto-applied)**:
|
||
1. Apply migration against prod DB (the broker's filename-tracked
|
||
migrator picks up `0028_*.sql` on next startup).
|
||
2. Deploy the broker with the code changes via Coolify.
|
||
3. Verify a daemon-originated send shows non-null `client_message_id`
|
||
in `mesh.message_queue` afterwards.
|
||
|
||
**Still deferred** (full broker hardening):
|
||
- `mesh.client_message_dedupe` table with `request_fingerprint BYTEA`
|
||
and atomic accept transaction (spec §4.7).
|
||
- Feature-bit advertisement on hello_ack of
|
||
`client_message_id_dedupe` v1, with daemon-side enforcement (spec §15).
|
||
- Partial unique index `(mesh_id, client_message_id) WHERE NOT NULL`.
|
||
|
||
These sit behind the same trigger as the followups below: do them when
|
||
real users hit operational corners that this addressing doesn't cover.
|
||
|
||
---
|
||
|
||
## How to use this document
|
||
|
||
When picking up post-v0.9.0 work on the daemon:
|
||
|
||
1. Check whether any of the "promote when" triggers above have fired.
|
||
2. If yes, consult the corresponding versioned spec (v6/v7/v8/v9/v10)
|
||
for the full proposed change.
|
||
3. Implement the lift, update `daemon-spec-v0.9.0.md` to reflect the
|
||
merge, and remove the item from this followups list.
|
||
|
||
The versioned specs live in `.artifacts/specs/` indefinitely as a
|
||
review-trail audit.
|