- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh daemon long-lived runtime). - CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring, crash recovery, and the deferred Sprint 7 broker hardening. - Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0 daemon redesign section, so the bridge release is documented as the shipped step toward the larger architectural shift. - Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0 spec + broker-hardening followups) from .artifacts/specs/ to .artifacts/shipped/ per the project artifact-pipeline convention. Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag — both are public-distribution actions and require explicit user approval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.5 KiB
claudemesh daemon — broker-hardening followups
Purpose: refinements found during the v6 → v10 codex review series that are real improvements but not v0.9.0 blockers. The implementation target is
2026-05-03-daemon-spec-v0.9.0.md. This document lists what was deferred, why, and the trigger that promotes each item to "must-do."Background: codex reviewed the daemon spec across 9 rounds (v1 through v10). Rounds 1–4 found load-bearing architectural issues (identity, IPC auth, exactly-once lie, hook tokens, rotation, etc.). Rounds 5–9 found progressively finer correctness issues inside one subsystem (broker idempotency mechanics). v6 closed the architectural review; v7–v10 are increasingly fine-grained idempotency-correctness shavings on the same layer. Pre-launch (no users) doesn't need v7–v10 level rigor. We pulled the cheap wins into v0.9.0; the rest waits.
1. B0 dedupe fast-path before rate-limit (v10)
What v10 said: read mesh.client_message_dedupe BEFORE consulting
the rate limiter. Existing id (match or mismatch) returns immediately
without touching rate-limit budget.
Why deferred: v0.9.0 doesn't have meaningful rate-limit pressure on the daemon path. The split-brain failure (broker accepted, daemon believes failure due to rate-limit-rejection-on-retry) requires sustained saturated rate-limit windows, which don't exist pre-launch.
Promote when: any single mesh sees rate-limit rejections AND has
daemon retries against committed ids. Telemetry to watch:
cm_broker_rate_limit_rejection_total per mesh > 0 sustained.
Implementation cost: small — one indexed PK lookup before the existing limiter call. The work is mostly testing the race semantics.
2. Lua-scripted idempotent rate limiter (v10)
What v10 said: limiter keyed by (mesh_id, client_message_id, window_bucket) so retries-within-window consume budget at most once.
Why deferred: depends on (1) above. Without B0 fast-path this is incremental complexity for marginal benefit. With B0 it becomes the right belt-and-suspenders fix for the rare race where two same-id requests both miss B0 simultaneously.
Promote when: B0 ships. Same trigger.
Implementation cost: medium — Lua script in Redis, careful TTL tuning, integration with existing limiter call sites.
3. In-tx mesh.mention_index (v8)
What v8 said: mention-fanout index updates should commit inside the broker accept transaction so mention-search reads can never see a mention pointing at an uncommitted message.
Why deferred: the lag between accept-commit and async
mention-indexer is small (single-digit milliseconds in expected
deployment). Stale-read window during mention search is acceptable for
v0.9.0; receivers learn of mentions via the mention event in their
inbox stream regardless.
Promote when: real users complain about "I was mentioned but the mention search doesn't show it" with reproducible cases that don't self-heal in seconds.
Implementation cost: small — add INSERT INTO mesh.mention_index
to the accept transaction. The async indexer becomes a backfill
fallback rather than the primary path.
4. 4011 / 4012 close-code split (v6 §15.5)
What v6 said: split 4010 feature_unavailable into three codes:
4010 (missing), 4011 (params invalid), 4012 (params below floor).
Why deferred: v0.9.0 ships single 4010 with structured
close_reason JSON containing kind, feature, detail. Same
diagnostic information, simpler protocol surface.
Promote when: ops tooling or external monitoring needs distinct status codes (e.g. PagerDuty rules that fire on 4012-only). Probably never; structured JSON is parseable.
Implementation cost: trivial — three constants and a switch on
close_reason.kind.
5. Per-OS fingerprint precedence elaborate table (v8 §2.2.1)
What v8 said: comprehensive per-OS table covering Linux machine-id
sources, macOS IOPlatformUUID, Windows MachineGuid, BSD
kern.hostuuid, plus interface exclusion rules.
Why deferred: v0.9.0 ships with the simpler "machine-id || first-stable-mac" rule from v6. Edge cases (cloud images, machine-id-not-readable, etc.) are documented when first hit.
Promote when: operators report fingerprint false-positives we can't explain from the v6 rule. Each report adds one row to the per-OS table.
Implementation cost: incremental — each OS-specific source is a small probe function with a fallback chain.
6. request_fingerprint schema-version-2 in feature negotiation (v6 §15.1)
What v6 said: client_message_id_dedupe feature parameters
versioned independently. v0.9.0 ships at version 1 with a single
request_fingerprint: bool flag.
Why deferred: we don't yet need parameterized fingerprint variants (different canonical forms, different hash algos). Version-bump path is documented; we'll use it when we add the second fingerprint mode.
Promote when: we want a fingerprint algo other than sha256/JCS (e.g. a faster hash, or a normalized canonical form).
Implementation cost: small — single feature-bit version bump following the documented pattern.
7. Force-expiry / quarantine semantics for keypair-archive.json (v8 §14.1.1)
What v8 said: max_archived_keys cap with force-expiry; explicit
quarantine of malformed archive (keypair-archive.json.malformed-<ts>);
duplicate key_id rejection; mode-mismatch warning behavior.
Why deferred: v0.9.0 ships the simpler v6 rule — drop expired entries on cleanup pass; refuse to start on malformed archive (loud, operator-actionable). The v8 elaboration makes archive corruption non-blocking, which is operationally nicer but trades off audit clarity.
Promote when: a real operator hits an archive corruption that shouldn't have brought the daemon down (e.g. mid-rotation crash leaves a partially-written archive).
Implementation cost: small — quarantine logic + one extra startup check.
8. Cross-language JCS conformance for request_fingerprint (v6 §4.4 round-6 question)
What v6 asked: does JCS work cross-language for
meta_canonical_json? Python json.dumps, Go encoding/json, and JS
JSON.stringify all behave differently. Should we ship a vetted JCS lib
in each SDK?
Why deferred from v0.9.0: the daemon ships in TypeScript only for
v0.9.0 (the claudemesh-cli package). Single-language JCS is trivial.
SDK ports come post-v0.9.0.
Promote when: we ship the Python or Go SDK. Each SDK port gets a JCS conformance test against a corpus of envelopes.
Implementation cost: small per-language — a conformance fixture file and a unit test.
Sprint 7 (this session) — what landed vs deferred
Landed in code (not yet deployed):
packages/db/migrations/0028_message_queue_idempotency_fields.sqladds nullableclient_message_idandrequest_fingerprintcolumns tomesh.message_queue(additive, online-safe).apps/broker/src/broker.ts—queueMessageanddrainForMemberthread the new columns through.apps/broker/src/index.ts—handleSendpicks them up from the daemon's wire envelope; outbound push echoes them back so receiving daemons can dedupe.apps/broker/src/types.ts—WSPushMessagedeclares the optional fields.
Deployment plan (not auto-applied):
- Apply migration against prod DB (the broker's filename-tracked
migrator picks up
0028_*.sqlon next startup). - Deploy the broker with the code changes via Coolify.
- Verify a daemon-originated send shows non-null
client_message_idinmesh.message_queueafterwards.
Still deferred (full broker hardening):
mesh.client_message_dedupetable withrequest_fingerprint BYTEAand atomic accept transaction (spec §4.7).- Feature-bit advertisement on hello_ack of
client_message_id_dedupev1, with daemon-side enforcement (spec §15). - Partial unique index
(mesh_id, client_message_id) WHERE NOT NULL.
These sit behind the same trigger as the followups below: do them when real users hit operational corners that this addressing doesn't cover.
How to use this document
When picking up post-v0.9.0 work on the daemon:
- Check whether any of the "promote when" triggers above have fired.
- If yes, consult the corresponding versioned spec (v6/v7/v8/v9/v10) for the full proposed change.
- Implement the lift, update
daemon-spec-v0.9.0.mdto reflect the merge, and remove the item from this followups list.
The versioned specs live in .artifacts/specs/ indefinitely as a
review-trail audit.