Files
claudemesh/.artifacts/shipped/2026-05-03-daemon-spec-broker-hardening-followups.md
Alejandro Gutiérrez a2568ad9f4
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
chore(release): cli 1.22.0 — daemon v0.9.0 + housekeeping
- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh
  daemon long-lived runtime).
- CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring,
  crash recovery, and the deferred Sprint 7 broker hardening.
- Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0
  daemon redesign section, so the bridge release is documented as the
  shipped step toward the larger architectural shift.
- Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0
  spec + broker-hardening followups) from .artifacts/specs/ to
  .artifacts/shipped/ per the project artifact-pipeline convention.

Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag
— both are public-distribution actions and require explicit user
approval.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:24:32 +01:00

8.5 KiB
Raw Blame History

claudemesh daemon — broker-hardening followups

Purpose: refinements found during the v6 → v10 codex review series that are real improvements but not v0.9.0 blockers. The implementation target is 2026-05-03-daemon-spec-v0.9.0.md. This document lists what was deferred, why, and the trigger that promotes each item to "must-do."

Background: codex reviewed the daemon spec across 9 rounds (v1 through v10). Rounds 14 found load-bearing architectural issues (identity, IPC auth, exactly-once lie, hook tokens, rotation, etc.). Rounds 59 found progressively finer correctness issues inside one subsystem (broker idempotency mechanics). v6 closed the architectural review; v7v10 are increasingly fine-grained idempotency-correctness shavings on the same layer. Pre-launch (no users) doesn't need v7v10 level rigor. We pulled the cheap wins into v0.9.0; the rest waits.


1. B0 dedupe fast-path before rate-limit (v10)

What v10 said: read mesh.client_message_dedupe BEFORE consulting the rate limiter. Existing id (match or mismatch) returns immediately without touching rate-limit budget.

Why deferred: v0.9.0 doesn't have meaningful rate-limit pressure on the daemon path. The split-brain failure (broker accepted, daemon believes failure due to rate-limit-rejection-on-retry) requires sustained saturated rate-limit windows, which don't exist pre-launch.

Promote when: any single mesh sees rate-limit rejections AND has daemon retries against committed ids. Telemetry to watch: cm_broker_rate_limit_rejection_total per mesh > 0 sustained.

Implementation cost: small — one indexed PK lookup before the existing limiter call. The work is mostly testing the race semantics.


2. Lua-scripted idempotent rate limiter (v10)

What v10 said: limiter keyed by (mesh_id, client_message_id, window_bucket) so retries-within-window consume budget at most once.

Why deferred: depends on (1) above. Without B0 fast-path this is incremental complexity for marginal benefit. With B0 it becomes the right belt-and-suspenders fix for the rare race where two same-id requests both miss B0 simultaneously.

Promote when: B0 ships. Same trigger.

Implementation cost: medium — Lua script in Redis, careful TTL tuning, integration with existing limiter call sites.


3. In-tx mesh.mention_index (v8)

What v8 said: mention-fanout index updates should commit inside the broker accept transaction so mention-search reads can never see a mention pointing at an uncommitted message.

Why deferred: the lag between accept-commit and async mention-indexer is small (single-digit milliseconds in expected deployment). Stale-read window during mention search is acceptable for v0.9.0; receivers learn of mentions via the mention event in their inbox stream regardless.

Promote when: real users complain about "I was mentioned but the mention search doesn't show it" with reproducible cases that don't self-heal in seconds.

Implementation cost: small — add INSERT INTO mesh.mention_index to the accept transaction. The async indexer becomes a backfill fallback rather than the primary path.


4. 4011 / 4012 close-code split (v6 §15.5)

What v6 said: split 4010 feature_unavailable into three codes: 4010 (missing), 4011 (params invalid), 4012 (params below floor).

Why deferred: v0.9.0 ships single 4010 with structured close_reason JSON containing kind, feature, detail. Same diagnostic information, simpler protocol surface.

Promote when: ops tooling or external monitoring needs distinct status codes (e.g. PagerDuty rules that fire on 4012-only). Probably never; structured JSON is parseable.

Implementation cost: trivial — three constants and a switch on close_reason.kind.


5. Per-OS fingerprint precedence elaborate table (v8 §2.2.1)

What v8 said: comprehensive per-OS table covering Linux machine-id sources, macOS IOPlatformUUID, Windows MachineGuid, BSD kern.hostuuid, plus interface exclusion rules.

Why deferred: v0.9.0 ships with the simpler "machine-id || first-stable-mac" rule from v6. Edge cases (cloud images, machine-id-not-readable, etc.) are documented when first hit.

Promote when: operators report fingerprint false-positives we can't explain from the v6 rule. Each report adds one row to the per-OS table.

Implementation cost: incremental — each OS-specific source is a small probe function with a fallback chain.


6. request_fingerprint schema-version-2 in feature negotiation (v6 §15.1)

What v6 said: client_message_id_dedupe feature parameters versioned independently. v0.9.0 ships at version 1 with a single request_fingerprint: bool flag.

Why deferred: we don't yet need parameterized fingerprint variants (different canonical forms, different hash algos). Version-bump path is documented; we'll use it when we add the second fingerprint mode.

Promote when: we want a fingerprint algo other than sha256/JCS (e.g. a faster hash, or a normalized canonical form).

Implementation cost: small — single feature-bit version bump following the documented pattern.


7. Force-expiry / quarantine semantics for keypair-archive.json (v8 §14.1.1)

What v8 said: max_archived_keys cap with force-expiry; explicit quarantine of malformed archive (keypair-archive.json.malformed-<ts>); duplicate key_id rejection; mode-mismatch warning behavior.

Why deferred: v0.9.0 ships the simpler v6 rule — drop expired entries on cleanup pass; refuse to start on malformed archive (loud, operator-actionable). The v8 elaboration makes archive corruption non-blocking, which is operationally nicer but trades off audit clarity.

Promote when: a real operator hits an archive corruption that shouldn't have brought the daemon down (e.g. mid-rotation crash leaves a partially-written archive).

Implementation cost: small — quarantine logic + one extra startup check.


8. Cross-language JCS conformance for request_fingerprint (v6 §4.4 round-6 question)

What v6 asked: does JCS work cross-language for meta_canonical_json? Python json.dumps, Go encoding/json, and JS JSON.stringify all behave differently. Should we ship a vetted JCS lib in each SDK?

Why deferred from v0.9.0: the daemon ships in TypeScript only for v0.9.0 (the claudemesh-cli package). Single-language JCS is trivial. SDK ports come post-v0.9.0.

Promote when: we ship the Python or Go SDK. Each SDK port gets a JCS conformance test against a corpus of envelopes.

Implementation cost: small per-language — a conformance fixture file and a unit test.


Sprint 7 (this session) — what landed vs deferred

Landed in code (not yet deployed):

  • packages/db/migrations/0028_message_queue_idempotency_fields.sql adds nullable client_message_id and request_fingerprint columns to mesh.message_queue (additive, online-safe).
  • apps/broker/src/broker.tsqueueMessage and drainForMember thread the new columns through.
  • apps/broker/src/index.tshandleSend picks them up from the daemon's wire envelope; outbound push echoes them back so receiving daemons can dedupe.
  • apps/broker/src/types.tsWSPushMessage declares the optional fields.

Deployment plan (not auto-applied):

  1. Apply migration against prod DB (the broker's filename-tracked migrator picks up 0028_*.sql on next startup).
  2. Deploy the broker with the code changes via Coolify.
  3. Verify a daemon-originated send shows non-null client_message_id in mesh.message_queue afterwards.

Still deferred (full broker hardening):

  • mesh.client_message_dedupe table with request_fingerprint BYTEA and atomic accept transaction (spec §4.7).
  • Feature-bit advertisement on hello_ack of client_message_id_dedupe v1, with daemon-side enforcement (spec §15).
  • Partial unique index (mesh_id, client_message_id) WHERE NOT NULL.

These sit behind the same trigger as the followups below: do them when real users hit operational corners that this addressing doesn't cover.


How to use this document

When picking up post-v0.9.0 work on the daemon:

  1. Check whether any of the "promote when" triggers above have fired.
  2. If yes, consult the corresponding versioned spec (v6/v7/v8/v9/v10) for the full proposed change.
  3. Implement the lift, update daemon-spec-v0.9.0.md to reflect the merge, and remove the item from this followups list.

The versioned specs live in .artifacts/specs/ indefinitely as a review-trail audit.