Files
claudemesh/.artifacts/shipped/2026-05-03-daemon-final-spec-v4.md
Alejandro Gutiérrez a2568ad9f4
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
chore(release): cli 1.22.0 — daemon v0.9.0 + housekeeping
- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh
  daemon long-lived runtime).
- CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring,
  crash recovery, and the deferred Sprint 7 broker hardening.
- Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0
  daemon redesign section, so the bridge release is documented as the
  shipped step toward the larger architectural shift.
- Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0
  spec + broker-hardening followups) from .artifacts/specs/ to
  .artifacts/shipped/ per the project artifact-pipeline convention.

Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag
— both are public-distribution actions and require explicit user
approval.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:24:32 +01:00

24 KiB
Raw Permalink Blame History

claudemesh daemon — Final Spec v4

Round 4. v3 was reviewed by codex (round 3) and got an overall pass on architecture but flagged three precision gaps: (1) broker dedupe window semantics — permanent or windowed? schema as drawn was permanent but the prose said 24h; (2) feature-bit negotiation should carry parameters, not just booleans (so daemon can derive its outbox TTL from broker policy instead of hardcoding 23h); (3) key-archive record format and retention behavior were unspecified. Plus minor polish: document machine-id/MAC source precedence per OS, explicitly defer arbitrary outbound hook sends, resolve RunPod identity-vs-hooks inconsistency.

The intent §0 is unchanged from v2 — read it there. v4 only revises what changed from v3.


0. Intent — unchanged, see v2 §0

Pre-launch peer-mesh runtime. Servers/laptops become first-class peers. Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not a generic broker. We can break anything.

One claim retracted from v1/v2: "exactly-once" delivery. Replaced with a precise contract in §4 below.


1. Process model — unchanged from v3 §1 / v2 §1

Resource caps, file layout, single-binary unchanged.


2. Identity — accidental-clone detection only, plus broker dedupe

Codex round-2 fix retained: no boot-id (false-positives every reboot). Codex round-3 polish: spell out fingerprint sources per OS so we don't ship a brittle "machine-id || first-mac" with no precedence rules.

2.1 Modes

claudemesh daemon up                       # default: persistent member
claudemesh daemon up --ephemeral           # in-memory keypair, never written
claudemesh daemon up --ephemeral --ttl 2h  # auto-shutdown after duration

CI auto-detection: if any of these env vars are set (CI=true, GITHUB_ACTIONS, GITLAB_CI, BUILDKITE, CIRCLECI, JENKINS_URL, KUBERNETES_SERVICE_HOST), AND --persistent is not explicitly passed, daemon defaults to --ephemeral. Rationale in §16.

RUNPOD_POD_ID removed from auto-CI list (was inconsistent — see §16.3).

2.2 Accidental-clone detection (NOT attacker-grade)

This catches image clones, restored backups, copy-pasted homedirs — accidents made by humans. It does not defend against an attacker who copies both keypair.json and host_fingerprint.json. The threat model (§16) says this explicitly.

2.2.1 Fingerprint source precedence (NEW — codex r3)

host_fingerprint.json stores sha256(host_id || stable_mac) where the inputs are computed from the OS-specific table below, in order:

OS host_id (try in order) stable_mac
Linux /etc/machine-id/var/lib/dbus/machine-id → first stable MAC First non-loopback non-virtual interface, lex-sorted by name (en…/eth… before wl…); docker0/veth*/br-*/lo excluded
macOS IOPlatformUUID (ioreg -rd1 -c IOPlatformExpertDevice) First non-loopback non-virtual interface (en0 typical)
Windows HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid First physical adapter (Get-NetAdapter -Physical), MAC sorted lex by adapter name
BSD kern.hostuuid (sysctl -n kern.hostuuid) Same MAC rule as Linux

Excluded interfaces (cross-platform): loopback, point-to-point tunnels (tailscale*, wg*, utun*, ppp*), docker (docker0, br-, veth), VPN (tap*/tun*), VM bridges (vboxnet*, vmnet*), Apple awdl/llw bridges.

Cloud-image false-positive note: bare AMIs/Azure images regenerate /etc/machine-id on first boot via cloud-init; for those, the first-boot fingerprint is what we keep. If an operator clones a running VM post-cloud-init, both host_id AND first-MAC will collide → the daemon correctly flags this as an accidental clone.

If host_id cannot be read on the host's OS, daemon logs fingerprint_host_id_unavailable and falls back to MAC-only. If MAC also unavailable (truly headless container with no NIC), daemon logs fingerprint_unavailable, persists a random UUID as host_id, and the clone-detection feature is effectively disabled for this host (broker concurrent-connection policy still works).

Behavior on mismatch (unchanged from v3): refuse / accept-host / remint. [clone] policy = "refuse" | "warn" | "allow" overrides per host.

2.3 Concurrent-duplicate-identity broker policy — unchanged from v3 §2.3

prefer_newest (default), prefer_oldest, allow_concurrent. Configured per-mesh in mesh.cloneConcurrencyPolicy.

2.4 Rename, key rotation — see §14


3. IPC surface — unchanged from v3 §3

Same frozen core, same auth model (UDS 0600 / TCP+SSE bearer / no token in query / all endpoints auth by default / UDS-only in containers / Origin/Host checks / no User-Agent theatre).


4. Delivery contract — at-least-once, permanent broker dedupe

Codex round 3 caught: v3's prose said "24h dedupe window" but the schema (partial unique indexes with no created_at) gave permanent dedupe. We have to pick. v4 chooses permanent dedupe because:

  • It's the simplest correct choice. No GC job, no edge case where a long-asleep daemon's retry slips past the window and double-sends.
  • The unique index storage cost is bounded: at 1 KB per row × 100k messages/day × 365 = ~36 GB/year of broker storage, which is well within the broker's existing message-retention budget. Older message rows themselves can still be GC'd by the existing message retention policy (currently 365d) — only the client_message_id column on retained rows has to live as long as that row does.
  • It eliminates the daemon-side max_age_hours = 23h hack. Daemon outbox TTL becomes "however long you want to keep retrying"; default 7d.
  • It removes a class of "where exactly is the dedupe window edge?" bugs.

If broker storage growth becomes a real concern post-v0.9.0, we can convert to a windowed scheme via a feature-bit upgrade (§15) — but we'd own the correct migration semantics then.

4.1 The contract (precise)

Local guarantee: each successful POST /v1/send returns a stable client_message_id. The send is durably persisted to outbox.db before the response returns.

Broker guarantee: the broker dedupes on client_message_id permanently within the lifetime of the row. Multiple inflight retries from the daemon for the same client_message_id produce at most one broker-accepted row, regardless of time elapsed (subject to message-row retention policy on the broker). This is advertised via the client_message_id_dedupe feature-bit with { mode: "permanent" } parameter (§15).

End-to-end guarantee: at-least-once delivery to subscribers, with client_message_id propagated in the inbound envelope so receivers can dedupe locally. We do not guarantee at-most-once end-to-end — receiver-side dedupe is the receiver's job. The daemon's inbox.db provides it for daemon-hosted peers.

4.2 Daemon-supplied client_message_id — unchanged from v3 §4.2

Sources: Idempotency-Key header → body client_message_id → daemon-minted ulid. Stored in outbox UNIQUE NOT NULL, propagated to broker, propagated to receivers.

4.3 Broker schema delta — clarified as permanent dedupe

ALTER TABLE mesh.topic_message
  ADD COLUMN client_message_id TEXT;
ALTER TABLE mesh.message_queue
  ADD COLUMN client_message_id TEXT;

CREATE UNIQUE INDEX topic_message_client_id_idx
  ON mesh.topic_message(mesh_id, client_message_id)
  WHERE client_message_id IS NOT NULL;
CREATE UNIQUE INDEX message_queue_client_id_idx
  ON mesh.message_queue(mesh_id, client_message_id)
  WHERE client_message_id IS NOT NULL;

-- No created_at column needed for dedupe; the existing message row's
-- created_at handles row-level retention. Dedupe is permanent for the row's
-- lifetime, then naturally GC'd when the row is purged.

Partial unique indexes — legacy traffic without client_message_id (from claudemesh launch, dashboard chat, web posts) is unaffected.

Migration: additive-only. Online ALTER TABLE on Postgres takes the row lock for the column add but not the index build (CREATE UNIQUE INDEX CONCURRENTLY is safe). Deploy order: schema migration → broker code that reads/writes client_message_id → daemon code that sends it → daemon enforces feature bit.

4.4 Outbox schema — unchanged from v3 §4.4

UNIQUE NOT NULL on client_message_id. Default max_age_hours raised back to 168h (7d) because broker dedupe is permanent — no need to stay inside a 24h window.

4.5 Inbox schema — unchanged from v3 §4.5

Content table + indexes; FTS5 deferred.

4.6 Crash recovery — unchanged from v3 §4.6

4.7 Failure modes — windowed-broker case removed

The "broker dedupe window expired" failure mode in v3 §4.7 is deleted because dedupe is permanent. Remaining cases:

  • dead rows: surface in claudemesh daemon outbox --failed. User manually requeues (outbox requeue <id>) or drops (outbox drop <id>).
  • Receiver-side dedupe: only daemon-hosted receivers dedupe. claudemesh launch and dashboard chat don't dedupe today; post-v0.9.0.
  • Broker row already GC'd, daemon retries: daemon retry hits the partial unique index → 23505 conflict. Broker treats as already-accepted, returns the original messageId from a soft-delete tombstone OR (if the row was hard-deleted by retention) returns client_id_unknown. Daemon treats client_id_unknown as "delivered, history may have been pruned" and marks done. Tombstone strategy is a broker implementation choice (advertised via client_message_id_dedupe.tombstone_retention_days in §15.1).

5. Inbound — unchanged from v3 §5


6. Hooks — scopes tightened (codex r2), explicit deferment of arbitrary sends (codex r3)

6.1 Hooks contract — unchanged from v2 §6 / v3 §6.1

6.2 Capability scopes — narrowed for v0.9.0

Scope Capability Notes
reply:event Reply to the specific event that triggered this hook Bound to event_id; daemon validates target; expires on hook exit
dm:send:<sender_pubkey> Send DM only to the specific sender Bound to one pubkey from event; not a write to anyone
topic:<name>:post Post to the specific topic that fired Bound to topic from event; can't write elsewhere

No read scopes in v0.9.0. Hooks read via the event payload (which the daemon redacts appropriately), not via daemon-mediated reads.

Explicitly deferred to post-v0.9.0 (codex r3 — say it out loud so use cases don't pile up against an undocumented limit):

  • Arbitrary outbound dm:send to anyone other than the event sender — no scope grant for this. "Escalate to oncall" hooks must shell out to claudemesh send <oncall> with the user's normal config; the daemon doesn't issue capability tokens for arbitrary recipients.
  • Cross-topic post — a hook firing on topic:alerts cannot post to topic:incidents. Same reason.
  • Mesh-cross post — hooks see one mesh at a time.
  • Reading state/inbox/peers — covered above.

If a real use case demands cross-topic or arbitrary-recipient hooks post-v0.9.0, we add scopes like dm:send:* (wildcard) or topic:*:post (wildcard) and gate them behind explicit operator opt-in in config ([hooks.<name>] dangerous_wildcards = true). Not in v0.9.0.

6.3 Sandboxing — unchanged from v3 §6.3

Best-effort network_policy = "deny"; cross-platform unenforceability acknowledged; counter cm_daemon_hook_unenforceable_total exposed.

6.4 Payload size & truncation — unchanged from v3 §6.4

6.5 Audit log + killpg — unchanged


7. Multi-mesh — unchanged

8. Auto-routing — unchanged

9. Service installation — unchanged

10. Observability — unchanged

11. SDKs — unchanged

12. Security model — unchanged


13. Configuration — unchanged shape, plus parameterized features

[features]
require = [
  "client_message_id_dedupe",       # broker provides §4.1 contract
  "concurrent_connection_policy",   # broker honours mesh.cloneConcurrencyPolicy
]
optional = ["mesh_skill_share", "mcp_host"]
# Daemon refuses to start if broker doesn't advertise all `require` bits.
# Broker advertises feature parameters in the negotiation response (§15.1)
# — daemon picks up `dedupe_mode` and `tombstone_retention_days` from there
# and writes them to its runtime view, not config.

14. Lifecycle — key rotation crypto fixed (codex r2), archive format spec'd (codex r3)

14.1 Key rotation — crypto correct (codex r2)

claudemesh daemon rotate-keypair:

  • Mints fresh ed25519 + x25519 keypairs.
  • Registers new pubkeys with the broker as member_keypair_rotated event.
  • Broker associates the new pubkey with the same member id, marks the old pubkey as rotated_out (not revoked); senders who haven't received the rotation event continue to encrypt to the old pubkey for a grace window.
  • Daemon retains the old x25519 private key (only x25519 — ed25519 is for signing, doesn't need a grace window) in keypair-archive.json.
  • During grace, decrypt path: try current private key first; on crypto_box_open_easy failure, walk archived keys in order. Successful archived-key decrypts increment cm_daemon_decrypt_archived_total.
  • After grace expiry, archived keys are zeroed and the file is rewritten without them. Messages still encrypted to a fully-expired pubkey fail to decrypt and increment cm_daemon_decrypt_stale_total.

14.1.1 Archive record format (NEW — codex r3)

keypair-archive.json (mode 0600, atomic-rename writes):

{
  "schema_version": 1,
  "max_archived_keys": 8,
  "keys": [
    {
      "pubkey":            "ed25519-base64...",
      "x25519_pubkey":     "base64...",
      "x25519_privkey":    "base64...",     // sensitive; whole file is 0600
      "key_id":            "k_01HQX...",     // ulid; matches broker's record
      "created_at":        "2026-04-12T11:00:00Z",
      "rotated_out_at":    "2026-05-03T16:00:00Z",
      "expires_at":        "2026-05-10T16:00:00Z"   // rotated_out_at + grace
    }
  ]
}

Rules:

  • max_archived_keys (default 8): cap on archive size. If a rotation would push the archive past the cap, the oldest entry is force-expired (zeroed + removed) regardless of expires_at. Force-expiry increments cm_daemon_archive_force_expired_total{key_id}. Operator who rotates faster than 8 keys per grace-window-duration is intentionally accepting decryption gaps for very-late inbound messages encrypted to those keys.
  • Grace period default: 7 days. Configurable via [crypto] key_grace_period_days = 7. Hard cap 30 days (codex review: unbounded grace = unbounded archive on disk = bigger blast radius if daemon host is compromised mid-life).
  • Cleanup: scheduled daily at midnight local time + on-demand via claudemesh daemon archive-cleanup. Walks keys[], drops anything with expires_at < now. If file is empty after cleanup, file is deleted.
  • Archive write failure: rotation is aborted. Daemon refuses to commit the new keypair if the archive can't be written durably. Logged as key_rotation_aborted_archive_write_failed. New keypair is in memory only; restart returns to old keypair. This is intentional: the archive write is the durability point of rotation.
  • At-rest encryption: archive file is mode 0600 plaintext, same threat model as keypair.json (root-on-host can read both anyway). Operators who want disk-level encryption can put ~/.claudemesh/ on an encrypted volume; we don't reinvent that. Documented in the threat model (§16). Future option --archive-passphrase deferred — adds passphrase prompt to rotation/decrypt path, but breaks unattended daemon restart.

14.2 Backup includes topic state — unchanged from v3 §14.2

keypair.json, keypair-archive.json (with all archived keys), host_fingerprint.json, config.toml, topic_subscriptions.json, topic_keys.json, key_epoch.json, schema_version.

local_token NOT included; regenerated on restore.

14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — unchanged from v2 §14.3


15. Version compat — feature-bit negotiation with parameters (codex r3)

v3's feature bits were boolean. Codex r3: dedupe-window, max-payload, key epochs all need parameters. v4 makes feature bits string-keyed entries that optionally carry a value.

15.1 Feature bits with parameters

Bit Type Parameters Notes
client_message_id_dedupe object { mode: "permanent"|"windowed", window_hours?: int, tombstone_retention_days: int } Daemon reads mode to decide whether to enforce its own outbox max-age cap. tombstone_retention_days (broker-controlled) tells daemon how long it can expect "already-accepted" replies after the source row is GC'd
concurrent_connection_policy bool Broker honours mesh.cloneConcurrencyPolicy
member_keypair_rotated_event bool Broker emits the event
key_epoch object { max_concurrent_epochs: int } Per-topic key epochs supported
max_payload object { inline_bytes: int, blob_bytes: int } Hard limits broker enforces
mesh_skill_share bool Future
mcp_host bool Future

15.2 Negotiation handshake (parameterized)

On WS connect, after hello, before normal traffic:

→ daemon:  feature_negotiation_request
           {
             require:  ["client_message_id_dedupe",
                        "concurrent_connection_policy"],
             optional: ["mesh_skill_share","mcp_host","max_payload"]
           }

← broker:  feature_negotiation_response
           {
             supported: {
               "client_message_id_dedupe": {
                 "mode": "permanent",
                 "tombstone_retention_days": 30
               },
               "concurrent_connection_policy": true,
               "member_keypair_rotated_event": true,
               "max_payload": {
                 "inline_bytes": 65536,
                 "blob_bytes": 524288000
               }
             },
             missing_required: []
           }

If missing_required is non-empty, daemon closes the connection with code 4010 feature_unavailable, logs forensic event, exits non-zero. Supervisor sees a restart-loop → operator alert.

If client_message_id_dedupe.mode == "windowed", daemon reads window_hours and configures its outbox max_age_hours to window_hours - 1 (margin) instead of the 168h default. Permanent mode → daemon uses the config default, no override.

15.3 IPC negotiation — unchanged from v3 §15.3

GET /v1/version returns daemon version, IPC features, schema version, and the parsed broker feature parameters (so SDKs querying the daemon can display them).

15.4 Compatibility matrix — unchanged from v3 §15.4

Published at GET /v1/compat.


16. Threat model — unchanged from v3 §16, plus RunPod fix

16.1 Attacker classes — unchanged

16.2 Out of scope — unchanged

16.3 Container & CI defaults table (RunPod inconsistency fixed)

Environment Identity IPC Hooks Rationale
Bare metal / VM (default) Persistent (clone-detected) UDS + TCP loopback Enabled Trusted operator-owned host
Docker container (/.dockerenv) Persistent UDS-only by default Enabled Single-tenant container, host loopback shared
Kubernetes (KUBERNETES_SERVICE_HOST) Persistent UDS-only Enabled Single pod = single tenant
CI (CI=true, GITHUB_ACTIONS, etc.) Ephemeral UDS-only Disabled by default ([hooks] enabled = false) Multi-tenant runner; arbitrary code; ephemeral identity = no cross-job leak; hooks disabled because CI workloads are arbitrary user code
RunPod (RUNPOD_POD_ID) Persistent UDS-only Enabled Long-lived single-tenant sandbox; user owns the pod for its lifetime; identical trust model to a Docker container, NOT to a CI runner

RunPod resolution (codex r3): v3 listed RunPod under both "ephemeral identity" and "hooks enabled" which was contradictory. v4 treats RunPod as a single-tenant container (Docker-like): persistent identity, UDS-only, hooks enabled. RunPod is removed from the CI auto-detect list (§2.1). Operators who run RunPod as multi-tenant sandbox-as-CI can opt in with --ephemeral + [hooks] enabled = false explicitly.

Operator overrides any default with explicit flags; warning logged for non-default-secure choices.


17. Migration — unchanged from v3 §17

Broker schema delta (additive partial unique indexes, safe online), deployed before daemon. Daemon refuses to start if client_message_id_dedupe feature bit is missing from broker's negotiation response.


What changed v3 → v4 (codex round-3 actionable items)

Codex r3 item v4 fix Section
Broker dedupe window: permanent vs windowed? Picked permanent; schema clarified; outbox max_age_hours raised back to 168h §4
Feature bits should be parameterized All feature bits are string-keyed with optional value object §15.1, §15.2
Key archive record format unspecified Full schema with key_id, timestamps, max_archived_keys, force-expiry rule, write-failure semantics §14.1.1
Document fingerprint source precedence per OS Per-OS table for host_id and stable MAC; cloud-image false-positive note §2.2.1
Explicit deferment of arbitrary outbound hook sends Listed deferred capabilities + escape hatch path post-v0.9.0 §6.2
RunPod ephemeral-but-hooks-enabled inconsistency RunPod treated as single-tenant container; removed from CI auto-detect §2.1, §16.3

What needs review (round 4)

Round 1 → identity, IPC auth, exactly-once lie, hook tokens, surface bloat, missing rotation/recovery/migration/threat-model.

Round 2 → boot-id false-positive, broker must dedupe on client id, CI shared-runner reality, feature-bit negotiation, key rotation crypto, hook scopes, FTS schema, ~7 polish items.

Round 3 → dedupe window semantics, feature-bit parameters, key archive record format, fingerprint source precedence, deferred hook scopes, RunPod inconsistency.

This v4 attempts to address all of round 3. Specifically:

  1. Permanent dedupe choice (§4) — does the storage-cost calculus hold? Is the tombstone path (client_id_unknown after row GC) actually workable, or does it need to be a real tombstone table?
  2. Feature parameter shape (§15.1) — is the type system right (object with optional value)? Should it be a flat key-value list instead? Versioning of parameters within a feature?
  3. Archive record format (§14.1.1) — anything missing? Is max_archived_keys=8 a sensible default, or should it be unbounded with a force-expiry on storage size instead of count?
  4. Fingerprint per-OS table (§2.2.1) — accurate? Is BSD worth listing if we're not actively building for FreeBSD in v0.9.0?
  5. Hook deferment list (§6.2) — does it cover all the realistic v0.9.0 ask? Is the "shell out to claudemesh send" workaround for escalation ergonomically acceptable?
  6. RunPod resolution (§16.3) — agree with treating RunPod as single-tenant container? Or are there real multi-tenant RunPod deployments we should default-guard against?
  7. Anything else still wrong? Read it as if you were going to operate this for a year. What falls down?

Three options after this review:

  • (a) v4 is shippable: lock the spec, start coding the frozen core.
  • (b) v5 needed: list the must-fix items.
  • (c) the architecture itself is wrong: what would you do differently?

Be ruthless. We can break anything.