- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh daemon long-lived runtime). - CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring, crash recovery, and the deferred Sprint 7 broker hardening. - Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0 daemon redesign section, so the bridge release is documented as the shipped step toward the larger architectural shift. - Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0 spec + broker-hardening followups) from .artifacts/specs/ to .artifacts/shipped/ per the project artifact-pipeline convention. Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag — both are public-distribution actions and require explicit user approval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
37 KiB
claudemesh daemon — Final Spec v2
Round 2 after a critical first-pass review. v1 of this spec was reviewed by another model and pushed back on identity model, no-auth IPC, "exactly-once" overclaim, hook credentials, surface bloat, and missing operational flows (rotation, image clones, schema migration, threat model). v2 incorporates all of those.
0. Intent — what this is, what it isn't
0.1 The product reality
claudemesh today is a peer mesh runtime for Claude Code sessions. Each
session runs claudemesh launch, opens a WebSocket to a managed broker, gets
ephemeral identity, sends/receives DMs and topic messages with other Claude Code
sessions, posts to shared state, deploys MCP servers / skills / files,
participates in tasks, schedules reminders. Everything is E2E encrypted with
crypto_box envelopes for DMs and per-topic symmetric keys for topics. The broker
is a routing/persistence layer; peers do the actual work.
The CLI is the canonical surface — every operation is a claudemesh <verb>.
The MCP server is a "tool-less push pipe" that surfaces inbound messages to
Claude Code as channel notifications. There is also a web dashboard, an /v1/*
REST API, and an existing apikey auth model for external integrations.
0.2 The gap
Anything that isn't a Claude Code session is a second-class citizen:
-
A RunPod handler that wants to alert a peer when an OOM happens has only one option: curl an apikey-authed REST endpoint. One-way only. The handler is not a peer — it can't be DM'd back, can't be
@-mentioned, can't be inpeer list, can't claim a task assigned to it, can't host an MCP service or share a skill. It's a webhook spoke, not a participant. -
A Temporal worker that wants to track its own progress in shared mesh state, publish to a
#alertstopic, and listen for "retry now" instructions has no good shape. Either it shells out toclaudemesh sendcold-path (a fresh WS handshake per message — ~1s latency, broker churn, no inbound path) or it speaks the WS protocol manually (significant code, no SDK). -
A long-running CI runner, an IoT box, a phone app, a future Python or Go service — none can be first-class peers without writing the same WS reconnect / queue / encryption / presence code that the existing CLI already has, plus an IPC surface so the host's apps can use it without re-implementing any of that.
0.3 What this daemon is
A long-running process — the same claudemesh-cli binary in daemon mode —
that turns any host into a first-class peer:
- Stable identity across restarts (the host is a member of the mesh, not a series of disconnected sessions).
- Persistent WS to the broker, with reconnect, queue, dedupe.
- Local IPC surface (UDS + loopback HTTP + SSE) that any local app can hit to send, subscribe, query — without learning the broker protocol or carrying long-lived secrets in app code.
- Hooks: shell scripts that fire on events. Server replies to DMs, auto-claims tasks, escalates errors — without the app being involved.
- Same security primitives as
claudemesh launch(mesh keypair, crypto_box, per-topic keys). No new auth model toward the broker.
The daemon is the runtime. The CLI in cold-path mode is a fallback. The Claude Code MCP integration is one client of the daemon (eventually).
0.4 What this daemon is NOT
-
Not a webhook gateway.
/v1/notifyand apikeys remain the path for systems that can't host the runtime (third-party SaaS, monitoring tools). The daemon is for systems that can run a process — code you control. -
Not a generic message broker. It speaks claudemesh protocol to one managed broker. It is not a substitute for NATS, Redis, Kafka, RabbitMQ.
-
Not a Slack replacement. Topics, DMs, mentions exist because AI sessions use them. Humans interact via the dashboard or a Claude Code session, not by reading the daemon's inbox directly.
-
Not a fleet manager. One daemon manages one mesh on one host. Multi-mesh on one host is supported (one daemon per mesh, supervised). Cross-host supervision is an external concern (systemd, k8s, etc.) — the daemon doesn't reach across hosts.
0.5 Who deploys this
- A developer running
claudemesh daemon upon their laptop so their open Claude Code sessions all share one persistent connection (instead of each opening its own ephemeral WS). - The same developer running
claudemesh daemon install-serviceon their VPS, RunPod pod, Temporal worker, CI runner — turning each into an addressable peer that scripts on that host can talk to via local IPC. - Eventually: language SDKs (Python / Go / TypeScript) talking to the daemon
on
localhost, exposing claudemesh as a first-class API for any app the developer writes.
0.6 Pre-launch posture
No users yet. We can break protocol, schema, surface, anything. Optimize for the architecture we want to live with for years, not for the smallest shippable cut. Codex pushed back on v1 on this exact axis: do not ship graph/vector/MCP/skills/tasks on day one — freeze a small, hardened core, expand deliberately.
1. Process model
One daemon per (user, mesh). Persistent. Survives reboots via OS supervisor. Serves multiple local apps concurrently.
~/.claudemesh/daemon/<mesh-slug>/
pid 0600 pidfile, cleaned on shutdown
sock 0600 unix domain socket (primary IPC)
http.port 0644 auto-allocated loopback port
local_token 0600 per-daemon bearer for HTTP/TCP transports
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
host_fingerprint.json 0600 machine-id + boot-id + interface mac digest
config.toml 0644 user-editable runtime tuning
outbox.db 0600 SQLite — durable outbound queue
inbox.db 0600 SQLite — N-day inbound history, FTS-indexed
schema_version 0644 integer; gates online migrations
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
hooks/ 0700 user-managed event scripts
Resource caps (defaults, configurable):
| Resource | Default | Why |
|---|---|---|
| RSS | 256 MB | Most workloads stay under 50 MB; cap protects multi-mesh hosts |
| CPU | unlimited | Hook fan-out can spike briefly; rely on OS scheduler |
| Outbox DB | 5 GB | At 1KB avg msg, that's 5M queued. Disk-full handling at 90% |
| Inbox DB | 5 GB | Same |
| File descriptors | 1024 | UDS clients + SSE streams + DB handles + WS |
| SSE concurrent | 32 streams | DoS protection; configurable up |
| IPC concurrent | 64 in-flight | Backpressure beyond this returns 429 daemon_busy |
| Hook concurrency | 8 | Bounded pool; overflow queues |
Single binary. Same claudemesh-cli package; daemon is one of its modes.
2. Identity — persistent member by default, ephemeral on opt-in, clone-aware
2.1 Modes
claudemesh daemon up # default: persistent member
claudemesh daemon up --ephemeral # session-shaped, no keypair persisted
claudemesh daemon up --ephemeral --ttl=2h # auto-shutdown after TTL
- Persistent (default): ed25519 + x25519 keypair stored in
keypair.json. Same identity across restarts, reconnects, supervisor cycles. Right for servers, workers, addressable peers. - Ephemeral: keypair generated in memory, never written. Daemon exits = identity gone. Right for CI jobs, preview environments, disposable RunPod pods, test harnesses, build agents, anything that should not leave a peer ghost in the broker after teardown.
--ttl <duration>on ephemeral mode: auto-shutdown after the duration, or afterclaudemesh daemon down, whichever first. Broker member record cleaned up on shutdown.
2.2 Image-clone detection
Two daemons booting with the same keypair.json (VM image clone, container
copy, restored backup) is a serious failure mode — broker sees connection
collisions, presence flickers, encrypted messages route to the wrong host.
Handled in three places:
- Daemon side:
host_fingerprint.jsonis written on first startup —sha256(machine-id || boot-id || mac-of-default-iface || hostname). On every subsequent startup, the fingerprint is recomputed and compared. If it differs, the daemon refuses to start unless--accept-cloned-identityis passed (writes a fresh fingerprint and continues with the same keypair — for legitimate hardware migrations) or--remintis passed (mints fresh keypair, registers as a new member, broker reaps the old member after grace period). - Broker side: tracks
lastSeenHostFingerprintper member. On reconnection from a different fingerprint, broker emits amember_clone_suspectedsecurity event to the mesh owner's dashboard. Connection itself is allowed (legitimate hardware swaps happen) but visible for audit. - Mesh owner:
claudemesh member revoke <pubkey>revokes the keypair server-side; daemon receiveskeypair_revokedpush event on next connection and self-disables.
2.3 Rename
--name is taken at first daemon up; subsequent runs read the keypair file
and ignore --name unless --rename is passed (which produces a
member_renamed event the broker propagates to peers).
3. IPC surface — stable core only in v0.9.0
3.1 Frozen core surface (v0.9.0)
Codex's feedback: do not ship every CLI verb on day one. A small hardened core first, expand under explicit capability gates.
# Messaging — durable, tested
POST /v1/send {to, message, priority?, meta?, replyToId?}
POST /v1/topic/post {topic, message, priority?, mentions?}
POST /v1/topic/subscribe {topic} (idempotent)
POST /v1/topic/unsubscribe {topic}
GET /v1/topic/list
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
GET /v1/inbox/search ?q=<fts-query>&limit=<n> (FTS5)
# Peers + presence — read-only on day one
GET /v1/peers ?mesh=<slug>
POST /v1/profile {summary?, status?, visible?} (limited fields)
# Files — already production in CLI
POST /v1/file/share {path, to?, message?, persistent?}
GET /v1/file/get ?id=<fileId>&out=<path>
GET /v1/file/list
# Events — push
GET /v1/events text/event-stream
core events: message, peer_join, peer_leave, file_shared,
daemon_disconnect, daemon_reconnect, hook_executed
# Control plane
GET /v1/health {connected, lag_ms, queue_depth, inflight,
mesh, member_pubkey, uptime_s, schema_version,
daemon_version, broker_version}
GET /v1/metrics Prometheus exposition
GET /v1/version {daemon, schema, ipc_api} (negotiation)
POST /v1/heartbeat {} (caller-side liveness signal)
That's it. ~20 endpoints. Battle-test these before adding more.
3.2 Capability-gated future surface (v0.9.x roadmap)
Behind explicit feature flags in config.toml, post-v0.9.0:
[capabilities]
state = false # /v1/state/{set,get,list}
memory = false # /v1/memory/{remember,recall}
vector = false # /v1/vector/{store,search,delete}
graph = false # /v1/graph/query
tasks = false # /v1/task/{create,claim,complete}
scheduling = false # /v1/scheduling/remind
mcp_host = false # /v1/mcp/{register,call} (LARGEST surface; treat as v1.0)
skill_share = false # /v1/skill/{deploy,share}
Each capability is its own ship: design review, security review, test coverage, capability-token model, then enable. None enabled in v0.9.0.
3.3 Local IPC authentication
Codex was right: loopback TCP without auth is an attack surface (browser SSRF, container side-channels, sandboxed apps with network but no FS access, WSL host-shared loopback).
| Transport | Auth | Rationale |
|---|---|---|
| UDS | None (relies on FS perms 0600) | Reaching the socket = same UID = can read keypair anyway |
| TCP loopback | Required: Authorization: Bearer <local_token> |
Browser/container/sandbox can reach loopback without FS access |
| SSE | Required: Authorization: Bearer <local_token> |
Same |
local_token is 32 bytes of crypto.randomBytes (~256 bits), encoded base64url,
written to local_token mode 0600 at daemon init. Rotated on claudemesh daemon rotate-token. SDKs auto-discover the token by reading the file (same
mechanism as discovering the socket path).
Additional defenses:
- HTTP listener binds 127.0.0.1 only. Refuses to bind elsewhere unless
[ipc] http_bind = "..."is set explicitly and[ipc] http_external_auth = "..."points to a separate token file (escape hatch for advanced users; never the default). Originheader check: rejects requests withOriginset unless it's explicitly allowlisted in config (default: empty allowlist). Defends against browser SSRF.Hostheader check: must belocalhostor127.0.0.1. Defends against DNS rebinding.- CORS:
Access-Control-Allow-Originnever echoed; preflight returns403. User-Agentrequired (rejects empty UA — mild signal against simple SSRF).
3.4 Request limits + backpressure
- Max request body: 1 MB (override per endpoint; file uploads use a separate streaming endpoint).
- Max response body: 10 MB; truncated with
Link: rel=nextcursor. - Max in-flight IPC requests: 64. Beyond →
429 daemon_busy. - Max SSE concurrent streams: 32. Beyond →
429 too_many_streams. - Per-token rate limit: 100 req/sec sustained, 1000/sec burst (token bucket). Tunable.
4. Delivery contract — durable at-least-once with idempotent send
Codex was right: "exactly-once" is a lie. Replacing the claim with a precise contract.
4.1 The contract
The daemon guarantees: each successful send call enqueues exactly one row to the broker eventually, identified by a stable
messageId. The daemon does not guarantee that downstream peers process the message exactly once — that is the receiver's responsibility, aided by the propagatedidempotency_key.
Concretely:
- Caller → daemon: caller may supply
Idempotency-Key; daemon dedupes identical keys for 24h. Without one, daemon mintsulidand returns it asmessageId. - Daemon → broker: each outbox row has at-most-one inflight transmit.
Daemon retries with exponential backoff until broker ACKs OR row hits TTL
(7d default → moves to
dead). - Broker → peer: existing claudemesh delivery semantics. Broker dedupes by
messageId. Peer receives ≥1 copy. - Peer hooks: hooks see
idempotency_keyin the event JSON. Idempotent hook implementations are the receiver's responsibility.
4.2 Outbox row state machine
┌────────────┐
send call → │ pending │
└─────┬──────┘
│ daemon picks up batch
▼
┌────────────┐
│ inflight │ ← attempts++, last_error written
└─┬────┬─────┘
│ │ broker NACK / network err
broker ACK │ └──────────► back to pending (with exp. backoff)
▼
┌────────────┐
│ done │ ← delivered_at set, broker_message_id stored
└────────────┘
age > max_age_hours:
┌────────────┐
│ dead │ ← surfaces in `daemon outbox --failed`
└────────────┘
4.3 Crash recovery
On daemon startup:
- Any rows in
inflightare reset topendingwithattempts++andnext_attempt_at = now + min_backoff. Note: this MAY cause double-delivery of a message that was actually ACK'd by the broker but the ACK didn't persist locally before crash. Theidempotency_keypropagates to broker (via messagemeta) so the broker dedupes by key. outbox.dbintegrity check (PRAGMA integrity_check); if fails, daemon refuses to start, points user atclaudemesh daemon recover.inbox.dbintegrity check; on failure, drops toinbox.db.corrupt-<ts>, creates fresh empty inbox, logsinbox_corruption_recovered(does not block startup — inbox is a cache).
4.4 Disk-full
- At 80% of
outbox.max_queue_sizeor 80% of[disk] reserved_bytes: daemon emitsoutbox_pressure_highevent + Prometheus gauge. Sends still accept. - At 95%: new sends return
507 insufficient_storage. Existing inflight drains. - At 100%: daemon enters degraded mode — refuses sends, refuses new SSE
streams, holds open WS for inbound only.
daemon statusshows degraded. - Recovery: drain via broker reconnect (drains
donerows older than retention window) orclaudemesh daemon outbox prune --confirm.
4.5 Schema migration
schema_version file holds an integer. On startup:
- If
schema_versionmatches binary's expected version → continue. - If version is older → run
apps/cli/src/daemon/migrations/<from>-<to>.sqlin a transaction, write new version on success. - If version is newer (downgrade) → daemon refuses to start, error points at re-installing matching version.
Migrations are forward-only. Each migration is ≤ 1 transaction. Test coverage required: every migration has a snapshot test from prior schema.
5. Inbound — durable history with FTS
Every inbound message is written to inbox.db before any hook fires:
CREATE VIRTUAL TABLE inbox USING fts5(
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
sender_name, body, meta, idempotency_key UNINDEXED,
received_at UNINDEXED, replied_to_id UNINDEXED
);
CREATE INDEX inbox_received_at ON inbox(received_at);
CREATE INDEX inbox_idem ON inbox(idempotency_key);
- Receiver-side dedupe: on insert,
INSERT OR IGNOREonidempotency_key. Duplicate broker delivery becomes a no-op locally +cm_daemon_dedupe_totalcounter increments. - 30-day rolling retention (configurable).
VACUUMweekly during low-traffic window. claudemesh daemon search "OOM"queries the FTS index.- Apps connecting mid-stream replay history via
?since=<iso>.
6. Hooks — first-class but tightly bounded
Codex was right: hooks were underspecified, and putting CLAUDEMESH_TOKEN in
every hook env was a serious exfil footgun.
6.1 Hook directory & contract
hooks/
on-message.sh every inbound message (DM + topic)
on-dm.sh DMs only
on-mention.sh when @<my-name> appears anywhere
on-topic-<name>.sh a specific topic
on-file-share.sh file shared with me
on-disconnect.sh WS dropped
on-reconnect.sh reconnected
on-startup.sh daemon up
pre-send.sh filter / mutate outbound (last gate)
hooks.toml per-hook policy (auth, redaction, env, timeout)
hooks.toml (mandatory; daemon refuses to invoke hooks without it):
[on-mention]
enabled = true
timeout_s = 30
output_size_limit = 65536
redact_payload = ["body.password", "meta.api_key"] # JSONPath
allow_reply = true # if false, stdout reply ignored
capability_token_scope = ["topic:alerts:post"] # scoped, NOT broker session token
network_policy = "deny" # 'deny' | 'allow' | 'allowlist'
network_allowlist = [] # only if policy = 'allowlist'
fs_policy = "readonly" # 'readonly' | 'rw' | 'sandbox'
killpg_on_timeout = true # SIGTERM process group, not just child
audit = true # log every invocation
6.2 Credentials passed to hooks
Default: nothing. No CLAUDEMESH_TOKEN, no broker session, nothing that
lets the hook impersonate the daemon's identity broadly.
Opt-in per hook: capability_token_scope = ["topic:alerts:post"] mints a
short-lived (5 min) capability token scoped to exactly that capability.
The hook can use it to call back into the daemon's IPC ("post a reply to
#alerts") but cannot use it to read state, read inbox, deploy MCP, etc. Token
expires when hook process exits OR after 5 min, whichever first.
Capability tokens are local-only — they authorize against the daemon's IPC surface, never the broker directly. Daemon translates capability calls into broker calls.
Env variables the hook DOES get:
CLAUDEMESH_MESH=<slug>CLAUDEMESH_HOOK_NAME=on-mentionCLAUDEMESH_EVENT_ID=<ulid>CLAUDEMESH_CAPABILITY_TOKEN=<token>(only if scope was configured; else absent)CLAUDEMESH_DAEMON_SOCK=<path>(so SDKs can connect for capability calls)PATH=/usr/bin:/bin(locked down)
6.3 Payload redaction
Hook stdin receives event JSON minus paths listed in redact_payload. Default
redaction: nothing. Mesh owner / daemon admin opts in.
6.4 Timeout & cleanup
- Per-hook
timeout_s(default 30s). On timeout, daemon sends SIGTERM to the hook's process group (killpg_on_timeout=true), waits 5s, then SIGKILL. Catches forked grandchildren that were trying to keep things alive. - Hook stdout/stderr captured, truncated at
output_size_limit. Larger outputs log a warning and discard the overflow.
6.5 Audit log
Every hook invocation logs:
{"hook":"on-mention","event_id":"01H8…","exit":0,"duration_ms":47,
"stdout_bytes":120,"stderr_bytes":0,"replied":true,"capability_calls":1,
"ts":"2026-05-03T14:00:00Z"}
Stored in daemon.log; metrics exposed via cm_daemon_hook_*.
6.6 Sandboxing — supported, not required
The contract supports sandboxing without mandating it (mandating breaks too many real workflows):
- Linux: opt-in
sandbox = "bubblewrap"inhooks.tomlruns the hook underbwrapwith no network (unlessnetwork_policy != "deny"), readonly FS except/tmp/<hook-id>, no DBus, no /proc. - macOS: opt-in
sandbox = "sandbox-exec"with similar profile. - Default: no sandbox; rely on Unix permissions +
network_policy=deny(which is enforced viaunshare --neton Linux when available, otherwise best-effort firewall rule).
7. Multi-mesh — daemon-per-mesh, supervised by a thin shell
7.1 The decision
One daemon per mesh, coordinated by a supervisor script. Codex pushed back — "why not one daemon serving all meshes?". Going daemon-per-mesh because:
- Crash isolation: a panic in
prodmesh's WS reader can't corruptdevmesh's outbox. - Resource accounting: per-mesh RSS, per-mesh metrics, per-mesh disk budget — easy to attribute, easy to cap.
- Independent identity: each mesh has its own keypair, host fingerprint, capability gates. Conflating into one process forces shared trust.
- Independent upgrades: rolling daemon restarts per mesh, no downtime across all meshes.
- Simpler code: zero cross-mesh routing logic in the daemon body.
The cost (process count, log fan-out) is real but bounded: typical user has
1–3 meshes. Heavy users (10–20) get a claudemesh daemon ps + --all UX that
treats them as a fleet.
7.2 Resource caps for fleet hosts
config.toml has [fleet] section read by daemon up --all:
[fleet]
max_daemons = 10
total_memory_budget = "2GB" # divided across daemons; each gets budget/N RSS cap
total_disk_budget = "20GB" # divided across outbox + inbox per daemon
If a user hits max_daemons, daemon up <next> errors with a clear message
pointing at the cap.
7.3 Commands
claudemesh daemon up --mesh <slug> # one mesh
claudemesh daemon up --all # all joined meshes (respects fleet caps)
claudemesh daemon down --mesh <slug>
claudemesh daemon down --all
claudemesh daemon status # all daemons, table view
claudemesh daemon status --json # machine-readable
claudemesh daemon ps # alias of status
claudemesh daemon logs --mesh <slug> [-f]
claudemesh daemon restart --mesh <slug>
8. Auto-routing — clarified, not transparent
Codex pushed back: "no behavior difference" was hand-waving. Persistent identity, queueing, hooks, profile state — these legitimately change behavior.
8.1 What changes when a daemon is up
| Behavior | Cold-path CLI | Daemon-routed CLI |
|---|---|---|
| Sender attribution | Ephemeral session pubkey for that invocation | Daemon's persistent member pubkey |
| Latency | ~1s (fresh WS handshake) | <10ms (local UDS round-trip) |
| Send durability | None — if broker is unreachable, command fails | Outbox queue retries until TTL |
| Inbound visibility | Not available (cold path closes WS) | claudemesh inbox reads daemon's inbox.db |
| Hooks | Not invoked | Invoked on every event |
| Presence | Brief flicker as session connects+disconnects | Continuous; daemon's status reflected |
peer list shows me as |
A new ephemeral session each invocation | The daemon's persistent member |
8.2 Detection logic — connect, don't trust pidfile
1. Check ~/.claudemesh/daemon/<slug>/sock exists.
2. attempt UDS connect with 100ms timeout.
3. If connect succeeds: send GET /v1/version.
4. If response is well-formed AND mesh matches AND daemon_version is
compatible → use this daemon.
5. Otherwise → cold path.
PID liveness check is unreliable (PID reuse, process orphaned). Socket handshake is canonical.
8.3 Coexistence with claudemesh launch
Both can be running for the same mesh:
- Daemon connected as persistent member
runpod-worker-3. - A separate
claudemesh launchconnects as ephemeral session of the same member. Visible to peers as "another session of runpod-worker-3" (sibling-session relationship viamemberPubkey). - CLI verbs from inside
claudemesh launchroute through the launch session, NOT the daemon (preserves "this Claude Code session has its own ephemeral identity" semantics). - CLI verbs from a separate shell route through the daemon (faster, durable).
This is consistent with the v0.5.1 self-DM guard and sibling-session semantics already shipped.
9. Service installation
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
claudemesh daemon uninstall-service
claudemesh daemon install-service --user # user-scope unit (default; no root)
claudemesh daemon install-service --system # system-scope unit (root; multi-user host)
Unit defaults:
Restart=on-failure,RestartSec=5s,StartLimitBurst=5/5minMemoryMax=<resource cap>,TasksMax=128,LimitNOFILE=4096StandardOutput/Error=journalNoNewPrivileges=yes,PrivateTmp=yes,ProtectSystem=strict,ProtectHome=read-onlywithReadWritePaths=~/.claudemesh- For systemd
--user, runs as the invoking user (no root needed).
claudemesh install (the existing setup verb) gains an opt-in prompt:
"Install as a background service that always runs?" Defaults differently
based on detected environment (TTY vs no-TTY, presence of systemd, etc.).
10. Observability
Standard CLI surface unchanged from v1, with the new gauges/counters:
cm_daemon_connected{mesh} 0/1
cm_daemon_reconnects_total{mesh,reason}
cm_daemon_lag_ms{mesh} last broker round-trip
cm_daemon_outbox_depth{mesh,status} pending|inflight|dead
cm_daemon_outbox_age_seconds{mesh} oldest pending row
cm_daemon_dedupe_total{mesh,direction} out|in
cm_daemon_disk_pct{mesh,kind} outbox|inbox
cm_daemon_send_total{mesh,kind,status}
cm_daemon_recv_total{mesh,kind,from_type}
cm_daemon_hook_invocations_total{hook,exit}
cm_daemon_hook_duration_seconds{hook} histogram
cm_daemon_hook_capability_calls_total{hook,scope}
cm_daemon_ipc_request_total{endpoint,status,transport}
cm_daemon_ipc_duration_seconds{endpoint} histogram
cm_daemon_local_token_rotations_total
cm_daemon_clone_suspected_total
Tracing: optional OpenTelemetry export.
11. SDKs — three, slim, core-API only
Same shape as v1 but only target the frozen core surface (§3.1). State / memory / vector / graph / tasks / MCP / skills are NOT in v0.9.0 SDKs — they ship per capability gate.
Each SDK auto-discovers the daemon: reads sock path, http.port,
local_token. SDKs versioned in lockstep with the daemon's /v1 surface.
12. Security model — explicit boundaries
| Boundary | Trust | Mechanism |
|---|---|---|
| App ↔ Daemon (UDS) | OS user, FS perms | UDS 0600 |
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 only + local_token + Origin/Host check |
| Hook ↔ Daemon | Capability scope | Short-lived capability token, never broker session |
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello + crypto_box DM + per-topic keys |
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under ~/.claudemesh/daemon/ |
| Cloned identity | Host fingerprint check | Daemon refuses to start; dashboard audit event |
13. Configuration
config.toml — same shape as v1 plus:
[capabilities](§3.2)[fleet](§7.2)[disk] reserved_bytes(§4.4)[clone] policy = "refuse" | "warn" | "allow"(§2.2)
User-editable. claudemesh daemon reload re-reads it without dropping the WS.
14. Lifecycle — the operational flows v1 was missing
14.1 Key rotation
claudemesh daemon rotate-keypair
Mints fresh ed25519 + x25519. Registers new pubkey with broker as a member_keypair_rotated operation (broker associates new pubkey with same member id). Old pubkey is held server-side for 24h grace (decrypts in-flight messages encrypted to old pubkey), then revoked.
14.2 Local token rotation
claudemesh daemon rotate-token
Atomically writes a new local_token, returns the old one alongside the new
one for 60s grace. SDKs that already have the old token finish in-flight
requests; new requests use the new token. After 60s, old token is rejected.
14.3 Compromised host revocation
From the dashboard or another mesh-owner session:
claudemesh member revoke <pubkey>
Broker marks member as revoked. Connected daemon receives member_revoked
push, self-disables (refuses new IPC, closes WS), exits with non-zero status,
logs forensic event.
14.4 Image-clone lifecycle
Covered in §2.2. Three policies (refuse, warn, allow — settable per-host
via config.toml).
14.5 Backup & restore
claudemesh daemon backup --out <path> # dumps keypair, config, schema_version
claudemesh daemon restore --in <path> # writes them; refuses if a daemon is running
Backup is encrypted with a passphrase (Argon2id KDF + crypto_secretbox). The intent: "I'm reformatting my laptop, I want my mesh memberships back without re-joining." NOT for "deploy this same identity on 10 servers" (that's the clone problem above).
14.6 Uninstall / reset
claudemesh daemon uninstall # full purge: stops, deregisters from broker, wipes ~/.claudemesh/daemon/<slug>
claudemesh daemon reset # wipes local state, keeps broker member registration (for restoring)
Uninstall calls broker's POST /v1/me/members/:pubkey/leave so member doesn't
linger as ghost. Reset is local-only, no broker contact.
14.7 Disk corruption recovery
claudemesh daemon recover # interactive: integrity check + offer rebuild paths
Detects corrupt outbox.db / inbox.db. Options:
- Restore from local journal-only inbox (read-only mode; sends disabled).
- Wipe + rebuild from broker (fetches last N days of message history if available; topics need re-subscribe; outbox is irrecoverable, queued sends are lost).
- Wipe + start fresh.
15. Version compatibility
15.1 Negotiation handshake
On daemon connect to broker AND on every IPC request:
GET /v1/version
{
"daemon_version": "0.9.0",
"ipc_api": "v1",
"ipc_minor": 3, # additive minor
"schema_version": 7,
"broker_protocol_min": "0.7",
"broker_protocol_max": "0.9"
}
15.2 Compat policy
| Across | Policy |
|---|---|
| Daemon ↔ Broker | Daemon refuses to connect if broker version < daemon's broker_protocol_min. Broker logs warning. Pre-1.0 we may break this with notice; post-1.0 we maintain backward compat for ≥6 months. |
| CLI ↔ Daemon | CLI checks daemon's ipc_api. Same major = OK. Different major = CLI falls back to cold-path with warning. |
| SDK ↔ Daemon | SDK negotiates ipc_minor; uses minimum of (SDK's, daemon's). |
| Daemon binary ↔ schema | Binary refuses to start on unknown schema; migrations run forward-only; no automatic downgrade. |
15.3 Compatibility matrix (published in docs, machine-readable JSON at /v1/compat)
{
"daemon": "0.9.0",
"compatible_brokers": ["0.7.x", "0.8.x", "0.9.x"],
"compatible_clis": ["0.9.x"],
"compatible_sdks": {
"python": ">=0.9.0,<1.0.0",
"go": ">=0.9.0,<1.0.0",
"ts": ">=0.9.0,<1.0.0"
}
}
16. Threat model
16.1 Attacker classes
| Attacker | Has | Wants | Mitigations |
|---|---|---|---|
| Local same-user shell | OS user creds | Send / read mesh messages | None needed — they already have FS access to keypair; daemon is no worse |
| Local different-user shell | Different OS user | Read this user's daemon | UDS 0600 + TCP loopback + token. Requires OS exploit to escalate |
| Browser SSRF | Loopback HTTP | Send messages, read inbox | local_token + Origin/Host check + non-default port. SSRF without token cannot succeed |
| Container side-channel | Same loopback namespace | Read another container's daemon | Containers share host loopback only if explicitly net=host. local_token defends. Recommended: bind UDS only inside containers |
| Compromised hook | Capability token in env | Use that scope | Capability tokens are scoped + short-lived; cannot escalate |
| Compromised broker | Full mesh visibility on its side | Deliver malicious messages, identity-impersonate | E2E encryption (crypto_box DMs, per-topic keys) — broker can't read content. Out-of-scope for daemon |
| Cloned VM image | Same keypair on two hosts | Identity collision | Host fingerprint detection + dashboard audit + --remint flow |
| Stolen laptop | Disk access | Mesh impersonation forever | member revoke from dashboard. Without disk encryption, this is the user's laptop security; documented in security guide |
| Untrusted hook author | Hook script content | Exfil mesh data | Hook is on disk YOU control. If you ran git pull on a malicious hooks/ repo, that's a code-supply-chain attack out of scope for the daemon |
16.2 Out of scope
- Defending against an attacker with root on the daemon host. They can read
keypair.jsondirectly. - Defending against malicious peers in the same mesh sending malformed payloads. Daemon validates structure but trusts mesh members.
- Defending against compromised broker. Out-of-scope for daemon; mesh-level E2E protects content but not metadata.
17. Migration — what changes for existing users
Same as v1. Additive. No DB migration on broker. Existing
~/.claudemesh/config.json consumed unchanged. claudemesh launch keeps
working; daemon is opt-in.
What needs review (round 2)
Round 1 produced: identity model needs --ephemeral + clone-detect, IPC needs
local token, "exactly-once" was a lie, hooks needed scoped credentials, surface
needed shrinking, missing rotation/recovery/migration/threat-model.
This v2 attempts to address all of them. Specifically critique:
- Has the identity model fully closed the clone problem? Refuses-on-fingerprint-mismatch
plus broker audit plus mesh-owner revoke — does this catch a sophisticated
attacker who copies
host_fingerprint.jsonalong with the keypair? - Is the local-token model sufficient for browser-SSRF defense? Token + Origin + Host checks + 127.0.0.1-only. Anything else needed?
- The delivery contract (§4) — is it now defensible? Does the inflight-recovery semantics + idempotency-key propagation produce the guarantees claimed?
- Hook capability tokens (§6.2) — short-lived, scoped, expire on hook exit. Does this fully eliminate the exfil footgun? What capability scopes are actually needed for v0.9.0 hooks?
- Frozen v0.9.0 surface (§3.1) — is the cut right? Should
peer listbe in core or capability-gated? Shouldinbox/searchship in v0.9.0? - Threat model (§16) — anything missing? Specifically thinking about CI environments where the daemon's host is a fleet shared across many users' builds.
- Lifecycle flows (§14) — image clones, key rotation, host moves, disk corruption, uninstall semantics. Anything still missing?
- Version compat (§15) — is the negotiation handshake sufficient, or do we need stronger guarantees (e.g. semver-strict, or a feature-bit negotiation rather than version numbers)?
Score 1–5 each. Top 3 changes you'd insist on for v3, if any. If you think v2 is shippable, say so explicitly — over-engineering is a real risk.