Files
claudemesh/.artifacts/shipped/2026-05-03-daemon-final-spec-v2.md
Alejandro Gutiérrez a2568ad9f4
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
chore(release): cli 1.22.0 — daemon v0.9.0 + housekeeping
- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh
  daemon long-lived runtime).
- CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring,
  crash recovery, and the deferred Sprint 7 broker hardening.
- Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0
  daemon redesign section, so the bridge release is documented as the
  shipped step toward the larger architectural shift.
- Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0
  spec + broker-hardening followups) from .artifacts/specs/ to
  .artifacts/shipped/ per the project artifact-pipeline convention.

Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag
— both are public-distribution actions and require explicit user
approval.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:24:32 +01:00

854 lines
37 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# `claudemesh daemon` — Final Spec v2
> **Round 2 after a critical first-pass review.** v1 of this spec was reviewed
> by another model and pushed back on identity model, no-auth IPC, "exactly-once"
> overclaim, hook credentials, surface bloat, and missing operational flows
> (rotation, image clones, schema migration, threat model). v2 incorporates all
> of those.
---
## 0. Intent — what this is, what it isn't
### 0.1 The product reality
claudemesh today is a **peer mesh runtime for Claude Code sessions**. Each
session runs `claudemesh launch`, opens a WebSocket to a managed broker, gets
ephemeral identity, sends/receives DMs and topic messages with other Claude Code
sessions, posts to shared state, deploys MCP servers / skills / files,
participates in tasks, schedules reminders. Everything is E2E encrypted with
crypto_box envelopes for DMs and per-topic symmetric keys for topics. The broker
is a routing/persistence layer; peers do the actual work.
The CLI is the canonical surface — every operation is a `claudemesh <verb>`.
The MCP server is a "tool-less push pipe" that surfaces inbound messages to
Claude Code as channel notifications. There is also a web dashboard, an `/v1/*`
REST API, and an existing apikey auth model for external integrations.
### 0.2 The gap
Anything that **isn't a Claude Code session** is a second-class citizen:
- A RunPod handler that wants to alert a peer when an OOM happens has only
one option: curl an apikey-authed REST endpoint. One-way only. The handler
is not a peer — it can't be DM'd back, can't be `@-mentioned`, can't be in
`peer list`, can't claim a task assigned to it, can't host an MCP service or
share a skill. It's a webhook spoke, not a participant.
- A Temporal worker that wants to track its own progress in shared mesh state,
publish to a `#alerts` topic, and listen for "retry now" instructions has
no good shape. Either it shells out to `claudemesh send` cold-path
(a fresh WS handshake per message — ~1s latency, broker churn, no inbound
path) or it speaks the WS protocol manually (significant code, no SDK).
- A long-running CI runner, an IoT box, a phone app, a future Python or Go
service — none can be **first-class peers** without writing the same WS
reconnect / queue / encryption / presence code that the existing CLI already
has, plus an IPC surface so the host's apps can use it without re-implementing
any of that.
### 0.3 What this daemon is
A long-running process — the same `claudemesh-cli` binary in `daemon` mode —
that turns any host into a **first-class peer**:
- Stable identity across restarts (the host *is* a member of the mesh, not a
series of disconnected sessions).
- Persistent WS to the broker, with reconnect, queue, dedupe.
- Local IPC surface (UDS + loopback HTTP + SSE) that any local app can hit
to send, subscribe, query — without learning the broker protocol or carrying
long-lived secrets in app code.
- Hooks: shell scripts that fire on events. Server replies to DMs, auto-claims
tasks, escalates errors — without the app being involved.
- Same security primitives as `claudemesh launch` (mesh keypair, crypto_box,
per-topic keys). No new auth model toward the broker.
The daemon **is the runtime**. The CLI in cold-path mode is a fallback. The
Claude Code MCP integration is one client of the daemon (eventually).
### 0.4 What this daemon is NOT
- **Not a webhook gateway.** `/v1/notify` and apikeys remain the path for
systems that can't host the runtime (third-party SaaS, monitoring tools).
The daemon is for systems that *can* run a process — code you control.
- **Not a generic message broker.** It speaks claudemesh protocol to one
managed broker. It is not a substitute for NATS, Redis, Kafka, RabbitMQ.
- **Not a Slack replacement.** Topics, DMs, mentions exist because *AI
sessions* use them. Humans interact via the dashboard or a Claude Code
session, not by reading the daemon's inbox directly.
- **Not a fleet manager.** One daemon manages one mesh on one host. Multi-mesh
on one host is supported (one daemon per mesh, supervised). Cross-host
supervision is an external concern (systemd, k8s, etc.) — the daemon doesn't
reach across hosts.
### 0.5 Who deploys this
- A developer running `claudemesh daemon up` on their laptop so their open
Claude Code sessions all share one persistent connection (instead of each
opening its own ephemeral WS).
- The same developer running `claudemesh daemon install-service` on their VPS,
RunPod pod, Temporal worker, CI runner — turning each into an
addressable peer that scripts on that host can talk to via local IPC.
- Eventually: language SDKs (Python / Go / TypeScript) talking to the daemon
on `localhost`, exposing claudemesh as a first-class API for any app the
developer writes.
### 0.6 Pre-launch posture
No users yet. We can break protocol, schema, surface, anything. Optimize for
the architecture we want to live with for years, not for the smallest
shippable cut. Codex pushed back on v1 on this exact axis: do not ship
graph/vector/MCP/skills/tasks on day one — freeze a small, hardened core,
expand deliberately.
---
## 1. Process model
**One daemon per (user, mesh)**. Persistent. Survives reboots via OS
supervisor. Serves multiple local apps concurrently.
```
~/.claudemesh/daemon/<mesh-slug>/
pid 0600 pidfile, cleaned on shutdown
sock 0600 unix domain socket (primary IPC)
http.port 0644 auto-allocated loopback port
local_token 0600 per-daemon bearer for HTTP/TCP transports
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
host_fingerprint.json 0600 machine-id + boot-id + interface mac digest
config.toml 0644 user-editable runtime tuning
outbox.db 0600 SQLite — durable outbound queue
inbox.db 0600 SQLite — N-day inbound history, FTS-indexed
schema_version 0644 integer; gates online migrations
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
hooks/ 0700 user-managed event scripts
```
**Resource caps (defaults, configurable):**
| Resource | Default | Why |
|---|---|---|
| RSS | 256 MB | Most workloads stay under 50 MB; cap protects multi-mesh hosts |
| CPU | unlimited | Hook fan-out can spike briefly; rely on OS scheduler |
| Outbox DB | 5 GB | At 1KB avg msg, that's 5M queued. Disk-full handling at 90% |
| Inbox DB | 5 GB | Same |
| File descriptors | 1024 | UDS clients + SSE streams + DB handles + WS |
| SSE concurrent | 32 streams | DoS protection; configurable up |
| IPC concurrent | 64 in-flight | Backpressure beyond this returns `429 daemon_busy` |
| Hook concurrency | 8 | Bounded pool; overflow queues |
Single binary. Same `claudemesh-cli` package; `daemon` is one of its modes.
## 2. Identity — persistent member by default, ephemeral on opt-in, clone-aware
### 2.1 Modes
```
claudemesh daemon up # default: persistent member
claudemesh daemon up --ephemeral # session-shaped, no keypair persisted
claudemesh daemon up --ephemeral --ttl=2h # auto-shutdown after TTL
```
- **Persistent (default)**: ed25519 + x25519 keypair stored in `keypair.json`.
Same identity across restarts, reconnects, supervisor cycles. Right for
servers, workers, addressable peers.
- **Ephemeral**: keypair generated in memory, never written. Daemon exits =
identity gone. Right for CI jobs, preview environments, disposable RunPod
pods, test harnesses, build agents, anything that should not leave a peer
ghost in the broker after teardown.
- **`--ttl <duration>`** on ephemeral mode: auto-shutdown after the duration,
or after `claudemesh daemon down`, whichever first. Broker member record
cleaned up on shutdown.
### 2.2 Image-clone detection
Two daemons booting with the same `keypair.json` (VM image clone, container
copy, restored backup) is a serious failure mode — broker sees connection
collisions, presence flickers, encrypted messages route to the wrong host.
Handled in three places:
1. **Daemon side**: `host_fingerprint.json` is written on first startup —
`sha256(machine-id || boot-id || mac-of-default-iface || hostname)`. On every
subsequent startup, the fingerprint is recomputed and compared. If it
differs, the daemon **refuses to start** unless `--accept-cloned-identity`
is passed (writes a fresh fingerprint and continues with the same keypair —
for legitimate hardware migrations) or `--remint` is passed (mints fresh
keypair, registers as a new member, broker reaps the old member after
grace period).
2. **Broker side**: tracks `lastSeenHostFingerprint` per member. On
reconnection from a different fingerprint, broker emits a
`member_clone_suspected` security event to the mesh owner's dashboard.
Connection itself is allowed (legitimate hardware swaps happen) but visible
for audit.
3. **Mesh owner**: `claudemesh member revoke <pubkey>` revokes the keypair
server-side; daemon receives `keypair_revoked` push event on next
connection and self-disables.
### 2.3 Rename
`--name` is taken at first `daemon up`; subsequent runs read the keypair file
and ignore `--name` unless `--rename` is passed (which produces a
`member_renamed` event the broker propagates to peers).
## 3. IPC surface — stable core only in v0.9.0
### 3.1 Frozen core surface (v0.9.0)
Codex's feedback: do not ship every CLI verb on day one. A small hardened core
first, expand under explicit capability gates.
```
# Messaging — durable, tested
POST /v1/send {to, message, priority?, meta?, replyToId?}
POST /v1/topic/post {topic, message, priority?, mentions?}
POST /v1/topic/subscribe {topic} (idempotent)
POST /v1/topic/unsubscribe {topic}
GET /v1/topic/list
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
GET /v1/inbox/search ?q=<fts-query>&limit=<n> (FTS5)
# Peers + presence — read-only on day one
GET /v1/peers ?mesh=<slug>
POST /v1/profile {summary?, status?, visible?} (limited fields)
# Files — already production in CLI
POST /v1/file/share {path, to?, message?, persistent?}
GET /v1/file/get ?id=<fileId>&out=<path>
GET /v1/file/list
# Events — push
GET /v1/events text/event-stream
core events: message, peer_join, peer_leave, file_shared,
daemon_disconnect, daemon_reconnect, hook_executed
# Control plane
GET /v1/health {connected, lag_ms, queue_depth, inflight,
mesh, member_pubkey, uptime_s, schema_version,
daemon_version, broker_version}
GET /v1/metrics Prometheus exposition
GET /v1/version {daemon, schema, ipc_api} (negotiation)
POST /v1/heartbeat {} (caller-side liveness signal)
```
That's it. ~20 endpoints. Battle-test these before adding more.
### 3.2 Capability-gated future surface (v0.9.x roadmap)
Behind explicit feature flags in `config.toml`, post-v0.9.0:
```toml
[capabilities]
state = false # /v1/state/{set,get,list}
memory = false # /v1/memory/{remember,recall}
vector = false # /v1/vector/{store,search,delete}
graph = false # /v1/graph/query
tasks = false # /v1/task/{create,claim,complete}
scheduling = false # /v1/scheduling/remind
mcp_host = false # /v1/mcp/{register,call} (LARGEST surface; treat as v1.0)
skill_share = false # /v1/skill/{deploy,share}
```
Each capability is its own ship: design review, security review, test
coverage, capability-token model, then enable. None enabled in v0.9.0.
### 3.3 Local IPC authentication
Codex was right: loopback TCP without auth is an attack surface (browser SSRF,
container side-channels, sandboxed apps with network but no FS access, WSL
host-shared loopback).
| Transport | Auth | Rationale |
|---|---|---|
| UDS | None (relies on FS perms 0600) | Reaching the socket = same UID = can read keypair anyway |
| TCP loopback | **Required**: `Authorization: Bearer <local_token>` | Browser/container/sandbox can reach loopback without FS access |
| SSE | Required: `Authorization: Bearer <local_token>` | Same |
`local_token` is 32 bytes of `crypto.randomBytes` (~256 bits), encoded base64url,
written to `local_token` mode 0600 at daemon init. Rotated on `claudemesh
daemon rotate-token`. SDKs auto-discover the token by reading the file (same
mechanism as discovering the socket path).
**Additional defenses:**
- HTTP listener binds **127.0.0.1 only**. Refuses to bind elsewhere unless
`[ipc] http_bind = "..."` is set explicitly **and** `[ipc] http_external_auth = "..."`
points to a separate token file (escape hatch for advanced users; never the default).
- `Origin` header check: rejects requests with `Origin` set unless it's
explicitly allowlisted in config (default: empty allowlist). Defends against
browser SSRF.
- `Host` header check: must be `localhost` or `127.0.0.1`. Defends against DNS
rebinding.
- CORS: `Access-Control-Allow-Origin` never echoed; preflight returns `403`.
- `User-Agent` required (rejects empty UA — mild signal against simple SSRF).
### 3.4 Request limits + backpressure
- Max request body: **1 MB** (override per endpoint; file uploads use a separate
streaming endpoint).
- Max response body: **10 MB**; truncated with `Link: rel=next` cursor.
- Max in-flight IPC requests: **64**. Beyond → `429 daemon_busy`.
- Max SSE concurrent streams: **32**. Beyond → `429 too_many_streams`.
- Per-token rate limit: **100 req/sec** sustained, 1000/sec burst (token
bucket). Tunable.
## 4. Delivery contract — durable at-least-once with idempotent send
Codex was right: "exactly-once" is a lie. Replacing the claim with a precise
contract.
### 4.1 The contract
> **The daemon guarantees: each successful send call enqueues exactly one row
> to the broker eventually, identified by a stable `messageId`. The daemon
> does not guarantee that downstream peers process the message exactly once —
> that is the receiver's responsibility, aided by the propagated
> `idempotency_key`.**
Concretely:
- **Caller → daemon**: caller may supply `Idempotency-Key`; daemon dedupes
identical keys for 24h. Without one, daemon mints `ulid` and returns it as
`messageId`.
- **Daemon → broker**: each outbox row has at-most-one inflight transmit.
Daemon retries with exponential backoff until broker ACKs OR row hits TTL
(7d default → moves to `dead`).
- **Broker → peer**: existing claudemesh delivery semantics. Broker dedupes by
`messageId`. Peer receives ≥1 copy.
- **Peer hooks**: hooks see `idempotency_key` in the event JSON. Idempotent
hook implementations are the receiver's responsibility.
### 4.2 Outbox row state machine
```
┌────────────┐
send call → │ pending │
└─────┬──────┘
│ daemon picks up batch
┌────────────┐
│ inflight │ ← attempts++, last_error written
└─┬────┬─────┘
│ │ broker NACK / network err
broker ACK │ └──────────► back to pending (with exp. backoff)
┌────────────┐
│ done │ ← delivered_at set, broker_message_id stored
└────────────┘
age > max_age_hours:
┌────────────┐
│ dead │ ← surfaces in `daemon outbox --failed`
└────────────┘
```
### 4.3 Crash recovery
On daemon startup:
1. Any rows in `inflight` are reset to `pending` with `attempts++` and
`next_attempt_at = now + min_backoff`. Note: this MAY cause double-delivery
of a message that was actually ACK'd by the broker but the ACK didn't
persist locally before crash. The `idempotency_key` propagates to broker
(via message `meta`) so the broker dedupes by key.
2. `outbox.db` integrity check (`PRAGMA integrity_check`); if fails, daemon
refuses to start, points user at `claudemesh daemon recover`.
3. `inbox.db` integrity check; on failure, drops to `inbox.db.corrupt-<ts>`,
creates fresh empty inbox, logs `inbox_corruption_recovered` (does not
block startup — inbox is a cache).
### 4.4 Disk-full
- At 80% of `outbox.max_queue_size` or 80% of `[disk] reserved_bytes`: daemon
emits `outbox_pressure_high` event + Prometheus gauge. Sends still accept.
- At 95%: new sends return `507 insufficient_storage`. Existing inflight
drains.
- At 100%: daemon enters degraded mode — refuses sends, refuses new SSE
streams, holds open WS for inbound only. `daemon status` shows degraded.
- Recovery: drain via broker reconnect (drains `done` rows older than
retention window) or `claudemesh daemon outbox prune --confirm`.
### 4.5 Schema migration
`schema_version` file holds an integer. On startup:
1. If `schema_version` matches binary's expected version → continue.
2. If version is older → run `apps/cli/src/daemon/migrations/<from>-<to>.sql`
in a transaction, write new version on success.
3. If version is newer (downgrade) → daemon refuses to start, error points at
re-installing matching version.
Migrations are forward-only. Each migration is ≤ 1 transaction. Test coverage
required: every migration has a snapshot test from prior schema.
## 5. Inbound — durable history with FTS
Every inbound message is written to `inbox.db` before any hook fires:
```sql
CREATE VIRTUAL TABLE inbox USING fts5(
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
sender_name, body, meta, idempotency_key UNINDEXED,
received_at UNINDEXED, replied_to_id UNINDEXED
);
CREATE INDEX inbox_received_at ON inbox(received_at);
CREATE INDEX inbox_idem ON inbox(idempotency_key);
```
- **Receiver-side dedupe**: on insert, `INSERT OR IGNORE` on `idempotency_key`.
Duplicate broker delivery becomes a no-op locally + `cm_daemon_dedupe_total`
counter increments.
- 30-day rolling retention (configurable). `VACUUM` weekly during low-traffic
window.
- `claudemesh daemon search "OOM"` queries the FTS index.
- Apps connecting mid-stream replay history via `?since=<iso>`.
## 6. Hooks — first-class but tightly bounded
Codex was right: hooks were underspecified, and putting `CLAUDEMESH_TOKEN` in
every hook env was a serious exfil footgun.
### 6.1 Hook directory & contract
```
hooks/
on-message.sh every inbound message (DM + topic)
on-dm.sh DMs only
on-mention.sh when @<my-name> appears anywhere
on-topic-<name>.sh a specific topic
on-file-share.sh file shared with me
on-disconnect.sh WS dropped
on-reconnect.sh reconnected
on-startup.sh daemon up
pre-send.sh filter / mutate outbound (last gate)
hooks.toml per-hook policy (auth, redaction, env, timeout)
```
`hooks.toml` (mandatory; daemon refuses to invoke hooks without it):
```toml
[on-mention]
enabled = true
timeout_s = 30
output_size_limit = 65536
redact_payload = ["body.password", "meta.api_key"] # JSONPath
allow_reply = true # if false, stdout reply ignored
capability_token_scope = ["topic:alerts:post"] # scoped, NOT broker session token
network_policy = "deny" # 'deny' | 'allow' | 'allowlist'
network_allowlist = [] # only if policy = 'allowlist'
fs_policy = "readonly" # 'readonly' | 'rw' | 'sandbox'
killpg_on_timeout = true # SIGTERM process group, not just child
audit = true # log every invocation
```
### 6.2 Credentials passed to hooks
**Default: nothing.** No `CLAUDEMESH_TOKEN`, no broker session, nothing that
lets the hook impersonate the daemon's identity broadly.
**Opt-in per hook**: `capability_token_scope = ["topic:alerts:post"]` mints a
**short-lived (5 min) capability token** scoped to exactly that capability.
The hook can use it to call back into the daemon's IPC ("post a reply to
#alerts") but cannot use it to read state, read inbox, deploy MCP, etc. Token
expires when hook process exits OR after 5 min, whichever first.
Capability tokens are local-only — they authorize against the daemon's IPC
surface, never the broker directly. Daemon translates capability calls into
broker calls.
Env variables the hook DOES get:
- `CLAUDEMESH_MESH=<slug>`
- `CLAUDEMESH_HOOK_NAME=on-mention`
- `CLAUDEMESH_EVENT_ID=<ulid>`
- `CLAUDEMESH_CAPABILITY_TOKEN=<token>` (only if scope was configured; else absent)
- `CLAUDEMESH_DAEMON_SOCK=<path>` (so SDKs can connect for capability calls)
- `PATH=/usr/bin:/bin` (locked down)
### 6.3 Payload redaction
Hook stdin receives event JSON minus paths listed in `redact_payload`. Default
redaction: nothing. Mesh owner / daemon admin opts in.
### 6.4 Timeout & cleanup
- Per-hook `timeout_s` (default 30s). On timeout, daemon sends SIGTERM to the
hook's process group (`killpg_on_timeout=true`), waits 5s, then SIGKILL.
Catches forked grandchildren that were trying to keep things alive.
- Hook stdout/stderr captured, truncated at `output_size_limit`. Larger
outputs log a warning and discard the overflow.
### 6.5 Audit log
Every hook invocation logs:
```json
{"hook":"on-mention","event_id":"01H8…","exit":0,"duration_ms":47,
"stdout_bytes":120,"stderr_bytes":0,"replied":true,"capability_calls":1,
"ts":"2026-05-03T14:00:00Z"}
```
Stored in `daemon.log`; metrics exposed via `cm_daemon_hook_*`.
### 6.6 Sandboxing — supported, not required
The contract supports sandboxing without mandating it (mandating breaks too
many real workflows):
- Linux: opt-in `sandbox = "bubblewrap"` in `hooks.toml` runs the hook under
`bwrap` with no network (unless `network_policy != "deny"`), readonly FS
except `/tmp/<hook-id>`, no DBus, no /proc.
- macOS: opt-in `sandbox = "sandbox-exec"` with similar profile.
- Default: no sandbox; rely on Unix permissions + `network_policy=deny` (which
is enforced via `unshare --net` on Linux when available, otherwise
best-effort firewall rule).
## 7. Multi-mesh — daemon-per-mesh, supervised by a thin shell
### 7.1 The decision
One daemon per mesh, coordinated by a supervisor script. Codex pushed back —
"why not one daemon serving all meshes?". Going daemon-per-mesh because:
- **Crash isolation**: a panic in `prod` mesh's WS reader can't corrupt
`dev` mesh's outbox.
- **Resource accounting**: per-mesh RSS, per-mesh metrics, per-mesh disk
budget — easy to attribute, easy to cap.
- **Independent identity**: each mesh has its own keypair, host fingerprint,
capability gates. Conflating into one process forces shared trust.
- **Independent upgrades**: rolling daemon restarts per mesh, no downtime
across all meshes.
- **Simpler code**: zero cross-mesh routing logic in the daemon body.
The cost (process count, log fan-out) is real but bounded: typical user has
13 meshes. Heavy users (1020) get a `claudemesh daemon ps` + `--all` UX that
treats them as a fleet.
### 7.2 Resource caps for fleet hosts
`config.toml` has `[fleet]` section read by `daemon up --all`:
```toml
[fleet]
max_daemons = 10
total_memory_budget = "2GB" # divided across daemons; each gets budget/N RSS cap
total_disk_budget = "20GB" # divided across outbox + inbox per daemon
```
If a user hits `max_daemons`, `daemon up <next>` errors with a clear message
pointing at the cap.
### 7.3 Commands
```
claudemesh daemon up --mesh <slug> # one mesh
claudemesh daemon up --all # all joined meshes (respects fleet caps)
claudemesh daemon down --mesh <slug>
claudemesh daemon down --all
claudemesh daemon status # all daemons, table view
claudemesh daemon status --json # machine-readable
claudemesh daemon ps # alias of status
claudemesh daemon logs --mesh <slug> [-f]
claudemesh daemon restart --mesh <slug>
```
## 8. Auto-routing — clarified, not transparent
Codex pushed back: "no behavior difference" was hand-waving. Persistent
identity, queueing, hooks, profile state — these legitimately change behavior.
### 8.1 What changes when a daemon is up
| Behavior | Cold-path CLI | Daemon-routed CLI |
|---|---|---|
| Sender attribution | Ephemeral session pubkey for that invocation | Daemon's persistent member pubkey |
| Latency | ~1s (fresh WS handshake) | <10ms (local UDS round-trip) |
| Send durability | None — if broker is unreachable, command fails | Outbox queue retries until TTL |
| Inbound visibility | Not available (cold path closes WS) | `claudemesh inbox` reads daemon's inbox.db |
| Hooks | Not invoked | Invoked on every event |
| Presence | Brief flicker as session connects+disconnects | Continuous; daemon's status reflected |
| `peer list` shows me as | A new ephemeral session each invocation | The daemon's persistent member |
### 8.2 Detection logic — connect, don't trust pidfile
```
1. Check ~/.claudemesh/daemon/<slug>/sock exists.
2. attempt UDS connect with 100ms timeout.
3. If connect succeeds: send GET /v1/version.
4. If response is well-formed AND mesh matches AND daemon_version is
compatible → use this daemon.
5. Otherwise → cold path.
```
PID liveness check is unreliable (PID reuse, process orphaned). Socket
handshake is canonical.
### 8.3 Coexistence with `claudemesh launch`
Both can be running for the same mesh:
- Daemon connected as persistent member `runpod-worker-3`.
- A separate `claudemesh launch` connects as ephemeral session of the same
member. Visible to peers as "another session of runpod-worker-3"
(sibling-session relationship via `memberPubkey`).
- CLI verbs from inside `claudemesh launch` route through the launch session,
NOT the daemon (preserves "this Claude Code session has its own ephemeral
identity" semantics).
- CLI verbs from a separate shell route through the daemon (faster, durable).
This is consistent with the v0.5.1 self-DM guard and sibling-session
semantics already shipped.
## 9. Service installation
```bash
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
claudemesh daemon uninstall-service
claudemesh daemon install-service --user # user-scope unit (default; no root)
claudemesh daemon install-service --system # system-scope unit (root; multi-user host)
```
Unit defaults:
- `Restart=on-failure`, `RestartSec=5s`, `StartLimitBurst=5/5min`
- `MemoryMax=<resource cap>`, `TasksMax=128`, `LimitNOFILE=4096`
- `StandardOutput/Error=journal`
- `NoNewPrivileges=yes`, `PrivateTmp=yes`, `ProtectSystem=strict`,
`ProtectHome=read-only` with `ReadWritePaths=~/.claudemesh`
- For systemd `--user`, runs as the invoking user (no root needed).
`claudemesh install` (the existing setup verb) gains an opt-in prompt:
*"Install as a background service that always runs?"* Defaults differently
based on detected environment (TTY vs no-TTY, presence of systemd, etc.).
## 10. Observability
Standard CLI surface unchanged from v1, with the new gauges/counters:
```
cm_daemon_connected{mesh} 0/1
cm_daemon_reconnects_total{mesh,reason}
cm_daemon_lag_ms{mesh} last broker round-trip
cm_daemon_outbox_depth{mesh,status} pending|inflight|dead
cm_daemon_outbox_age_seconds{mesh} oldest pending row
cm_daemon_dedupe_total{mesh,direction} out|in
cm_daemon_disk_pct{mesh,kind} outbox|inbox
cm_daemon_send_total{mesh,kind,status}
cm_daemon_recv_total{mesh,kind,from_type}
cm_daemon_hook_invocations_total{hook,exit}
cm_daemon_hook_duration_seconds{hook} histogram
cm_daemon_hook_capability_calls_total{hook,scope}
cm_daemon_ipc_request_total{endpoint,status,transport}
cm_daemon_ipc_duration_seconds{endpoint} histogram
cm_daemon_local_token_rotations_total
cm_daemon_clone_suspected_total
```
Tracing: optional OpenTelemetry export.
## 11. SDKs — three, slim, core-API only
Same shape as v1 but only target the **frozen core surface** (§3.1). State /
memory / vector / graph / tasks / MCP / skills are NOT in v0.9.0 SDKs — they
ship per capability gate.
Each SDK auto-discovers the daemon: reads `sock` path, `http.port`,
`local_token`. SDKs versioned in lockstep with the daemon's `/v1` surface.
## 12. Security model — explicit boundaries
| Boundary | Trust | Mechanism |
|---|---|---|
| App ↔ Daemon (UDS) | OS user, FS perms | UDS 0600 |
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 only + `local_token` + Origin/Host check |
| Hook ↔ Daemon | Capability scope | Short-lived capability token, never broker session |
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello + crypto_box DM + per-topic keys |
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
| Cloned identity | Host fingerprint check | Daemon refuses to start; dashboard audit event |
## 13. Configuration
`config.toml` — same shape as v1 plus:
- `[capabilities]` (§3.2)
- `[fleet]` (§7.2)
- `[disk] reserved_bytes` (§4.4)
- `[clone] policy = "refuse" | "warn" | "allow"` (§2.2)
User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.
## 14. Lifecycle — the operational flows v1 was missing
### 14.1 Key rotation
```
claudemesh daemon rotate-keypair
```
Mints fresh ed25519 + x25519. Registers new pubkey with broker as a `member_keypair_rotated` operation (broker associates new pubkey with same member id). Old pubkey is held server-side for 24h grace (decrypts in-flight messages encrypted to old pubkey), then revoked.
### 14.2 Local token rotation
```
claudemesh daemon rotate-token
```
Atomically writes a new `local_token`, returns the old one alongside the new
one for 60s grace. SDKs that already have the old token finish in-flight
requests; new requests use the new token. After 60s, old token is rejected.
### 14.3 Compromised host revocation
From the dashboard or another mesh-owner session:
```
claudemesh member revoke <pubkey>
```
Broker marks member as revoked. Connected daemon receives `member_revoked`
push, self-disables (refuses new IPC, closes WS), exits with non-zero status,
logs forensic event.
### 14.4 Image-clone lifecycle
Covered in §2.2. Three policies (`refuse`, `warn`, `allow` — settable per-host
via `config.toml`).
### 14.5 Backup & restore
```
claudemesh daemon backup --out <path> # dumps keypair, config, schema_version
claudemesh daemon restore --in <path> # writes them; refuses if a daemon is running
```
Backup is encrypted with a passphrase (Argon2id KDF + crypto_secretbox). The
intent: "I'm reformatting my laptop, I want my mesh memberships back without
re-joining." NOT for "deploy this same identity on 10 servers" (that's the
clone problem above).
### 14.6 Uninstall / reset
```
claudemesh daemon uninstall # full purge: stops, deregisters from broker, wipes ~/.claudemesh/daemon/<slug>
claudemesh daemon reset # wipes local state, keeps broker member registration (for restoring)
```
Uninstall calls broker's `POST /v1/me/members/:pubkey/leave` so member doesn't
linger as ghost. Reset is local-only, no broker contact.
### 14.7 Disk corruption recovery
```
claudemesh daemon recover # interactive: integrity check + offer rebuild paths
```
Detects corrupt `outbox.db` / `inbox.db`. Options:
- Restore from local journal-only inbox (read-only mode; sends disabled).
- Wipe + rebuild from broker (fetches last N days of message history if
available; topics need re-subscribe; outbox is irrecoverable, queued sends are
lost).
- Wipe + start fresh.
## 15. Version compatibility
### 15.1 Negotiation handshake
On daemon connect to broker AND on every IPC request:
```
GET /v1/version
{
"daemon_version": "0.9.0",
"ipc_api": "v1",
"ipc_minor": 3, # additive minor
"schema_version": 7,
"broker_protocol_min": "0.7",
"broker_protocol_max": "0.9"
}
```
### 15.2 Compat policy
| Across | Policy |
|---|---|
| Daemon ↔ Broker | Daemon refuses to connect if broker version < daemon's `broker_protocol_min`. Broker logs warning. Pre-1.0 we may break this with notice; post-1.0 we maintain backward compat for ≥6 months. |
| CLI ↔ Daemon | CLI checks daemon's `ipc_api`. Same major = OK. Different major = CLI falls back to cold-path with warning. |
| SDK ↔ Daemon | SDK negotiates `ipc_minor`; uses minimum of (SDK's, daemon's). |
| Daemon binary ↔ schema | Binary refuses to start on unknown schema; migrations run forward-only; no automatic downgrade. |
### 15.3 Compatibility matrix (published in docs, machine-readable JSON at /v1/compat)
```json
{
"daemon": "0.9.0",
"compatible_brokers": ["0.7.x", "0.8.x", "0.9.x"],
"compatible_clis": ["0.9.x"],
"compatible_sdks": {
"python": ">=0.9.0,<1.0.0",
"go": ">=0.9.0,<1.0.0",
"ts": ">=0.9.0,<1.0.0"
}
}
```
## 16. Threat model
### 16.1 Attacker classes
| Attacker | Has | Wants | Mitigations |
|---|---|---|---|
| Local same-user shell | OS user creds | Send / read mesh messages | None needed — they already have FS access to keypair; daemon is no worse |
| Local different-user shell | Different OS user | Read this user's daemon | UDS 0600 + TCP loopback + token. Requires OS exploit to escalate |
| Browser SSRF | Loopback HTTP | Send messages, read inbox | `local_token` + Origin/Host check + non-default port. SSRF without token cannot succeed |
| Container side-channel | Same loopback namespace | Read another container's daemon | Containers share host loopback only if explicitly net=host. `local_token` defends. Recommended: bind UDS only inside containers |
| Compromised hook | Capability token in env | Use that scope | Capability tokens are scoped + short-lived; cannot escalate |
| Compromised broker | Full mesh visibility on its side | Deliver malicious messages, identity-impersonate | E2E encryption (crypto_box DMs, per-topic keys) — broker can't read content. Out-of-scope for daemon |
| Cloned VM image | Same keypair on two hosts | Identity collision | Host fingerprint detection + dashboard audit + `--remint` flow |
| Stolen laptop | Disk access | Mesh impersonation forever | `member revoke` from dashboard. Without disk encryption, this is the user's laptop security; documented in security guide |
| Untrusted hook author | Hook script content | Exfil mesh data | Hook is on disk YOU control. If you ran `git pull` on a malicious hooks/ repo, that's a code-supply-chain attack out of scope for the daemon |
### 16.2 Out of scope
- Defending against an attacker with root on the daemon host. They can read
`keypair.json` directly.
- Defending against malicious peers in the same mesh sending malformed
payloads. Daemon validates structure but trusts mesh members.
- Defending against compromised broker. Out-of-scope for daemon; mesh-level
E2E protects content but not metadata.
## 17. Migration — what changes for existing users
Same as v1. Additive. No DB migration on broker. Existing
`~/.claudemesh/config.json` consumed unchanged. `claudemesh launch` keeps
working; daemon is opt-in.
---
## What needs review (round 2)
Round 1 produced: identity model needs `--ephemeral` + clone-detect, IPC needs
local token, "exactly-once" was a lie, hooks needed scoped credentials, surface
needed shrinking, missing rotation/recovery/migration/threat-model.
This v2 attempts to address all of them. Specifically critique:
1. **Has the identity model fully closed the clone problem?** Refuses-on-fingerprint-mismatch
plus broker audit plus mesh-owner revoke — does this catch a sophisticated
attacker who copies `host_fingerprint.json` along with the keypair?
2. **Is the local-token model sufficient for browser-SSRF defense?**
Token + Origin + Host checks + 127.0.0.1-only. Anything else needed?
3. **The delivery contract** (§4) — is it now defensible? Does the inflight-recovery
semantics + idempotency-key propagation produce the guarantees claimed?
4. **Hook capability tokens** (§6.2) — short-lived, scoped, expire on hook exit.
Does this fully eliminate the exfil footgun? What capability scopes are
actually needed for v0.9.0 hooks?
5. **Frozen v0.9.0 surface** (§3.1) — is the cut right? Should `peer list` be
in core or capability-gated? Should `inbox/search` ship in v0.9.0?
6. **Threat model** (§16) — anything missing? Specifically thinking about CI
environments where the daemon's host is a fleet shared across many users'
builds.
7. **Lifecycle flows** (§14) — image clones, key rotation, host moves, disk
corruption, uninstall semantics. Anything still missing?
8. **Version compat** (§15) — is the negotiation handshake sufficient, or do
we need stronger guarantees (e.g. semver-strict, or a feature-bit
negotiation rather than version numbers)?
Score 15 each. Top 3 changes you'd insist on for v3, if any. If you think v2
is shippable, say so explicitly — over-engineering is a real risk.