- Bump apps/cli/package.json to 1.22.0 (additive feature: claudemesh daemon long-lived runtime). - CHANGELOG entry for 1.22.0 covering subcommands, idempotency wiring, crash recovery, and the deferred Sprint 7 broker hardening. - Roadmap entry for v0.9.0 daemon foundation right above the v2.0.0 daemon redesign section, so the bridge release is documented as the shipped step toward the larger architectural shift. - Move shipped daemon specs (v1..v10 iteration trail + locked v0.9.0 spec + broker-hardening followups) from .artifacts/specs/ to .artifacts/shipped/ per the project artifact-pipeline convention. Not in this commit: npm publish and the cli-v1.22.0 GitHub release tag — both are public-distribution actions and require explicit user approval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
854 lines
37 KiB
Markdown
854 lines
37 KiB
Markdown
# `claudemesh daemon` — Final Spec v2
|
||
|
||
> **Round 2 after a critical first-pass review.** v1 of this spec was reviewed
|
||
> by another model and pushed back on identity model, no-auth IPC, "exactly-once"
|
||
> overclaim, hook credentials, surface bloat, and missing operational flows
|
||
> (rotation, image clones, schema migration, threat model). v2 incorporates all
|
||
> of those.
|
||
|
||
---
|
||
|
||
## 0. Intent — what this is, what it isn't
|
||
|
||
### 0.1 The product reality
|
||
|
||
claudemesh today is a **peer mesh runtime for Claude Code sessions**. Each
|
||
session runs `claudemesh launch`, opens a WebSocket to a managed broker, gets
|
||
ephemeral identity, sends/receives DMs and topic messages with other Claude Code
|
||
sessions, posts to shared state, deploys MCP servers / skills / files,
|
||
participates in tasks, schedules reminders. Everything is E2E encrypted with
|
||
crypto_box envelopes for DMs and per-topic symmetric keys for topics. The broker
|
||
is a routing/persistence layer; peers do the actual work.
|
||
|
||
The CLI is the canonical surface — every operation is a `claudemesh <verb>`.
|
||
The MCP server is a "tool-less push pipe" that surfaces inbound messages to
|
||
Claude Code as channel notifications. There is also a web dashboard, an `/v1/*`
|
||
REST API, and an existing apikey auth model for external integrations.
|
||
|
||
### 0.2 The gap
|
||
|
||
Anything that **isn't a Claude Code session** is a second-class citizen:
|
||
|
||
- A RunPod handler that wants to alert a peer when an OOM happens has only
|
||
one option: curl an apikey-authed REST endpoint. One-way only. The handler
|
||
is not a peer — it can't be DM'd back, can't be `@-mentioned`, can't be in
|
||
`peer list`, can't claim a task assigned to it, can't host an MCP service or
|
||
share a skill. It's a webhook spoke, not a participant.
|
||
|
||
- A Temporal worker that wants to track its own progress in shared mesh state,
|
||
publish to a `#alerts` topic, and listen for "retry now" instructions has
|
||
no good shape. Either it shells out to `claudemesh send` cold-path
|
||
(a fresh WS handshake per message — ~1s latency, broker churn, no inbound
|
||
path) or it speaks the WS protocol manually (significant code, no SDK).
|
||
|
||
- A long-running CI runner, an IoT box, a phone app, a future Python or Go
|
||
service — none can be **first-class peers** without writing the same WS
|
||
reconnect / queue / encryption / presence code that the existing CLI already
|
||
has, plus an IPC surface so the host's apps can use it without re-implementing
|
||
any of that.
|
||
|
||
### 0.3 What this daemon is
|
||
|
||
A long-running process — the same `claudemesh-cli` binary in `daemon` mode —
|
||
that turns any host into a **first-class peer**:
|
||
|
||
- Stable identity across restarts (the host *is* a member of the mesh, not a
|
||
series of disconnected sessions).
|
||
- Persistent WS to the broker, with reconnect, queue, dedupe.
|
||
- Local IPC surface (UDS + loopback HTTP + SSE) that any local app can hit
|
||
to send, subscribe, query — without learning the broker protocol or carrying
|
||
long-lived secrets in app code.
|
||
- Hooks: shell scripts that fire on events. Server replies to DMs, auto-claims
|
||
tasks, escalates errors — without the app being involved.
|
||
- Same security primitives as `claudemesh launch` (mesh keypair, crypto_box,
|
||
per-topic keys). No new auth model toward the broker.
|
||
|
||
The daemon **is the runtime**. The CLI in cold-path mode is a fallback. The
|
||
Claude Code MCP integration is one client of the daemon (eventually).
|
||
|
||
### 0.4 What this daemon is NOT
|
||
|
||
- **Not a webhook gateway.** `/v1/notify` and apikeys remain the path for
|
||
systems that can't host the runtime (third-party SaaS, monitoring tools).
|
||
The daemon is for systems that *can* run a process — code you control.
|
||
|
||
- **Not a generic message broker.** It speaks claudemesh protocol to one
|
||
managed broker. It is not a substitute for NATS, Redis, Kafka, RabbitMQ.
|
||
|
||
- **Not a Slack replacement.** Topics, DMs, mentions exist because *AI
|
||
sessions* use them. Humans interact via the dashboard or a Claude Code
|
||
session, not by reading the daemon's inbox directly.
|
||
|
||
- **Not a fleet manager.** One daemon manages one mesh on one host. Multi-mesh
|
||
on one host is supported (one daemon per mesh, supervised). Cross-host
|
||
supervision is an external concern (systemd, k8s, etc.) — the daemon doesn't
|
||
reach across hosts.
|
||
|
||
### 0.5 Who deploys this
|
||
|
||
- A developer running `claudemesh daemon up` on their laptop so their open
|
||
Claude Code sessions all share one persistent connection (instead of each
|
||
opening its own ephemeral WS).
|
||
- The same developer running `claudemesh daemon install-service` on their VPS,
|
||
RunPod pod, Temporal worker, CI runner — turning each into an
|
||
addressable peer that scripts on that host can talk to via local IPC.
|
||
- Eventually: language SDKs (Python / Go / TypeScript) talking to the daemon
|
||
on `localhost`, exposing claudemesh as a first-class API for any app the
|
||
developer writes.
|
||
|
||
### 0.6 Pre-launch posture
|
||
|
||
No users yet. We can break protocol, schema, surface, anything. Optimize for
|
||
the architecture we want to live with for years, not for the smallest
|
||
shippable cut. Codex pushed back on v1 on this exact axis: do not ship
|
||
graph/vector/MCP/skills/tasks on day one — freeze a small, hardened core,
|
||
expand deliberately.
|
||
|
||
---
|
||
|
||
## 1. Process model
|
||
|
||
**One daemon per (user, mesh)**. Persistent. Survives reboots via OS
|
||
supervisor. Serves multiple local apps concurrently.
|
||
|
||
```
|
||
~/.claudemesh/daemon/<mesh-slug>/
|
||
pid 0600 pidfile, cleaned on shutdown
|
||
sock 0600 unix domain socket (primary IPC)
|
||
http.port 0644 auto-allocated loopback port
|
||
local_token 0600 per-daemon bearer for HTTP/TCP transports
|
||
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
|
||
host_fingerprint.json 0600 machine-id + boot-id + interface mac digest
|
||
config.toml 0644 user-editable runtime tuning
|
||
outbox.db 0600 SQLite — durable outbound queue
|
||
inbox.db 0600 SQLite — N-day inbound history, FTS-indexed
|
||
schema_version 0644 integer; gates online migrations
|
||
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
|
||
hooks/ 0700 user-managed event scripts
|
||
```
|
||
|
||
**Resource caps (defaults, configurable):**
|
||
|
||
| Resource | Default | Why |
|
||
|---|---|---|
|
||
| RSS | 256 MB | Most workloads stay under 50 MB; cap protects multi-mesh hosts |
|
||
| CPU | unlimited | Hook fan-out can spike briefly; rely on OS scheduler |
|
||
| Outbox DB | 5 GB | At 1KB avg msg, that's 5M queued. Disk-full handling at 90% |
|
||
| Inbox DB | 5 GB | Same |
|
||
| File descriptors | 1024 | UDS clients + SSE streams + DB handles + WS |
|
||
| SSE concurrent | 32 streams | DoS protection; configurable up |
|
||
| IPC concurrent | 64 in-flight | Backpressure beyond this returns `429 daemon_busy` |
|
||
| Hook concurrency | 8 | Bounded pool; overflow queues |
|
||
|
||
Single binary. Same `claudemesh-cli` package; `daemon` is one of its modes.
|
||
|
||
## 2. Identity — persistent member by default, ephemeral on opt-in, clone-aware
|
||
|
||
### 2.1 Modes
|
||
|
||
```
|
||
claudemesh daemon up # default: persistent member
|
||
claudemesh daemon up --ephemeral # session-shaped, no keypair persisted
|
||
claudemesh daemon up --ephemeral --ttl=2h # auto-shutdown after TTL
|
||
```
|
||
|
||
- **Persistent (default)**: ed25519 + x25519 keypair stored in `keypair.json`.
|
||
Same identity across restarts, reconnects, supervisor cycles. Right for
|
||
servers, workers, addressable peers.
|
||
- **Ephemeral**: keypair generated in memory, never written. Daemon exits =
|
||
identity gone. Right for CI jobs, preview environments, disposable RunPod
|
||
pods, test harnesses, build agents, anything that should not leave a peer
|
||
ghost in the broker after teardown.
|
||
- **`--ttl <duration>`** on ephemeral mode: auto-shutdown after the duration,
|
||
or after `claudemesh daemon down`, whichever first. Broker member record
|
||
cleaned up on shutdown.
|
||
|
||
### 2.2 Image-clone detection
|
||
|
||
Two daemons booting with the same `keypair.json` (VM image clone, container
|
||
copy, restored backup) is a serious failure mode — broker sees connection
|
||
collisions, presence flickers, encrypted messages route to the wrong host.
|
||
|
||
Handled in three places:
|
||
|
||
1. **Daemon side**: `host_fingerprint.json` is written on first startup —
|
||
`sha256(machine-id || boot-id || mac-of-default-iface || hostname)`. On every
|
||
subsequent startup, the fingerprint is recomputed and compared. If it
|
||
differs, the daemon **refuses to start** unless `--accept-cloned-identity`
|
||
is passed (writes a fresh fingerprint and continues with the same keypair —
|
||
for legitimate hardware migrations) or `--remint` is passed (mints fresh
|
||
keypair, registers as a new member, broker reaps the old member after
|
||
grace period).
|
||
2. **Broker side**: tracks `lastSeenHostFingerprint` per member. On
|
||
reconnection from a different fingerprint, broker emits a
|
||
`member_clone_suspected` security event to the mesh owner's dashboard.
|
||
Connection itself is allowed (legitimate hardware swaps happen) but visible
|
||
for audit.
|
||
3. **Mesh owner**: `claudemesh member revoke <pubkey>` revokes the keypair
|
||
server-side; daemon receives `keypair_revoked` push event on next
|
||
connection and self-disables.
|
||
|
||
### 2.3 Rename
|
||
|
||
`--name` is taken at first `daemon up`; subsequent runs read the keypair file
|
||
and ignore `--name` unless `--rename` is passed (which produces a
|
||
`member_renamed` event the broker propagates to peers).
|
||
|
||
## 3. IPC surface — stable core only in v0.9.0
|
||
|
||
### 3.1 Frozen core surface (v0.9.0)
|
||
|
||
Codex's feedback: do not ship every CLI verb on day one. A small hardened core
|
||
first, expand under explicit capability gates.
|
||
|
||
```
|
||
# Messaging — durable, tested
|
||
POST /v1/send {to, message, priority?, meta?, replyToId?}
|
||
POST /v1/topic/post {topic, message, priority?, mentions?}
|
||
POST /v1/topic/subscribe {topic} (idempotent)
|
||
POST /v1/topic/unsubscribe {topic}
|
||
GET /v1/topic/list
|
||
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
|
||
GET /v1/inbox/search ?q=<fts-query>&limit=<n> (FTS5)
|
||
|
||
# Peers + presence — read-only on day one
|
||
GET /v1/peers ?mesh=<slug>
|
||
POST /v1/profile {summary?, status?, visible?} (limited fields)
|
||
|
||
# Files — already production in CLI
|
||
POST /v1/file/share {path, to?, message?, persistent?}
|
||
GET /v1/file/get ?id=<fileId>&out=<path>
|
||
GET /v1/file/list
|
||
|
||
# Events — push
|
||
GET /v1/events text/event-stream
|
||
core events: message, peer_join, peer_leave, file_shared,
|
||
daemon_disconnect, daemon_reconnect, hook_executed
|
||
|
||
# Control plane
|
||
GET /v1/health {connected, lag_ms, queue_depth, inflight,
|
||
mesh, member_pubkey, uptime_s, schema_version,
|
||
daemon_version, broker_version}
|
||
GET /v1/metrics Prometheus exposition
|
||
GET /v1/version {daemon, schema, ipc_api} (negotiation)
|
||
POST /v1/heartbeat {} (caller-side liveness signal)
|
||
```
|
||
|
||
That's it. ~20 endpoints. Battle-test these before adding more.
|
||
|
||
### 3.2 Capability-gated future surface (v0.9.x roadmap)
|
||
|
||
Behind explicit feature flags in `config.toml`, post-v0.9.0:
|
||
|
||
```toml
|
||
[capabilities]
|
||
state = false # /v1/state/{set,get,list}
|
||
memory = false # /v1/memory/{remember,recall}
|
||
vector = false # /v1/vector/{store,search,delete}
|
||
graph = false # /v1/graph/query
|
||
tasks = false # /v1/task/{create,claim,complete}
|
||
scheduling = false # /v1/scheduling/remind
|
||
mcp_host = false # /v1/mcp/{register,call} (LARGEST surface; treat as v1.0)
|
||
skill_share = false # /v1/skill/{deploy,share}
|
||
```
|
||
|
||
Each capability is its own ship: design review, security review, test
|
||
coverage, capability-token model, then enable. None enabled in v0.9.0.
|
||
|
||
### 3.3 Local IPC authentication
|
||
|
||
Codex was right: loopback TCP without auth is an attack surface (browser SSRF,
|
||
container side-channels, sandboxed apps with network but no FS access, WSL
|
||
host-shared loopback).
|
||
|
||
| Transport | Auth | Rationale |
|
||
|---|---|---|
|
||
| UDS | None (relies on FS perms 0600) | Reaching the socket = same UID = can read keypair anyway |
|
||
| TCP loopback | **Required**: `Authorization: Bearer <local_token>` | Browser/container/sandbox can reach loopback without FS access |
|
||
| SSE | Required: `Authorization: Bearer <local_token>` | Same |
|
||
|
||
`local_token` is 32 bytes of `crypto.randomBytes` (~256 bits), encoded base64url,
|
||
written to `local_token` mode 0600 at daemon init. Rotated on `claudemesh
|
||
daemon rotate-token`. SDKs auto-discover the token by reading the file (same
|
||
mechanism as discovering the socket path).
|
||
|
||
**Additional defenses:**
|
||
- HTTP listener binds **127.0.0.1 only**. Refuses to bind elsewhere unless
|
||
`[ipc] http_bind = "..."` is set explicitly **and** `[ipc] http_external_auth = "..."`
|
||
points to a separate token file (escape hatch for advanced users; never the default).
|
||
- `Origin` header check: rejects requests with `Origin` set unless it's
|
||
explicitly allowlisted in config (default: empty allowlist). Defends against
|
||
browser SSRF.
|
||
- `Host` header check: must be `localhost` or `127.0.0.1`. Defends against DNS
|
||
rebinding.
|
||
- CORS: `Access-Control-Allow-Origin` never echoed; preflight returns `403`.
|
||
- `User-Agent` required (rejects empty UA — mild signal against simple SSRF).
|
||
|
||
### 3.4 Request limits + backpressure
|
||
|
||
- Max request body: **1 MB** (override per endpoint; file uploads use a separate
|
||
streaming endpoint).
|
||
- Max response body: **10 MB**; truncated with `Link: rel=next` cursor.
|
||
- Max in-flight IPC requests: **64**. Beyond → `429 daemon_busy`.
|
||
- Max SSE concurrent streams: **32**. Beyond → `429 too_many_streams`.
|
||
- Per-token rate limit: **100 req/sec** sustained, 1000/sec burst (token
|
||
bucket). Tunable.
|
||
|
||
## 4. Delivery contract — durable at-least-once with idempotent send
|
||
|
||
Codex was right: "exactly-once" is a lie. Replacing the claim with a precise
|
||
contract.
|
||
|
||
### 4.1 The contract
|
||
|
||
> **The daemon guarantees: each successful send call enqueues exactly one row
|
||
> to the broker eventually, identified by a stable `messageId`. The daemon
|
||
> does not guarantee that downstream peers process the message exactly once —
|
||
> that is the receiver's responsibility, aided by the propagated
|
||
> `idempotency_key`.**
|
||
|
||
Concretely:
|
||
|
||
- **Caller → daemon**: caller may supply `Idempotency-Key`; daemon dedupes
|
||
identical keys for 24h. Without one, daemon mints `ulid` and returns it as
|
||
`messageId`.
|
||
- **Daemon → broker**: each outbox row has at-most-one inflight transmit.
|
||
Daemon retries with exponential backoff until broker ACKs OR row hits TTL
|
||
(7d default → moves to `dead`).
|
||
- **Broker → peer**: existing claudemesh delivery semantics. Broker dedupes by
|
||
`messageId`. Peer receives ≥1 copy.
|
||
- **Peer hooks**: hooks see `idempotency_key` in the event JSON. Idempotent
|
||
hook implementations are the receiver's responsibility.
|
||
|
||
### 4.2 Outbox row state machine
|
||
|
||
```
|
||
┌────────────┐
|
||
send call → │ pending │
|
||
└─────┬──────┘
|
||
│ daemon picks up batch
|
||
▼
|
||
┌────────────┐
|
||
│ inflight │ ← attempts++, last_error written
|
||
└─┬────┬─────┘
|
||
│ │ broker NACK / network err
|
||
broker ACK │ └──────────► back to pending (with exp. backoff)
|
||
▼
|
||
┌────────────┐
|
||
│ done │ ← delivered_at set, broker_message_id stored
|
||
└────────────┘
|
||
|
||
age > max_age_hours:
|
||
┌────────────┐
|
||
│ dead │ ← surfaces in `daemon outbox --failed`
|
||
└────────────┘
|
||
```
|
||
|
||
### 4.3 Crash recovery
|
||
|
||
On daemon startup:
|
||
|
||
1. Any rows in `inflight` are reset to `pending` with `attempts++` and
|
||
`next_attempt_at = now + min_backoff`. Note: this MAY cause double-delivery
|
||
of a message that was actually ACK'd by the broker but the ACK didn't
|
||
persist locally before crash. The `idempotency_key` propagates to broker
|
||
(via message `meta`) so the broker dedupes by key.
|
||
2. `outbox.db` integrity check (`PRAGMA integrity_check`); if fails, daemon
|
||
refuses to start, points user at `claudemesh daemon recover`.
|
||
3. `inbox.db` integrity check; on failure, drops to `inbox.db.corrupt-<ts>`,
|
||
creates fresh empty inbox, logs `inbox_corruption_recovered` (does not
|
||
block startup — inbox is a cache).
|
||
|
||
### 4.4 Disk-full
|
||
|
||
- At 80% of `outbox.max_queue_size` or 80% of `[disk] reserved_bytes`: daemon
|
||
emits `outbox_pressure_high` event + Prometheus gauge. Sends still accept.
|
||
- At 95%: new sends return `507 insufficient_storage`. Existing inflight
|
||
drains.
|
||
- At 100%: daemon enters degraded mode — refuses sends, refuses new SSE
|
||
streams, holds open WS for inbound only. `daemon status` shows degraded.
|
||
- Recovery: drain via broker reconnect (drains `done` rows older than
|
||
retention window) or `claudemesh daemon outbox prune --confirm`.
|
||
|
||
### 4.5 Schema migration
|
||
|
||
`schema_version` file holds an integer. On startup:
|
||
1. If `schema_version` matches binary's expected version → continue.
|
||
2. If version is older → run `apps/cli/src/daemon/migrations/<from>-<to>.sql`
|
||
in a transaction, write new version on success.
|
||
3. If version is newer (downgrade) → daemon refuses to start, error points at
|
||
re-installing matching version.
|
||
|
||
Migrations are forward-only. Each migration is ≤ 1 transaction. Test coverage
|
||
required: every migration has a snapshot test from prior schema.
|
||
|
||
## 5. Inbound — durable history with FTS
|
||
|
||
Every inbound message is written to `inbox.db` before any hook fires:
|
||
|
||
```sql
|
||
CREATE VIRTUAL TABLE inbox USING fts5(
|
||
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
|
||
sender_name, body, meta, idempotency_key UNINDEXED,
|
||
received_at UNINDEXED, replied_to_id UNINDEXED
|
||
);
|
||
CREATE INDEX inbox_received_at ON inbox(received_at);
|
||
CREATE INDEX inbox_idem ON inbox(idempotency_key);
|
||
```
|
||
|
||
- **Receiver-side dedupe**: on insert, `INSERT OR IGNORE` on `idempotency_key`.
|
||
Duplicate broker delivery becomes a no-op locally + `cm_daemon_dedupe_total`
|
||
counter increments.
|
||
- 30-day rolling retention (configurable). `VACUUM` weekly during low-traffic
|
||
window.
|
||
- `claudemesh daemon search "OOM"` queries the FTS index.
|
||
- Apps connecting mid-stream replay history via `?since=<iso>`.
|
||
|
||
## 6. Hooks — first-class but tightly bounded
|
||
|
||
Codex was right: hooks were underspecified, and putting `CLAUDEMESH_TOKEN` in
|
||
every hook env was a serious exfil footgun.
|
||
|
||
### 6.1 Hook directory & contract
|
||
|
||
```
|
||
hooks/
|
||
on-message.sh every inbound message (DM + topic)
|
||
on-dm.sh DMs only
|
||
on-mention.sh when @<my-name> appears anywhere
|
||
on-topic-<name>.sh a specific topic
|
||
on-file-share.sh file shared with me
|
||
on-disconnect.sh WS dropped
|
||
on-reconnect.sh reconnected
|
||
on-startup.sh daemon up
|
||
pre-send.sh filter / mutate outbound (last gate)
|
||
hooks.toml per-hook policy (auth, redaction, env, timeout)
|
||
```
|
||
|
||
`hooks.toml` (mandatory; daemon refuses to invoke hooks without it):
|
||
|
||
```toml
|
||
[on-mention]
|
||
enabled = true
|
||
timeout_s = 30
|
||
output_size_limit = 65536
|
||
redact_payload = ["body.password", "meta.api_key"] # JSONPath
|
||
allow_reply = true # if false, stdout reply ignored
|
||
capability_token_scope = ["topic:alerts:post"] # scoped, NOT broker session token
|
||
network_policy = "deny" # 'deny' | 'allow' | 'allowlist'
|
||
network_allowlist = [] # only if policy = 'allowlist'
|
||
fs_policy = "readonly" # 'readonly' | 'rw' | 'sandbox'
|
||
killpg_on_timeout = true # SIGTERM process group, not just child
|
||
audit = true # log every invocation
|
||
```
|
||
|
||
### 6.2 Credentials passed to hooks
|
||
|
||
**Default: nothing.** No `CLAUDEMESH_TOKEN`, no broker session, nothing that
|
||
lets the hook impersonate the daemon's identity broadly.
|
||
|
||
**Opt-in per hook**: `capability_token_scope = ["topic:alerts:post"]` mints a
|
||
**short-lived (5 min) capability token** scoped to exactly that capability.
|
||
The hook can use it to call back into the daemon's IPC ("post a reply to
|
||
#alerts") but cannot use it to read state, read inbox, deploy MCP, etc. Token
|
||
expires when hook process exits OR after 5 min, whichever first.
|
||
|
||
Capability tokens are local-only — they authorize against the daemon's IPC
|
||
surface, never the broker directly. Daemon translates capability calls into
|
||
broker calls.
|
||
|
||
Env variables the hook DOES get:
|
||
- `CLAUDEMESH_MESH=<slug>`
|
||
- `CLAUDEMESH_HOOK_NAME=on-mention`
|
||
- `CLAUDEMESH_EVENT_ID=<ulid>`
|
||
- `CLAUDEMESH_CAPABILITY_TOKEN=<token>` (only if scope was configured; else absent)
|
||
- `CLAUDEMESH_DAEMON_SOCK=<path>` (so SDKs can connect for capability calls)
|
||
- `PATH=/usr/bin:/bin` (locked down)
|
||
|
||
### 6.3 Payload redaction
|
||
|
||
Hook stdin receives event JSON minus paths listed in `redact_payload`. Default
|
||
redaction: nothing. Mesh owner / daemon admin opts in.
|
||
|
||
### 6.4 Timeout & cleanup
|
||
|
||
- Per-hook `timeout_s` (default 30s). On timeout, daemon sends SIGTERM to the
|
||
hook's process group (`killpg_on_timeout=true`), waits 5s, then SIGKILL.
|
||
Catches forked grandchildren that were trying to keep things alive.
|
||
- Hook stdout/stderr captured, truncated at `output_size_limit`. Larger
|
||
outputs log a warning and discard the overflow.
|
||
|
||
### 6.5 Audit log
|
||
|
||
Every hook invocation logs:
|
||
```json
|
||
{"hook":"on-mention","event_id":"01H8…","exit":0,"duration_ms":47,
|
||
"stdout_bytes":120,"stderr_bytes":0,"replied":true,"capability_calls":1,
|
||
"ts":"2026-05-03T14:00:00Z"}
|
||
```
|
||
|
||
Stored in `daemon.log`; metrics exposed via `cm_daemon_hook_*`.
|
||
|
||
### 6.6 Sandboxing — supported, not required
|
||
|
||
The contract supports sandboxing without mandating it (mandating breaks too
|
||
many real workflows):
|
||
|
||
- Linux: opt-in `sandbox = "bubblewrap"` in `hooks.toml` runs the hook under
|
||
`bwrap` with no network (unless `network_policy != "deny"`), readonly FS
|
||
except `/tmp/<hook-id>`, no DBus, no /proc.
|
||
- macOS: opt-in `sandbox = "sandbox-exec"` with similar profile.
|
||
- Default: no sandbox; rely on Unix permissions + `network_policy=deny` (which
|
||
is enforced via `unshare --net` on Linux when available, otherwise
|
||
best-effort firewall rule).
|
||
|
||
## 7. Multi-mesh — daemon-per-mesh, supervised by a thin shell
|
||
|
||
### 7.1 The decision
|
||
|
||
One daemon per mesh, coordinated by a supervisor script. Codex pushed back —
|
||
"why not one daemon serving all meshes?". Going daemon-per-mesh because:
|
||
|
||
- **Crash isolation**: a panic in `prod` mesh's WS reader can't corrupt
|
||
`dev` mesh's outbox.
|
||
- **Resource accounting**: per-mesh RSS, per-mesh metrics, per-mesh disk
|
||
budget — easy to attribute, easy to cap.
|
||
- **Independent identity**: each mesh has its own keypair, host fingerprint,
|
||
capability gates. Conflating into one process forces shared trust.
|
||
- **Independent upgrades**: rolling daemon restarts per mesh, no downtime
|
||
across all meshes.
|
||
- **Simpler code**: zero cross-mesh routing logic in the daemon body.
|
||
|
||
The cost (process count, log fan-out) is real but bounded: typical user has
|
||
1–3 meshes. Heavy users (10–20) get a `claudemesh daemon ps` + `--all` UX that
|
||
treats them as a fleet.
|
||
|
||
### 7.2 Resource caps for fleet hosts
|
||
|
||
`config.toml` has `[fleet]` section read by `daemon up --all`:
|
||
|
||
```toml
|
||
[fleet]
|
||
max_daemons = 10
|
||
total_memory_budget = "2GB" # divided across daemons; each gets budget/N RSS cap
|
||
total_disk_budget = "20GB" # divided across outbox + inbox per daemon
|
||
```
|
||
|
||
If a user hits `max_daemons`, `daemon up <next>` errors with a clear message
|
||
pointing at the cap.
|
||
|
||
### 7.3 Commands
|
||
|
||
```
|
||
claudemesh daemon up --mesh <slug> # one mesh
|
||
claudemesh daemon up --all # all joined meshes (respects fleet caps)
|
||
claudemesh daemon down --mesh <slug>
|
||
claudemesh daemon down --all
|
||
claudemesh daemon status # all daemons, table view
|
||
claudemesh daemon status --json # machine-readable
|
||
claudemesh daemon ps # alias of status
|
||
claudemesh daemon logs --mesh <slug> [-f]
|
||
claudemesh daemon restart --mesh <slug>
|
||
```
|
||
|
||
## 8. Auto-routing — clarified, not transparent
|
||
|
||
Codex pushed back: "no behavior difference" was hand-waving. Persistent
|
||
identity, queueing, hooks, profile state — these legitimately change behavior.
|
||
|
||
### 8.1 What changes when a daemon is up
|
||
|
||
| Behavior | Cold-path CLI | Daemon-routed CLI |
|
||
|---|---|---|
|
||
| Sender attribution | Ephemeral session pubkey for that invocation | Daemon's persistent member pubkey |
|
||
| Latency | ~1s (fresh WS handshake) | <10ms (local UDS round-trip) |
|
||
| Send durability | None — if broker is unreachable, command fails | Outbox queue retries until TTL |
|
||
| Inbound visibility | Not available (cold path closes WS) | `claudemesh inbox` reads daemon's inbox.db |
|
||
| Hooks | Not invoked | Invoked on every event |
|
||
| Presence | Brief flicker as session connects+disconnects | Continuous; daemon's status reflected |
|
||
| `peer list` shows me as | A new ephemeral session each invocation | The daemon's persistent member |
|
||
|
||
### 8.2 Detection logic — connect, don't trust pidfile
|
||
|
||
```
|
||
1. Check ~/.claudemesh/daemon/<slug>/sock exists.
|
||
2. attempt UDS connect with 100ms timeout.
|
||
3. If connect succeeds: send GET /v1/version.
|
||
4. If response is well-formed AND mesh matches AND daemon_version is
|
||
compatible → use this daemon.
|
||
5. Otherwise → cold path.
|
||
```
|
||
|
||
PID liveness check is unreliable (PID reuse, process orphaned). Socket
|
||
handshake is canonical.
|
||
|
||
### 8.3 Coexistence with `claudemesh launch`
|
||
|
||
Both can be running for the same mesh:
|
||
- Daemon connected as persistent member `runpod-worker-3`.
|
||
- A separate `claudemesh launch` connects as ephemeral session of the same
|
||
member. Visible to peers as "another session of runpod-worker-3"
|
||
(sibling-session relationship via `memberPubkey`).
|
||
- CLI verbs from inside `claudemesh launch` route through the launch session,
|
||
NOT the daemon (preserves "this Claude Code session has its own ephemeral
|
||
identity" semantics).
|
||
- CLI verbs from a separate shell route through the daemon (faster, durable).
|
||
|
||
This is consistent with the v0.5.1 self-DM guard and sibling-session
|
||
semantics already shipped.
|
||
|
||
## 9. Service installation
|
||
|
||
```bash
|
||
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
|
||
claudemesh daemon uninstall-service
|
||
claudemesh daemon install-service --user # user-scope unit (default; no root)
|
||
claudemesh daemon install-service --system # system-scope unit (root; multi-user host)
|
||
```
|
||
|
||
Unit defaults:
|
||
- `Restart=on-failure`, `RestartSec=5s`, `StartLimitBurst=5/5min`
|
||
- `MemoryMax=<resource cap>`, `TasksMax=128`, `LimitNOFILE=4096`
|
||
- `StandardOutput/Error=journal`
|
||
- `NoNewPrivileges=yes`, `PrivateTmp=yes`, `ProtectSystem=strict`,
|
||
`ProtectHome=read-only` with `ReadWritePaths=~/.claudemesh`
|
||
- For systemd `--user`, runs as the invoking user (no root needed).
|
||
|
||
`claudemesh install` (the existing setup verb) gains an opt-in prompt:
|
||
*"Install as a background service that always runs?"* Defaults differently
|
||
based on detected environment (TTY vs no-TTY, presence of systemd, etc.).
|
||
|
||
## 10. Observability
|
||
|
||
Standard CLI surface unchanged from v1, with the new gauges/counters:
|
||
|
||
```
|
||
cm_daemon_connected{mesh} 0/1
|
||
cm_daemon_reconnects_total{mesh,reason}
|
||
cm_daemon_lag_ms{mesh} last broker round-trip
|
||
cm_daemon_outbox_depth{mesh,status} pending|inflight|dead
|
||
cm_daemon_outbox_age_seconds{mesh} oldest pending row
|
||
cm_daemon_dedupe_total{mesh,direction} out|in
|
||
cm_daemon_disk_pct{mesh,kind} outbox|inbox
|
||
cm_daemon_send_total{mesh,kind,status}
|
||
cm_daemon_recv_total{mesh,kind,from_type}
|
||
cm_daemon_hook_invocations_total{hook,exit}
|
||
cm_daemon_hook_duration_seconds{hook} histogram
|
||
cm_daemon_hook_capability_calls_total{hook,scope}
|
||
cm_daemon_ipc_request_total{endpoint,status,transport}
|
||
cm_daemon_ipc_duration_seconds{endpoint} histogram
|
||
cm_daemon_local_token_rotations_total
|
||
cm_daemon_clone_suspected_total
|
||
```
|
||
|
||
Tracing: optional OpenTelemetry export.
|
||
|
||
## 11. SDKs — three, slim, core-API only
|
||
|
||
Same shape as v1 but only target the **frozen core surface** (§3.1). State /
|
||
memory / vector / graph / tasks / MCP / skills are NOT in v0.9.0 SDKs — they
|
||
ship per capability gate.
|
||
|
||
Each SDK auto-discovers the daemon: reads `sock` path, `http.port`,
|
||
`local_token`. SDKs versioned in lockstep with the daemon's `/v1` surface.
|
||
|
||
## 12. Security model — explicit boundaries
|
||
|
||
| Boundary | Trust | Mechanism |
|
||
|---|---|---|
|
||
| App ↔ Daemon (UDS) | OS user, FS perms | UDS 0600 |
|
||
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 only + `local_token` + Origin/Host check |
|
||
| Hook ↔ Daemon | Capability scope | Short-lived capability token, never broker session |
|
||
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello + crypto_box DM + per-topic keys |
|
||
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
|
||
| Cloned identity | Host fingerprint check | Daemon refuses to start; dashboard audit event |
|
||
|
||
## 13. Configuration
|
||
|
||
`config.toml` — same shape as v1 plus:
|
||
- `[capabilities]` (§3.2)
|
||
- `[fleet]` (§7.2)
|
||
- `[disk] reserved_bytes` (§4.4)
|
||
- `[clone] policy = "refuse" | "warn" | "allow"` (§2.2)
|
||
|
||
User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.
|
||
|
||
## 14. Lifecycle — the operational flows v1 was missing
|
||
|
||
### 14.1 Key rotation
|
||
|
||
```
|
||
claudemesh daemon rotate-keypair
|
||
```
|
||
|
||
Mints fresh ed25519 + x25519. Registers new pubkey with broker as a `member_keypair_rotated` operation (broker associates new pubkey with same member id). Old pubkey is held server-side for 24h grace (decrypts in-flight messages encrypted to old pubkey), then revoked.
|
||
|
||
### 14.2 Local token rotation
|
||
|
||
```
|
||
claudemesh daemon rotate-token
|
||
```
|
||
|
||
Atomically writes a new `local_token`, returns the old one alongside the new
|
||
one for 60s grace. SDKs that already have the old token finish in-flight
|
||
requests; new requests use the new token. After 60s, old token is rejected.
|
||
|
||
### 14.3 Compromised host revocation
|
||
|
||
From the dashboard or another mesh-owner session:
|
||
|
||
```
|
||
claudemesh member revoke <pubkey>
|
||
```
|
||
|
||
Broker marks member as revoked. Connected daemon receives `member_revoked`
|
||
push, self-disables (refuses new IPC, closes WS), exits with non-zero status,
|
||
logs forensic event.
|
||
|
||
### 14.4 Image-clone lifecycle
|
||
|
||
Covered in §2.2. Three policies (`refuse`, `warn`, `allow` — settable per-host
|
||
via `config.toml`).
|
||
|
||
### 14.5 Backup & restore
|
||
|
||
```
|
||
claudemesh daemon backup --out <path> # dumps keypair, config, schema_version
|
||
claudemesh daemon restore --in <path> # writes them; refuses if a daemon is running
|
||
```
|
||
|
||
Backup is encrypted with a passphrase (Argon2id KDF + crypto_secretbox). The
|
||
intent: "I'm reformatting my laptop, I want my mesh memberships back without
|
||
re-joining." NOT for "deploy this same identity on 10 servers" (that's the
|
||
clone problem above).
|
||
|
||
### 14.6 Uninstall / reset
|
||
|
||
```
|
||
claudemesh daemon uninstall # full purge: stops, deregisters from broker, wipes ~/.claudemesh/daemon/<slug>
|
||
claudemesh daemon reset # wipes local state, keeps broker member registration (for restoring)
|
||
```
|
||
|
||
Uninstall calls broker's `POST /v1/me/members/:pubkey/leave` so member doesn't
|
||
linger as ghost. Reset is local-only, no broker contact.
|
||
|
||
### 14.7 Disk corruption recovery
|
||
|
||
```
|
||
claudemesh daemon recover # interactive: integrity check + offer rebuild paths
|
||
```
|
||
|
||
Detects corrupt `outbox.db` / `inbox.db`. Options:
|
||
- Restore from local journal-only inbox (read-only mode; sends disabled).
|
||
- Wipe + rebuild from broker (fetches last N days of message history if
|
||
available; topics need re-subscribe; outbox is irrecoverable, queued sends are
|
||
lost).
|
||
- Wipe + start fresh.
|
||
|
||
## 15. Version compatibility
|
||
|
||
### 15.1 Negotiation handshake
|
||
|
||
On daemon connect to broker AND on every IPC request:
|
||
|
||
```
|
||
GET /v1/version
|
||
{
|
||
"daemon_version": "0.9.0",
|
||
"ipc_api": "v1",
|
||
"ipc_minor": 3, # additive minor
|
||
"schema_version": 7,
|
||
"broker_protocol_min": "0.7",
|
||
"broker_protocol_max": "0.9"
|
||
}
|
||
```
|
||
|
||
### 15.2 Compat policy
|
||
|
||
| Across | Policy |
|
||
|---|---|
|
||
| Daemon ↔ Broker | Daemon refuses to connect if broker version < daemon's `broker_protocol_min`. Broker logs warning. Pre-1.0 we may break this with notice; post-1.0 we maintain backward compat for ≥6 months. |
|
||
| CLI ↔ Daemon | CLI checks daemon's `ipc_api`. Same major = OK. Different major = CLI falls back to cold-path with warning. |
|
||
| SDK ↔ Daemon | SDK negotiates `ipc_minor`; uses minimum of (SDK's, daemon's). |
|
||
| Daemon binary ↔ schema | Binary refuses to start on unknown schema; migrations run forward-only; no automatic downgrade. |
|
||
|
||
### 15.3 Compatibility matrix (published in docs, machine-readable JSON at /v1/compat)
|
||
|
||
```json
|
||
{
|
||
"daemon": "0.9.0",
|
||
"compatible_brokers": ["0.7.x", "0.8.x", "0.9.x"],
|
||
"compatible_clis": ["0.9.x"],
|
||
"compatible_sdks": {
|
||
"python": ">=0.9.0,<1.0.0",
|
||
"go": ">=0.9.0,<1.0.0",
|
||
"ts": ">=0.9.0,<1.0.0"
|
||
}
|
||
}
|
||
```
|
||
|
||
## 16. Threat model
|
||
|
||
### 16.1 Attacker classes
|
||
|
||
| Attacker | Has | Wants | Mitigations |
|
||
|---|---|---|---|
|
||
| Local same-user shell | OS user creds | Send / read mesh messages | None needed — they already have FS access to keypair; daemon is no worse |
|
||
| Local different-user shell | Different OS user | Read this user's daemon | UDS 0600 + TCP loopback + token. Requires OS exploit to escalate |
|
||
| Browser SSRF | Loopback HTTP | Send messages, read inbox | `local_token` + Origin/Host check + non-default port. SSRF without token cannot succeed |
|
||
| Container side-channel | Same loopback namespace | Read another container's daemon | Containers share host loopback only if explicitly net=host. `local_token` defends. Recommended: bind UDS only inside containers |
|
||
| Compromised hook | Capability token in env | Use that scope | Capability tokens are scoped + short-lived; cannot escalate |
|
||
| Compromised broker | Full mesh visibility on its side | Deliver malicious messages, identity-impersonate | E2E encryption (crypto_box DMs, per-topic keys) — broker can't read content. Out-of-scope for daemon |
|
||
| Cloned VM image | Same keypair on two hosts | Identity collision | Host fingerprint detection + dashboard audit + `--remint` flow |
|
||
| Stolen laptop | Disk access | Mesh impersonation forever | `member revoke` from dashboard. Without disk encryption, this is the user's laptop security; documented in security guide |
|
||
| Untrusted hook author | Hook script content | Exfil mesh data | Hook is on disk YOU control. If you ran `git pull` on a malicious hooks/ repo, that's a code-supply-chain attack out of scope for the daemon |
|
||
|
||
### 16.2 Out of scope
|
||
|
||
- Defending against an attacker with root on the daemon host. They can read
|
||
`keypair.json` directly.
|
||
- Defending against malicious peers in the same mesh sending malformed
|
||
payloads. Daemon validates structure but trusts mesh members.
|
||
- Defending against compromised broker. Out-of-scope for daemon; mesh-level
|
||
E2E protects content but not metadata.
|
||
|
||
## 17. Migration — what changes for existing users
|
||
|
||
Same as v1. Additive. No DB migration on broker. Existing
|
||
`~/.claudemesh/config.json` consumed unchanged. `claudemesh launch` keeps
|
||
working; daemon is opt-in.
|
||
|
||
---
|
||
|
||
## What needs review (round 2)
|
||
|
||
Round 1 produced: identity model needs `--ephemeral` + clone-detect, IPC needs
|
||
local token, "exactly-once" was a lie, hooks needed scoped credentials, surface
|
||
needed shrinking, missing rotation/recovery/migration/threat-model.
|
||
|
||
This v2 attempts to address all of them. Specifically critique:
|
||
|
||
1. **Has the identity model fully closed the clone problem?** Refuses-on-fingerprint-mismatch
|
||
plus broker audit plus mesh-owner revoke — does this catch a sophisticated
|
||
attacker who copies `host_fingerprint.json` along with the keypair?
|
||
2. **Is the local-token model sufficient for browser-SSRF defense?**
|
||
Token + Origin + Host checks + 127.0.0.1-only. Anything else needed?
|
||
3. **The delivery contract** (§4) — is it now defensible? Does the inflight-recovery
|
||
semantics + idempotency-key propagation produce the guarantees claimed?
|
||
4. **Hook capability tokens** (§6.2) — short-lived, scoped, expire on hook exit.
|
||
Does this fully eliminate the exfil footgun? What capability scopes are
|
||
actually needed for v0.9.0 hooks?
|
||
5. **Frozen v0.9.0 surface** (§3.1) — is the cut right? Should `peer list` be
|
||
in core or capability-gated? Should `inbox/search` ship in v0.9.0?
|
||
6. **Threat model** (§16) — anything missing? Specifically thinking about CI
|
||
environments where the daemon's host is a fleet shared across many users'
|
||
builds.
|
||
7. **Lifecycle flows** (§14) — image clones, key rotation, host moves, disk
|
||
corruption, uninstall semantics. Anything still missing?
|
||
8. **Version compat** (§15) — is the negotiation handshake sufficient, or do
|
||
we need stronger guarantees (e.g. semver-strict, or a feature-bit
|
||
negotiation rather than version numbers)?
|
||
|
||
Score 1–5 each. Top 3 changes you'd insist on for v3, if any. If you think v2
|
||
is shippable, say so explicitly — over-engineering is a real risk.
|