claudemesh/.artifacts/shipped/2026-05-03-daemon-final-spec-v2.md

# `claudemesh daemon` — Final Spec v2

> **Round 2 after a critical first-pass review.** v1 of this spec was reviewed
> by another model and pushed back on identity model, no-auth IPC, "exactly-once"
> overclaim, hook credentials, surface bloat, and missing operational flows
> (rotation, image clones, schema migration, threat model). v2 incorporates all
> of those.

---

## 0. Intent — what this is, what it isn't

### 0.1 The product reality

claudemesh today is a **peer mesh runtime for Claude Code sessions**. Each
session runs `claudemesh launch`, opens a WebSocket to a managed broker, gets
ephemeral identity, sends/receives DMs and topic messages with other Claude Code
sessions, posts to shared state, deploys MCP servers / skills / files,
participates in tasks, schedules reminders. Everything is E2E encrypted with
crypto_box envelopes for DMs and per-topic symmetric keys for topics. The broker
is a routing/persistence layer; peers do the actual work.

The CLI is the canonical surface — every operation is a `claudemesh <verb>`.
The MCP server is a "tool-less push pipe" that surfaces inbound messages to
Claude Code as channel notifications. There is also a web dashboard, an `/v1/*`
REST API, and an existing apikey auth model for external integrations.

### 0.2 The gap

Anything that **isn't a Claude Code session** is a second-class citizen:

- A RunPod handler that wants to alert a peer when an OOM happens has only
  one option: curl an apikey-authed REST endpoint. One-way only. The handler
  is not a peer — it can't be DM'd back, can't be `@-mentioned`, can't be in
  `peer list`, can't claim a task assigned to it, can't host an MCP service or
  share a skill. It's a webhook spoke, not a participant.

- A Temporal worker that wants to track its own progress in shared mesh state,
  publish to a `#alerts` topic, and listen for "retry now" instructions has
  no good shape. Either it shells out to `claudemesh send` cold-path
  (a fresh WS handshake per message — ~1s latency, broker churn, no inbound
  path) or it speaks the WS protocol manually (significant code, no SDK).

- A long-running CI runner, an IoT box, a phone app, a future Python or Go
  service — none can be **first-class peers** without writing the same WS
  reconnect / queue / encryption / presence code that the existing CLI already
  has, plus an IPC surface so the host's apps can use it without re-implementing
  any of that.

### 0.3 What this daemon is

A long-running process — the same `claudemesh-cli` binary in `daemon` mode —
that turns any host into a **first-class peer**:

- Stable identity across restarts (the host *is* a member of the mesh, not a
  series of disconnected sessions).
- Persistent WS to the broker, with reconnect, queue, dedupe.
- Local IPC surface (UDS + loopback HTTP + SSE) that any local app can hit
  to send, subscribe, query — without learning the broker protocol or carrying
  long-lived secrets in app code.
- Hooks: shell scripts that fire on events. Server replies to DMs, auto-claims
  tasks, escalates errors — without the app being involved.
- Same security primitives as `claudemesh launch` (mesh keypair, crypto_box,
  per-topic keys). No new auth model toward the broker.

The daemon **is the runtime**. The CLI in cold-path mode is a fallback. The
Claude Code MCP integration is one client of the daemon (eventually).

### 0.4 What this daemon is NOT

- **Not a webhook gateway.** `/v1/notify` and apikeys remain the path for
  systems that can't host the runtime (third-party SaaS, monitoring tools).
  The daemon is for systems that *can* run a process — code you control.

- **Not a generic message broker.** It speaks claudemesh protocol to one
  managed broker. It is not a substitute for NATS, Redis, Kafka, RabbitMQ.

- **Not a Slack replacement.** Topics, DMs, mentions exist because *AI
  sessions* use them. Humans interact via the dashboard or a Claude Code
  session, not by reading the daemon's inbox directly.

- **Not a fleet manager.** One daemon manages one mesh on one host. Multi-mesh
  on one host is supported (one daemon per mesh, supervised). Cross-host
  supervision is an external concern (systemd, k8s, etc.) — the daemon doesn't
  reach across hosts.

### 0.5 Who deploys this

- A developer running `claudemesh daemon up` on their laptop so their open
  Claude Code sessions all share one persistent connection (instead of each
  opening its own ephemeral WS).
- The same developer running `claudemesh daemon install-service` on their VPS,
  RunPod pod, Temporal worker, CI runner — turning each into an
  addressable peer that scripts on that host can talk to via local IPC.
- Eventually: language SDKs (Python / Go / TypeScript) talking to the daemon
  on `localhost`, exposing claudemesh as a first-class API for any app the
  developer writes.

### 0.6 Pre-launch posture

No users yet. We can break protocol, schema, surface, anything. Optimize for
the architecture we want to live with for years, not for the smallest
shippable cut. Codex pushed back on v1 on this exact axis: do not ship
graph/vector/MCP/skills/tasks on day one — freeze a small, hardened core,
expand deliberately.

---

## 1. Process model

**One daemon per (user, mesh)**. Persistent. Survives reboots via OS
supervisor. Serves multiple local apps concurrently.

```
~/.claudemesh/daemon/<mesh-slug>/
  pid                       0600    pidfile, cleaned on shutdown
  sock                      0600    unix domain socket (primary IPC)
  http.port                 0644    auto-allocated loopback port
  local_token               0600    per-daemon bearer for HTTP/TCP transports
  keypair.json              0600    persistent ed25519 + x25519 — daemon identity
  host_fingerprint.json     0600    machine-id + boot-id + interface mac digest
  config.toml               0644    user-editable runtime tuning
  outbox.db                 0600    SQLite — durable outbound queue
  inbox.db                  0600    SQLite — N-day inbound history, FTS-indexed
  schema_version            0644    integer; gates online migrations
  daemon.log                0644    JSON-lines, rotating (100 MB / 14 d)
  hooks/                    0700    user-managed event scripts
```

**Resource caps (defaults, configurable):**

| Resource | Default | Why |
|---|---|---|
| RSS | 256 MB | Most workloads stay under 50 MB; cap protects multi-mesh hosts |
| CPU | unlimited | Hook fan-out can spike briefly; rely on OS scheduler |
| Outbox DB | 5 GB | At 1KB avg msg, that's 5M queued. Disk-full handling at 90% |
| Inbox DB | 5 GB | Same |
| File descriptors | 1024 | UDS clients + SSE streams + DB handles + WS |
| SSE concurrent | 32 streams | DoS protection; configurable up |
| IPC concurrent | 64 in-flight | Backpressure beyond this returns `429 daemon_busy` |
| Hook concurrency | 8 | Bounded pool; overflow queues |

Single binary. Same `claudemesh-cli` package; `daemon` is one of its modes.

## 2. Identity — persistent member by default, ephemeral on opt-in, clone-aware

### 2.1 Modes

```
claudemesh daemon up                          # default: persistent member
claudemesh daemon up --ephemeral              # session-shaped, no keypair persisted
claudemesh daemon up --ephemeral --ttl=2h     # auto-shutdown after TTL
```

- **Persistent (default)**: ed25519 + x25519 keypair stored in `keypair.json`.
  Same identity across restarts, reconnects, supervisor cycles. Right for
  servers, workers, addressable peers.
- **Ephemeral**: keypair generated in memory, never written. Daemon exits =
  identity gone. Right for CI jobs, preview environments, disposable RunPod
  pods, test harnesses, build agents, anything that should not leave a peer
  ghost in the broker after teardown.
- **`--ttl <duration>`** on ephemeral mode: auto-shutdown after the duration,
  or after `claudemesh daemon down`, whichever first. Broker member record
  cleaned up on shutdown.

### 2.2 Image-clone detection

Two daemons booting with the same `keypair.json` (VM image clone, container
copy, restored backup) is a serious failure mode — broker sees connection
collisions, presence flickers, encrypted messages route to the wrong host.

Handled in three places:

1. **Daemon side**: `host_fingerprint.json` is written on first startup —
   `sha256(machine-id || boot-id || mac-of-default-iface || hostname)`. On every
   subsequent startup, the fingerprint is recomputed and compared. If it
   differs, the daemon **refuses to start** unless `--accept-cloned-identity`
   is passed (writes a fresh fingerprint and continues with the same keypair —
   for legitimate hardware migrations) or `--remint` is passed (mints fresh
   keypair, registers as a new member, broker reaps the old member after
   grace period).
2. **Broker side**: tracks `lastSeenHostFingerprint` per member. On
   reconnection from a different fingerprint, broker emits a
   `member_clone_suspected` security event to the mesh owner's dashboard.
   Connection itself is allowed (legitimate hardware swaps happen) but visible
   for audit.
3. **Mesh owner**: `claudemesh member revoke <pubkey>` revokes the keypair
   server-side; daemon receives `keypair_revoked` push event on next
   connection and self-disables.

### 2.3 Rename

`--name` is taken at first `daemon up`; subsequent runs read the keypair file
and ignore `--name` unless `--rename` is passed (which produces a
`member_renamed` event the broker propagates to peers).

## 3. IPC surface — stable core only in v0.9.0

### 3.1 Frozen core surface (v0.9.0)

Codex's feedback: do not ship every CLI verb on day one. A small hardened core
first, expand under explicit capability gates.

```
# Messaging — durable, tested
POST   /v1/send              {to, message, priority?, meta?, replyToId?}
POST   /v1/topic/post        {topic, message, priority?, mentions?}
POST   /v1/topic/subscribe   {topic}                            (idempotent)
POST   /v1/topic/unsubscribe {topic}
GET    /v1/topic/list
GET    /v1/inbox             ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
GET    /v1/inbox/search      ?q=<fts-query>&limit=<n>           (FTS5)

# Peers + presence — read-only on day one
GET    /v1/peers             ?mesh=<slug>
POST   /v1/profile           {summary?, status?, visible?}      (limited fields)

# Files — already production in CLI
POST   /v1/file/share        {path, to?, message?, persistent?}
GET    /v1/file/get          ?id=<fileId>&out=<path>
GET    /v1/file/list

# Events — push
GET    /v1/events            text/event-stream
       core events: message, peer_join, peer_leave, file_shared,
                    daemon_disconnect, daemon_reconnect, hook_executed

# Control plane
GET    /v1/health            {connected, lag_ms, queue_depth, inflight,
                              mesh, member_pubkey, uptime_s, schema_version,
                              daemon_version, broker_version}
GET    /v1/metrics           Prometheus exposition
GET    /v1/version           {daemon, schema, ipc_api}            (negotiation)
POST   /v1/heartbeat         {} (caller-side liveness signal)
```

That's it. ~20 endpoints. Battle-test these before adding more.

### 3.2 Capability-gated future surface (v0.9.x roadmap)

Behind explicit feature flags in `config.toml`, post-v0.9.0:

```toml
[capabilities]
state = false        # /v1/state/{set,get,list}
memory = false       # /v1/memory/{remember,recall}
vector = false       # /v1/vector/{store,search,delete}
graph = false        # /v1/graph/query
tasks = false        # /v1/task/{create,claim,complete}
scheduling = false   # /v1/scheduling/remind
mcp_host = false     # /v1/mcp/{register,call} (LARGEST surface; treat as v1.0)
skill_share = false  # /v1/skill/{deploy,share}
```

Each capability is its own ship: design review, security review, test
coverage, capability-token model, then enable. None enabled in v0.9.0.

### 3.3 Local IPC authentication

Codex was right: loopback TCP without auth is an attack surface (browser SSRF,
container side-channels, sandboxed apps with network but no FS access, WSL
host-shared loopback).

| Transport | Auth | Rationale |
|---|---|---|
| UDS | None (relies on FS perms 0600) | Reaching the socket = same UID = can read keypair anyway |
| TCP loopback | **Required**: `Authorization: Bearer <local_token>` | Browser/container/sandbox can reach loopback without FS access |
| SSE | Required: `Authorization: Bearer <local_token>` | Same |

`local_token` is 32 bytes of `crypto.randomBytes` (~256 bits), encoded base64url,
written to `local_token` mode 0600 at daemon init. Rotated on `claudemesh
daemon rotate-token`. SDKs auto-discover the token by reading the file (same
mechanism as discovering the socket path).

**Additional defenses:**
- HTTP listener binds **127.0.0.1 only**. Refuses to bind elsewhere unless
  `[ipc] http_bind = "..."` is set explicitly **and** `[ipc] http_external_auth = "..."`
  points to a separate token file (escape hatch for advanced users; never the default).
- `Origin` header check: rejects requests with `Origin` set unless it's
  explicitly allowlisted in config (default: empty allowlist). Defends against
  browser SSRF.
- `Host` header check: must be `localhost` or `127.0.0.1`. Defends against DNS
  rebinding.
- CORS: `Access-Control-Allow-Origin` never echoed; preflight returns `403`.
- `User-Agent` required (rejects empty UA — mild signal against simple SSRF).

### 3.4 Request limits + backpressure

- Max request body: **1 MB** (override per endpoint; file uploads use a separate
  streaming endpoint).
- Max response body: **10 MB**; truncated with `Link: rel=next` cursor.
- Max in-flight IPC requests: **64**. Beyond → `429 daemon_busy`.
- Max SSE concurrent streams: **32**. Beyond → `429 too_many_streams`.
- Per-token rate limit: **100 req/sec** sustained, 1000/sec burst (token
  bucket). Tunable.

## 4. Delivery contract — durable at-least-once with idempotent send

Codex was right: "exactly-once" is a lie. Replacing the claim with a precise
contract.

### 4.1 The contract

> **The daemon guarantees: each successful send call enqueues exactly one row
> to the broker eventually, identified by a stable `messageId`. The daemon
> does not guarantee that downstream peers process the message exactly once —
> that is the receiver's responsibility, aided by the propagated
> `idempotency_key`.**

Concretely:

- **Caller → daemon**: caller may supply `Idempotency-Key`; daemon dedupes
  identical keys for 24h. Without one, daemon mints `ulid` and returns it as
  `messageId`.
- **Daemon → broker**: each outbox row has at-most-one inflight transmit.
  Daemon retries with exponential backoff until broker ACKs OR row hits TTL
  (7d default → moves to `dead`).
- **Broker → peer**: existing claudemesh delivery semantics. Broker dedupes by
  `messageId`. Peer receives ≥1 copy.
- **Peer hooks**: hooks see `idempotency_key` in the event JSON. Idempotent
  hook implementations are the receiver's responsibility.

### 4.2 Outbox row state machine

```
                ┌────────────┐
   send call →  │  pending   │
                └─────┬──────┘
                      │ daemon picks up batch
                      ▼
                ┌────────────┐
                │  inflight  │  ← attempts++, last_error written
                └─┬────┬─────┘
                  │    │ broker NACK / network err
       broker ACK │    └──────────► back to pending (with exp. backoff)
                  ▼
                ┌────────────┐
                │    done    │  ← delivered_at set, broker_message_id stored
                └────────────┘

   age > max_age_hours:
                ┌────────────┐
                │    dead    │  ← surfaces in `daemon outbox --failed`
                └────────────┘
```

### 4.3 Crash recovery

On daemon startup:

1. Any rows in `inflight` are reset to `pending` with `attempts++` and
   `next_attempt_at = now + min_backoff`. Note: this MAY cause double-delivery
   of a message that was actually ACK'd by the broker but the ACK didn't
   persist locally before crash. The `idempotency_key` propagates to broker
   (via message `meta`) so the broker dedupes by key.
2. `outbox.db` integrity check (`PRAGMA integrity_check`); if fails, daemon
   refuses to start, points user at `claudemesh daemon recover`.
3. `inbox.db` integrity check; on failure, drops to `inbox.db.corrupt-<ts>`,
   creates fresh empty inbox, logs `inbox_corruption_recovered` (does not
   block startup — inbox is a cache).

### 4.4 Disk-full

- At 80% of `outbox.max_queue_size` or 80% of `[disk] reserved_bytes`: daemon
  emits `outbox_pressure_high` event + Prometheus gauge. Sends still accept.
- At 95%: new sends return `507 insufficient_storage`. Existing inflight
  drains.
- At 100%: daemon enters degraded mode — refuses sends, refuses new SSE
  streams, holds open WS for inbound only. `daemon status` shows degraded.
- Recovery: drain via broker reconnect (drains `done` rows older than
  retention window) or `claudemesh daemon outbox prune --confirm`.

### 4.5 Schema migration

`schema_version` file holds an integer. On startup:
1. If `schema_version` matches binary's expected version → continue.
2. If version is older → run `apps/cli/src/daemon/migrations/<from>-<to>.sql`
   in a transaction, write new version on success.
3. If version is newer (downgrade) → daemon refuses to start, error points at
   re-installing matching version.

Migrations are forward-only. Each migration is ≤ 1 transaction. Test coverage
required: every migration has a snapshot test from prior schema.

## 5. Inbound — durable history with FTS

Every inbound message is written to `inbox.db` before any hook fires:

```sql
CREATE VIRTUAL TABLE inbox USING fts5(
  message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
  sender_name, body, meta, idempotency_key UNINDEXED,
  received_at UNINDEXED, replied_to_id UNINDEXED
);
CREATE INDEX inbox_received_at ON inbox(received_at);
CREATE INDEX inbox_idem ON inbox(idempotency_key);
```

- **Receiver-side dedupe**: on insert, `INSERT OR IGNORE` on `idempotency_key`.
  Duplicate broker delivery becomes a no-op locally + `cm_daemon_dedupe_total`
  counter increments.
- 30-day rolling retention (configurable). `VACUUM` weekly during low-traffic
  window.
- `claudemesh daemon search "OOM"` queries the FTS index.
- Apps connecting mid-stream replay history via `?since=<iso>`.

## 6. Hooks — first-class but tightly bounded

Codex was right: hooks were underspecified, and putting `CLAUDEMESH_TOKEN` in
every hook env was a serious exfil footgun.

### 6.1 Hook directory & contract

```
hooks/
  on-message.sh         every inbound message (DM + topic)
  on-dm.sh              DMs only
  on-mention.sh         when @<my-name> appears anywhere
  on-topic-<name>.sh    a specific topic
  on-file-share.sh      file shared with me
  on-disconnect.sh      WS dropped
  on-reconnect.sh       reconnected
  on-startup.sh         daemon up
  pre-send.sh           filter / mutate outbound (last gate)
  hooks.toml            per-hook policy (auth, redaction, env, timeout)
```

`hooks.toml` (mandatory; daemon refuses to invoke hooks without it):

```toml
[on-mention]
enabled = true
timeout_s = 30
output_size_limit = 65536
redact_payload = ["body.password", "meta.api_key"]   # JSONPath
allow_reply = true                                    # if false, stdout reply ignored
capability_token_scope = ["topic:alerts:post"]        # scoped, NOT broker session token
network_policy = "deny"                               # 'deny' | 'allow' | 'allowlist'
network_allowlist = []                                # only if policy = 'allowlist'
fs_policy = "readonly"                                # 'readonly' | 'rw' | 'sandbox'
killpg_on_timeout = true                              # SIGTERM process group, not just child
audit = true                                          # log every invocation
```

### 6.2 Credentials passed to hooks

**Default: nothing.** No `CLAUDEMESH_TOKEN`, no broker session, nothing that
lets the hook impersonate the daemon's identity broadly.

**Opt-in per hook**: `capability_token_scope = ["topic:alerts:post"]` mints a
**short-lived (5 min) capability token** scoped to exactly that capability.
The hook can use it to call back into the daemon's IPC ("post a reply to
#alerts") but cannot use it to read state, read inbox, deploy MCP, etc. Token
expires when hook process exits OR after 5 min, whichever first.

Capability tokens are local-only — they authorize against the daemon's IPC
surface, never the broker directly. Daemon translates capability calls into
broker calls.

Env variables the hook DOES get:
- `CLAUDEMESH_MESH=<slug>`
- `CLAUDEMESH_HOOK_NAME=on-mention`
- `CLAUDEMESH_EVENT_ID=<ulid>`
- `CLAUDEMESH_CAPABILITY_TOKEN=<token>` (only if scope was configured; else absent)
- `CLAUDEMESH_DAEMON_SOCK=<path>` (so SDKs can connect for capability calls)
- `PATH=/usr/bin:/bin` (locked down)

### 6.3 Payload redaction

Hook stdin receives event JSON minus paths listed in `redact_payload`. Default
redaction: nothing. Mesh owner / daemon admin opts in.

### 6.4 Timeout & cleanup

- Per-hook `timeout_s` (default 30s). On timeout, daemon sends SIGTERM to the
  hook's process group (`killpg_on_timeout=true`), waits 5s, then SIGKILL.
  Catches forked grandchildren that were trying to keep things alive.
- Hook stdout/stderr captured, truncated at `output_size_limit`. Larger
  outputs log a warning and discard the overflow.

### 6.5 Audit log

Every hook invocation logs:
```json
{"hook":"on-mention","event_id":"01H8…","exit":0,"duration_ms":47,
 "stdout_bytes":120,"stderr_bytes":0,"replied":true,"capability_calls":1,
 "ts":"2026-05-03T14:00:00Z"}
```

Stored in `daemon.log`; metrics exposed via `cm_daemon_hook_*`.

### 6.6 Sandboxing — supported, not required

The contract supports sandboxing without mandating it (mandating breaks too
many real workflows):

- Linux: opt-in `sandbox = "bubblewrap"` in `hooks.toml` runs the hook under
  `bwrap` with no network (unless `network_policy != "deny"`), readonly FS
  except `/tmp/<hook-id>`, no DBus, no /proc.
- macOS: opt-in `sandbox = "sandbox-exec"` with similar profile.
- Default: no sandbox; rely on Unix permissions + `network_policy=deny` (which
  is enforced via `unshare --net` on Linux when available, otherwise
  best-effort firewall rule).

## 7. Multi-mesh — daemon-per-mesh, supervised by a thin shell

### 7.1 The decision

One daemon per mesh, coordinated by a supervisor script. Codex pushed back —
"why not one daemon serving all meshes?". Going daemon-per-mesh because:

- **Crash isolation**: a panic in `prod` mesh's WS reader can't corrupt
  `dev` mesh's outbox.
- **Resource accounting**: per-mesh RSS, per-mesh metrics, per-mesh disk
  budget — easy to attribute, easy to cap.
- **Independent identity**: each mesh has its own keypair, host fingerprint,
  capability gates. Conflating into one process forces shared trust.
- **Independent upgrades**: rolling daemon restarts per mesh, no downtime
  across all meshes.
- **Simpler code**: zero cross-mesh routing logic in the daemon body.

The cost (process count, log fan-out) is real but bounded: typical user has
1–3 meshes. Heavy users (10–20) get a `claudemesh daemon ps` + `--all` UX that
treats them as a fleet.

### 7.2 Resource caps for fleet hosts

`config.toml` has `[fleet]` section read by `daemon up --all`:

```toml
[fleet]
max_daemons = 10
total_memory_budget = "2GB"     # divided across daemons; each gets budget/N RSS cap
total_disk_budget = "20GB"      # divided across outbox + inbox per daemon
```

If a user hits `max_daemons`, `daemon up <next>` errors with a clear message
pointing at the cap.

### 7.3 Commands

```
claudemesh daemon up        --mesh <slug>     # one mesh
claudemesh daemon up --all                    # all joined meshes (respects fleet caps)
claudemesh daemon down      --mesh <slug>
claudemesh daemon down --all
claudemesh daemon status                      # all daemons, table view
claudemesh daemon status --json               # machine-readable
claudemesh daemon ps                          # alias of status
claudemesh daemon logs --mesh <slug> [-f]
claudemesh daemon restart --mesh <slug>
```

## 8. Auto-routing — clarified, not transparent

Codex pushed back: "no behavior difference" was hand-waving. Persistent
identity, queueing, hooks, profile state — these legitimately change behavior.

### 8.1 What changes when a daemon is up

| Behavior | Cold-path CLI | Daemon-routed CLI |
|---|---|---|
| Sender attribution | Ephemeral session pubkey for that invocation | Daemon's persistent member pubkey |
| Latency | ~1s (fresh WS handshake) | <10ms (local UDS round-trip) |
| Send durability | None — if broker is unreachable, command fails | Outbox queue retries until TTL |
| Inbound visibility | Not available (cold path closes WS) | `claudemesh inbox` reads daemon's inbox.db |
| Hooks | Not invoked | Invoked on every event |
| Presence | Brief flicker as session connects+disconnects | Continuous; daemon's status reflected |
| `peer list` shows me as | A new ephemeral session each invocation | The daemon's persistent member |

### 8.2 Detection logic — connect, don't trust pidfile

```
1. Check ~/.claudemesh/daemon/<slug>/sock exists.
2. attempt UDS connect with 100ms timeout.
3. If connect succeeds: send GET /v1/version.
4. If response is well-formed AND mesh matches AND daemon_version is
   compatible → use this daemon.
5. Otherwise → cold path.
```

PID liveness check is unreliable (PID reuse, process orphaned). Socket
handshake is canonical.

### 8.3 Coexistence with `claudemesh launch`

Both can be running for the same mesh:
- Daemon connected as persistent member `runpod-worker-3`.
- A separate `claudemesh launch` connects as ephemeral session of the same
  member. Visible to peers as "another session of runpod-worker-3"
  (sibling-session relationship via `memberPubkey`).
- CLI verbs from inside `claudemesh launch` route through the launch session,
  NOT the daemon (preserves "this Claude Code session has its own ephemeral
  identity" semantics).
- CLI verbs from a separate shell route through the daemon (faster, durable).

This is consistent with the v0.5.1 self-DM guard and sibling-session
semantics already shipped.

## 9. Service installation

```bash
claudemesh daemon install-service                 # writes systemd unit / launchd plist / Windows SC
claudemesh daemon uninstall-service
claudemesh daemon install-service --user          # user-scope unit (default; no root)
claudemesh daemon install-service --system        # system-scope unit (root; multi-user host)
```

Unit defaults:
- `Restart=on-failure`, `RestartSec=5s`, `StartLimitBurst=5/5min`
- `MemoryMax=<resource cap>`, `TasksMax=128`, `LimitNOFILE=4096`
- `StandardOutput/Error=journal`
- `NoNewPrivileges=yes`, `PrivateTmp=yes`, `ProtectSystem=strict`,
  `ProtectHome=read-only` with `ReadWritePaths=~/.claudemesh`
- For systemd `--user`, runs as the invoking user (no root needed).

`claudemesh install` (the existing setup verb) gains an opt-in prompt:
*"Install as a background service that always runs?"* Defaults differently
based on detected environment (TTY vs no-TTY, presence of systemd, etc.).

## 10. Observability

Standard CLI surface unchanged from v1, with the new gauges/counters:

```
cm_daemon_connected{mesh}                  0/1
cm_daemon_reconnects_total{mesh,reason}
cm_daemon_lag_ms{mesh}                     last broker round-trip
cm_daemon_outbox_depth{mesh,status}        pending|inflight|dead
cm_daemon_outbox_age_seconds{mesh}         oldest pending row
cm_daemon_dedupe_total{mesh,direction}     out|in
cm_daemon_disk_pct{mesh,kind}              outbox|inbox
cm_daemon_send_total{mesh,kind,status}
cm_daemon_recv_total{mesh,kind,from_type}
cm_daemon_hook_invocations_total{hook,exit}
cm_daemon_hook_duration_seconds{hook}      histogram
cm_daemon_hook_capability_calls_total{hook,scope}
cm_daemon_ipc_request_total{endpoint,status,transport}
cm_daemon_ipc_duration_seconds{endpoint}   histogram
cm_daemon_local_token_rotations_total
cm_daemon_clone_suspected_total
```

Tracing: optional OpenTelemetry export.

## 11. SDKs — three, slim, core-API only

Same shape as v1 but only target the **frozen core surface** (§3.1). State /
memory / vector / graph / tasks / MCP / skills are NOT in v0.9.0 SDKs — they
ship per capability gate.

Each SDK auto-discovers the daemon: reads `sock` path, `http.port`,
`local_token`. SDKs versioned in lockstep with the daemon's `/v1` surface.

## 12. Security model — explicit boundaries

| Boundary | Trust | Mechanism |
|---|---|---|
| App ↔ Daemon (UDS) | OS user, FS perms | UDS 0600 |
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 only + `local_token` + Origin/Host check |
| Hook ↔ Daemon | Capability scope | Short-lived capability token, never broker session |
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello + crypto_box DM + per-topic keys |
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
| Cloned identity | Host fingerprint check | Daemon refuses to start; dashboard audit event |

## 13. Configuration

`config.toml` — same shape as v1 plus:
- `[capabilities]` (§3.2)
- `[fleet]` (§7.2)
- `[disk] reserved_bytes` (§4.4)
- `[clone] policy = "refuse" | "warn" | "allow"` (§2.2)

User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.

## 14. Lifecycle — the operational flows v1 was missing

### 14.1 Key rotation

```
claudemesh daemon rotate-keypair
```

Mints fresh ed25519 + x25519. Registers new pubkey with broker as a `member_keypair_rotated` operation (broker associates new pubkey with same member id). Old pubkey is held server-side for 24h grace (decrypts in-flight messages encrypted to old pubkey), then revoked.

### 14.2 Local token rotation

```
claudemesh daemon rotate-token
```

Atomically writes a new `local_token`, returns the old one alongside the new
one for 60s grace. SDKs that already have the old token finish in-flight
requests; new requests use the new token. After 60s, old token is rejected.

### 14.3 Compromised host revocation

From the dashboard or another mesh-owner session:

```
claudemesh member revoke <pubkey>
```

Broker marks member as revoked. Connected daemon receives `member_revoked`
push, self-disables (refuses new IPC, closes WS), exits with non-zero status,
logs forensic event.

### 14.4 Image-clone lifecycle

Covered in §2.2. Three policies (`refuse`, `warn`, `allow` — settable per-host
via `config.toml`).

### 14.5 Backup & restore

```
claudemesh daemon backup --out <path>          # dumps keypair, config, schema_version
claudemesh daemon restore --in <path>          # writes them; refuses if a daemon is running
```

Backup is encrypted with a passphrase (Argon2id KDF + crypto_secretbox). The
intent: "I'm reformatting my laptop, I want my mesh memberships back without
re-joining." NOT for "deploy this same identity on 10 servers" (that's the
clone problem above).

### 14.6 Uninstall / reset

```
claudemesh daemon uninstall                  # full purge: stops, deregisters from broker, wipes ~/.claudemesh/daemon/<slug>
claudemesh daemon reset                      # wipes local state, keeps broker member registration (for restoring)
```

Uninstall calls broker's `POST /v1/me/members/:pubkey/leave` so member doesn't
linger as ghost. Reset is local-only, no broker contact.

### 14.7 Disk corruption recovery

```
claudemesh daemon recover                    # interactive: integrity check + offer rebuild paths
```

Detects corrupt `outbox.db` / `inbox.db`. Options:
- Restore from local journal-only inbox (read-only mode; sends disabled).
- Wipe + rebuild from broker (fetches last N days of message history if
  available; topics need re-subscribe; outbox is irrecoverable, queued sends are
  lost).
- Wipe + start fresh.

## 15. Version compatibility

### 15.1 Negotiation handshake

On daemon connect to broker AND on every IPC request:

```
GET /v1/version
{
  "daemon_version": "0.9.0",
  "ipc_api": "v1",
  "ipc_minor": 3,                  # additive minor
  "schema_version": 7,
  "broker_protocol_min": "0.7",
  "broker_protocol_max": "0.9"
}
```

### 15.2 Compat policy

| Across | Policy |
|---|---|
| Daemon ↔ Broker | Daemon refuses to connect if broker version < daemon's `broker_protocol_min`. Broker logs warning. Pre-1.0 we may break this with notice; post-1.0 we maintain backward compat for ≥6 months. |
| CLI ↔ Daemon | CLI checks daemon's `ipc_api`. Same major = OK. Different major = CLI falls back to cold-path with warning. |
| SDK ↔ Daemon | SDK negotiates `ipc_minor`; uses minimum of (SDK's, daemon's). |
| Daemon binary ↔ schema | Binary refuses to start on unknown schema; migrations run forward-only; no automatic downgrade. |

### 15.3 Compatibility matrix (published in docs, machine-readable JSON at /v1/compat)

```json
{
  "daemon": "0.9.0",
  "compatible_brokers": ["0.7.x", "0.8.x", "0.9.x"],
  "compatible_clis": ["0.9.x"],
  "compatible_sdks": {
    "python": ">=0.9.0,<1.0.0",
    "go":     ">=0.9.0,<1.0.0",
    "ts":     ">=0.9.0,<1.0.0"
  }
}
```

## 16. Threat model

### 16.1 Attacker classes

| Attacker | Has | Wants | Mitigations |
|---|---|---|---|
| Local same-user shell | OS user creds | Send / read mesh messages | None needed — they already have FS access to keypair; daemon is no worse |
| Local different-user shell | Different OS user | Read this user's daemon | UDS 0600 + TCP loopback + token. Requires OS exploit to escalate |
| Browser SSRF | Loopback HTTP | Send messages, read inbox | `local_token` + Origin/Host check + non-default port. SSRF without token cannot succeed |
| Container side-channel | Same loopback namespace | Read another container's daemon | Containers share host loopback only if explicitly net=host. `local_token` defends. Recommended: bind UDS only inside containers |
| Compromised hook | Capability token in env | Use that scope | Capability tokens are scoped + short-lived; cannot escalate |
| Compromised broker | Full mesh visibility on its side | Deliver malicious messages, identity-impersonate | E2E encryption (crypto_box DMs, per-topic keys) — broker can't read content. Out-of-scope for daemon |
| Cloned VM image | Same keypair on two hosts | Identity collision | Host fingerprint detection + dashboard audit + `--remint` flow |
| Stolen laptop | Disk access | Mesh impersonation forever | `member revoke` from dashboard. Without disk encryption, this is the user's laptop security; documented in security guide |
| Untrusted hook author | Hook script content | Exfil mesh data | Hook is on disk YOU control. If you ran `git pull` on a malicious hooks/ repo, that's a code-supply-chain attack out of scope for the daemon |

### 16.2 Out of scope

- Defending against an attacker with root on the daemon host. They can read
  `keypair.json` directly.
- Defending against malicious peers in the same mesh sending malformed
  payloads. Daemon validates structure but trusts mesh members.
- Defending against compromised broker. Out-of-scope for daemon; mesh-level
  E2E protects content but not metadata.

## 17. Migration — what changes for existing users

Same as v1. Additive. No DB migration on broker. Existing
`~/.claudemesh/config.json` consumed unchanged. `claudemesh launch` keeps
working; daemon is opt-in.

---

## What needs review (round 2)

Round 1 produced: identity model needs `--ephemeral` + clone-detect, IPC needs
local token, "exactly-once" was a lie, hooks needed scoped credentials, surface
needed shrinking, missing rotation/recovery/migration/threat-model.

This v2 attempts to address all of them. Specifically critique:

1. **Has the identity model fully closed the clone problem?** Refuses-on-fingerprint-mismatch
   plus broker audit plus mesh-owner revoke — does this catch a sophisticated
   attacker who copies `host_fingerprint.json` along with the keypair?
2. **Is the local-token model sufficient for browser-SSRF defense?**
   Token + Origin + Host checks + 127.0.0.1-only. Anything else needed?
3. **The delivery contract** (§4) — is it now defensible? Does the inflight-recovery
   semantics + idempotency-key propagation produce the guarantees claimed?
4. **Hook capability tokens** (§6.2) — short-lived, scoped, expire on hook exit.
   Does this fully eliminate the exfil footgun? What capability scopes are
   actually needed for v0.9.0 hooks?
5. **Frozen v0.9.0 surface** (§3.1) — is the cut right? Should `peer list` be
   in core or capability-gated? Should `inbox/search` ship in v0.9.0?
6. **Threat model** (§16) — anything missing? Specifically thinking about CI
   environments where the daemon's host is a fleet shared across many users'
   builds.
7. **Lifecycle flows** (§14) — image clones, key rotation, host moves, disk
   corruption, uninstall semantics. Anything still missing?
8. **Version compat** (§15) — is the negotiation handshake sufficient, or do
   we need stronger guarantees (e.g. semver-strict, or a feature-bit
   negotiation rather than version numbers)?

Score 1–5 each. Top 3 changes you'd insist on for v3, if any. If you think v2
is shippable, say so explicitly — over-engineering is a real risk.