claudemesh/docs/mesh-services-spec.md

# Mesh Services: MCP Servers & Skills Platform

> Consolidated spec for deploying, managing, and executing MCP servers
> and multi-file skills within a claudemesh mesh. Covers source modes,
> execution engine, credential vaults, access control, native Claude Code
> integration, and dynamic tool discovery.

---

## Problem

Today:
- **Skills** are a single `instructions` text field in Postgres. No multi-file support.
- **MCP servers** are live-proxied through the registering peer. When that peer disconnects, the server dies. The `persistent` flag is cosmetic.
- Neither supports bundled artifacts (templates, configs, schemas, example code).
- Claude Code has no way to discover mesh tools natively — peers must use the generic `mesh_tool_call` proxy.

## Design goals

1. Three source modes — inline, zip bundle, git repo — for both skills and MCP servers
2. MCP servers run on the VPS, not on peers — true 24/7 persistence
3. Sandboxed execution with resource limits
4. Native Claude Code tool integration — deployed MCPs appear as regular MCP server entries
5. Per-peer credential vault for secrets (OAuth tokens, API keys)
6. Visibility scopes on services — peer, group, role, or mesh-wide — deployer controls who can call, not who sees secrets
7. Dynamic mid-session discovery via `notifications/tools/list_changed`
8. All existing behavior preserved — inline skills and live-proxy MCPs unchanged

---

## Architecture overview

```
┌──────────────────────────────────────────────────────────────────┐
│ claudemesh launch --name Mou --mesh dev                          │
│                                                                  │
│  1. Connect to broker, authenticate                              │
│  2. Fetch service catalog (scope-filtered for this peer)          │
│  3. Write native MCP entries to ~/.claude.json:                  │
│       mesh:gmail, mesh:context7, mesh:whatsapp                   │
│  4. Spawn claude                                                 │
│  5. On exit: remove mesh:* entries                               │
└──────────┬───────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────┐
│ Claude Code session                                              │
│                                                                  │
│  MCP: claudemesh (stdio)                                         │
│  ├── send_message, list_peers, set_summary, ...  (peer comms)   │
│  ├── mesh_mcp_deploy, mesh_mcp_scope, ...        (service mgmt) │
│  ├── vault_set, vault_list, ...                  (credentials)  │
│  └── mesh_mcp_schema                             (introspection)│
│                                                                  │
│  MCP: mesh:gmail (stdio proxy)        → mcp__mesh_gmail__*      │
│  MCP: mesh:context7 (stdio proxy)     → mcp__mesh_context7__*   │
│  MCP: mesh:whatsapp (stdio proxy)     → mcp__mesh_whatsapp__*   │
│                                                                  │
│  MCP: playwriter (stdio, local)       → local MCPs as usual     │
│  MCP: figma (stdio, local)                                       │
└──────────┬───────────────────────────────────────────────────────┘
           │ Each mesh:* proxy connects via WebSocket
           ▼
┌──────────────────────────────────────────────────────────────────┐
│ Broker (VPS — wss://ic.claudemesh.com/ws)                        │
│                                                                  │
│  Existing: message routing, presence, state, memory, files, ...  │
│                                                                  │
│  New: Service Catalog                                            │
│  ├── Scope enforcement (peer/group/role/mesh visibility)         │
│  ├── Tool schema registry (from runner)                          │
│  ├── Deploy/undeploy/update commands                             │
│  └── System events: mcp_deployed, mcp_undeployed                │
│                                                                  │
│  New: Vault                                                      │
│  └── Per-peer encrypted credential storage                       │
│                                                                  │
│  Tool call routing:                                              │
│  ├── Managed service? → forward to runner                        │
│  └── Live proxy?      → forward to hosting peer (existing)       │
└──────────┬───────────────────────────────────────────────────────┘
           │ stdio (child process)
           ▼
┌──────────────────────────────────────────────────────────────────┐
│ Runner (one Docker container per mesh)                            │
│                                                                  │
│  Supervisor (Node main thread)                                   │
│  ├── stdin/stdout ↔ broker (JSON-RPC multiplexed)               │
│  ├── Routes tool calls by service name                           │
│  ├── Lifecycle: load / unload / restart                          │
│  ├── Health: MCP ping per child, restart on 3 failures          │
│  ├── Logs: 1000-line ring buffer per service                     │
│  └── Vault: decrypts credentials at spawn time                   │
│                                                                  │
│  Child processes (one per MCP server):                           │
│  ├── child_process.spawn("node", [...]) ← Node MCP servers     │
│  ├── child_process.spawn("uvx", [...])  ← Python MCP servers   │
│  ├── child_process.spawn("npx", [...])  ← npm MCP packages     │
│  │                                                               │
│  │   Each child:                                                 │
│  │   ├── Own stdio pipe (MCP protocol)                          │
│  │   ├── Own env vars (including vault-resolved secrets)        │
│  │   ├── Own /secrets/<name>/ dir (vault files)                 │
│  │   └── Killed individually on undeploy                        │
│  │                                                               │
│  Base image: node:22 + python3.12 + uv + npx                    │
│  Limits: --memory=512m --cpus=1 --network=mesh-restricted       │
└──────────────────────────────────────────────────────────────────┘
```

---

## Source modes

### 1. Inline (existing, unchanged)

```
share_skill(name, description, instructions, tags)       ← text-only skill
mesh_mcp_register(server_name, description, tools)       ← live peer proxy
```

### 2. Zip bundle

Upload a zip, then deploy:

```
1. share_file(path="./my-server.zip", tags=["mcp-bundle"])  → fileId
2. mesh_mcp_deploy(file_id=fileId, server_name="my-server", config={...})
```

**MCP server zip structure:**
```
my-mcp-server/
├── package.json          # or pyproject.toml / requirements.txt
├── src/index.ts          # MCP server entry (stdio transport)
├── .env.example          # declares required env vars
└── README.md
```

**Skill bundle zip structure:**
```
my-skill/
├── SKILL.md              # instructions (replaces inline text)
├── skill.json            # { name, description, tags }
├── templates/            # prompt templates, examples
└── schemas/              # JSON schemas, configs
```

### 3. Git repository

```
mesh_mcp_deploy(
  git_url="https://github.com/user/my-mcp-server.git",
  branch="main",
  server_name="my-server",
  config={ env: { API_KEY: "$vault:my-api-key" } }
)
```

- Shallow clone (`--depth 1`)
- Commit SHA pinned in DB for auditability
- `mesh_mcp_update(server_name)` → git pull + rebuild + restart
- Auth via `config.git_auth` (stored encrypted, never logged)

---

## Execution engine

### Why child processes, not worker threads

MCP servers use **stdio transport** — each server owns its stdin/stdout via
`StdioServerTransport`. Two servers can't share one process. Worker threads
don't help because:
- MCP SDK `StdioServerTransport` takes over process stdin/stdout
- `npx @package/mcp-server` spawns its own process anyway
- Python MCPs need a Python runtime, not a Node thread

The runner spawns each MCP server as a **child process** with its own stdio
pipe, exactly how every MCP server is designed to work.

### Container design: one per mesh

```
┌─ Docker container (mesh: "dev") ─────────────────┐
│                                                   │
│  Supervisor (Node main thread)                    │
│  ├─ stdio ↔ broker                               │
│  ├─ routes calls by service name                  │
│  │                                                │
│  ├─ spawn("npx", ["@upstash/context7-mcp"])      │
│  │   └─ stdio pipe ↔ MCP protocol                │
│  ├─ spawn("node", ["dist/index.js"])              │
│  │   └─ stdio pipe ↔ MCP protocol                │
│  ├─ spawn("uvx", ["mcp-outline"])                 │
│  │   └─ stdio pipe ↔ MCP protocol                │
│  └─ spawn("python", ["-m", "server"])             │
│      └─ stdio pipe ↔ MCP protocol                │
│                                                   │
│  Base: node:22 + python3.12 + uv + npx            │
│  Limits: --memory=512m --cpus=1                    │
│  Network: mesh-restricted bridge (allowlist)       │
└───────────────────────────────────────────────────┘
```

**Why one container, not N:**
- One Docker process to manage, one cgroup for the whole mesh
- One network namespace — single firewall config
- Shared node_modules / pip cache across services
- VPS resources: 8 vCores / 24GB — N containers exhausts memory fast

**Why not zero containers (bare child processes on the broker):**
- Broker stays routing-only — runner crashes don't take it down
- Security boundary — runner can't access broker's DB or filesystem
- Runner can be on a different machine later (NUC, second VPS)

### Supervisor protocol

Broker ↔ runner communicate over the container's stdin/stdout as JSON lines:

```typescript
// Broker → runner
{ action: "load", name: "gmail", path: "/services/gmail", env: {...} }
{ action: "call", name: "gmail", tool: "search_emails", args: {...}, callId: "abc" }
{ action: "unload", name: "gmail" }
{ action: "health", name: "gmail" }
{ action: "list_tools", name: "gmail" }

// Runner → broker
{ callId: "abc", result: {...} }
{ callId: "abc", error: "connection refused" }
{ type: "loaded", name: "gmail", tools: [{name, description, inputSchema}] }
{ type: "unloaded", name: "gmail" }
{ type: "crashed", name: "gmail", restarts: 3, error: "OOM" }
{ type: "health", name: "gmail", ok: true, rssKb: 45000 }
```

### Runtime auto-detection

| File found | Runtime | Spawn command |
|---|---|---|
| `package.json` | node | `npm install && node <main>` |
| `package.json` with npx hint | node | `npx <package>` |
| `pyproject.toml` | python | `pip install . && python -m <module>` |
| `requirements.txt` | python | `pip install -r requirements.txt && python <entry>` |
| `Bunfile` or `bun.lockb` | bun | `bun install && bun <entry>` |

### Health & restart

- Supervisor sends MCP `ping` to each child every 30s
- No response within 5s → mark unhealthy
- 3 consecutive failures → restart (kill + re-spawn)
- Max 5 restarts → status=`crashed`, notify deployer via mesh system event
- On crash: `{ type: "push", event: "mcp_crashed", eventData: { name, error, restarts } }`

### Logs

Per-service ring buffer (1000 lines). Captures child's stderr + stdout
(excluding MCP protocol JSON). Accessible via `mesh_mcp_logs(name, lines?)`.

### Storage layout

```
/var/claudemesh/services/
├── <meshId>/
│   ├── <serviceName>/
│   │   ├── source/          # extracted zip or git clone
│   │   ├── secrets/         # vault-resolved credential files
│   │   ├── node_modules/    # or .venv/ for Python
│   │   └── .meta.json       # { pid, startedAt, sha, runtime }
```

### Network policy

Default: `--network=mesh-restricted` (Docker bridge with outbound deny-all).

Per-service allowlist in deploy config:
```json
{
  "network_allow": [
    "gmail.googleapis.com:443",
    "oauth2.googleapis.com:443",
    "100.113.153.45:*"
  ]
}
```

Implemented via iptables rules on the bridge, or per-container `--add-host`
entries combined with a proxy. For Tailscale-accessible services (NUC, etc.),
allow the Tailscale IP.

---

## Credential vault

### Design

Per-peer encrypted storage on the broker. Credentials never leave the vault
in plaintext — decrypted only inside the runner container at spawn time.

Peers don't share credentials. They share **access to the running MCP
server** via scopes. The MCP server runs with the deployer's credentials;
other peers call it without ever seeing the secrets.

### Encryption model

Same crypto as E2E file sharing (`crypto/file-crypto.ts`):

1. Peer generates random symmetric key
2. Encrypts the credential with `crypto_secretbox` (symmetric)
3. Seals the symmetric key with their own pubkey (`crypto_box`)
4. Stores sealed key + ciphertext on broker — broker sees only ciphertext
5. At spawn time: runner requests decryption from the deployer's sealed key
   (the runner holds a mesh-scoped keypair granted by the deployer at deploy time)

### Vault reference syntax

In `mesh_mcp_deploy` env config, `$vault:` prefix triggers vault resolution:

```
$vault:api-key                              → inject as env var
$vault:gmail-creds:file:/secrets/creds.json → decrypt, write to file, set env var to path
```

Examples:
```typescript
mesh_mcp_deploy({
  server_name: "gmail",
  git_url: "https://github.com/gongrzhe/server-gmail-autoauth-mcp",
  env: {
    GMAIL_CREDENTIALS_PATH: "$vault:gmail-creds:file:/secrets/credentials.json",
    GMAIL_OAUTH_PATH: "$vault:gmail-oauth:file:/secrets/gcp-oauth.keys.json",
  },
  network_allow: ["gmail.googleapis.com:443", "oauth2.googleapis.com:443"],
})
```

### MCP tools

```
vault_set(key, value, type?, mount_path?)  — encrypt + store
  value: string (env var) or local file path (reads + encrypts the file)
  type: "env" (default) or "file"
  mount_path: for files, where to write inside the service dir

vault_list()                               — list keys (no values, metadata only)
vault_delete(key)                          — remove entry
```

### DB schema

```sql
CREATE TABLE mesh.vault_entry (
  id          TEXT PRIMARY KEY,
  mesh_id     TEXT NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
  member_id   TEXT NOT NULL REFERENCES mesh.member(id),
  key         TEXT NOT NULL,

  -- E2E encrypted content
  ciphertext  BYTEA NOT NULL,
  nonce       BYTEA NOT NULL,
  sealed_key  BYTEA NOT NULL,    -- symmetric key sealed with peer's pubkey

  -- Metadata (plaintext)
  entry_type  TEXT DEFAULT 'env' CHECK (entry_type IN ('env', 'file')),
  mount_path  TEXT,
  description TEXT,

  created_at  TIMESTAMP DEFAULT now(),
  updated_at  TIMESTAMP DEFAULT now(),

  UNIQUE (mesh_id, member_id, key)
);
```

---

## Visibility scopes

### Model

Scopes control who can see and call a service. Credentials are invisible to
callers — they interact with the running service, not the secrets behind it.
The deployer controls visibility; the vault handles secrets separately.

### Scope levels

| Scope | Who sees it | Use case |
|---|---|---|
| `peer` | Only the deployer (default) | Personal tools, staging before publish |
| `{ peers: [...] }` | Named peers | Shared between specific people |
| `{ group: "eng" }` | All @eng members | Team-specific tools |
| `{ groups: ["eng", "ops"] }` | Multiple groups | Cross-team tools |
| `{ role: "lead" }` | Any peer with that role | Role-gated admin tools |
| `mesh` | Everyone in the mesh | Shared utilities |

### Examples

```
┌─────────────────────────────────────────────────┐
│ Mesh: "dev-team"                                │
│                                                 │
│  mesh scope ─── everyone                        │
│  ├── context7         (utility)                 │
│  ├── youtube-transcript                         │
│  └── mesh-db          (shared database)         │
│                                                 │
│  group scope ─── @group members only            │
│  ├── @eng                                       │
│  │   ├── github-mcp   (eng team's GitHub)       │
│  │   └── ssh-manager  (eng infra access)        │
│  ├── @sales                                     │
│  │   ├── apollo-io    (sales CRM)               │
│  │   └── gmail        (sales@ inbox)            │
│  └── @ops                                       │
│      ├── stalwart-mail (mail server admin)       │
│      └── namecheap    (DNS management)           │
│                                                 │
│  role scope ─── by role tag                     │
│  ├── lead → mesh-admin-tools (deploy, vault)    │
│  └── observer → (read-only MCPs only)           │
│                                                 │
│  peer scope ─── only specific peers             │
│  ├── Alejandro                                  │
│  │   ├── gmail-personal  (my inbox)             │
│  │   └── gworkspace      (my workspace)         │
│  └── Mou                                        │
│      └── cursor-composer (Mou's Cursor)         │
│                                                 │
└─────────────────────────────────────────────────┘
```

### Deploy with scope

```typescript
// Mesh scope — everyone
mesh_mcp_deploy({
  server_name: "context7",
  source: { type: "git", url: "..." },
  scope: "mesh",
})

// Group scope — only @eng
mesh_mcp_deploy({
  server_name: "github-mcp",
  source: { type: "git", url: "..." },
  scope: { group: "eng" },
})

// Multi-group
mesh_mcp_deploy({
  server_name: "ssh-manager",
  scope: { groups: ["eng", "ops"] },
})

// Role scope — only leads
mesh_mcp_deploy({
  server_name: "mesh-admin",
  scope: { role: "lead" },
})

// Peer scope — just me (default)
mesh_mcp_deploy({
  server_name: "gmail-personal",
  scope: "peer",
})

// Specific peers
mesh_mcp_deploy({
  server_name: "shared-workspace",
  scope: { peers: ["Mou", "Alejandro"] },
})
```

### Enforcement

- **At catalog time:** broker filters the service catalog by scope before
  sending to peers in `hello_ack`. The peer's groups and role (from `hello`)
  are matched against each service's scope. A tool you can't access never
  appears in Claude's tool list.
- **At call time:** broker re-checks scope before routing. Double-check
  in case catalog is stale or the peer's groups changed.

### Scope resolution logic

```typescript
function peerCanAccess(service: Service, peer: PeerConn): boolean {
  const scope = service.scope;
  if (typeof scope === "string") {
    if (scope === "peer") return service.deployed_by === peer.memberId;
    if (scope === "mesh") return true;
  }
  if ("peers" in scope) {
    return scope.peers.some(p =>
      p === peer.memberId || p === peer.displayName);
  }
  if ("group" in scope) {
    return peer.groups.some(g => g.name === scope.group);
  }
  if ("groups" in scope) {
    return peer.groups.some(g => scope.groups.includes(g.name));
  }
  if ("role" in scope) {
    return peer.groups.some(g => g.role === scope.role);
  }
  return false;
}
```

### MCP tools

```
mesh_mcp_scope(server_name, scope?)
  scope set:  mesh_mcp_scope("gmail", { group: "sales" })
  scope read: mesh_mcp_scope("gmail") → { scope, deployed_by }
```

### Scope change events

When a scope changes, the broker:
1. Computes which peers gained/lost access
2. Sends `mcp_scope_changed` system event to affected peers
3. Peers who gained access get `svc__*` dynamic tools via `list_changed`
4. Peers who lost access get tools removed via `list_changed`
5. Full native access requires session restart

### DB

Single column on `mesh.service`:

```sql
scope JSONB DEFAULT '{"type": "peer"}'
-- {"type": "peer"}
-- {"type": "mesh"}
-- {"type": "peers", "allow": ["member_id_1", "member_id_2"]}
-- {"type": "group", "group": "eng"}
-- {"type": "groups", "groups": ["eng", "ops"]}
-- {"type": "role", "role": "lead"}
```

### Future: cross-mesh scope

Not for v1. Each mesh is isolated. The schema supports it later:

```json
{"type": "cross_mesh", "meshes": ["dev", "staging"]}
```

A service deployed in `dev` visible in `staging`. Requires the runner to be
accessible from both meshes (possible since it's on the VPS).

---

## Native Claude Code integration

### Goal

Deployed mesh MCPs feel indistinguishable from locally installed MCP servers.
Claude sees `mcp__mesh_gmail__search_emails` — not `mesh_tool_call("gmail", ...)`.

### At session start: native MCP entries

`claudemesh launch` queries the broker for the scope-filtered service catalog
and installs each service as a native MCP entry before spawning Claude:

```typescript
// commands/launch.ts — extended flow

// Step 3 (new): fetch service catalog from broker
const catalog = await fetchServiceCatalog(mesh);

// Step 4 (new): write mesh MCP entries to ~/.claude.json
for (const service of catalog) {
  addMcpEntry(`mesh:${service.name}`, {
    command: "claudemesh",
    args: ["mcp", "--service", service.name],
  });
}

// Step 5: spawn claude with mesh-aware env
const child = spawn("claude", claudeArgs, {
  env: {
    ...process.env,
    CLAUDEMESH_CONFIG_DIR: tmpDir,
    CLAUDEMESH_DISPLAY_NAME: displayName,
    // Mesh calls traverse: proxy → WS → broker → runner → child.
    // Default MCP timeout is too short for this chain.
    MCP_TIMEOUT: process.env.MCP_TIMEOUT ?? "30000",
    // Mesh MCPs may return large results (DB queries, file contents).
    MAX_MCP_OUTPUT_TOKENS: process.env.MAX_MCP_OUTPUT_TOKENS ?? "50000",
  },
});

// Step 6 (extended): cleanup mesh:* entries on exit
child.on("exit", () => {
  removeMcpEntries("mesh:*");
  cleanup();  // existing tmpdir cleanup
});
```

Each `claudemesh mcp --service <name>` is a thin stdio proxy:

```typescript
// Thin proxy: connects to broker, serves ONE service's tools
const client = new BrokerClient(mesh);
await client.connect();
const tools = await client.getServiceTools(serviceName);

server.setRequestHandler(ListToolsRequestSchema, () => ({ tools }));
server.setRequestHandler(CallToolRequestSchema, async (req) => {
  // Wait for broker reconnection if WS is down (up to 10s)
  if (client.status !== "open") {
    const connected = await client.waitForConnection(10_000);
    if (!connected) {
      return text("Service temporarily unavailable — broker reconnecting. Retry in a few seconds.", true);
    }
  }
  return await client.mcpCall(serviceName, req.params.name, req.params.arguments);
});
```

**Resilience notes:**
- The `BrokerClient` handles WS reconnection with exponential backoff (1s→30s)
- Claude Code does NOT auto-restart crashed MCP servers — if the proxy
  process itself dies, those tools vanish until session restart
- The proxy should catch all exceptions and return MCP errors, never crash
- `claudemesh doctor` diagnoses dead proxy processes mid-session

**Result:** Claude Code starts and sees:
```
mcp__mesh_gmail__search_emails         ← proper namespace, full schema
mcp__mesh_gmail__send_email            ← deferred by ToolSearch automatically
mcp__mesh_context7__query_docs         ← native MCP, no indirection
```

### Session management

**Safe `~/.claude.json` modification:**
- `~/.claude.json` stores MCP entries AND other Claude Code config (permissions,
  env vars, etc.). Never overwrite the whole file.
- Read-modify-write: load full JSON → add/remove only `mesh:*` keys in
  `mcpServers` → write back. Preserve all other keys.
- Use `flock` on writes to prevent concurrent session corruption.

**Stale entry cleanup:**
- Each `mesh:*` entry includes `_meshSession` metadata with PID and timestamp
- `claudemesh launch` sweeps stale entries on startup (dead PID check)
- `claudemesh doctor` reports orphaned entries

**Concurrent sessions:**
- Entries are session-scoped: `mesh:gmail:w1t0p0` (includes session ID)
- Each session manages only its own entries

### Mid-session deploys: dynamic tools

When a service is deployed after the Claude session started, native MCP entries
can't be added (Claude Code doesn't support adding new MCP servers mid-session).

**Two-tier fallback:**

1. **Claudemesh MCP fires `notifications/tools/list_changed`** (stdio, proven to work)
   - Adds `svc__<name>__<tool>` tools to its own `tools/list`
   - Claude sees them as `mcp__claudemesh__svc__gmail__search_emails`
   - Works, but namespacing is less clean than native

2. **System notification tells the peer:**
   ```
   [mesh] Service deployed: "namecheap" by Alejandro (3 tools).
   Available now via mesh_tool_call("namecheap", "domains_list", {...}).
   Restart session for native mcp__mesh_namecheap__* access.
   ```

3. **`mesh_tool_call` remains the universal fallback** — works for any
   service at any time, native or not.

### Mid-session undeploys

When a service is undeployed, the native proxy process detects the broker
event and exits gracefully. Claude Code sees the MCP server disconnect and
stops offering those tools. No `list_changed` needed — MCP server death
is already handled.

### Schema introspection

For programmatic access to tool schemas (building workflows, debugging):

```
mesh_mcp_schema(server_name)                → all tools with full inputSchema
mesh_mcp_schema(server_name, tool_name)     → one specific tool's schema
mesh_mcp_catalog()                          → all services with tool counts, scope, status
```

---

## Database changes

### New table: `mesh.service`

```sql
CREATE TABLE mesh.service (
  id              TEXT PRIMARY KEY,
  mesh_id         TEXT NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
  name            TEXT NOT NULL,
  type            TEXT NOT NULL CHECK (type IN ('mcp', 'skill')),

  -- Source
  source_type     TEXT NOT NULL CHECK (source_type IN ('inline', 'zip', 'git')),
  source_file_id  TEXT REFERENCES mesh.file(id),
  source_git_url  TEXT,
  source_git_branch TEXT DEFAULT 'main',
  source_git_sha  TEXT,
  prev_git_sha    TEXT,                    -- for rollback

  -- Content
  description     TEXT NOT NULL,
  instructions    TEXT,                    -- skills only
  tools_schema    JSONB,                   -- MCPs: [{ name, description, inputSchema }]

  -- Bundle
  manifest        JSONB,                   -- { files: [...], entry: "src/index.ts" }

  -- Execution (MCPs only)
  runtime         TEXT CHECK (runtime IN ('node', 'python', 'bun', NULL)),
  status          TEXT DEFAULT 'stopped'
                  CHECK (status IN ('building', 'installing', 'running',
                                    'stopped', 'failed', 'crashed', 'restarting')),
  config          JSONB DEFAULT '{}',      -- resource limits, network policy
  last_health     TIMESTAMP,
  restart_count   INT DEFAULT 0,
  version         INT DEFAULT 1,

  -- Visibility scope
  scope           JSONB DEFAULT '{"type": "peer"}',

  -- Metadata
  deployed_by     TEXT REFERENCES mesh.member(id),
  deployed_by_name TEXT,
  created_at      TIMESTAMP DEFAULT now() NOT NULL,
  updated_at      TIMESTAMP DEFAULT now() NOT NULL,

  UNIQUE (mesh_id, name)
);
```

### New table: `mesh.vault_entry`

```sql
CREATE TABLE mesh.vault_entry (
  id          TEXT PRIMARY KEY,
  mesh_id     TEXT NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
  member_id   TEXT NOT NULL REFERENCES mesh.member(id),
  key         TEXT NOT NULL,
  ciphertext  BYTEA NOT NULL,
  nonce       BYTEA NOT NULL,
  sealed_key  BYTEA NOT NULL,
  entry_type  TEXT DEFAULT 'env' CHECK (entry_type IN ('env', 'file')),
  mount_path  TEXT,
  description TEXT,
  created_at  TIMESTAMP DEFAULT now(),
  updated_at  TIMESTAMP DEFAULT now(),
  UNIQUE (mesh_id, member_id, key)
);
```

### Extend `mesh.skill` (backward compat)

```sql
ALTER TABLE mesh.skill
  ADD COLUMN source_type TEXT DEFAULT 'inline'
    CHECK (source_type IN ('inline', 'zip', 'git')),
  ADD COLUMN bundle_file_id TEXT REFERENCES mesh.file(id),
  ADD COLUMN git_url TEXT,
  ADD COLUMN git_branch TEXT DEFAULT 'main',
  ADD COLUMN git_sha TEXT,
  ADD COLUMN manifest JSONB;
```

---

## Wire protocol additions

### Client → broker

```typescript
// --- Service deployment ---

interface WSMcpDeployMessage {
  type: "mcp_deploy";
  server_name: string;
  source:
    | { type: "zip"; file_id: string }
    | { type: "git"; url: string; branch?: string; auth?: string };
  config?: {
    env?: Record<string, string>;    // supports $vault: refs
    memory_mb?: number;              // default 256
    cpus?: number;                   // default 0.5
    network_allow?: string[];        // default: none
    runtime?: "node" | "python" | "bun";
  };
  scope?:
    | "peer"                                    // private (default)
    | "mesh"                                    // everyone
    | { peers: string[] }                       // named peers
    | { group: string }                         // single group
    | { groups: string[] }                      // multiple groups
    | { role: string };                         // by role tag
  _reqId?: string;
}

interface WSMcpUndeployMessage {
  type: "mcp_undeploy";
  server_name: string;
  _reqId?: string;
}

interface WSMcpUpdateMessage {
  type: "mcp_update";
  server_name: string;
  _reqId?: string;
}

interface WSMcpLogsMessage {
  type: "mcp_logs";
  server_name: string;
  lines?: number;     // default 50, max 1000
  _reqId?: string;
}

interface WSMcpScopeMessage {
  type: "mcp_scope";
  server_name: string;
  scope?:                                       // set — omit to read current
    | "peer"
    | "mesh"
    | { peers: string[] }
    | { group: string }
    | { groups: string[] }
    | { role: string };
  _reqId?: string;
}

interface WSMcpSchemaMessage {
  type: "mcp_schema";
  server_name: string;
  tool_name?: string;  // omit for all tools
  _reqId?: string;
}

interface WSMcpCatalogMessage {
  type: "mcp_catalog";
  _reqId?: string;
}

// --- Skill deployment ---

interface WSSkillDeployMessage {
  type: "skill_deploy";
  source:
    | { type: "zip"; file_id: string }
    | { type: "git"; url: string; branch?: string; auth?: string };
  _reqId?: string;
}

// --- Vault ---

interface WSVaultSetMessage {
  type: "vault_set";
  key: string;
  ciphertext: string;   // base64
  nonce: string;         // base64
  sealed_key: string;    // base64
  entry_type: "env" | "file";
  mount_path?: string;
  description?: string;
  _reqId?: string;
}

interface WSVaultListMessage {
  type: "vault_list";
  _reqId?: string;
}

interface WSVaultDeleteMessage {
  type: "vault_delete";
  key: string;
  _reqId?: string;
}
```

### Broker → client

```typescript
// --- Service responses ---

interface WSMcpDeployStatusMessage {
  type: "mcp_deploy_status";
  server_name: string;
  status: "building" | "installing" | "running" | "failed";
  tools?: Array<{ name: string; description: string; inputSchema: object }>;
  error?: string;
  _reqId?: string;
}

interface WSMcpLogsResultMessage {
  type: "mcp_logs_result";
  server_name: string;
  lines: string[];
  _reqId?: string;
}

interface WSMcpSchemaResultMessage {
  type: "mcp_schema_result";
  server_name: string;
  tools: Array<{ name: string; description: string; inputSchema: object }>;
  _reqId?: string;
}

interface WSMcpCatalogResultMessage {
  type: "mcp_catalog_result";
  services: Array<{
    name: string;
    type: "mcp" | "skill";
    description: string;
    status: string;
    tool_count: number;
    deployed_by: string;
    scope: { type: string; [key: string]: unknown };
    source_type: string;
    runtime?: string;
    created_at: string;
  }>;
  _reqId?: string;
}

interface WSMcpScopeResultMessage {
  type: "mcp_scope_result";
  server_name: string;
  scope: { type: string; [key: string]: unknown };
  deployed_by: string;
  _reqId?: string;
}

// --- Skill responses ---

interface WSSkillDeployAckMessage {
  type: "skill_deploy_ack";
  name: string;
  files: string[];
  _reqId?: string;
}

// --- Vault responses ---

interface WSVaultAckMessage {
  type: "vault_ack";
  key: string;
  action: "stored" | "deleted" | "not_found";
  _reqId?: string;
}

interface WSVaultListResultMessage {
  type: "vault_list_result";
  entries: Array<{
    key: string;
    entry_type: "env" | "file";
    mount_path?: string;
    description?: string;
    updated_at: string;
  }>;
  _reqId?: string;
}

// --- System events (broadcast to mesh) ---

// Sent as WSPushMessage with subtype: "system"
// event: "mcp_deployed"
// eventData: { name, description, tool_count, deployed_by, scope, tools: [...] }

// event: "mcp_undeployed"
// eventData: { name, by }

// event: "mcp_crashed"
// eventData: { name, error, restarts }

// event: "mcp_updated"
// eventData: { name, prev_sha, new_sha, tools: [...] }
```

### Extended `hello_ack`

```typescript
interface WSHelloAckMessage {
  // ... existing fields ...

  /** Scope-filtered service catalog for this peer. */
  services?: Array<{
    name: string;
    description: string;
    status: string;
    tools: Array<{ name: string; description: string; inputSchema: object }>;
    deployed_by: string;
  }>;
}
```

---

## MCP tool additions (CLI)

### Service management tools

```typescript
mesh_mcp_deploy(server_name, file_id?, git_url?, git_branch?, env?, runtime?,
                memory_mb?, network_allow?, scope?)
mesh_mcp_undeploy(server_name)
mesh_mcp_update(server_name)           // git-only: pull + rebuild + restart
mesh_mcp_logs(server_name, lines?)
mesh_mcp_scope(server_name, scope?)    // set or read visibility scope
mesh_mcp_schema(server_name, tool?)    // introspect tool schemas
mesh_mcp_catalog()                     // list all services with status
mesh_skill_deploy(file_id?, git_url?, git_branch?)
```

### Vault tools

```typescript
vault_set(key, value, type?, mount_path?, description?)
vault_list()
vault_delete(key)
```

### Existing tools (unchanged)

```typescript
share_skill(name, description, instructions, tags)    // inline skills
mesh_mcp_register(server_name, description, tools)     // live peer proxy
mesh_tool_call(server_name, tool_name, args)           // universal fallback
mesh_mcp_list()                                        // shows both proxy + managed
```

---

## Broker-side service manager

New file: `apps/broker/src/service-manager.ts`

### Interface

```typescript
interface ServiceManager {
  deploy(opts: {
    meshId: string;
    name: string;
    source: { type: "zip"; fileId: string }
           | { type: "git"; url: string; branch: string; auth?: string };
    config: ServiceConfig;
    vaultEntries: Array<{ key: string; ciphertext: Buffer; nonce: Buffer; sealedKey: Buffer;
                          entryType: "env" | "file"; mountPath?: string }>;
  }): Promise<{ tools: ToolDef[]; status: string }>;

  undeploy(meshId: string, name: string): Promise<void>;

  update(meshId: string, name: string): Promise<{ tools: ToolDef[]; newSha?: string }>;

  callTool(meshId: string, serverName: string, toolName: string,
           args: Record<string, unknown>): Promise<{ result?: unknown; error?: string }>;

  logs(meshId: string, name: string, lines?: number): string[];

  status(meshId: string, name: string): ServiceStatus;

  restoreAll(): Promise<void>;  // on broker boot
}
```

### Boot restore

On broker startup:
1. Query `mesh.service WHERE status IN ('running', 'crashed', 'restarting')`
2. Set all to `status='restarting'`
3. Re-spawn runner container per mesh
4. Load each service's source and spawn child process
5. Set `status='running'` only after successful MCP `initialize` response
6. Services that fail to start → `status='failed'`, system event broadcast

---

## Security model

| Concern | Mitigation |
|---|---|
| Arbitrary code execution | Docker container, one per mesh |
| Resource exhaustion | `--memory=512m --cpus=1` per container |
| Filesystem escape | No host volume mounts |
| Secret leakage | Vault E2E encrypted, decrypted only inside container |
| Network exfiltration | `--network=mesh-restricted`, per-service allowlist |
| Malicious zip (path traversal) | Validate all paths within target dir, reject `..` |
| Git auth tokens | Stored encrypted in vault, passed via `GIT_ASKPASS` |
| Denial of service | Max 20 services per mesh, max 50MB zip, max 500MB image |
| Scope bypass | Double-check: filter catalog + check on call |
| OAuth token expiry | Store refresh tokens, notify deployer on persistent failure |
| Tool name collision | `svc__` prefix for mid-session dynamic tools |
| Stale MCP entries | PID check + age sweep on launch |
| Tool call timeout | `MCP_TIMEOUT=30000` set by launch (default too short for mesh chain) |
| Large tool output | `MAX_MCP_OUTPUT_TOKENS=50000` set by launch; proxy truncates if needed |
| Proxy crash | Claude Code won't auto-restart; `claudemesh doctor` diagnoses dead proxies |
| Broker restart | Proxies reconnect via BrokerClient backoff; calls return "reconnecting" during window |

---

## CLI commands

```bash
# Deploy from zip
claudemesh deploy ./my-server.zip --name my-server

# Deploy from git
claudemesh deploy --git https://github.com/user/repo.git --name my-server

# Deploy with vault refs
claudemesh vault set gmail-creds ~/.gmail-mcp/credentials.json --type file
claudemesh deploy --git https://github.com/user/gmail-mcp.git --name gmail \
  --env 'GMAIL_CREDENTIALS_PATH=$vault:gmail-creds:file:/secrets/creds.json' \
  --network-allow 'gmail.googleapis.com:443'

# Set access
claudemesh scope gmail --mesh                     # everyone
claudemesh scope gmail --group eng                # @eng only
claudemesh scope gmail --groups 'eng,ops'         # @eng + @ops
claudemesh scope gmail --role lead                # leads only
claudemesh scope gmail --peers 'Mou,Alejandro'   # specific peers
claudemesh scope gmail --peer                     # private (deployer only)

# Manage
claudemesh logs gmail
claudemesh update gmail              # git-only: pull + rebuild
claudemesh undeploy gmail
claudemesh catalog                   # list all services

# Skills
claudemesh skill deploy ./my-skill.zip
claudemesh skill deploy --git https://github.com/user/skill.git

# Vault
claudemesh vault set api-key "sk-abc123"
claudemesh vault set oauth-creds ~/path/to/creds.json --type file
claudemesh vault list
claudemesh vault delete api-key
```

---

## Migration path

| What | Before | After |
|---|---|---|
| `share_skill()` inline | works | unchanged |
| `mesh_mcp_register()` live proxy | works | unchanged, labeled "proxy" in catalog |
| Zip MCP server | not possible | `share_file` + `mesh_mcp_deploy` |
| Git MCP server | not possible | `mesh_mcp_deploy(git_url=...)` |
| Zip skill bundle | not possible | `mesh_skill_deploy(file_id=...)` |
| Git skill | not possible | `mesh_skill_deploy(git_url=...)` |
| `mesh_tool_call` | forwards to peer | routes to runner OR forwards to peer |
| `mesh_mcp_list` | proxy only | shows proxy + managed, with status |
| Tool discovery | manual `mesh_mcp_list` | native MCP entries at launch + mid-session events |
| Credentials | plaintext env vars | E2E encrypted vault with `$vault:` refs |
| Access control | none (anyone can call) | Scopes: peer/group/role/mesh per service |

All existing behavior preserved. New capabilities are additive.

---

## Implementation order

### Phase 1: Foundation
1. DB migration — `mesh.service` table, `mesh.vault_entry` table, extend `mesh.skill`
2. Wire protocol — add all new message types to `types.ts`
3. Vault — broker-side storage + CLI tools (`vault_set`, `vault_list`, `vault_delete`)
4. Service catalog — `mcp_catalog`, `mcp_schema`, scope filtering in `hello_ack`

### Phase 2: Execution engine
5. Runner supervisor — `service-manager.ts`, child process spawn/kill/restart/health
6. Docker container — base image, build + run lifecycle
7. Deploy flow — zip extraction, git clone, runtime detection, `npm install` / `pip install`
8. Tool call routing — broker routes managed service calls to runner

### Phase 3: Native integration
9. Launch integration — `claudemesh launch` writes `mesh:*` MCP entries to `~/.claude.json`
10. Stdio proxy — `claudemesh mcp --service <name>` thin proxy command
11. Mid-session fallback — `svc__*` dynamic tools + `list_changed` on claudemesh MCP
12. Session cleanup — stale entry sweep, PID checks, `flock` on config writes

### Phase 4: Skill bundles
13. Skill deploy — zip/git extraction, `SKILL.md` + `skill.json` parsing, manifest storage
14. `get_skill` extension — returns structured file contents from bundle

### Phase 5: Polish
15. `mesh_mcp_update` — git pull + rebuild + restart flow
16. Boot restore — re-spawn services on broker restart
17. CLI commands — `claudemesh deploy`, `claudemesh vault`, `claudemesh scope`, `claudemesh catalog`
18. Docs + example bundles — sample MCP server zip, sample skill bundle

---

## Appendix: Claude Code MCP behavior (verified)

Key findings from Claude Code MCP architecture research that informed this
spec. These are behaviors of Claude Code itself, not the MCP protocol.

### Lifecycle
- MCP servers start when a session begins, stop when it ends
- **No auto-restart on crash** — next tool invocation fails. Our proxy must
  handle reconnection to the broker independently
- No health checks from Claude Code — failures discovered on tool use
- `MCP_TIMEOUT` env var controls tool call timeout

### Dynamic tools
- `notifications/tools/list_changed` is supported and triggers immediate
  re-fetch of `tools/list` — works mid-conversation over stdio
- **SSE/HTTP transport support for `list_changed` may be unreliable** — known
  bug in some versions. This is why we use stdio proxies, not HTTP transport.

### ToolSearch / deferred tools
- Enabled by default (`ENABLE_TOOL_SEARCH=true`)
- Only tool **names** are loaded at startup — full schemas fetched on demand
- Requires Sonnet 4+ or Opus 4+ (Haiku does not support tool references)
- Adding 100+ MCP tools has near-zero context cost at startup
- Configurable: `ENABLE_TOOL_SEARCH=auto:5` loads upfront if <5% of context

### Tool output limits
- Warning at 10,000 tokens, hard limit at 25,000 tokens (default)
- Configurable via `MAX_MCP_OUTPUT_TOKENS` env var
- Per-tool override: `_meta["anthropic/maxResultSizeChars"]` (up to 500K chars)

### Namespacing
- Tools namespaced as `mcp__servername__toolname`
- Two servers with same tool name → no conflict (different namespace)
- Server names normalized: spaces → underscores

### Registration
- **File-based only** — no runtime API to add MCP servers
- Scopes: `local` (~/.claude.json), `project` (.mcp.json), `user` (~/.claude.json global)
- Precedence: local > project > user
- `claude mcp add --scope user` for global, `--scope project` for team-shared
- **Cannot add new MCP server entries mid-session** — this is why `claudemesh
  launch` pre-writes entries before spawning, and mid-session deploys fall
  back to dynamic `svc__*` tools on the claudemesh MCP server

### Environment variables
- Passed via `--env KEY=VALUE` on `claude mcp add`
- `.mcp.json` supports `${VAR}` and `${VAR:-default}` expansion
- Special: `${CLAUDE_PLUGIN_ROOT}`, `${CLAUDE_PLUGIN_DATA}`

### Implications for this spec
- Native MCP entries MUST be written before `claude` spawns → `claudemesh launch` flow
- Stdio transport is the only reliable path for `list_changed` → thin proxy model
- ToolSearch means 100+ mesh tools have negligible context cost
- No server dependencies → each mesh proxy is independent
- No auto-restart → proxies must reconnect to broker on their own