alezmad/claudemesh

Fork 0

Files

Alejandro Gutiérrez e1cafa54b3

CI / Typecheck (push) Has been cancelled

Details

CI / Lint (push) Has been cancelled

Details

CI / Broker tests (Postgres) (push) Has been cancelled

Details

CI / Docker build (linux/amd64) (push) Has been cancelled

Details

feat: mesh services platform — deploy MCP servers, vaults, scopes

Add the foundation for deploying and managing MCP servers on the VPS
broker, with per-peer credential vaults and visibility scopes.

Architecture:
- One Docker container per mesh with a Node supervisor
- Each MCP server runs as a child process with its own stdio pipe
- claudemesh launch installs native MCP entries in ~/.claude.json
- Mid-session deploys fall back to svc__* dynamic tools + list_changed

New components:
- DB: mesh.service + mesh.vault_entry tables, mesh.skill extensions
- Broker: 19 wire protocol types, 11 message handlers, service catalog
  in hello_ack with scope filtering, service-manager.ts (775 lines)
- CLI: 13 tool definitions, 12 WS client methods, tool call handlers,
  startServiceProxy() for native MCP proxy mode
- Launch: catalog fetch, native MCP entry install, stale sweep, cleanup,
  MCP_TIMEOUT=30s, MAX_MCP_OUTPUT_TOKENS=50k

Security: path sanitization on service names, column whitelist on
upsertService, returning()-based delete checks, vault E2E encryption.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-08 10:53:03 +01:00

45 KiB

Raw Blame History

Mesh Services: MCP Servers & Skills Platform

Consolidated spec for deploying, managing, and executing MCP servers and multi-file skills within a claudemesh mesh. Covers source modes, execution engine, credential vaults, access control, native Claude Code integration, and dynamic tool discovery.

Problem

Today:

Skills are a single instructions text field in Postgres. No multi-file support.
MCP servers are live-proxied through the registering peer. When that peer disconnects, the server dies. The persistent flag is cosmetic.
Neither supports bundled artifacts (templates, configs, schemas, example code).
Claude Code has no way to discover mesh tools natively — peers must use the generic mesh_tool_call proxy.

Design goals

Three source modes — inline, zip bundle, git repo — for both skills and MCP servers
MCP servers run on the VPS, not on peers — true 24/7 persistence
Sandboxed execution with resource limits
Native Claude Code tool integration — deployed MCPs appear as regular MCP server entries
Per-peer credential vault for secrets (OAuth tokens, API keys)
Visibility scopes on services — peer, group, role, or mesh-wide — deployer controls who can call, not who sees secrets
Dynamic mid-session discovery via notifications/tools/list_changed
All existing behavior preserved — inline skills and live-proxy MCPs unchanged

Architecture overview

┌──────────────────────────────────────────────────────────────────┐
│ claudemesh launch --name Mou --mesh dev                          │
│                                                                  │
│  1. Connect to broker, authenticate                              │
│  2. Fetch service catalog (scope-filtered for this peer)          │
│  3. Write native MCP entries to ~/.claude.json:                  │
│       mesh:gmail, mesh:context7, mesh:whatsapp                   │
│  4. Spawn claude                                                 │
│  5. On exit: remove mesh:* entries                               │
└──────────┬───────────────────────────────────────────────────────┘
           │
           ▼
┌──────────────────────────────────────────────────────────────────┐
│ Claude Code session                                              │
│                                                                  │
│  MCP: claudemesh (stdio)                                         │
│  ├── send_message, list_peers, set_summary, ...  (peer comms)   │
│  ├── mesh_mcp_deploy, mesh_mcp_scope, ...        (service mgmt) │
│  ├── vault_set, vault_list, ...                  (credentials)  │
│  └── mesh_mcp_schema                             (introspection)│
│                                                                  │
│  MCP: mesh:gmail (stdio proxy)        → mcp__mesh_gmail__*      │
│  MCP: mesh:context7 (stdio proxy)     → mcp__mesh_context7__*   │
│  MCP: mesh:whatsapp (stdio proxy)     → mcp__mesh_whatsapp__*   │
│                                                                  │
│  MCP: playwriter (stdio, local)       → local MCPs as usual     │
│  MCP: figma (stdio, local)                                       │
└──────────┬───────────────────────────────────────────────────────┘
           │ Each mesh:* proxy connects via WebSocket
           ▼
┌──────────────────────────────────────────────────────────────────┐
│ Broker (VPS — wss://ic.claudemesh.com/ws)                        │
│                                                                  │
│  Existing: message routing, presence, state, memory, files, ...  │
│                                                                  │
│  New: Service Catalog                                            │
│  ├── Scope enforcement (peer/group/role/mesh visibility)         │
│  ├── Tool schema registry (from runner)                          │
│  ├── Deploy/undeploy/update commands                             │
│  └── System events: mcp_deployed, mcp_undeployed                │
│                                                                  │
│  New: Vault                                                      │
│  └── Per-peer encrypted credential storage                       │
│                                                                  │
│  Tool call routing:                                              │
│  ├── Managed service? → forward to runner                        │
│  └── Live proxy?      → forward to hosting peer (existing)       │
└──────────┬───────────────────────────────────────────────────────┘
           │ stdio (child process)
           ▼
┌──────────────────────────────────────────────────────────────────┐
│ Runner (one Docker container per mesh)                            │
│                                                                  │
│  Supervisor (Node main thread)                                   │
│  ├── stdin/stdout ↔ broker (JSON-RPC multiplexed)               │
│  ├── Routes tool calls by service name                           │
│  ├── Lifecycle: load / unload / restart                          │
│  ├── Health: MCP ping per child, restart on 3 failures          │
│  ├── Logs: 1000-line ring buffer per service                     │
│  └── Vault: decrypts credentials at spawn time                   │
│                                                                  │
│  Child processes (one per MCP server):                           │
│  ├── child_process.spawn("node", [...]) ← Node MCP servers     │
│  ├── child_process.spawn("uvx", [...])  ← Python MCP servers   │
│  ├── child_process.spawn("npx", [...])  ← npm MCP packages     │
│  │                                                               │
│  │   Each child:                                                 │
│  │   ├── Own stdio pipe (MCP protocol)                          │
│  │   ├── Own env vars (including vault-resolved secrets)        │
│  │   ├── Own /secrets/<name>/ dir (vault files)                 │
│  │   └── Killed individually on undeploy                        │
│  │                                                               │
│  Base image: node:22 + python3.12 + uv + npx                    │
│  Limits: --memory=512m --cpus=1 --network=mesh-restricted       │
└──────────────────────────────────────────────────────────────────┘

Source modes

1. Inline (existing, unchanged)

share_skill(name, description, instructions, tags)       ← text-only skill
mesh_mcp_register(server_name, description, tools)       ← live peer proxy

2. Zip bundle

Upload a zip, then deploy:

1. share_file(path="./my-server.zip", tags=["mcp-bundle"])  → fileId
2. mesh_mcp_deploy(file_id=fileId, server_name="my-server", config={...})

MCP server zip structure:

my-mcp-server/
├── package.json          # or pyproject.toml / requirements.txt
├── src/index.ts          # MCP server entry (stdio transport)
├── .env.example          # declares required env vars
└── README.md

Skill bundle zip structure:

my-skill/
├── SKILL.md              # instructions (replaces inline text)
├── skill.json            # { name, description, tags }
├── templates/            # prompt templates, examples
└── schemas/              # JSON schemas, configs

3. Git repository

mesh_mcp_deploy(
  git_url="https://github.com/user/my-mcp-server.git",
  branch="main",
  server_name="my-server",
  config={ env: { API_KEY: "$vault:my-api-key" } }
)

Shallow clone (--depth 1)
Commit SHA pinned in DB for auditability
mesh_mcp_update(server_name) → git pull + rebuild + restart
Auth via config.git_auth (stored encrypted, never logged)

Execution engine

Why child processes, not worker threads

MCP servers use stdio transport — each server owns its stdin/stdout via StdioServerTransport. Two servers can't share one process. Worker threads don't help because:

MCP SDK StdioServerTransport takes over process stdin/stdout
npx @package/mcp-server spawns its own process anyway
Python MCPs need a Python runtime, not a Node thread

The runner spawns each MCP server as a child process with its own stdio pipe, exactly how every MCP server is designed to work.

Container design: one per mesh

┌─ Docker container (mesh: "dev") ─────────────────┐
│                                                   │
│  Supervisor (Node main thread)                    │
│  ├─ stdio ↔ broker                               │
│  ├─ routes calls by service name                  │
│  │                                                │
│  ├─ spawn("npx", ["@upstash/context7-mcp"])      │
│  │   └─ stdio pipe ↔ MCP protocol                │
│  ├─ spawn("node", ["dist/index.js"])              │
│  │   └─ stdio pipe ↔ MCP protocol                │
│  ├─ spawn("uvx", ["mcp-outline"])                 │
│  │   └─ stdio pipe ↔ MCP protocol                │
│  └─ spawn("python", ["-m", "server"])             │
│      └─ stdio pipe ↔ MCP protocol                │
│                                                   │
│  Base: node:22 + python3.12 + uv + npx            │
│  Limits: --memory=512m --cpus=1                    │
│  Network: mesh-restricted bridge (allowlist)       │
└───────────────────────────────────────────────────┘

Why one container, not N:

One Docker process to manage, one cgroup for the whole mesh
One network namespace — single firewall config
Shared node_modules / pip cache across services
VPS resources: 8 vCores / 24GB — N containers exhausts memory fast

Why not zero containers (bare child processes on the broker):

Broker stays routing-only — runner crashes don't take it down
Security boundary — runner can't access broker's DB or filesystem
Runner can be on a different machine later (NUC, second VPS)

Supervisor protocol

Broker ↔ runner communicate over the container's stdin/stdout as JSON lines:

// Broker → runner
{ action: "load", name: "gmail", path: "/services/gmail", env: {...} }
{ action: "call", name: "gmail", tool: "search_emails", args: {...}, callId: "abc" }
{ action: "unload", name: "gmail" }
{ action: "health", name: "gmail" }
{ action: "list_tools", name: "gmail" }

// Runner → broker
{ callId: "abc", result: {...} }
{ callId: "abc", error: "connection refused" }
{ type: "loaded", name: "gmail", tools: [{name, description, inputSchema}] }
{ type: "unloaded", name: "gmail" }
{ type: "crashed", name: "gmail", restarts: 3, error: "OOM" }
{ type: "health", name: "gmail", ok: true, rssKb: 45000 }

Runtime auto-detection

File found	Runtime	Spawn command
`package.json`	node	`npm install && node <main>`
`package.json` with npx hint	node	`npx <package>`
`pyproject.toml`	python	`pip install . && python -m <module>`
`requirements.txt`	python	`pip install -r requirements.txt && python <entry>`
`Bunfile` or `bun.lockb`	bun	`bun install && bun <entry>`

Health & restart

Supervisor sends MCP ping to each child every 30s
No response within 5s → mark unhealthy
3 consecutive failures → restart (kill + re-spawn)
Max 5 restarts → status=crashed, notify deployer via mesh system event
On crash: { type: "push", event: "mcp_crashed", eventData: { name, error, restarts } }

Logs

Per-service ring buffer (1000 lines). Captures child's stderr + stdout (excluding MCP protocol JSON). Accessible via mesh_mcp_logs(name, lines?).

Storage layout

/var/claudemesh/services/
├── <meshId>/
│   ├── <serviceName>/
│   │   ├── source/          # extracted zip or git clone
│   │   ├── secrets/         # vault-resolved credential files
│   │   ├── node_modules/    # or .venv/ for Python
│   │   └── .meta.json       # { pid, startedAt, sha, runtime }

Network policy

Default: --network=mesh-restricted (Docker bridge with outbound deny-all).

Per-service allowlist in deploy config:

{
  "network_allow": [
    "gmail.googleapis.com:443",
    "oauth2.googleapis.com:443",
    "100.113.153.45:*"
  ]
}

Implemented via iptables rules on the bridge, or per-container --add-host entries combined with a proxy. For Tailscale-accessible services (NUC, etc.), allow the Tailscale IP.

Credential vault

Design

Per-peer encrypted storage on the broker. Credentials never leave the vault in plaintext — decrypted only inside the runner container at spawn time.

Peers don't share credentials. They share access to the running MCP server via scopes. The MCP server runs with the deployer's credentials; other peers call it without ever seeing the secrets.

Encryption model

Same crypto as E2E file sharing (crypto/file-crypto.ts):

Peer generates random symmetric key
Encrypts the credential with crypto_secretbox (symmetric)
Seals the symmetric key with their own pubkey (crypto_box)
Stores sealed key + ciphertext on broker — broker sees only ciphertext
At spawn time: runner requests decryption from the deployer's sealed key (the runner holds a mesh-scoped keypair granted by the deployer at deploy time)

Vault reference syntax

In mesh_mcp_deploy env config, $vault: prefix triggers vault resolution:

$vault:api-key                              → inject as env var
$vault:gmail-creds:file:/secrets/creds.json → decrypt, write to file, set env var to path

Examples:

mesh_mcp_deploy({
  server_name: "gmail",
  git_url: "https://github.com/gongrzhe/server-gmail-autoauth-mcp",
  env: {
    GMAIL_CREDENTIALS_PATH: "$vault:gmail-creds:file:/secrets/credentials.json",
    GMAIL_OAUTH_PATH: "$vault:gmail-oauth:file:/secrets/gcp-oauth.keys.json",
  },
  network_allow: ["gmail.googleapis.com:443", "oauth2.googleapis.com:443"],
})

MCP tools

vault_set(key, value, type?, mount_path?)  — encrypt + store
  value: string (env var) or local file path (reads + encrypts the file)
  type: "env" (default) or "file"
  mount_path: for files, where to write inside the service dir

vault_list()                               — list keys (no values, metadata only)
vault_delete(key)                          — remove entry

DB schema

CREATE TABLE mesh.vault_entry (
  id          TEXT PRIMARY KEY,
  mesh_id     TEXT NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
  member_id   TEXT NOT NULL REFERENCES mesh.member(id),
  key         TEXT NOT NULL,

  -- E2E encrypted content
  ciphertext  BYTEA NOT NULL,
  nonce       BYTEA NOT NULL,
  sealed_key  BYTEA NOT NULL,    -- symmetric key sealed with peer's pubkey

  -- Metadata (plaintext)
  entry_type  TEXT DEFAULT 'env' CHECK (entry_type IN ('env', 'file')),
  mount_path  TEXT,
  description TEXT,

  created_at  TIMESTAMP DEFAULT now(),
  updated_at  TIMESTAMP DEFAULT now(),

  UNIQUE (mesh_id, member_id, key)
);

Visibility scopes

Model

Scopes control who can see and call a service. Credentials are invisible to callers — they interact with the running service, not the secrets behind it. The deployer controls visibility; the vault handles secrets separately.

Scope levels

Scope	Who sees it	Use case
`peer`	Only the deployer (default)	Personal tools, staging before publish
`{ peers: [...] }`	Named peers	Shared between specific people
`{ group: "eng" }`	All @eng members	Team-specific tools
`{ groups: ["eng", "ops"] }`	Multiple groups	Cross-team tools
`{ role: "lead" }`	Any peer with that role	Role-gated admin tools
`mesh`	Everyone in the mesh	Shared utilities

Examples

┌─────────────────────────────────────────────────┐
│ Mesh: "dev-team"                                │
│                                                 │
│  mesh scope ─── everyone                        │
│  ├── context7         (utility)                 │
│  ├── youtube-transcript                         │
│  └── mesh-db          (shared database)         │
│                                                 │
│  group scope ─── @group members only            │
│  ├── @eng                                       │
│  │   ├── github-mcp   (eng team's GitHub)       │
│  │   └── ssh-manager  (eng infra access)        │
│  ├── @sales                                     │
│  │   ├── apollo-io    (sales CRM)               │
│  │   └── gmail        (sales@ inbox)            │
│  └── @ops                                       │
│      ├── stalwart-mail (mail server admin)       │
│      └── namecheap    (DNS management)           │
│                                                 │
│  role scope ─── by role tag                     │
│  ├── lead → mesh-admin-tools (deploy, vault)    │
│  └── observer → (read-only MCPs only)           │
│                                                 │
│  peer scope ─── only specific peers             │
│  ├── Alejandro                                  │
│  │   ├── gmail-personal  (my inbox)             │
│  │   └── gworkspace      (my workspace)         │
│  └── Mou                                        │
│      └── cursor-composer (Mou's Cursor)         │
│                                                 │
└─────────────────────────────────────────────────┘

Deploy with scope

// Mesh scope — everyone
mesh_mcp_deploy({
  server_name: "context7",
  source: { type: "git", url: "..." },
  scope: "mesh",
})

// Group scope — only @eng
mesh_mcp_deploy({
  server_name: "github-mcp",
  source: { type: "git", url: "..." },
  scope: { group: "eng" },
})

// Multi-group
mesh_mcp_deploy({
  server_name: "ssh-manager",
  scope: { groups: ["eng", "ops"] },
})

// Role scope — only leads
mesh_mcp_deploy({
  server_name: "mesh-admin",
  scope: { role: "lead" },
})

// Peer scope — just me (default)
mesh_mcp_deploy({
  server_name: "gmail-personal",
  scope: "peer",
})

// Specific peers
mesh_mcp_deploy({
  server_name: "shared-workspace",
  scope: { peers: ["Mou", "Alejandro"] },
})

Enforcement

At catalog time: broker filters the service catalog by scope before sending to peers in hello_ack. The peer's groups and role (from hello) are matched against each service's scope. A tool you can't access never appears in Claude's tool list.
At call time: broker re-checks scope before routing. Double-check in case catalog is stale or the peer's groups changed.

Scope resolution logic

function peerCanAccess(service: Service, peer: PeerConn): boolean {
  const scope = service.scope;
  if (typeof scope === "string") {
    if (scope === "peer") return service.deployed_by === peer.memberId;
    if (scope === "mesh") return true;
  }
  if ("peers" in scope) {
    return scope.peers.some(p =>
      p === peer.memberId || p === peer.displayName);
  }
  if ("group" in scope) {
    return peer.groups.some(g => g.name === scope.group);
  }
  if ("groups" in scope) {
    return peer.groups.some(g => scope.groups.includes(g.name));
  }
  if ("role" in scope) {
    return peer.groups.some(g => g.role === scope.role);
  }
  return false;
}

MCP tools

mesh_mcp_scope(server_name, scope?)
  scope set:  mesh_mcp_scope("gmail", { group: "sales" })
  scope read: mesh_mcp_scope("gmail") → { scope, deployed_by }

Scope change events

When a scope changes, the broker:

Computes which peers gained/lost access
Sends mcp_scope_changed system event to affected peers
Peers who gained access get svc__* dynamic tools via list_changed
Peers who lost access get tools removed via list_changed
Full native access requires session restart

DB

Single column on mesh.service:

scope JSONB DEFAULT '{"type": "peer"}'
-- {"type": "peer"}
-- {"type": "mesh"}
-- {"type": "peers", "allow": ["member_id_1", "member_id_2"]}
-- {"type": "group", "group": "eng"}
-- {"type": "groups", "groups": ["eng", "ops"]}
-- {"type": "role", "role": "lead"}

Future: cross-mesh scope

Not for v1. Each mesh is isolated. The schema supports it later:

{"type": "cross_mesh", "meshes": ["dev", "staging"]}

A service deployed in dev visible in staging. Requires the runner to be accessible from both meshes (possible since it's on the VPS).

Native Claude Code integration

Goal

Deployed mesh MCPs feel indistinguishable from locally installed MCP servers. Claude sees mcp__mesh_gmail__search_emails — not mesh_tool_call("gmail", ...).

At session start: native MCP entries

claudemesh launch queries the broker for the scope-filtered service catalog and installs each service as a native MCP entry before spawning Claude:

// commands/launch.ts — extended flow

// Step 3 (new): fetch service catalog from broker
const catalog = await fetchServiceCatalog(mesh);

// Step 4 (new): write mesh MCP entries to ~/.claude.json
for (const service of catalog) {
  addMcpEntry(`mesh:${service.name}`, {
    command: "claudemesh",
    args: ["mcp", "--service", service.name],
  });
}

// Step 5: spawn claude with mesh-aware env
const child = spawn("claude", claudeArgs, {
  env: {
    ...process.env,
    CLAUDEMESH_CONFIG_DIR: tmpDir,
    CLAUDEMESH_DISPLAY_NAME: displayName,
    // Mesh calls traverse: proxy → WS → broker → runner → child.
    // Default MCP timeout is too short for this chain.
    MCP_TIMEOUT: process.env.MCP_TIMEOUT ?? "30000",
    // Mesh MCPs may return large results (DB queries, file contents).
    MAX_MCP_OUTPUT_TOKENS: process.env.MAX_MCP_OUTPUT_TOKENS ?? "50000",
  },
});

// Step 6 (extended): cleanup mesh:* entries on exit
child.on("exit", () => {
  removeMcpEntries("mesh:*");
  cleanup();  // existing tmpdir cleanup
});

Each claudemesh mcp --service <name> is a thin stdio proxy:

// Thin proxy: connects to broker, serves ONE service's tools
const client = new BrokerClient(mesh);
await client.connect();
const tools = await client.getServiceTools(serviceName);

server.setRequestHandler(ListToolsRequestSchema, () => ({ tools }));
server.setRequestHandler(CallToolRequestSchema, async (req) => {
  // Wait for broker reconnection if WS is down (up to 10s)
  if (client.status !== "open") {
    const connected = await client.waitForConnection(10_000);
    if (!connected) {
      return text("Service temporarily unavailable — broker reconnecting. Retry in a few seconds.", true);
    }
  }
  return await client.mcpCall(serviceName, req.params.name, req.params.arguments);
});

Resilience notes:

The BrokerClient handles WS reconnection with exponential backoff (1s→30s)
Claude Code does NOT auto-restart crashed MCP servers — if the proxy process itself dies, those tools vanish until session restart
The proxy should catch all exceptions and return MCP errors, never crash
claudemesh doctor diagnoses dead proxy processes mid-session

Result: Claude Code starts and sees:

mcp__mesh_gmail__search_emails         ← proper namespace, full schema
mcp__mesh_gmail__send_email            ← deferred by ToolSearch automatically
mcp__mesh_context7__query_docs         ← native MCP, no indirection

Session management

Safe ~/.claude.json modification:

~/.claude.json stores MCP entries AND other Claude Code config (permissions, env vars, etc.). Never overwrite the whole file.
Read-modify-write: load full JSON → add/remove only mesh:* keys in mcpServers → write back. Preserve all other keys.
Use flock on writes to prevent concurrent session corruption.

Stale entry cleanup:

Each mesh:* entry includes _meshSession metadata with PID and timestamp
claudemesh launch sweeps stale entries on startup (dead PID check)
claudemesh doctor reports orphaned entries

Concurrent sessions:

Entries are session-scoped: mesh:gmail:w1t0p0 (includes session ID)
Each session manages only its own entries

Mid-session deploys: dynamic tools

When a service is deployed after the Claude session started, native MCP entries can't be added (Claude Code doesn't support adding new MCP servers mid-session).

Two-tier fallback:

Claudemesh MCP fires notifications/tools/list_changed (stdio, proven to work)
- Adds svc__<name>__<tool> tools to its own tools/list
- Claude sees them as mcp__claudemesh__svc__gmail__search_emails
- Works, but namespacing is less clean than native

System notification tells the peer:

[mesh] Service deployed: "namecheap" by Alejandro (3 tools).
Available now via mesh_tool_call("namecheap", "domains_list", {...}).
Restart session for native mcp__mesh_namecheap__* access.

mesh_tool_call remains the universal fallback — works for any service at any time, native or not.

Mid-session undeploys

When a service is undeployed, the native proxy process detects the broker event and exits gracefully. Claude Code sees the MCP server disconnect and stops offering those tools. No list_changed needed — MCP server death is already handled.

Schema introspection

For programmatic access to tool schemas (building workflows, debugging):

mesh_mcp_schema(server_name)                → all tools with full inputSchema
mesh_mcp_schema(server_name, tool_name)     → one specific tool's schema
mesh_mcp_catalog()                          → all services with tool counts, scope, status

Database changes

New table: `mesh.service`

CREATE TABLE mesh.service (
  id              TEXT PRIMARY KEY,
  mesh_id         TEXT NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
  name            TEXT NOT NULL,
  type            TEXT NOT NULL CHECK (type IN ('mcp', 'skill')),

  -- Source
  source_type     TEXT NOT NULL CHECK (source_type IN ('inline', 'zip', 'git')),
  source_file_id  TEXT REFERENCES mesh.file(id),
  source_git_url  TEXT,
  source_git_branch TEXT DEFAULT 'main',
  source_git_sha  TEXT,
  prev_git_sha    TEXT,                    -- for rollback

  -- Content
  description     TEXT NOT NULL,
  instructions    TEXT,                    -- skills only
  tools_schema    JSONB,                   -- MCPs: [{ name, description, inputSchema }]

  -- Bundle
  manifest        JSONB,                   -- { files: [...], entry: "src/index.ts" }

  -- Execution (MCPs only)
  runtime         TEXT CHECK (runtime IN ('node', 'python', 'bun', NULL)),
  status          TEXT DEFAULT 'stopped'
                  CHECK (status IN ('building', 'installing', 'running',
                                    'stopped', 'failed', 'crashed', 'restarting')),
  config          JSONB DEFAULT '{}',      -- resource limits, network policy
  last_health     TIMESTAMP,
  restart_count   INT DEFAULT 0,
  version         INT DEFAULT 1,

  -- Visibility scope
  scope           JSONB DEFAULT '{"type": "peer"}',

  -- Metadata
  deployed_by     TEXT REFERENCES mesh.member(id),
  deployed_by_name TEXT,
  created_at      TIMESTAMP DEFAULT now() NOT NULL,
  updated_at      TIMESTAMP DEFAULT now() NOT NULL,

  UNIQUE (mesh_id, name)
);

New table: `mesh.vault_entry`

CREATE TABLE mesh.vault_entry (
  id          TEXT PRIMARY KEY,
  mesh_id     TEXT NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
  member_id   TEXT NOT NULL REFERENCES mesh.member(id),
  key         TEXT NOT NULL,
  ciphertext  BYTEA NOT NULL,
  nonce       BYTEA NOT NULL,
  sealed_key  BYTEA NOT NULL,
  entry_type  TEXT DEFAULT 'env' CHECK (entry_type IN ('env', 'file')),
  mount_path  TEXT,
  description TEXT,
  created_at  TIMESTAMP DEFAULT now(),
  updated_at  TIMESTAMP DEFAULT now(),
  UNIQUE (mesh_id, member_id, key)
);

Extend `mesh.skill` (backward compat)

ALTER TABLE mesh.skill
  ADD COLUMN source_type TEXT DEFAULT 'inline'
    CHECK (source_type IN ('inline', 'zip', 'git')),
  ADD COLUMN bundle_file_id TEXT REFERENCES mesh.file(id),
  ADD COLUMN git_url TEXT,
  ADD COLUMN git_branch TEXT DEFAULT 'main',
  ADD COLUMN git_sha TEXT,
  ADD COLUMN manifest JSONB;

Wire protocol additions

Client → broker

// --- Service deployment ---

interface WSMcpDeployMessage {
  type: "mcp_deploy";
  server_name: string;
  source:
    | { type: "zip"; file_id: string }
    | { type: "git"; url: string; branch?: string; auth?: string };
  config?: {
    env?: Record<string, string>;    // supports $vault: refs
    memory_mb?: number;              // default 256
    cpus?: number;                   // default 0.5
    network_allow?: string[];        // default: none
    runtime?: "node" | "python" | "bun";
  };
  scope?:
    | "peer"                                    // private (default)
    | "mesh"                                    // everyone
    | { peers: string[] }                       // named peers
    | { group: string }                         // single group
    | { groups: string[] }                      // multiple groups
    | { role: string };                         // by role tag
  _reqId?: string;
}

interface WSMcpUndeployMessage {
  type: "mcp_undeploy";
  server_name: string;
  _reqId?: string;
}

interface WSMcpUpdateMessage {
  type: "mcp_update";
  server_name: string;
  _reqId?: string;
}

interface WSMcpLogsMessage {
  type: "mcp_logs";
  server_name: string;
  lines?: number;     // default 50, max 1000
  _reqId?: string;
}

interface WSMcpScopeMessage {
  type: "mcp_scope";
  server_name: string;
  scope?:                                       // set — omit to read current
    | "peer"
    | "mesh"
    | { peers: string[] }
    | { group: string }
    | { groups: string[] }
    | { role: string };
  _reqId?: string;
}

interface WSMcpSchemaMessage {
  type: "mcp_schema";
  server_name: string;
  tool_name?: string;  // omit for all tools
  _reqId?: string;
}

interface WSMcpCatalogMessage {
  type: "mcp_catalog";
  _reqId?: string;
}

// --- Skill deployment ---

interface WSSkillDeployMessage {
  type: "skill_deploy";
  source:
    | { type: "zip"; file_id: string }
    | { type: "git"; url: string; branch?: string; auth?: string };
  _reqId?: string;
}

// --- Vault ---

interface WSVaultSetMessage {
  type: "vault_set";
  key: string;
  ciphertext: string;   // base64
  nonce: string;         // base64
  sealed_key: string;    // base64
  entry_type: "env" | "file";
  mount_path?: string;
  description?: string;
  _reqId?: string;
}

interface WSVaultListMessage {
  type: "vault_list";
  _reqId?: string;
}

interface WSVaultDeleteMessage {
  type: "vault_delete";
  key: string;
  _reqId?: string;
}

Broker → client

// --- Service responses ---

interface WSMcpDeployStatusMessage {
  type: "mcp_deploy_status";
  server_name: string;
  status: "building" | "installing" | "running" | "failed";
  tools?: Array<{ name: string; description: string; inputSchema: object }>;
  error?: string;
  _reqId?: string;
}

interface WSMcpLogsResultMessage {
  type: "mcp_logs_result";
  server_name: string;
  lines: string[];
  _reqId?: string;
}

interface WSMcpSchemaResultMessage {
  type: "mcp_schema_result";
  server_name: string;
  tools: Array<{ name: string; description: string; inputSchema: object }>;
  _reqId?: string;
}

interface WSMcpCatalogResultMessage {
  type: "mcp_catalog_result";
  services: Array<{
    name: string;
    type: "mcp" | "skill";
    description: string;
    status: string;
    tool_count: number;
    deployed_by: string;
    scope: { type: string; [key: string]: unknown };
    source_type: string;
    runtime?: string;
    created_at: string;
  }>;
  _reqId?: string;
}

interface WSMcpScopeResultMessage {
  type: "mcp_scope_result";
  server_name: string;
  scope: { type: string; [key: string]: unknown };
  deployed_by: string;
  _reqId?: string;
}

// --- Skill responses ---

interface WSSkillDeployAckMessage {
  type: "skill_deploy_ack";
  name: string;
  files: string[];
  _reqId?: string;
}

// --- Vault responses ---

interface WSVaultAckMessage {
  type: "vault_ack";
  key: string;
  action: "stored" | "deleted" | "not_found";
  _reqId?: string;
}

interface WSVaultListResultMessage {
  type: "vault_list_result";
  entries: Array<{
    key: string;
    entry_type: "env" | "file";
    mount_path?: string;
    description?: string;
    updated_at: string;
  }>;
  _reqId?: string;
}

// --- System events (broadcast to mesh) ---

// Sent as WSPushMessage with subtype: "system"
// event: "mcp_deployed"
// eventData: { name, description, tool_count, deployed_by, scope, tools: [...] }

// event: "mcp_undeployed"
// eventData: { name, by }

// event: "mcp_crashed"
// eventData: { name, error, restarts }

// event: "mcp_updated"
// eventData: { name, prev_sha, new_sha, tools: [...] }

Extended `hello_ack`

interface WSHelloAckMessage {
  // ... existing fields ...

  /** Scope-filtered service catalog for this peer. */
  services?: Array<{
    name: string;
    description: string;
    status: string;
    tools: Array<{ name: string; description: string; inputSchema: object }>;
    deployed_by: string;
  }>;
}

MCP tool additions (CLI)

Service management tools

mesh_mcp_deploy(server_name, file_id?, git_url?, git_branch?, env?, runtime?,
                memory_mb?, network_allow?, scope?)
mesh_mcp_undeploy(server_name)
mesh_mcp_update(server_name)           // git-only: pull + rebuild + restart
mesh_mcp_logs(server_name, lines?)
mesh_mcp_scope(server_name, scope?)    // set or read visibility scope
mesh_mcp_schema(server_name, tool?)    // introspect tool schemas
mesh_mcp_catalog()                     // list all services with status
mesh_skill_deploy(file_id?, git_url?, git_branch?)

Vault tools

vault_set(key, value, type?, mount_path?, description?)
vault_list()
vault_delete(key)

Existing tools (unchanged)

share_skill(name, description, instructions, tags)    // inline skills
mesh_mcp_register(server_name, description, tools)     // live peer proxy
mesh_tool_call(server_name, tool_name, args)           // universal fallback
mesh_mcp_list()                                        // shows both proxy + managed

Broker-side service manager

New file: apps/broker/src/service-manager.ts

Interface

interface ServiceManager {
  deploy(opts: {
    meshId: string;
    name: string;
    source: { type: "zip"; fileId: string }
           | { type: "git"; url: string; branch: string; auth?: string };
    config: ServiceConfig;
    vaultEntries: Array<{ key: string; ciphertext: Buffer; nonce: Buffer; sealedKey: Buffer;
                          entryType: "env" | "file"; mountPath?: string }>;
  }): Promise<{ tools: ToolDef[]; status: string }>;

  undeploy(meshId: string, name: string): Promise<void>;

  update(meshId: string, name: string): Promise<{ tools: ToolDef[]; newSha?: string }>;

  callTool(meshId: string, serverName: string, toolName: string,
           args: Record<string, unknown>): Promise<{ result?: unknown; error?: string }>;

  logs(meshId: string, name: string, lines?: number): string[];

  status(meshId: string, name: string): ServiceStatus;

  restoreAll(): Promise<void>;  // on broker boot
}

Boot restore

On broker startup:

Query mesh.service WHERE status IN ('running', 'crashed', 'restarting')
Set all to status='restarting'
Re-spawn runner container per mesh
Load each service's source and spawn child process
Set status='running' only after successful MCP initialize response
Services that fail to start → status='failed', system event broadcast

Security model

Concern	Mitigation
Arbitrary code execution	Docker container, one per mesh
Resource exhaustion	`--memory=512m --cpus=1` per container
Filesystem escape	No host volume mounts
Secret leakage	Vault E2E encrypted, decrypted only inside container
Network exfiltration	`--network=mesh-restricted`, per-service allowlist
Malicious zip (path traversal)	Validate all paths within target dir, reject `..`
Git auth tokens	Stored encrypted in vault, passed via `GIT_ASKPASS`
Denial of service	Max 20 services per mesh, max 50MB zip, max 500MB image
Scope bypass	Double-check: filter catalog + check on call
OAuth token expiry	Store refresh tokens, notify deployer on persistent failure
Tool name collision	`svc__` prefix for mid-session dynamic tools
Stale MCP entries	PID check + age sweep on launch
Tool call timeout	`MCP_TIMEOUT=30000` set by launch (default too short for mesh chain)
Large tool output	`MAX_MCP_OUTPUT_TOKENS=50000` set by launch; proxy truncates if needed
Proxy crash	Claude Code won't auto-restart; `claudemesh doctor` diagnoses dead proxies
Broker restart	Proxies reconnect via BrokerClient backoff; calls return "reconnecting" during window

CLI commands

# Deploy from zip
claudemesh deploy ./my-server.zip --name my-server

# Deploy from git
claudemesh deploy --git https://github.com/user/repo.git --name my-server

# Deploy with vault refs
claudemesh vault set gmail-creds ~/.gmail-mcp/credentials.json --type file
claudemesh deploy --git https://github.com/user/gmail-mcp.git --name gmail \
  --env 'GMAIL_CREDENTIALS_PATH=$vault:gmail-creds:file:/secrets/creds.json' \
  --network-allow 'gmail.googleapis.com:443'

# Set access
claudemesh scope gmail --mesh                     # everyone
claudemesh scope gmail --group eng                # @eng only
claudemesh scope gmail --groups 'eng,ops'         # @eng + @ops
claudemesh scope gmail --role lead                # leads only
claudemesh scope gmail --peers 'Mou,Alejandro'   # specific peers
claudemesh scope gmail --peer                     # private (deployer only)

# Manage
claudemesh logs gmail
claudemesh update gmail              # git-only: pull + rebuild
claudemesh undeploy gmail
claudemesh catalog                   # list all services

# Skills
claudemesh skill deploy ./my-skill.zip
claudemesh skill deploy --git https://github.com/user/skill.git

# Vault
claudemesh vault set api-key "sk-abc123"
claudemesh vault set oauth-creds ~/path/to/creds.json --type file
claudemesh vault list
claudemesh vault delete api-key

Migration path

What	Before	After
`share_skill()` inline	works	unchanged
`mesh_mcp_register()` live proxy	works	unchanged, labeled "proxy" in catalog
Zip MCP server	not possible	`share_file` + `mesh_mcp_deploy`
Git MCP server	not possible	`mesh_mcp_deploy(git_url=...)`
Zip skill bundle	not possible	`mesh_skill_deploy(file_id=...)`
Git skill	not possible	`mesh_skill_deploy(git_url=...)`
`mesh_tool_call`	forwards to peer	routes to runner OR forwards to peer
`mesh_mcp_list`	proxy only	shows proxy + managed, with status
Tool discovery	manual `mesh_mcp_list`	native MCP entries at launch + mid-session events
Credentials	plaintext env vars	E2E encrypted vault with `$vault:` refs
Access control	none (anyone can call)	Scopes: peer/group/role/mesh per service

All existing behavior preserved. New capabilities are additive.

Implementation order

Phase 1: Foundation

DB migration — mesh.service table, mesh.vault_entry table, extend mesh.skill
Wire protocol — add all new message types to types.ts
Vault — broker-side storage + CLI tools (vault_set, vault_list, vault_delete)
Service catalog — mcp_catalog, mcp_schema, scope filtering in hello_ack

Phase 2: Execution engine

Runner supervisor — service-manager.ts, child process spawn/kill/restart/health
Docker container — base image, build + run lifecycle
Deploy flow — zip extraction, git clone, runtime detection, npm install / pip install
Tool call routing — broker routes managed service calls to runner

Phase 3: Native integration

Launch integration — claudemesh launch writes mesh:* MCP entries to ~/.claude.json
Stdio proxy — claudemesh mcp --service <name> thin proxy command
Mid-session fallback — svc__* dynamic tools + list_changed on claudemesh MCP
Session cleanup — stale entry sweep, PID checks, flock on config writes

Phase 4: Skill bundles

Skill deploy — zip/git extraction, SKILL.md + skill.json parsing, manifest storage
get_skill extension — returns structured file contents from bundle

Phase 5: Polish

mesh_mcp_update — git pull + rebuild + restart flow
Boot restore — re-spawn services on broker restart
CLI commands — claudemesh deploy, claudemesh vault, claudemesh scope, claudemesh catalog
Docs + example bundles — sample MCP server zip, sample skill bundle

Appendix: Claude Code MCP behavior (verified)

Key findings from Claude Code MCP architecture research that informed this spec. These are behaviors of Claude Code itself, not the MCP protocol.

Lifecycle

MCP servers start when a session begins, stop when it ends
No auto-restart on crash — next tool invocation fails. Our proxy must handle reconnection to the broker independently
No health checks from Claude Code — failures discovered on tool use
MCP_TIMEOUT env var controls tool call timeout

Dynamic tools

notifications/tools/list_changed is supported and triggers immediate re-fetch of tools/list — works mid-conversation over stdio
SSE/HTTP transport support for list_changed may be unreliable — known bug in some versions. This is why we use stdio proxies, not HTTP transport.

ToolSearch / deferred tools

Enabled by default (ENABLE_TOOL_SEARCH=true)
Only tool names are loaded at startup — full schemas fetched on demand
Requires Sonnet 4+ or Opus 4+ (Haiku does not support tool references)
Adding 100+ MCP tools has near-zero context cost at startup
Configurable: ENABLE_TOOL_SEARCH=auto:5 loads upfront if <5% of context

Tool output limits

Warning at 10,000 tokens, hard limit at 25,000 tokens (default)
Configurable via MAX_MCP_OUTPUT_TOKENS env var
Per-tool override: _meta["anthropic/maxResultSizeChars"] (up to 500K chars)

Namespacing

Tools namespaced as mcp__servername__toolname
Two servers with same tool name → no conflict (different namespace)
Server names normalized: spaces → underscores

Registration

File-based only — no runtime API to add MCP servers
Scopes: local (~~/.claude.json), project (.mcp.json), user (~~/.claude.json global)
Precedence: local > project > user
claude mcp add --scope user for global, --scope project for team-shared
Cannot add new MCP server entries mid-session — this is why claudemesh launch pre-writes entries before spawning, and mid-session deploys fall back to dynamic svc__* tools on the claudemesh MCP server

Environment variables

Passed via --env KEY=VALUE on claude mcp add
.mcp.json supports ${VAR} and ${VAR:-default} expansion
Special: ${CLAUDE_PLUGIN_ROOT}, ${CLAUDE_PLUGIN_DATA}

Implications for this spec

Native MCP entries MUST be written before claude spawns → claudemesh launch flow
Stdio transport is the only reliable path for list_changed → thin proxy model
ToolSearch means 100+ mesh tools have negligible context cost
No server dependencies → each mesh proxy is independent
No auto-restart → proxies must reconnect to broker on their own

45 KiB Raw Blame History

Mesh Services: MCP Servers & Skills Platform

Problem

Design goals

Architecture overview

Source modes

1. Inline (existing, unchanged)

2. Zip bundle

3. Git repository

Execution engine

Why child processes, not worker threads

Container design: one per mesh

Supervisor protocol

Runtime auto-detection

Health & restart

Logs

Storage layout

Network policy

Credential vault

Design

Encryption model

Vault reference syntax

MCP tools

DB schema

Visibility scopes

Model

Scope levels

Examples

Deploy with scope

Enforcement

Scope resolution logic

MCP tools

Scope change events

DB

Future: cross-mesh scope

Native Claude Code integration

Goal

At session start: native MCP entries

Session management

Mid-session deploys: dynamic tools

Mid-session undeploys

Schema introspection

Database changes

New table: mesh.service

New table: mesh.vault_entry

Extend mesh.skill (backward compat)

Wire protocol additions

Client → broker

Broker → client

Extended hello_ack

MCP tool additions (CLI)

Service management tools

Vault tools

Existing tools (unchanged)

Broker-side service manager

Interface

Boot restore

Security model

CLI commands

Migration path

Implementation order

Phase 1: Foundation

Phase 2: Execution engine

Phase 3: Native integration

Phase 4: Skill bundles

Phase 5: Polish

Appendix: Claude Code MCP behavior (verified)

Lifecycle

Dynamic tools

ToolSearch / deferred tools

Tool output limits

Namespacing

Registration

Environment variables

Implications for this spec

45 KiB

Raw Blame History

New table: `mesh.service`

New table: `mesh.vault_entry`

Extend `mesh.skill` (backward compat)

Extended `hello_ack`