docs(specs): m1 — agentic-comms architecture spec (v1 + v2 frozen)
v1: initial 3-layer architecture proposal, reviewed by Codex GPT-5.2 (high)
v2: full end-state with hybrid P2P data plane, broker as coordination
plane only, 6 layers, 8 architectural milestones, Codex-2 corrections
(at-least-once requires client_ack, service_pubkey explicit, meta
required in v2 envelope, streamId required for stream channel,
explicit revocation flow). v2 is frozen for implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
506
.artifacts/specs/2026-05-04-agentic-comms-architecture-v2.md
Normal file
506
.artifacts/specs/2026-05-04-agentic-comms-architecture-v2.md
Normal file
@@ -0,0 +1,506 @@
|
||||
---
|
||||
title: claudemesh — full end-state architecture for agentic peer communication
|
||||
status: draft (v2 — supersedes v1: removes time-boxed phasing, adds P2P data plane, applies Codex-2 correctness/scope-gap edits)
|
||||
target: end-state (architectural milestones, not version timelines)
|
||||
author: Alejandro + Claude (Codex GPT-5.2 cross-checked twice)
|
||||
date: 2026-05-04
|
||||
supersedes: 2026-05-04-agentic-comms-architecture.md (v1)
|
||||
references:
|
||||
- 2026-05-02-architecture-north-star.md (CLI-first commitment, push-pipe)
|
||||
- 2026-05-04-per-session-presence.md (per-launch session pubkey + attestation)
|
||||
- apps/cli/CHANGELOG.md (1.30.0–1.32.1 history)
|
||||
---
|
||||
|
||||
# claudemesh — agentic peer communication, full end-state
|
||||
|
||||
## What this document is
|
||||
|
||||
The end-state architecture for claudemesh as a transport-agnostic agentic peer-comms platform. Not a release plan, not a sprint roadmap — the **shape** the system needs to converge on. Implementation order at the end is a *suggestion*, not a contract; time estimates are deliberately omitted because the surface is too cross-cutting to phase by weeks.
|
||||
|
||||
v1 of this spec (same date, no `-v2` suffix) treated the broker as the sole data plane. v2 corrects that: **the broker is a coordination plane (signaling, discovery, offline queue, fan-out, registry, revocation); the data plane is hybrid P2P** with broker fallback for the cases P2P can't cover. Closer to how Tailscale, libp2p, LiveKit, and modern WebRTC stacks work in production.
|
||||
|
||||
## TL;DR
|
||||
|
||||
- **Identity** — three keypair types (member, session, service) all rooted in a member's secret key. Member is durable, session is per-launch, service is a member-scoped delegate for non-Claude integrations. Every service has its own pubkey and explicit revocation.
|
||||
- **Coordination plane** — broker handles signaling, peer discovery, offline message queue, group/topic fan-out, mesh state authority, revocation gossip. Always reachable.
|
||||
- **Data plane** — hybrid:
|
||||
- **P2P first** (WebRTC data channels, future: QUIC) when both peers online + NAT-traversable.
|
||||
- **Broker-relayed** when peers are NAT-blocked, when one peer is offline, or for group/topic/broadcast where fan-out at the broker is structurally cheaper than N-way sender-side fan-out.
|
||||
- **Pure broker** for service identities that can't run a P2P stack (HTTP webhook senders, OpenAI Assistants, browser SDKs without WebRTC).
|
||||
- **Channels** — typed envelope (dm, group, topic, rpc, system, stream). Channel type drives crypto, routing, and transport selection. `meta` is required in v2 envelope.
|
||||
- **Transports** — pluggable adapters under one interface: WS-to-broker (today), WebRTC P2P, HTTP webhook, future LiveKit/QUIC/etc. Broker negotiates which adapter a peer pair uses.
|
||||
- **Crypto** — every direct message is E2E encrypted to recipient's pubkey regardless of transport. Broker never sees plaintext. P2P doesn't get any extra trust just because it's direct.
|
||||
- **Delivery** — at-least-once **requires receiver ack** before broker marks `delivered_at`. The retry path before that is best-effort with idempotent dedupe at the receiver.
|
||||
|
||||
The CLI-first commitment from the North Star spec stays intact. Every channel type and every transport is invocable from `claudemesh <verb>`. MCP serves only `claude/channel` mid-turn push.
|
||||
|
||||
---
|
||||
|
||||
## The forcing functions (why this shape, not a smaller one)
|
||||
|
||||
1. **Multi-session interconnect already broke** (1.30.0 → 1.32.1) because the per-session WS subsystem shipped without push handler. Symptom of "broker is the data plane and we keep bolting on" thinking. Need to formalize roles and transport adapters before the next bolt-on.
|
||||
|
||||
2. **Codex review surfaced a correctness bug** in `drainForMember` — claims `delivered_at = NOW()` *before* WS push succeeds; if `ws.readyState !== OPEN` the row is marked delivered and message is lost. At-most-once with no retry. Inherited by every channel/transport added unless fixed at the foundation.
|
||||
|
||||
3. **The agentic-comms domain has standardized on hybrid P2P + central coordinator.** Tailscale (control plane + WireGuard P2P), LiveKit (signaling + SFU + P2P data channels), libp2p (DHT discovery + multi-transport), Iroh (gossip + QUIC P2P). Pure-broker is a 2010s pattern; pure-P2P is academic. Hybrid is the norm.
|
||||
|
||||
4. **claudemesh's pricing/economics demand P2P.** Every byte through the broker is your cost. Voice transcripts, file transfers, real-time tool I/O — bandwidth-heavy. P2P data plane lets the broker scale linearly with peer count, not message volume.
|
||||
|
||||
5. **Privacy/sovereignty matters as the agent ecosystem grows.** "Your agents talk to my agents" should default to peer-to-peer paths when possible. Broker as relay is fine; broker as forced middleman is not.
|
||||
|
||||
---
|
||||
|
||||
## Audience for this architecture
|
||||
|
||||
| Peer type | Identity | Online presence | Data plane preference | Notes |
|
||||
|---|---|---|---|---|
|
||||
| **Claude Code session** | Per-launch session pubkey, member-attested | WS to broker (control + signaling) | P2P first, broker fallback | Mid-turn push via MCP `claude/channel` |
|
||||
| **Daemon, no launch** (idle Mac with daemon running) | Member pubkey | WS to broker | Broker only (no P2P partner unless launched) | Receives broadcasts + member-targeted DMs |
|
||||
| **Voice agent** (LiveKit, Pipecat) | Service identity, member-signed | LiveKit room + bridge | LiveKit room data channels intra-room; bridge over broker for cross-mesh | Side-car bridges room ↔ broker |
|
||||
| **OpenAI Assistant / Anthropic Skill** | Service identity, scoped token | HTTP outbound, webhook inbound | Broker only (can't run P2P) | Daemon does delegated re-encryption |
|
||||
| **Browser-based peer** (web dashboard, SDK) | Member or service identity | WS to broker, WebRTC for P2P | P2P-where-possible (browsers ARE WebRTC-native) | Full feature parity once on-mesh |
|
||||
| **Webhook consumer** (Stripe-style passive) | Service identity | HTTP webhook inbound only | Broker only | Topic subscriptions; no inbound channel |
|
||||
| **Bridge** (Slack, WhatsApp, IRC, Matrix) | Service identity per bridge + per-end-user delegated | WS to broker | Broker only for bridge ↔ broker; native protocol for bridge ↔ external | Trust delegated to bridge operator |
|
||||
| **Cron / scheduled actor** | Member pubkey or service identity | Ephemeral; HTTP send only | Broker only | No long-lived connection |
|
||||
| **CLI-only user** (no Claude Code) | Member pubkey | Ephemeral on each `claudemesh send` | Broker only | Command-line agent, queues via outbox |
|
||||
|
||||
Every row in this table works without changing the broker's coordination plane.
|
||||
|
||||
---
|
||||
|
||||
## Layer 1: Identity
|
||||
|
||||
Three keypair types, one auth model.
|
||||
|
||||
### Member identity (durable)
|
||||
- Ed25519 keypair, generated at `claudemesh join <invite>`. Held in `~/.claudemesh/config.json` per mesh.
|
||||
- The auth boundary — grants, kicks, bans operate on members.
|
||||
- Used for hello signature on the daemon's control-plane WS.
|
||||
- Used as cryptographic root of trust for sibling sessions and service identities.
|
||||
|
||||
### Session identity (ephemeral, per-launch)
|
||||
- Ed25519 keypair generated by each `claudemesh launch`. Held in process memory only.
|
||||
- Parent-signed attestation vouches for it (TTL 12h, broker cap 24h). Rotation = new launch.
|
||||
- Used for hello signature on the per-session WS, and as routing key for DMs targeted at *this specific launched session*.
|
||||
- Session secret never touches disk; lives only in the daemon's `sessionBrokers` map keyed by IPC token.
|
||||
|
||||
### Service identity (third type, additive)
|
||||
|
||||
For non-Claude integrations that can't or shouldn't use a per-launch session.
|
||||
|
||||
```
|
||||
ServiceIdentity {
|
||||
service_id // Stable string id ("openai-assistant-foo", "livekit-room-bar")
|
||||
service_pubkey // Ed25519 pubkey — the cryptographic identity. crypto_box targets this.
|
||||
member_id // The mesh member that owns this service (auth boundary)
|
||||
service_type // "openai-assistant" | "livekit-room" | "webhook" | "voice-agent" | ...
|
||||
scopes // ["dm:read", "topic:write", "rpc:invoke", ...]
|
||||
attestation // member-signed: { service_id, service_pubkey, scopes, expires_at, signature }
|
||||
transport_hint // "ws" | "http-webhook" | "sse" | "livekit" — informs how the broker reaches it
|
||||
delegate_daemon_pubkey? // Optional. Set when the daemon holds the service's secret on its behalf.
|
||||
}
|
||||
```
|
||||
|
||||
Two flavors:
|
||||
- **Holds-secret service** — has its own keypair (`service_pubkey` + service-secret kept by the service itself). Runs E2E crypto end-to-end. Voice agent side-cars, browser SDK, MQTT bridges.
|
||||
- **Delegated service** — daemon holds the service-secret on the service's behalf. Senders still encrypt to `service_pubkey`; daemon decrypts on receipt and forwards plaintext (or re-signs) to the service via its `transport_hint`. Used by HTTP webhook consumers, OpenAI Assistants. Trust is in the daemon owner. `delegate_daemon_pubkey` records who's holding.
|
||||
|
||||
All three identity types resolve to a `member_id` for authorization. They differ in liveness (member = always; session = per-launch; service = scoped) and transport hint (member/session = WS-resident; service = polymorphic).
|
||||
|
||||
### Identity revocation (explicit)
|
||||
|
||||
Existing v1 left this implicit. v2 makes it concrete:
|
||||
|
||||
- **CLI verb:** `claudemesh service revoke <service_id>` (also `claudemesh peer revoke <pubkey>` for member revocation).
|
||||
- **Broker effect:** add row to `revocation` table with `(mesh_id, revoked_pubkey, revoked_at, revoked_by, reason?)`. Drop any active WS for that pubkey (close 4002 "revoked"). Reject future helloes.
|
||||
- **Drain effect:** `drainForMember` checks revocation list at drain time; ciphertext-in-flight from the revoked sender is dropped (sender already broker-acked, but recipient never sees it).
|
||||
- **Gossip:** revocation events publish on the `system` channel (highest priority). Online peers cache; offline peers see on reconnect. Required so P2P sessions also honor revoke (otherwise a revoked peer's stored attestations could keep working over direct paths).
|
||||
- **Latency target:** <30s for online peers to receive and apply.
|
||||
- **Expiry vs revoke distinction:** `expires_at` is graceful (predictable, scheduled rotation); revoke is emergency (leaked secret, fired employee, compromised host). Both use the same revocation table; `expires_at` enforces silently when reached, revoke is logged as an audit event.
|
||||
|
||||
---
|
||||
|
||||
## Layer 2: Coordination plane (the broker, properly scoped)
|
||||
|
||||
The broker is **not** the data plane. Its real responsibilities:
|
||||
|
||||
1. **Mesh state authority** — member roster, group memberships, topic registry, service registrations, revocation list. Source of truth for who's in a mesh and what they can do.
|
||||
2. **Peer discovery** — `list_peers` returns currently-online presences. Broker is the only system that knows which peers are reachable now and over which transports.
|
||||
3. **Signaling for P2P upgrades** — when peer A wants to open a P2P connection to peer B, A sends a SDP offer through the broker; B responds with an SDP answer through the broker. Once the data channel is up, broker is out of the path. Same as WebRTC signaling.
|
||||
4. **Offline message queue** — when recipient is offline, broker stores the (encrypted) message until they reconnect. P2P can't do this without an "always-on peer" model, which is awkward to bootstrap.
|
||||
5. **Group / topic / broadcast fan-out** — broker is the cheap fan-out point. Sender publishes once; broker delivers to N recipients. P2P fan-out (gossipsub) is possible but adds significant complexity for a feature most meshes won't need at scale.
|
||||
6. **TURN-style relay for NAT-blocked pairs** — when P2P negotiation fails (symmetric NAT, restrictive corporate firewall), broker carries the data. Functionally equivalent to TURN.
|
||||
7. **Revocation gossip publisher** — broker pushes revocation events to all online peers via the `system` channel; peers cache them.
|
||||
8. **Audit log + persistence layer** — encrypted message metadata for compliance. Bodies are E2E-encrypted, so audit is over (sender, recipient, channel, timestamp, size), not content.
|
||||
|
||||
The broker is **NOT**:
|
||||
- The default path for online-online direct messages (P2P should win).
|
||||
- The decryptor for any direct message (E2E means broker sees ciphertext only).
|
||||
- A bottleneck on bulk data (file transfer, voice, screen share — these go P2P or fail).
|
||||
- The sole identity authority for active sessions (P2P sessions verify attestations locally via cached mesh state).
|
||||
|
||||
### Two roles per mesh on the WS layer (Codex-1 correction, kept)
|
||||
|
||||
Within the broker's WS surface, the daemon holds two roles per mesh, not one connection per launch:
|
||||
|
||||
- **Control-plane connection** — one per mesh, member-keyed. Carries: signaling + outbox drain + RPCs + broadcast/member-targeted inbound + revocation gossip subscription.
|
||||
- **Session connections** — N per mesh, session-keyed. Carries: presence row keyed on session pubkey + signaling for P2P upgrades involving this session + inbound for session-targeted DMs that arrive via broker fallback.
|
||||
|
||||
A peer who's purely on the broker (no P2P) functions exactly as today. A peer who upgrades to P2P with another peer keeps its broker WS for the other roles.
|
||||
|
||||
---
|
||||
|
||||
## Layer 3: Data plane (hybrid P2P + broker fallback)
|
||||
|
||||
The data plane is what carries actual message bodies. Three modes, selected per (sender, recipient, channel) tuple:
|
||||
|
||||
### Mode 1: Direct P2P (preferred when possible)
|
||||
|
||||
Two peers run a WebRTC data channel (or QUIC stream — pluggable, see Layer 4) between their daemons. Established via signaling through the broker; once up, broker is out of the path.
|
||||
|
||||
**When P2P is selected:**
|
||||
- Both peers are online (have an active broker WS).
|
||||
- Both peers' transports advertise P2P capability (WebRTC available; not a webhook-only service identity; not a browser without `RTCPeerConnection`).
|
||||
- ICE negotiation succeeds (at least one candidate pair works — direct, server-reflexive, or peer-reflexive).
|
||||
- Channel type is `dm`, `rpc`, or `stream` (the 1:1 cases).
|
||||
|
||||
**P2P session lifecycle:**
|
||||
- Established lazily on first message (warm-up cost ~200ms; dominated by ICE + DTLS handshake). Subsequent messages reuse the channel.
|
||||
- Idle timeout: 5min of no traffic → tear down. Re-established on next message.
|
||||
- Hard timeout: 1h max regardless of activity, then re-handshake. Limits damage of compromised session keys.
|
||||
- Either side can demote to broker-relay at any time; broker is the fallback always.
|
||||
|
||||
**Crypto on P2P:**
|
||||
- DTLS handshake provides transport encryption (forward secrecy; recipient pubkey verified via cached attestation chain).
|
||||
- Application-layer crypto_box ALSO runs on top — same as broker-relayed messages — so the wire format and decryption path are identical on the receiver side. Defense in depth, no special-case code.
|
||||
|
||||
### Mode 2: Broker-relayed (fallback)
|
||||
|
||||
The current path. Sender encrypts to recipient pubkey (member or session or service), pushes to broker via WS, broker queues, recipient pulls (or broker pushes to recipient's WS).
|
||||
|
||||
**When broker-relay is selected:**
|
||||
- One peer offline → broker queues, delivers on reconnect.
|
||||
- ICE negotiation fails → broker becomes the relay.
|
||||
- Channel type is `group`, `topic`, or `broadcast` → broker fan-out is structurally cheaper than P2P fan-out for any group >2.
|
||||
- Service identity at either end can't run P2P → broker is the only path.
|
||||
|
||||
**Crypto:** unchanged from today — E2E crypto_box, broker sees ciphertext only.
|
||||
|
||||
### Mode 3: Direct webhook (broker as broker, not as relay)
|
||||
|
||||
For service identities advertising `transport_hint: "http-webhook"`. Sender encrypts to service's `service_pubkey` (or to delegate-daemon's pubkey for delegated services), broker POSTs the ciphertext to the service's registered URL with HMAC signature + retry. No long-lived connection on the service side.
|
||||
|
||||
This is functionally a "broker queue, custom delivery transport" — broker still mediates, but delivery is HTTP not WS.
|
||||
|
||||
### Selection logic (deterministic, sender-side)
|
||||
|
||||
```
|
||||
function pickTransport(sender, recipient, channel) -> Transport:
|
||||
if channel in [group, topic, broadcast]:
|
||||
return broker.relay # fan-out semantics
|
||||
|
||||
if recipient.transport_hint == "http-webhook":
|
||||
return broker.relay # broker calls webhook
|
||||
|
||||
if recipient is offline:
|
||||
return broker.queue # store-and-forward
|
||||
|
||||
if !recipient.capabilities.p2p:
|
||||
return broker.relay # one-end can't P2P
|
||||
|
||||
if !sender.capabilities.p2p:
|
||||
return broker.relay # we can't P2P
|
||||
|
||||
if has_active_p2p_session(sender, recipient):
|
||||
return p2p.session # warm path
|
||||
|
||||
attempt_p2p_handshake(sender, recipient, timeout=2s) ->
|
||||
if ok: return p2p.session
|
||||
else: return broker.relay # fall through, log degraded
|
||||
```
|
||||
|
||||
Policy lives in the daemon's send path. Broker doesn't know or care — it sees only the messages that actually go through it.
|
||||
|
||||
---
|
||||
|
||||
## Layer 4: Transport adapters (pluggable)
|
||||
|
||||
A transport adapter is an implementation of how *one peer pair* moves bytes. Defined by an interface; new adapters added without touching upper layers.
|
||||
|
||||
```typescript
|
||||
interface PeerTransport {
|
||||
readonly kind: string; // "ws-broker" | "webrtc-p2p" | "http-webhook" | ...
|
||||
|
||||
readonly capabilities: {
|
||||
p2p: boolean;
|
||||
bidirectional: boolean;
|
||||
midTurnPush: boolean;
|
||||
maxMessageBytes: number;
|
||||
streamingChunks: boolean;
|
||||
};
|
||||
|
||||
open(opts: TransportOpenOpts): Promise<TransportSession>;
|
||||
send(envelope: Envelope): Promise<TransportSendResult>;
|
||||
inbound(): AsyncIterable<Envelope>;
|
||||
heartbeat(): Promise<boolean>;
|
||||
close(reason?: string): Promise<void>;
|
||||
}
|
||||
```
|
||||
|
||||
### Concrete adapters at end-state
|
||||
|
||||
1. **`WsBrokerTransport`** — current code. WebSocket to `wss://ic.claudemesh.com/ws`. Underpins both broker-relay (Mode 2) and signaling for P2P upgrades.
|
||||
2. **`WebRtcP2pTransport`** — RTCPeerConnection + RTCDataChannel. Browser, Node (`node-datachannel` or similar), CLI all supported. Chunking handled at envelope layer for `stream` channel.
|
||||
3. **`HttpWebhookTransport`** — outbound HTTP POST to broker `/v1/send`; inbound HTTP POST to a registered webhook URL. Unidirectional from peer's perspective. Mid-turn push: no.
|
||||
4. **`LiveKitRoomTransport`** — for voice agents. Side-car bridges a LiveKit room to claudemesh. Maps a LiveKit participant → claudemesh service identity.
|
||||
|
||||
Future adapters TBD as concrete needs surface — no commitments here. (v1 listed MQTT/gRPC/SSE as future named adapters; v2 drops the named list per Codex-2 should-cut feedback.)
|
||||
|
||||
The peer's daemon advertises transport capabilities at hello time; broker stores them in the presence row; senders consult them via `list_peers` (capability fields added to the response).
|
||||
|
||||
---
|
||||
|
||||
## Layer 5: Channels (typed envelope)
|
||||
|
||||
Channels define **semantics**: what the message means, what crypto to apply, what delivery guarantees, what fan-out, what backpressure.
|
||||
|
||||
```typescript
|
||||
type ChannelType =
|
||||
| "dm" // 1:1 direct, encrypted to recipient pubkey, at-least-once with ack
|
||||
| "group" // post to named group, per-recipient encrypt or symmetric, at-least-once with ack
|
||||
| "topic" // pub/sub topic, persisted history, per-topic symmetric key, at-least-once with ack
|
||||
| "rpc" // request/response with correlation id + timeout, exactly-once via dedupe
|
||||
| "system" // peer_joined / peer_left / topology / lifecycle / revocation (broker-originated)
|
||||
| "stream"; // long-lived ordered chunks, idempotent per (stream_id, chunk_id)
|
||||
|
||||
interface Envelope {
|
||||
v: 2;
|
||||
channel: ChannelType;
|
||||
/** Routing target — meaning depends on channel:
|
||||
* dm: recipient pubkey (member, session, or service)
|
||||
* group: group name (e.g. "@admins")
|
||||
* topic: topic id (e.g. "#abc123")
|
||||
* rpc: recipient pubkey
|
||||
* system: ignored (sender-determined fan-out; broker fills in)
|
||||
* stream: recipient pubkey (the stream_id is in meta.streamId — see below) */
|
||||
target: string;
|
||||
/** Sender identity pubkey (member, session, or service). */
|
||||
from: string;
|
||||
/** Encrypted payload. Channel + recipient determines crypto recipe:
|
||||
* dm/rpc/stream: crypto_box to recipient pubkey
|
||||
* group: per-recipient seal (or symmetric in v3)
|
||||
* topic: per-topic symmetric key (v0.2.0 spec)
|
||||
* system: broker-signed, plaintext metadata (event has no body) */
|
||||
body: { nonce: string; ciphertext: string; bodyVersion: number };
|
||||
/** Required in v2 (was optional in v1). Even minimal envelopes must carry
|
||||
* clientMessageId for idempotent dedupe. */
|
||||
meta: {
|
||||
clientMessageId: string; // REQUIRED — idempotency id (spec §4.2)
|
||||
requestFingerprint?: string;
|
||||
priority?: "now" | "next" | "low"; // dm: gates mid-turn push; group/topic: fan-out priority
|
||||
timeoutMs?: number; // rpc only
|
||||
streamId?: string; // REQUIRED for channel:"stream"; identifies the stream
|
||||
streamChunkId?: number; // stream only; monotonic; receiver dedupes
|
||||
streamTerminator?: boolean; // stream only; signals end
|
||||
rpcCorrelationId?: string; // rpc only; back-edge for response
|
||||
rpcResponse?: boolean; // rpc only; this is a response, not request
|
||||
replyToId?: string; // dm/topic threading
|
||||
mentions?: string[]; // dm/topic; @-callouts
|
||||
expiresAt?: number; // any; broker drops past this; default 7d for queued
|
||||
};
|
||||
/** Sender Ed25519 signature over canonical bytes. Verified by recipient
|
||||
* (and by broker for system-message origin). */
|
||||
signature: string;
|
||||
}
|
||||
```
|
||||
|
||||
### Stream concurrency
|
||||
|
||||
For `channel: "stream"`, **`meta.streamId` is required**. Two concurrent streams to the same recipient pubkey use distinct streamIds; receiver demuxes by `(from, streamId)`. Without this, multi-stream voice transcripts or file transfers from the same peer would collide.
|
||||
|
||||
### Crypto by channel
|
||||
|
||||
- `dm`, `rpc`, `stream` → crypto_box(plaintext, recipient_pubkey, sender_secretkey). Receiver verifies attestation chain to ensure recipient_pubkey is a valid identity rooted in a current member.
|
||||
- `group` → for now: per-recipient crypto_box (sender encrypts N times, broker fans out). Future: hybrid Curve25519 → AES-GCM with sender key wrap, like Signal Sender Keys.
|
||||
- `topic` → per-topic symmetric key (already in v0.2.0 spec). Key rotation = new topic + members re-subscribe. Keys distributed via DM at join time, encrypted to each member's pubkey.
|
||||
- `system` → broker is the signer; receivers verify against the broker's published Ed25519 pubkey. Plaintext bodies allowed since these are operational events.
|
||||
|
||||
### Delivery semantics (Codex-2 correction applied)
|
||||
|
||||
**At-least-once requires receiver ack.** Today's broker sets `delivered_at = NOW()` inside the claim CTE before WS push succeeds — that's at-most-once with no retry. The end-state behavior:
|
||||
|
||||
1. Sender's daemon writes to outbox (durable).
|
||||
2. Drain worker sends to broker; broker acks with `client_message_id` echo (this is sender → broker delivery ack, NOT end-to-end).
|
||||
3. Broker queues with `claimed_at` NULL, `delivered_at` NULL.
|
||||
4. On recipient hello / push opportunity: broker claims by setting `claimed_at = NOW(), claim_id = <presenceId>` (lease 30s).
|
||||
5. Broker `sendToPeer` writes to WS / P2P / webhook.
|
||||
6. Receiver processes envelope and emits `client_ack { clientMessageId }` back to broker.
|
||||
7. Broker sets `delivered_at = NOW()` ON ACK RECEIPT.
|
||||
8. If lease expires without ack → broker re-eligible to claim and re-deliver.
|
||||
9. Receiver dedupes by `clientMessageId` (idempotent insert into inbox).
|
||||
|
||||
Until ack is wired (transitional state), the transitional label is **best-effort retry with idempotent dedupe**, not at-least-once. The outbox + claim/lease + dedupe combination upgrades to at-least-once when the ack path is in place.
|
||||
|
||||
`rpc` exactly-once is the same path with the addition that the response carries the `rpcCorrelationId`; sender retries the request until response received OR `timeoutMs` elapses; receiver-side dedupe ensures the handler runs at most once.
|
||||
|
||||
### Mid-turn push
|
||||
|
||||
`channel: "dm"` with `meta.priority: "now"` and recipient is a launched Claude Code session → recipient's daemon emits `claude/channel` MCP push; the session's Claude Code reads it mid-turn. Other priorities deliver via `claudemesh inbox` poll or at next tool boundary.
|
||||
|
||||
### Reply threading + mentions
|
||||
|
||||
Uniform across `dm` and `topic`: `meta.replyToId` references the original message's `clientMessageId`. `meta.mentions` is an array of pubkeys (or `@<group>`) — UI/CLI surfaces them; broker doesn't enforce.
|
||||
|
||||
---
|
||||
|
||||
## Layer 6: Mesh state — broker authority + signed gossip
|
||||
|
||||
The mesh state (members, groups, topics, services, revocations, policies) needs both:
|
||||
|
||||
- **Authority** — single source of truth. The broker DB. Mutations (add member, revoke, change policy) go through broker, signed by mesh owner / admin.
|
||||
- **Replication** — every peer needs a current-enough copy to authorize incoming P2P messages locally (otherwise revoke can't be enforced when peers chat directly).
|
||||
|
||||
End-state: broker publishes signed mesh-state-update events on the `system` channel; peers cache and apply. Conflict resolution is trivial because broker is authority — peers merge updates by version vector. Eventually consistent in seconds, not the open-ended convergence of CRDT-only systems.
|
||||
|
||||
For peer revocation specifically: revocation gossip is highest priority and must propagate within 30s to all online peers. Offline peers see it on reconnect.
|
||||
|
||||
---
|
||||
|
||||
## Crypto — what doesn't change vs what does
|
||||
|
||||
### Doesn't change
|
||||
- Per-peer Ed25519 keypairs (member + session + service).
|
||||
- crypto_box (Curve25519 + XSalsa20 + Poly1305) for DMs/RPC/stream.
|
||||
- Parent-attestation flow for sessions and services.
|
||||
|
||||
### Does change (additive)
|
||||
- DTLS layer underneath WebRTC P2P (transport-level encryption for fingerprint binding).
|
||||
- Per-topic symmetric keys (v0.2.0 baseline; v2 makes it a hard requirement for topics).
|
||||
- Broker signing key for `system` channel events (single Ed25519 keypair the broker holds; pubkey published in mesh state).
|
||||
- Service identity attestations carry `service_pubkey` + `scopes`.
|
||||
- Forward-secrecy for long-lived P2P sessions: post-handshake, derive a fresh symmetric key per session epoch (1h max); rotate.
|
||||
|
||||
---
|
||||
|
||||
## Migration order (architectural milestones, NO time estimates)
|
||||
|
||||
The end-state above doesn't ship in one PR. The following ordering minimizes regression risk and lets each milestone be useful on its own. **No weeks/sprints attached** — work proceeds when the prior milestone is stable.
|
||||
|
||||
### Milestone 1 — Foundational correctness
|
||||
*Required before anything else. Without this, every later milestone inherits the bugs.*
|
||||
|
||||
- Extract `connectWsWithBackoff` helper. Refactor `DaemonBrokerClient` and `SessionBrokerClient` to use it. Eliminates the drift bug class.
|
||||
- Drop daemon's stray `sessionPubkey` field (or rename + document).
|
||||
- Tighten daemon-WS inbound filter — `*` broadcasts and member-targeted DMs only; session-targeted DMs land on session WS exclusively.
|
||||
- Add `presence.role` column at broker (`control-plane | session | service`); list_peers + fan-out + reconnect honor it.
|
||||
- **Fix broker drain race** — schema migration adds `claimed_at`, `claim_id`, `claim_expires_at` columns. Rewrite `drainForMember` for two-phase claim/deliver. Re-claim if `claimed_at` older than lease (30s).
|
||||
- Receiver-side `client_ack` for at-least-once with ack (Codex-2 correction). Without ack wiring this stays at "best-effort retry with idempotent dedupe."
|
||||
- Receiver-side dedupe: idempotent insert on `clientMessageId`; finished + made required for v2 envelopes.
|
||||
|
||||
### Milestone 2 — Capability advertisement + transport abstraction
|
||||
*Sets up the interface. No new transport yet.*
|
||||
|
||||
- Define `PeerTransport` interface; refactor existing WS code to be the first implementation. No behavioral change.
|
||||
- Add capabilities field to hello payload + presence row + `list_peers` response.
|
||||
- Define `Envelope v2` schema with `meta` required + `streamId` requirement on `stream` channel. Broker accepts both v1 and v2 (v1 auto-upgraded server-side by inferring `channel` from `targetSpec` shape). Senders start emitting v2.
|
||||
|
||||
### Milestone 3 — Service identity + HTTP webhook transport
|
||||
*First non-WS transport. Validates abstraction. Includes revocation.*
|
||||
|
||||
- Service identity registration: `claudemesh service register --type webhook --pubkey <hex> --scopes ...` mints attestation, stores broker-side. Service pubkey explicit in attestation.
|
||||
- Service revocation: `claudemesh service revoke <service_id>` writes broker denylist + closes any active connections + publishes `system` revocation event.
|
||||
- Add `HttpWebhookTransport` (broker-side outbound: POST with HMAC + retry; daemon-side inbound: HTTP server receives webhook callbacks → handleBrokerPush).
|
||||
- Add `/v1/send` HTTP POST endpoint on broker (today broker is WS-only for sends).
|
||||
- Demo: cron job using only `curl` posts to mesh; webhook subscriber receives.
|
||||
- (`SseTransport` deferred — Codex-2 should-cut feedback. Pull in when concrete browser need arises.)
|
||||
|
||||
### Milestone 4 — Typed channels: rpc, stream, system
|
||||
*Channel layer becomes real.*
|
||||
|
||||
- `channel: "rpc"` end-to-end: correlation id routing through any transport, response timeout, `claudemesh rpc <peer> <method> <args>` CLI verb.
|
||||
- `channel: "stream"` end-to-end: chunked + ordered + idempotent, multi-stream demux via `meta.streamId`, `claudemesh stream <peer> <stream-id>` CLI verb.
|
||||
- `channel: "system"` formalized (broker-signed events for peer_joined, peer_left, topology, revocation, mesh-state-updates).
|
||||
|
||||
### Milestone 5 — P2P data plane (WebRTC adapter)
|
||||
*The big architectural shift. Broker becomes coordinator, not data path.*
|
||||
|
||||
- Add `WebRtcP2pTransport` adapter. Uses `node-datachannel` (or libdatachannel binding) on Node; native WebRTC in browser.
|
||||
- Add signaling protocol over the existing broker WS:
|
||||
- `p2p_offer` (sender → broker → recipient): SDP offer + ICE candidates.
|
||||
- `p2p_answer` (recipient → broker → sender): SDP answer + ICE candidates.
|
||||
- `p2p_candidate` (either way): trickle ICE candidates.
|
||||
- All signaling messages are broker-attested (only valid sender/recipient pairs).
|
||||
- Add `pickTransport()` policy in daemon send path.
|
||||
- Add P2P session manager: warm-cache, idle timeout, hard timeout, demote-to-broker on failure.
|
||||
- Tag broker-relayed messages that *could have* gone P2P with a metric, so degradation rate is observable.
|
||||
|
||||
### Milestone 6 — Mesh state replication + revocation gossip
|
||||
*Required before P2P is safe at scale.*
|
||||
|
||||
- Broker publishes signed `system` events for all mesh state mutations.
|
||||
- Peers subscribe; cache and apply.
|
||||
- Revocation propagation latency target: <30s for online peers.
|
||||
- P2P sessions verify peer identity against cached state on every message (cheap, just a map lookup).
|
||||
|
||||
### Milestone 7 — External integrations (proof points, parallel)
|
||||
*One PoC per category to validate the architecture, opportunistically.*
|
||||
|
||||
- LiveKit side-car (validates LiveKit room transport).
|
||||
- OpenAI Assistant (validates delegated-key crypto + webhook transport).
|
||||
- WhatsApp / Slack bridge (validates human-bridge service identity).
|
||||
- Browser SDK (validates browser as a peer; uses WebRTC adapter natively).
|
||||
|
||||
### Milestone 8 — Group/topic crypto upgrade
|
||||
*Group fan-out crypto efficiency.*
|
||||
|
||||
- Sender Keys protocol for group: sender derives group key, encrypts content once, encrypts group key per-recipient. Avoids N-way encryption per message.
|
||||
- Per-topic key rotation policy (member join → optional re-key; member leave → forced re-key).
|
||||
|
||||
### Beyond Milestone 8
|
||||
- Future transport adapters as concrete needs surface (no commitments).
|
||||
- Multi-broker federation (mesh spans multiple brokers; gossip across).
|
||||
- Onion routing option for adversarial environments.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals (explicit)
|
||||
|
||||
- **Replacing Slack / Discord / Matrix as a human chat product.** claudemesh is for agent coordination; humans participate via bridges or direct DMs but UX is CLI-first.
|
||||
- **Pure-P2P with no central coordinator.** The broker stays — for offline queue, group fan-out, mesh authority, revocation. "P2P-first hybrid" is the commitment, not "P2P-only."
|
||||
- **Replacing the MCP `claude/channel` push-pipe.** Mid-turn interrupt stays MCP. The data-plane changes don't touch the daemon-to-Claude-Code path.
|
||||
- **Real-time media (audio/video) directly in claudemesh data channels.** Bandwidth-heavy media goes through dedicated stacks (LiveKit, WebRTC SFU). claudemesh metadata + signaling glues them.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **Mid-turn push when sender is on P2P session.** P2P delivery to recipient's daemon → daemon emits MCP push. Same shape as broker-delivered. Confirm the MCP push respects per-session targeting (different session pubkey siblings of the same member).
|
||||
|
||||
2. **Browser peers and NAT traversal.** Browser ↔ browser via WebRTC works. Browser ↔ daemon (Node WebRTC binding) — needs testing under symmetric NAT. May require running a STUN server (Google's for now; eventually self-hosted). TURN fallback uses the broker WS.
|
||||
|
||||
3. **Backpressure on stream channel.** WebRTC data channels have built-in flow control. Broker-relayed streams need per-stream backpressure signaling to avoid OOM at the broker. Proposal: receiver advertises `stream_window_bytes` periodically; sender pauses when used.
|
||||
|
||||
4. **Multi-region brokers.** Today single broker. If we add a second broker (or federation), how do peers in mesh A on broker 1 talk to peers in mesh A on broker 2? Out of scope here; separate spec when forced.
|
||||
|
||||
---
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
**Codex-1 (initial architecture review of existing code) caught:**
|
||||
- "Remove daemon-WS inbound entirely" idea silently loses broadcasts + member-targeted DMs whenever zero launches exist. Corrected → retained.
|
||||
- Inheritance for the dup'd lifecycle would become a god class. Composition via helper kept.
|
||||
- Drain race needs `claimed_at` + delivered-on-success; "check OPEN before claim" still drops on crash. Kept.
|
||||
- Token-keyed registry is correct (token = auth boundary), not a smell. Kept.
|
||||
|
||||
**Codex-2 (single-pass review of v1 of this spec) caught:**
|
||||
- At-least-once requires receiver ack, not just "set delivered_at on success." → Layer 5 delivery semantics rewritten to require client_ack.
|
||||
- Service identity needs explicit `service_pubkey` field, included in attestation. → Added to ServiceIdentity definition.
|
||||
- v2 envelope `meta` should be non-optional with `clientMessageId` always present. → meta is now required.
|
||||
- Service identity needed explicit revocation/disable story. → New CLI verb `claudemesh service revoke`, broker denylist, system-channel gossip propagation.
|
||||
- `streamId` location ambiguous; concurrent streams to same peer would collide. → `meta.streamId` made REQUIRED for `channel: "stream"`.
|
||||
- Defer `SseTransport` from Milestone 3. → Done.
|
||||
- Drop named future-adapter list (MQTT/gRPC) to avoid false commitments. → Done.
|
||||
|
||||
The hybrid P2P data plane, transport adapter abstraction, typed channel envelope, mesh state replication, and milestone reordering are mine. Codex's reviews were targeted at correctness/scope-gap/should-cut, not redesign.
|
||||
|
||||
**This spec is now frozen for implementation.** No further architectural drift; deviations during implementation surface as new spec-deltas with explicit rationale, not silent edits to this document.
|
||||
360
.artifacts/specs/2026-05-04-agentic-comms-architecture.md
Normal file
360
.artifacts/specs/2026-05-04-agentic-comms-architecture.md
Normal file
@@ -0,0 +1,360 @@
|
||||
---
|
||||
title: claudemesh as agentic communication platform — architecture spec
|
||||
status: draft
|
||||
target: 2.0.0 (foundational cleanup) → 2.1.0 (transport adapters) → 2.2.0 (channel typing)
|
||||
author: Alejandro + Claude (cross-checked with Codex GPT-5.2)
|
||||
date: 2026-05-04
|
||||
supersedes: none
|
||||
references:
|
||||
- 2026-05-02-architecture-north-star.md (CLI-first commitment, push-pipe)
|
||||
- 2026-05-04-per-session-presence.md (per-launch session pubkey + attestation)
|
||||
- apps/cli/CHANGELOG.md (1.30.0–1.32.1 history)
|
||||
---
|
||||
|
||||
# claudemesh as agentic communication platform
|
||||
|
||||
## TL;DR
|
||||
|
||||
Today claudemesh is a **peer mesh for Claude Code sessions** — broker + CLI + per-session WS, encrypted DMs, peer list, mid-turn push via MCP. Tomorrow it has to be a **transport-agnostic agentic communication platform** that:
|
||||
|
||||
- treats Claude Code as **one channel type** among many (with first-class support for mid-turn interrupts via `claude/channel`)
|
||||
- accepts **non-Claude agents** as peers — voice agents (LiveKit/Pipecat), OpenAI Assistants, raw HTTP webhook consumers, scheduled cron actors, human IM bridges
|
||||
- exposes **typed channels** (DM, group, topic, RPC, system event, stream) so message semantics aren't shoved through one `targetSpec` string
|
||||
- has a **pluggable transport layer** so a peer can join the mesh over WS, HTTP webhook, SSE, MQTT, or gRPC without changing the broker's data plane
|
||||
- preserves **end-to-end encryption** as a non-negotiable for direct messages
|
||||
|
||||
This document specifies the architecture in three layers (identity, transport, channel), the foundational cleanup needed before adding any of it (Codex caught a few sharp issues), and the migration path that gets us there without a "v2 rewrite" event.
|
||||
|
||||
The CLI-first commitment from the North Star spec stays intact — every channel type and transport adapter must be invocable from `claudemesh <verb>` first, with MCP serving only `claude/channel` push.
|
||||
|
||||
---
|
||||
|
||||
## Why now
|
||||
|
||||
Three forcing functions:
|
||||
|
||||
1. **Multi-session interconnect already broke** (1.30.0 → 1.32.1). The per-session WS subsystem shipped without a push handler because the architecture assumed "one daemon WS per mesh handles everything" and then we bolted session WSes on top without finishing the inbound side. The shape is right; the wiring was incomplete. We need to formalize the role split before adding more transports.
|
||||
|
||||
2. **Codex review surfaced a correctness bug in the broker's drain.** `drainForMember` claims rows by setting `delivered_at = NOW()` *before* the WS push succeeds. If `ws.readyState !== OPEN` at push time, the row is marked delivered and the message is gone. This is at-most-once with no retry. Any future channel type or transport adapter inherits this bug if we don't fix it at the foundation.
|
||||
|
||||
3. **The agentic-comms market is becoming a thing.** Voice agents (LiveKit, Pipecat, ElevenLabs Conversational), OpenAI Assistants threads, MCP servers acting as autonomous workers, scheduled cron actors — they all need a "mesh" to coordinate. claudemesh has the right primitives (E2E crypto, peer presence, typed routing); it just needs the architecture to admit non-Claude peers without forking the codebase.
|
||||
|
||||
---
|
||||
|
||||
## Audience for this architecture
|
||||
|
||||
| Peer type | Identity | Transport | Channels they speak |
|
||||
|---|---|---|---|
|
||||
| **Claude Code session** (today) | Per-launch session pubkey, parent-attested by member key | WS to broker | DM, group, topic, system events; receives mid-turn push via MCP `claude/channel` |
|
||||
| **Headless agent** (e.g. cron job, Hermes/OpenClaw worker) | Member pubkey (no per-launch session) | WS to broker, OR HTTP webhook outbound | DM, group, topic; no mid-turn push (polls inbox) |
|
||||
| **Voice agent** (LiveKit/Pipecat call) | Service identity (signed by mesh owner) | WS to broker, possibly via TURN relay | DM (transcript stream), group (call participants), system events (call lifecycle) |
|
||||
| **OpenAI Assistant / Anthropic Agent** (Skill SDK) | Service identity, OAuth-style scoped token | HTTP webhook (server-side push) OR WS | DM, RPC (tool-style request/response) |
|
||||
| **Human via Slack/WhatsApp bridge** | Service identity for the bridge, end-user mapped via membership | WS (bridge to broker) | DM, topic |
|
||||
| **Webhook consumer** (Stripe-style passive listener) | Service identity, scoped to one channel | HTTP webhook outbound only | Topic (subscribe to events) |
|
||||
|
||||
Every row in this table needs to work without changing the broker's data plane.
|
||||
|
||||
---
|
||||
|
||||
## Layer 1: Identity
|
||||
|
||||
### Today
|
||||
|
||||
Two identity types coexist:
|
||||
|
||||
- **Member identity** — stable Ed25519 keypair held in `~/.claudemesh/config.json`. One per joined mesh. Used for hello signature on the daemon's main WS; used as the cryptographic root of trust for sibling sessions.
|
||||
- **Session identity** — ephemeral Ed25519 keypair generated per `claudemesh launch`. Parent-signed attestation vouches for it (TTL 12h, broker cap 24h). Used for hello signature on the per-session WS; used as the routing key for DMs targeted at *this specific launched session*.
|
||||
|
||||
This is enough for Claude Code peers. It's not enough for the audience table above.
|
||||
|
||||
### Proposed: third identity type — **service identity**
|
||||
|
||||
A service identity is what a non-Claude integration uses to authenticate:
|
||||
|
||||
```
|
||||
ServiceIdentity {
|
||||
member_id // The mesh member that owns this service (auth boundary)
|
||||
service_id // Stable id for the service ("openai-assistant-foo", "livekit-room-bar")
|
||||
service_type // "openai-assistant" | "livekit-room" | "webhook" | "voice-agent" | ...
|
||||
scopes // ["dm:read", "topic:write", "rpc:invoke", ...]
|
||||
attestation // member-signed: { service_id, scopes, expires_at, signature }
|
||||
transport_hint // "ws" | "http-webhook" | "sse" — informs how the broker reaches it
|
||||
}
|
||||
```
|
||||
|
||||
**Three identity types, one auth model:**
|
||||
- All identities resolve to a `member_id` (the auth boundary — grants, kicks, bans operate on members).
|
||||
- Identities differ in *liveness* (member = always; session = per-launch; service = scoped/scheduled) and in *transport hint* (member/session = WS-resident; service = polymorphic).
|
||||
|
||||
**Backward compatibility:** existing member + session identities are unchanged. Service identity is additive.
|
||||
|
||||
### Cryptographic implications
|
||||
|
||||
- E2E encryption (`crypto_box`) targets a public key. Member pubkey, session pubkey, service pubkey all work the same way.
|
||||
- A service that can't hold a long-lived secret (e.g. OpenAI Assistant calling out via HTTPS) gets a **delegated identity** the daemon holds — sender encrypts to the daemon's per-member key, daemon re-encrypts and forwards over the service's webhook. This adds trust in the daemon, but it's the only way to bridge to non-crypto-native peers without giving them raw secrets.
|
||||
|
||||
---
|
||||
|
||||
## Layer 2: Transport
|
||||
|
||||
### Today
|
||||
|
||||
One transport: **WebSocket to broker** (`wss://ic.claudemesh.com/ws`). Everything goes through it — hello, send, push, RPC. The CLI's daemon holds two WS instances per mesh (member-keyed `DaemonBrokerClient` + per-launch `SessionBrokerClient`).
|
||||
|
||||
### Proposed: transport adapter interface
|
||||
|
||||
```typescript
|
||||
interface BrokerTransport {
|
||||
/** One-time hello + auth handshake. Identity is opaque to the transport. */
|
||||
connect(opts: TransportConnectOpts): Promise<TransportSession>;
|
||||
|
||||
/** Send a typed envelope. Returns a delivery promise (ack or terminal failure). */
|
||||
send(envelope: Envelope): Promise<SendResult>;
|
||||
|
||||
/** Stream of inbound envelopes. Pull-model so a transport can be a webhook,
|
||||
* not just a long-lived socket. */
|
||||
inbound(): AsyncIterable<Envelope>;
|
||||
|
||||
/** Close cleanly. */
|
||||
close(reason?: string): Promise<void>;
|
||||
|
||||
/** Capabilities surfaced to the daemon — broker uses this to decide
|
||||
* whether mid-turn push is possible, whether RPC blocks are
|
||||
* supported, etc. */
|
||||
capabilities: TransportCapabilities;
|
||||
}
|
||||
```
|
||||
|
||||
**Concrete adapters at v2.1.0:**
|
||||
|
||||
1. **`WsBrokerTransport`** — current WS implementation. The `DaemonBrokerClient` and `SessionBrokerClient` are recast as two roles using this transport with different hello payloads.
|
||||
2. **`HttpWebhookTransport`** — for service identities that can't hold a WS open. Outbound: HTTP POST to the broker's `/v1/send`. Inbound: broker calls back to a registered webhook URL with retry + signature. Mid-turn push is not possible (degrades gracefully).
|
||||
3. **`SseTransport`** — for browsers / restricted environments. Outbound: HTTP POST. Inbound: SSE stream from broker to client.
|
||||
|
||||
**Future adapters (v2.3+):**
|
||||
|
||||
4. **`LiveKitTransport`** — for voice agents. The "broker" is a LiveKit room; messages are LiveKit data-channel packets. Bridges to the central broker via a daemon side-car.
|
||||
5. **`MqttTransport`** — for IoT / fleet scenarios.
|
||||
6. **`GrpcTransport`** — for low-latency intra-cluster.
|
||||
|
||||
Any new adapter implements the same interface; broker logic is transport-agnostic at the API boundary.
|
||||
|
||||
### The two-role model (Codex's correction)
|
||||
|
||||
Even within one transport, the daemon holds **two roles per mesh**, not one connection per launch:
|
||||
|
||||
- **Control-plane connection** — one per mesh, member-keyed. Carries: outbox drain (one queue, can't race), `list_peers`/state/memory/skill RPCs, inbound for `*` broadcasts and member-targeted DMs (legacy traffic + zero-launch state).
|
||||
- **Session connections** — N per mesh, session-keyed. Carries: presence row keyed on session pubkey, inbound for session-targeted DMs.
|
||||
|
||||
This is what we have today; the spec just makes the role split explicit. The mistake in 1.30.0–1.32.0 was treating session connections as "presence-only" instead of "second-class peers." 1.32.1 corrects that.
|
||||
|
||||
### Foundational cleanup (ship first, before any new transport)
|
||||
|
||||
1. **Extract `connectWsWithBackoff` helper** — current `DaemonBrokerClient` and `SessionBrokerClient` duplicate the WS lifecycle (open, hello, ack-timeout, close, backoff, reconnect). Codex's recommendation: composition, not inheritance. A single helper takes `{ url, buildHello, onMessage, onStatusChange }` and both clients call it. Eliminates the drift bug class that produced session_replaced thrashing.
|
||||
|
||||
2. **Drop the daemon's stray `sessionPubkey`** (`apps/cli/src/daemon/broker.ts:113`). It's a leftover from the era when the daemon WS was the only WS. The session role now owns session pubkeys. If we want the daemon itself to be addressable by a stable pubkey, rename it `daemonPubkey` and document it; today it's dead ballast.
|
||||
|
||||
3. **Tighten daemon-WS inbound filter, don't remove it** (Codex's correction to my prior take). Daemon WS should still receive `*` broadcasts and member-targeted DMs (legacy senders, zero-launch state). It should NOT decrypt session-targeted DMs (that's the session WS's job, and decryption requires the session secret which the daemon WS doesn't have anyway).
|
||||
|
||||
4. **Fix the broker drain race** (`apps/broker/src/broker.ts:2399-2402`). Add `claimed_at` + `claim_id` columns; claim sets `claimed_at = NOW()` (NOT `delivered_at`); push runs; `delivered_at = NOW()` is set ONLY after `ws.send` succeeds. Re-eligible if `claimed_at` is older than the lease timeout (e.g. 30s). Combined with `client_message_id` dedupe on the receiver side, this gives at-least-once semantics, which is what an agentic comms platform needs.
|
||||
|
||||
5. **Decouple presence-WS-role from session-WS-role at the broker.** Today `connectPresence` is called from both `handleHello` and `handleSessionHello`. The two paths diverge in identity (member vs session pubkey) and dedup key (sessionId in both cases). Make the role explicit on the presence row (`role: "control-plane" | "session" | "service"`) so list_peers, fan-out, and reconnect can reason about it. Hidden `claudemesh-daemon` rows in 1.32.0's `peer list` are a hack covering for missing typing.
|
||||
|
||||
---
|
||||
|
||||
## Layer 3: Channels
|
||||
|
||||
### Today
|
||||
|
||||
One channel type: **direct messages with target-spec routing**. `targetSpec` is a string that the broker pattern-matches:
|
||||
- `<64-hex-pubkey>` → DM to that member or session
|
||||
- `*` → broadcast to mesh
|
||||
- `@<groupname>` → group post
|
||||
- `#<topicId>` → topic post
|
||||
|
||||
This works but it's overloaded — the same `send` verb covers DMs, broadcasts, groups, topics, and (since v0.9) tagged messages. As we add agentic peers, the semantics matter and the routing key string can't carry them.
|
||||
|
||||
### Proposed: typed channel envelope
|
||||
|
||||
```typescript
|
||||
type ChannelType =
|
||||
| "dm" // 1:1 message, encrypted to recipient pubkey
|
||||
| "group" // post to named group, encrypted per-recipient (today: base64 plaintext)
|
||||
| "topic" // pub/sub topic, persisted, history available, per-topic symmetric key
|
||||
| "rpc" // request/response, correlation id, timeout, structured result
|
||||
| "system" // peer_joined / peer_left / topology / lifecycle events
|
||||
| "stream"; // long-lived data stream (voice transcript, log tail, file transfer chunks)
|
||||
|
||||
interface Envelope {
|
||||
/** Schema version. v1 = current opaque shape. v2 = this typed shape. */
|
||||
v: 2;
|
||||
/** What semantics the receiver should apply. */
|
||||
channel: ChannelType;
|
||||
/** Target — pubkey for dm, group name for group, topic id for topic, etc.
|
||||
* Same wire format as today's targetSpec, but typed. */
|
||||
target: string;
|
||||
/** Sender identity (member, session, or service pubkey). */
|
||||
from: string;
|
||||
/** Encrypted payload + crypto envelope. Channel type drives crypto:
|
||||
* - dm: crypto_box to recipient pubkey
|
||||
* - group: per-recipient seal (today: plaintext)
|
||||
* - topic: symmetric key (today: plaintext, v0.2.0+ adds per-topic key)
|
||||
* - rpc / system / stream: same as DM (crypto_box) */
|
||||
body: { nonce: string; ciphertext: string; bodyVersion: number };
|
||||
/** Optional metadata, varies by channel type. */
|
||||
meta?: {
|
||||
/** Stable client-supplied id for dedupe (existing field, made required for v2). */
|
||||
clientMessageId: string;
|
||||
/** Sender's canonical fingerprint per spec §4.4 (existing field). */
|
||||
requestFingerprint?: string;
|
||||
/** dm/group: priority gate (now/next/low). rpc: timeout_ms. stream: chunk_id. */
|
||||
priority?: "now" | "next" | "low";
|
||||
timeoutMs?: number;
|
||||
streamChunkId?: number;
|
||||
/** dm/topic: replyTo for threading. */
|
||||
replyToId?: string;
|
||||
/** topic: mentions list (existing field). */
|
||||
mentions?: string[];
|
||||
/** rpc: correlation back-edge so the broker can route the response. */
|
||||
rpcCorrelationId?: string;
|
||||
};
|
||||
/** Sender signature over (channel, target, from, nonce, ciphertext, meta). */
|
||||
signature?: string;
|
||||
}
|
||||
```
|
||||
|
||||
**Why this matters for agentic peers:**
|
||||
|
||||
- A voice agent sending a partial transcript wants `channel: "stream"` semantics — high-frequency, small chunks, idempotent, no per-message ack required.
|
||||
- An OpenAI Assistant calling a tool wants `channel: "rpc"` — request-response with timeout, correlation back-edge so the response routes.
|
||||
- A scheduled cron actor reporting completion wants `channel: "topic"` — fire-and-forget, persisted history.
|
||||
- Today all of these get bolted onto `dm` with conventions; v2 envelope makes them first-class.
|
||||
|
||||
### Claude Code channels — first-class support
|
||||
|
||||
Two specific channel features for Claude Code:
|
||||
|
||||
1. **Mid-turn interrupt** (`claude/channel` push). Already implemented via the MCP push-pipe. The new envelope makes it explicit: `channel: "dm"` with `meta.priority: "now"` triggers MCP push to a launched session. Other priorities deliver at next inbox poll.
|
||||
|
||||
2. **Reply threading** (`meta.replyToId`). Already partially supported on topics; v2 makes it work uniformly across `dm` and `topic`. The receiver Claude Code session sees a structured reply thread instead of flat history.
|
||||
|
||||
3. **Mentions** (`meta.mentions`). Already supported on topics; v2 surfaces them on `dm` too — useful for `@<peer>` callouts in groups even when the message body is encrypted.
|
||||
|
||||
### Backward compatibility
|
||||
|
||||
Envelope v1 (today's shape) stays accepted by the broker until v3.x. v1 envelopes are auto-upgraded server-side: `channel` inferred from `targetSpec` shape (`*` → group/broadcast, `#` → topic, hex → dm). Existing CLIs keep working.
|
||||
|
||||
---
|
||||
|
||||
## Future integrations (concrete)
|
||||
|
||||
These are not part of v2.0 — they're the test cases the architecture must support:
|
||||
|
||||
### LiveKit voice agent
|
||||
- Service identity: `livekit-room-<id>`, signed by mesh owner.
|
||||
- Transport: dedicated daemon side-car hosts a LiveKit participant; data-channel packets bridge to the central broker via WS.
|
||||
- Channels: `stream` for transcript chunks, `system` for call lifecycle (joined/left/muted), `dm` for sidebar text.
|
||||
- E2E: per-call ephemeral keypair held by the side-car; participants' member keys are discovered via mesh peer list.
|
||||
|
||||
### OpenAI Assistant integration
|
||||
- Service identity: `openai-assistant-<id>`, scoped to one or more topics + RPC.
|
||||
- Transport: HTTP webhook out (broker → assistant API), HTTP POST in (assistant → broker `/v1/send`).
|
||||
- Channels: `rpc` for tool-style invocations from claudemesh peers, `topic` for assistant-published events.
|
||||
- Crypto: delegated to daemon (assistant can't hold a libsodium secret; daemon re-encrypts on its behalf).
|
||||
|
||||
### Generic webhook consumer (Stripe-style)
|
||||
- Service identity: `webhook-<consumer-id>`, scoped to subscribed topics.
|
||||
- Transport: HTTP webhook out only. No inbound — it's a passive sink.
|
||||
- Channels: `topic` only.
|
||||
- Crypto: not E2E; webhook bodies are signed (HMAC-SHA256, sender = mesh) but plaintext.
|
||||
|
||||
### Human-via-WhatsApp bridge
|
||||
- Service identity: `whatsapp-bridge`, with member-mapping for each end-user.
|
||||
- Transport: WS (bridge holds long connection to broker), bridges to WhatsApp Business API.
|
||||
- Channels: `dm` (1:1 chat → WhatsApp DM), `topic` (claudemesh topic → WhatsApp group).
|
||||
- E2E: bridge holds a per-end-user delegated key; not "true" E2E to the WhatsApp side, but signaled clearly in UX.
|
||||
|
||||
---
|
||||
|
||||
## Migration plan
|
||||
|
||||
### v2.0.0 — Foundational cleanup (no new external surface)
|
||||
**Target: 1–2 weeks**
|
||||
|
||||
- [ ] Extract `connectWsWithBackoff` helper, refactor `DaemonBrokerClient` + `SessionBrokerClient` to use it.
|
||||
- [ ] Drop daemon's stray `sessionPubkey` (or rename + document).
|
||||
- [ ] Tighten daemon-WS inbound filter (broadcast + member-targeted only).
|
||||
- [ ] Add `presence.role` column (`control-plane | session | service`); broker fan-out + list_peers honor it.
|
||||
- [ ] **Fix drain race**: schema migration adds `claimed_at`, `claim_id`, `claim_expires_at` columns; rewrite `drainForMember` for two-phase claim/deliver; add re-claim path for stale leases.
|
||||
- [ ] Receiver-side: harden `client_message_id` dedupe (already partial in 1.32.x; finish for at-least-once). Add idempotent insert that returns existing row on conflict.
|
||||
|
||||
**Success criteria:**
|
||||
- Two-session smoke test still passes (1.32.1 baseline).
|
||||
- Crash-mid-push test: kill broker between claim and send; verify message redelivers on broker restart + recipient reconnect.
|
||||
- Reconnect storm test: 100 reconnect cycles per session over 60s; zero message loss.
|
||||
|
||||
### v2.1.0 — Transport adapter interface
|
||||
**Target: 2–3 weeks after v2.0.0**
|
||||
|
||||
- [ ] Define `BrokerTransport` interface; refactor existing WS code to be the first implementation.
|
||||
- [ ] Add `HttpWebhookTransport` adapter (broker side: outbound HTTP POST with retry + HMAC signature; daemon side: HTTP server that receives webhook callbacks and inserts into inbox).
|
||||
- [ ] Add `/v1/send` HTTP endpoint on the broker (today the broker is WS-only for sends).
|
||||
- [ ] Service identity registration flow: `claudemesh service register --type webhook --scopes dm:read,topic:write` mints attestation, stores it locally + on broker.
|
||||
- [ ] Basic `SseTransport` for browser/CI use cases.
|
||||
|
||||
**Success criteria:**
|
||||
- A scheduled cron job using only `curl` can send to the mesh (no daemon required).
|
||||
- A webhook consumer subscribed to a topic receives messages within 5s of post.
|
||||
|
||||
### v2.2.0 — Typed channels (envelope v2)
|
||||
**Target: 2–3 weeks after v2.1.0**
|
||||
|
||||
- [ ] Define `Envelope v2` schema; broker accepts both v1 and v2; sender-side code emits v2.
|
||||
- [ ] `channel: "rpc"` end-to-end: correlation id routing, response timeout, `claudemesh rpc <peer> <method> <args>` CLI verb.
|
||||
- [ ] `channel: "stream"` end-to-end: chunked delivery, ordered, idempotent, `claudemesh stream <peer> <stream-id>` CLI verb.
|
||||
- [ ] Mid-turn push (`claude/channel`) honors `channel: "dm"` with `meta.priority: "now"` only.
|
||||
- [ ] Mentions + replyToId surface uniformly across dm and topic.
|
||||
|
||||
**Success criteria:**
|
||||
- Demo: a Claude Code session sends an `rpc` to another Claude Code session, gets a structured response.
|
||||
- Demo: a voice-agent prototype sends `stream` chunks; another peer receives them in order with no gaps.
|
||||
|
||||
### v2.3+ — Concrete external integrations
|
||||
**Target: opportunistic**
|
||||
|
||||
- LiveKit side-car (one PoC integration to validate the architecture).
|
||||
- OpenAI Assistant integration (validate delegated-key crypto path).
|
||||
- WhatsApp bridge (validate human-bridge service identity).
|
||||
|
||||
These are not on the critical path for the architecture; they prove it.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals (explicit)
|
||||
|
||||
- **Replacing Slack / Discord.** claudemesh is for agent coordination. Human chat is a side-effect, not the headline.
|
||||
- **Federation across multiple brokers.** v2.0 stays single-broker per mesh. Multi-broker (gossip / federation) is a separate spec, post-v3.
|
||||
- **Sync-only / no-broker P2P.** Direct peer-to-peer (without the central broker) is a different architecture (libp2p, Iroh). Not in scope.
|
||||
- **Replacing the MCP push-pipe.** Mid-turn interrupt stays MCP-based. The transport-adapter layer is broker-side; MCP is daemon-to-Claude-Code, untouched.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **How does a service identity prove liveness?** WS gives us implicit liveness via the connection. HTTP webhook services need an explicit heartbeat / health-check. Proposal: broker periodically POSTs to `<webhook>/health`; service is marked offline after 3 consecutive failures.
|
||||
|
||||
2. **RPC routing through offline peers — what's the failure mode?** If `claudemesh rpc <peer> ...` and the peer is offline, do we (a) queue and wait (DM semantics) or (b) fail fast (REST semantics)? Proposal: RPC fails fast with `peer_offline` after a 5s probe; explicit `--wait` flag opts into DM-style queue.
|
||||
|
||||
3. **Per-topic symmetric key rotation.** Existing v0.2.0 spec mentions per-topic keys. Rotation policy (when, who triggers, how members re-sync) is unsolved. Defer to a separate spec; v2.2.0 ships with one-shot keys (rotate by re-creating topic).
|
||||
|
||||
---
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
Cross-checked with Codex (GPT-5.2, high reasoning) on the foundational cleanup section. Codex caught:
|
||||
- The "remove daemon-WS inbound entirely" idea would silently lose broadcasts + member-targeted DMs whenever zero launches exist. Corrected.
|
||||
- Inheritance for the dup'd lifecycle would become a god class. Composition via helper is the right call.
|
||||
- The drain race needs a `claimed_at` + delivered-on-success fix; "check OPEN before claim" still drops on crash.
|
||||
- Token-keyed registry is correct (token = auth boundary), not a smell.
|
||||
|
||||
The agentic-comms / typed-channels / transport-adapter layers are mine — Codex didn't touch those because the question I asked was about the existing architecture's smells, not the future roadmap.
|
||||
Reference in New Issue
Block a user