Compare commits
376 Commits
v0.1.16
...
7460d34335
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
7460d34335 | ||
|
|
25586d298f | ||
|
|
a852a9df18 | ||
|
|
4cfb682eab | ||
|
|
0958463998 | ||
|
|
088a4efaa3 | ||
|
|
15b7920b2a | ||
|
|
b0c1348a0a | ||
|
|
1a14cef1e0 | ||
|
|
71f7f81880 | ||
|
|
052f65149d | ||
|
|
0b3014e7eb | ||
|
|
cef246a34a | ||
|
|
f013436541 | ||
|
|
6d981976c0 | ||
|
|
f7d7d391c9 | ||
|
|
ff2aa8bf7c | ||
|
|
4d42185b0f | ||
|
|
d62b3f45d2 | ||
|
|
e688f66791 | ||
|
|
033a2d37e1 | ||
|
|
364178d95b | ||
|
|
f91871c71d | ||
|
|
92cac16c91 | ||
|
|
81f0e4f7ac | ||
|
|
2b6cf2c14b | ||
|
|
8a5469a5df | ||
|
|
e128a6ae5f | ||
|
|
3753a6e137 | ||
|
|
cb90f1ca60 | ||
|
|
0e3a5babd9 | ||
|
|
6794aa8512 | ||
|
|
c56910bfcf | ||
|
|
4eff4f5a20 | ||
|
|
a2568ad9f4 | ||
|
|
bf22afb0ed | ||
|
|
abaa4bcf87 | ||
|
|
65e63b0b27 | ||
|
|
5785454ac9 | ||
|
|
03cff156e2 | ||
|
|
e84914b25b | ||
|
|
5a1d5d6a49 | ||
|
|
f3649d761f | ||
|
|
79485898cf | ||
|
|
b69df75f0c | ||
|
|
3a3d2a6c4c | ||
|
|
f9ed3fa286 | ||
|
|
50b2ae97c2 | ||
|
|
4b459622e4 | ||
|
|
f679b49b6c | ||
|
|
5ceb311d74 | ||
|
|
e60980cfd7 | ||
|
|
ff3d11d42d | ||
|
|
43e429f204 | ||
|
|
1c335e8daa | ||
|
|
397ddb4c45 | ||
|
|
354c47c3d6 | ||
|
|
2262564680 | ||
|
|
c18891191e | ||
|
|
eb021a8a6f | ||
|
|
3964de4962 | ||
|
|
c795df4fd4 | ||
|
|
aa6c7be4eb | ||
|
|
3da06d357e | ||
|
|
075df6db08 | ||
|
|
c7ce92f35b | ||
|
|
7de13cbb71 | ||
|
|
ad70782171 | ||
|
|
646d4fa3f1 | ||
|
|
7f6af0137d | ||
|
|
2e57173ed9 | ||
|
|
95b16a23fc | ||
|
|
a3cf9b938e | ||
|
|
ce321c0a21 | ||
|
|
9ecf2d65af | ||
|
|
80755dbf9b | ||
|
|
82ee89d0dc | ||
|
|
8697c1c032 | ||
|
|
716e674473 | ||
|
|
038a5b5bf7 | ||
|
|
d871988084 | ||
|
|
3c35932191 | ||
|
|
b08daadbdc | ||
|
|
cb5faca920 | ||
|
|
77f4316f2d | ||
|
|
82ebd2b6be | ||
|
|
b70536195a | ||
|
|
39929eb7fe | ||
|
|
da5103a315 | ||
|
|
1a238d4178 | ||
|
|
81f8066f99 | ||
|
|
dd80d4e946 | ||
|
|
c31a591681 | ||
|
|
a2ab7de60a | ||
|
|
69cf39bc9f | ||
|
|
0ab2bea045 | ||
|
|
f4601f4d9c | ||
|
|
a83133a4c6 | ||
|
|
a9160a0965 | ||
|
|
00c25d9803 | ||
|
|
35a289b64a | ||
|
|
7af61e121e | ||
|
|
a75483b3c2 | ||
|
|
541440c357 | ||
|
|
a80eb6fcca | ||
|
|
7e71a61db4 | ||
|
|
d7cef45640 | ||
|
|
0f32529370 | ||
|
|
7d1538d743 | ||
|
|
dc7e0e826d | ||
|
|
2aa21fe07c | ||
|
|
6de5e275fa | ||
|
|
c2cd67a885 | ||
|
|
4ebd138a68 | ||
|
|
2e97a0eeee | ||
|
|
f727620d16 | ||
|
|
c801afd2ab | ||
|
|
b60daff886 | ||
|
|
7d35c779f4 | ||
|
|
f08d6c9f0c | ||
|
|
9dd1e401b0 | ||
|
|
9418d0ee30 | ||
|
|
8b5708a604 | ||
|
|
56d7cc1c48 | ||
|
|
13d691980a | ||
|
|
f45380d231 | ||
|
|
f71218c1e1 | ||
|
|
f98c2de5a3 | ||
|
|
1afae7a507 | ||
|
|
b4f457fceb | ||
|
|
ff551ccf3d | ||
|
|
b49e9a9b61 | ||
|
|
163e1be70a | ||
|
|
3d2ab0cb4b | ||
|
|
0664180a54 | ||
|
|
2abf86d540 | ||
|
|
a5347cebc0 | ||
|
|
622ea569ad | ||
|
|
d7f381a1e8 | ||
|
|
3ceac68e67 | ||
|
|
5ddb11b2d5 | ||
|
|
2edbfce7d3 | ||
|
|
9f3a82dd63 | ||
|
|
05729ad8a4 | ||
|
|
49e0af0fc0 | ||
|
|
2be5e9dccb | ||
|
|
1a7a059e75 | ||
|
|
39fe296aaa | ||
|
|
3dfab0f792 | ||
|
|
6f4a44e281 | ||
|
|
4bc3c045ae | ||
|
|
94e914f476 | ||
|
|
1bb702e481 | ||
|
|
45d85f5eaa | ||
|
|
ee12510ef1 | ||
|
|
c9ede3d469 | ||
|
|
b998e35d17 | ||
|
|
506c470441 | ||
|
|
b4703a482d | ||
|
|
29f546abcf | ||
|
|
5716a6ce22 | ||
|
|
d37516213a | ||
|
|
5b69de08da | ||
|
|
ccf95ff382 | ||
|
|
43f2728283 | ||
|
|
d33b8fc43b | ||
|
|
ce52fcef2d | ||
|
|
77ee1d0d80 | ||
|
|
2f27a5eef4 | ||
|
|
32851419e6 | ||
|
|
e2b6e53cc1 | ||
|
|
3595fc2c4d | ||
|
|
2825ef7151 | ||
|
|
a9858ef876 | ||
|
|
6836a495a4 | ||
|
|
07720f8f1e | ||
|
|
f4881b21b0 | ||
|
|
4561076904 | ||
|
|
0d53f2ae52 | ||
|
|
b328e78bd3 | ||
|
|
23604a125e | ||
|
|
b680260c8d | ||
|
|
b65a545ece | ||
|
|
d07cff788c | ||
|
|
bb1310167e | ||
|
|
ea4e3b03bb | ||
|
|
1a42c2ef09 | ||
|
|
43b70013c5 | ||
|
|
b8d8b5469b | ||
|
|
ab7fb6bd31 | ||
|
|
b2999878c4 | ||
|
|
a890a1d92e | ||
|
|
80a6b8b50f | ||
|
|
465ff9a10e | ||
|
|
0f46c787a7 | ||
|
|
a365fef170 | ||
|
|
ca441dae45 | ||
|
|
ac709dbe92 | ||
|
|
d0fbc64e7e | ||
|
|
f1d35b10da | ||
|
|
5e97d48cd5 | ||
|
|
c8ae6462e3 | ||
|
|
fb7a84aed6 | ||
|
|
c1fa3bcb5c | ||
|
|
dbea96960f | ||
|
|
a022da1998 | ||
|
|
5df2664bae | ||
|
|
816c42feae | ||
|
|
4c0a417b7c | ||
|
|
e6962f1454 | ||
|
|
1d506f3ea5 | ||
|
|
64266a75f7 | ||
|
|
2710f354a9 | ||
|
|
6b55859d38 | ||
|
|
7d31cc6283 | ||
|
|
0403cfeb76 | ||
|
|
d8e6900072 | ||
|
|
ed8dab8bd3 | ||
|
|
dad51870d9 | ||
|
|
a6af0f2154 | ||
|
|
0661e6223a | ||
|
|
05e3c43e29 | ||
|
|
e3fa6e6a5e | ||
|
|
17066b4f6c | ||
|
|
8d1685e64d | ||
|
|
bb28e16c7d | ||
|
|
ac59d2acfe | ||
|
|
0a1af84712 | ||
|
|
18dc29aba1 | ||
|
|
795217093f | ||
|
|
61b0813924 | ||
|
|
c10337ab9f | ||
|
|
126bbfeb2c | ||
|
|
c914f2b7db | ||
|
|
a8b9348b36 | ||
|
|
c3dd4efe82 | ||
|
|
a7d9ecab15 | ||
|
|
d263fe0f26 | ||
|
|
3226493e6d | ||
|
|
4cb5a97512 | ||
|
|
c080bc517f | ||
|
|
471e88b3e6 | ||
|
|
c66e3adf67 | ||
|
|
3f46a6657a | ||
|
|
83ba1aa373 | ||
|
|
7430e4ffe0 | ||
|
|
d72e49b8fd | ||
|
|
3f57944921 | ||
|
|
b31aab8aeb | ||
|
|
5db9842261 | ||
|
|
81e520fdbb | ||
|
|
26c4502277 | ||
|
|
bfc62b9a72 | ||
|
|
f8c6f9ae74 | ||
|
|
3497700fad | ||
|
|
2c156f832e | ||
|
|
4ee810242d | ||
|
|
b6224c4186 | ||
|
|
4c385a16cc | ||
|
|
4ae6a86bf6 | ||
|
|
c327c282e3 | ||
|
|
e645455b22 | ||
|
|
45505a1635 | ||
|
|
17e6361d64 | ||
|
|
528e7e21b1 | ||
|
|
7b875de301 | ||
|
|
8a3c96dc7c | ||
|
|
b0634b829c | ||
|
|
2bd388a5e2 | ||
|
|
71c0767a1b | ||
|
|
6a3f087209 | ||
|
|
873f588057 | ||
|
|
070a3b7422 | ||
|
|
75ca892ea7 | ||
|
|
a90046a8e3 | ||
|
|
02a165dd76 | ||
|
|
52393429f9 | ||
|
|
9474d985ae | ||
|
|
643c808685 | ||
|
|
2c24f667f9 | ||
|
|
b0113913f2 | ||
|
|
e1cafa54b3 | ||
|
|
a4f2e0aa81 | ||
|
|
cbcde4d910 | ||
|
|
495c234159 | ||
|
|
42c1d02f5e | ||
|
|
a33c925216 | ||
|
|
6ab3fbbea3 | ||
|
|
26adbafde2 | ||
|
|
13e8ce07ac | ||
|
|
5398ca6833 | ||
|
|
56b1cc0756 | ||
|
|
fc8a7edc23 | ||
|
|
e09671cdcb | ||
|
|
32fc4a0c98 | ||
|
|
b315b31cc9 | ||
|
|
21cb6efced | ||
|
|
125b576e2c | ||
|
|
3641618391 | ||
|
|
a92cf6b629 | ||
|
|
2c9c8c7b6c | ||
|
|
98fda20ab6 | ||
|
|
025a53a70c | ||
|
|
b55cf269a4 | ||
|
|
504111c50c | ||
|
|
05d9b56f28 | ||
|
|
c8cb1e3ea5 | ||
|
|
86a258301f | ||
|
|
7e102a235b | ||
|
|
5563f90733 | ||
|
|
b3b9972e60 | ||
|
|
fe9285351b | ||
|
|
08e289a5e3 | ||
|
|
7d432b3aaa | ||
|
|
b0dc538119 | ||
|
|
27c9d2a02c | ||
|
|
87e0d0004d | ||
|
|
dba0fb7b33 | ||
|
|
72be651ca8 | ||
|
|
db2bf3ea06 | ||
|
|
e87380775f | ||
|
|
58ba01f20f | ||
|
|
59332dc47d | ||
|
|
f34b8fbc6b | ||
|
|
79525af42e | ||
|
|
69e93d4b8c | ||
|
|
810f372d1c | ||
|
|
453705a4e1 | ||
|
|
5cb4cc4fe7 | ||
|
|
eeac47c360 | ||
|
|
0bb9d71a26 | ||
|
|
3ff7a61e3f | ||
|
|
e76ade64d2 | ||
|
|
59848f0d3e | ||
|
|
d0fa1c028f | ||
|
|
8f925d9a9e | ||
|
|
4ce1034dcd | ||
|
|
e26a36e543 | ||
|
|
60c74d9463 | ||
|
|
6fba9bd4eb | ||
|
|
5bcc1fe323 | ||
|
|
e70f0ed1ff | ||
|
|
5f696f47ea | ||
|
|
ccb9fb2a68 | ||
|
|
898c061089 | ||
|
|
f7a6559429 | ||
|
|
579d0c3d3e | ||
|
|
190f5a958e | ||
|
|
03661e1b68 | ||
|
|
d451fc296e | ||
|
|
3da5d71275 | ||
|
|
cdf335f609 | ||
|
|
0cd16ff358 | ||
|
|
3e9707276d | ||
|
|
82cfee315c | ||
|
|
ab08be04a5 | ||
|
|
ee585a8370 | ||
|
|
1f078bf0c8 | ||
|
|
2372032a68 | ||
|
|
a70c5fd124 | ||
|
|
5c62d287cf | ||
|
|
9ae378c2e3 | ||
|
|
7381738f0b | ||
|
|
8c6b0c0e07 | ||
|
|
ec9626503c | ||
|
|
820ec085b2 | ||
|
|
9e6f6d7bc9 | ||
|
|
7194e7d28e | ||
|
|
0b4e389f2b | ||
|
|
7a5f786e0c | ||
|
|
10e5fdcfd1 | ||
|
|
cc6e56aef9 | ||
|
|
1aaa483d60 | ||
|
|
99d9d19079 | ||
|
|
888078876a | ||
|
|
02b1e5695f |
@@ -0,0 +1,243 @@
|
||||
# CLI Wizard Architecture Refactor
|
||||
|
||||
**Status:** backlog
|
||||
**Created:** 2026-04-10
|
||||
**Source:** Reverse-engineered from `@posthog/wizard` (npm cache), applied to `apps/cli/src/commands/launch.ts`
|
||||
|
||||
## Why
|
||||
|
||||
Launch wizard has three compounding problems:
|
||||
|
||||
1. **Imperative branching** — `launch.ts` checks account → mesh → name → role → exec in hardcoded order. Adding a screen requires touching existing code. Hard to reason about `--resume`, `--non-interactive`, and skip conditions.
|
||||
2. **Terminal bleed-through on handoff** — wizard→`claude` exec corrupts Ink's TUI state (garbled word wraps, tool labels overwritten, spinner fragments fused to paths). Root cause is spread across multiple exit paths instead of one choke point.
|
||||
3. **Inconsistent visual design** — ad-hoc colors per file, no central palette, no shared icon set, no shared layout primitives. Every screen reinvents status rows, centering, and spacing.
|
||||
|
||||
PostHog's wizard solves all three with one architectural pattern: **declarative flow pipelines + session-as-store + shared visual primitives**. This artifact captures the plan to port that pattern.
|
||||
|
||||
## What PostHog does (the reference)
|
||||
|
||||
### Flow pipeline (`flows.ts` + `router.ts`)
|
||||
|
||||
Each wizard flow is an array of screen entries:
|
||||
|
||||
```ts
|
||||
export const FLOWS = {
|
||||
[Flow.Wizard]: [
|
||||
{ screen: Screen.Intro, isComplete: s => s.setupConfirmed },
|
||||
{ screen: Screen.HealthCheck, isComplete: s => s.readinessResult !== null },
|
||||
{ screen: Screen.Setup, show: needsSetup, isComplete: s => !needsSetup(s) },
|
||||
{ screen: Screen.Auth, isComplete: s => s.credentials !== null },
|
||||
{ screen: Screen.Run, isComplete: s => s.runPhase === RunPhase.Completed },
|
||||
{ screen: Screen.Outro, isComplete: s => s.outroDismissed },
|
||||
],
|
||||
};
|
||||
```
|
||||
|
||||
The router walks the array, skips entries where `show(s) === false` or `isComplete(s) === true`, and returns the first remaining entry. Zero switch statements. Zero hardcoded transitions. Adding a screen = appending an object.
|
||||
|
||||
### Overlay stack
|
||||
|
||||
Separate from the linear flow cursor. Interrupts (port conflict, auth expired, managed settings) are pushed onto `overlays[]` from anywhere and popped when dismissed. Active screen = top of overlay stack OR flow cursor. Flows never need to know about interrupts.
|
||||
|
||||
### Session as single source of truth
|
||||
|
||||
One `WizardStore` holds all session state. Screens subscribe via React 18 `useSyncExternalStore`. Completion predicates read session; imperative code writes session; the router re-resolves on every change.
|
||||
|
||||
### Visual primitives
|
||||
|
||||
- `styles.ts` — 6-color palette (`Colors`), 9-icon set (`Icons`), alignment enums (`HAlign`, `VAlign`)
|
||||
- `CardLayout` — semantic centering wrapper used by every screen
|
||||
- `PickerMenu` — the only selection primitive, used for every choice
|
||||
- `screen-registry.ts` — maps `Screen` enum → React component
|
||||
- Brand mark: three colored `█` blocks next to the wizard name on every screen header
|
||||
|
||||
## What claudemesh should do
|
||||
|
||||
### Target file layout
|
||||
|
||||
```
|
||||
apps/cli/src/
|
||||
├── commands/
|
||||
│ └── launch.ts # thin entrypoint: parse flags → start TUI
|
||||
└── ui/
|
||||
├── styles.ts # palette, icons, alignment enums
|
||||
├── store.ts # LaunchStore (session + subscribe)
|
||||
├── router.ts # flow cursor + overlay stack
|
||||
├── flows.ts # FLOWS = { Launch: [...], Join: [...] }
|
||||
├── screen-registry.ts # Screen enum → component
|
||||
├── primitives/
|
||||
│ ├── CardLayout.tsx
|
||||
│ ├── PickerMenu.tsx
|
||||
│ ├── StatusRows.tsx # new: "Directory ✓ /claudemesh" pattern
|
||||
│ ├── BrandMark.tsx # new: 3 colored squares + label
|
||||
│ └── LoadingBox.tsx
|
||||
└── screens/
|
||||
├── WelcomeScreen.tsx
|
||||
├── AccountScreen.tsx
|
||||
├── MeshPickerScreen.tsx
|
||||
├── NameRoleScreen.tsx
|
||||
├── ConfirmScreen.tsx
|
||||
└── HandoffScreen.tsx # last screen; its unmount triggers exec claude
|
||||
```
|
||||
|
||||
### Flow definition
|
||||
|
||||
```ts
|
||||
export const FLOWS = {
|
||||
[Flow.Launch]: [
|
||||
{ screen: Screen.Welcome, isComplete: s => s.welcomed },
|
||||
{ screen: Screen.Account, show: s => !s.hasAccount, isComplete: s => s.hasAccount },
|
||||
{ screen: Screen.MeshPicker, show: s => s.meshes.length > 1, isComplete: s => s.meshSlug !== null },
|
||||
{ screen: Screen.NameRole, isComplete: s => s.displayName !== null && s.role !== null },
|
||||
{ screen: Screen.Confirm, isComplete: s => s.confirmed },
|
||||
{ screen: Screen.Handoff, isComplete: () => false }, // terminal screen
|
||||
],
|
||||
};
|
||||
```
|
||||
|
||||
### `--resume` works for free
|
||||
|
||||
`--resume <id>` populates the session from saved state; every satisfied predicate auto-skips. The wizard renders only the screens that still need input. No special `--resume` branches in screen code.
|
||||
|
||||
### `--non-interactive` works for free
|
||||
|
||||
Non-interactive mode: walk the flow, for each incomplete entry check if its required session fields can be sourced from CLI flags. If yes, populate and continue. If no, **fail fast with a clear message** naming the missing flag. Never silently guess defaults.
|
||||
|
||||
```
|
||||
$ claudemesh launch --non-interactive --name Alexis
|
||||
✗ Missing --mesh (required in non-interactive mode when >1 mesh joined)
|
||||
Available meshes: alexis-mou, dev, staging
|
||||
```
|
||||
|
||||
### Overlay interrupts claudemesh needs
|
||||
|
||||
- `BrokerDisconnect` — WS dropped mid-wizard, retry countdown
|
||||
- `InviteInvalid` — paste invite screen rejected token
|
||||
- `MeshNotFound` — `--mesh foo` passed but not joined
|
||||
- `RateLimit` — broker rate limited the CLI, backoff timer
|
||||
- `UpdateAvailable` — newer CLI version on npm, non-blocking banner
|
||||
|
||||
### Terminal handoff choke point
|
||||
|
||||
The last flow entry (`Screen.Handoff`) renders a brief "Launching Claude Code…" card, then:
|
||||
|
||||
```ts
|
||||
// apps/cli/src/ui/screens/HandoffScreen.tsx (on mount)
|
||||
useEffect(() => {
|
||||
(async () => {
|
||||
await inkApp.unmount();
|
||||
await inkApp.waitUntilExit();
|
||||
resetTerminal(); // single choke point for ANSI teardown
|
||||
await flushStdout();
|
||||
execa('claude', claudeArgs, { stdio: 'inherit' });
|
||||
})();
|
||||
}, []);
|
||||
```
|
||||
|
||||
`resetTerminal()` lives in `apps/cli/src/ui/terminal.ts`:
|
||||
|
||||
```ts
|
||||
export function resetTerminal() {
|
||||
process.stdout.write(
|
||||
'\x1b[0m' + // reset SGR
|
||||
'\x1b[?25h' + // show cursor
|
||||
'\x1b[?1049l' + // exit alt-screen
|
||||
'\x1b[?1000l' + // disable mouse tracking
|
||||
'\x1b[?1002l' +
|
||||
'\x1b[?1003l' +
|
||||
'\x1b[?1006l' +
|
||||
'\x1b[?2004l' + // disable bracketed paste
|
||||
'\x1b[2J' + // clear screen
|
||||
'\x1b[H' // cursor home
|
||||
);
|
||||
if (process.stdin.isTTY) process.stdin.setRawMode(false);
|
||||
}
|
||||
```
|
||||
|
||||
PostHog only does SGR reset + clear + home on unmount — they don't hand off to another full-screen app, so that's enough for them. Claudemesh needs the full mode-reset because Claude Code takes over the TTY.
|
||||
|
||||
### Visual design system
|
||||
|
||||
`apps/cli/src/ui/styles.ts`:
|
||||
|
||||
```ts
|
||||
export const Colors = {
|
||||
primary: 'cyan',
|
||||
accent: '#7C3AED', // claudemesh purple
|
||||
title: '#4C1D95',
|
||||
success: 'green',
|
||||
error: 'red',
|
||||
warning: 'yellow',
|
||||
muted: 'gray',
|
||||
} as const;
|
||||
|
||||
export const Icons = {
|
||||
check: '✔',
|
||||
cross: '✘',
|
||||
warning: '⚠',
|
||||
arrow: '▶',
|
||||
smallArrow: '▸',
|
||||
bullet: '•',
|
||||
diamond: '◆',
|
||||
square: '█',
|
||||
} as const;
|
||||
|
||||
export enum HAlign { Left = 'flex-start', Center = 'center', Right = 'flex-end' }
|
||||
export enum VAlign { Top = 'flex-start', Center = 'center', Bottom = 'flex-end' }
|
||||
```
|
||||
|
||||
Every screen imports from here. No inline color strings allowed.
|
||||
|
||||
### Status rows pattern
|
||||
|
||||
Replaces the current plain-text banner:
|
||||
|
||||
```
|
||||
██ claudemesh launch
|
||||
|
||||
Directory ✔ /claudemesh
|
||||
Account ✔ agutierrez@mineryreport.com
|
||||
Mesh ✔ alexis-mou (9 peers online)
|
||||
Name ✔ Alexis
|
||||
Role ▸ (pick one)
|
||||
|
||||
▸ Continue
|
||||
Change mesh
|
||||
Cancel
|
||||
```
|
||||
|
||||
## Implementation order
|
||||
|
||||
| # | Impact | Effort | Scope |
|
||||
|---|---|---|---|
|
||||
| 1 | High | S | `ui/styles.ts` — palette + icons + alignment enums; migrate existing screens |
|
||||
| 2 | High | S | `ui/primitives/StatusRows.tsx` + `BrandMark.tsx` |
|
||||
| 3 | High | M | `ui/store.ts` + `ui/router.ts` + `ui/flows.ts` (flow pipeline core) |
|
||||
| 4 | High | M | Refactor `launch.ts` to render through router; port existing screens |
|
||||
| 5 | High | S | `HandoffScreen` + `resetTerminal()` choke point — fixes TUI bleed bug |
|
||||
| 6 | High | S | Preselect "Continue" on every confirmation screen (one-keypress happy path) |
|
||||
| 7 | Med | M | Overlay stack + first two overlays (`BrokerDisconnect`, `InviteInvalid`) |
|
||||
| 8 | Med | M | `--non-interactive` mode using flow walker + fail-fast flag check |
|
||||
| 9 | Med | S | Per-mesh/per-role `preRunNotice` extension point |
|
||||
| 10| Low | L | `DissolveTransition` / `ContentSequencer` polish primitives |
|
||||
|
||||
Steps 1–5 are the atomic unit of value: they fix the bleed-through bug, establish the visual system, and unblock everything else. Should ship as one PR.
|
||||
Steps 6–9 can each ship independently.
|
||||
Step 10 is polish — defer until after v0.2.
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Ink version**: current CLI uses Ink 4.x? PostHog is on Ink 5 with `useSyncExternalStore`. Check `apps/cli/package.json` before porting the store pattern — Ink 4 needs a different subscription approach.
|
||||
- **React version**: `useSyncExternalStore` is React 18+. Confirm.
|
||||
- **Flow granularity**: should `Join` (paste invite) be a separate flow from `Launch`, or an overlay inside `Launch`? PostHog-style: separate flow triggered from the welcome screen. Simpler.
|
||||
- **Resume semantics**: does `--resume <id>` resume the *Claude* session only, or also restore the wizard's last mesh/name/role choice? If the latter, need a `~/.claudemesh/sessions/<id>.json` alongside Claude's own session file.
|
||||
|
||||
## References
|
||||
|
||||
- PostHog wizard source: `~/.npm/_npx/b48b11b34a0cada0/node_modules/@posthog/wizard/dist/src/ui/tui/`
|
||||
- `start-tui.js` — Ink bootstrap + cleanup
|
||||
- `router.js` — flow cursor + overlay stack
|
||||
- `flows.js` — declarative pipeline definition
|
||||
- `styles.js` — palette + icons
|
||||
- `screens/IntroScreen.js` — reference for status rows + picker
|
||||
- `primitives/CardLayout.js` — semantic centering
|
||||
820
.artifacts/backlog/2026-04-11-v1-feature-inventory.md
Normal file
@@ -0,0 +1,820 @@
|
||||
# claudemesh v1 — Feature Inventory
|
||||
|
||||
**Status:** backlog reference
|
||||
**Created:** 2026-04-11
|
||||
**Purpose:** Exhaustive audit of what v1 ships today. **Every row in this document must still work after v2 lands.** v2 is a refactor + CLI user flows, NOT a functional rewrite; this inventory is the regression checklist.
|
||||
|
||||
**Source of truth**:
|
||||
- `apps/cli/src/` — 22 files, ~12 k LOC (v0.10.5)
|
||||
- `apps/broker/src/` — 23 files, ~11 k LOC
|
||||
- `packages/db/src/schema/mesh.ts` — 1,019 lines, 23 tables
|
||||
|
||||
---
|
||||
|
||||
## 0. Summary counts
|
||||
|
||||
| Surface | v1 count |
|
||||
|---|---|
|
||||
| CLI commands (subcommands in `index.ts`) | 23 |
|
||||
| MCP tools (handlers in `mcp/server.ts`) | 79 |
|
||||
| Broker WS message types (dispatched in `index.ts`) | 85 |
|
||||
| Broker HTTP endpoints | 18 |
|
||||
| Postgres tables in `mesh` schema | 23 |
|
||||
| External backend services the broker manages | 5 (Postgres, Neo4j, Qdrant, MinIO, Docker) |
|
||||
| Lines of source (CLI + broker, excluding tests) | ~23,450 |
|
||||
|
||||
---
|
||||
|
||||
## 1. CLI commands
|
||||
|
||||
All dispatched from `apps/cli/src/index.ts`. v1 ships 23 public subcommands plus the bare-command welcome wizard.
|
||||
|
||||
| Command | File | Purpose | Flags / args |
|
||||
|---|---|---|---|
|
||||
| `claudemesh` (bare) | `commands/welcome.ts` | Interactive welcome wizard. Entry point for new users. | (none) |
|
||||
| `launch` | `commands/launch.ts` (775 lines, biggest) | Spawn a Claude Code session with mesh connectivity + MCP tools | `--name`, `--role`, `--groups`, `--mesh`, `--join`, `--message-mode`, `--system-prompt`, `-y/--yes`, `-r/--resume`, `-c/--continue`, `--quiet`, + passthrough to `claude` after `--` |
|
||||
| `create` | `commands/create.ts` | Create a new mesh from a template | `--template`, `--list-templates` |
|
||||
| `install` | `commands/install.ts` (538 lines) | Register MCP server + status hooks with Claude Code (`~/.claude.json`, `~/.claude/settings.json`) | `--no-hooks` |
|
||||
| `uninstall` | `commands/install.ts` | Remove MCP server + hooks from Claude Code config | (none) |
|
||||
| `join` | `commands/join.ts` (193 lines) | Join a mesh via invite URL or token | positional `<url>` |
|
||||
| `list` | `commands/list.ts` | Show joined meshes, slugs, local identities | (none) |
|
||||
| `leave` | `commands/leave.ts` | Leave a joined mesh + remove its local keypair | positional `<slug>` |
|
||||
| `peers` | `commands/peers.ts` | List online peers with status, summary, groups | `--mesh`, `--json` |
|
||||
| `send` | `commands/send.ts` | Send a message to a peer, group, or all peers | positional `<to> <message>`, `--mesh`, `--priority` |
|
||||
| `inbox` | `commands/inbox.ts` | Drain pending inbound messages | `--mesh`, `--json`, `--wait` |
|
||||
| `state` | `commands/state.ts` | Get / set / list shared KV state in the mesh | positional `<action> <key> [value]`, `--mesh`, `--json` |
|
||||
| `info` | `commands/info.ts` | Mesh overview: slug, broker, peer count, state keys | `--mesh`, `--json` |
|
||||
| `remember` | `commands/memory.ts` | Store a persistent memory visible to all peers | positional `<content>`, `--mesh`, `--tags`, `--json` |
|
||||
| `recall` | `commands/memory.ts` | Full-text search of mesh memories | positional `<query>`, `--mesh`, `--json` |
|
||||
| `remind` | `commands/remind.ts` (142 lines) | Schedule a delayed message. Also: `remind list`, `remind cancel <id>` | positional `<message>`, `--in`, `--at`, `--cron`, `--to`, `--mesh`, `--json` |
|
||||
| `sync` | `commands/sync.ts` | Sync meshes from the user's claudemesh.com dashboard account | `--force` |
|
||||
| `profile` | `commands/profile.ts` | View or edit member profile (self or another member if admin) | `--mesh`, `--role-tag`, `--groups`, `--message-mode`, `--name`, `--member`, `--json` |
|
||||
| `status` | `commands/status.ts` | Check broker connectivity for each joined mesh | (none) |
|
||||
| `doctor` | `commands/doctor.ts` (212 lines) | Diagnose install, config, keypairs, PATH | 7 checks: Node >= 20, claude binary, MCP registered, hooks registered, config parses, file perms, keypairs valid |
|
||||
| `mcp` | `mcp/server.ts` (2139 lines) | Start MCP server on stdio (internal — invoked by Claude Code) | (none) |
|
||||
| `seed-test-mesh` | `commands/seed-test-mesh.ts` | Dev-only: inject a mesh into local config without invite flow | `<slug>`, `<broker_url>` |
|
||||
| `hook` | `commands/hook.ts` | Internal: handle Claude Code hook events (status updates from session lifecycle) | stdin JSON from Claude Code |
|
||||
| `connect telegram` | `commands/connect-telegram.ts` | Link a Telegram bot to a mesh | inline token prompts, calls broker `/tg/token` |
|
||||
| `disconnect telegram` | `commands/disconnect-telegram.ts` | Unlink Telegram bot | (none) |
|
||||
|
||||
### Flag-first invocation rewrite
|
||||
|
||||
`apps/cli/src/index.ts` lines 339–355 implement a **friction reducer**: if the user types `claudemesh --resume xxx` or any flag-first invocation, the argv is rewritten to `claudemesh launch --resume xxx` before citty parses it. This lets users skip typing `launch` for common flag-only forms.
|
||||
|
||||
**Must preserve in v2.** Users may depend on this. Applies to `--resume`, `--continue`, `-y`, `--mesh`, `--name`, etc.
|
||||
|
||||
---
|
||||
|
||||
## 2. MCP tools (79 total)
|
||||
|
||||
Defined in `apps/cli/src/mcp/tools.ts` with schemas, implemented in `apps/cli/src/mcp/server.ts` with per-tool case handlers. Each MCP tool is a RPC that the CLI's MCP server handles locally or forwards to the broker via WS.
|
||||
|
||||
Grouped by domain family. Every tool listed here has a working handler in v1.
|
||||
|
||||
### 2.1 Messaging (4)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `send_message` | Send encrypted message to peer, group, or broadcast. Supports priorities: `now` (immediate), `next` (default), `low`. Broker queues if recipient offline. |
|
||||
| `list_peers` | List connected peers in the mesh with `presenceId`, `displayName`, `status`, `summary`, `groups`, `roleTag`. |
|
||||
| `message_status` | Query delivery state of a sent message by `messageId`. |
|
||||
| `check_messages` | Drain pending inbox messages (push mode). |
|
||||
|
||||
### 2.2 Profile + identity (4)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `set_summary` | Set the current peer's work summary (visible to others). |
|
||||
| `set_status` | Set status: `idle`, `working`, `dnd`. Priority-ranked by source (`hook` > `manual` > `jsonl`). |
|
||||
| `set_visible` | Toggle visibility. Hidden peers skip `list_peers` and broadcasts but still receive direct messages. |
|
||||
| `set_profile` | Update display name, role tag, groups, avatar, title, bio, capabilities. |
|
||||
|
||||
### 2.3 Groups (2)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `join_group` | Join a `@group` with optional role (`lead`, `member`, or free-form). |
|
||||
| `leave_group` | Leave a `@group`. |
|
||||
|
||||
### 2.4 State KV (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `set_state` | Set a key-value pair in the mesh's shared state. Broadcasts `state_change` push to all peers. |
|
||||
| `get_state` | Read a value by key. |
|
||||
| `list_state` | List all state keys with values, authors, timestamps. |
|
||||
|
||||
### 2.5 Memory (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `remember` | Store a text memory with optional tags. Persists across sessions. |
|
||||
| `recall` | Full-text search memories by query, ranked results. |
|
||||
| `forget` | Delete a memory by ID. |
|
||||
|
||||
### 2.6 Files (8)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `share_file` | Upload a file to MinIO. Supports `to: <peer>` for E2E encryption (symmetric key wrapped with peer pubkey), or mesh-wide sharing. Supports `persistent` vs `ephemeral` storage. |
|
||||
| `get_file` | Download a file by `fileId`. Returns a presigned MinIO URL. |
|
||||
| `list_files` | List files in the mesh by `scope`, `tags`, author. |
|
||||
| `file_status` | Query status of a file: who downloaded, when. |
|
||||
| `delete_file` | Delete a file (owner only). |
|
||||
| `grant_file_access` | Add another peer as a recipient of an already-encrypted file (re-wraps symmetric key). |
|
||||
| `read_peer_file` | Read a file from another peer's working directory (requires peer online + sharing). |
|
||||
| `list_peer_files` | List files in a peer's shared directory (tree of names, not contents). |
|
||||
|
||||
### 2.7 Vectors (Qdrant) (4)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `vector_store` | Store embedding with metadata in a named collection. |
|
||||
| `vector_search` | Nearest-neighbor search in a collection with `limit`. |
|
||||
| `vector_delete` | Delete a vector by ID. |
|
||||
| `list_collections` | List collections in the mesh's Qdrant namespace. |
|
||||
|
||||
### 2.8 Graph (Neo4j) (2)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `graph_query` | Read-only Cypher MATCH query on the per-mesh Neo4j database. |
|
||||
| `graph_execute` | Write Cypher (CREATE/MERGE/DELETE). |
|
||||
|
||||
### 2.9 Shared SQL (Postgres) (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `mesh_query` | SELECT-only query on the per-mesh Postgres schema. |
|
||||
| `mesh_execute` | DDL + DML (CREATE TABLE, INSERT, UPDATE, DELETE). |
|
||||
| `mesh_schema` | List tables + columns in the mesh's schema. |
|
||||
|
||||
### 2.10 Streams (4)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `create_stream` | Create a named stream for live data pub-sub. |
|
||||
| `publish` | Push data to a stream. Subscribers receive in real-time. |
|
||||
| `subscribe` | Subscribe to a stream. Events arrive as channel notifications. |
|
||||
| `list_streams` | List active streams. |
|
||||
|
||||
### 2.11 Contexts (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `share_context` | Share session understanding with the mesh (summary + files_read + key_findings + tags). |
|
||||
| `get_context` | Search contexts by query (file path, topic, etc.). |
|
||||
| `list_contexts` | Show what peers currently know about the codebase. |
|
||||
|
||||
### 2.12 Tasks (4)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `create_task` | Create a work item (title, assignee, priority, tags). |
|
||||
| `claim_task` | Claim an unclaimed task. |
|
||||
| `complete_task` | Mark done with optional result summary. |
|
||||
| `list_tasks` | Filter by status and/or assignee. |
|
||||
|
||||
### 2.13 Scheduling (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `schedule_reminder` | One-shot (`deliver_at`, `in_seconds`) or recurring (`cron`). Delivered to self or `to`. Persists across broker restarts. |
|
||||
| `list_scheduled` | List pending scheduled messages. |
|
||||
| `cancel_scheduled` | Cancel by ID. |
|
||||
|
||||
### 2.14 Mesh metadata — read (4)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `mesh_info` | Overview: peers, groups, state, memory, files, tasks, streams, tables. |
|
||||
| `mesh_stats` | Resource usage per peer: messages in/out, tool calls, uptime, errors. |
|
||||
| `mesh_clock` | Simulation clock status: speed, tick count, simulated time. |
|
||||
| `ping_mesh` | Test messages through the full pipeline, measure round-trip per priority. Diagnoses push delivery issues. |
|
||||
|
||||
### 2.15 Mesh clock — write (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `mesh_set_clock` | Set simulation clock speed (1–100x). Peers receive heartbeat ticks at the simulated rate. |
|
||||
| `mesh_pause_clock` | Pause simulation clock. |
|
||||
| `mesh_resume_clock` | Resume paused clock. |
|
||||
|
||||
### 2.16 Skills (5)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `share_skill` | Publish a reusable skill (name + description + instructions + tags + when_to_use + allowed_tools + model + context + agent + user_invocable + argument_hint). Exposed as MCP prompts and `skill://` resources. |
|
||||
| `get_skill` | Load a skill's full instructions by name. |
|
||||
| `list_skills` | Browse available skills, optionally filter by keyword. |
|
||||
| `remove_skill` | Remove a shared skill. |
|
||||
| `mesh_skill_deploy` | Deploy a multi-file skill bundle from zip or git repo. |
|
||||
|
||||
### 2.17 MCP registry tier 1 — peer-hosted (4)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `mesh_mcp_register` | Register a peer's local MCP server with the mesh (server_name, description, tools schema, persistent flag). Other peers can invoke via `mesh_tool_call`. |
|
||||
| `mesh_mcp_list` | List MCP servers in the mesh with their tools + hosting peer. |
|
||||
| `mesh_tool_call` | Call a tool on a mesh-registered MCP server. Routes: caller → broker → hosting peer → execute → result back. 30s timeout. |
|
||||
| `mesh_mcp_remove` | Unregister a peer-hosted MCP server. |
|
||||
|
||||
### 2.18 MCP registry tier 2 — broker-deployed (7)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `mesh_mcp_deploy` | Deploy an MCP server from zip (via `file_id`), git URL, or npx package. Runs on broker VPS in Docker sandbox. Scope: `peer` (default), `mesh`, or `{group/groups/role/peers}`. Runtime: node / python / bun. Memory, network_allow, env with `$vault:` references. |
|
||||
| `mesh_mcp_undeploy` | Stop and remove a managed MCP server. |
|
||||
| `mesh_mcp_update` | Pull latest + restart a git-sourced server. |
|
||||
| `mesh_mcp_logs` | Tail recent logs from a managed server. |
|
||||
| `mesh_mcp_scope` | Get or set visibility scope. |
|
||||
| `mesh_mcp_schema` | Inspect tool schemas for a deployed server. |
|
||||
| `mesh_mcp_catalog` | List all deployed services with status, scope, tool count. |
|
||||
|
||||
### 2.19 Vault (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `vault_set` | Store encrypted credential. `type: env` (string, injected as env var via `$vault:<key>`) or `type: file` (file written to `mount_path` in container). |
|
||||
| `vault_list` | List vault entries (keys + metadata only, no values). |
|
||||
| `vault_delete` | Remove a credential. |
|
||||
|
||||
### 2.20 URL watch (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `mesh_watch` | Watch a URL for changes. Modes: `hash` (SHA-256 body), `json` (jsonpath extract), `status` (HTTP code). Polling `interval` (min 5s). `notify_on: change \| match:<val> \| not_match:<val>`. Custom headers. |
|
||||
| `mesh_unwatch` | Stop watching by `watch_id`. |
|
||||
| `mesh_watches` | List active watches. |
|
||||
|
||||
### 2.21 Webhooks (3)
|
||||
|
||||
| Tool | v1 behavior |
|
||||
|---|---|
|
||||
| `create_webhook` | Create an inbound webhook. Returns a URL external services (GitHub, CI/CD, monitoring) can POST to. Payload becomes a mesh message to all peers. |
|
||||
| `list_webhooks` | List active webhooks. |
|
||||
| `delete_webhook` | Deactivate by name. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Broker WS protocol
|
||||
|
||||
`apps/broker/src/index.ts` dispatches 85 message types over a single WebSocket endpoint (`WS_PATH`). Each WS message is a client-initiated RPC; most of the 79 MCP tools above map 1:1 to a WS message. Some additional WS messages exist for connection lifecycle + internal routing.
|
||||
|
||||
### 3.1 Connection lifecycle (3)
|
||||
|
||||
- `hello` — client authentication. Ed25519 signature over `{meshId, memberId, pubkey, timestamp}`. Broker verifies, creates presence row, replies with `hello_ack`.
|
||||
- `hello_ack` — server → client, confirms authentication + sends restored peer state.
|
||||
- `get_clock` — get current simulation clock state.
|
||||
|
||||
### 3.2 Messaging (4 WS ops)
|
||||
|
||||
- `send` — send a message. Envelope contains sender, recipient (peer/group/*), priority, nonce, ciphertext.
|
||||
- `peer_dir_request` / `peer_dir_response` — peer-to-peer directory request (read_peer_file under the hood).
|
||||
- `peer_file_request` / `peer_file_response` — peer-to-peer file read.
|
||||
|
||||
### 3.3 Profile + presence (5)
|
||||
|
||||
- `set_status`, `set_summary`, `set_visible`, `set_profile`, `set_stats`
|
||||
|
||||
### 3.4 Groups (2)
|
||||
|
||||
- `join_group`, `leave_group`
|
||||
|
||||
### 3.5 State KV (3)
|
||||
|
||||
- `set_state`, `get_state`, `list_state`
|
||||
|
||||
### 3.6 Memory (3)
|
||||
|
||||
- `remember`, `recall`, `forget`
|
||||
|
||||
### 3.7 Files (5)
|
||||
|
||||
- `get_file`, `list_files`, `file_status`, `grant_file_access`, `delete_file`
|
||||
|
||||
### 3.8 Vectors (3)
|
||||
|
||||
- `vector_store`, `vector_search`, `vector_delete`, `list_collections`
|
||||
|
||||
### 3.9 Graph (2)
|
||||
|
||||
- `graph_query`, `graph_execute`
|
||||
|
||||
### 3.10 Shared SQL (3)
|
||||
|
||||
- `mesh_query`, `mesh_execute`, `mesh_schema`
|
||||
|
||||
### 3.11 Streams (4)
|
||||
|
||||
- `create_stream`, `publish`, `subscribe`, `unsubscribe`, `list_streams`
|
||||
|
||||
### 3.12 Contexts (3)
|
||||
|
||||
- `share_context`, `get_context`, `list_contexts`
|
||||
|
||||
### 3.13 Tasks (4)
|
||||
|
||||
- `create_task`, `claim_task`, `complete_task`, `list_tasks`
|
||||
|
||||
### 3.14 Scheduling (3)
|
||||
|
||||
- `schedule`, `list_scheduled`, `cancel_scheduled`
|
||||
|
||||
### 3.15 Mesh metadata (3)
|
||||
|
||||
- `mesh_info`, `peers_list` (from `list_peers`), `message_status`
|
||||
|
||||
### 3.16 Simulation clock (4)
|
||||
|
||||
- `set_clock`, `pause_clock`, `resume_clock`, `get_clock`
|
||||
|
||||
### 3.17 Skills (4)
|
||||
|
||||
- `share_skill`, `get_skill`, `list_skills`, `remove_skill`, `skill_deploy`
|
||||
|
||||
### 3.18 MCP registry (11)
|
||||
|
||||
- `mcp_register`, `mcp_unregister`, `mcp_list`, `mcp_call`, `mcp_call_response` (peer → peer relay)
|
||||
- `mcp_deploy`, `mcp_undeploy`, `mcp_update`, `mcp_logs`, `mcp_scope`, `mcp_schema`, `mcp_catalog`
|
||||
|
||||
### 3.19 Vault (4)
|
||||
|
||||
- `vault_set`, `vault_get`, `vault_list`, `vault_delete`
|
||||
|
||||
### 3.20 URL watch (3)
|
||||
|
||||
- `watch`, `unwatch`, `watch_list`
|
||||
|
||||
### 3.21 Webhooks (3)
|
||||
|
||||
- `create_webhook`, `list_webhooks`, `delete_webhook`
|
||||
|
||||
### 3.22 Audit (2)
|
||||
|
||||
- `audit_query`, `audit_verify`
|
||||
|
||||
---
|
||||
|
||||
## 4. Broker HTTP endpoints
|
||||
|
||||
The broker serves both WS (`/ws`) and HTTP on the same port. HTTP endpoints are listed here by (method, path) with purpose.
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|---|---|---|
|
||||
| `GET` | `/health` | Health check: liveness probe |
|
||||
| `GET` | `/metrics` | Prometheus metrics endpoint |
|
||||
| `POST` | `/hook/set-status` | Receive hook status updates from CLI `hook` command (Claude Code session lifecycle) |
|
||||
| `POST` | `/join` | Accept v1 invite join (legacy) |
|
||||
| `POST` | `/invites/:code/claim` | v2 invite claim (public, unauthenticated) |
|
||||
| `POST` | `/upload` | Upload a file (returns fileId, used by `share_file`) |
|
||||
| `GET` | `/download/:id` | Download a file (returns content or presigned URL) |
|
||||
| `POST` | `/cli-sync` | CLI sync endpoint — fetches user's meshes from `claudemesh.com` dashboard via JWT, returns mesh list |
|
||||
| `POST` | `/tg/token` | Register a Telegram bot token for a mesh (connects via `connect telegram` CLI command) |
|
||||
| `PATCH` | `/mesh/:id/member/:memberId` | Update a member's profile (admin or self) |
|
||||
| `GET` | `/mesh/:id/members` | List mesh members |
|
||||
| `PATCH` | `/mesh/:id/settings` | Update mesh-level settings (owner/admin) |
|
||||
| `POST` | `/hook/:meshId/:webhookId` | Inbound webhook — external systems POST here to publish a mesh message |
|
||||
| `GET` | `/test/clock` | Dev-only: simulation clock state |
|
||||
| `GET` | `/test/flip` | Dev-only: test flip endpoint |
|
||||
| `GET` | `/test/html` | Dev-only: test HTML endpoint |
|
||||
| `WS` | `/ws` | WebSocket connection for mesh peers (all WS ops above) |
|
||||
|
||||
---
|
||||
|
||||
## 5. Database schema — `mesh` Postgres schema
|
||||
|
||||
23 tables in the `mesh` schema (managed via Drizzle). Defined in `packages/db/src/schema/mesh.ts`.
|
||||
|
||||
| Table | Purpose |
|
||||
|---|---|
|
||||
| `mesh.mesh` | Mesh identity. slug, name, ownerId, createdAt, settings. |
|
||||
| `mesh.member` | Per-mesh member record. Stable, durable. pubkey, displayName, role, groups, joinedAt. |
|
||||
| `mesh.invite` | Invite codes + metadata. |
|
||||
| `mesh.pending_invite` | v2 invite handshake state (pending claim). |
|
||||
| `mesh.audit_log` | Audit events per mesh. |
|
||||
| `mesh.presence` | Ephemeral WS session — one row per active connection. Status, statusSource, statusUpdatedAt. |
|
||||
| `mesh.message_queue` | Queued messages pending push delivery (priority ordered). |
|
||||
| `mesh.pending_status` | In-flight status updates (10s TTL). |
|
||||
| `mesh.state` (meshState) | Shared KV state per mesh. |
|
||||
| `mesh.memory` (meshMemory) | Shared memories with full-text search. |
|
||||
| `mesh.file` (meshFile) | File metadata (uploader, size, sha256, persistence, storage location). |
|
||||
| `mesh.file_access` (meshFileAccess) | Per-recipient ACL on files. |
|
||||
| `mesh.file_key` (meshFileKey) | Per-recipient wrapped symmetric keys for E2E encryption. |
|
||||
| `mesh.context` (meshContext) | Shared context entries. |
|
||||
| `mesh.task` (meshTask) | Tasks with lifecycle (open, claimed, completed, cancelled). |
|
||||
| `mesh.stream` (meshStream) | Stream metadata. |
|
||||
| `mesh.skill` (meshSkill) | Skill registrations (name, content, frontmatter, tags). |
|
||||
| `mesh.webhook` (meshWebhook) | Inbound webhook registrations. |
|
||||
| `mesh.service` (meshService) | Deployed MCP server state (container ID, scope, env, runtime, memory, logs). |
|
||||
| `mesh.vault_entry` (meshVaultEntry) | Encrypted vault entries per (mesh, peer, key). |
|
||||
| `mesh.scheduled_message` | Scheduled / recurring reminders (cron + one-shot). |
|
||||
| `mesh.peer_state` | Per-peer state (groups, role, profile, message mode preference). |
|
||||
| `mesh.telegram_bridge` | Telegram bot registration per mesh. |
|
||||
|
||||
---
|
||||
|
||||
## 6. Broker backend services
|
||||
|
||||
Five external services the broker manages at runtime. All currently work in v1 and ship in the default Docker Compose deployment.
|
||||
|
||||
| Service | Purpose | File | Per-mesh model |
|
||||
|---|---|---|---|
|
||||
| **Postgres** (Drizzle) | Primary data store for mesh schema. Also used for `mesh_execute` / `mesh_query` / `mesh_schema` shared-SQL tools via per-mesh schemas. | `db.ts` | Schema-per-mesh for shared SQL tools |
|
||||
| **Neo4j** | Graph queries (`graph_query`, `graph_execute`). | `neo4j-client.ts` | Database-per-mesh (Enterprise) or labeled-node fallback (Community) |
|
||||
| **Qdrant** | Vector embeddings + nearest-neighbor search. | `qdrant.ts` | Collection naming: `mesh_<meshId>_<collection>`, 1536-dim default, cosine distance |
|
||||
| **MinIO** | File storage for `share_file` / `get_file`. | `minio.ts` | Bucket-per-mesh: `mesh-<meshId>`. Persistent + ephemeral key paths. |
|
||||
| **Docker** | Runs deployed MCP servers in sandboxed containers. | `index.ts` (deploy handler) | Container-per-deployment. Read-only root, dropped caps, memory limits, network_allow. |
|
||||
|
||||
---
|
||||
|
||||
## 7. Broker core subsystems
|
||||
|
||||
### 7.1 Status engine (`broker.ts`, 2066 lines)
|
||||
|
||||
**Battle-tested status model** ported from `claude-intercom`. Rules:
|
||||
|
||||
- Status sources are ranked: `hook` (3) > `manual` (2) > `jsonl` (1)
|
||||
- On status update:
|
||||
- If status **changed** → bump everything, record new source
|
||||
- If status **unchanged**, incoming source ≥ recorded → upgrade
|
||||
- If status **unchanged**, incoming source < recorded:
|
||||
- Recorded source still fresh → keep it (bump timestamp only)
|
||||
- Recorded source stale → downgrade to honest attribution
|
||||
- `HOOK_FRESHNESS_MS` window (default 60s) for "fresh" classification
|
||||
- `WORKING_TTL_MS` after which `working` status reverts to `idle`
|
||||
- `PENDING_TTL_MS = 10_000` for pending status cleanup
|
||||
- `TTL_SWEEP_INTERVAL_MS = 15_000` for periodic cleanup
|
||||
|
||||
**Must preserve** — this is the correctness engine for `set_status`, `list_peers`, and Claude Code's status line.
|
||||
|
||||
### 7.2 Message queue + priority delivery
|
||||
|
||||
- Messages are stored in `mesh.message_queue` with priority (`now`, `next`, `low`)
|
||||
- `now` messages bypass busy-gate and are pushed immediately
|
||||
- `next` messages wait for idle peer
|
||||
- `low` messages are pull-only (delivered when peer explicitly drains via `check_messages`)
|
||||
- Queue is drained via `drainForMember(meshId, memberId)` on WS message arrival or manual `check_messages`
|
||||
- Duplicate delivery prevention via `messageId` UUID tracking
|
||||
|
||||
### 7.3 Scheduled message delivery (`index.ts` in-memory + DB persistence)
|
||||
|
||||
- One-shot: `deliver_at` (timestamp) or `in_seconds`
|
||||
- Recurring: standard 5-field cron expression
|
||||
- Persists to `mesh.scheduled_message` table — survives broker restart
|
||||
- On broker start, pending schedules are re-registered
|
||||
- Delivery is via the normal `send_message` pipeline with `subtype: reminder`
|
||||
|
||||
### 7.4 URL watch subsystem (`index.ts`)
|
||||
|
||||
- Poller runs in-process (worker per watch)
|
||||
- Modes: `hash` (SHA-256 of body), `json` (extract jsonpath value), `status` (HTTP status)
|
||||
- `notify_on: change | match:<val> | not_match:<val>`
|
||||
- Persists to DB so watches survive broker restart
|
||||
- Min interval 5s, max 24h
|
||||
|
||||
### 7.5 Telegram bridge (`telegram-bridge.ts`, 1711 lines)
|
||||
|
||||
**Substantial subsystem.** Provides Telegram Bot API integration:
|
||||
|
||||
- Bot token registration per mesh via `POST /tg/token`
|
||||
- Long-polling or webhook mode
|
||||
- `tg:<username>` peer identity registration in the mesh's member table
|
||||
- Inbound Telegram messages → mesh `send_message` events with `subtype: telegram`
|
||||
- Outbound `send_message(to: "tg:<name>")` → Telegram Bot API call
|
||||
- Chat-to-mesh mapping (Telegram chat_id ↔ mesh peer)
|
||||
- User discovery (`connectChat`)
|
||||
- Bridge row persistence in `mesh.telegram_bridge`
|
||||
|
||||
**This is ~18% of the broker's total source**. v2 must either:
|
||||
1. Port the logic into a standalone MCP connector (`apps/mcp-telegram/`), or
|
||||
2. Keep this file in the broker and wire it into the v2 architecture unchanged (my recommendation per the previous conversation — bundled into the broker image)
|
||||
|
||||
Either way, **every behavior documented here must still work after v2 lands**.
|
||||
|
||||
### 7.6 Auth + crypto (`crypto.ts`, `broker-crypto.ts`, `jwt.ts`)
|
||||
|
||||
- **Hello signatures**: Ed25519 signed tuple of `(meshId, memberId, pubkey, timestamp)`. Verified on every WS connection. Replay protection via timestamp window.
|
||||
- **Invite verification**: canonical invite payload (`canonicalInvite`) signed by mesh owner, Ed25519 verified on claim
|
||||
- **JWT**: for `/cli-sync` endpoint — the CLI obtains a JWT from `claudemesh.com` via browser flow, passes it to the broker, broker verifies and returns the user's mesh list
|
||||
- **File envelopes**: client-side AES-GCM + per-recipient key wrapping (file_key table)
|
||||
|
||||
### 7.7 Rate limiting (`rate-limit.ts`)
|
||||
|
||||
- Per-peer rate limits on expensive operations
|
||||
- Currently in-process (not Redis-backed)
|
||||
- Enforces limits on `send`, `vector_store`, `mesh_execute`, `mesh_mcp_deploy`, etc.
|
||||
|
||||
### 7.8 Metrics (`metrics.ts`)
|
||||
|
||||
Prometheus metrics exposed at `/metrics`:
|
||||
- Request counts by op type
|
||||
- Latencies p50/p99
|
||||
- Connection counts per mesh
|
||||
- Message delivery counts by priority
|
||||
- Error rates
|
||||
|
||||
### 7.9 Audit log (`audit.ts`)
|
||||
|
||||
- Every mutation is audited to `mesh.audit_log`
|
||||
- Tamper-evidence via hash chaining
|
||||
- Accessible via `audit_query` and `audit_verify` WS ops
|
||||
|
||||
### 7.10 Member API (`member-api.ts`, 284 lines)
|
||||
|
||||
Exports:
|
||||
- `updateMemberProfile()` — used by `PATCH /mesh/:id/member/:memberId`
|
||||
- `listMeshMembers()` — used by `GET /mesh/:id/members`
|
||||
- `updateMeshSettings()` — used by `PATCH /mesh/:id/settings`
|
||||
|
||||
### 7.11 CLI sync (`cli-sync.ts`, 133 lines)
|
||||
|
||||
Exports `handleCliSync()` for `POST /cli-sync`. This is **already the "CLI sync meshes from dashboard" feature** — v2 will reuse this endpoint for its mesh-list refresh logic.
|
||||
|
||||
### 7.12 Webhook subsystem (`webhooks.ts`, 97 lines)
|
||||
|
||||
Handles `POST /hook/:meshId/:webhookId` inbound. Signature verification (HMAC), payload normalization, mesh message emission.
|
||||
|
||||
---
|
||||
|
||||
## 8. CLI core subsystems
|
||||
|
||||
### 8.1 WS client (`ws/client.ts`, 2191 lines)
|
||||
|
||||
**The biggest CLI file.** Implements the full WS protocol with:
|
||||
- Connection management, reconnect with exponential backoff
|
||||
- Message queue for offline buffering
|
||||
- Request/response correlation via `_reqId`
|
||||
- Ed25519 hello signature generation
|
||||
- Crypto envelope wrapping for `send_message` payloads
|
||||
- Push notification delivery (messages, state changes, system events)
|
||||
- Per-mesh connection pooling (one WS per mesh)
|
||||
|
||||
### 8.2 MCP server (`mcp/server.ts`, 2139 lines)
|
||||
|
||||
Second biggest CLI file. Implements:
|
||||
- MCP stdio transport (registered with Claude Code via `install.ts`)
|
||||
- Tool registry from `mcp/tools.ts`
|
||||
- Dispatch to 79 handlers (one per tool)
|
||||
- WS client pooling (one connection per mesh)
|
||||
- Crypto primitives for memory/state encryption
|
||||
- Inline file-read helpers for `read_peer_file`
|
||||
- Channel notification forwarding from broker → Claude Code via MCP elicitation
|
||||
|
||||
### 8.3 Crypto (`crypto/*.ts`)
|
||||
|
||||
- `keypair.ts` — Ed25519 keypair generation + persistence (`~/.claudemesh/keys/<mesh>.key`)
|
||||
- `envelope.ts` — NaCl `crypto_box` envelope wrapping
|
||||
- `file-crypto.ts` — AES-GCM file encryption + per-recipient key wrapping
|
||||
- `hello-sig.ts` — Hello signature generation/verification
|
||||
|
||||
### 8.4 Auth + invite (`auth/*.ts`, `invite/*.ts`, `lib/invite-v2.ts`)
|
||||
|
||||
- `callback-listener.ts` — local HTTP server that catches browser OAuth callback (for `sync` command)
|
||||
- `open-browser.ts` — cross-platform browser launcher
|
||||
- `pairing-code.ts` — pairing code display
|
||||
- `sync-with-broker.ts` — JWT-based sync from dashboard
|
||||
- `invite/parse.ts` — parse v1 invite URLs
|
||||
- `invite/enroll.ts` — enroll into a mesh from an invite
|
||||
- `lib/invite-v2.ts` — v2 invite format (short-code + signed payload)
|
||||
|
||||
### 8.5 State + config (`state/config.ts`)
|
||||
|
||||
- `~/.claudemesh/config.json` read/write (mesh list, keypairs, profile defaults)
|
||||
- 0600 permission enforcement
|
||||
- Schema validation
|
||||
|
||||
### 8.6 TUI primitives (`tui/*.ts`)
|
||||
|
||||
- `colors.ts` — hard-coded ANSI colors
|
||||
- `index.ts` — input helpers
|
||||
- `screen.ts` — raw-mode screen control
|
||||
- `spinner.ts` — simple spinner
|
||||
|
||||
### 8.7 Templates (`templates/index.ts`)
|
||||
|
||||
- `dev-team`, `research`, `ops-incident`, `simulation`, `personal`
|
||||
- Each template seeds initial state + preset groups
|
||||
|
||||
### 8.8 Tests
|
||||
|
||||
- `__tests__/crypto-roundtrip.test.ts` — crypto round-trip verification
|
||||
- `__tests__/invite-parse.test.ts` — invite URL parsing
|
||||
- No integration tests against a real broker
|
||||
|
||||
---
|
||||
|
||||
## 9. Infrastructure + deployment
|
||||
|
||||
### 9.1 Broker runtime (`env.ts`)
|
||||
|
||||
Environment variables the broker expects:
|
||||
- `DATABASE_URL` — Postgres connection
|
||||
- `NEO4J_URL`, `NEO4J_USER`, `NEO4J_PASSWORD`
|
||||
- `QDRANT_URL`
|
||||
- `MINIO_ENDPOINT`, `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, `MINIO_USE_SSL`
|
||||
- `STATUS_TTL_SECONDS` — working status timeout
|
||||
- `HOOK_FRESH_WINDOW_SECONDS` — hook source freshness window
|
||||
- `TELEGRAM_BOT_TOKEN` — for bridge
|
||||
- `DASHBOARD_JWT_SECRET` — for `/cli-sync` verification
|
||||
- `PORT` (default 8787)
|
||||
- Various feature flags
|
||||
|
||||
### 9.2 CLI runtime
|
||||
|
||||
- Node >= 20 required (checked in `doctor`)
|
||||
- `claude` binary must be on PATH
|
||||
- `~/.claudemesh/` directory with config + keys
|
||||
- `~/.claude.json` MCP server registration
|
||||
- `~/.claude/settings.json` status hooks registration
|
||||
|
||||
### 9.3 Deployment (Coolify/Docker Compose)
|
||||
|
||||
- Broker deployed via Coolify + Gitea CI on OVHcloud VPS (`ic.claudemesh.com`)
|
||||
- WS endpoint: `wss://ic.claudemesh.com/ws`
|
||||
- HTTP endpoint: `https://ic.claudemesh.com`
|
||||
- Postgres, Neo4j, Qdrant, MinIO run as siblings in Docker Compose
|
||||
- Deployed MCP sandboxes use the host Docker daemon via socket mount
|
||||
|
||||
---
|
||||
|
||||
## 10. Features not in the tool/WS surface (behavioral)
|
||||
|
||||
These are v1 behaviors that exist but aren't enumerated as tools. Each must still work after v2.
|
||||
|
||||
| Feature | Location | Notes |
|
||||
|---|---|---|
|
||||
| Flag-first `claudemesh --resume xxx` routing | `cli/src/index.ts` §339 | Rewrites argv to `launch --resume xxx` |
|
||||
| Bare `claudemesh` → welcome wizard | `cli/src/index.ts` §334 | Runs `runWelcome()` |
|
||||
| Status hook auto-registration | `commands/install.ts` | Writes to `~/.claude/settings.json` |
|
||||
| Claude Code session hook handling | `commands/hook.ts` | Receives stdin JSON, posts to `/hook/set-status` |
|
||||
| Per-mesh keypair directory | `crypto/keypair.ts` | `~/.claudemesh/keys/<mesh>.key` with 0600 perms |
|
||||
| E2E file encryption with re-wrapping | `crypto/file-crypto.ts` + `mesh_file_key` table | `grant_file_access` re-wraps symmetric key for new recipient |
|
||||
| Priority message delivery | `broker.ts` | `now` bypasses busy-gate, `next` waits for idle, `low` is pull-only |
|
||||
| Hook > manual > jsonl status priority | `broker.ts` | Documented in §7.1 |
|
||||
| Simulation clock for test time | `index.ts` (broker) | Peers receive heartbeat ticks at simulated rate |
|
||||
| Audit log hash chaining | `audit.ts` | Tamper-evident — tools call `audit_verify` to check |
|
||||
| Dashboard-CLI sync | `auth/sync-with-broker.ts` + `cli-sync.ts` | Browser JWT flow, fetches mesh list from dashboard |
|
||||
| Telegram chat ↔ mesh peer mapping | `telegram-bridge.ts` | Bidirectional routing via `tg:<username>` |
|
||||
| Inbound webhook payload normalization | `webhooks.ts` | External systems POST, becomes a mesh message |
|
||||
| Rate limiting per peer per operation | `rate-limit.ts` | In-memory token buckets |
|
||||
| Prometheus metrics | `metrics.ts` | `/metrics` endpoint |
|
||||
|
||||
---
|
||||
|
||||
## 11. Test coverage (v1)
|
||||
|
||||
| Test | File | Notes |
|
||||
|---|---|---|
|
||||
| Crypto round-trip | `apps/cli/src/__tests__/crypto-roundtrip.test.ts` | Encrypt → decrypt verification |
|
||||
| Invite URL parsing | `apps/cli/src/__tests__/invite-parse.test.ts` | v1 and v2 formats |
|
||||
| Broker tests | `apps/broker/tests/*.test.ts` | broker.test.ts, invite-signature.test.ts, invite-v2.test.ts, hello-signature.test.ts, rate-limit.test.ts, encoding.test.ts, dup-delivery.test.ts, metrics.test.ts, logging.test.ts, integration/health.test.ts |
|
||||
|
||||
**v1 test coverage is minimal for the CLI side.** 2 unit test files for 12k LOC.
|
||||
|
||||
Broker has ~10 test files. They cover crypto primitives, invite flow, hello signatures, rate limiting, metrics — but **not** the 85 WS message handlers comprehensively.
|
||||
|
||||
---
|
||||
|
||||
## 12. The "must preserve" list (high-priority regression checks)
|
||||
|
||||
If v2 breaks any of these, it's a user-facing regression:
|
||||
|
||||
### 12.1 First-run experience
|
||||
- [ ] `claudemesh` bare command → welcome wizard
|
||||
- [ ] `claudemesh install` registers MCP server + status hooks in Claude Code config
|
||||
- [ ] `claudemesh join <url>` enrolls into a mesh from a v1 OR v2 invite URL
|
||||
- [ ] `claudemesh launch` starts Claude Code with mesh connectivity
|
||||
|
||||
### 12.2 Session lifecycle
|
||||
- [ ] Status hooks fire correctly on Claude Code session start/stop/pause
|
||||
- [ ] `set_status` honors priority (hook > manual > jsonl)
|
||||
- [ ] `list_peers` shows live status with freshness gating
|
||||
- [ ] Status TTL sweeper runs every 15s
|
||||
|
||||
### 12.3 Messaging
|
||||
- [ ] `send_message(to: peer, priority: "now")` delivers immediately
|
||||
- [ ] `send_message(to: peer, priority: "next")` waits for idle
|
||||
- [ ] `send_message(to: "@group")` broadcasts to group members
|
||||
- [ ] `send_message(to: "*")` broadcasts to all mesh peers
|
||||
- [ ] Offline recipients receive queued messages on reconnect
|
||||
- [ ] Duplicate delivery is prevented by `messageId` tracking
|
||||
|
||||
### 12.4 Cryptographic integrity
|
||||
- [ ] Ed25519 keypair generation + persistence with 0600 perms
|
||||
- [ ] Hello signature verification rejects replay within timestamp window
|
||||
- [ ] `send_message` envelopes are E2E encrypted (NaCl crypto_box)
|
||||
- [ ] File uploads are AES-GCM encrypted with per-recipient key wrapping
|
||||
- [ ] `grant_file_access` re-wraps symmetric key for a new recipient
|
||||
|
||||
### 12.5 All 79 MCP tools
|
||||
- [ ] Every tool in §2 dispatches correctly through the CLI's MCP server
|
||||
- [ ] Every tool delegates to the broker WS protocol or local handler as appropriate
|
||||
- [ ] No tool returns "not implemented" or throws an unexpected error
|
||||
|
||||
### 12.6 Broker backends
|
||||
- [ ] `mesh_query` / `mesh_execute` / `mesh_schema` work against per-mesh Postgres schema
|
||||
- [ ] `graph_query` / `graph_execute` work against per-mesh Neo4j database
|
||||
- [ ] `vector_store` / `vector_search` work against per-mesh Qdrant collection
|
||||
- [ ] `share_file` / `get_file` work through per-mesh MinIO bucket
|
||||
- [ ] `mesh_mcp_deploy` spawns a Docker container with correct scope + env + network_allow
|
||||
- [ ] `vault_set` + `$vault:<key>` env injection works end-to-end for deployed MCPs
|
||||
|
||||
### 12.7 Scheduled + URL watch
|
||||
- [ ] `schedule_reminder` with `cron` survives broker restart (persisted in DB)
|
||||
- [ ] `mesh_watch` polls at the specified interval and notifies on change
|
||||
- [ ] Watch state persists across broker restart
|
||||
|
||||
### 12.8 Telegram bridge
|
||||
- [ ] `connect telegram` registers bot token via `POST /tg/token`
|
||||
- [ ] Bot token is stored in `mesh.telegram_bridge`
|
||||
- [ ] Inbound Telegram messages are routed as mesh messages
|
||||
- [ ] `send_message(to: "tg:<username>")` routes via Telegram Bot API
|
||||
- [ ] `disconnect telegram` tears down the bridge cleanly
|
||||
|
||||
### 12.9 Dashboard sync
|
||||
- [ ] `claudemesh sync` browser flow completes and fetches mesh list
|
||||
- [ ] `POST /cli-sync` with valid JWT returns user's dashboard meshes
|
||||
|
||||
### 12.10 Webhooks
|
||||
- [ ] `create_webhook` returns a POST URL
|
||||
- [ ] External POST to webhook URL becomes a mesh message
|
||||
- [ ] HMAC signature validation rejects unsigned requests
|
||||
- [ ] `list_webhooks` + `delete_webhook` work
|
||||
|
||||
### 12.11 Doctor checks
|
||||
- [ ] Node >= 20 check
|
||||
- [ ] `claude` binary on PATH
|
||||
- [ ] MCP server registered in `~/.claude.json`
|
||||
- [ ] Status hooks registered in `~/.claude/settings.json`
|
||||
- [ ] `~/.claudemesh/config.json` exists + parses + 0600 perms
|
||||
- [ ] Mesh keypairs valid
|
||||
|
||||
---
|
||||
|
||||
## 13. What v2 is adding (net new)
|
||||
|
||||
Not part of the regression list, but tracked here so we don't lose sight of the forward-looking scope.
|
||||
|
||||
### 13.1 New CLI features (from user's stated v2 intent)
|
||||
|
||||
- [ ] `claudemesh login` — device-code OAuth against claudemesh.com's Better Auth backend
|
||||
- [ ] `claudemesh register` — create a new account from the CLI (via browser handoff)
|
||||
- [ ] `claudemesh new` — create a mesh from the CLI against `POST /api/my/meshes` (not via templates in the CLI — via dashboard API)
|
||||
- [ ] `claudemesh invite` — generate an invite from the CLI via `POST /api/my/meshes/:slug/invites`
|
||||
- [ ] `claudemesh whoami` — show current identity + token source
|
||||
- [ ] `claudemesh logout` — revoke server-side session + clear local credentials
|
||||
|
||||
### 13.2 Architecture improvements (from user's v2 intent)
|
||||
|
||||
- [ ] Feature-folder `services/` layer with strict facade boundaries
|
||||
- [ ] ESLint + dependency-cruiser boundary enforcement
|
||||
- [ ] `cli/` vs `ui/` separation (non-Ink I/O vs Ink rendering)
|
||||
- [ ] `entrypoints/` folder with cli + mcp entries
|
||||
- [ ] Typed error classes per service with `toDomainError` helper
|
||||
- [ ] Coverage threshold enforcement in CI
|
||||
|
||||
### 13.3 Not in v1.0.0 scope (defer to v1.1+)
|
||||
|
||||
Everything from the Composer 2 review rounds that isn't Pass 1:
|
||||
|
||||
- Local-first SQLite source of truth (Lamport, sync daemon, publish transaction)
|
||||
- Broker security hardening (role-per-mesh Postgres, Docker egress proxy, SSRF policy)
|
||||
- ICU MessageFormat + per-locale budgets
|
||||
- Accessibility token-signal matrix
|
||||
- Tiered MCP catalog + audit process
|
||||
- session_kind enum
|
||||
- NFC peer_id normalization
|
||||
- Write queue state machine
|
||||
|
||||
These stay in the `.artifacts/specs/` as reference documents. They describe a good destination. They are NOT v1.0.0 requirements.
|
||||
|
||||
---
|
||||
|
||||
## 14. Known v1 technical debt / gaps (worth noting)
|
||||
|
||||
These aren't features — they're places where v1 is weaker than it could be. Document here so v2 doesn't blindly port the weaknesses.
|
||||
|
||||
- **CLI auth is missing** — v1 has no `login` / `logout` command. All account-level operations require the web dashboard. This is what v2 is adding.
|
||||
- **Imperative command branching** — `commands/launch.ts` is 775 lines with nested flag handling. Cleaner in v2's flow pipeline.
|
||||
- **Minimal CLI test coverage** — 2 test files for 12k LOC. v2 should have colocated tests per service.
|
||||
- **Rate limiting is in-memory only** — doesn't survive broker restart; not Redis-backed.
|
||||
- **No CLI-side caching** — every `list_peers` / `mesh_info` call hits the broker. v2's local-first layer (Pass 2) addresses this.
|
||||
- **Telegram bridge is a large monolithic file** (1711 lines) — legitimate complexity, but v2 may want to modularize if it touches it.
|
||||
- **v1 wizard bleed-through** — `launch` → `claude` handoff leaves ANSI state dirty. v2's `resetTerminal()` choke point fixes this.
|
||||
|
||||
None of these are regressions if v2 keeps them as-is. v2 should **not** prioritize fixing them — fix them when they become a problem, not speculatively.
|
||||
|
||||
---
|
||||
|
||||
## 15. Reading this inventory
|
||||
|
||||
**If you're implementing v2 Phase 1** (foundation layers): every tool in §2, every WS op in §3, every HTTP endpoint in §4, every DB table in §5 must have a place in the v2 folder structure. No new semantics, no improved algorithms — just move the working code.
|
||||
|
||||
**If you're reviewing a v2 PR**: check it against §12 ("must preserve" list). If the PR changes the behavior of anything in that list, it's a regression and needs explicit sign-off.
|
||||
|
||||
**If you're writing v2 docs**: reference this document. Every feature here is user-visible and documented in v1's README / slash-command help / tool descriptions. v2 docs should mention every feature from §2 as preserved.
|
||||
|
||||
---
|
||||
|
||||
**End of inventory.**
|
||||
1068
.artifacts/backlog/2026-04-11-v2-parity-test-plan.md
Normal file
BIN
.artifacts/hero-animation/clawd-apple-zoom.png
Normal file
|
After Width: | Height: | Size: 3.4 KiB |
BIN
.artifacts/hero-animation/clawd-zoom-v2.png
Normal file
|
After Width: | Height: | Size: 1.5 KiB |
BIN
.artifacts/hero-animation/clawd-zoom.png
Normal file
|
After Width: | Height: | Size: 1.1 KiB |
BIN
.artifacts/hero-animation/fcc-preview-v1.png
Normal file
|
After Width: | Height: | Size: 99 KiB |
BIN
.artifacts/hero-animation/fcc-preview-v2.png
Normal file
|
After Width: | Height: | Size: 41 KiB |
BIN
.artifacts/hero-animation/fcc-preview-v3.png
Normal file
|
After Width: | Height: | Size: 67 KiB |
BIN
.artifacts/hero-animation/features-section.png
Normal file
|
After Width: | Height: | Size: 174 KiB |
BIN
.artifacts/hero-animation/features-with-skills.png
Normal file
|
After Width: | Height: | Size: 107 KiB |
BIN
.artifacts/hero-animation/frame-01-alone.png
Normal file
|
After Width: | Height: | Size: 458 KiB |
BIN
.artifacts/hero-animation/hero-with-mesh-v1.png
Normal file
|
After Width: | Height: | Size: 303 KiB |
BIN
.artifacts/hero-animation/landing-cover.png
Normal file
|
After Width: | Height: | Size: 475 KiB |
BIN
.artifacts/hero-animation/landing-live.png
Normal file
|
After Width: | Height: | Size: 462 KiB |
BIN
.artifacts/hero-animation/mesh-constellation-v1.png
Normal file
|
After Width: | Height: | Size: 250 KiB |
BIN
.artifacts/hero-animation/mesh-constellation-v2.png
Normal file
|
After Width: | Height: | Size: 334 KiB |
BIN
.artifacts/hero-animation/mesh-constellation-v3.png
Normal file
|
After Width: | Height: | Size: 225 KiB |
BIN
.artifacts/hero-animation/mesh-hero-apple-clawd.png
Normal file
|
After Width: | Height: | Size: 170 KiB |
BIN
.artifacts/hero-animation/mesh-hero-clip.png
Normal file
|
After Width: | Height: | Size: 167 KiB |
BIN
.artifacts/hero-animation/mesh-hero-full.png
Normal file
|
After Width: | Height: | Size: 180 KiB |
BIN
.artifacts/hero-animation/mesh-hero-v1.png
Normal file
|
After Width: | Height: | Size: 178 KiB |
BIN
.artifacts/hero-animation/mesh-icon-big.png
Normal file
|
After Width: | Height: | Size: 128 KiB |
BIN
.artifacts/hero-animation/mesh-no-overlap.png
Normal file
|
After Width: | Height: | Size: 278 KiB |
BIN
.artifacts/hero-animation/mesh-peers-equal.png
Normal file
|
After Width: | Height: | Size: 263 KiB |
BIN
.artifacts/hero-animation/mesh-trail-5700.png
Normal file
|
After Width: | Height: | Size: 144 KiB |
BIN
.artifacts/hero-animation/mesh-trail-inflight.png
Normal file
|
After Width: | Height: | Size: 144 KiB |
BIN
.artifacts/hero-animation/mesh-trail-top.png
Normal file
|
After Width: | Height: | Size: 74 KiB |
BIN
.artifacts/hero-animation/mesh-trail-v1.png
Normal file
|
After Width: | Height: | Size: 162 KiB |
BIN
.artifacts/hero-animation/mesh-trail-v2.png
Normal file
|
After Width: | Height: | Size: 145 KiB |
BIN
.artifacts/hero-animation/mesh-triangle.png
Normal file
|
After Width: | Height: | Size: 257 KiB |
BIN
.artifacts/hero-animation/mesh-zoom-mid.png
Normal file
|
After Width: | Height: | Size: 64 KiB |
BIN
.artifacts/hero-animation/prompt-box-early.png
Normal file
|
After Width: | Height: | Size: 24 KiB |
BIN
.artifacts/hero-animation/prompt-input-live.png
Normal file
|
After Width: | Height: | Size: 151 KiB |
BIN
.artifacts/hero-animation/reference.png
Normal file
|
After Width: | Height: | Size: 325 KiB |
BIN
.artifacts/hero-animation/responsive-1200.png
Normal file
|
After Width: | Height: | Size: 231 KiB |
BIN
.artifacts/hero-animation/responsive-1700.png
Normal file
|
After Width: | Height: | Size: 268 KiB |
BIN
.artifacts/hero-animation/responsive-800.png
Normal file
|
After Width: | Height: | Size: 139 KiB |
BIN
.artifacts/hero-animation/session-mid-2.png
Normal file
|
After Width: | Height: | Size: 62 KiB |
BIN
.artifacts/hero-animation/session-mid-3.png
Normal file
|
After Width: | Height: | Size: 55 KiB |
BIN
.artifacts/hero-animation/session-mid.png
Normal file
|
After Width: | Height: | Size: 69 KiB |
BIN
.artifacts/hero-animation/where-mesh-fits-v2.png
Normal file
|
After Width: | Height: | Size: 193 KiB |
BIN
.artifacts/hero-animation/where-mesh-fits.png
Normal file
|
After Width: | Height: | Size: 175 KiB |
158
.artifacts/ideas/2026-04-19-hackathon-day-one-scenario.txt
Normal file
@@ -0,0 +1,158 @@
|
||||
HACKATHON — THE DAY-ONE "WOW" SCENARIO
|
||||
======================================
|
||||
Date: 2026-04-19
|
||||
Follow-up to: 2026-04-19-hackathon-proposal.txt
|
||||
|
||||
|
||||
THE SHORT ANSWER
|
||||
----------------
|
||||
|
||||
Yes — it's exactly as simple as run one command, join a mesh, and
|
||||
immediately inherit your team's tools, skills, MCPs, and context.
|
||||
No config copying. No API key juggling. No "let me send you my
|
||||
.mcp.json". Zero setup.
|
||||
|
||||
That's the thing that has never existed before: Claude Code sessions
|
||||
that share capability at the speed of a chat invite.
|
||||
|
||||
|
||||
THE 60-SECOND STORY (rough, but close to real)
|
||||
----------------------------------------------
|
||||
|
||||
Picture Ana at the hackathon. Her teammate David has been working on
|
||||
their project for two days — wired up a Linear MCP, a Figma MCP, a
|
||||
custom "brand-asset" skill, shared project context, a few API keys
|
||||
in the team vault. She shows up at the table, opens her laptop, has
|
||||
never touched the project.
|
||||
|
||||
1. David runs one command:
|
||||
$ claudemesh share ana@team.com
|
||||
She gets a link: https://claudemesh.com/i/5SLJ7F95
|
||||
|
||||
2. Ana runs one command:
|
||||
$ claudemesh https://claudemesh.com/i/5SLJ7F95
|
||||
(No separate install, the CLI self-installs if missing.
|
||||
Takes under 10 seconds.)
|
||||
|
||||
3. Claude Code opens automatically, connected to the mesh. No
|
||||
further setup.
|
||||
|
||||
4. Ana types into Claude Code:
|
||||
"what are we building?"
|
||||
|
||||
Claude — HER local Claude, on HER laptop — answers with the
|
||||
team's current brief, pulled from the mesh's shared context
|
||||
that David set earlier. It knows the repo, the deadline, the
|
||||
stack, who's on the team, what's done, what's open.
|
||||
|
||||
5. Ana says:
|
||||
"pull the latest tickets from Linear"
|
||||
|
||||
Her Claude uses the Linear MCP. Ana never installed it. She has
|
||||
no Linear API key on her machine. The MCP was deployed to the
|
||||
mesh by David on day one; the moment Ana joined, it became
|
||||
callable from her Claude Code as if it were local. Ciphertext
|
||||
routes through the broker, tool calls execute on the peer that
|
||||
owns the integration.
|
||||
|
||||
6. She asks:
|
||||
"generate launch-day assets in our brand"
|
||||
|
||||
Her Claude invokes the /brand-asset skill that David authored
|
||||
two days ago. Skills are portable in the mesh — calling it
|
||||
remotely is indistinguishable from having it installed locally.
|
||||
|
||||
7. She hits a wall on a type error. Instead of pinging David in
|
||||
Slack she types:
|
||||
"ask the mesh"
|
||||
|
||||
Question fans out to every teammate's Claude. Thirty seconds
|
||||
later she has three answers with three different repo contexts,
|
||||
synthesized into one reply, with attributions. This is the
|
||||
fan-out demo from the main proposal.
|
||||
|
||||
TOTAL ELAPSED TIME: under 90 seconds from "I don't have anything
|
||||
set up" to "my Claude knows our project and can use my team's tools."
|
||||
|
||||
|
||||
WHY THIS IS THE HEADLINE
|
||||
------------------------
|
||||
|
||||
Every other developer tool in 2026 still demands:
|
||||
- install this package
|
||||
- set these env vars
|
||||
- copy this config
|
||||
- get an API key approved
|
||||
- restart your editor
|
||||
- re-index your repo
|
||||
|
||||
claudemesh replaces all of that with a single click on an invite
|
||||
link. The mesh IS the onboarding.
|
||||
|
||||
The shorter way to say it: every Claude Code session you onboard,
|
||||
you onboard your team's entire AI toolchain in one shot.
|
||||
|
||||
|
||||
WHAT THE USER ACTUALLY SEES
|
||||
---------------------------
|
||||
|
||||
Terminal (Ana):
|
||||
$ claudemesh https://claudemesh.com/i/5SLJ7F95
|
||||
✔ Joined "launch-team" as Ana
|
||||
4 peers online: David, Nedas, Lug-Nut, Juan
|
||||
12 tools available from the mesh
|
||||
3 shared skills
|
||||
context: "launch-day assets — due Friday"
|
||||
✔ Launching Claude Code…
|
||||
|
||||
Claude Code:
|
||||
> connected to mesh: launch-team
|
||||
> inherited: 12 tools, 3 skills, shared context, 14 memories
|
||||
|
||||
Dashboard (claudemesh.com):
|
||||
Ana's node appears on the live topology. Packets animate along
|
||||
edges as her first message flies. David's screen gets a presence
|
||||
ping: "Ana joined — ready".
|
||||
|
||||
That's the wow. Not a pitch deck, not a feature matrix — a literal
|
||||
before-and-after experience that takes under two minutes and looks
|
||||
impossible to anyone who's ever onboarded a new developer onto a
|
||||
project the old way.
|
||||
|
||||
|
||||
WHAT WE'RE BUILDING THIS WEEK TO MAKE THIS REAL
|
||||
-----------------------------------------------
|
||||
|
||||
Most of the primitives exist. The hackathon week is the glue:
|
||||
|
||||
• Tool inheritance — a peer's deployed MCPs become callable from
|
||||
other peers as if installed locally. Today: partially shipped.
|
||||
Hackathon goal: make it automatic, zero-config, visible in the
|
||||
universe dashboard.
|
||||
|
||||
• Skill sharing — same story, for skills (already has an alpha).
|
||||
Hackathon goal: polish, auto-discovery, one-line invoke.
|
||||
|
||||
• Context inheritance — joining a mesh automatically loads the
|
||||
mesh's shared context into the new Claude's session so it
|
||||
"knows what we're working on" from minute one. Today: state
|
||||
exists, auto-pull on join does not.
|
||||
|
||||
• "Ask the mesh" fan-out — the broadcast + synthesize primitive
|
||||
from the main proposal.
|
||||
|
||||
• The onboarding CLI flow — make the invite-link-to-Claude-ready
|
||||
path bulletproof and under 10 seconds on a fresh machine.
|
||||
|
||||
|
||||
THE DEMO ARTIFACT
|
||||
-----------------
|
||||
|
||||
A single 90-second screencast. Split screen: Ana's terminal on the
|
||||
left, the claudemesh.com live universe dashboard on the right.
|
||||
She joins. Her node appears on the mesh. She asks a question. Tools
|
||||
fire. Skills execute. Answer comes back. No text overlays needed —
|
||||
the UX itself is the argument.
|
||||
|
||||
That's the video that goes at the top of claudemesh.com on demo
|
||||
day.
|
||||
147
.artifacts/ideas/2026-04-19-hackathon-proposal.txt
Normal file
@@ -0,0 +1,147 @@
|
||||
HACKATHON PROPOSAL — CLAUDEMESH
|
||||
===============================
|
||||
Date: 2026-04-19
|
||||
Author: Alejandro Gutiérrez
|
||||
|
||||
|
||||
THE SHORT ANSWER
|
||||
----------------
|
||||
|
||||
I'm going with claudemesh — not the Flexicar voice assistant, not a fresh
|
||||
blend. claudemesh is already a real product with a real backbone (CLI,
|
||||
MCP server, broker, E2E crypto, web dashboard), and what it still lacks
|
||||
is the one thing a hackathon is perfect for: a single headline capability
|
||||
that makes its existence obvious in ten seconds.
|
||||
|
||||
So I'm using the week to push claudemesh from "useful infra for people
|
||||
who already get it" → "demo that makes someone say, oh, that's what this
|
||||
is for."
|
||||
|
||||
|
||||
WHAT'S ALREADY THERE (SO YOU KNOW WHAT I'M BUILDING ON, NOT FROM ZERO)
|
||||
----------------------------------------------------------------------
|
||||
|
||||
- CLI + MCP server (claudemesh-cli), 40+ alpha releases shipped
|
||||
- Broker on wss://ic.claudemesh.com/ws with libsodium E2E encryption —
|
||||
broker routes ciphertext, never reads messages
|
||||
- Shared primitives: direct messages, group broadcasts, shared state,
|
||||
memory, file sharing, skill sharing, MCP deployment to the mesh
|
||||
- Telegram bridge with a Haiku-4.5 AI layer so you can talk to the mesh
|
||||
from your phone (shipped this week)
|
||||
- Web dashboard with per-mesh live panel (peers, envelope stream,
|
||||
audit chain)
|
||||
- Brand-new "Universe" dashboard landing (shipped today) — meshes +
|
||||
incoming invitations in one view
|
||||
|
||||
|
||||
WHAT I'M BUILDING DURING THE HACKATHON
|
||||
---------------------------------------
|
||||
|
||||
Headline: AGENT-TO-AGENT DELEGATION WITH LIVE STREAMING
|
||||
|
||||
Right now a Claude Code session can SEND a message to another session
|
||||
in the mesh. That's primitive-level. What's missing — and what makes
|
||||
the whole thing click — is DELEGATION: one Claude hands off a task to
|
||||
another, waits for the real answer (not a "sure, I'll do that later"
|
||||
acknowledgement), and composes it into its own response, with the
|
||||
user watching the whole thing happen live.
|
||||
|
||||
Why this is the right hackathon target:
|
||||
- It requires NO new physical infrastructure. The broker, the crypto,
|
||||
the transport are all there.
|
||||
- It's the unlock that turns claudemesh from "chat for Claudes" into
|
||||
"distributed cognition layer for Claude Code."
|
||||
- It's demoable in 60 seconds and the value is self-evident.
|
||||
|
||||
|
||||
DAY-BY-DAY PLAN (REALISTIC, NOT ASPIRATIONAL)
|
||||
---------------------------------------------
|
||||
|
||||
DAY 1 — Protocol + primitive
|
||||
• Design `mesh_delegate(to, task, timeout)` MCP tool — one call from
|
||||
the local Claude, returns the remote Claude's answer synchronously
|
||||
from the caller's perspective
|
||||
• Broker-side: new message type `delegation_request` / `_response`
|
||||
with correlation IDs so responses route back to the originator
|
||||
• Remote Claude receives delegation → runs in a sandboxed subcontext
|
||||
→ emits structured response (text + artifacts)
|
||||
|
||||
DAY 2 — Live streaming of remote work
|
||||
• While remote Claude works, stream its tool calls + thinking back
|
||||
through the mesh as `delegation_progress` events
|
||||
• Caller's dashboard lights up with "Nedas is reading src/auth.ts…"
|
||||
in real time
|
||||
• The "wow" moment: watching another Claude think, from your terminal
|
||||
|
||||
DAY 3 — Multi-peer fan-out
|
||||
• `mesh_ask_all(question)` — broadcast a question to @group, gather
|
||||
answers in parallel, synthesize
|
||||
• This is the Slack-killer: one question, three Claudes with
|
||||
different repo contexts, one merged answer
|
||||
• Add to the universe dashboard: inline "ask your mesh" prompt
|
||||
|
||||
DAY 4 — Voice control (stretch, uses my Pipecat/Cartesia background)
|
||||
• Phone → Telegram voice note → AI layer already in place →
|
||||
mesh_delegate or mesh_ask_all fires
|
||||
• "Hey mesh, which of you is closest to the payments bug?" — the
|
||||
mesh answers with the Claude that has the most recent auth.ts edits
|
||||
• Ties the Flexicar voice work into claudemesh without fragmenting
|
||||
the proposal
|
||||
|
||||
DAY 5 — Live schematic on the dashboard
|
||||
• Build the animated mesh-topology view from my prototype
|
||||
(SVG nodes + packets in flight) using REAL delegation traffic
|
||||
• When a delegation fires, you literally see a packet fly from one
|
||||
node to another on the dashboard
|
||||
• This is the screenshot/video artifact for the demo day
|
||||
|
||||
DAY 6 — Demo recording + narrative
|
||||
• 90-second video: single person, three terminals, one dashboard.
|
||||
Asks a question in terminal 1, two other Claudes answer, dashboard
|
||||
animates, final answer synthesized
|
||||
• Landing page update with the video above the fold
|
||||
• Changelog post
|
||||
|
||||
DAY 7 — Buffer, polish, publish alpha
|
||||
|
||||
|
||||
WHAT MAKES THIS TAILORED FOR A HACKATHON (NOT JUST ROADMAP WORK)
|
||||
-----------------------------------------------------------------
|
||||
|
||||
1. Visible. Three terminals + one dashboard = immediately legible.
|
||||
2. Ambitious. Going from "pub/sub messaging" to "synchronous distributed
|
||||
delegation" is a real protocol-level step up — it's the difference
|
||||
between email and RPC.
|
||||
3. Native to the event. Hackathon judges are the exact target user:
|
||||
people with multiple Claude Code sessions open, wanting them to
|
||||
coordinate. Dogfood-able during the week itself.
|
||||
4. Leverages what I already built. I'm not rebuilding the transport,
|
||||
the crypto, the auth, the dashboard shell — just adding the one
|
||||
missing primitive that ties it all together.
|
||||
5. Stretch goal (voice) reuses my Flexicar/Pipecat expertise without
|
||||
making the proposal schizophrenic — it's one coherent pitch with a
|
||||
multimodal cherry on top if time allows.
|
||||
|
||||
|
||||
WHAT I'M EXPLICITLY NOT DOING
|
||||
------------------------------
|
||||
|
||||
- Not rewriting the Flexicar assistant as a mesh app. It's a great
|
||||
product, wrong scope for one week.
|
||||
- Not building federation (mesh-to-mesh). Powerful but too abstract
|
||||
to demo cleanly.
|
||||
- Not building a self-hosted broker. Infra work, no hackathon payoff.
|
||||
- Not building a mobile app. Telegram already covers the "mesh from
|
||||
anywhere" story.
|
||||
|
||||
|
||||
THE PITCH IN ONE SENTENCE
|
||||
-------------------------
|
||||
|
||||
By the end of the week, one Claude will delegate a real coding task to
|
||||
another Claude running on a different machine, get a real answer back,
|
||||
and the whole thing will happen in sixty seconds with the mesh
|
||||
topology animating live on claudemesh.com.
|
||||
|
||||
That's the demo. Everything else in the week is in service of making
|
||||
those sixty seconds watertight.
|
||||
29
.artifacts/prompts/claudemesh-prompts.rtf
Normal file
@@ -0,0 +1,29 @@
|
||||
{\rtf1\ansi\ansicpg1252\cocoartf2867
|
||||
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
|
||||
{\colortbl;\red255\green255\blue255;}
|
||||
{\*\expandedcolortbl;;}
|
||||
\margl1440\margr1440\vieww11180\viewh8060\viewkind0
|
||||
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural\partightenfactor0
|
||||
|
||||
\f0\fs24 \cf0 Mesh templates for predefined roles, groups\'85\
|
||||
Mesh blockchain, can it be a good addition? For what?\
|
||||
Mesh webhooks, external web sockets, restful apis to be connected to the mesh (mcp)\
|
||||
Mesh skills available for all ai? Like a mesh catalog of skills for sessions to get and use them?\
|
||||
Inicial private mesh by default for every new user\
|
||||
Mesh dashboard for situational awareness of mesh, to illustrate the peers connected, their activity, status, mesh structure\
|
||||
Mesh of meshes? bridge?\
|
||||
Mesh Connectors: slack, telegram, they can appear as peers? Or sth different?\
|
||||
Connect humans to the mesh? Peer info to know about if human, type of channel (telegram or whatever) or llm model if ai?\
|
||||
How to connect others than just claude code? The problem will be the push system I suppose\
|
||||
\
|
||||
Add path (pwd) where each session is being executed for them to understand how to reference files if same computer? Maybe only visible for peers on same computer?\
|
||||
What if a peer on connection can make available all the project files, folders and subfolders? Direct access? So other ai can read files if needed from connected projects?\
|
||||
Can we have peer stats for example about context consumption?\
|
||||
Mesh notifications about new peers, new connectors, new resources? Broadcast?\
|
||||
Allow group or role changes dynamically not only on mesh connection?\
|
||||
Dynamic mcp that can be connected or disconnected on realtime without resetting the claude code sessions?\
|
||||
Mesh templates on creation, with a predefined structure that it can be changed as well by mesh admin role? Or any? Or what idea?\
|
||||
What if reminders can be just cron so ai knows exactly how to configure crons for the mesh? So broker can handle the cron creation? What about mesh heartbeats to keep ai alive?\
|
||||
Sandbox for code execution, python, node, chromium, etc so any peer can connect to resources, and resources being scalable on real time if a new peer needs a sandbox?\
|
||||
\
|
||||
}
|
||||
154
.artifacts/shipped/2026-04-15-ship-all-retrospective.md
Normal file
@@ -0,0 +1,154 @@
|
||||
# Ship-All Session — 2026-04-15
|
||||
|
||||
Full checklist from the "Claude Code-grade CLI" bar, shipped end-to-end.
|
||||
|
||||
## Final scoreboard (vs original 15-item list)
|
||||
|
||||
| # | Item | Status | Ref |
|
||||
|---|------|--------|-----|
|
||||
| 1 | Single static binary, curl-installable, Homebrew, winget | ✅ **Shipped** | `release-cli.yml`, `packaging/homebrew/*`, `packaging/winget/*`, `/install` binary fallback |
|
||||
| 2 | `claudemesh://` URL scheme handler | ✅ **Shipped** | `apps/cli-v2/src/commands/url-handler.ts` — darwin/linux/windows |
|
||||
| 3 | `claudemesh <url>` one command | ✅ **Shipped** | `apps/cli-v2/src/entrypoints/cli.ts` bare dispatch |
|
||||
| 4 | `-y` fully non-interactive | ✅ **Shipped** | `launch.ts` — bypasses wizard |
|
||||
| 5 | Unified onboarding | ✅ **Shipped** | `welcome.ts` rewritten: invite-link-first, then browser |
|
||||
| 6 | Status line in Claude Code | ✅ **Shipped** | `status-line.ts` + MCP writes peer cache + `install --status-line` |
|
||||
| 7 | Channel messages as first-class UI | 🟡 **Partial** | Best effort — `<sender>: <body>` format + priority/broadcast badges. True rich UI requires Claude Code protocol change we don't own. |
|
||||
| 8 | Recovery phrase / encrypted backup | ✅ **Shipped** | `backup.ts` — Argon2id + XChaCha20-Poly1305 |
|
||||
| 9 | Per-peer capabilities | ✅ **Shipped** | `grants.ts` — grant/revoke/block/grants; MCP server enforces DM+broadcast drops |
|
||||
| 10 | Doctor with real checks | ✅ **Shipped** | `doctor.ts` — WS reach + npm version added |
|
||||
| 11 | Shell completions | ✅ **Shipped** | `completions.ts` — bash/zsh/fish |
|
||||
| 12 | QR code on share | ✅ **Shipped** | `qr.ts` + wired into `invite` |
|
||||
| 13 | Consistent clay-accented renderer | ✅ **Shipped** | `ui/render.ts` — single renderer; new commands use it |
|
||||
| 14 | Auto-update (rustup-style) | ✅ **Shipped** | `upgrade.ts` — finds portable or system npm, self-installs |
|
||||
| 15 | `claudemesh verify <peer>` safety numbers | ✅ **Shipped** | `verify.ts` — 30-digit SAS |
|
||||
|
||||
**Final: 14/15 fully shipped + 1 partial = 97% addressed.** Item 7 is blocked
|
||||
on Claude Code protocol work outside our scope.
|
||||
|
||||
## What landed across the session
|
||||
|
||||
### npm
|
||||
- `claudemesh-cli@1.0.0-alpha.30` on the alpha dist-tag
|
||||
|
||||
### GitHub Releases
|
||||
- `cli-v1.0.0-alpha.29` live with 5 binaries + SHA256SUMS
|
||||
(darwin-x64, darwin-arm64, linux-x64, linux-arm64, windows-x64.exe)
|
||||
- `cli-v1.0.0-alpha.30` workflow running to reproduce the set
|
||||
|
||||
### CI
|
||||
- `.github/workflows/release-cli.yml` — fires on `cli-v*` tags, builds
|
||||
single-file binaries via `bun build --compile`, attaches to GitHub
|
||||
Release, optionally bumps the Homebrew tap formula
|
||||
|
||||
### Broker
|
||||
- `handleCliMeshInvite` + email via Postmark with branded react-email
|
||||
template (from earlier in the day)
|
||||
- `handleCliMeshCreate` generates owner keypair + root key so CLI-made
|
||||
meshes can immediately issue invites
|
||||
|
||||
### Web
|
||||
- `/install` script: binary-first fallback when Node absent, npm path
|
||||
otherwise. No sudo required.
|
||||
- `apps/web/src/modules/join/install-toggle.tsx` — single one-liner copy
|
||||
block, `--name` defaults to `$USER`
|
||||
|
||||
### CLI commands (new this session)
|
||||
- `claudemesh <invite-url>` — bare dispatch, join + launch
|
||||
- `claudemesh upgrade` / `update` — self-update
|
||||
- `claudemesh verify [peer]` — SAS safety numbers
|
||||
- `claudemesh backup / restore` — encrypted config backup
|
||||
- `claudemesh grant / revoke / block / grants` — per-peer capabilities
|
||||
- `claudemesh completions <shell>` — bash/zsh/fish
|
||||
- `claudemesh url-handler <install|uninstall>` — `claudemesh://` scheme
|
||||
- `claudemesh status-line` — statusLine renderer for Claude Code
|
||||
- `claudemesh install --status-line` — wire the statusLine
|
||||
|
||||
## Files created
|
||||
```
|
||||
apps/cli-v2/src/commands/backup.ts # backup/restore
|
||||
apps/cli-v2/src/commands/completions.ts # shell completions
|
||||
apps/cli-v2/src/commands/grants.ts # per-peer caps
|
||||
apps/cli-v2/src/commands/status-line.ts # statusLine renderer
|
||||
apps/cli-v2/src/commands/upgrade.ts # auto-update
|
||||
apps/cli-v2/src/commands/url-handler.ts # :// scheme registration
|
||||
apps/cli-v2/src/commands/verify.ts # SAS safety numbers
|
||||
apps/cli-v2/src/emails/mesh-invitation.tsx # branded react-email template
|
||||
apps/cli-v2/src/ui/qr.ts # QR renderer
|
||||
apps/cli-v2/src/ui/render.ts # unified renderer
|
||||
apps/cli-v2/scripts/build-binaries.ts # cross-platform compile
|
||||
apps/broker/src/emails/mesh-invitation.tsx # (broker copy — pre-session)
|
||||
.github/workflows/release-cli.yml # binary CI
|
||||
packaging/homebrew/claudemesh.rb.template # brew formula
|
||||
packaging/winget/claudemesh.yaml.template # winget manifest
|
||||
```
|
||||
|
||||
## Gotchas hit and fixed
|
||||
|
||||
1. **`capability_v_2` vs `capability_v2`** — Drizzle's `casing: snake_case`
|
||||
inserts an underscore before digits, but the migration SQL
|
||||
(`0019_invite-v2-and-email.sql`) used `capability_v2`. Production DB
|
||||
had both drifted. Fixed by hand: `ALTER TABLE mesh.invite ADD COLUMN
|
||||
capability_v_2 text`.
|
||||
|
||||
2. **`handleCliMeshCreate` never generated owner keypair** — so `prueba1`
|
||||
and every CLI-created mesh before 2026-04-15 couldn't issue invites.
|
||||
Added generation to create + self-heal in invite.
|
||||
|
||||
3. **`cli.ts` dispatch dropped `--join`** — the website's
|
||||
`claudemesh launch --name X --join TOKEN` silently ignored the token
|
||||
because dispatch didn't forward the flag. Fixed by forwarding to
|
||||
`runLaunch`.
|
||||
|
||||
4. **`apps/cli-v2` was gitignored** — blocked the binary release workflow
|
||||
(no source for CI to check out). Moved gitignore from root to the
|
||||
package directory with only build artefacts excluded.
|
||||
|
||||
5. **Workflow pnpm version conflict** — `pnpm/action-setup@v4` errors when
|
||||
both `version:` and `package.json#packageManager` are set. Removed the
|
||||
explicit version to defer to `packageManager`.
|
||||
|
||||
6. **Cross-compiled binary smoke tests** — `macos-latest` is ARM64, so
|
||||
darwin-x64 binary won't run there; `ubuntu-latest` is x64, so
|
||||
linux-arm64 binary won't run there. Smoke tests now run only when
|
||||
build arch matches runner arch.
|
||||
|
||||
7. **Port ownership during debugging** — several DB containers on the VPS
|
||||
(cuidecar, flexidoc, whyrating, claudemesh). Always verify via
|
||||
`docker ps | grep <port>` + matching the `DATABASE_URL` in the app
|
||||
container before running psql.
|
||||
|
||||
## What's follow-up (tier-3)
|
||||
|
||||
- **Item 7** properly — needs a Claude Code-side notification type for
|
||||
rich `<channel>` UI (chat bubble, avatar, timestamp). Our side already
|
||||
emits the structured metadata; UI rendering is upstream.
|
||||
- **Homebrew tap repo** (`homebrew-claudemesh`) doesn't exist yet —
|
||||
formula template is in `packaging/` ready to drop in when the tap is
|
||||
bootstrapped.
|
||||
- **winget submission** needs the first non-prerelease (cli-v1.0.0)
|
||||
cut, then PR to `microsoft/winget-pkgs`.
|
||||
- **Migrate all commands to `render.ts`** — foundation is shipped, old
|
||||
commands (peers, launch banner, etc.) still use ad-hoc
|
||||
`console.log` with color codes. Mechanical refactor.
|
||||
- **PostHog dashboard for `/install` fetches** — counter exists in
|
||||
memory, wire it to the shared posthog server SDK instead.
|
||||
|
||||
## Published version trail this session
|
||||
|
||||
- alpha.22 → 23 (previous session)
|
||||
- alpha.24: broker invite endpoint
|
||||
- alpha.25: CLI invite wire through generateInvite
|
||||
- alpha.26: email on Postmark honestly reported
|
||||
- alpha.27: `--join` dispatch fix, unified bare URL, shell completions,
|
||||
verify, qr, doctor checks, status-line, backup
|
||||
- alpha.28: url-handler, install --status-line
|
||||
- alpha.29: first successful binary release, grants/block, upgrade,
|
||||
welcome refactor
|
||||
- alpha.30: channel message polish (current)
|
||||
|
||||
## Published things outside npm
|
||||
|
||||
- https://github.com/alezmad/claudemesh/releases/tag/cli-v1.0.0-alpha.29
|
||||
— 5 platform binaries, SHA256SUMS
|
||||
- https://claudemesh.com/install — shell installer, binary fallback
|
||||
- https://claudemesh.com/i/... — invite short URLs (unchanged)
|
||||
551
.artifacts/shipped/2026-05-03-daemon-final-spec-v10.md
Normal file
@@ -0,0 +1,551 @@
|
||||
# `claudemesh daemon` — Final Spec v10
|
||||
|
||||
> **Round 10.** v9 was reviewed by codex (round 9). The two-layer ID
|
||||
> model (5/5) and §4.1 wording (4/5) were closed cleanly, but rate-limit
|
||||
> placement created a worse failure: putting B1 limiter before dedupe
|
||||
> lookup means **idempotent retries burn rate-limit budget** and a
|
||||
> daemon retry of an already-committed message during a saturated
|
||||
> window can get rate-limit-rejected → daemon marks `dead` → split-brain
|
||||
> (broker has the message, daemon believes failure).
|
||||
>
|
||||
> **v10 fixes**:
|
||||
>
|
||||
> 1. New **Phase B0 dedupe fast-path** — read dedupe table BEFORE rate
|
||||
> limit. Existing id (match or mismatch) returns immediately without
|
||||
> touching rate-limit budget.
|
||||
> 2. **Idempotent rate-limiter** keyed by `(mesh_id, client_message_id,
|
||||
> window_bucket)` so even if two same-id requests race past B0, only
|
||||
> the first one consumes budget.
|
||||
> 3. **§4.11 stale text** — rate-limit moved out of B2 failure mode.
|
||||
> 4. **§4.7.2 pseudocode reordered** to show B0 → B1 → BEGIN → claim →
|
||||
> B2 → B3.
|
||||
>
|
||||
> **Intent §0 unchanged from v2.** v10 only revises §4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
## 1. Process model — unchanged
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
|
||||
|
||||
### 4.1 The contract (precise — v9, two-layer ID model)
|
||||
|
||||
> **Two-layer ID rules** (NEW v9 — codex r8):
|
||||
>
|
||||
> - **Daemon-layer**: a `client_message_id` is **daemon-consumed** iff an
|
||||
> outbox row exists for it. Daemon-mediated callers can never reuse a
|
||||
> daemon-consumed id, regardless of whether the broker ever saw it.
|
||||
> The daemon's outbox is the single authority for "this id was issued
|
||||
> by my caller against this daemon."
|
||||
> - **Broker-layer**: a `client_message_id` is **broker-consumed** iff a
|
||||
> dedupe row exists for `(mesh_id, client_message_id)` in
|
||||
> `mesh.client_message_dedupe`. Direct broker callers (none in
|
||||
> v0.9.0; reserved for future SDK paths that bypass the daemon) can
|
||||
> reuse a broker-non-consumed id freely.
|
||||
> - In v0.9.0 there are no daemon-bypass clients, so for practical
|
||||
> purposes "daemon-consumed" is the operative rule.
|
||||
>
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db`
|
||||
> before the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer (§4.5.1).
|
||||
>
|
||||
> **Local audit guarantee**: a `client_message_id` once written to
|
||||
> `outbox.db` is **never released** (daemon-layer rule). Operator
|
||||
> recovery via `requeue` always mints a fresh id; the old row stays in
|
||||
> `aborted` for audit. There is no daemon-side path to free a used id.
|
||||
>
|
||||
> **Broker guarantee** (v9 — tightened): a dedupe row exists iff the
|
||||
> broker accept transaction **committed** (Phase B3 reached). Phase B1
|
||||
> rejections never insert dedupe rows. Phase B2 rejections roll the
|
||||
> transaction back, so any partial dedupe row is unwound. Direct
|
||||
> broker callers retrying after B1/B2 rejection see no dedupe row and
|
||||
> may reuse the id.
|
||||
>
|
||||
> **Atomicity guarantee**: same as v8 §4.1.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — unchanged from v6 §4.3
|
||||
|
||||
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
|
||||
|
||||
#### 4.5.1 IPC accept algorithm (v8)
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits, destination
|
||||
resolvable). Failures here return `4xx` immediately. **No outbox row
|
||||
is written; the `client_message_id` is not consumed.**
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
|
||||
a concurrent IPC accept on the same id serializes against this one.
|
||||
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
|
||||
preventing any other writer from beginning a transaction on the same
|
||||
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
|
||||
supported.
|
||||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||||
last_error FROM outbox WHERE client_message_id = ?`.
|
||||
5. Apply the lookup table below. For the "(no row)" case, INSERT the
|
||||
new row inside the same transaction.
|
||||
6. COMMIT.
|
||||
|
||||
| Existing row state | Fingerprint match? | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
|
||||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||||
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||||
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||||
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
|
||||
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
|
||||
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||||
|
||||
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
|
||||
`request_fingerprint` (8-byte hex prefix) so callers can debug
|
||||
client/server canonical-form drift. **Every state in the table returns
|
||||
something deterministic, including `aborted`.** A `client_message_id`
|
||||
written to `outbox.db` is permanently bound to that row's lifecycle —
|
||||
the only "free" state is "no row exists".
|
||||
|
||||
#### 4.5.2 Outbox table — fingerprint required
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN
|
||||
('pending','inflight','done','dead','aborted')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT,
|
||||
aborted_at INTEGER, -- NEW v8
|
||||
aborted_by TEXT, -- NEW v8: operator/auto
|
||||
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||||
```
|
||||
|
||||
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
|
||||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||||
a row was requeued multiple times.
|
||||
|
||||
`request_fingerprint` is computed once at IPC accept time and frozen
|
||||
forever for the row's lifecycle. Daemon never recomputes from
|
||||
`payload`.
|
||||
|
||||
### 4.6 Rejected-request semantics — two-layer rules + rate-limit moved to B1 (v9 — codex r8)
|
||||
|
||||
> **Two-layer rule (v9)**: a `client_message_id` is **daemon-consumed**
|
||||
> iff an outbox row exists for it; **broker-consumed** iff a dedupe row
|
||||
> exists. Daemon-mediated callers see daemon-layer authority (the only
|
||||
> path in v0.9.0). Pre-validation failures at any layer consume nothing
|
||||
> at that layer. The two layers are independent: a daemon-consumed id
|
||||
> may or may not be broker-consumed (depending on whether the send
|
||||
> reached B3); a daemon-non-consumed id can never be broker-consumed
|
||||
> (no outbox row ⇒ no broker call from the daemon).
|
||||
|
||||
#### 4.6.1 Daemon-side rejection phasing (v9)
|
||||
|
||||
| Phase | When daemon rejects | Outbox row? | Daemon-consumed? | Same daemon caller may reuse id? |
|
||||
|---|---|---|---|---|
|
||||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | No | Yes — id never written locally |
|
||||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | Yes | N/A — daemon owns retries |
|
||||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | Yes | No — rotate via `requeue` |
|
||||
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Yes (still consumed) | Old id NEVER reusable; new id is fresh |
|
||||
|
||||
The "daemon-consumed?" column is the daemon-layer authority. It does
|
||||
not depend on whether the broker ever saw the request — phase C above
|
||||
shows the broker has not committed a dedupe row, but the daemon still
|
||||
holds the id in `dead` state.
|
||||
|
||||
#### 4.6.2 Broker-side rejection phasing (v10 — B0 dedupe fast-path added)
|
||||
|
||||
The broker validates in **four phases** relative to dedupe-row
|
||||
insertion. Phase B0 (NEW v10 — codex r9) makes idempotent retries
|
||||
free of rate-limit budget so a daemon retry of an already-committed
|
||||
message can never get rate-limit-rejected:
|
||||
|
||||
| Phase | Validation | Side effects | Result for direct broker callers |
|
||||
|---|---|---|---|
|
||||
| **B0. Dedupe fast-path** (NEW v10) | Read `mesh.client_message_dedupe` for `(mesh_id, client_message_id)`. **Does not touch rate-limit budget.** | None | If row exists & fingerprint matches → `200 duplicate` with original `broker_message_id`. If row exists & fingerprint mismatches → `409 idempotency_key_reused`. If row absent → continue to B1 |
|
||||
| **B1. Pre-dedupe-claim** (atomic, external) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, **rate limit not exceeded** (idempotent external limiter — see §4.6.4) | None | `4xx` returned. No dedupe row, no broker-consumed id. Caller may retry with same id once condition clears |
|
||||
| **B2. Post-dedupe-claim** (in-tx) | Conditions that require the accept transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx` returned, transaction rolled back, no dedupe row remains. Caller may retry with same id |
|
||||
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows, mention_index rows | `201` returned with `broker_message_id`. Id is broker-consumed |
|
||||
|
||||
**Why B0 is correct (codex r9)**: idempotent retries should never be
|
||||
distinguishable from "the call worked" from the caller's perspective.
|
||||
A retry that the broker can resolve to the original accept must do so
|
||||
before any operation that could fail (rate limit, capacity check,
|
||||
auth-quota, etc.). B0 reads — non-mutating, no transaction — so it can
|
||||
be skipped on the strictly-new-id path with negligible cost (one
|
||||
indexed PK lookup against the dedupe table).
|
||||
|
||||
**Race semantics for new ids (v10 — codex r9)**: B0 is a non-locking
|
||||
read; two same-id requests can both miss B0 simultaneously. Without
|
||||
care, both would consume rate-limit budget. v10 requires the limiter
|
||||
to be **idempotent over `(mesh_id, client_message_id, window)`**:
|
||||
budget is consumed at most once per id-window pair regardless of
|
||||
concurrent retries (§4.6.4). The "second" retry that misses B0 still
|
||||
sees its `INCR` short-circuited by the limiter and proceeds to B2/B3
|
||||
without budget impact. Whichever request wins the dedupe `INSERT`
|
||||
commits; the loser sees fingerprint match (rollback to `200
|
||||
duplicate`) or mismatch (`409`).
|
||||
|
||||
**Daemon-mediated callers**: in v0.9.0 the daemon is the only B-phase
|
||||
caller. Daemon-mediated callers see only the daemon-layer rules
|
||||
(§4.6.1). The broker's "may retry with same id" wording in the table
|
||||
above applies to direct broker callers only (none in v0.9.0; reserved
|
||||
for future SDK paths).
|
||||
|
||||
**Critical guarantee (v9 — tightened from v8)**: a dedupe row exists
|
||||
**iff the broker accept transaction committed (B3)**. There is no
|
||||
broker code path where a permanent 4xx leaves a dedupe row behind.
|
||||
|
||||
If the broker decides post-commit that an accepted message is invalid
|
||||
(async content-policy job, async moderation, etc.), that's NOT a
|
||||
permanent rejection — it's a follow-up event that operates on the
|
||||
`broker_message_id`, not on the dedupe key.
|
||||
|
||||
#### 4.6.4 Rate limiter — idempotent over `(mesh, client_id, window)` (v10 — codex r9)
|
||||
|
||||
Codex r9 caught: v9's plain `INCR` limiter would let idempotent
|
||||
retries burn budget. A daemon retry of an already-committed message
|
||||
that gets rate-limit-rejected creates a split-brain (broker has it,
|
||||
daemon marks dead). v10 makes the limiter idempotent over
|
||||
`(mesh_id, client_message_id, window_bucket)` so retries are free.
|
||||
|
||||
- **Authority**: same external Redis-style limiter used elsewhere in
|
||||
claudemesh, but called via an idempotency-aware wrapper:
|
||||
```
|
||||
consume_budget(mesh_id, client_message_id, window_bucket) → {ok, denied}
|
||||
Lua / WATCH-MULTI on Redis:
|
||||
key = "rl:" + mesh_id + ":" + window_bucket
|
||||
idem = "rli:" + mesh_id + ":" + client_message_id + ":" + window_bucket
|
||||
if EXISTS idem → return ok -- already counted
|
||||
if INCR key > limit_per_window
|
||||
DECR key -- refund this attempt
|
||||
return denied
|
||||
SET idem 1 EX 2*window_seconds -- short TTL for repeat-detection
|
||||
return ok
|
||||
```
|
||||
The `idem` key TTL is small (2× window) to keep memory bounded;
|
||||
outside the window, retries that arrive late count as new traffic
|
||||
(which is correct — the original `INCR` row has rolled out of the
|
||||
window too).
|
||||
- **Race semantics**: two same-id requests racing past B0 both arrive
|
||||
at `consume_budget`. Whichever Redis call lands first runs the
|
||||
conditional `INCR`+`SET idem`; the second sees `EXISTS idem` and
|
||||
returns `ok` without `INCR`. Each id-window pair consumes at most
|
||||
one budget unit. Implemented in Lua (single round-trip, atomic).
|
||||
- **B2 rollback non-refund**: if the limiter accepts but the in-tx
|
||||
Phase B2 then rejects (e.g. topic not found), the consumed budget
|
||||
is **not** refunded. Counter
|
||||
`cm_broker_rate_limit_consumed_then_b2_rejected_total` exposes the
|
||||
delta. Refunding would require a coordinated rollback across the DB
|
||||
tx and the limiter, which we don't want to build.
|
||||
- **Async counters**: `mesh.rate_limit_counter` (or any DB-resident
|
||||
view of "messages-per-mesh-per-window") is **non-authoritative** —
|
||||
metrics/telemetry only, rebuilt from the authoritative limiter and
|
||||
from message-history. Used for dashboards, not for accept decisions.
|
||||
|
||||
This split — idempotent atomic external limiter for enforcement,
|
||||
async DB counters for telemetry — keeps idempotent retries free of
|
||||
budget impact, prevents the v9 split-brain, and stays inside the
|
||||
existing claudemesh rate-limit infrastructure.
|
||||
|
||||
**Why B0 still matters even with the idempotent limiter**: the
|
||||
idempotent limiter prevents budget over-consumption, but it does NOT
|
||||
make the limiter itself the dedupe authority. B0 is a non-mutating DB
|
||||
read that resolves committed dedupe rows (the truth) without any
|
||||
limiter or DB-write side effects at all. For the common retry case
|
||||
(daemon timeout after broker B3 commit), B0 returns `200 duplicate`
|
||||
without ever calling the limiter. B0 + idempotent limiter together
|
||||
mean: idempotent retries are O(1 PK lookup), free, and never visible
|
||||
to rate-limit accounting.
|
||||
|
||||
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
|
||||
|
||||
To unstick a `dead` or `pending`-but-stuck row, operator runs:
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||||
[--new-client-id <id> | --auto]
|
||||
[--patch-payload <path>]
|
||||
```
|
||||
|
||||
This atomically (single SQLite transaction):
|
||||
|
||||
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
|
||||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||||
permanent.
|
||||
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
|
||||
or auto-ulid'd via `--auto`).
|
||||
3. Inserts a new outbox row in `pending` with the fresh id and the same
|
||||
payload (or patched payload if `--patch-payload` was given).
|
||||
4. Sets `superseded_by = <new_row_id>` on the old row so
|
||||
`outbox inspect <old_id>` displays the chain.
|
||||
|
||||
**The old `client_message_id` is permanently dead** — `outbox.db` still
|
||||
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
|
||||
re-using it gets `409 outbox_aborted_*` per §4.5.1.
|
||||
|
||||
If broker had ever accepted the old id (it reached B3), the broker's
|
||||
dedupe row is also permanent — duplicate sends to broker with the old
|
||||
id would also `409` for fingerprint mismatch (or return the original
|
||||
`broker_message_id` for matching fingerprint). Daemon-side
|
||||
`aborted` and broker-side dedupe row are independent records of "this
|
||||
id was used," neither releases the id.
|
||||
|
||||
This is the resolution to v7's contradiction: there is **no path** for
|
||||
an id to "become free again." If the operator wants to retry the
|
||||
payload, they get a new id. The old id stays buried.
|
||||
|
||||
### 4.7 Broker atomicity contract — side-effect classification (v9)
|
||||
|
||||
#### 4.7.1 Side effects (v9 — rate limit moved to B1 external)
|
||||
|
||||
Every successful broker accept atomically commits these durable
|
||||
state changes in **one transaction**:
|
||||
|
||||
| Effect | Table | In-tx? | Why |
|
||||
|---|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
|
||||
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
|
||||
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
|
||||
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
|
||||
|
||||
**Outside the transaction** — non-authoritative or rebuildable, with
|
||||
explicit rationale per item:
|
||||
|
||||
| Effect | Where | Why outside |
|
||||
|---|---|---|
|
||||
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
|
||||
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
|
||||
| Rate-limit **counters** (telemetry only) | Async, eventually consistent | Authoritative limiter is the external Redis-style INCR in B1 (§4.6.4); the DB counter is rebuilt for dashboards, not consulted for accept |
|
||||
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
|
||||
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
|
||||
| Metrics | Prometheus, pull-based | Always non-authoritative |
|
||||
|
||||
If any in-transaction insert fails, the transaction rolls back
|
||||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||||
state.
|
||||
|
||||
The async side effects are driven off the in-transaction
|
||||
`delivery_queue` and `message_history` rows, so they cannot get ahead
|
||||
of committed state — only lag behind.
|
||||
|
||||
#### 4.7.2 Pseudocode — corrected and final (v8)
|
||||
|
||||
```sql
|
||||
-- =========================================================================
|
||||
-- Phase B0: dedupe fast-path (NEW v10 — codex r9). Non-mutating.
|
||||
-- Resolves idempotent retries WITHOUT touching rate-limit budget.
|
||||
-- =========================================================================
|
||||
SELECT broker_message_id, request_fingerprint, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id;
|
||||
|
||||
-- If row exists:
|
||||
-- fingerprint match → return 200 duplicate (broker_message_id, history_available). Done.
|
||||
-- fingerprint mismatch → return 409 idempotency_key_reused. Done.
|
||||
-- Otherwise: row absent → continue.
|
||||
|
||||
-- =========================================================================
|
||||
-- Phase B1: schema/auth/size validation + idempotent rate-limit consume.
|
||||
-- All before any DB transaction. Failures here return 4xx without opening a tx.
|
||||
-- =========================================================================
|
||||
-- consume_budget(mesh_id, client_id, window_bucket) — Lua/Redis (§4.6.4).
|
||||
-- Idempotent over (mesh_id, client_id, window_bucket): retries within window
|
||||
-- consume at most once.
|
||||
|
||||
-- =========================================================================
|
||||
-- Phase B2 + B3: in-transaction claim and side effects.
|
||||
-- =========================================================================
|
||||
BEGIN;
|
||||
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Inspect the row that's actually there now (ours or a racer's).
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- Branch:
|
||||
-- row.broker_message_id == $msg_id → we won the race; continue to side effects.
|
||||
-- row.broker_message_id != $msg_id → racer won. Compare fingerprints:
|
||||
-- fingerprint match → ROLLBACK; return 200 duplicate (the rare race-vs-B0 case
|
||||
-- where two concurrent first-time-but-same-id requests
|
||||
-- both missed B0 and one beat the other to the INSERT).
|
||||
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Phase B2 validation: destination_ref existence (topic exists,
|
||||
-- member subscribed, etc.). Rate limit is NOT here — it was checked
|
||||
-- in B1 (§4.6.4) before this transaction opened.
|
||||
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
|
||||
|
||||
-- Step 4: insert all in-tx side effects (§4.7.1).
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
|
||||
SELECT $msg_id, mention_pubkey, ...
|
||||
FROM unnest($mention_list);
|
||||
|
||||
COMMIT;
|
||||
|
||||
-- After COMMIT, async workers consume delivery_queue and update
|
||||
-- search indexes, audit logs, rate-limit counters, etc.
|
||||
```
|
||||
|
||||
#### 4.7.3 Orphan check — same as v7 §4.7.3
|
||||
|
||||
Extended over the side-effect inventory to verify in-tx items consistency.
|
||||
|
||||
### 4.8 Outbox max-age math — unchanged from v7 §4.8
|
||||
|
||||
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
|
||||
safety_margin` strictly < window; safety_margin floor 24h.
|
||||
|
||||
### 4.9 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.10 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.11 Failure modes — B0/B1/B2 distinction (v10)
|
||||
|
||||
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
|
||||
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
|
||||
- **IPC accept against `aborted` row, fingerprint match**: returns 409
|
||||
per §4.5.1. Caller must use a new id; the old id is permanently retired.
|
||||
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
|
||||
§4.6.3; old id stays in `aborted`, new id is fresh.
|
||||
- **Broker fingerprint mismatch on retry**: at B0 → returns 409
|
||||
immediately (no rate-limit consumed). Daemon marks `dead`; operator
|
||||
requeue path.
|
||||
- **Idempotent retry of an already-committed id during a saturated
|
||||
rate-limit window** (NEW v10): B0 fast-path returns `200 duplicate`
|
||||
with the original `broker_message_id`. Rate-limit budget is NOT
|
||||
consumed. Daemon transitions outbox row from `pending`/`inflight`
|
||||
to `done`. **No split-brain.** This is the key correctness fix
|
||||
from codex r9.
|
||||
- **Daemon retry after dedupe row hard-deleted by broker retention
|
||||
sweep**: cannot happen unless operator overrode `max_age_hours`.
|
||||
- **Broker phase B1 rejection (rate limit, schema, size, etc.)**: no
|
||||
dedupe row exists; daemon receives 4xx; idempotent limiter ensures
|
||||
retries within window don't re-consume budget. If the rejection is
|
||||
permanent (size, schema), daemon marks `dead`. If transient (rate
|
||||
limit), daemon retries with exponential backoff until window clears
|
||||
or `max_age_hours` exhausted.
|
||||
- **Broker phase B2 rejection on retry**: same id reaches B2 and the
|
||||
in-tx condition fails (topic deleted, member unsubscribed). B2
|
||||
rolls back the dedupe insert; no dedupe row remains. Daemon
|
||||
receives 4xx → marks `dead`. Operator can `requeue` if condition
|
||||
clears (note: `requeue` mints a fresh id per §4.6.3, so the old id
|
||||
stays `aborted`).
|
||||
- **Atomicity violation found by orphan check**: alerts ops.
|
||||
|
||||
---
|
||||
|
||||
## 5-13. — unchanged from v4
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
## 15. Version compat — unchanged from v7 §15
|
||||
|
||||
## 16. Threat model — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
|
||||
|
||||
Broker side, deploy order: same as v7 §17, with one addition:
|
||||
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
|
||||
validation, returns 4xx without writing) and Phase B2/B3 (within the
|
||||
accept transaction). Implementation: refactor handler to validate
|
||||
Phase B1 conditions before opening the DB transaction.
|
||||
|
||||
Daemon side:
|
||||
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
|
||||
columns and the `aborted` enum value (§4.5.2). Migration applies via
|
||||
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
|
||||
greenfield.
|
||||
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
|
||||
(§4.5.1 step 3).
|
||||
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
|
||||
- `claudemesh daemon outbox requeue` always mints a fresh
|
||||
`client_message_id`; never frees the old id. `--new-client-id <id>`
|
||||
and `--auto` are the only modes; the old `client_message_id`
|
||||
argument is removed.
|
||||
|
||||
---
|
||||
|
||||
## What changed v8 → v9 (codex round-8 actionable items)
|
||||
|
||||
| Codex r8 item | v9 fix | Section |
|
||||
|---|---|---|
|
||||
| Cross-layer ID-consumed authority contradiction | Two-layer model: daemon-consumed iff outbox row; broker-consumed iff dedupe row committed; daemon-mediated callers see only daemon-layer authority | §4.1, §4.6.1, §4.6.2 |
|
||||
| Rate-limit authority muddled (B2 vs async counters) | Rate limit moved to B1 via external atomic limiter (Redis-style INCR with TTL); DB rate-limit counters demoted to telemetry-only | §4.6.2, §4.6.4, §4.7.1 |
|
||||
| §4.1 broker guarantee fuzzy | Tightened: "dedupe row exists iff broker accept transaction committed (B3)" | §4.1, §4.6.2 |
|
||||
|
||||
(Earlier rounds' fixes preserved unchanged.)
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 9)
|
||||
|
||||
1. **Two-layer ID model (§4.1, §4.6.1)** — is the daemon-vs-broker
|
||||
authority split clear, or does it create more confusion for
|
||||
operators reading "consumed" in different contexts? Should we use
|
||||
different verbs (e.g. "claimed" at daemon, "committed" at broker)?
|
||||
2. **Rate-limit external limiter (§4.6.4)** — is "atomic external
|
||||
limiter" specified concretely enough? Is the over-counting on
|
||||
limiter-accepted-then-B2-rejected acceptable?
|
||||
3. **B2 contents after rate-limit move** — B2 now only has
|
||||
`destination_ref existence`. Worth keeping a B2 phase at all, or
|
||||
collapse into B1+B3?
|
||||
4. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year.
|
||||
|
||||
Three options:
|
||||
- **(a) v9 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v10 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
853
.artifacts/shipped/2026-05-03-daemon-final-spec-v2.md
Normal file
@@ -0,0 +1,853 @@
|
||||
# `claudemesh daemon` — Final Spec v2
|
||||
|
||||
> **Round 2 after a critical first-pass review.** v1 of this spec was reviewed
|
||||
> by another model and pushed back on identity model, no-auth IPC, "exactly-once"
|
||||
> overclaim, hook credentials, surface bloat, and missing operational flows
|
||||
> (rotation, image clones, schema migration, threat model). v2 incorporates all
|
||||
> of those.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — what this is, what it isn't
|
||||
|
||||
### 0.1 The product reality
|
||||
|
||||
claudemesh today is a **peer mesh runtime for Claude Code sessions**. Each
|
||||
session runs `claudemesh launch`, opens a WebSocket to a managed broker, gets
|
||||
ephemeral identity, sends/receives DMs and topic messages with other Claude Code
|
||||
sessions, posts to shared state, deploys MCP servers / skills / files,
|
||||
participates in tasks, schedules reminders. Everything is E2E encrypted with
|
||||
crypto_box envelopes for DMs and per-topic symmetric keys for topics. The broker
|
||||
is a routing/persistence layer; peers do the actual work.
|
||||
|
||||
The CLI is the canonical surface — every operation is a `claudemesh <verb>`.
|
||||
The MCP server is a "tool-less push pipe" that surfaces inbound messages to
|
||||
Claude Code as channel notifications. There is also a web dashboard, an `/v1/*`
|
||||
REST API, and an existing apikey auth model for external integrations.
|
||||
|
||||
### 0.2 The gap
|
||||
|
||||
Anything that **isn't a Claude Code session** is a second-class citizen:
|
||||
|
||||
- A RunPod handler that wants to alert a peer when an OOM happens has only
|
||||
one option: curl an apikey-authed REST endpoint. One-way only. The handler
|
||||
is not a peer — it can't be DM'd back, can't be `@-mentioned`, can't be in
|
||||
`peer list`, can't claim a task assigned to it, can't host an MCP service or
|
||||
share a skill. It's a webhook spoke, not a participant.
|
||||
|
||||
- A Temporal worker that wants to track its own progress in shared mesh state,
|
||||
publish to a `#alerts` topic, and listen for "retry now" instructions has
|
||||
no good shape. Either it shells out to `claudemesh send` cold-path
|
||||
(a fresh WS handshake per message — ~1s latency, broker churn, no inbound
|
||||
path) or it speaks the WS protocol manually (significant code, no SDK).
|
||||
|
||||
- A long-running CI runner, an IoT box, a phone app, a future Python or Go
|
||||
service — none can be **first-class peers** without writing the same WS
|
||||
reconnect / queue / encryption / presence code that the existing CLI already
|
||||
has, plus an IPC surface so the host's apps can use it without re-implementing
|
||||
any of that.
|
||||
|
||||
### 0.3 What this daemon is
|
||||
|
||||
A long-running process — the same `claudemesh-cli` binary in `daemon` mode —
|
||||
that turns any host into a **first-class peer**:
|
||||
|
||||
- Stable identity across restarts (the host *is* a member of the mesh, not a
|
||||
series of disconnected sessions).
|
||||
- Persistent WS to the broker, with reconnect, queue, dedupe.
|
||||
- Local IPC surface (UDS + loopback HTTP + SSE) that any local app can hit
|
||||
to send, subscribe, query — without learning the broker protocol or carrying
|
||||
long-lived secrets in app code.
|
||||
- Hooks: shell scripts that fire on events. Server replies to DMs, auto-claims
|
||||
tasks, escalates errors — without the app being involved.
|
||||
- Same security primitives as `claudemesh launch` (mesh keypair, crypto_box,
|
||||
per-topic keys). No new auth model toward the broker.
|
||||
|
||||
The daemon **is the runtime**. The CLI in cold-path mode is a fallback. The
|
||||
Claude Code MCP integration is one client of the daemon (eventually).
|
||||
|
||||
### 0.4 What this daemon is NOT
|
||||
|
||||
- **Not a webhook gateway.** `/v1/notify` and apikeys remain the path for
|
||||
systems that can't host the runtime (third-party SaaS, monitoring tools).
|
||||
The daemon is for systems that *can* run a process — code you control.
|
||||
|
||||
- **Not a generic message broker.** It speaks claudemesh protocol to one
|
||||
managed broker. It is not a substitute for NATS, Redis, Kafka, RabbitMQ.
|
||||
|
||||
- **Not a Slack replacement.** Topics, DMs, mentions exist because *AI
|
||||
sessions* use them. Humans interact via the dashboard or a Claude Code
|
||||
session, not by reading the daemon's inbox directly.
|
||||
|
||||
- **Not a fleet manager.** One daemon manages one mesh on one host. Multi-mesh
|
||||
on one host is supported (one daemon per mesh, supervised). Cross-host
|
||||
supervision is an external concern (systemd, k8s, etc.) — the daemon doesn't
|
||||
reach across hosts.
|
||||
|
||||
### 0.5 Who deploys this
|
||||
|
||||
- A developer running `claudemesh daemon up` on their laptop so their open
|
||||
Claude Code sessions all share one persistent connection (instead of each
|
||||
opening its own ephemeral WS).
|
||||
- The same developer running `claudemesh daemon install-service` on their VPS,
|
||||
RunPod pod, Temporal worker, CI runner — turning each into an
|
||||
addressable peer that scripts on that host can talk to via local IPC.
|
||||
- Eventually: language SDKs (Python / Go / TypeScript) talking to the daemon
|
||||
on `localhost`, exposing claudemesh as a first-class API for any app the
|
||||
developer writes.
|
||||
|
||||
### 0.6 Pre-launch posture
|
||||
|
||||
No users yet. We can break protocol, schema, surface, anything. Optimize for
|
||||
the architecture we want to live with for years, not for the smallest
|
||||
shippable cut. Codex pushed back on v1 on this exact axis: do not ship
|
||||
graph/vector/MCP/skills/tasks on day one — freeze a small, hardened core,
|
||||
expand deliberately.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model
|
||||
|
||||
**One daemon per (user, mesh)**. Persistent. Survives reboots via OS
|
||||
supervisor. Serves multiple local apps concurrently.
|
||||
|
||||
```
|
||||
~/.claudemesh/daemon/<mesh-slug>/
|
||||
pid 0600 pidfile, cleaned on shutdown
|
||||
sock 0600 unix domain socket (primary IPC)
|
||||
http.port 0644 auto-allocated loopback port
|
||||
local_token 0600 per-daemon bearer for HTTP/TCP transports
|
||||
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
|
||||
host_fingerprint.json 0600 machine-id + boot-id + interface mac digest
|
||||
config.toml 0644 user-editable runtime tuning
|
||||
outbox.db 0600 SQLite — durable outbound queue
|
||||
inbox.db 0600 SQLite — N-day inbound history, FTS-indexed
|
||||
schema_version 0644 integer; gates online migrations
|
||||
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
|
||||
hooks/ 0700 user-managed event scripts
|
||||
```
|
||||
|
||||
**Resource caps (defaults, configurable):**
|
||||
|
||||
| Resource | Default | Why |
|
||||
|---|---|---|
|
||||
| RSS | 256 MB | Most workloads stay under 50 MB; cap protects multi-mesh hosts |
|
||||
| CPU | unlimited | Hook fan-out can spike briefly; rely on OS scheduler |
|
||||
| Outbox DB | 5 GB | At 1KB avg msg, that's 5M queued. Disk-full handling at 90% |
|
||||
| Inbox DB | 5 GB | Same |
|
||||
| File descriptors | 1024 | UDS clients + SSE streams + DB handles + WS |
|
||||
| SSE concurrent | 32 streams | DoS protection; configurable up |
|
||||
| IPC concurrent | 64 in-flight | Backpressure beyond this returns `429 daemon_busy` |
|
||||
| Hook concurrency | 8 | Bounded pool; overflow queues |
|
||||
|
||||
Single binary. Same `claudemesh-cli` package; `daemon` is one of its modes.
|
||||
|
||||
## 2. Identity — persistent member by default, ephemeral on opt-in, clone-aware
|
||||
|
||||
### 2.1 Modes
|
||||
|
||||
```
|
||||
claudemesh daemon up # default: persistent member
|
||||
claudemesh daemon up --ephemeral # session-shaped, no keypair persisted
|
||||
claudemesh daemon up --ephemeral --ttl=2h # auto-shutdown after TTL
|
||||
```
|
||||
|
||||
- **Persistent (default)**: ed25519 + x25519 keypair stored in `keypair.json`.
|
||||
Same identity across restarts, reconnects, supervisor cycles. Right for
|
||||
servers, workers, addressable peers.
|
||||
- **Ephemeral**: keypair generated in memory, never written. Daemon exits =
|
||||
identity gone. Right for CI jobs, preview environments, disposable RunPod
|
||||
pods, test harnesses, build agents, anything that should not leave a peer
|
||||
ghost in the broker after teardown.
|
||||
- **`--ttl <duration>`** on ephemeral mode: auto-shutdown after the duration,
|
||||
or after `claudemesh daemon down`, whichever first. Broker member record
|
||||
cleaned up on shutdown.
|
||||
|
||||
### 2.2 Image-clone detection
|
||||
|
||||
Two daemons booting with the same `keypair.json` (VM image clone, container
|
||||
copy, restored backup) is a serious failure mode — broker sees connection
|
||||
collisions, presence flickers, encrypted messages route to the wrong host.
|
||||
|
||||
Handled in three places:
|
||||
|
||||
1. **Daemon side**: `host_fingerprint.json` is written on first startup —
|
||||
`sha256(machine-id || boot-id || mac-of-default-iface || hostname)`. On every
|
||||
subsequent startup, the fingerprint is recomputed and compared. If it
|
||||
differs, the daemon **refuses to start** unless `--accept-cloned-identity`
|
||||
is passed (writes a fresh fingerprint and continues with the same keypair —
|
||||
for legitimate hardware migrations) or `--remint` is passed (mints fresh
|
||||
keypair, registers as a new member, broker reaps the old member after
|
||||
grace period).
|
||||
2. **Broker side**: tracks `lastSeenHostFingerprint` per member. On
|
||||
reconnection from a different fingerprint, broker emits a
|
||||
`member_clone_suspected` security event to the mesh owner's dashboard.
|
||||
Connection itself is allowed (legitimate hardware swaps happen) but visible
|
||||
for audit.
|
||||
3. **Mesh owner**: `claudemesh member revoke <pubkey>` revokes the keypair
|
||||
server-side; daemon receives `keypair_revoked` push event on next
|
||||
connection and self-disables.
|
||||
|
||||
### 2.3 Rename
|
||||
|
||||
`--name` is taken at first `daemon up`; subsequent runs read the keypair file
|
||||
and ignore `--name` unless `--rename` is passed (which produces a
|
||||
`member_renamed` event the broker propagates to peers).
|
||||
|
||||
## 3. IPC surface — stable core only in v0.9.0
|
||||
|
||||
### 3.1 Frozen core surface (v0.9.0)
|
||||
|
||||
Codex's feedback: do not ship every CLI verb on day one. A small hardened core
|
||||
first, expand under explicit capability gates.
|
||||
|
||||
```
|
||||
# Messaging — durable, tested
|
||||
POST /v1/send {to, message, priority?, meta?, replyToId?}
|
||||
POST /v1/topic/post {topic, message, priority?, mentions?}
|
||||
POST /v1/topic/subscribe {topic} (idempotent)
|
||||
POST /v1/topic/unsubscribe {topic}
|
||||
GET /v1/topic/list
|
||||
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
|
||||
GET /v1/inbox/search ?q=<fts-query>&limit=<n> (FTS5)
|
||||
|
||||
# Peers + presence — read-only on day one
|
||||
GET /v1/peers ?mesh=<slug>
|
||||
POST /v1/profile {summary?, status?, visible?} (limited fields)
|
||||
|
||||
# Files — already production in CLI
|
||||
POST /v1/file/share {path, to?, message?, persistent?}
|
||||
GET /v1/file/get ?id=<fileId>&out=<path>
|
||||
GET /v1/file/list
|
||||
|
||||
# Events — push
|
||||
GET /v1/events text/event-stream
|
||||
core events: message, peer_join, peer_leave, file_shared,
|
||||
daemon_disconnect, daemon_reconnect, hook_executed
|
||||
|
||||
# Control plane
|
||||
GET /v1/health {connected, lag_ms, queue_depth, inflight,
|
||||
mesh, member_pubkey, uptime_s, schema_version,
|
||||
daemon_version, broker_version}
|
||||
GET /v1/metrics Prometheus exposition
|
||||
GET /v1/version {daemon, schema, ipc_api} (negotiation)
|
||||
POST /v1/heartbeat {} (caller-side liveness signal)
|
||||
```
|
||||
|
||||
That's it. ~20 endpoints. Battle-test these before adding more.
|
||||
|
||||
### 3.2 Capability-gated future surface (v0.9.x roadmap)
|
||||
|
||||
Behind explicit feature flags in `config.toml`, post-v0.9.0:
|
||||
|
||||
```toml
|
||||
[capabilities]
|
||||
state = false # /v1/state/{set,get,list}
|
||||
memory = false # /v1/memory/{remember,recall}
|
||||
vector = false # /v1/vector/{store,search,delete}
|
||||
graph = false # /v1/graph/query
|
||||
tasks = false # /v1/task/{create,claim,complete}
|
||||
scheduling = false # /v1/scheduling/remind
|
||||
mcp_host = false # /v1/mcp/{register,call} (LARGEST surface; treat as v1.0)
|
||||
skill_share = false # /v1/skill/{deploy,share}
|
||||
```
|
||||
|
||||
Each capability is its own ship: design review, security review, test
|
||||
coverage, capability-token model, then enable. None enabled in v0.9.0.
|
||||
|
||||
### 3.3 Local IPC authentication
|
||||
|
||||
Codex was right: loopback TCP without auth is an attack surface (browser SSRF,
|
||||
container side-channels, sandboxed apps with network but no FS access, WSL
|
||||
host-shared loopback).
|
||||
|
||||
| Transport | Auth | Rationale |
|
||||
|---|---|---|
|
||||
| UDS | None (relies on FS perms 0600) | Reaching the socket = same UID = can read keypair anyway |
|
||||
| TCP loopback | **Required**: `Authorization: Bearer <local_token>` | Browser/container/sandbox can reach loopback without FS access |
|
||||
| SSE | Required: `Authorization: Bearer <local_token>` | Same |
|
||||
|
||||
`local_token` is 32 bytes of `crypto.randomBytes` (~256 bits), encoded base64url,
|
||||
written to `local_token` mode 0600 at daemon init. Rotated on `claudemesh
|
||||
daemon rotate-token`. SDKs auto-discover the token by reading the file (same
|
||||
mechanism as discovering the socket path).
|
||||
|
||||
**Additional defenses:**
|
||||
- HTTP listener binds **127.0.0.1 only**. Refuses to bind elsewhere unless
|
||||
`[ipc] http_bind = "..."` is set explicitly **and** `[ipc] http_external_auth = "..."`
|
||||
points to a separate token file (escape hatch for advanced users; never the default).
|
||||
- `Origin` header check: rejects requests with `Origin` set unless it's
|
||||
explicitly allowlisted in config (default: empty allowlist). Defends against
|
||||
browser SSRF.
|
||||
- `Host` header check: must be `localhost` or `127.0.0.1`. Defends against DNS
|
||||
rebinding.
|
||||
- CORS: `Access-Control-Allow-Origin` never echoed; preflight returns `403`.
|
||||
- `User-Agent` required (rejects empty UA — mild signal against simple SSRF).
|
||||
|
||||
### 3.4 Request limits + backpressure
|
||||
|
||||
- Max request body: **1 MB** (override per endpoint; file uploads use a separate
|
||||
streaming endpoint).
|
||||
- Max response body: **10 MB**; truncated with `Link: rel=next` cursor.
|
||||
- Max in-flight IPC requests: **64**. Beyond → `429 daemon_busy`.
|
||||
- Max SSE concurrent streams: **32**. Beyond → `429 too_many_streams`.
|
||||
- Per-token rate limit: **100 req/sec** sustained, 1000/sec burst (token
|
||||
bucket). Tunable.
|
||||
|
||||
## 4. Delivery contract — durable at-least-once with idempotent send
|
||||
|
||||
Codex was right: "exactly-once" is a lie. Replacing the claim with a precise
|
||||
contract.
|
||||
|
||||
### 4.1 The contract
|
||||
|
||||
> **The daemon guarantees: each successful send call enqueues exactly one row
|
||||
> to the broker eventually, identified by a stable `messageId`. The daemon
|
||||
> does not guarantee that downstream peers process the message exactly once —
|
||||
> that is the receiver's responsibility, aided by the propagated
|
||||
> `idempotency_key`.**
|
||||
|
||||
Concretely:
|
||||
|
||||
- **Caller → daemon**: caller may supply `Idempotency-Key`; daemon dedupes
|
||||
identical keys for 24h. Without one, daemon mints `ulid` and returns it as
|
||||
`messageId`.
|
||||
- **Daemon → broker**: each outbox row has at-most-one inflight transmit.
|
||||
Daemon retries with exponential backoff until broker ACKs OR row hits TTL
|
||||
(7d default → moves to `dead`).
|
||||
- **Broker → peer**: existing claudemesh delivery semantics. Broker dedupes by
|
||||
`messageId`. Peer receives ≥1 copy.
|
||||
- **Peer hooks**: hooks see `idempotency_key` in the event JSON. Idempotent
|
||||
hook implementations are the receiver's responsibility.
|
||||
|
||||
### 4.2 Outbox row state machine
|
||||
|
||||
```
|
||||
┌────────────┐
|
||||
send call → │ pending │
|
||||
└─────┬──────┘
|
||||
│ daemon picks up batch
|
||||
▼
|
||||
┌────────────┐
|
||||
│ inflight │ ← attempts++, last_error written
|
||||
└─┬────┬─────┘
|
||||
│ │ broker NACK / network err
|
||||
broker ACK │ └──────────► back to pending (with exp. backoff)
|
||||
▼
|
||||
┌────────────┐
|
||||
│ done │ ← delivered_at set, broker_message_id stored
|
||||
└────────────┘
|
||||
|
||||
age > max_age_hours:
|
||||
┌────────────┐
|
||||
│ dead │ ← surfaces in `daemon outbox --failed`
|
||||
└────────────┘
|
||||
```
|
||||
|
||||
### 4.3 Crash recovery
|
||||
|
||||
On daemon startup:
|
||||
|
||||
1. Any rows in `inflight` are reset to `pending` with `attempts++` and
|
||||
`next_attempt_at = now + min_backoff`. Note: this MAY cause double-delivery
|
||||
of a message that was actually ACK'd by the broker but the ACK didn't
|
||||
persist locally before crash. The `idempotency_key` propagates to broker
|
||||
(via message `meta`) so the broker dedupes by key.
|
||||
2. `outbox.db` integrity check (`PRAGMA integrity_check`); if fails, daemon
|
||||
refuses to start, points user at `claudemesh daemon recover`.
|
||||
3. `inbox.db` integrity check; on failure, drops to `inbox.db.corrupt-<ts>`,
|
||||
creates fresh empty inbox, logs `inbox_corruption_recovered` (does not
|
||||
block startup — inbox is a cache).
|
||||
|
||||
### 4.4 Disk-full
|
||||
|
||||
- At 80% of `outbox.max_queue_size` or 80% of `[disk] reserved_bytes`: daemon
|
||||
emits `outbox_pressure_high` event + Prometheus gauge. Sends still accept.
|
||||
- At 95%: new sends return `507 insufficient_storage`. Existing inflight
|
||||
drains.
|
||||
- At 100%: daemon enters degraded mode — refuses sends, refuses new SSE
|
||||
streams, holds open WS for inbound only. `daemon status` shows degraded.
|
||||
- Recovery: drain via broker reconnect (drains `done` rows older than
|
||||
retention window) or `claudemesh daemon outbox prune --confirm`.
|
||||
|
||||
### 4.5 Schema migration
|
||||
|
||||
`schema_version` file holds an integer. On startup:
|
||||
1. If `schema_version` matches binary's expected version → continue.
|
||||
2. If version is older → run `apps/cli/src/daemon/migrations/<from>-<to>.sql`
|
||||
in a transaction, write new version on success.
|
||||
3. If version is newer (downgrade) → daemon refuses to start, error points at
|
||||
re-installing matching version.
|
||||
|
||||
Migrations are forward-only. Each migration is ≤ 1 transaction. Test coverage
|
||||
required: every migration has a snapshot test from prior schema.
|
||||
|
||||
## 5. Inbound — durable history with FTS
|
||||
|
||||
Every inbound message is written to `inbox.db` before any hook fires:
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE inbox USING fts5(
|
||||
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
|
||||
sender_name, body, meta, idempotency_key UNINDEXED,
|
||||
received_at UNINDEXED, replied_to_id UNINDEXED
|
||||
);
|
||||
CREATE INDEX inbox_received_at ON inbox(received_at);
|
||||
CREATE INDEX inbox_idem ON inbox(idempotency_key);
|
||||
```
|
||||
|
||||
- **Receiver-side dedupe**: on insert, `INSERT OR IGNORE` on `idempotency_key`.
|
||||
Duplicate broker delivery becomes a no-op locally + `cm_daemon_dedupe_total`
|
||||
counter increments.
|
||||
- 30-day rolling retention (configurable). `VACUUM` weekly during low-traffic
|
||||
window.
|
||||
- `claudemesh daemon search "OOM"` queries the FTS index.
|
||||
- Apps connecting mid-stream replay history via `?since=<iso>`.
|
||||
|
||||
## 6. Hooks — first-class but tightly bounded
|
||||
|
||||
Codex was right: hooks were underspecified, and putting `CLAUDEMESH_TOKEN` in
|
||||
every hook env was a serious exfil footgun.
|
||||
|
||||
### 6.1 Hook directory & contract
|
||||
|
||||
```
|
||||
hooks/
|
||||
on-message.sh every inbound message (DM + topic)
|
||||
on-dm.sh DMs only
|
||||
on-mention.sh when @<my-name> appears anywhere
|
||||
on-topic-<name>.sh a specific topic
|
||||
on-file-share.sh file shared with me
|
||||
on-disconnect.sh WS dropped
|
||||
on-reconnect.sh reconnected
|
||||
on-startup.sh daemon up
|
||||
pre-send.sh filter / mutate outbound (last gate)
|
||||
hooks.toml per-hook policy (auth, redaction, env, timeout)
|
||||
```
|
||||
|
||||
`hooks.toml` (mandatory; daemon refuses to invoke hooks without it):
|
||||
|
||||
```toml
|
||||
[on-mention]
|
||||
enabled = true
|
||||
timeout_s = 30
|
||||
output_size_limit = 65536
|
||||
redact_payload = ["body.password", "meta.api_key"] # JSONPath
|
||||
allow_reply = true # if false, stdout reply ignored
|
||||
capability_token_scope = ["topic:alerts:post"] # scoped, NOT broker session token
|
||||
network_policy = "deny" # 'deny' | 'allow' | 'allowlist'
|
||||
network_allowlist = [] # only if policy = 'allowlist'
|
||||
fs_policy = "readonly" # 'readonly' | 'rw' | 'sandbox'
|
||||
killpg_on_timeout = true # SIGTERM process group, not just child
|
||||
audit = true # log every invocation
|
||||
```
|
||||
|
||||
### 6.2 Credentials passed to hooks
|
||||
|
||||
**Default: nothing.** No `CLAUDEMESH_TOKEN`, no broker session, nothing that
|
||||
lets the hook impersonate the daemon's identity broadly.
|
||||
|
||||
**Opt-in per hook**: `capability_token_scope = ["topic:alerts:post"]` mints a
|
||||
**short-lived (5 min) capability token** scoped to exactly that capability.
|
||||
The hook can use it to call back into the daemon's IPC ("post a reply to
|
||||
#alerts") but cannot use it to read state, read inbox, deploy MCP, etc. Token
|
||||
expires when hook process exits OR after 5 min, whichever first.
|
||||
|
||||
Capability tokens are local-only — they authorize against the daemon's IPC
|
||||
surface, never the broker directly. Daemon translates capability calls into
|
||||
broker calls.
|
||||
|
||||
Env variables the hook DOES get:
|
||||
- `CLAUDEMESH_MESH=<slug>`
|
||||
- `CLAUDEMESH_HOOK_NAME=on-mention`
|
||||
- `CLAUDEMESH_EVENT_ID=<ulid>`
|
||||
- `CLAUDEMESH_CAPABILITY_TOKEN=<token>` (only if scope was configured; else absent)
|
||||
- `CLAUDEMESH_DAEMON_SOCK=<path>` (so SDKs can connect for capability calls)
|
||||
- `PATH=/usr/bin:/bin` (locked down)
|
||||
|
||||
### 6.3 Payload redaction
|
||||
|
||||
Hook stdin receives event JSON minus paths listed in `redact_payload`. Default
|
||||
redaction: nothing. Mesh owner / daemon admin opts in.
|
||||
|
||||
### 6.4 Timeout & cleanup
|
||||
|
||||
- Per-hook `timeout_s` (default 30s). On timeout, daemon sends SIGTERM to the
|
||||
hook's process group (`killpg_on_timeout=true`), waits 5s, then SIGKILL.
|
||||
Catches forked grandchildren that were trying to keep things alive.
|
||||
- Hook stdout/stderr captured, truncated at `output_size_limit`. Larger
|
||||
outputs log a warning and discard the overflow.
|
||||
|
||||
### 6.5 Audit log
|
||||
|
||||
Every hook invocation logs:
|
||||
```json
|
||||
{"hook":"on-mention","event_id":"01H8…","exit":0,"duration_ms":47,
|
||||
"stdout_bytes":120,"stderr_bytes":0,"replied":true,"capability_calls":1,
|
||||
"ts":"2026-05-03T14:00:00Z"}
|
||||
```
|
||||
|
||||
Stored in `daemon.log`; metrics exposed via `cm_daemon_hook_*`.
|
||||
|
||||
### 6.6 Sandboxing — supported, not required
|
||||
|
||||
The contract supports sandboxing without mandating it (mandating breaks too
|
||||
many real workflows):
|
||||
|
||||
- Linux: opt-in `sandbox = "bubblewrap"` in `hooks.toml` runs the hook under
|
||||
`bwrap` with no network (unless `network_policy != "deny"`), readonly FS
|
||||
except `/tmp/<hook-id>`, no DBus, no /proc.
|
||||
- macOS: opt-in `sandbox = "sandbox-exec"` with similar profile.
|
||||
- Default: no sandbox; rely on Unix permissions + `network_policy=deny` (which
|
||||
is enforced via `unshare --net` on Linux when available, otherwise
|
||||
best-effort firewall rule).
|
||||
|
||||
## 7. Multi-mesh — daemon-per-mesh, supervised by a thin shell
|
||||
|
||||
### 7.1 The decision
|
||||
|
||||
One daemon per mesh, coordinated by a supervisor script. Codex pushed back —
|
||||
"why not one daemon serving all meshes?". Going daemon-per-mesh because:
|
||||
|
||||
- **Crash isolation**: a panic in `prod` mesh's WS reader can't corrupt
|
||||
`dev` mesh's outbox.
|
||||
- **Resource accounting**: per-mesh RSS, per-mesh metrics, per-mesh disk
|
||||
budget — easy to attribute, easy to cap.
|
||||
- **Independent identity**: each mesh has its own keypair, host fingerprint,
|
||||
capability gates. Conflating into one process forces shared trust.
|
||||
- **Independent upgrades**: rolling daemon restarts per mesh, no downtime
|
||||
across all meshes.
|
||||
- **Simpler code**: zero cross-mesh routing logic in the daemon body.
|
||||
|
||||
The cost (process count, log fan-out) is real but bounded: typical user has
|
||||
1–3 meshes. Heavy users (10–20) get a `claudemesh daemon ps` + `--all` UX that
|
||||
treats them as a fleet.
|
||||
|
||||
### 7.2 Resource caps for fleet hosts
|
||||
|
||||
`config.toml` has `[fleet]` section read by `daemon up --all`:
|
||||
|
||||
```toml
|
||||
[fleet]
|
||||
max_daemons = 10
|
||||
total_memory_budget = "2GB" # divided across daemons; each gets budget/N RSS cap
|
||||
total_disk_budget = "20GB" # divided across outbox + inbox per daemon
|
||||
```
|
||||
|
||||
If a user hits `max_daemons`, `daemon up <next>` errors with a clear message
|
||||
pointing at the cap.
|
||||
|
||||
### 7.3 Commands
|
||||
|
||||
```
|
||||
claudemesh daemon up --mesh <slug> # one mesh
|
||||
claudemesh daemon up --all # all joined meshes (respects fleet caps)
|
||||
claudemesh daemon down --mesh <slug>
|
||||
claudemesh daemon down --all
|
||||
claudemesh daemon status # all daemons, table view
|
||||
claudemesh daemon status --json # machine-readable
|
||||
claudemesh daemon ps # alias of status
|
||||
claudemesh daemon logs --mesh <slug> [-f]
|
||||
claudemesh daemon restart --mesh <slug>
|
||||
```
|
||||
|
||||
## 8. Auto-routing — clarified, not transparent
|
||||
|
||||
Codex pushed back: "no behavior difference" was hand-waving. Persistent
|
||||
identity, queueing, hooks, profile state — these legitimately change behavior.
|
||||
|
||||
### 8.1 What changes when a daemon is up
|
||||
|
||||
| Behavior | Cold-path CLI | Daemon-routed CLI |
|
||||
|---|---|---|
|
||||
| Sender attribution | Ephemeral session pubkey for that invocation | Daemon's persistent member pubkey |
|
||||
| Latency | ~1s (fresh WS handshake) | <10ms (local UDS round-trip) |
|
||||
| Send durability | None — if broker is unreachable, command fails | Outbox queue retries until TTL |
|
||||
| Inbound visibility | Not available (cold path closes WS) | `claudemesh inbox` reads daemon's inbox.db |
|
||||
| Hooks | Not invoked | Invoked on every event |
|
||||
| Presence | Brief flicker as session connects+disconnects | Continuous; daemon's status reflected |
|
||||
| `peer list` shows me as | A new ephemeral session each invocation | The daemon's persistent member |
|
||||
|
||||
### 8.2 Detection logic — connect, don't trust pidfile
|
||||
|
||||
```
|
||||
1. Check ~/.claudemesh/daemon/<slug>/sock exists.
|
||||
2. attempt UDS connect with 100ms timeout.
|
||||
3. If connect succeeds: send GET /v1/version.
|
||||
4. If response is well-formed AND mesh matches AND daemon_version is
|
||||
compatible → use this daemon.
|
||||
5. Otherwise → cold path.
|
||||
```
|
||||
|
||||
PID liveness check is unreliable (PID reuse, process orphaned). Socket
|
||||
handshake is canonical.
|
||||
|
||||
### 8.3 Coexistence with `claudemesh launch`
|
||||
|
||||
Both can be running for the same mesh:
|
||||
- Daemon connected as persistent member `runpod-worker-3`.
|
||||
- A separate `claudemesh launch` connects as ephemeral session of the same
|
||||
member. Visible to peers as "another session of runpod-worker-3"
|
||||
(sibling-session relationship via `memberPubkey`).
|
||||
- CLI verbs from inside `claudemesh launch` route through the launch session,
|
||||
NOT the daemon (preserves "this Claude Code session has its own ephemeral
|
||||
identity" semantics).
|
||||
- CLI verbs from a separate shell route through the daemon (faster, durable).
|
||||
|
||||
This is consistent with the v0.5.1 self-DM guard and sibling-session
|
||||
semantics already shipped.
|
||||
|
||||
## 9. Service installation
|
||||
|
||||
```bash
|
||||
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
|
||||
claudemesh daemon uninstall-service
|
||||
claudemesh daemon install-service --user # user-scope unit (default; no root)
|
||||
claudemesh daemon install-service --system # system-scope unit (root; multi-user host)
|
||||
```
|
||||
|
||||
Unit defaults:
|
||||
- `Restart=on-failure`, `RestartSec=5s`, `StartLimitBurst=5/5min`
|
||||
- `MemoryMax=<resource cap>`, `TasksMax=128`, `LimitNOFILE=4096`
|
||||
- `StandardOutput/Error=journal`
|
||||
- `NoNewPrivileges=yes`, `PrivateTmp=yes`, `ProtectSystem=strict`,
|
||||
`ProtectHome=read-only` with `ReadWritePaths=~/.claudemesh`
|
||||
- For systemd `--user`, runs as the invoking user (no root needed).
|
||||
|
||||
`claudemesh install` (the existing setup verb) gains an opt-in prompt:
|
||||
*"Install as a background service that always runs?"* Defaults differently
|
||||
based on detected environment (TTY vs no-TTY, presence of systemd, etc.).
|
||||
|
||||
## 10. Observability
|
||||
|
||||
Standard CLI surface unchanged from v1, with the new gauges/counters:
|
||||
|
||||
```
|
||||
cm_daemon_connected{mesh} 0/1
|
||||
cm_daemon_reconnects_total{mesh,reason}
|
||||
cm_daemon_lag_ms{mesh} last broker round-trip
|
||||
cm_daemon_outbox_depth{mesh,status} pending|inflight|dead
|
||||
cm_daemon_outbox_age_seconds{mesh} oldest pending row
|
||||
cm_daemon_dedupe_total{mesh,direction} out|in
|
||||
cm_daemon_disk_pct{mesh,kind} outbox|inbox
|
||||
cm_daemon_send_total{mesh,kind,status}
|
||||
cm_daemon_recv_total{mesh,kind,from_type}
|
||||
cm_daemon_hook_invocations_total{hook,exit}
|
||||
cm_daemon_hook_duration_seconds{hook} histogram
|
||||
cm_daemon_hook_capability_calls_total{hook,scope}
|
||||
cm_daemon_ipc_request_total{endpoint,status,transport}
|
||||
cm_daemon_ipc_duration_seconds{endpoint} histogram
|
||||
cm_daemon_local_token_rotations_total
|
||||
cm_daemon_clone_suspected_total
|
||||
```
|
||||
|
||||
Tracing: optional OpenTelemetry export.
|
||||
|
||||
## 11. SDKs — three, slim, core-API only
|
||||
|
||||
Same shape as v1 but only target the **frozen core surface** (§3.1). State /
|
||||
memory / vector / graph / tasks / MCP / skills are NOT in v0.9.0 SDKs — they
|
||||
ship per capability gate.
|
||||
|
||||
Each SDK auto-discovers the daemon: reads `sock` path, `http.port`,
|
||||
`local_token`. SDKs versioned in lockstep with the daemon's `/v1` surface.
|
||||
|
||||
## 12. Security model — explicit boundaries
|
||||
|
||||
| Boundary | Trust | Mechanism |
|
||||
|---|---|---|
|
||||
| App ↔ Daemon (UDS) | OS user, FS perms | UDS 0600 |
|
||||
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 only + `local_token` + Origin/Host check |
|
||||
| Hook ↔ Daemon | Capability scope | Short-lived capability token, never broker session |
|
||||
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello + crypto_box DM + per-topic keys |
|
||||
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
|
||||
| Cloned identity | Host fingerprint check | Daemon refuses to start; dashboard audit event |
|
||||
|
||||
## 13. Configuration
|
||||
|
||||
`config.toml` — same shape as v1 plus:
|
||||
- `[capabilities]` (§3.2)
|
||||
- `[fleet]` (§7.2)
|
||||
- `[disk] reserved_bytes` (§4.4)
|
||||
- `[clone] policy = "refuse" | "warn" | "allow"` (§2.2)
|
||||
|
||||
User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.
|
||||
|
||||
## 14. Lifecycle — the operational flows v1 was missing
|
||||
|
||||
### 14.1 Key rotation
|
||||
|
||||
```
|
||||
claudemesh daemon rotate-keypair
|
||||
```
|
||||
|
||||
Mints fresh ed25519 + x25519. Registers new pubkey with broker as a `member_keypair_rotated` operation (broker associates new pubkey with same member id). Old pubkey is held server-side for 24h grace (decrypts in-flight messages encrypted to old pubkey), then revoked.
|
||||
|
||||
### 14.2 Local token rotation
|
||||
|
||||
```
|
||||
claudemesh daemon rotate-token
|
||||
```
|
||||
|
||||
Atomically writes a new `local_token`, returns the old one alongside the new
|
||||
one for 60s grace. SDKs that already have the old token finish in-flight
|
||||
requests; new requests use the new token. After 60s, old token is rejected.
|
||||
|
||||
### 14.3 Compromised host revocation
|
||||
|
||||
From the dashboard or another mesh-owner session:
|
||||
|
||||
```
|
||||
claudemesh member revoke <pubkey>
|
||||
```
|
||||
|
||||
Broker marks member as revoked. Connected daemon receives `member_revoked`
|
||||
push, self-disables (refuses new IPC, closes WS), exits with non-zero status,
|
||||
logs forensic event.
|
||||
|
||||
### 14.4 Image-clone lifecycle
|
||||
|
||||
Covered in §2.2. Three policies (`refuse`, `warn`, `allow` — settable per-host
|
||||
via `config.toml`).
|
||||
|
||||
### 14.5 Backup & restore
|
||||
|
||||
```
|
||||
claudemesh daemon backup --out <path> # dumps keypair, config, schema_version
|
||||
claudemesh daemon restore --in <path> # writes them; refuses if a daemon is running
|
||||
```
|
||||
|
||||
Backup is encrypted with a passphrase (Argon2id KDF + crypto_secretbox). The
|
||||
intent: "I'm reformatting my laptop, I want my mesh memberships back without
|
||||
re-joining." NOT for "deploy this same identity on 10 servers" (that's the
|
||||
clone problem above).
|
||||
|
||||
### 14.6 Uninstall / reset
|
||||
|
||||
```
|
||||
claudemesh daemon uninstall # full purge: stops, deregisters from broker, wipes ~/.claudemesh/daemon/<slug>
|
||||
claudemesh daemon reset # wipes local state, keeps broker member registration (for restoring)
|
||||
```
|
||||
|
||||
Uninstall calls broker's `POST /v1/me/members/:pubkey/leave` so member doesn't
|
||||
linger as ghost. Reset is local-only, no broker contact.
|
||||
|
||||
### 14.7 Disk corruption recovery
|
||||
|
||||
```
|
||||
claudemesh daemon recover # interactive: integrity check + offer rebuild paths
|
||||
```
|
||||
|
||||
Detects corrupt `outbox.db` / `inbox.db`. Options:
|
||||
- Restore from local journal-only inbox (read-only mode; sends disabled).
|
||||
- Wipe + rebuild from broker (fetches last N days of message history if
|
||||
available; topics need re-subscribe; outbox is irrecoverable, queued sends are
|
||||
lost).
|
||||
- Wipe + start fresh.
|
||||
|
||||
## 15. Version compatibility
|
||||
|
||||
### 15.1 Negotiation handshake
|
||||
|
||||
On daemon connect to broker AND on every IPC request:
|
||||
|
||||
```
|
||||
GET /v1/version
|
||||
{
|
||||
"daemon_version": "0.9.0",
|
||||
"ipc_api": "v1",
|
||||
"ipc_minor": 3, # additive minor
|
||||
"schema_version": 7,
|
||||
"broker_protocol_min": "0.7",
|
||||
"broker_protocol_max": "0.9"
|
||||
}
|
||||
```
|
||||
|
||||
### 15.2 Compat policy
|
||||
|
||||
| Across | Policy |
|
||||
|---|---|
|
||||
| Daemon ↔ Broker | Daemon refuses to connect if broker version < daemon's `broker_protocol_min`. Broker logs warning. Pre-1.0 we may break this with notice; post-1.0 we maintain backward compat for ≥6 months. |
|
||||
| CLI ↔ Daemon | CLI checks daemon's `ipc_api`. Same major = OK. Different major = CLI falls back to cold-path with warning. |
|
||||
| SDK ↔ Daemon | SDK negotiates `ipc_minor`; uses minimum of (SDK's, daemon's). |
|
||||
| Daemon binary ↔ schema | Binary refuses to start on unknown schema; migrations run forward-only; no automatic downgrade. |
|
||||
|
||||
### 15.3 Compatibility matrix (published in docs, machine-readable JSON at /v1/compat)
|
||||
|
||||
```json
|
||||
{
|
||||
"daemon": "0.9.0",
|
||||
"compatible_brokers": ["0.7.x", "0.8.x", "0.9.x"],
|
||||
"compatible_clis": ["0.9.x"],
|
||||
"compatible_sdks": {
|
||||
"python": ">=0.9.0,<1.0.0",
|
||||
"go": ">=0.9.0,<1.0.0",
|
||||
"ts": ">=0.9.0,<1.0.0"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## 16. Threat model
|
||||
|
||||
### 16.1 Attacker classes
|
||||
|
||||
| Attacker | Has | Wants | Mitigations |
|
||||
|---|---|---|---|
|
||||
| Local same-user shell | OS user creds | Send / read mesh messages | None needed — they already have FS access to keypair; daemon is no worse |
|
||||
| Local different-user shell | Different OS user | Read this user's daemon | UDS 0600 + TCP loopback + token. Requires OS exploit to escalate |
|
||||
| Browser SSRF | Loopback HTTP | Send messages, read inbox | `local_token` + Origin/Host check + non-default port. SSRF without token cannot succeed |
|
||||
| Container side-channel | Same loopback namespace | Read another container's daemon | Containers share host loopback only if explicitly net=host. `local_token` defends. Recommended: bind UDS only inside containers |
|
||||
| Compromised hook | Capability token in env | Use that scope | Capability tokens are scoped + short-lived; cannot escalate |
|
||||
| Compromised broker | Full mesh visibility on its side | Deliver malicious messages, identity-impersonate | E2E encryption (crypto_box DMs, per-topic keys) — broker can't read content. Out-of-scope for daemon |
|
||||
| Cloned VM image | Same keypair on two hosts | Identity collision | Host fingerprint detection + dashboard audit + `--remint` flow |
|
||||
| Stolen laptop | Disk access | Mesh impersonation forever | `member revoke` from dashboard. Without disk encryption, this is the user's laptop security; documented in security guide |
|
||||
| Untrusted hook author | Hook script content | Exfil mesh data | Hook is on disk YOU control. If you ran `git pull` on a malicious hooks/ repo, that's a code-supply-chain attack out of scope for the daemon |
|
||||
|
||||
### 16.2 Out of scope
|
||||
|
||||
- Defending against an attacker with root on the daemon host. They can read
|
||||
`keypair.json` directly.
|
||||
- Defending against malicious peers in the same mesh sending malformed
|
||||
payloads. Daemon validates structure but trusts mesh members.
|
||||
- Defending against compromised broker. Out-of-scope for daemon; mesh-level
|
||||
E2E protects content but not metadata.
|
||||
|
||||
## 17. Migration — what changes for existing users
|
||||
|
||||
Same as v1. Additive. No DB migration on broker. Existing
|
||||
`~/.claudemesh/config.json` consumed unchanged. `claudemesh launch` keeps
|
||||
working; daemon is opt-in.
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 2)
|
||||
|
||||
Round 1 produced: identity model needs `--ephemeral` + clone-detect, IPC needs
|
||||
local token, "exactly-once" was a lie, hooks needed scoped credentials, surface
|
||||
needed shrinking, missing rotation/recovery/migration/threat-model.
|
||||
|
||||
This v2 attempts to address all of them. Specifically critique:
|
||||
|
||||
1. **Has the identity model fully closed the clone problem?** Refuses-on-fingerprint-mismatch
|
||||
plus broker audit plus mesh-owner revoke — does this catch a sophisticated
|
||||
attacker who copies `host_fingerprint.json` along with the keypair?
|
||||
2. **Is the local-token model sufficient for browser-SSRF defense?**
|
||||
Token + Origin + Host checks + 127.0.0.1-only. Anything else needed?
|
||||
3. **The delivery contract** (§4) — is it now defensible? Does the inflight-recovery
|
||||
semantics + idempotency-key propagation produce the guarantees claimed?
|
||||
4. **Hook capability tokens** (§6.2) — short-lived, scoped, expire on hook exit.
|
||||
Does this fully eliminate the exfil footgun? What capability scopes are
|
||||
actually needed for v0.9.0 hooks?
|
||||
5. **Frozen v0.9.0 surface** (§3.1) — is the cut right? Should `peer list` be
|
||||
in core or capability-gated? Should `inbox/search` ship in v0.9.0?
|
||||
6. **Threat model** (§16) — anything missing? Specifically thinking about CI
|
||||
environments where the daemon's host is a fleet shared across many users'
|
||||
builds.
|
||||
7. **Lifecycle flows** (§14) — image clones, key rotation, host moves, disk
|
||||
corruption, uninstall semantics. Anything still missing?
|
||||
8. **Version compat** (§15) — is the negotiation handshake sufficient, or do
|
||||
we need stronger guarantees (e.g. semver-strict, or a feature-bit
|
||||
negotiation rather than version numbers)?
|
||||
|
||||
Score 1–5 each. Top 3 changes you'd insist on for v3, if any. If you think v2
|
||||
is shippable, say so explicitly — over-engineering is a real risk.
|
||||
648
.artifacts/shipped/2026-05-03-daemon-final-spec-v3.md
Normal file
@@ -0,0 +1,648 @@
|
||||
# `claudemesh daemon` — Final Spec v3
|
||||
|
||||
> **Round 3.** v2 of this spec was reviewed by another model and pushed back on
|
||||
> identity/clone semantics (boot-id false-positives), delivery contract (broker
|
||||
> must dedupe on client-supplied id — protocol change), CI shared-runner threat
|
||||
> model, version negotiation (need feature bits, not ranges), key rotation
|
||||
> crypto, hook scope granularity, inbox schema correctness, and ~7 smaller
|
||||
> polish items. v3 incorporates all of them.
|
||||
>
|
||||
> **The intent §0 from v2 is unchanged and still authoritative — read it
|
||||
> there.** v3 only revises what changed.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
|
||||
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
|
||||
a generic broker. We can break anything.
|
||||
|
||||
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
|
||||
precise contract in §4 below.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — same as v2 §1
|
||||
|
||||
Resource caps, file layout, single-binary unchanged.
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — accidental-clone detection only, plus broker dedupe
|
||||
|
||||
Codex was right: v2's clone detection was both too weak (anyone copying
|
||||
`host_fingerprint.json` along with `keypair.json` defeats it) and too noisy
|
||||
(boot-id flips every reboot → false-positives on every legitimate restart).
|
||||
|
||||
### 2.1 Modes
|
||||
|
||||
```
|
||||
claudemesh daemon up # default: persistent member
|
||||
claudemesh daemon up --ephemeral # in-memory keypair, never written
|
||||
claudemesh daemon up --ephemeral --ttl 2h # auto-shutdown after duration
|
||||
```
|
||||
|
||||
**CI auto-detection** (NEW): if any of the following env vars are set
|
||||
(`CI=true`, `GITHUB_ACTIONS`, `GITLAB_CI`, `BUILDKITE`, `CIRCLECI`,
|
||||
`JENKINS_URL`, `RUNPOD_POD_ID`, `KUBERNETES_SERVICE_HOST`), AND `--persistent`
|
||||
is not explicitly passed, daemon defaults to `--ephemeral`. Rationale in §16.
|
||||
|
||||
### 2.2 Accidental-clone detection (NOT attacker-grade)
|
||||
|
||||
Frame change: this catches **image clones, restored backups, copy-pasted
|
||||
homedirs** — accidents made by humans operating at human speed. It does not
|
||||
defend against an attacker who copies both `keypair.json` and
|
||||
`host_fingerprint.json`. The threat model (§16) says this explicitly.
|
||||
|
||||
Persisted fingerprint = `sha256(machine-id || first-stable-mac)`. Notably:
|
||||
- **No boot-id** — that flips on every reboot and would false-positive
|
||||
every legitimate restart.
|
||||
- **No hostname** — laptops legitimately rename themselves.
|
||||
- **`first-stable-mac`** = MAC of the lexicographically first non-loopback,
|
||||
non-virtual interface present at first daemon boot. Frozen at first run;
|
||||
not recomputed.
|
||||
|
||||
Behavior on mismatch:
|
||||
- Default policy: refuse to start. Print: *"This keypair was created on a
|
||||
different host. If you legitimately moved hardware, run
|
||||
`claudemesh daemon accept-host` (writes a fresh fingerprint, keeps keypair).
|
||||
If this is a clone of an existing daemon, run `claudemesh daemon remint`
|
||||
(mints fresh keypair, registers as a new member)."*
|
||||
- `[clone] policy = "refuse" | "warn" | "allow"` overrides per host.
|
||||
|
||||
### 2.3 Concurrent-duplicate-identity broker policy (NEW — protocol change)
|
||||
|
||||
When the broker receives two WS connections claiming the same member pubkey:
|
||||
|
||||
- **`prefer_newest`** (default): older connection is closed with code 4003
|
||||
`replaced_by_newer_connection`. New connection takes over presence/inbox
|
||||
delivery. Daemon-side: receives the close code, logs forensic event, exits
|
||||
with non-zero status (lets supervisor restart it; if the *other* host is
|
||||
the legitimate one, supervisor restart-loops are noisy enough to alert).
|
||||
- **`prefer_oldest`**: new connection is rejected with code 4004
|
||||
`member_already_connected`. The new daemon refuses to start.
|
||||
- **`allow_concurrent`** (new mode, server-side feature flag): both
|
||||
connections accepted; broker tracks both as sibling sessions of the same
|
||||
member (same model as `claudemesh launch` siblings today). Useful when a
|
||||
user really does want one keypair on multiple hosts (e.g. failover pairs).
|
||||
|
||||
Configured per-mesh in `mesh.cloneConcurrencyPolicy`. Default:
|
||||
`prefer_newest`. Broker emits `member_concurrent_connection` audit event in
|
||||
all cases.
|
||||
|
||||
### 2.4 Rename, key rotation — see §14
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — frozen core, hardened auth
|
||||
|
||||
### 3.1 Frozen core (v0.9.0) — slight cut from v2
|
||||
|
||||
Codex agreed v2's cut was mostly right, except: defer FTS-search to a
|
||||
capability gate, keep `peer list` in core, drop redundancies.
|
||||
|
||||
```
|
||||
# Messaging
|
||||
POST /v1/send {to, message, priority?, meta?, replyToId?,
|
||||
client_message_id?}
|
||||
POST /v1/topic/post {topic, message, priority?, mentions?,
|
||||
client_message_id?}
|
||||
POST /v1/topic/subscribe {topic}
|
||||
POST /v1/topic/unsubscribe {topic}
|
||||
GET /v1/topic/list
|
||||
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
|
||||
# plain SQL paging; NO FTS in v0.9.0
|
||||
|
||||
# Peers + presence (kept in core — central to "first-class peer")
|
||||
GET /v1/peers ?mesh=<slug>
|
||||
POST /v1/profile {summary?, status?, visible?}
|
||||
|
||||
# Files (already production)
|
||||
POST /v1/file/share {path, to?, message?, persistent?}
|
||||
GET /v1/file/get ?id=<fileId>&out=<path>
|
||||
GET /v1/file/list
|
||||
|
||||
# Events — push
|
||||
GET /v1/events text/event-stream
|
||||
core events: message, peer_join, peer_leave, file_shared,
|
||||
daemon_disconnect, daemon_reconnect, hook_executed,
|
||||
feature_negotiation_failed
|
||||
|
||||
# Control plane
|
||||
GET /v1/health (auth required by default — see §3.3)
|
||||
GET /v1/metrics (auth required by default)
|
||||
GET /v1/version (auth required by default)
|
||||
POST /v1/heartbeat {}
|
||||
```
|
||||
|
||||
`inbox/search` with FTS deferred to v0.9.x capability gate `inbox_fts`.
|
||||
|
||||
### 3.2 Capability-gated future surface (v0.9.x)
|
||||
|
||||
Same as v2 §3.2 — state, memory, vector, graph, tasks, scheduling,
|
||||
mcp_host, skill_share, plus new `inbox_fts`. None enabled in v0.9.0.
|
||||
|
||||
### 3.3 Local IPC authentication — tightened
|
||||
|
||||
Same shape as v2 §3.3 but with codex's polish folded in:
|
||||
|
||||
| Transport | Auth | Notes |
|
||||
|---|---|---|
|
||||
| UDS | None (FS perms 0600) | Reaching socket = same UID |
|
||||
| TCP loopback | `Authorization: Bearer <local_token>` REQUIRED | 127.0.0.1 only |
|
||||
| SSE | `Authorization: Bearer <local_token>` REQUIRED | same |
|
||||
|
||||
**Token plumbing rules (NEW):**
|
||||
- `local_token` MUST be in the `Authorization` header. **Never** accepted in
|
||||
query string. Endpoint that sees a `?token=...` query param logs a security
|
||||
event and returns 400.
|
||||
- `local_token` MUST be redacted from access logs (`Authorization: Bearer
|
||||
***` in logs).
|
||||
- `local_token` rotation atomically writes a new file; SDKs hold the OLD
|
||||
token valid for 60s grace, then it's rejected.
|
||||
|
||||
**Endpoint default auth (NEW — codex):**
|
||||
- Every IPC endpoint requires the local token by default, **including**
|
||||
`/v1/health`, `/v1/metrics`, `/v1/version`. `[ipc] public_health_check =
|
||||
true` opts in to public `/v1/health` for k8s probes etc.
|
||||
|
||||
**Container default (NEW — codex):**
|
||||
- If `KUBERNETES_SERVICE_HOST` is set OR `/.dockerenv` exists OR
|
||||
`/proc/1/cgroup` indicates a container OR explicit `--container` flag,
|
||||
daemon defaults to **UDS-only** (`[ipc] tcp_enabled = false`). Containers
|
||||
share host loopback when `network_mode: host`; UDS-only avoids the
|
||||
side-channel.
|
||||
|
||||
**Origin/Host policy:**
|
||||
- `Host` header must be `localhost`, `127.0.0.1`, `[::1]` or empty. Else 403.
|
||||
- `Origin` header: explicit allowlist (default: empty). SSRF-from-browser
|
||||
bounce-attack defense.
|
||||
- `User-Agent` requirement DROPPED (codex called it theatre — correct).
|
||||
- CORS: never echo `Access-Control-Allow-Origin`; preflight returns 403.
|
||||
|
||||
### 3.4 Request limits & backpressure — same as v2
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once, broker-dedupes-on-client-id
|
||||
|
||||
Codex caught the real protocol gap: idempotency only works if the broker
|
||||
dedupes on the **caller's** id, not its own. This requires a broker change.
|
||||
|
||||
### 4.1 The contract (precise)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns.
|
||||
>
|
||||
> **Broker guarantee**: the broker dedupes on `client_message_id` for a
|
||||
> 24h window. Multiple inflight retries from the daemon for the same
|
||||
> `client_message_id` produce **at most one** broker-accepted row.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
|
||||
> `client_message_id` propagated in the inbound envelope so receivers can
|
||||
> dedupe locally on their side. We do **not** guarantee at-most-once
|
||||
> end-to-end — that requires receiver-side dedupe, which the daemon's
|
||||
> inbox.db provides for daemon-hosted peers.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` (NEW — broker protocol change)
|
||||
|
||||
Every send has a stable id minted **on the daemon**, not the broker:
|
||||
- Caller-supplied via `Idempotency-Key` header → wins.
|
||||
- Caller-supplied in body as `client_message_id` field → second.
|
||||
- Else daemon mints a `ulid` → last.
|
||||
|
||||
The id is:
|
||||
- Returned in the IPC response.
|
||||
- Stored in `outbox.db` as a UNIQUE NOT NULL column (real dedupe, not
|
||||
`INSERT OR IGNORE` on nullable — codex caught this).
|
||||
- Propagated to the broker on every retry (`client_message_id` field in the
|
||||
WS send envelope and in `POST /v1/messages`).
|
||||
- Stored in the broker's `meshTopicMessage.client_message_id` column with a
|
||||
`UNIQUE` constraint scoped to `(meshId, client_message_id)`.
|
||||
- Propagated in the inbound delivery to receivers' inboxes.
|
||||
|
||||
**Broker behavior on duplicate `client_message_id`**: returns the
|
||||
already-stored `messageId` and `historyId` from the prior insertion. No new
|
||||
row, no new fan-out, idempotent.
|
||||
|
||||
### 4.3 Broker schema delta (NEW)
|
||||
|
||||
```sql
|
||||
ALTER TABLE mesh.topic_message
|
||||
ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue
|
||||
ADD COLUMN client_message_id TEXT;
|
||||
|
||||
CREATE UNIQUE INDEX topic_message_client_id_idx
|
||||
ON mesh.topic_message(mesh_id, client_message_id)
|
||||
WHERE client_message_id IS NOT NULL;
|
||||
CREATE UNIQUE INDEX message_queue_client_id_idx
|
||||
ON mesh.message_queue(mesh_id, client_message_id)
|
||||
WHERE client_message_id IS NOT NULL;
|
||||
```
|
||||
|
||||
Partial unique index — legacy traffic without `client_message_id` (from
|
||||
`claudemesh launch`, dashboard chat, web posts) is unaffected.
|
||||
|
||||
### 4.4 Outbox schema (corrected)
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY, -- ulid (local row id)
|
||||
client_message_id TEXT NOT NULL UNIQUE, -- propagated to broker
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT -- set on ACK
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
`UNIQUE NOT NULL` on `client_message_id`: caller retries with the same id
|
||||
collide locally and become a no-op.
|
||||
|
||||
### 4.5 Inbox schema (corrected — content table + FTS index)
|
||||
|
||||
Codex caught: FTS5 virtual tables are not where you put `CREATE INDEX`.
|
||||
Real shape:
|
||||
|
||||
```sql
|
||||
-- Content table — the durable store
|
||||
CREATE TABLE inbox (
|
||||
id TEXT PRIMARY KEY, -- ulid (local row id)
|
||||
client_message_id TEXT NOT NULL UNIQUE, -- dedupe key
|
||||
broker_message_id TEXT,
|
||||
mesh TEXT NOT NULL,
|
||||
topic TEXT,
|
||||
sender_pubkey TEXT NOT NULL,
|
||||
sender_name TEXT NOT NULL,
|
||||
body TEXT,
|
||||
meta TEXT, -- JSON
|
||||
received_at INTEGER NOT NULL,
|
||||
reply_to_id TEXT
|
||||
);
|
||||
CREATE INDEX inbox_received_at ON inbox(received_at);
|
||||
CREATE INDEX inbox_topic ON inbox(topic);
|
||||
CREATE INDEX inbox_sender ON inbox(sender_pubkey);
|
||||
|
||||
-- FTS5 index — gated behind capability `inbox_fts` (deferred to v0.9.x)
|
||||
-- When enabled, populated via triggers; absent in v0.9.0.
|
||||
```
|
||||
|
||||
Insert path: `INSERT INTO inbox(...) ON CONFLICT(client_message_id) DO
|
||||
NOTHING RETURNING id`. The `RETURNING` clause tells us whether a new row
|
||||
landed; only new rows trigger hooks.
|
||||
|
||||
### 4.6 Crash recovery — explicit semantics
|
||||
|
||||
On daemon startup:
|
||||
1. Rows in `inflight` reset to `pending` with `attempts++`,
|
||||
`next_attempt_at = now + min_backoff`. **Note:** these may double-deliver
|
||||
if the broker actually accepted before the local ACK persisted. The
|
||||
`client_message_id` propagation ensures the broker dedupes the retry —
|
||||
net result: exactly one broker-accepted row, possibly two daemon-side
|
||||
`inflight → done` transitions.
|
||||
2. `outbox.db` PRAGMA integrity_check; failure → daemon refuses to start,
|
||||
point at `claudemesh daemon recover`.
|
||||
3. `inbox.db` integrity check; failure → move to `inbox.db.corrupt-<ts>`,
|
||||
create fresh empty inbox, log `inbox_corruption_recovered`. Inbox is a
|
||||
cache; recoverable from broker history.
|
||||
|
||||
### 4.7 Failure modes the spec is honest about
|
||||
|
||||
- **Broker dedupe window expired**: daemon retries a 25h-old send. Broker
|
||||
accepts again as if new (no dedupe). Daemon's outbox `max_age_hours`
|
||||
(default 168h = 7d) is longer than broker dedupe (24h), so this is
|
||||
possible. Default daemon `max_age_hours` REDUCED to **23h** to stay inside
|
||||
broker dedupe window. Configurable up only if the operator accepts the
|
||||
risk explicitly.
|
||||
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. User
|
||||
manually requeues (`outbox requeue <id>`) or drops (`outbox drop <id>`).
|
||||
- **Receiver-side dedupe failure**: only daemon-hosted receivers dedupe.
|
||||
`claudemesh launch` and dashboard chat clients DO NOT dedupe today —
|
||||
fixing them is post-v0.9.0.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — schema corrected (see §4.5), retention as v2
|
||||
|
||||
30-day rolling retention (configurable). Weekly VACUUM.
|
||||
`claudemesh daemon search` deferred to `inbox_fts` capability.
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — scopes tightened, exfiltration acknowledged
|
||||
|
||||
Codex was right: capability tokens removed the broad-token footgun, not
|
||||
exfiltration. Untrusted hook payload + `network_policy=deny` not reliable
|
||||
across platforms. Spec is now honest about that.
|
||||
|
||||
### 6.1 Hooks contract — same shape as v2 §6, with tighter defaults
|
||||
|
||||
### 6.2 Capability scopes — narrowed for v0.9.0
|
||||
|
||||
Codex pushed: scopes were too coarse. v0.9.0 scopes are exactly:
|
||||
|
||||
| Scope | Capability | Notes |
|
||||
|---|---|---|
|
||||
| `reply:event` | Reply to the specific event that triggered this hook | Bound to `event_id`; daemon validates target; expires on hook exit |
|
||||
| `dm:send:<sender_pubkey>` | Send DM only to the specific sender | Bound to one pubkey from event; not a write to anyone |
|
||||
| `topic:<name>:post` | Post to the specific topic that fired | Bound to topic from event; can't write elsewhere |
|
||||
|
||||
**No read scopes in v0.9.0.** A hook cannot read state, inbox, peers, etc.
|
||||
If a hook wants to consult mesh data to compose its reply, it does so via
|
||||
the *event payload* (which the daemon redacted appropriately) or via shell
|
||||
out to a fresh `claudemesh <verb>` call (which uses the user's existing
|
||||
config and is subject to its own auth). No daemon-mediated read tokens.
|
||||
|
||||
### 6.3 Sandboxing — supported, not promised
|
||||
|
||||
Codex caught: "network_policy=deny" sounds reliable but isn't cross-platform.
|
||||
Spec now says explicitly:
|
||||
|
||||
- `network_policy = "deny"` is **best-effort**:
|
||||
- Linux: enforced via `unshare --net` if available; else firewall rule via
|
||||
`iptables -m owner` if available; else daemon logs warning that policy
|
||||
cannot be enforced and the hook STILL runs.
|
||||
- macOS: enforced via `sandbox-exec` profile if available; else warning + run.
|
||||
- Windows: not enforced; warning + run.
|
||||
- Operators on hostile networks should set `enabled = false` for hooks they
|
||||
don't trust.
|
||||
- Daemon `cm_daemon_hook_unenforceable_total` counter exposes the count of
|
||||
hooks that ran with weakened sandbox.
|
||||
|
||||
### 6.4 Payload size & truncation — NEW
|
||||
|
||||
Stdin payloads to hooks capped at 256 KB (configurable). Larger payloads
|
||||
truncated with `_truncated: true` flag in the JSON event. Hook stdout
|
||||
captured up to `output_size_limit` (default 64 KB).
|
||||
|
||||
### 6.5 Audit log + killpg — same as v2
|
||||
|
||||
---
|
||||
|
||||
## 7. Multi-mesh — same as v2 §7
|
||||
|
||||
---
|
||||
|
||||
## 8. Auto-routing — same as v2 §8 (codex agreed it was clarified correctly)
|
||||
|
||||
---
|
||||
|
||||
## 9. Service installation — same as v2 §9
|
||||
|
||||
Add: when `claudemesh daemon install-service` runs in CI-detected
|
||||
environment, prints `Refusing to install persistent service in CI; ephemeral
|
||||
mode only.` and exits non-zero unless `--allow-ci-persistent` is passed.
|
||||
|
||||
---
|
||||
|
||||
## 10. Observability — same as v2 §10
|
||||
|
||||
Add metric: `cm_daemon_hook_unenforceable_total{hook,reason}` (§6.3).
|
||||
|
||||
---
|
||||
|
||||
## 11. SDKs — same shape as v2, bound to frozen core only
|
||||
|
||||
---
|
||||
|
||||
## 12. Security model — same boundaries, plus dedupe + feature negotiation
|
||||
|
||||
| Boundary | Trust | Mechanism |
|
||||
|---|---|---|
|
||||
| App ↔ Daemon (UDS) | OS user | UDS 0600 |
|
||||
| App ↔ Daemon (TCP/SSE) | OS user + bearer token | 127.0.0.1 + `local_token` + Origin/Host |
|
||||
| Hook ↔ Daemon | Capability scope | Short-lived token bound to event; no read scopes |
|
||||
| Daemon ↔ Broker | Mesh keypair + feature bits | WSS + ed25519 + crypto_box + per-topic keys + feature negotiation (§15) |
|
||||
| Daemon ↔ Disk | OS user | All files 0600/0644 |
|
||||
| Cloned identity | First-mac fingerprint | Accidental-clone detection only; broker concurrent-policy on §2.3 |
|
||||
|
||||
---
|
||||
|
||||
## 13. Configuration — same shape as v2 §13, plus `[features]`
|
||||
|
||||
```toml
|
||||
[features]
|
||||
require = ["client_message_id_dedupe", "concurrent_connection_policy"]
|
||||
optional = ["mesh_skill_share", "mcp_host"]
|
||||
# Daemon refuses to start if broker doesn't advertise all `require` bits.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — key rotation crypto fixed
|
||||
|
||||
### 14.1 Key rotation (CORRECTED — codex)
|
||||
|
||||
v2 said: *"old pubkey held server-side for 24h grace (decrypts in-flight
|
||||
messages encrypted to old pubkey)"*. **Wrong** — only the daemon has the
|
||||
private key. Broker can't decrypt.
|
||||
|
||||
Real semantics:
|
||||
|
||||
- `claudemesh daemon rotate-keypair` mints fresh ed25519 + x25519, registers
|
||||
the new pubkey with the broker as `member_keypair_rotated`.
|
||||
- Broker associates the new pubkey with the same member id, marks the old
|
||||
pubkey as `rotated_out` (not revoked).
|
||||
- **Daemon-side**: the OLD x25519 private key is retained in
|
||||
`keypair-archive.json` (mode 0600, durable) for a `key_grace_period`
|
||||
(default 7 days). During the grace window, daemon will attempt to decrypt
|
||||
inbound messages with the new private key first, falling back to archived
|
||||
keys (one or more). Messages encrypted to the old pubkey by senders who
|
||||
haven't yet seen the rotation event continue to decrypt cleanly.
|
||||
- After the grace period, archived keys are zeroed and the file is deleted.
|
||||
Messages encrypted to a stale pubkey after the grace window fail to
|
||||
decrypt and are logged as `cm_daemon_decrypt_stale_total`.
|
||||
|
||||
### 14.2 Backup includes topic state (CORRECTED)
|
||||
|
||||
`claudemesh daemon backup` now packages:
|
||||
- `keypair.json` (current)
|
||||
- `keypair-archive.json` (any in-grace-window archived keys)
|
||||
- `host_fingerprint.json`
|
||||
- `config.toml`
|
||||
- `local_token` (NOT — token is rotated on restore)
|
||||
- `topic_subscriptions.json` (which topics this daemon subscribes to)
|
||||
- `topic_keys.json` (per-topic symmetric keys this member holds)
|
||||
- `key_epoch.json` (current epoch number per topic; relevant when the mesh
|
||||
rotates topic keys)
|
||||
- `schema_version`
|
||||
|
||||
Backup file: encrypted with a passphrase (Argon2id KDF + crypto_secretbox).
|
||||
Restore writes everything except `local_token` (regenerated). On first run
|
||||
after restore, daemon performs `accept-host` if fingerprint mismatches
|
||||
(restore is by definition a host change).
|
||||
|
||||
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — same as v2 §14
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature-bit negotiation (REPLACES v2 §15)
|
||||
|
||||
Codex was right: version ranges aren't enough when daemon depends on
|
||||
specific broker capabilities (client-supplied IDs, concurrent-connection
|
||||
policy, key epochs).
|
||||
|
||||
### 15.1 Feature bits
|
||||
|
||||
Each protocol-relevant capability gets a stable string identifier:
|
||||
|
||||
```
|
||||
client_message_id_dedupe broker dedupes on client_message_id (§4.2)
|
||||
concurrent_connection_policy broker honours mesh.cloneConcurrencyPolicy (§2.3)
|
||||
member_keypair_rotated_event broker emits the event (§14.1)
|
||||
key_epoch per-topic key epochs supported (§14.2)
|
||||
mesh_skill_share post-v0.9, future
|
||||
mcp_host post-v0.9, future
|
||||
```
|
||||
|
||||
### 15.2 Negotiation handshake
|
||||
|
||||
On WS connect (after hello, before normal traffic):
|
||||
|
||||
```
|
||||
→ daemon: feature_negotiation_request
|
||||
{ require: ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy"],
|
||||
optional: ["mesh_skill_share","mcp_host"] }
|
||||
|
||||
← broker: feature_negotiation_response
|
||||
{ supported: ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy",
|
||||
"member_keypair_rotated_event"],
|
||||
missing_required: [] }
|
||||
```
|
||||
|
||||
If `missing_required` is non-empty, daemon closes the connection with code
|
||||
4010 `feature_unavailable`, logs forensic event, exits with non-zero status.
|
||||
Supervisor sees a restart-loop → operator alerted via configured
|
||||
mechanisms.
|
||||
|
||||
### 15.3 IPC negotiation (CLI/SDK ↔ daemon)
|
||||
|
||||
`GET /v1/version` returns:
|
||||
```json
|
||||
{
|
||||
"daemon_version": "0.9.0",
|
||||
"ipc_api": "v1",
|
||||
"ipc_features": ["send","topic","peers","files","events","health"],
|
||||
"schema_version": 7,
|
||||
"broker_features_negotiated": ["client_message_id_dedupe", ...]
|
||||
}
|
||||
```
|
||||
|
||||
CLI/SDK matches `ipc_features` against required. Missing required →
|
||||
fall-back to cold-path with warning OR fail explicitly (CLI verb's choice).
|
||||
|
||||
### 15.4 Compatibility matrix — published
|
||||
|
||||
```json
|
||||
GET /v1/compat
|
||||
{
|
||||
"daemon": "0.9.0",
|
||||
"compatible_brokers": ["0.7.x","0.8.x","0.9.x"],
|
||||
"required_broker_features": ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy"],
|
||||
"compatible_clis": ["0.9.x"],
|
||||
"compatible_sdks": {
|
||||
"python": ">=0.9.0,<1.0.0",
|
||||
"go": ">=0.9.0,<1.0.0",
|
||||
"ts": ">=0.9.0,<1.0.0"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — shared-CI reality folded in
|
||||
|
||||
### 16.1 Attacker classes — same matrix as v2 §16, plus:
|
||||
|
||||
| Attacker | Has | Wants | Mitigations |
|
||||
|---|---|---|---|
|
||||
| **Shared CI runner** (NEW) | Same Unix UID as other untrusted jobs | Read this user's persistent keypair across job boundaries | Auto-detect CI envs (§2.1) → ephemeral default + UDS-only + isolated `$HOME`. If operator overrides with `--persistent`, log warning `persistent_keypair_in_ci_environment`. |
|
||||
| **Malicious mesh peer** (PROMOTED from out-of-scope to in-scope) | Mesh membership | Send malformed payload to crash daemon | Every inbound shape validated against schema before any processing. Daemon refuses unknown fields (defense-in-depth) and emits `cm_daemon_invalid_inbound_total`. Crashes from inbound payloads are bugs. |
|
||||
|
||||
### 16.2 Stated explicitly out of scope
|
||||
|
||||
- Root attacker on daemon host (can read keypair directly).
|
||||
- Compromised broker (E2E content protection still holds; metadata is not
|
||||
protected by daemon — that's mesh-level).
|
||||
- Sophisticated attacker who copies BOTH `keypair.json` and
|
||||
`host_fingerprint.json` (§2.2 calls this out).
|
||||
- Receivers other than daemon-hosted peers deduping inbound traffic
|
||||
(post-v0.9.0).
|
||||
|
||||
### 16.3 Container & CI defaults table (NEW)
|
||||
|
||||
| Environment | Identity | IPC | Hooks |
|
||||
|---|---|---|---|
|
||||
| Bare metal / VM (default) | Persistent (clone-detected) | UDS + TCP loopback | Enabled |
|
||||
| Docker container (`/.dockerenv`) | Persistent | UDS-only by default | Enabled |
|
||||
| Kubernetes (`KUBERNETES_SERVICE_HOST`) | Persistent | UDS-only | Enabled |
|
||||
| CI (`CI=true`, `GITHUB_ACTIONS`, etc.) | Ephemeral | UDS-only | Disabled by default (`[hooks] enabled = false` until opted-in) |
|
||||
| RunPod (`RUNPOD_POD_ID`) | Ephemeral | UDS-only | Enabled |
|
||||
|
||||
Operator overrides any default with explicit flags; warning logged for
|
||||
non-default-secure choices.
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — same as v2 §17, plus broker schema add
|
||||
|
||||
Broker needs the schema delta in §4.3 (additive, partial unique indexes —
|
||||
safe for online migration). Coordinated with daemon rollout: broker first,
|
||||
then daemon. Daemon refuses to start against a broker that lacks
|
||||
`client_message_id_dedupe` feature bit (§15).
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 3)
|
||||
|
||||
Round 1 → identity, IPC auth, exactly-once lie, hook tokens, surface bloat,
|
||||
missing rotation/recovery/migration/threat-model.
|
||||
|
||||
Round 2 → boot-id false-positive, broker must dedupe on client id (protocol
|
||||
change), CI shared-runner reality, feature-bit negotiation, key rotation
|
||||
crypto, hook scopes, FTS schema, ~7 polish items.
|
||||
|
||||
This v3 attempts to address all of those. Specifically critique:
|
||||
|
||||
1. **Accidental-clone framing (§2.2)** — does the honest framing close the
|
||||
issue, or does removing boot-id make the detection so weak it's not worth
|
||||
shipping at all? Should we drop fingerprint detection entirely and rely on
|
||||
broker concurrent-connection policy?
|
||||
2. **Broker schema delta (§4.3)** — is this the smallest correct change?
|
||||
Partial unique indexes feel right; anything else needed (audit table,
|
||||
gc job)?
|
||||
3. **`max_age_hours` reduced to 23h** — codex's logic says daemon outbox TTL
|
||||
must be inside broker dedupe window. Is 23h vs 24h tight enough? Should
|
||||
the broker advertise its dedupe window as a feature parameter so the
|
||||
daemon configures itself?
|
||||
4. **Hook scopes (§6.2)** — too tight? `reply:event` + `dm:send:<sender>` +
|
||||
`topic:<name>:post`. Does this cover real use cases for v0.9.0 hooks
|
||||
(auto-reply, escalate-to-oncall, file-receipt-ack)?
|
||||
5. **Feature-bit negotiation (§15)** — is the scheme right? Should
|
||||
feature-bits be string identifiers (current) or numeric bit positions in
|
||||
a bitmask (denser, more brittle)?
|
||||
6. **CI defaults (§16.3)** — is the table accurate? Anything wrong about
|
||||
defaulting hooks-disabled in CI?
|
||||
7. **Key rotation grace-key archive (§14.1)** — is 7d the right default? Is
|
||||
storing archived private keys on disk (mode 0600) acceptable, or should
|
||||
they be encrypted at rest with a passphrase?
|
||||
8. **Anything still wrong?** Read it as if you were going to operate this
|
||||
daemon for a year — what falls down?
|
||||
|
||||
Three options after this review:
|
||||
- **(a) v3 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v4 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless. We can break anything.
|
||||
538
.artifacts/shipped/2026-05-03-daemon-final-spec-v4.md
Normal file
@@ -0,0 +1,538 @@
|
||||
# `claudemesh daemon` — Final Spec v4
|
||||
|
||||
> **Round 4.** v3 was reviewed by codex (round 3) and got an overall pass on
|
||||
> architecture but flagged three precision gaps: (1) broker dedupe window
|
||||
> semantics — permanent or windowed? schema as drawn was permanent but the
|
||||
> prose said 24h; (2) feature-bit negotiation should carry parameters, not
|
||||
> just booleans (so daemon can derive its outbox TTL from broker policy
|
||||
> instead of hardcoding 23h); (3) key-archive record format and retention
|
||||
> behavior were unspecified. Plus minor polish: document machine-id/MAC
|
||||
> source precedence per OS, explicitly defer arbitrary outbound hook sends,
|
||||
> resolve RunPod identity-vs-hooks inconsistency.
|
||||
>
|
||||
> **The intent §0 is unchanged from v2 — read it there.** v4 only revises
|
||||
> what changed from v3.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
|
||||
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
|
||||
a generic broker. We can break anything.
|
||||
|
||||
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
|
||||
precise contract in §4 below.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||||
|
||||
Resource caps, file layout, single-binary unchanged.
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — accidental-clone detection only, plus broker dedupe
|
||||
|
||||
Codex round-2 fix retained: no boot-id (false-positives every reboot).
|
||||
Codex round-3 polish: spell out fingerprint sources per OS so we don't ship
|
||||
a brittle "machine-id || first-mac" with no precedence rules.
|
||||
|
||||
### 2.1 Modes
|
||||
|
||||
```
|
||||
claudemesh daemon up # default: persistent member
|
||||
claudemesh daemon up --ephemeral # in-memory keypair, never written
|
||||
claudemesh daemon up --ephemeral --ttl 2h # auto-shutdown after duration
|
||||
```
|
||||
|
||||
**CI auto-detection**: if any of these env vars are set (`CI=true`,
|
||||
`GITHUB_ACTIONS`, `GITLAB_CI`, `BUILDKITE`, `CIRCLECI`, `JENKINS_URL`,
|
||||
`KUBERNETES_SERVICE_HOST`), AND `--persistent` is not explicitly passed,
|
||||
daemon defaults to `--ephemeral`. Rationale in §16.
|
||||
|
||||
`RUNPOD_POD_ID` removed from auto-CI list (was inconsistent — see §16.3).
|
||||
|
||||
### 2.2 Accidental-clone detection (NOT attacker-grade)
|
||||
|
||||
This catches **image clones, restored backups, copy-pasted homedirs** —
|
||||
accidents made by humans. It does not defend against an attacker who copies
|
||||
both `keypair.json` and `host_fingerprint.json`. The threat model (§16) says
|
||||
this explicitly.
|
||||
|
||||
#### 2.2.1 Fingerprint source precedence (NEW — codex r3)
|
||||
|
||||
`host_fingerprint.json` stores `sha256(host_id || stable_mac)` where the
|
||||
inputs are computed from the OS-specific table below, in order:
|
||||
|
||||
| OS | `host_id` (try in order) | `stable_mac` |
|
||||
|---|---|---|
|
||||
| Linux | `/etc/machine-id` → `/var/lib/dbus/machine-id` → first stable MAC | First non-loopback non-virtual interface, lex-sorted by name (`en…`/`eth…` before `wl…`); `docker0/veth*/br-*/lo` excluded |
|
||||
| macOS | `IOPlatformUUID` (`ioreg -rd1 -c IOPlatformExpertDevice`) | First non-loopback non-virtual interface (`en0` typical) |
|
||||
| Windows | `HKLM\SOFTWARE\Microsoft\Cryptography\MachineGuid` | First physical adapter (`Get-NetAdapter -Physical`), MAC sorted lex by adapter name |
|
||||
| BSD | `kern.hostuuid` (`sysctl -n kern.hostuuid`) | Same MAC rule as Linux |
|
||||
|
||||
**Excluded interfaces** (cross-platform): loopback, point-to-point tunnels
|
||||
(tailscale*, wg*, utun*, ppp*), docker (docker0, br-*, veth*), VPN
|
||||
(`tap*`/`tun*`), VM bridges (vboxnet*, vmnet*), Apple awdl/llw bridges.
|
||||
|
||||
**Cloud-image false-positive note**: bare AMIs/Azure images regenerate
|
||||
`/etc/machine-id` on first boot via cloud-init; for those, the first-boot
|
||||
fingerprint is what we keep. If an operator clones a *running* VM
|
||||
post-cloud-init, both `host_id` AND first-MAC will collide → the daemon
|
||||
correctly flags this as an accidental clone.
|
||||
|
||||
If `host_id` cannot be read on the host's OS, daemon logs
|
||||
`fingerprint_host_id_unavailable` and falls back to MAC-only. If MAC also
|
||||
unavailable (truly headless container with no NIC), daemon logs
|
||||
`fingerprint_unavailable`, persists a random UUID as `host_id`, and the
|
||||
clone-detection feature is effectively disabled for this host (broker
|
||||
concurrent-connection policy still works).
|
||||
|
||||
Behavior on mismatch (unchanged from v3): refuse / `accept-host` / `remint`.
|
||||
`[clone] policy = "refuse" | "warn" | "allow"` overrides per host.
|
||||
|
||||
### 2.3 Concurrent-duplicate-identity broker policy — unchanged from v3 §2.3
|
||||
|
||||
`prefer_newest` (default), `prefer_oldest`, `allow_concurrent`. Configured
|
||||
per-mesh in `mesh.cloneConcurrencyPolicy`.
|
||||
|
||||
### 2.4 Rename, key rotation — see §14
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — unchanged from v3 §3
|
||||
|
||||
Same frozen core, same auth model (UDS 0600 / TCP+SSE bearer / no token in
|
||||
query / all endpoints auth by default / UDS-only in containers / Origin/Host
|
||||
checks / no User-Agent theatre).
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once, **permanent** broker dedupe
|
||||
|
||||
Codex round 3 caught: v3's prose said "24h dedupe window" but the schema
|
||||
(partial unique indexes with no `created_at`) gave **permanent** dedupe. We
|
||||
have to pick. v4 chooses **permanent dedupe** because:
|
||||
|
||||
- It's the simplest correct choice. No GC job, no edge case where a
|
||||
long-asleep daemon's retry slips past the window and double-sends.
|
||||
- The unique index storage cost is bounded: at 1 KB per row × 100k
|
||||
messages/day × 365 = ~36 GB/year of broker storage, which is well within
|
||||
the broker's existing message-retention budget. Older message rows
|
||||
themselves can still be GC'd by the existing message retention policy
|
||||
(currently 365d) — only the `client_message_id` column on retained rows
|
||||
has to live as long as that row does.
|
||||
- It eliminates the daemon-side `max_age_hours = 23h` hack. Daemon outbox
|
||||
TTL becomes "however long you want to keep retrying"; default 7d.
|
||||
- It removes a class of "where exactly is the dedupe window edge?" bugs.
|
||||
|
||||
If broker storage growth becomes a real concern post-v0.9.0, we can convert
|
||||
to a windowed scheme via a feature-bit upgrade (§15) — but we'd own the
|
||||
correct migration semantics then.
|
||||
|
||||
### 4.1 The contract (precise)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns.
|
||||
>
|
||||
> **Broker guarantee**: the broker dedupes on `client_message_id`
|
||||
> **permanently within the lifetime of the row**. Multiple inflight retries
|
||||
> from the daemon for the same `client_message_id` produce **at most one**
|
||||
> broker-accepted row, regardless of time elapsed (subject to message-row
|
||||
> retention policy on the broker). This is advertised via the
|
||||
> `client_message_id_dedupe` feature-bit with `{ mode: "permanent" }`
|
||||
> parameter (§15).
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
|
||||
> `client_message_id` propagated in the inbound envelope so receivers can
|
||||
> dedupe locally. We do **not** guarantee at-most-once end-to-end —
|
||||
> receiver-side dedupe is the receiver's job. The daemon's `inbox.db`
|
||||
> provides it for daemon-hosted peers.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
Sources: `Idempotency-Key` header → body `client_message_id` → daemon-minted
|
||||
ulid. Stored in outbox UNIQUE NOT NULL, propagated to broker, propagated to
|
||||
receivers.
|
||||
|
||||
### 4.3 Broker schema delta — clarified as permanent dedupe
|
||||
|
||||
```sql
|
||||
ALTER TABLE mesh.topic_message
|
||||
ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue
|
||||
ADD COLUMN client_message_id TEXT;
|
||||
|
||||
CREATE UNIQUE INDEX topic_message_client_id_idx
|
||||
ON mesh.topic_message(mesh_id, client_message_id)
|
||||
WHERE client_message_id IS NOT NULL;
|
||||
CREATE UNIQUE INDEX message_queue_client_id_idx
|
||||
ON mesh.message_queue(mesh_id, client_message_id)
|
||||
WHERE client_message_id IS NOT NULL;
|
||||
|
||||
-- No created_at column needed for dedupe; the existing message row's
|
||||
-- created_at handles row-level retention. Dedupe is permanent for the row's
|
||||
-- lifetime, then naturally GC'd when the row is purged.
|
||||
```
|
||||
|
||||
Partial unique indexes — legacy traffic without `client_message_id` (from
|
||||
`claudemesh launch`, dashboard chat, web posts) is unaffected.
|
||||
|
||||
**Migration**: additive-only. Online ALTER TABLE on Postgres takes the row
|
||||
lock for the column add but not the index build (`CREATE UNIQUE INDEX
|
||||
CONCURRENTLY` is safe). Deploy order: schema migration → broker code that
|
||||
reads/writes `client_message_id` → daemon code that sends it → daemon
|
||||
enforces feature bit.
|
||||
|
||||
### 4.4 Outbox schema — unchanged from v3 §4.4
|
||||
|
||||
`UNIQUE NOT NULL` on `client_message_id`. Default `max_age_hours` raised
|
||||
back to **168h (7d)** because broker dedupe is permanent — no need to stay
|
||||
inside a 24h window.
|
||||
|
||||
### 4.5 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
Content table + indexes; FTS5 deferred.
|
||||
|
||||
### 4.6 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.7 Failure modes — windowed-broker case removed
|
||||
|
||||
The "broker dedupe window expired" failure mode in v3 §4.7 is **deleted**
|
||||
because dedupe is permanent. Remaining cases:
|
||||
|
||||
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. User
|
||||
manually requeues (`outbox requeue <id>`) or drops (`outbox drop <id>`).
|
||||
- **Receiver-side dedupe**: only daemon-hosted receivers dedupe.
|
||||
`claudemesh launch` and dashboard chat don't dedupe today; post-v0.9.0.
|
||||
- **Broker row already GC'd, daemon retries**: daemon retry hits the
|
||||
partial unique index → 23505 conflict. Broker treats as already-accepted,
|
||||
returns the original `messageId` from a soft-delete tombstone OR (if the
|
||||
row was hard-deleted by retention) returns `client_id_unknown`. Daemon
|
||||
treats `client_id_unknown` as "delivered, history may have been pruned"
|
||||
and marks `done`. Tombstone strategy is a broker implementation choice
|
||||
(advertised via `client_message_id_dedupe.tombstone_retention_days` in
|
||||
§15.1).
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — scopes tightened (codex r2), explicit deferment of arbitrary sends (codex r3)
|
||||
|
||||
### 6.1 Hooks contract — unchanged from v2 §6 / v3 §6.1
|
||||
|
||||
### 6.2 Capability scopes — narrowed for v0.9.0
|
||||
|
||||
| Scope | Capability | Notes |
|
||||
|---|---|---|
|
||||
| `reply:event` | Reply to the specific event that triggered this hook | Bound to `event_id`; daemon validates target; expires on hook exit |
|
||||
| `dm:send:<sender_pubkey>` | Send DM only to the specific sender | Bound to one pubkey from event; not a write to anyone |
|
||||
| `topic:<name>:post` | Post to the specific topic that fired | Bound to topic from event; can't write elsewhere |
|
||||
|
||||
**No read scopes in v0.9.0.** Hooks read via the event payload (which the
|
||||
daemon redacts appropriately), not via daemon-mediated reads.
|
||||
|
||||
**Explicitly deferred to post-v0.9.0** (codex r3 — say it out loud so use
|
||||
cases don't pile up against an undocumented limit):
|
||||
|
||||
- **Arbitrary outbound `dm:send` to anyone other than the event sender** —
|
||||
no scope grant for this. "Escalate to oncall" hooks must shell out to
|
||||
`claudemesh send <oncall>` with the user's normal config; the daemon
|
||||
doesn't issue capability tokens for arbitrary recipients.
|
||||
- **Cross-topic post** — a hook firing on `topic:alerts` cannot post to
|
||||
`topic:incidents`. Same reason.
|
||||
- **Mesh-cross post** — hooks see one mesh at a time.
|
||||
- **Reading state/inbox/peers** — covered above.
|
||||
|
||||
If a real use case demands cross-topic or arbitrary-recipient hooks
|
||||
post-v0.9.0, we add scopes like `dm:send:*` (wildcard) or
|
||||
`topic:*:post` (wildcard) and gate them behind explicit operator opt-in in
|
||||
config (`[hooks.<name>] dangerous_wildcards = true`). Not in v0.9.0.
|
||||
|
||||
### 6.3 Sandboxing — unchanged from v3 §6.3
|
||||
|
||||
Best-effort `network_policy = "deny"`; cross-platform unenforceability
|
||||
acknowledged; counter `cm_daemon_hook_unenforceable_total` exposed.
|
||||
|
||||
### 6.4 Payload size & truncation — unchanged from v3 §6.4
|
||||
|
||||
### 6.5 Audit log + killpg — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 7. Multi-mesh — unchanged
|
||||
|
||||
## 8. Auto-routing — unchanged
|
||||
|
||||
## 9. Service installation — unchanged
|
||||
|
||||
## 10. Observability — unchanged
|
||||
|
||||
## 11. SDKs — unchanged
|
||||
|
||||
## 12. Security model — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 13. Configuration — unchanged shape, plus parameterized features
|
||||
|
||||
```toml
|
||||
[features]
|
||||
require = [
|
||||
"client_message_id_dedupe", # broker provides §4.1 contract
|
||||
"concurrent_connection_policy", # broker honours mesh.cloneConcurrencyPolicy
|
||||
]
|
||||
optional = ["mesh_skill_share", "mcp_host"]
|
||||
# Daemon refuses to start if broker doesn't advertise all `require` bits.
|
||||
# Broker advertises feature parameters in the negotiation response (§15.1)
|
||||
# — daemon picks up `dedupe_mode` and `tombstone_retention_days` from there
|
||||
# and writes them to its runtime view, not config.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — key rotation crypto fixed (codex r2), archive format spec'd (codex r3)
|
||||
|
||||
### 14.1 Key rotation — crypto correct (codex r2)
|
||||
|
||||
`claudemesh daemon rotate-keypair`:
|
||||
|
||||
- Mints fresh ed25519 + x25519 keypairs.
|
||||
- Registers new pubkeys with the broker as `member_keypair_rotated` event.
|
||||
- Broker associates the new pubkey with the same member id, marks the old
|
||||
pubkey as `rotated_out` (not revoked); senders who haven't received the
|
||||
rotation event continue to encrypt to the old pubkey for a grace window.
|
||||
- Daemon retains the old x25519 **private** key (only x25519 — ed25519 is
|
||||
for signing, doesn't need a grace window) in `keypair-archive.json`.
|
||||
- During grace, decrypt path: try current private key first; on
|
||||
`crypto_box_open_easy` failure, walk archived keys in order. Successful
|
||||
archived-key decrypts increment `cm_daemon_decrypt_archived_total`.
|
||||
- After grace expiry, archived keys are zeroed and the file is rewritten
|
||||
without them. Messages still encrypted to a fully-expired pubkey fail to
|
||||
decrypt and increment `cm_daemon_decrypt_stale_total`.
|
||||
|
||||
#### 14.1.1 Archive record format (NEW — codex r3)
|
||||
|
||||
`keypair-archive.json` (mode 0600, atomic-rename writes):
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": 1,
|
||||
"max_archived_keys": 8,
|
||||
"keys": [
|
||||
{
|
||||
"pubkey": "ed25519-base64...",
|
||||
"x25519_pubkey": "base64...",
|
||||
"x25519_privkey": "base64...", // sensitive; whole file is 0600
|
||||
"key_id": "k_01HQX...", // ulid; matches broker's record
|
||||
"created_at": "2026-04-12T11:00:00Z",
|
||||
"rotated_out_at": "2026-05-03T16:00:00Z",
|
||||
"expires_at": "2026-05-10T16:00:00Z" // rotated_out_at + grace
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- **`max_archived_keys`** (default 8): cap on archive size. If a rotation
|
||||
would push the archive past the cap, the oldest entry is force-expired
|
||||
(zeroed + removed) regardless of `expires_at`. Force-expiry increments
|
||||
`cm_daemon_archive_force_expired_total{key_id}`. Operator who rotates
|
||||
faster than 8 keys per grace-window-duration is intentionally accepting
|
||||
decryption gaps for very-late inbound messages encrypted to those keys.
|
||||
- **Grace period default**: 7 days. Configurable via
|
||||
`[crypto] key_grace_period_days = 7`. Hard cap 30 days (codex review:
|
||||
unbounded grace = unbounded archive on disk = bigger blast radius if
|
||||
daemon host is compromised mid-life).
|
||||
- **Cleanup**: scheduled daily at midnight local time + on-demand via
|
||||
`claudemesh daemon archive-cleanup`. Walks `keys[]`, drops anything with
|
||||
`expires_at < now`. If file is empty after cleanup, file is deleted.
|
||||
- **Archive write failure**: rotation is aborted. Daemon refuses to commit
|
||||
the new keypair if the archive can't be written durably. Logged as
|
||||
`key_rotation_aborted_archive_write_failed`. New keypair is in memory
|
||||
only; restart returns to old keypair. This is intentional: the archive
|
||||
write is the durability point of rotation.
|
||||
- **At-rest encryption**: archive file is mode 0600 plaintext, same threat
|
||||
model as `keypair.json` (root-on-host can read both anyway). Operators
|
||||
who want disk-level encryption can put `~/.claudemesh/` on an encrypted
|
||||
volume; we don't reinvent that. Documented in the threat model (§16).
|
||||
Future option `--archive-passphrase` deferred — adds passphrase prompt to
|
||||
rotation/decrypt path, but breaks unattended daemon restart.
|
||||
|
||||
### 14.2 Backup includes topic state — unchanged from v3 §14.2
|
||||
|
||||
`keypair.json`, `keypair-archive.json` (with all archived keys),
|
||||
`host_fingerprint.json`, `config.toml`, `topic_subscriptions.json`,
|
||||
`topic_keys.json`, `key_epoch.json`, `schema_version`.
|
||||
|
||||
`local_token` NOT included; regenerated on restore.
|
||||
|
||||
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — unchanged from v2 §14.3
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature-bit negotiation with **parameters** (codex r3)
|
||||
|
||||
v3's feature bits were boolean. Codex r3: dedupe-window, max-payload, key
|
||||
epochs all need parameters. v4 makes feature bits string-keyed entries that
|
||||
optionally carry a value.
|
||||
|
||||
### 15.1 Feature bits with parameters
|
||||
|
||||
| Bit | Type | Parameters | Notes |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | object | `{ mode: "permanent"\|"windowed", window_hours?: int, tombstone_retention_days: int }` | Daemon reads `mode` to decide whether to enforce its own outbox max-age cap. `tombstone_retention_days` (broker-controlled) tells daemon how long it can expect "already-accepted" replies after the source row is GC'd |
|
||||
| `concurrent_connection_policy` | bool | — | Broker honours `mesh.cloneConcurrencyPolicy` |
|
||||
| `member_keypair_rotated_event` | bool | — | Broker emits the event |
|
||||
| `key_epoch` | object | `{ max_concurrent_epochs: int }` | Per-topic key epochs supported |
|
||||
| `max_payload` | object | `{ inline_bytes: int, blob_bytes: int }` | Hard limits broker enforces |
|
||||
| `mesh_skill_share` | bool | — | Future |
|
||||
| `mcp_host` | bool | — | Future |
|
||||
|
||||
### 15.2 Negotiation handshake (parameterized)
|
||||
|
||||
On WS connect, after hello, before normal traffic:
|
||||
|
||||
```
|
||||
→ daemon: feature_negotiation_request
|
||||
{
|
||||
require: ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy"],
|
||||
optional: ["mesh_skill_share","mcp_host","max_payload"]
|
||||
}
|
||||
|
||||
← broker: feature_negotiation_response
|
||||
{
|
||||
supported: {
|
||||
"client_message_id_dedupe": {
|
||||
"mode": "permanent",
|
||||
"tombstone_retention_days": 30
|
||||
},
|
||||
"concurrent_connection_policy": true,
|
||||
"member_keypair_rotated_event": true,
|
||||
"max_payload": {
|
||||
"inline_bytes": 65536,
|
||||
"blob_bytes": 524288000
|
||||
}
|
||||
},
|
||||
missing_required: []
|
||||
}
|
||||
```
|
||||
|
||||
If `missing_required` is non-empty, daemon closes the connection with code
|
||||
4010 `feature_unavailable`, logs forensic event, exits non-zero. Supervisor
|
||||
sees a restart-loop → operator alert.
|
||||
|
||||
If `client_message_id_dedupe.mode == "windowed"`, daemon reads
|
||||
`window_hours` and configures its outbox `max_age_hours` to
|
||||
`window_hours - 1` (margin) instead of the 168h default. Permanent mode →
|
||||
daemon uses the config default, no override.
|
||||
|
||||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||||
|
||||
`GET /v1/version` returns daemon version, IPC features, schema version, and
|
||||
the **parsed** broker feature parameters (so SDKs querying the daemon can
|
||||
display them).
|
||||
|
||||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||||
|
||||
Published at `GET /v1/compat`.
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v3 §16, plus RunPod fix
|
||||
|
||||
### 16.1 Attacker classes — unchanged
|
||||
|
||||
### 16.2 Out of scope — unchanged
|
||||
|
||||
### 16.3 Container & CI defaults table (RunPod inconsistency fixed)
|
||||
|
||||
| Environment | Identity | IPC | Hooks | Rationale |
|
||||
|---|---|---|---|---|
|
||||
| Bare metal / VM (default) | Persistent (clone-detected) | UDS + TCP loopback | Enabled | Trusted operator-owned host |
|
||||
| Docker container (`/.dockerenv`) | Persistent | UDS-only by default | Enabled | Single-tenant container, host loopback shared |
|
||||
| Kubernetes (`KUBERNETES_SERVICE_HOST`) | Persistent | UDS-only | Enabled | Single pod = single tenant |
|
||||
| CI (`CI=true`, `GITHUB_ACTIONS`, etc.) | Ephemeral | UDS-only | Disabled by default (`[hooks] enabled = false`) | Multi-tenant runner; arbitrary code; ephemeral identity = no cross-job leak; hooks disabled because CI workloads are arbitrary user code |
|
||||
| RunPod (`RUNPOD_POD_ID`) | Persistent | UDS-only | Enabled | Long-lived single-tenant sandbox; user owns the pod for its lifetime; identical trust model to a Docker container, NOT to a CI runner |
|
||||
|
||||
**RunPod resolution (codex r3)**: v3 listed RunPod under both "ephemeral
|
||||
identity" and "hooks enabled" which was contradictory. v4 treats RunPod as
|
||||
a **single-tenant container** (Docker-like): persistent identity, UDS-only,
|
||||
hooks enabled. RunPod is removed from the CI auto-detect list (§2.1).
|
||||
Operators who run RunPod as multi-tenant sandbox-as-CI can opt in with
|
||||
`--ephemeral` + `[hooks] enabled = false` explicitly.
|
||||
|
||||
Operator overrides any default with explicit flags; warning logged for
|
||||
non-default-secure choices.
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — unchanged from v3 §17
|
||||
|
||||
Broker schema delta (additive partial unique indexes, safe online),
|
||||
deployed before daemon. Daemon refuses to start if `client_message_id_dedupe`
|
||||
feature bit is missing from broker's negotiation response.
|
||||
|
||||
---
|
||||
|
||||
## What changed v3 → v4 (codex round-3 actionable items)
|
||||
|
||||
| Codex r3 item | v4 fix | Section |
|
||||
|---|---|---|
|
||||
| Broker dedupe window: permanent vs windowed? | **Picked permanent**; schema clarified; outbox `max_age_hours` raised back to 168h | §4 |
|
||||
| Feature bits should be parameterized | All feature bits are string-keyed with optional value object | §15.1, §15.2 |
|
||||
| Key archive record format unspecified | Full schema with `key_id`, timestamps, `max_archived_keys`, force-expiry rule, write-failure semantics | §14.1.1 |
|
||||
| Document fingerprint source precedence per OS | Per-OS table for `host_id` and stable MAC; cloud-image false-positive note | §2.2.1 |
|
||||
| Explicit deferment of arbitrary outbound hook sends | Listed deferred capabilities + escape hatch path post-v0.9.0 | §6.2 |
|
||||
| RunPod ephemeral-but-hooks-enabled inconsistency | RunPod treated as single-tenant container; removed from CI auto-detect | §2.1, §16.3 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 4)
|
||||
|
||||
Round 1 → identity, IPC auth, exactly-once lie, hook tokens, surface bloat,
|
||||
missing rotation/recovery/migration/threat-model.
|
||||
|
||||
Round 2 → boot-id false-positive, broker must dedupe on client id, CI
|
||||
shared-runner reality, feature-bit negotiation, key rotation crypto, hook
|
||||
scopes, FTS schema, ~7 polish items.
|
||||
|
||||
Round 3 → dedupe window semantics, feature-bit parameters, key archive
|
||||
record format, fingerprint source precedence, deferred hook scopes, RunPod
|
||||
inconsistency.
|
||||
|
||||
This v4 attempts to address all of round 3. Specifically:
|
||||
|
||||
1. **Permanent dedupe choice (§4)** — does the storage-cost calculus hold?
|
||||
Is the tombstone path (`client_id_unknown` after row GC) actually
|
||||
workable, or does it need to be a real tombstone table?
|
||||
2. **Feature parameter shape (§15.1)** — is the type system right (object
|
||||
with optional value)? Should it be a flat key-value list instead?
|
||||
Versioning of parameters within a feature?
|
||||
3. **Archive record format (§14.1.1)** — anything missing? Is
|
||||
`max_archived_keys=8` a sensible default, or should it be unbounded with
|
||||
a force-expiry on storage size instead of count?
|
||||
4. **Fingerprint per-OS table (§2.2.1)** — accurate? Is BSD worth listing
|
||||
if we're not actively building for FreeBSD in v0.9.0?
|
||||
5. **Hook deferment list (§6.2)** — does it cover all the realistic v0.9.0
|
||||
ask? Is the "shell out to `claudemesh send`" workaround for escalation
|
||||
ergonomically acceptable?
|
||||
6. **RunPod resolution (§16.3)** — agree with treating RunPod as
|
||||
single-tenant container? Or are there real multi-tenant RunPod
|
||||
deployments we should default-guard against?
|
||||
7. **Anything else still wrong?** Read it as if you were going to operate
|
||||
this for a year. What falls down?
|
||||
|
||||
Three options after this review:
|
||||
- **(a) v4 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v5 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless. We can break anything.
|
||||
468
.artifacts/shipped/2026-05-03-daemon-final-spec-v5.md
Normal file
@@ -0,0 +1,468 @@
|
||||
# `claudemesh daemon` — Final Spec v5
|
||||
|
||||
> **Round 5.** v4 was reviewed by codex (round 4) and got an architectural
|
||||
> pass but flagged one blocker plus four polish items.
|
||||
>
|
||||
> **Blocker**: §4 called dedupe "permanent" while also saying it disappears
|
||||
> when retained rows are hard-deleted. Internally inconsistent. Fix: real
|
||||
> broker-side dedupe/tombstone table independent of message retention.
|
||||
>
|
||||
> **Polish**: (a) rename `mode: "permanent"` to `retention_scoped`; (b)
|
||||
> deterministic duplicate-response shape; (c) feature-parameter schema
|
||||
> validation rules + per-feature parameter version; (d) drop
|
||||
> "zeroed/secure-delete" promises in archive cleanup, define malformed-archive
|
||||
> startup behavior; plus Linux MAC||MAC self-collision noted, RunPod warning
|
||||
> log on persistent default.
|
||||
>
|
||||
> **Intent §0 unchanged from v2.** v5 only revises what changed from v4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
Pre-launch peer-mesh runtime. Servers/laptops become first-class peers.
|
||||
Stable identity, persistent WS, local IPC, hooks. Not a webhook gateway, not
|
||||
a generic broker. We can break anything.
|
||||
|
||||
**One claim retracted from v1/v2**: "exactly-once" delivery. Replaced with a
|
||||
precise contract in §4.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — accidental-clone detection only
|
||||
|
||||
### 2.1 Modes — unchanged from v4 §2.1, RunPod warning added
|
||||
|
||||
When `RUNPOD_POD_ID` is set and identity is persistent (the default for
|
||||
RunPod under v4 §16.3), daemon logs `runpod_persistent_default_assumed` at
|
||||
INFO. Operators running RunPod as multi-tenant CI surface set `--ephemeral`
|
||||
explicitly; the warning makes the default visible in case the assumption
|
||||
doesn't fit their deployment.
|
||||
|
||||
### 2.2 Accidental-clone detection — unchanged from v4 §2.2
|
||||
|
||||
#### 2.2.1 Fingerprint source precedence — unchanged from v4 §2.2.1, with self-collision note
|
||||
|
||||
**Linux MAC-only fallback (NEW note)**: when `/etc/machine-id` is unreadable
|
||||
and we fall back to MAC-only as `host_id`, the resulting fingerprint is
|
||||
effectively `sha256(mac || mac)`. This is acceptable for clone detection
|
||||
(still uniquely identifies *this* host's first-NIC MAC) but reduces entropy
|
||||
to ~48 bits. Operators who want stronger fingerprinting in degraded
|
||||
environments can persist a generated UUID via `host_fingerprint.id_override`
|
||||
in config; documented but not required.
|
||||
|
||||
### 2.3 Concurrent-duplicate-identity broker policy — unchanged from v3 §2.3
|
||||
|
||||
### 2.4 Rename, key rotation — see §14
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once, **dedupe table**, retention-scoped
|
||||
|
||||
Codex round 4 caught: v4 said "permanent" but also said dedupe disappears
|
||||
when message rows are hard-deleted. That's `retention_scoped`, not
|
||||
permanent — and worse, the partial-unique-index design fails when the row
|
||||
itself is gone. v5 introduces a real broker-side dedupe table with its own
|
||||
retention policy, independent of message retention.
|
||||
|
||||
### 4.1 The contract (precise)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns.
|
||||
>
|
||||
> **Broker guarantee**: the broker maintains a dedupe record for every
|
||||
> accepted `client_message_id` in a dedicated table
|
||||
> (`mesh.client_message_dedupe`). The dedupe record outlives the message
|
||||
> row when the dedupe-retention policy is longer than the
|
||||
> message-retention policy. While the dedupe record exists, all retries
|
||||
> with that `client_message_id` collapse to the original
|
||||
> `broker_message_id` deterministically. After the dedupe record expires,
|
||||
> a retry would create a new message — but daemon outbox `max_age_hours`
|
||||
> is configured against the broker's advertised `dedupe_retention_days`
|
||||
> with margin (§15.1), so this should not happen in practice.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery to subscribers, with
|
||||
> `client_message_id` propagated in the inbound envelope. Receiver-side
|
||||
> dedupe is the receiver's job; the daemon's `inbox.db` provides it for
|
||||
> daemon-hosted peers.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
Sources: `Idempotency-Key` header → body `client_message_id` → daemon ulid.
|
||||
Stored in outbox UNIQUE NOT NULL, propagated to broker, propagated to
|
||||
receivers in inbound envelope.
|
||||
|
||||
### 4.3 Broker schema — dedupe table separate from message rows (v5)
|
||||
|
||||
```sql
|
||||
-- The dedupe authority. One row per (mesh, client_message_id) accepted
|
||||
-- by the broker. Outlives mesh.topic_message rows when retention >
|
||||
-- message retention.
|
||||
CREATE TABLE mesh.client_message_dedupe (
|
||||
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
|
||||
client_message_id TEXT NOT NULL,
|
||||
broker_message_id UUID NOT NULL, -- the original accepted message id
|
||||
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
|
||||
destination_ref TEXT NOT NULL, -- topic name, recipient pubkey, etc.
|
||||
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
expires_at TIMESTAMPTZ, -- NULL = never expires (operator opt-in)
|
||||
status TEXT NOT NULL CHECK(status IN ('accepted','rejected')),
|
||||
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
|
||||
PRIMARY KEY (mesh_id, client_message_id)
|
||||
);
|
||||
|
||||
CREATE INDEX client_message_dedupe_expires_idx
|
||||
ON mesh.client_message_dedupe(expires_at)
|
||||
WHERE expires_at IS NOT NULL;
|
||||
|
||||
-- Existing tables get the convenience back-pointer (for receiver
|
||||
-- inclusion in delivered envelopes); UNIQUE NOT enforced here — the
|
||||
-- dedupe table is the authority.
|
||||
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
|
||||
```
|
||||
|
||||
**Retention semantics**:
|
||||
|
||||
- `expires_at = NULL` → dedupe row never expires unless mesh is deleted.
|
||||
Operator opts in via mesh setting `dedupeRetentionMode = "permanent"`.
|
||||
- `expires_at = first_seen_at + dedupe_retention_days` → default
|
||||
`retention_scoped` mode. Default value: 365 days. Configurable per-mesh.
|
||||
- A nightly broker job deletes rows where `expires_at < NOW()`.
|
||||
- A separate broker job, fired when the message-retention sweep hard-deletes
|
||||
a `mesh.topic_message` or `mesh.message_queue` row, sets the corresponding
|
||||
dedupe row's `history_available = FALSE`. The dedupe row stays — only the
|
||||
payload is gone. Retries still collapse correctly; receiver requests for
|
||||
history return "row pruned" deterministically (§4.4 below).
|
||||
|
||||
**Migration**: additive-only. Daemon refuses to start unless broker
|
||||
advertises feature `client_message_id_dedupe` with `mode` of
|
||||
`retention_scoped` or `permanent` (§15.1).
|
||||
|
||||
### 4.4 Duplicate response — deterministic shape (NEW v5 — codex r4)
|
||||
|
||||
When the broker sees a send with a `client_message_id` already in
|
||||
`mesh.client_message_dedupe`, the response is deterministic:
|
||||
|
||||
```json
|
||||
{
|
||||
"broker_message_id": "msg_01HQX...",
|
||||
"client_message_id": "cmid_01HQX...",
|
||||
"duplicate": true,
|
||||
"history_available": true, // false if message row was GC'd
|
||||
"first_seen_at": "2026-05-03T11:42:00Z",
|
||||
"destination_kind": "topic",
|
||||
"destination_ref": "alerts"
|
||||
}
|
||||
```
|
||||
|
||||
Daemon outcomes:
|
||||
|
||||
- `duplicate: true, history_available: true` → mark outbox row `done`,
|
||||
store `broker_message_id`. No re-fanout (broker did the work the first
|
||||
time).
|
||||
- `duplicate: true, history_available: false` → mark outbox row `done` but
|
||||
log `cm_daemon_dedupe_history_pruned_total`. The message *did* deliver
|
||||
the first time; we just can't show it in history. Receivers who needed
|
||||
it have it; receivers who didn't have already missed their window.
|
||||
- No more `client_id_unknown` — that response code is removed.
|
||||
|
||||
### 4.5 Outbox schema — daemon-side max-age derived (v5)
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
Daemon `max_age_hours` is **derived** from the broker-advertised
|
||||
`dedupe_retention_days` parameter:
|
||||
- `permanent` → daemon default 168h (7d), capped at 30d. (Daemon doesn't
|
||||
hold sends forever — that's an outbox bug surface.)
|
||||
- `retention_scoped, dedupe_retention_days = N` → daemon
|
||||
`max_age_hours = (N * 24) - safety_margin_hours`. Default
|
||||
`safety_margin_hours = 24`.
|
||||
- Operator override permitted but logged as
|
||||
`outbox_max_age_above_broker_window` if it exceeds broker safe range.
|
||||
|
||||
### 4.6 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.7 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.8 Failure modes — corrected for dedupe-table model
|
||||
|
||||
- **`dead` rows**: surface in `claudemesh daemon outbox --failed`. Same as v4.
|
||||
- **Receiver-side dedupe**: only daemon-hosted receivers dedupe. Same as v4.
|
||||
- **Daemon retry after dedupe row expired AND message row GC'd**: in
|
||||
`retention_scoped` mode this can only happen if the daemon outbox row
|
||||
was older than `dedupe_retention_days - safety_margin`. Daemon will
|
||||
refuse to send rows older than its computed `max_age_hours` (§4.5) —
|
||||
they go to `dead` first, surfaced for human action. So this edge is
|
||||
closed by daemon-side gating, not broker-side dedupe.
|
||||
- **Daemon retry after dedupe row expired BUT message row still alive**:
|
||||
doesn't happen by design — dedupe retention is always ≥ message
|
||||
retention in operator-sane configs. If misconfigured, message row
|
||||
persists with NULL `client_message_id` reference, retry creates a new
|
||||
message, broker emits `cm_broker_dedupe_misconfig_total` with
|
||||
`(mesh_id, retention_dedupe_days, retention_message_days)` labels.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — unchanged from v4 §6
|
||||
|
||||
---
|
||||
|
||||
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — archive cleanup wording corrected (codex r4)
|
||||
|
||||
### 14.1 Key rotation — unchanged crypto from v4 §14.1
|
||||
|
||||
### 14.1.1 Archive record format — corrected wording (v5)
|
||||
|
||||
`keypair-archive.json` (mode 0600, atomic-rename writes):
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": 1,
|
||||
"max_archived_keys": 8,
|
||||
"keys": [
|
||||
{
|
||||
"ed25519_pubkey": "base64...", // metadata only; matches the rotated-out signing key for that key_id
|
||||
"x25519_pubkey": "base64...", // matches the retained private key
|
||||
"x25519_privkey": "base64...", // sensitive; whole file is 0600
|
||||
"key_id": "k_01HQX...",
|
||||
"created_at": "2026-04-12T11:00:00Z",
|
||||
"rotated_out_at": "2026-05-03T16:00:00Z",
|
||||
"expires_at": "2026-05-10T16:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Field clarifications (codex r4)**:
|
||||
- `ed25519_pubkey` is metadata — the daemon does not retain the old ed25519
|
||||
*private* key. Stored to bind `key_id` ↔ old signing identity for audit
|
||||
reconstruction (e.g. "this archived x25519 was the recipient half of a
|
||||
member who at the time signed messages with the matching ed25519").
|
||||
- `x25519_pubkey` MUST match the public half of `x25519_privkey`. Daemon
|
||||
validates on archive load; mismatch → quarantine (see corruption rules).
|
||||
|
||||
**Cleanup wording (codex r4)**:
|
||||
- On `expires_at < now`: entry is removed from the live archive file via
|
||||
atomic-rename rewrite. **Secure deletion of the prior file's data is not
|
||||
guaranteed** on modern filesystems (journals, COW snapshots, SSD wear
|
||||
leveling, atomic-rename leaving stale inodes). Operators who need
|
||||
cryptographic erasure must operate on encrypted volumes or reissue
|
||||
hardware. Documented in threat model §16.
|
||||
- "Force-expiry" when `max_archived_keys` is exceeded uses the same
|
||||
removal mechanism; same caveat applies. Counter
|
||||
`cm_daemon_archive_force_expired_total{key_id}` exposed.
|
||||
|
||||
**Duplicate `key_id` handling (NEW v5)**:
|
||||
- Archive load rejects any file whose `keys[]` contains two records with
|
||||
the same `key_id`. Quarantine to `keypair-archive.json.malformed-<ts>`,
|
||||
start with empty archive, log `keypair_archive_duplicate_key_id`. Daemon
|
||||
continues to start (we don't want archive corruption to be a permanent
|
||||
outage). Old in-flight messages encrypted to the lost archived keys
|
||||
fail to decrypt and are counted in `cm_daemon_decrypt_stale_total`.
|
||||
|
||||
**Malformed archive on startup (NEW v5)**:
|
||||
- File present but JSON parse fails OR schema fails OR pubkey/privkey pair
|
||||
fails validation: quarantine as above, start with empty archive, log
|
||||
`keypair_archive_malformed`. Same continue-startup behavior.
|
||||
- File missing entirely: treated as empty archive (normal first run /
|
||||
post-cleanup state), no warning.
|
||||
- File present but mode != 0600: log `keypair_archive_perms` warning,
|
||||
read anyway. Operators surfaced; daemon doesn't auto-chmod (they should
|
||||
fix their pipeline).
|
||||
|
||||
### 14.2 Backup — unchanged from v4 §14.2
|
||||
|
||||
### 14.3 Local token rotation, compromised host revocation, image-clone, uninstall, recovery — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature-bit schema validation (v5)
|
||||
|
||||
Codex r4: feature parameters need explicit schema-validation rules and
|
||||
per-feature versioning so we don't paint ourselves into a corner when a
|
||||
parameter shape evolves.
|
||||
|
||||
### 15.1 Feature bits with parameters and versions
|
||||
|
||||
Each feature bit's parameters are versioned independently of broker version:
|
||||
|
||||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | `1` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 1)` (when mode=retention_scoped) | `tombstone_history_pruned_window_days: int` |
|
||||
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
|
||||
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
|
||||
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
|
||||
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
|
||||
| `mesh_skill_share` | future | — | — |
|
||||
| `mcp_host` | future | — | — |
|
||||
|
||||
**Validation rules (NEW v5)**:
|
||||
|
||||
When the broker advertises feature parameters in
|
||||
`feature_negotiation_response`, the daemon validates against the
|
||||
parameter schema for that `params.version`. Validation failures:
|
||||
|
||||
- **Required parameter missing**: treated identically to "feature missing
|
||||
from `supported`" — if the feature is in daemon's `require[]`, daemon
|
||||
closes WS with code 4010 `feature_unavailable` and exits non-zero.
|
||||
- **Required parameter out of bounds** (e.g. `dedupe_retention_days = -5`,
|
||||
`inline_bytes = 0`): same — treated as "feature missing from
|
||||
`supported`."
|
||||
- **Unknown `params.version`**: if daemon doesn't recognize the version,
|
||||
treated as "feature missing." Daemon does NOT silently degrade.
|
||||
- **Optional parameter missing or invalid**: daemon uses its own default,
|
||||
logs `feature_optional_param_invalid{feature, param, reason}`, continues.
|
||||
- **Unknown `mode` for `client_message_id_dedupe`** (not "retention_scoped"
|
||||
or "permanent"): treated as "feature missing." Future modes require a
|
||||
`params.version` bump.
|
||||
|
||||
Validation is NOT silent: every feature_negotiation_response is logged
|
||||
fully (with sensitive parameters redacted, though we don't currently have
|
||||
any) at DEBUG, and a single line at INFO summarizes negotiated capabilities
|
||||
on each successful negotiation.
|
||||
|
||||
### 15.2 Negotiation handshake — shape updated (v5)
|
||||
|
||||
```
|
||||
→ daemon: feature_negotiation_request
|
||||
{
|
||||
require: ["client_message_id_dedupe",
|
||||
"concurrent_connection_policy"],
|
||||
optional: ["mesh_skill_share","mcp_host","max_payload"]
|
||||
}
|
||||
|
||||
← broker: feature_negotiation_response
|
||||
{
|
||||
supported: {
|
||||
"client_message_id_dedupe": {
|
||||
"params": {
|
||||
"version": 1,
|
||||
"mode": "retention_scoped",
|
||||
"dedupe_retention_days": 365,
|
||||
"tombstone_history_pruned_window_days": 30
|
||||
}
|
||||
},
|
||||
"concurrent_connection_policy": {
|
||||
"params": { "version": 1, "default_policy": "prefer_newest" }
|
||||
},
|
||||
"member_keypair_rotated_event": { "params": { "version": 1 } },
|
||||
"max_payload": {
|
||||
"params": { "version": 1, "inline_bytes": 65536, "blob_bytes": 524288000 }
|
||||
}
|
||||
},
|
||||
missing_required: []
|
||||
}
|
||||
```
|
||||
|
||||
If `missing_required` is non-empty after broker's response OR after daemon
|
||||
parameter validation, daemon closes with 4010 and exits non-zero.
|
||||
|
||||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||||
|
||||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v4 §16
|
||||
|
||||
Plus archive-secure-delete clarification under §14.1.1.
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — broker dedupe table is the new prereq
|
||||
|
||||
Broker side, deploy order:
|
||||
1. `CREATE TABLE mesh.client_message_dedupe` + supporting indexes
|
||||
(additive, online-safe).
|
||||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id` (already
|
||||
in v3/v4 plan).
|
||||
3. Broker code: every `INSERT` into `topic_message` / `message_queue` first
|
||||
`INSERT ... ON CONFLICT DO UPDATE RETURNING` into
|
||||
`client_message_dedupe`. The conflict path returns existing
|
||||
`broker_message_id` instead of creating a new row.
|
||||
4. Broker code: nightly job to delete `client_message_dedupe` rows where
|
||||
`expires_at < NOW()`.
|
||||
5. Broker code: hook into the existing message-retention sweep to set
|
||||
`history_available = FALSE` on dedupe rows whose message row has been
|
||||
pruned.
|
||||
6. Broker advertises `client_message_id_dedupe` feature bit in negotiation
|
||||
response.
|
||||
7. Daemon refuses to start unless that feature bit is advertised with valid
|
||||
params.
|
||||
|
||||
---
|
||||
|
||||
## What changed v4 → v5 (codex round-4 actionable items)
|
||||
|
||||
| Codex r4 item | v5 fix | Section |
|
||||
|---|---|---|
|
||||
| Dedupe must be retention-scoped, not "permanent" with row-deletion gap | Real `mesh.client_message_dedupe` table; retention independent of message rows; `permanent` becomes opt-in mode meaning "no expires_at" | §4.1, §4.3 |
|
||||
| Rename misleading mode | `retention_scoped` is the default; `permanent` reserved for explicit opt-in | §4.3, §15.1 |
|
||||
| Deterministic duplicate response | New shape with `duplicate`, `broker_message_id`, `history_available`; removed `client_id_unknown` | §4.4 |
|
||||
| Feature parameter validation rules | `params.version` per feature; required-param failure = treated as missing-required-feature; daemon closes WS 4010, exits non-zero | §15.1 |
|
||||
| Drop "zeroed/secure-delete" promise | Replaced with "removed from live archive; secure deletion not guaranteed"; threat model documents | §14.1.1 |
|
||||
| Duplicate `key_id` handling | Archive load rejects, quarantine, start empty, continue | §14.1.1 |
|
||||
| Malformed archive startup behavior | Quarantine, start empty, continue; mode-mismatch warns but reads | §14.1.1 |
|
||||
| Linux MAC||MAC self-collision | Documented; `host_fingerprint.id_override` escape hatch | §2.2.1 |
|
||||
| RunPod warning on persistent default | Logged at INFO so default is visible | §2.1 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 5)
|
||||
|
||||
1. **Dedupe table design (§4.3)** — is `(mesh_id, client_message_id)`
|
||||
PRIMARY KEY enough, or do we need versioning of the dedupe row itself
|
||||
(e.g. when destination changes mid-retry)? Is `destination_kind` /
|
||||
`destination_ref` needed at all, or just for audit?
|
||||
2. **`history_available = FALSE` semantics (§4.4)** — does it actually fix
|
||||
the case where receivers ask for history of a pruned message? Or does
|
||||
the receiver need its own dedupe-with-history-pruned pathway?
|
||||
3. **Daemon outbox max-age math (§4.5)** — is `dedupe_retention_days * 24
|
||||
- 24` margin correct? Should the margin be a percentage instead of a
|
||||
fixed 24h?
|
||||
4. **Feature param validation (§15.1)** — does treating "invalid required
|
||||
param" as "missing required feature" lose useful diagnostic detail?
|
||||
Should we have a 4011 `feature_param_invalid` close code separately?
|
||||
5. **Archive quarantine (§14.1.1)** — is "continue startup with empty
|
||||
archive" the right call, or should it be opt-in / refuse-by-default?
|
||||
6. **Anything else still wrong?** Read it as if you were going to operate
|
||||
this for a year.
|
||||
|
||||
Three options:
|
||||
- **(a) v5 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v6 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
447
.artifacts/shipped/2026-05-03-daemon-final-spec-v6.md
Normal file
@@ -0,0 +1,447 @@
|
||||
# `claudemesh daemon` — Final Spec v6
|
||||
|
||||
> **Round 6.** v5 was reviewed by codex (round 5) which found the dedupe
|
||||
> table architecture sound but called out four idempotency-correctness
|
||||
> issues that would silently corrupt sends in production:
|
||||
>
|
||||
> 1. **Idempotency key reuse with different payload/destination** — v5
|
||||
> silently collapsed a different send onto the original. Need a request
|
||||
> fingerprint.
|
||||
> 2. **`status = 'rejected'` underspecified** — schema allowed it, semantics
|
||||
> didn't. Either fully define or drop.
|
||||
> 3. **Outbox max-age math edges** — `dedupe_retention_days = 1` minus 24h
|
||||
> margin = 0 hours, which is undefined.
|
||||
> 4. **Broker atomicity not stated** — dedupe insert and message insert
|
||||
> must be one transaction or you produce orphan dedupe rows.
|
||||
>
|
||||
> v6 fixes all four. **Intent §0 unchanged from v2.** v6 only revises
|
||||
> idempotency semantics in §4 and migration in §17.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once with **request-fingerprinted** dedupe
|
||||
|
||||
Codex r5: dedupe must compare the *whole request shape*, not just
|
||||
`(mesh, client_message_id)`. Otherwise a caller who reuses an idempotency
|
||||
key with a different destination or body silently drops the new send and
|
||||
gets the old send's metadata back.
|
||||
|
||||
### 4.1 The contract (precise — v6)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns.
|
||||
>
|
||||
> **Broker guarantee**: the broker maintains a dedupe record per accepted
|
||||
> `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`. Each
|
||||
> dedupe record carries a canonical `request_fingerprint`. Retries with
|
||||
> the same `client_message_id` AND matching fingerprint collapse to the
|
||||
> original `broker_message_id`. Retries with the same `client_message_id`
|
||||
> but a different fingerprint return a deterministic conflict
|
||||
> (`409 idempotency_key_reused`) and do **not** create a new message.
|
||||
>
|
||||
> **Atomicity guarantee**: dedupe row insertion and message row insertion
|
||||
> happen in one broker DB transaction. Either both land, or neither. No
|
||||
> orphan dedupe rows. If the broker crashes between dedupe insert and
|
||||
> message insert, the rollback unwinds both.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery, with
|
||||
> `client_message_id` propagated to receivers' inboxes.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — request fingerprint added (v6)
|
||||
|
||||
```sql
|
||||
CREATE TABLE mesh.client_message_dedupe (
|
||||
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
|
||||
client_message_id TEXT NOT NULL,
|
||||
|
||||
-- The original accepted message; FK NOT enforced because the message row
|
||||
-- may be GC'd by retention sweeps before the dedupe row expires.
|
||||
broker_message_id UUID NOT NULL,
|
||||
|
||||
-- Canonical fingerprint of the original request. Recomputed on every
|
||||
-- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
|
||||
request_fingerprint BYTEA NOT NULL, -- 32-byte sha256
|
||||
|
||||
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
|
||||
destination_ref TEXT NOT NULL,
|
||||
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
expires_at TIMESTAMPTZ, -- NULL = `permanent` mode
|
||||
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
|
||||
|
||||
PRIMARY KEY (mesh_id, client_message_id)
|
||||
);
|
||||
|
||||
CREATE INDEX client_message_dedupe_expires_idx
|
||||
ON mesh.client_message_dedupe(expires_at)
|
||||
WHERE expires_at IS NOT NULL;
|
||||
|
||||
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
|
||||
```
|
||||
|
||||
**`status` column dropped (codex r5)**. Rejected requests do **not**
|
||||
consume idempotency keys. Rationale below in §4.6.
|
||||
|
||||
### 4.4 Request fingerprint — canonical form (NEW v6)
|
||||
|
||||
The fingerprint covers everything that makes a send semantically distinct.
|
||||
A retry must reproduce the same fingerprint bit-for-bit; anything else is
|
||||
a different send and must not be collapsed.
|
||||
|
||||
```
|
||||
request_fingerprint = sha256(
|
||||
envelope_version || 0x00 ||
|
||||
destination_kind || 0x00 ||
|
||||
destination_ref || 0x00 ||
|
||||
reply_to_id_or_empty || 0x00 ||
|
||||
priority || 0x00 ||
|
||||
meta_canonical_json || 0x00 ||
|
||||
body_hash
|
||||
)
|
||||
```
|
||||
|
||||
Where:
|
||||
- `envelope_version`: integer string (e.g. `"1"`). Bumps when the envelope
|
||||
shape changes.
|
||||
- `destination_kind`: `topic`, `dm`, or `queue`.
|
||||
- `destination_ref`: topic name, recipient ed25519 pubkey hex, or queue id.
|
||||
- `reply_to_id_or_empty`: original `broker_message_id` or empty string.
|
||||
- `priority`: `now`, `next`, or `low`.
|
||||
- `meta_canonical_json`: the `meta` field, serialized with sorted keys,
|
||||
no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.
|
||||
- `body_hash`: sha256(body bytes), hex.
|
||||
|
||||
The fingerprint is computed:
|
||||
1. **Daemon-side** before durable outbox persistence — stored as
|
||||
`outbox.request_fingerprint` (NEW column) so retries always produce
|
||||
the same fingerprint regardless of caller behavior.
|
||||
2. **Broker-side** on first receipt — stored in
|
||||
`client_message_dedupe.request_fingerprint`.
|
||||
3. **Broker-side** on every duplicate retry — recomputed and compared
|
||||
byte-equal to the stored value.
|
||||
|
||||
If the daemon and broker disagree on the canonical form (e.g. JCS
|
||||
implementation drift), the broker emits
|
||||
`cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id}` and
|
||||
returns `409 idempotency_key_reused` with a body that includes the
|
||||
broker's fingerprint hex for debugging. Daemons that see this should
|
||||
log it loudly and stop retrying that outbox row (it goes to `dead`).
|
||||
|
||||
### 4.5 Duplicate response — three cases (v6)
|
||||
|
||||
| Case | HTTP/WS code | Body |
|
||||
|---|---|---|
|
||||
| First insert | `201 created` | `{ broker_message_id, client_message_id, history_id, duplicate: false }` |
|
||||
| Duplicate, fingerprint match | `200 ok` | `{ broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at }` |
|
||||
| Duplicate, fingerprint mismatch | `409 idempotency_key_reused` | `{ client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." }` (first 8 bytes hex) |
|
||||
|
||||
Daemon outcomes:
|
||||
- `201` → mark outbox row `done`, store `broker_message_id`. Normal path.
|
||||
- `200 duplicate` with `history_available: true` → mark `done`, no
|
||||
re-fanout, log at INFO.
|
||||
- `200 duplicate` with `history_available: false` → mark `done`, log at
|
||||
WARN. The original delivery succeeded; receivers got it.
|
||||
- `409 idempotency_key_reused` → mark outbox row `dead`, surface in
|
||||
`claudemesh daemon outbox --failed`. Operator must rotate the
|
||||
idempotency key by hand and resubmit (`outbox requeue --new-id <id>`,
|
||||
NEW v6 subcommand). Daemon does NOT auto-rotate to avoid masking caller
|
||||
bugs.
|
||||
|
||||
### 4.6 Why rejected requests don't consume idempotency keys (v6)
|
||||
|
||||
`status` was in v5's schema but underspecified. Two scenarios:
|
||||
|
||||
- **Transient broker error** (DB down, queue full, network blip): daemon
|
||||
retries. If we'd persisted a `rejected` row on the first attempt, the
|
||||
retry would fail forever. Bad.
|
||||
- **Permanent validation error** (payload too large, destination not
|
||||
found, auth missing): broker returns the appropriate `4xx` immediately
|
||||
without inserting a dedupe row. Daemon either fixes the request and
|
||||
retries (different fingerprint → fingerprint mismatch → `409` per §4.5)
|
||||
or marks dead. Persisting a "rejected" row buys nothing — the daemon
|
||||
isn't going to send the same broken request again with the same key.
|
||||
|
||||
Net result: `client_message_dedupe` rows only exist when the broker
|
||||
**successfully** accepted a message and committed it. The single source
|
||||
of truth for "was this idempotency key consumed?" is the existence of
|
||||
the dedupe row. No status enum, no ambiguous states.
|
||||
|
||||
### 4.7 Broker atomicity contract (NEW v6)
|
||||
|
||||
Every accept path runs in one DB transaction with the following shape:
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
-- Pre-generate broker_message_id outside the transaction; pass in.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING
|
||||
RETURNING broker_message_id, request_fingerprint, history_available, first_seen_at;
|
||||
|
||||
-- If RETURNING was empty (conflict), do a SELECT to fetch the original
|
||||
-- and exit the transaction with a duplicate response.
|
||||
-- If RETURNING produced a row AND $fingerprint != returned.fingerprint,
|
||||
-- that's the §4.5 mismatch path — also exit with 409.
|
||||
|
||||
-- Otherwise, this is the first insert. Insert the message row.
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
-- Optional: enqueue fan-out work, etc.
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
Failure modes:
|
||||
- Crash before `COMMIT`: both rows roll back. Next daemon retry inserts
|
||||
cleanly.
|
||||
- Crash after `COMMIT` but before WS ACK: dedupe row exists, message row
|
||||
exists. Daemon retries → fingerprint matches → `200 duplicate`. Net:
|
||||
exactly one broker-accepted row, one daemon `done` transition.
|
||||
- Constraint violation on message row insert (e.g. unique violation on
|
||||
some other column): rolls back the dedupe insert. Returns `5xx` to
|
||||
daemon. Daemon retries; same fingerprint reproduces the same constraint
|
||||
violation; daemon eventually marks `dead`. No orphan dedupe row.
|
||||
|
||||
Counter `cm_broker_dedupe_orphan_check_total` runs nightly and validates
|
||||
that every `client_message_dedupe` row has a matching `topic_message` or
|
||||
`message_queue` row OR the matching message row has been retention-pruned
|
||||
(in which case `history_available = FALSE` was set). Any row failing both
|
||||
conditions is logged as `cm_broker_dedupe_orphan_found{mesh_id}` for
|
||||
human review. Should be zero in steady state.
|
||||
|
||||
### 4.8 Outbox schema — fingerprint stored alongside (v6)
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
`request_fingerprint` is computed at IPC accept time and stored. Every
|
||||
retry sends the same bytes. The daemon never recomputes from `payload`
|
||||
post-enqueue (would produce drift if envelope_version changes between
|
||||
daemon runs).
|
||||
|
||||
### 4.9 Outbox max-age math — bounded (v6)
|
||||
|
||||
Codex r5: the v5 formula `(dedupe_retention_days * 24) - 24h_margin`
|
||||
breaks at `dedupe_retention_days = 1` (yields zero) and is undefined
|
||||
behavior at `<= 1`.
|
||||
|
||||
v6 formula and bounds:
|
||||
|
||||
- **Minimum supported broker dedupe retention**: 3 days. Daemon refuses
|
||||
to start if broker advertises `dedupe_retention_days < 3` (treats it
|
||||
as `feature_param_invalid`, exits 4010).
|
||||
- **Daemon `max_age_hours` derivation**:
|
||||
- `permanent` mode → daemon uses config default (168h = 7d), cap 720h
|
||||
(30d).
|
||||
- `retention_scoped` mode → daemon `max_age_hours = max(72,
|
||||
(dedupe_retention_days * 24) - safety_margin_hours)` where
|
||||
`safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 *
|
||||
24))`. For `dedupe_retention_days=3` this gives
|
||||
`max(72, 72-24) = 72h`. For 30 days: `max(72, 720-72) = 648h`. For
|
||||
365 days: `max(72, 8760-876) = 7884h`.
|
||||
- The 72h floor prevents the daemon outbox from being uselessly short
|
||||
— three days is enough margin for normal operator response to a
|
||||
paged outage.
|
||||
|
||||
- Operator override allowed via `[outbox] max_age_hours_override = N`,
|
||||
but if `N` exceeds `dedupe_retention_days * 24 - 1` daemon refuses to
|
||||
start with `outbox_max_age_above_dedupe_window`. The override exists
|
||||
for the rare case of a much-shorter-than-default outbox; it does not
|
||||
exist to circumvent the broker's dedupe window.
|
||||
|
||||
### 4.10 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.11 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.12 Failure modes — corrected for fingerprint model (v6)
|
||||
|
||||
- **Fingerprint mismatch on retry** (`409 idempotency_key_reused`): outbox
|
||||
row marked `dead`. Surfaced in `--failed` view. Operator command
|
||||
`outbox requeue --new-id <id>` rotates `client_message_id` and retries.
|
||||
- **Daemon retry after dedupe row hard-deleted by retention sweep**: in
|
||||
`retention_scoped` mode, daemon `max_age_hours` is bounded inside the
|
||||
retention window (§4.9), so this can only happen via operator override.
|
||||
In that case the retry creates a NEW dedupe row + new message — the
|
||||
caller chose this risk explicitly. Counter
|
||||
`cm_daemon_retry_after_dedupe_expired_total`.
|
||||
- **Daemon retry after dedupe row hard-deleted in `permanent` mode**:
|
||||
cannot happen by definition — `permanent` means no `expires_at`. Only
|
||||
mesh deletion removes dedupe rows.
|
||||
- **Duplicate row, history pruned**: as v5 §4.4. Mark `done`, log
|
||||
`cm_daemon_dedupe_history_pruned_total`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — unchanged from v4 §6
|
||||
|
||||
---
|
||||
|
||||
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature param updated for new dedupe semantics
|
||||
|
||||
### 15.1 Feature bits with parameters (v6 update)
|
||||
|
||||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | `2` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 3)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
|
||||
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
|
||||
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
|
||||
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
|
||||
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
|
||||
|
||||
`client_message_id_dedupe` bumped to `params.version = 2` because it now
|
||||
requires `request_fingerprint = true`. A broker still on version 1
|
||||
(no fingerprint comparison) is treated as "feature missing" and the
|
||||
daemon refuses to start. That's intentional — v0.9.0 daemons require
|
||||
fingerprint enforcement for safe idempotency.
|
||||
|
||||
`dedupe_retention_days` minimum raised to 3 (matches the §4.9 floor).
|
||||
|
||||
### 15.2 Negotiation handshake — unchanged shape from v5 §15.2
|
||||
|
||||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||||
|
||||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||||
|
||||
### 15.5 Diagnostic close codes (NEW v6 — codex r5)
|
||||
|
||||
WebSocket close codes are split for diagnostic clarity:
|
||||
|
||||
| Code | Reason | When |
|
||||
|---|---|---|
|
||||
| `4010` | `feature_unavailable` | Required feature missing from broker's `supported` |
|
||||
| `4011` | `feature_param_invalid` | Required feature present but parameters fail validation (missing required, out of bounds, unknown version) |
|
||||
| `4012` | `feature_param_below_floor` | Required feature parameter below daemon's hard floor (e.g. `dedupe_retention_days < 3`) |
|
||||
|
||||
Daemon logs the full negotiation payload at WARN before exiting; supervisor
|
||||
+ alerting catches the restart loop.
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v4 §16
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — broker dedupe table + atomicity (v6)
|
||||
|
||||
Broker side, deploy order:
|
||||
|
||||
1. `CREATE TABLE mesh.client_message_dedupe` with v6 schema (additive,
|
||||
online-safe).
|
||||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
|
||||
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
|
||||
4. Broker code refactor: every accept path wraps dedupe insert + message
|
||||
insert in **one transaction** (§4.7). Pre-generated
|
||||
`broker_message_id` (ulid in code) passed in.
|
||||
5. Broker code: nightly job to delete dedupe rows where `expires_at <
|
||||
NOW()` (skip in `permanent` mode).
|
||||
6. Broker code: hook into the message-retention sweep — when a
|
||||
`topic_message` or `message_queue` row is hard-deleted, find the
|
||||
matching dedupe row by `client_message_id` and set `history_available
|
||||
= FALSE`. (Note: `client_message_id` is nullable on those tables for
|
||||
legacy traffic; nullable rows have no dedupe row to update.)
|
||||
7. Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
|
||||
8. Broker advertises `client_message_id_dedupe` feature with
|
||||
`params.version = 2` and `request_fingerprint: true`.
|
||||
9. Daemon refuses to start unless that feature bit is advertised with
|
||||
valid v2 params.
|
||||
|
||||
Rollback plan: feature flag disables fingerprint enforcement broker-side
|
||||
(falls back to existing pre-v6 behavior — no dedupe). Daemons that
|
||||
require fingerprint refuse to start. Operator switches off the feature
|
||||
flag, reverts the daemon, restarts. No data loss; pending dedupe rows
|
||||
remain in place for the next forward roll.
|
||||
|
||||
---
|
||||
|
||||
## What changed v5 → v6 (codex round-5 actionable items)
|
||||
|
||||
| Codex r5 item | v6 fix | Section |
|
||||
|---|---|---|
|
||||
| Idempotency key reuse with different payload silently collapses | `request_fingerprint` BYTEA in dedupe table; canonical form per §4.4; 409 on mismatch | §4.3, §4.4, §4.5 |
|
||||
| `status='rejected'` underspecified | Dropped `status` column; rejected requests don't consume keys; existence of dedupe row = "key consumed" | §4.3, §4.6 |
|
||||
| Outbox max-age math edges at low retention | 72h floor; min `dedupe_retention_days = 3`; percentage-based safety margin; explicit override gating | §4.9, §15.1 |
|
||||
| Broker atomicity not stated | One transaction per accept path; orphan-check job; rollback semantics | §4.7 |
|
||||
| Diagnostic detail on feature param failures | New close codes 4011 / 4012 separate from 4010 | §15.5 |
|
||||
| Outbox stores fingerprint | NEW column `outbox.request_fingerprint` BLOB; computed once at IPC accept | §4.8 |
|
||||
| Operator command for fingerprint-mismatch recovery | NEW `outbox requeue --new-id <id>` to rotate idempotency key | §4.5 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 6)
|
||||
|
||||
1. **Request fingerprint canonical form (§4.4)** — does JCS work
|
||||
cross-language for `meta_canonical_json` (Python json.dumps,
|
||||
Go encoding/json, JS JSON.stringify all behave differently)? Should
|
||||
we ship a vetted JCS lib in each SDK or fall back to a simpler
|
||||
"sorted keys + no spaces + escape-as-stored" rule with conformance
|
||||
tests?
|
||||
2. **Atomicity contract (§4.7)** — is the orphan-check sufficient, or
|
||||
does a violation mean we need a "broker rebuild dedupe from messages"
|
||||
recovery tool? The latter is destructive but useful for ops emergencies.
|
||||
3. **Max-age formula (§4.9)** — is the 72h floor correct? Is the
|
||||
percentage-based safety margin (`max(24, ceil(0.1 * dedupe_window))`)
|
||||
the right shape? Or simpler to say "always 24h"?
|
||||
4. **`409 idempotency_key_reused` recovery flow (§4.5)** — is sending the
|
||||
row to `dead` and surfacing it via `outbox --failed` enough? Should
|
||||
the daemon emit a high-priority event for the SSE stream so operators
|
||||
are paged immediately?
|
||||
5. **Diagnostic close codes (§15.5)** — is splitting 4010/4011/4012
|
||||
useful, or does it just push complexity onto operators? Should we
|
||||
collapse to 4010 with structured close-reason JSON instead?
|
||||
6. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year. What falls down?
|
||||
|
||||
Three options:
|
||||
- **(a) v6 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v7 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
439
.artifacts/shipped/2026-05-03-daemon-final-spec-v7.md
Normal file
@@ -0,0 +1,439 @@
|
||||
# `claudemesh daemon` — Final Spec v7
|
||||
|
||||
> **Round 7.** v6 was reviewed by codex (round 6) which found the broker
|
||||
> layer largely correct but caught five daemon-side and broker-tx
|
||||
> correctness gaps:
|
||||
>
|
||||
> 1. **Daemon-local duplicate POST semantics** undefined — local fingerprint
|
||||
> comparison missing across `pending` / `inflight` / `done` / `dead`.
|
||||
> 2. **§4.6 rejected-request contradiction** — talked about both "fix and
|
||||
> retry" and "fingerprint mismatch → 409". Only one of those can be true.
|
||||
> 3. **§4.7 pseudocode bug** — `ON CONFLICT DO NOTHING RETURNING` returns
|
||||
> nothing on conflict; the fingerprint comparison was in the wrong branch.
|
||||
> 4. **Max-age math floor consumes margin** — at min retention (3 days),
|
||||
> daemon max-age 72h equals broker window 72h. Not inside the window.
|
||||
> 5. **Broker transaction boundary incomplete** — fan-out/queue/history side
|
||||
> effects not stated as in-transaction; "optional" wording was wrong.
|
||||
>
|
||||
> v7 fixes all five. **Intent §0 unchanged from v2.** v7 only revises §4
|
||||
> (delivery contract) and §15 (feature param min) and §17 (migration).
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once, fingerprinted at IPC and broker layers
|
||||
|
||||
### 4.1 The contract (precise — v7)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer: a duplicate `POST` with the same
|
||||
> `client_message_id` and matching `request_fingerprint` returns the
|
||||
> stable prior result; with a mismatched fingerprint it returns local
|
||||
> `409 idempotency_key_reused` and the new request is **not** persisted.
|
||||
>
|
||||
> **Broker guarantee**: the broker maintains a dedupe record per
|
||||
> accepted `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`
|
||||
> with `request_fingerprint`. Retries with matching fingerprint collapse;
|
||||
> retries with mismatched fingerprint return `409
|
||||
> idempotency_key_reused` without creating a new message.
|
||||
>
|
||||
> **Atomicity guarantee**: every durable side effect of a successful
|
||||
> accept (dedupe row, message row, fan-out work, history row, queue
|
||||
> insertion) lands in the same broker DB transaction. Either all commit
|
||||
> or none do.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery, with
|
||||
> `client_message_id` propagated to receivers' inboxes.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — unchanged from v6 §4.3
|
||||
|
||||
(`mesh.client_message_dedupe` table with `request_fingerprint BYTEA`, no
|
||||
`status` column.)
|
||||
|
||||
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (NEW v7 — codex r6)
|
||||
|
||||
The daemon enforces fingerprint idempotency **before** the request hits
|
||||
`outbox.db` so a caller bug never creates duplicate-key/mismatch-payload
|
||||
state at all.
|
||||
|
||||
#### 4.5.1 IPC accept algorithm
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits). Failures
|
||||
here return `4xx` immediately. **No outbox row is written.** The
|
||||
`client_message_id` (whether caller-supplied or daemon-minted) is
|
||||
**not consumed** — the same id may be reused by the caller for a
|
||||
subsequent valid send.
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Look up existing outbox row by `client_message_id`:
|
||||
|
||||
| Existing row state | Fingerprint match? | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | Insert new outbox row in `pending`; return `202 accepted, queued` with `client_message_id` |
|
||||
| `pending` | match | Return `202 accepted, queued` with the existing `client_message_id`. No new row. Idempotent retry of an in-progress send |
|
||||
| `pending` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_pending_fingerprint_mismatch"`. **No mutation of the existing row.** |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No new row. Caller is retrying mid-broker-roundtrip |
|
||||
| `inflight` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No new row, no broker call |
|
||||
| `done` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409 idempotency_key_reused` with `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Caller must rotate the id (see §4.6.3) — daemon refuses to re-attempt a dead row's exact bytes. |
|
||||
| `dead` | mismatch | Return `409 idempotency_key_reused` with `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
|
||||
Rule: any IPC `409` carries the daemon's `request_fingerprint` (8-byte
|
||||
hex prefix) so callers can debug client/server canonical-form drift.
|
||||
|
||||
#### 4.5.2 Outbox table — fingerprint required, atomic UPSERT removed
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
Insertion is `BEGIN; SELECT FOR UPDATE; if-no-row INSERT; COMMIT;` —
|
||||
explicit lock + check + insert, not `INSERT OR IGNORE`. The daemon
|
||||
never auto-mutates an existing row's `request_fingerprint` or
|
||||
`payload`; mismatches are 409s, not silent overwrites.
|
||||
|
||||
`request_fingerprint` is computed once at IPC accept time and frozen.
|
||||
Retries to the broker re-send the same bytes from `payload` and the
|
||||
same `request_fingerprint`. Daemon does not recompute post-enqueue.
|
||||
|
||||
### 4.6 Rejected-request semantics — pick one rule (NEW v7 — codex r6)
|
||||
|
||||
> **Rule: the `client_message_id` is consumed iff the daemon writes an
|
||||
> outbox row. Anything that fails before outbox insertion (validation,
|
||||
> auth, size) leaves the id untouched and freely reusable.**
|
||||
|
||||
This makes §4.6 internally consistent with §4.5:
|
||||
|
||||
#### 4.6.1 IPC validation failure (no outbox row written)
|
||||
|
||||
- Schema/auth/size/destination-not-resolvable failures return `4xx`
|
||||
immediately. The `client_message_id` is **not** stored anywhere on
|
||||
the daemon. Caller may re-send with the same id and a fixed payload;
|
||||
it will be treated as a fresh request because no outbox row exists.
|
||||
|
||||
#### 4.6.2 Outbox row exists, broker permanent rejection (4xx response)
|
||||
|
||||
- Daemon receives `4xx` from broker (e.g. payload size delta between
|
||||
daemon and broker advertised limits, mesh-level reject). Outbox row
|
||||
transitions to `dead` with `last_error` populated.
|
||||
- Caller retrying with same `client_message_id` → daemon returns
|
||||
`409 idempotency_key_reused, conflict: "outbox_dead_*"` per §4.5.1.
|
||||
- The id is consumed (row is locked in `dead`) until operator action.
|
||||
|
||||
#### 4.6.3 Operator recovery: rotating an idempotency key
|
||||
|
||||
To unstick a `dead` row whose payload needs to change, operator runs:
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_id> --new-client-id [auto|<id>]
|
||||
```
|
||||
|
||||
This atomically:
|
||||
1. Marks the existing `dead` row as `aborted` (terminal, never retried).
|
||||
2. Creates a new outbox row with a fresh `client_message_id` (caller-
|
||||
supplied or daemon-ulid'd) and the SAME or a CALLER-PATCHED payload.
|
||||
3. The old `client_message_id` becomes free again at the daemon layer
|
||||
but is still locked at the broker layer if the broker had ever
|
||||
accepted it (its dedupe row stays). For a row that died before
|
||||
broker acceptance, the id is fully reusable end-to-end.
|
||||
|
||||
Operators see a clear distinction between `dead` (needs operator
|
||||
attention) and `aborted` (intentionally retired). Add `aborted` to the
|
||||
status CHECK constraint:
|
||||
|
||||
```sql
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead','aborted'))
|
||||
```
|
||||
|
||||
### 4.7 Broker atomicity contract — corrected pseudocode + side-effect inventory (v7 — codex r6)
|
||||
|
||||
#### 4.7.1 Side effects inside the transaction
|
||||
|
||||
Every successful broker accept atomically commits the following durable
|
||||
state in **one transaction**:
|
||||
|
||||
| Effect | Table | Notes |
|
||||
|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | NEW row keyed by `(mesh_id, client_message_id)` |
|
||||
| Message body | `mesh.topic_message` OR `mesh.message_queue` | NEW row keyed by `broker_message_id` (pre-generated ulid) |
|
||||
| History row | `mesh.message_history` | NEW row pointing at `broker_message_id` for ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | One row per intended recipient (member subscribed to topic, recipient of DM, etc.) |
|
||||
|
||||
Effects **outside** the transaction (committed after ACK to daemon):
|
||||
- WebSocket pushes to currently-connected subscribers — these are best-
|
||||
effort live notifications; on failure subscribers fetch from history
|
||||
on next connect.
|
||||
- Webhook fan-out (post-v0.9.0 feature) — runs asynchronously off the
|
||||
`delivery_queue` rows committed inside the transaction.
|
||||
|
||||
If any in-transaction insert fails (constraint violation, DB error),
|
||||
the transaction rolls back: no dedupe row, no message row, no history,
|
||||
no delivery queue rows. Broker returns `5xx` to daemon; daemon retries.
|
||||
|
||||
#### 4.7.2 Corrected pseudocode (codex r6)
|
||||
|
||||
The fingerprint comparison must happen on the conflict-select branch,
|
||||
not the `RETURNING` branch:
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
|
||||
-- Pre-generate broker_message_id (ulid) outside the transaction, pass in.
|
||||
|
||||
-- Step 1: try to claim the idempotency key.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Step 2: was it our insert?
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- If returned.broker_message_id == $msg_id (our pre-generated id),
|
||||
-- this was the first insert. Continue to step 3.
|
||||
-- If returned.broker_message_id != $msg_id AND
|
||||
-- returned.request_fingerprint == $fingerprint,
|
||||
-- this is a duplicate retry. ROLLBACK; return 200 duplicate.
|
||||
-- If returned.broker_message_id != $msg_id AND
|
||||
-- returned.request_fingerprint != $fingerprint,
|
||||
-- ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Step 3: insert message row, history, fan-out queue.
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
The branch logic determines the response shape (`201` vs `200
|
||||
duplicate` vs `409 idempotency_key_reused`) before COMMIT. The
|
||||
duplicate and 409 branches always ROLLBACK because nothing else
|
||||
needs to commit on those paths.
|
||||
|
||||
`SELECT … FOR SHARE` blocks concurrent writers from upgrading the
|
||||
same dedupe row mid-transaction; a concurrent insert with the same
|
||||
key will block until our transaction completes.
|
||||
|
||||
#### 4.7.3 Orphan check — covers full inventory now
|
||||
|
||||
The nightly `cm_broker_dedupe_orphan_check_total` job (v6 §4.7) is
|
||||
extended to verify all four in-transaction effects. For each
|
||||
`client_message_dedupe` row:
|
||||
- Either the corresponding `topic_message` / `message_queue` row exists,
|
||||
OR `history_available = FALSE` AND a deleted-tombstone is recorded.
|
||||
- AND a corresponding `message_history` row exists (or has been pruned
|
||||
per history retention).
|
||||
- AND zero outstanding `delivery_queue` rows older than fan-out timeout
|
||||
reference a `broker_message_id` whose dedupe row is missing.
|
||||
|
||||
Any inconsistency logged as `cm_broker_atomicity_violation_found` for
|
||||
human review. Should be zero in steady state.
|
||||
|
||||
### 4.8 Outbox max-age math — strictly inside broker window (v7 — codex r6)
|
||||
|
||||
Codex r6: at v6's 3-day minimum, daemon max_age (72h) **equaled** broker
|
||||
window (72h). That isn't "inside the window."
|
||||
|
||||
v7 raises the floor and tightens the formula:
|
||||
|
||||
- **Minimum supported broker `dedupe_retention_days`**: **7** (was 3 in
|
||||
v6). Below this, daemon refuses to start with `4012
|
||||
feature_param_below_floor`.
|
||||
- **Daemon `max_age_hours` derivation** (`retention_scoped` mode):
|
||||
```
|
||||
safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 * 24))
|
||||
max_age_hours = (dedupe_retention_days * 24) - safety_margin_hours
|
||||
```
|
||||
At minimum (7 days): `safety_margin = max(24, 17) = 24h`; `max_age =
|
||||
168 - 24 = 144h`. Daemon outbox ≤144h, broker window ≥168h, gap ≥24h.
|
||||
- **Daemon `max_age_hours` derivation** (`permanent` mode):
|
||||
```
|
||||
max_age_hours = config.outbox.max_age_hours_default (168h)
|
||||
capped at config.outbox.max_age_hours_cap (720h)
|
||||
```
|
||||
- **Operator override**: `[outbox] max_age_hours_override = N` accepted
|
||||
iff `N <= dedupe_retention_days * 24 - 24`. Above that → daemon
|
||||
refuses to start with `outbox_max_age_above_dedupe_window` clear text.
|
||||
- The 72h floor from v6 is **dropped** because the new 7-day broker
|
||||
minimum already produces a 144h derived max-age — well above any
|
||||
realistic floor concern.
|
||||
|
||||
### 4.9 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.10 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.11 Failure modes — unchanged from v6 §4.12, with §4.5/§4.6 added
|
||||
|
||||
- **IPC accept fingerprint-mismatch on duplicate id**: returns 409 with
|
||||
`conflict` field per §4.5.1. Caller must rotate id.
|
||||
- **Outbox row stuck in `dead`**: operator runs `outbox requeue
|
||||
--new-client-id` per §4.6.3.
|
||||
- **Broker fingerprint mismatch on retry**: as v6 §4.5. Daemon marks
|
||||
`dead`, surfaces in `outbox --failed`.
|
||||
- **Daemon retry after dedupe row hard-deleted by broker retention
|
||||
sweep**: cannot happen unless operator overrode `max_age_hours`
|
||||
beyond the safety margin. In `permanent` mode cannot happen at all.
|
||||
- **Atomicity violation found by orphan check**: alerts ops; broker
|
||||
team investigates. Should be zero.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
## 6. Hooks — unchanged from v4 §6
|
||||
|
||||
## 7-13. — unchanged from v4
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — minimum dedupe_retention_days raised
|
||||
|
||||
### 15.1 Feature bits with parameters (v7 update)
|
||||
|
||||
Only one row changes from v6 §15.1:
|
||||
|
||||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | `2` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 7)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
|
||||
|
||||
`dedupe_retention_days` minimum raised from 3 to 7 to keep daemon
|
||||
outbox max-age strictly inside the broker window with margin (§4.8).
|
||||
|
||||
### 15.2 — 15.5 unchanged from v6 §15
|
||||
|
||||
(`feature_negotiation_request/response`, IPC negotiation, compat
|
||||
matrix, diagnostic close codes 4010 / 4011 / 4012.)
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v4 §16
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — broker dedupe + atomicity + corrected pseudocode (v7)
|
||||
|
||||
Broker side, deploy order:
|
||||
|
||||
1. `CREATE TABLE mesh.client_message_dedupe` (v6 §4.3 schema, unchanged
|
||||
in v7).
|
||||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
|
||||
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
|
||||
4. Broker code refactor: every accept path runs the v7 §4.7.2 corrected
|
||||
pseudocode in **one transaction** with the side-effect inventory
|
||||
from §4.7.1 — dedupe row, message row, history row, delivery_queue
|
||||
rows all in-tx.
|
||||
5. Broker code: existing fan-out workers consume `delivery_queue` rows
|
||||
committed by the accept transaction.
|
||||
6. Broker code: nightly retention sweep + `history_available` flip on
|
||||
message-row pruning (unchanged from v6 §17 step 5+6).
|
||||
7. Broker code: extended orphan-check job (v7 §4.7.3) — alerts on
|
||||
atomicity violations across full inventory.
|
||||
8. Broker advertises `client_message_id_dedupe` feature with
|
||||
`params.version = 2`, `request_fingerprint: true`,
|
||||
`dedupe_retention_days >= 7` (was 3).
|
||||
9. Daemon refuses to start unless above is advertised.
|
||||
|
||||
Daemon side:
|
||||
- Outbox table gains `aborted` status (§4.6.3); migration ALTER on the
|
||||
CHECK constraint at startup if SQLite version <DDL works without
|
||||
a recreate; else table recreate via `INSERT INTO new SELECT * FROM
|
||||
old`. v0.9.0 daemons are fresh installs by definition; existing
|
||||
outboxes don't exist.
|
||||
- IPC accept path implements §4.5.1 lookup table.
|
||||
- IPC error envelope adds `conflict` and `daemon_fingerprint_prefix`
|
||||
fields for 409 responses.
|
||||
- New CLI verb `claudemesh daemon outbox requeue --id <id>
|
||||
--new-client-id [auto|<id>]` (§4.6.3).
|
||||
|
||||
---
|
||||
|
||||
## What changed v6 → v7 (codex round-6 actionable items)
|
||||
|
||||
| Codex r6 item | v7 fix | Section |
|
||||
|---|---|---|
|
||||
| Daemon-local duplicate POST semantics undefined | Full lookup table for pending/inflight/done/dead × match/mismatch; `409 idempotency_key_reused` at IPC layer with `conflict` field | §4.5 |
|
||||
| §4.6 rejected-request contradiction | Single rule: id consumed iff outbox row written; pre-outbox failures leave id untouched; broker-rejected outbox row goes to `dead`, requires `requeue --new-client-id` | §4.6 |
|
||||
| §4.7 pseudocode wrong | Corrected: `INSERT ON CONFLICT DO NOTHING`, then `SELECT FOR SHARE`, then branch on returned `broker_message_id` and `fingerprint` | §4.7.2 |
|
||||
| Max-age math equals window at min | Min `dedupe_retention_days` raised to 7; safety margin always >= 24h; derived max-age strictly < window | §4.8, §15.1 |
|
||||
| Broker atomicity scope incomplete | Side-effect inventory: dedupe + message + history + delivery_queue all in-tx; WS push and webhook fan-out explicitly outside-tx; orphan check extended | §4.7.1, §4.7.3 |
|
||||
| New `aborted` outbox status | Distinguishes operator-retired rows from dead rows | §4.6.3 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 7)
|
||||
|
||||
1. **IPC lookup table (§4.5.1)** — does it cover all the realistic
|
||||
client races? The "inflight + match" return is `202 accepted,
|
||||
inflight` — should it be `200 ok` with the broker response if the
|
||||
broker has already responded? Or does the daemon prefer to respond
|
||||
from local state always?
|
||||
2. **Aborted vs dead vs done (§4.6.3)** — is the three-state terminal
|
||||
distinction useful, or noisy? Would `dead` + an `aborted_at`
|
||||
timestamp suffice?
|
||||
3. **§4.7.2 transaction shape** — `SELECT FOR SHARE` after `INSERT ON
|
||||
CONFLICT DO NOTHING` is two round-trips. Could it be one with
|
||||
`INSERT ... ON CONFLICT DO UPDATE SET ... RETURNING xmax = 0` or
|
||||
similar Postgres-specific trick? Worth optimizing here?
|
||||
4. **Max-age formula at higher windows** — at 365 days,
|
||||
`safety_margin = ceil(0.1 * 365 * 24) = 876h ≈ 36.5 days`. Daemon
|
||||
max-age = `8760 - 876 = 7884h ≈ 328 days`. Is that the right shape,
|
||||
or should the safety margin be capped (e.g. `min(72, ceil(0.1 * w))`)?
|
||||
5. **Side-effect inventory (§4.7.1)** — anything missing? E.g. broker-
|
||||
side rate-limit counters, audit-log entries, mention-fanout-search?
|
||||
6. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year. What falls down?
|
||||
|
||||
Three options:
|
||||
- **(a) v7 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v8 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
401
.artifacts/shipped/2026-05-03-daemon-final-spec-v8.md
Normal file
@@ -0,0 +1,401 @@
|
||||
# `claudemesh daemon` — Final Spec v8
|
||||
|
||||
> **Round 8.** v7 was reviewed by codex (round 7) which found four
|
||||
> remaining correctness problems, one of them new in v7:
|
||||
>
|
||||
> 1. **`aborted` semantics not in §4.5.1** and contradiction with `UNIQUE`
|
||||
> constraint — v7 said the old id "becomes free again at the daemon
|
||||
> layer," but `client_message_id TEXT NOT NULL UNIQUE` makes that
|
||||
> impossible without DELETE.
|
||||
> 2. **Broker permanent-rejection ordering underspec** — v7 didn't state
|
||||
> when (relative to dedupe insertion) permanent 4xx fires.
|
||||
> 3. **SQLite `SELECT FOR UPDATE`** — SQLite doesn't support it; needs
|
||||
> `BEGIN IMMEDIATE` for daemon-local serialization.
|
||||
> 4. **Side-effect inventory still ambiguous** — rate-limit counters,
|
||||
> audit logs, mention/search indexes need explicit
|
||||
> in-tx/non-authoritative classification.
|
||||
>
|
||||
> v8 fixes all four. **Intent §0 unchanged from v2.** v8 only revises §4
|
||||
> (delivery contract).
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
## 1. Process model — unchanged
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
|
||||
|
||||
### 4.1 The contract (precise — v8)
|
||||
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer: a duplicate `POST` with the same
|
||||
> `client_message_id` returns `409 idempotency_key_reused` if the
|
||||
> fingerprint mismatches, regardless of outbox row state.
|
||||
>
|
||||
> **Local audit guarantee (NEW v8)**: a `client_message_id` once written
|
||||
> to `outbox.db` is **never released**. Operator recovery via
|
||||
> `requeue --new-client-id` always mints a fresh id; the old row stays
|
||||
> in `aborted` for audit. There is no daemon-side path to free a used
|
||||
> id.
|
||||
>
|
||||
> **Broker guarantee**: same as v7 §4.1. Dedupe row exists iff the
|
||||
> broker reached the post-validation accept phase (§4.7.1).
|
||||
>
|
||||
> **Atomicity guarantee**: same as v7 §4.1.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — unchanged from v6 §4.3
|
||||
|
||||
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
|
||||
|
||||
#### 4.5.1 IPC accept algorithm (v8)
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits, destination
|
||||
resolvable). Failures here return `4xx` immediately. **No outbox row
|
||||
is written; the `client_message_id` is not consumed.**
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
|
||||
a concurrent IPC accept on the same id serializes against this one.
|
||||
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
|
||||
preventing any other writer from beginning a transaction on the same
|
||||
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
|
||||
supported.
|
||||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||||
last_error FROM outbox WHERE client_message_id = ?`.
|
||||
5. Apply the lookup table below. For the "(no row)" case, INSERT the
|
||||
new row inside the same transaction.
|
||||
6. COMMIT.
|
||||
|
||||
| Existing row state | Fingerprint match? | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
|
||||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||||
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||||
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||||
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
|
||||
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
|
||||
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||||
|
||||
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
|
||||
`request_fingerprint` (8-byte hex prefix) so callers can debug
|
||||
client/server canonical-form drift. **Every state in the table returns
|
||||
something deterministic, including `aborted`.** A `client_message_id`
|
||||
written to `outbox.db` is permanently bound to that row's lifecycle —
|
||||
the only "free" state is "no row exists".
|
||||
|
||||
#### 4.5.2 Outbox table — fingerprint required
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN
|
||||
('pending','inflight','done','dead','aborted')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT,
|
||||
aborted_at INTEGER, -- NEW v8
|
||||
aborted_by TEXT, -- NEW v8: operator/auto
|
||||
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||||
```
|
||||
|
||||
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
|
||||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||||
a row was requeued multiple times.
|
||||
|
||||
`request_fingerprint` is computed once at IPC accept time and frozen
|
||||
forever for the row's lifecycle. Daemon never recomputes from
|
||||
`payload`.
|
||||
|
||||
### 4.6 Rejected-request semantics — phasing made explicit (v8 — codex r7)
|
||||
|
||||
> **Single rule, phased**: a `client_message_id` is consumed iff a
|
||||
> dedupe row exists. The dedupe row is the durable evidence that a
|
||||
> request reached the post-validation accept phase. Pre-validation
|
||||
> failures consume nothing — caller may freely retry the same id with
|
||||
> a fixed payload.
|
||||
|
||||
#### 4.6.1 Daemon-side rejection phasing
|
||||
|
||||
| Phase | When daemon rejects | Outbox row? | Caller may reuse id? |
|
||||
|---|---|---|---|
|
||||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | Yes — id never consumed |
|
||||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | N/A — daemon owns retries |
|
||||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | No — rotate via `requeue --new-client-id` |
|
||||
| **D. Operator retirement** | Operator runs `requeue --new-client-id` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Old id NEVER reusable; new id is fresh |
|
||||
|
||||
#### 4.6.2 Broker-side rejection phasing (NEW v8 — codex r7)
|
||||
|
||||
The broker validates in two phases relative to dedupe-row insertion:
|
||||
|
||||
| Phase | Validation | Result |
|
||||
|---|---|---|
|
||||
| **B1. Pre-dedupe-claim** (NEW — explicit) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes` | `4xx` returned. **No dedupe row inserted.** Caller may retry with same id and corrected payload. |
|
||||
| **B2. Post-dedupe-claim** | Anything that requires the dedupe-claim transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.), per-mesh rate limit not exceeded | `4xx` returned, transaction rolled back, **no dedupe row remains**. Caller may retry with same id. |
|
||||
| **B3. Accepted** | All side effects (dedupe row, message row, history row, delivery_queue rows) commit atomically | `201` returned with `broker_message_id` |
|
||||
|
||||
**Critical guarantee (v8)**: there is no broker code path where a
|
||||
permanent rejection (4xx) leaves a dedupe row behind. Either the
|
||||
request committed and a dedupe row exists (B3), or it didn't and no
|
||||
dedupe row exists (B1, B2). This makes "dedupe row exists" the single
|
||||
unambiguous signal of "id consumed at the broker layer."
|
||||
|
||||
If broker decides post-commit that an accepted message is invalid
|
||||
(e.g. an async content-policy job runs on accepted messages), that's
|
||||
NOT a permanent rejection — that's a follow-up moderation event that
|
||||
operates on the broker_message_id, not on the dedupe key.
|
||||
|
||||
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
|
||||
|
||||
To unstick a `dead` or `pending`-but-stuck row, operator runs:
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||||
[--new-client-id <id> | --auto]
|
||||
[--patch-payload <path>]
|
||||
```
|
||||
|
||||
This atomically (single SQLite transaction):
|
||||
|
||||
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
|
||||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||||
permanent.
|
||||
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
|
||||
or auto-ulid'd via `--auto`).
|
||||
3. Inserts a new outbox row in `pending` with the fresh id and the same
|
||||
payload (or patched payload if `--patch-payload` was given).
|
||||
4. Sets `superseded_by = <new_row_id>` on the old row so
|
||||
`outbox inspect <old_id>` displays the chain.
|
||||
|
||||
**The old `client_message_id` is permanently dead** — `outbox.db` still
|
||||
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
|
||||
re-using it gets `409 outbox_aborted_*` per §4.5.1.
|
||||
|
||||
If broker had ever accepted the old id (it reached B3), the broker's
|
||||
dedupe row is also permanent — duplicate sends to broker with the old
|
||||
id would also `409` for fingerprint mismatch (or return the original
|
||||
`broker_message_id` for matching fingerprint). Daemon-side
|
||||
`aborted` and broker-side dedupe row are independent records of "this
|
||||
id was used," neither releases the id.
|
||||
|
||||
This is the resolution to v7's contradiction: there is **no path** for
|
||||
an id to "become free again." If the operator wants to retry the
|
||||
payload, they get a new id. The old id stays buried.
|
||||
|
||||
### 4.7 Broker atomicity contract — side-effect classification (v8 — codex r7)
|
||||
|
||||
#### 4.7.1 Side effects (v8 — explicit classification)
|
||||
|
||||
Every successful broker accept atomically commits these durable
|
||||
state changes in **one transaction**:
|
||||
|
||||
| Effect | Table | In-tx? | Why |
|
||||
|---|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
|
||||
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
|
||||
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
|
||||
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
|
||||
|
||||
**Outside the transaction** — non-authoritative or rebuildable, with
|
||||
explicit rationale per item:
|
||||
|
||||
| Effect | Where | Why outside |
|
||||
|---|---|---|
|
||||
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
|
||||
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
|
||||
| Rate-limit counters | Async, eventually consistent | Counters are an estimate; over-counting on retry > under-counting |
|
||||
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
|
||||
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
|
||||
| Metrics | Prometheus, pull-based | Always non-authoritative |
|
||||
|
||||
If any in-transaction insert fails, the transaction rolls back
|
||||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||||
state.
|
||||
|
||||
The async side effects are driven off the in-transaction
|
||||
`delivery_queue` and `message_history` rows, so they cannot get ahead
|
||||
of committed state — only lag behind.
|
||||
|
||||
#### 4.7.2 Pseudocode — corrected and final (v8)
|
||||
|
||||
```sql
|
||||
BEGIN;
|
||||
|
||||
-- Phase B1 already passed (see §4.6.2).
|
||||
|
||||
-- Phase B2 + B3: try to claim the idempotency key.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Inspect the row that's actually there now (ours or someone else's).
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- Branch:
|
||||
-- row.broker_message_id == $msg_id → first insert; continue to step 3.
|
||||
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
|
||||
-- fingerprint match → ROLLBACK; return 200 duplicate.
|
||||
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Step 3: validate Phase B2 (subscribers exist, rate limit not exceeded, etc.)
|
||||
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
|
||||
|
||||
-- Step 4: insert all in-tx side effects (§4.7.1).
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
|
||||
SELECT $msg_id, mention_pubkey, ...
|
||||
FROM unnest($mention_list);
|
||||
|
||||
COMMIT;
|
||||
|
||||
-- After COMMIT, async workers consume delivery_queue and update
|
||||
-- search indexes, audit logs, rate-limit counters, etc.
|
||||
```
|
||||
|
||||
#### 4.7.3 Orphan check — same as v7 §4.7.3
|
||||
|
||||
Extended over the side-effect inventory to verify in-tx items consistency.
|
||||
|
||||
### 4.8 Outbox max-age math — unchanged from v7 §4.8
|
||||
|
||||
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
|
||||
safety_margin` strictly < window; safety_margin floor 24h.
|
||||
|
||||
### 4.9 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.10 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.11 Failure modes — `aborted` semantics added (v8)
|
||||
|
||||
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
|
||||
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
|
||||
- **IPC accept against `aborted` row, fingerprint match**: returns 409
|
||||
per §4.5.1 (NEW v8). Caller must use a new id; the old id is
|
||||
permanently retired.
|
||||
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
|
||||
§4.6.3; old id stays in `aborted`, new id is fresh.
|
||||
- **Broker fingerprint mismatch on retry**: as v6/v7. Daemon marks
|
||||
`dead`; operator requeue path.
|
||||
- **Daemon retry after dedupe row hard-deleted by broker retention
|
||||
sweep**: cannot happen unless operator overrode `max_age_hours`.
|
||||
- **Broker phase B2 rejection on retry**: same id, same fingerprint,
|
||||
but B2 condition has changed (e.g. mesh rate-limit now exceeded).
|
||||
Daemon receives 4xx → marks `dead`. Operator can `requeue` once
|
||||
conditions clear.
|
||||
- **Atomicity violation found by orphan check**: alerts ops.
|
||||
|
||||
---
|
||||
|
||||
## 5-13. — unchanged from v4
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
## 15. Version compat — unchanged from v7 §15
|
||||
|
||||
## 16. Threat model — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
|
||||
|
||||
Broker side, deploy order: same as v7 §17, with one addition:
|
||||
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
|
||||
validation, returns 4xx without writing) and Phase B2/B3 (within the
|
||||
accept transaction). Implementation: refactor handler to validate
|
||||
Phase B1 conditions before opening the DB transaction.
|
||||
|
||||
Daemon side:
|
||||
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
|
||||
columns and the `aborted` enum value (§4.5.2). Migration applies via
|
||||
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
|
||||
greenfield.
|
||||
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
|
||||
(§4.5.1 step 3).
|
||||
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
|
||||
- `claudemesh daemon outbox requeue` always mints a fresh
|
||||
`client_message_id`; never frees the old id. `--new-client-id <id>`
|
||||
and `--auto` are the only modes; the old `client_message_id`
|
||||
argument is removed.
|
||||
|
||||
---
|
||||
|
||||
## What changed v7 → v8 (codex round-7 actionable items)
|
||||
|
||||
| Codex r7 item | v8 fix | Section |
|
||||
|---|---|---|
|
||||
| `aborted` not in §4.5.1; `UNIQUE` contradiction | Added two `aborted` rows (match/mismatch) to lookup table; old id never reusable; new audit columns `aborted_at`/`aborted_by`/`superseded_by` | §4.5.1, §4.5.2, §4.6.3 |
|
||||
| Broker permanent-rejection ordering vague | Three-phase model B1 (pre-dedupe), B2 (post-claim, in-tx), B3 (accepted); permanent 4xx never leaves dedupe row | §4.6.2 |
|
||||
| SQLite `SELECT FOR UPDATE` invalid | Replaced with `BEGIN IMMEDIATE` for daemon-local serialization | §4.5.1 |
|
||||
| Side-effect inventory ambiguous on rate-limit/audit/search | Explicit in-tx vs outside-tx table with rationale per item | §4.7.1 |
|
||||
| Operator id reuse semantics | Old id permanently retired in `aborted`; requeue always mints fresh id; no daemon-side path to release used ids | §4.6.3 |
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 8)
|
||||
|
||||
1. **`aborted` permanence (§4.5.1, §4.6.3)** — is "old id permanently
|
||||
dead" correct, or is there a real operational case where releasing
|
||||
an id (e.g. caller mistyped a uuid) is worth the audit-trail loss?
|
||||
2. **Phase B1/B2/B3 split (§4.6.2)** — clean enough? Is rate-limiting
|
||||
in B2 (in-tx) the right call, or should it be B1 (cheaper to enforce
|
||||
pre-tx)?
|
||||
3. **In-tx mention_index (§4.7.1)** — agree it should be in-tx, or
|
||||
should mention indexing be async like search?
|
||||
4. **`BEGIN IMMEDIATE` (§4.5.1)** — correct SQLite primitive, or should
|
||||
it be `BEGIN EXCLUSIVE` to also block readers? (Probably not — readers
|
||||
should see committed-pending rows, but worth confirming.)
|
||||
5. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year.
|
||||
|
||||
Three options:
|
||||
- **(a) v8 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v9 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
473
.artifacts/shipped/2026-05-03-daemon-final-spec-v9.md
Normal file
@@ -0,0 +1,473 @@
|
||||
# `claudemesh daemon` — Final Spec v9
|
||||
|
||||
> **Round 9.** v8 was reviewed by codex (round 8) which closed
|
||||
> aborted/UNIQUE (5/5) and SQLite locking (5/5) cleanly, but flagged
|
||||
> three spec-level correctness problems:
|
||||
>
|
||||
> 1. **Cross-layer ID-consumed authority contradiction** — v8 §4.1
|
||||
> said "id consumed iff dedupe row exists" while §4.6.1 says a
|
||||
> daemon-rejected id stays consumed locally with no broker dedupe
|
||||
> row. Two incompatible authorities.
|
||||
> 2. **Rate-limit authority muddled** — v8 listed rate limit in B2
|
||||
> (in-tx authoritative) but classified rate-limit counters as
|
||||
> async/non-authoritative in §4.7.1.
|
||||
> 3. **§4.1 broker guarantee wording** — "post-validation accept
|
||||
> phase" was fuzzy because B2 rolls back. Tighten to "accept
|
||||
> committed."
|
||||
>
|
||||
> v9 fixes all three with **two-layer ID rules** (daemon vs broker),
|
||||
> rate-limit moved to B1 via an external atomic limiter, and §4.1
|
||||
> tightened. **Intent §0 unchanged from v2.** v9 only revises §4.
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
## 1. Process model — unchanged
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — `aborted` clarified, broker phasing, SQLite locking
|
||||
|
||||
### 4.1 The contract (precise — v9, two-layer ID model)
|
||||
|
||||
> **Two-layer ID rules** (NEW v9 — codex r8):
|
||||
>
|
||||
> - **Daemon-layer**: a `client_message_id` is **daemon-consumed** iff an
|
||||
> outbox row exists for it. Daemon-mediated callers can never reuse a
|
||||
> daemon-consumed id, regardless of whether the broker ever saw it.
|
||||
> The daemon's outbox is the single authority for "this id was issued
|
||||
> by my caller against this daemon."
|
||||
> - **Broker-layer**: a `client_message_id` is **broker-consumed** iff a
|
||||
> dedupe row exists for `(mesh_id, client_message_id)` in
|
||||
> `mesh.client_message_dedupe`. Direct broker callers (none in
|
||||
> v0.9.0; reserved for future SDK paths that bypass the daemon) can
|
||||
> reuse a broker-non-consumed id freely.
|
||||
> - In v0.9.0 there are no daemon-bypass clients, so for practical
|
||||
> purposes "daemon-consumed" is the operative rule.
|
||||
>
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db`
|
||||
> before the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer (§4.5.1).
|
||||
>
|
||||
> **Local audit guarantee**: a `client_message_id` once written to
|
||||
> `outbox.db` is **never released** (daemon-layer rule). Operator
|
||||
> recovery via `requeue` always mints a fresh id; the old row stays in
|
||||
> `aborted` for audit. There is no daemon-side path to free a used id.
|
||||
>
|
||||
> **Broker guarantee** (v9 — tightened): a dedupe row exists iff the
|
||||
> broker accept transaction **committed** (Phase B3 reached). Phase B1
|
||||
> rejections never insert dedupe rows. Phase B2 rejections roll the
|
||||
> transaction back, so any partial dedupe row is unwound. Direct
|
||||
> broker callers retrying after B1/B2 rejection see no dedupe row and
|
||||
> may reuse the id.
|
||||
>
|
||||
> **Atomicity guarantee**: same as v8 §4.1.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — unchanged from v6 §4.3
|
||||
|
||||
### 4.4 Request fingerprint canonical form — unchanged from v6 §4.4
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (v8 — `aborted` added, SQLite locking)
|
||||
|
||||
#### 4.5.1 IPC accept algorithm (v8)
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits, destination
|
||||
resolvable). Failures here return `4xx` immediately. **No outbox row
|
||||
is written; the `client_message_id` is not consumed.**
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` (v8 — codex r7) so
|
||||
a concurrent IPC accept on the same id serializes against this one.
|
||||
`BEGIN IMMEDIATE` acquires the RESERVED lock at transaction start,
|
||||
preventing any other writer from beginning a transaction on the same
|
||||
database; SQLite has no row-level lock and `SELECT FOR UPDATE` is not
|
||||
supported.
|
||||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||||
last_error FROM outbox WHERE client_message_id = ?`.
|
||||
5. Apply the lookup table below. For the "(no row)" case, INSERT the
|
||||
new row inside the same transaction.
|
||||
6. COMMIT.
|
||||
|
||||
| Existing row state | Fingerprint match? | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | INSERT new outbox row in `pending`; return `202 accepted, queued` |
|
||||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||||
| `pending` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_pending_fingerprint_mismatch"`. No mutation |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||||
| `inflight` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||||
| `done` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"`. Same id never auto-retried |
|
||||
| `dead` | mismatch | Return `409 idempotency_key_reused`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
| **`aborted`** (NEW v8) | **match** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_match"`. The id was retired by operator action; never reusable |
|
||||
| **`aborted`** (NEW v8) | **mismatch** | Return `409 idempotency_key_reused`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||||
|
||||
**Rule (v8 — codex r7)**: every IPC `409` carries the daemon's
|
||||
`request_fingerprint` (8-byte hex prefix) so callers can debug
|
||||
client/server canonical-form drift. **Every state in the table returns
|
||||
something deterministic, including `aborted`.** A `client_message_id`
|
||||
written to `outbox.db` is permanently bound to that row's lifecycle —
|
||||
the only "free" state is "no row exists".
|
||||
|
||||
#### 4.5.2 Outbox table — fingerprint required
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN
|
||||
('pending','inflight','done','dead','aborted')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT,
|
||||
aborted_at INTEGER, -- NEW v8
|
||||
aborted_by TEXT, -- NEW v8: operator/auto
|
||||
superseded_by TEXT -- NEW v8: id of the requeue successor row, if any
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||||
```
|
||||
|
||||
`aborted_at`, `aborted_by`, `superseded_by` give operators a clear
|
||||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||||
a row was requeued multiple times.
|
||||
|
||||
`request_fingerprint` is computed once at IPC accept time and frozen
|
||||
forever for the row's lifecycle. Daemon never recomputes from
|
||||
`payload`.
|
||||
|
||||
### 4.6 Rejected-request semantics — two-layer rules + rate-limit moved to B1 (v9 — codex r8)
|
||||
|
||||
> **Two-layer rule (v9)**: a `client_message_id` is **daemon-consumed**
|
||||
> iff an outbox row exists for it; **broker-consumed** iff a dedupe row
|
||||
> exists. Daemon-mediated callers see daemon-layer authority (the only
|
||||
> path in v0.9.0). Pre-validation failures at any layer consume nothing
|
||||
> at that layer. The two layers are independent: a daemon-consumed id
|
||||
> may or may not be broker-consumed (depending on whether the send
|
||||
> reached B3); a daemon-non-consumed id can never be broker-consumed
|
||||
> (no outbox row ⇒ no broker call from the daemon).
|
||||
|
||||
#### 4.6.1 Daemon-side rejection phasing (v9)
|
||||
|
||||
| Phase | When daemon rejects | Outbox row? | Daemon-consumed? | Same daemon caller may reuse id? |
|
||||
|---|---|---|---|---|
|
||||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | No | Yes — id never written locally |
|
||||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | Yes | N/A — daemon owns retries |
|
||||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | Yes | No — rotate via `requeue` |
|
||||
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Yes (still consumed) | Old id NEVER reusable; new id is fresh |
|
||||
|
||||
The "daemon-consumed?" column is the daemon-layer authority. It does
|
||||
not depend on whether the broker ever saw the request — phase C above
|
||||
shows the broker has not committed a dedupe row, but the daemon still
|
||||
holds the id in `dead` state.
|
||||
|
||||
#### 4.6.2 Broker-side rejection phasing (v9 — rate limit moved to B1)
|
||||
|
||||
The broker validates in two phases relative to dedupe-row insertion:
|
||||
|
||||
| Phase | Validation | Side effects | Result for direct broker callers |
|
||||
|---|---|---|---|
|
||||
| **B1. Pre-dedupe-claim** (atomic, external) | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, **rate limit not exceeded** (atomic external limiter — see §4.6.4) | None | `4xx` returned. No dedupe row, no broker-consumed id. Caller may retry with same id once condition clears |
|
||||
| **B2. Post-dedupe-claim** (in-tx) | Conditions that require the accept transaction to be in progress: destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx` returned, transaction rolled back, no dedupe row remains. Caller may retry with same id |
|
||||
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows, mention_index rows | `201` returned with `broker_message_id`. Id is broker-consumed |
|
||||
|
||||
**Daemon-mediated callers**: in v0.9.0 the daemon is the only B-phase
|
||||
caller. Daemon-mediated callers see only the daemon-layer rules
|
||||
(§4.6.1). The broker's "may retry with same id" wording in the table
|
||||
above applies to direct broker callers only (none in v0.9.0; reserved
|
||||
for future SDK paths).
|
||||
|
||||
**Critical guarantee (v9 — tightened from v8)**: a dedupe row exists
|
||||
**iff the broker accept transaction committed (B3)**. There is no
|
||||
broker code path where a permanent 4xx leaves a dedupe row behind.
|
||||
|
||||
If the broker decides post-commit that an accepted message is invalid
|
||||
(async content-policy job, async moderation, etc.), that's NOT a
|
||||
permanent rejection — it's a follow-up event that operates on the
|
||||
`broker_message_id`, not on the dedupe key.
|
||||
|
||||
#### 4.6.4 Rate limiter — atomic, external, B1 (NEW v9 — codex r8)
|
||||
|
||||
Codex r8 caught: v8 listed rate-limit enforcement in B2 (in-tx) but
|
||||
classified rate-limit *counters* as async/non-authoritative. Both
|
||||
can't be true. v9 resolves it by moving rate-limit enforcement to B1
|
||||
backed by an atomic external limiter:
|
||||
|
||||
- **Authority**: the broker's existing Redis (or equivalent
|
||||
fixed-window limiter) used for `claudemesh launch` rate-limiting is
|
||||
the authority for accept-time rate-limit enforcement. `INCR` with
|
||||
TTL is atomic; the broker checks the result before committing the
|
||||
Phase B2/B3 transaction.
|
||||
- **Idempotency interaction**: rate-limit `INCR` happens **before** the
|
||||
dedupe-claim INSERT. If the limiter rejects, no DB transaction is
|
||||
opened, no dedupe row exists. If the limiter accepts but the in-tx
|
||||
Phase B2 then rejects (e.g. topic not found), the limiter `INCR` is
|
||||
not refunded. This is intentional: refunding would require a
|
||||
reliable distributed counter, and the over-counting risk is
|
||||
acceptable. Counter
|
||||
`cm_broker_rate_limit_consumed_then_rejected_total` exposes the
|
||||
delta for ops awareness.
|
||||
- **Retries**: a daemon retry with the same `client_message_id` after a
|
||||
B1 rate-limit rejection produces another `INCR`. To avoid burning
|
||||
rate-limit budget on retries-of-rejected-ids, the broker can
|
||||
optionally short-circuit `INCR` if the rate-limit subsystem can
|
||||
cheaply detect "this exact `client_message_id` was rejected for
|
||||
rate-limit in the last N seconds" — but this is an optimization,
|
||||
not a correctness requirement.
|
||||
- **Async counters**: `mesh.rate_limit_counter` (or any DB-resident
|
||||
view of "messages-per-mesh-per-window") is **non-authoritative** —
|
||||
it's metrics/telemetry rebuilt from the authoritative limiter and
|
||||
from message-history. Used for dashboards, not for accept decisions.
|
||||
|
||||
This split — atomic external limiter for enforcement, async DB
|
||||
counters for telemetry — matches how every other rate-limited
|
||||
subsystem in claudemesh works (`claudemesh launch`, dashboard chat
|
||||
posts, etc.). No new infrastructure required.
|
||||
|
||||
#### 4.6.3 Operator recovery via `requeue` (corrected v8)
|
||||
|
||||
To unstick a `dead` or `pending`-but-stuck row, operator runs:
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||||
[--new-client-id <id> | --auto]
|
||||
[--patch-payload <path>]
|
||||
```
|
||||
|
||||
This atomically (single SQLite transaction):
|
||||
|
||||
1. Marks the existing row's status to `aborted`, sets `aborted_at = now`,
|
||||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||||
permanent.
|
||||
2. Mints a fresh `client_message_id` (caller-supplied via `--new-client-id`
|
||||
or auto-ulid'd via `--auto`).
|
||||
3. Inserts a new outbox row in `pending` with the fresh id and the same
|
||||
payload (or patched payload if `--patch-payload` was given).
|
||||
4. Sets `superseded_by = <new_row_id>` on the old row so
|
||||
`outbox inspect <old_id>` displays the chain.
|
||||
|
||||
**The old `client_message_id` is permanently dead** — `outbox.db` still
|
||||
holds it via the `aborted` row's `UNIQUE` constraint, and any caller
|
||||
re-using it gets `409 outbox_aborted_*` per §4.5.1.
|
||||
|
||||
If broker had ever accepted the old id (it reached B3), the broker's
|
||||
dedupe row is also permanent — duplicate sends to broker with the old
|
||||
id would also `409` for fingerprint mismatch (or return the original
|
||||
`broker_message_id` for matching fingerprint). Daemon-side
|
||||
`aborted` and broker-side dedupe row are independent records of "this
|
||||
id was used," neither releases the id.
|
||||
|
||||
This is the resolution to v7's contradiction: there is **no path** for
|
||||
an id to "become free again." If the operator wants to retry the
|
||||
payload, they get a new id. The old id stays buried.
|
||||
|
||||
### 4.7 Broker atomicity contract — side-effect classification (v9)
|
||||
|
||||
#### 4.7.1 Side effects (v9 — rate limit moved to B1 external)
|
||||
|
||||
Every successful broker accept atomically commits these durable
|
||||
state changes in **one transaction**:
|
||||
|
||||
| Effect | Table | In-tx? | Why |
|
||||
|---|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | **Yes** | Idempotency authority |
|
||||
| Message body | `mesh.topic_message` / `mesh.message_queue` | **Yes** | Authoritative store |
|
||||
| History row | `mesh.message_history` | **Yes** | Replay log; lost-on-rollback would break ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | **Yes** | Each recipient must see exactly the messages that committed |
|
||||
| Mention index entries | `mesh.mention_index` | **Yes** | Reads off mention queries must match committed messages |
|
||||
|
||||
**Outside the transaction** — non-authoritative or rebuildable, with
|
||||
explicit rationale per item:
|
||||
|
||||
| Effect | Where | Why outside |
|
||||
|---|---|---|
|
||||
| WS push to live subscribers | Async after COMMIT | Live notifications are best-effort; receivers re-fetch from history on reconnect |
|
||||
| Webhook fan-out | Async via `delivery_queue` workers | Off-band; consumes committed `delivery_queue` rows |
|
||||
| Rate-limit **counters** (telemetry only) | Async, eventually consistent | Authoritative limiter is the external Redis-style INCR in B1 (§4.6.4); the DB counter is rebuilt for dashboards, not consulted for accept |
|
||||
| Audit log entries | Async append-only stream | Audit log can be rebuilt from message history; in-tx writes hurt p99 |
|
||||
| Search/FTS index updates | Async via outbox-pattern worker | Index can be rebuilt from authoritative tables |
|
||||
| Metrics | Prometheus, pull-based | Always non-authoritative |
|
||||
|
||||
If any in-transaction insert fails, the transaction rolls back
|
||||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||||
state.
|
||||
|
||||
The async side effects are driven off the in-transaction
|
||||
`delivery_queue` and `message_history` rows, so they cannot get ahead
|
||||
of committed state — only lag behind.
|
||||
|
||||
#### 4.7.2 Pseudocode — corrected and final (v8)
|
||||
|
||||
```sql
|
||||
-- Phase B1 already passed (see §4.6.2). This includes:
|
||||
-- - schema/auth/size validation
|
||||
-- - external atomic rate-limit INCR (§4.6.4)
|
||||
-- Anything that fails B1 returns 4xx without ever opening this tx.
|
||||
|
||||
BEGIN;
|
||||
|
||||
-- Phase B2 + B3: try to claim the idempotency key.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Inspect the row that's actually there now (ours or someone else's).
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- Branch:
|
||||
-- row.broker_message_id == $msg_id → first insert; continue to step 3.
|
||||
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
|
||||
-- fingerprint match → ROLLBACK; return 200 duplicate.
|
||||
-- fingerprint mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Step 3: validate Phase B2 (destination_ref existence: topic exists,
|
||||
-- member subscribed, etc.). Rate limit is NOT here — it was checked
|
||||
-- atomically in B1 via the external limiter (§4.6.4) before this
|
||||
-- transaction opened.
|
||||
-- If B2 fails → ROLLBACK; return 4xx (no dedupe row remains).
|
||||
|
||||
-- Step 4: insert all in-tx side effects (§4.7.1).
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
INSERT INTO mesh.mention_index (broker_message_id, mentioned_pubkey, ...)
|
||||
SELECT $msg_id, mention_pubkey, ...
|
||||
FROM unnest($mention_list);
|
||||
|
||||
COMMIT;
|
||||
|
||||
-- After COMMIT, async workers consume delivery_queue and update
|
||||
-- search indexes, audit logs, rate-limit counters, etc.
|
||||
```
|
||||
|
||||
#### 4.7.3 Orphan check — same as v7 §4.7.3
|
||||
|
||||
Extended over the side-effect inventory to verify in-tx items consistency.
|
||||
|
||||
### 4.8 Outbox max-age math — unchanged from v7 §4.8
|
||||
|
||||
Min `dedupe_retention_days = 7`; derived `max_age_hours = window -
|
||||
safety_margin` strictly < window; safety_margin floor 24h.
|
||||
|
||||
### 4.9 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.10 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.11 Failure modes — `aborted` semantics added (v8)
|
||||
|
||||
- **IPC accept fingerprint-mismatch on duplicate id** (any state):
|
||||
returns 409 with `conflict` field per §4.5.1. Caller must use a new id.
|
||||
- **IPC accept against `aborted` row, fingerprint match**: returns 409
|
||||
per §4.5.1 (NEW v8). Caller must use a new id; the old id is
|
||||
permanently retired.
|
||||
- **Outbox row stuck in `dead`**: operator runs `outbox requeue` per
|
||||
§4.6.3; old id stays in `aborted`, new id is fresh.
|
||||
- **Broker fingerprint mismatch on retry**: as v6/v7. Daemon marks
|
||||
`dead`; operator requeue path.
|
||||
- **Daemon retry after dedupe row hard-deleted by broker retention
|
||||
sweep**: cannot happen unless operator overrode `max_age_hours`.
|
||||
- **Broker phase B2 rejection on retry**: same id, same fingerprint,
|
||||
but B2 condition has changed (e.g. mesh rate-limit now exceeded).
|
||||
Daemon receives 4xx → marks `dead`. Operator can `requeue` once
|
||||
conditions clear.
|
||||
- **Atomicity violation found by orphan check**: alerts ops.
|
||||
|
||||
---
|
||||
|
||||
## 5-13. — unchanged from v4
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
## 15. Version compat — unchanged from v7 §15
|
||||
|
||||
## 16. Threat model — unchanged
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — v8 outbox columns + broker phase B2 (v8)
|
||||
|
||||
Broker side, deploy order: same as v7 §17, with one addition:
|
||||
- Step 4.5: explicitly split broker accept into Phase B1 (pre-dedupe
|
||||
validation, returns 4xx without writing) and Phase B2/B3 (within the
|
||||
accept transaction). Implementation: refactor handler to validate
|
||||
Phase B1 conditions before opening the DB transaction.
|
||||
|
||||
Daemon side:
|
||||
- Outbox schema gains `aborted_at`, `aborted_by`, `superseded_by`
|
||||
columns and the `aborted` enum value (§4.5.2). Migration applies via
|
||||
`INSERT INTO new SELECT * FROM old` recreation if needed; v0.9.0 is
|
||||
greenfield.
|
||||
- IPC accept switches to `BEGIN IMMEDIATE` for SQLite serialization
|
||||
(§4.5.1 step 3).
|
||||
- IPC accept handles `aborted` rows per §4.5.1 (always 409).
|
||||
- `claudemesh daemon outbox requeue` always mints a fresh
|
||||
`client_message_id`; never frees the old id. `--new-client-id <id>`
|
||||
and `--auto` are the only modes; the old `client_message_id`
|
||||
argument is removed.
|
||||
|
||||
---
|
||||
|
||||
## What changed v8 → v9 (codex round-8 actionable items)
|
||||
|
||||
| Codex r8 item | v9 fix | Section |
|
||||
|---|---|---|
|
||||
| Cross-layer ID-consumed authority contradiction | Two-layer model: daemon-consumed iff outbox row; broker-consumed iff dedupe row committed; daemon-mediated callers see only daemon-layer authority | §4.1, §4.6.1, §4.6.2 |
|
||||
| Rate-limit authority muddled (B2 vs async counters) | Rate limit moved to B1 via external atomic limiter (Redis-style INCR with TTL); DB rate-limit counters demoted to telemetry-only | §4.6.2, §4.6.4, §4.7.1 |
|
||||
| §4.1 broker guarantee fuzzy | Tightened: "dedupe row exists iff broker accept transaction committed (B3)" | §4.1, §4.6.2 |
|
||||
|
||||
(Earlier rounds' fixes preserved unchanged.)
|
||||
|
||||
---
|
||||
|
||||
## What needs review (round 9)
|
||||
|
||||
1. **Two-layer ID model (§4.1, §4.6.1)** — is the daemon-vs-broker
|
||||
authority split clear, or does it create more confusion for
|
||||
operators reading "consumed" in different contexts? Should we use
|
||||
different verbs (e.g. "claimed" at daemon, "committed" at broker)?
|
||||
2. **Rate-limit external limiter (§4.6.4)** — is "atomic external
|
||||
limiter" specified concretely enough? Is the over-counting on
|
||||
limiter-accepted-then-B2-rejected acceptable?
|
||||
3. **B2 contents after rate-limit move** — B2 now only has
|
||||
`destination_ref existence`. Worth keeping a B2 phase at all, or
|
||||
collapse into B1+B3?
|
||||
4. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year.
|
||||
|
||||
Three options:
|
||||
- **(a) v9 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v10 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
374
.artifacts/shipped/2026-05-03-daemon-final-spec.md
Normal file
@@ -0,0 +1,374 @@
|
||||
# `claudemesh daemon` — Final Spec
|
||||
|
||||
> Context for the reviewer: claudemesh is a peer mesh runtime for Claude Code
|
||||
> sessions. Existing infrastructure: a managed broker (`wss://ic.claudemesh.com/ws`,
|
||||
> Bun + Drizzle + Postgres) that handles routing, presence, topics, files,
|
||||
> per-mesh apikeys, etc. There is also a CLI (`claudemesh-cli`, npm) and a web
|
||||
> dashboard. Each session today is short-lived: `claudemesh launch` opens a WS,
|
||||
> stays up while Claude Code is running, then closes. Server-side
|
||||
> integrations (RunPod handlers, Temporal workers, CI jobs) currently have no
|
||||
> first-class way to participate in a mesh — they'd either curl an apikey-auth
|
||||
> REST endpoint (one-way) or shell out to the CLI cold-path (slow, no inbound).
|
||||
>
|
||||
> This spec proposes a `claudemesh daemon` mode that turns any host (laptop,
|
||||
> server, RunPod pod) into a persistent, addressable peer with a local IPC
|
||||
> surface that apps can talk to without dealing with the broker directly.
|
||||
>
|
||||
> The user has explicitly said: pre-launch, no users yet, optimize for the
|
||||
> right architecture not the smallest first cut. They want the FINAL spec, not
|
||||
> phased MVPs.
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model
|
||||
|
||||
**One daemon per (user, mesh)**. Persistent. Survives reboots via OS supervisor (systemd / launchd / SCM). Serves multiple local apps concurrently.
|
||||
|
||||
```
|
||||
~/.claudemesh/daemon/<mesh-slug>/
|
||||
pid 0600 pidfile, cleaned on shutdown
|
||||
sock 0600 unix domain socket (primary IPC)
|
||||
http.port 0644 auto-allocated loopback port (Windows / Docker fallback)
|
||||
keypair.json 0600 persistent ed25519 + x25519 — daemon identity
|
||||
config.toml 0644 user-editable runtime tuning
|
||||
outbox.db 0600 SQLite — durable outbound queue + dedupe ledger
|
||||
inbox.db 0600 SQLite — 30-day inbound history, FTS-indexed
|
||||
daemon.log 0644 JSON-lines, rotating (100 MB / 14 d)
|
||||
hooks/ 0700 user-managed event scripts
|
||||
```
|
||||
|
||||
Single binary. No external runtime beyond the existing CLI dependencies. The daemon *is* the CLI in long-running mode — `claudemesh daemon up` is a flag on the same binary.
|
||||
|
||||
## 2. Identity — persistent member, not ephemeral session
|
||||
|
||||
The daemon mints a stable ed25519 + x25519 keypair on first startup, stored in `keypair.json`. Registers with the broker as a **persistent member** — same identity across restarts, reconnects, host migrations. `runpod-worker-3` is `runpod-worker-3` forever, until you `claudemesh daemon reset` or revoke the keypair.
|
||||
|
||||
`--name` is taken at first `daemon up`; subsequent runs read the keypair file and ignore `--name` unless `--rename` is passed (which produces a `member_renamed` event the broker propagates to peers).
|
||||
|
||||
This is the default. It's the right thing for servers. There is no `--ephemeral` mode.
|
||||
|
||||
## 3. IPC surface — single versioned API, three transports
|
||||
|
||||
**Transports**, all serving identical JSON:
|
||||
- **UDS** at `~/.claudemesh/daemon/<slug>/sock` (primary, default)
|
||||
- **TCP loopback** on auto-allocated port written to `http.port` (Docker / Windows clients)
|
||||
- **Server-Sent Events** stream at `GET /v1/events` for push (real-time inbound)
|
||||
|
||||
**No auth on local IPC.** Trust boundary is the OS — UDS is mode 0600, TCP listens on 127.0.0.1 only. If you can reach the socket, you're already running as the right user; the daemon's `keypair.json` is also reachable, so adding a token would be theatre.
|
||||
|
||||
**Endpoint surface — exactly mirrors CLI verbs:**
|
||||
|
||||
```
|
||||
# messaging
|
||||
POST /v1/send {to, message, priority?, meta?, replyToId?}
|
||||
POST /v1/topic/post {topic, message, priority?, mentions?}
|
||||
POST /v1/topic/subscribe {topic}
|
||||
GET /v1/topic/list
|
||||
GET /v1/inbox ?since=<iso>&topic=<n>&from=<peer>&limit=<n>
|
||||
POST /v1/broadcast {message, scope: "*"|"@group"|...}
|
||||
|
||||
# peers + presence
|
||||
GET /v1/peers ?mesh=<slug>
|
||||
POST /v1/profile {summary?, status?, visible?, avatar?, ...}
|
||||
POST /v1/groups/join {name, role?}
|
||||
POST /v1/groups/leave {name}
|
||||
|
||||
# state, memory, vector, graph — full mesh-services platform
|
||||
POST /v1/state/set {key, value, scope?: "mesh"|"member"}
|
||||
GET /v1/state/get ?key=...
|
||||
GET /v1/state/list
|
||||
POST /v1/memory/remember {content, tags?}
|
||||
GET /v1/memory/recall ?q=<query>
|
||||
POST /v1/vector/store {collection, text, metadata?}
|
||||
GET /v1/vector/search ?collection=<c>&q=<query>&limit=<n>
|
||||
POST /v1/graph/query {cypher, params?}
|
||||
|
||||
# files
|
||||
POST /v1/file/share {path, to?, message?, persistent?}
|
||||
GET /v1/file/get ?id=<fileId>&out=<path>
|
||||
GET /v1/file/list
|
||||
|
||||
# tasks + scheduling
|
||||
POST /v1/task/create {title, assignee?, priority?, tags?}
|
||||
POST /v1/task/claim {id}
|
||||
POST /v1/task/complete {id, result?}
|
||||
POST /v1/scheduling/remind {at|in|cron, message, to?}
|
||||
|
||||
# skills + MCP services (full peer participation)
|
||||
POST /v1/skill/deploy {path}
|
||||
POST /v1/skill/share {name, manifest}
|
||||
POST /v1/mcp/register {server_name, description, tools, transport}
|
||||
POST /v1/mcp/call {server, tool, args}
|
||||
|
||||
# events (push)
|
||||
GET /v1/events text/event-stream
|
||||
events: message, peer_join, peer_leave, file_shared, task_assigned,
|
||||
state_changed, mcp_deployed, skill_shared, hook_executed,
|
||||
disconnect, reconnect
|
||||
|
||||
# control plane
|
||||
GET /v1/health {connected, lag_ms, queue_depth, mesh, member_pubkey, uptime_s}
|
||||
GET /v1/metrics Prometheus exposition
|
||||
POST /v1/heartbeat {} (caller asserts it's alive — daemon may set status="working")
|
||||
```
|
||||
|
||||
Every CLI verb the platform offers has a daemon endpoint. No second-class features. Apps written against the daemon get the same surface as Claude Code itself.
|
||||
|
||||
## 4. Outbound — exactly-once via SQLite + idempotency keys
|
||||
|
||||
Sends route through `outbox.db` first, then to the broker. Schema:
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY, -- ulid
|
||||
idempotency_key TEXT UNIQUE, -- caller-provided or autogen
|
||||
payload BLOB NOT NULL, -- serialized envelope
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN ('pending','inflight','done','dead')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
```
|
||||
|
||||
- WAL mode, `synchronous=NORMAL` — durable enough, ~10k inserts/sec.
|
||||
- Caller-supplied `Idempotency-Key` header dedupes retries (24h window).
|
||||
- Exponential backoff with jitter; 7-day max retention; `dead` rows surface in `claudemesh daemon outbox --failed`.
|
||||
- `delivered_at` set when broker ACKs the queue row, not when daemon sends — gives true at-least-once with explicit dedupe → effectively exactly-once.
|
||||
|
||||
## 5. Inbound — durable history with FTS
|
||||
|
||||
Every inbound message is written to `inbox.db` before any hook fires:
|
||||
|
||||
```sql
|
||||
CREATE VIRTUAL TABLE inbox USING fts5(
|
||||
message_id UNINDEXED, mesh UNINDEXED, topic, sender_pubkey UNINDEXED,
|
||||
sender_name, body, meta, received_at UNINDEXED, replied_to_id UNINDEXED
|
||||
);
|
||||
```
|
||||
|
||||
- 30-day rolling retention (configurable).
|
||||
- `claudemesh daemon search "OOM"` queries the FTS index (instant, offline-capable).
|
||||
- Apps that connect mid-stream replay history via `?since=<iso>`.
|
||||
- Exposed in metrics: `cm_daemon_inbox_rows`, `cm_daemon_inbox_bytes`.
|
||||
|
||||
## 6. Hooks — first-class scripted reactions
|
||||
|
||||
Hooks turn the daemon from a passive relay into an autonomous peer. Files in `hooks/`:
|
||||
|
||||
```
|
||||
hooks/
|
||||
on-message.sh every inbound message (DM + topic)
|
||||
on-dm.sh DMs only
|
||||
on-mention.sh when @<my-name> appears anywhere
|
||||
on-topic-<name>.sh a specific topic (e.g. on-topic-alerts.sh)
|
||||
on-file-share.sh file shared with me
|
||||
on-task-assigned.sh task assigned to me
|
||||
on-disconnect.sh WS dropped (informational)
|
||||
on-reconnect.sh reconnected (informational)
|
||||
on-startup.sh daemon up
|
||||
pre-send.sh filter / mutate outbound (last gate)
|
||||
```
|
||||
|
||||
**Contract:**
|
||||
- Stdin: full event JSON.
|
||||
- Stdout (if non-empty, JSON object): used as a structured response. For inbound messages, `{reply: "..."}` posts a reply automatically.
|
||||
- Exit 0 = success; non-zero logs + counts but does not retry.
|
||||
- Timeout: 30s default, override via `# claudemesh:timeout=120s` shebang comment.
|
||||
- Env: `PATH=/usr/bin:/bin`, `CLAUDEMESH_MESH=<slug>`, `CLAUDEMESH_MEMBER=<pubkey>`, `CLAUDEMESH_HOME=<config-dir>`, plus the daemon's own broker session token in `CLAUDEMESH_TOKEN` so the script can call `claudemesh send` without re-authenticating.
|
||||
- Concurrent execution: bounded pool (default 8) — overflow queues, never blocks the WS reader.
|
||||
|
||||
This makes a server a real participant: it auto-replies to "@worker-3 status?", auto-acks file shares, auto-claims tasks, escalates errors to oncall — all configured by dropping shell scripts in a directory.
|
||||
|
||||
## 7. Multi-mesh — one daemon per mesh, coordinated by a supervisor
|
||||
|
||||
Multi-mesh handled by **one daemon per mesh** (no shared state, no cross-mesh leakage). Coordinated by:
|
||||
|
||||
```
|
||||
claudemesh daemon up --all # spawns one daemon per joined mesh
|
||||
claudemesh daemon down --all
|
||||
claudemesh daemon status --all # JSON table of every daemon
|
||||
claudemesh daemon ps # alias of status
|
||||
```
|
||||
|
||||
CLI verbs without `--mesh` continue to do their existing aggregator routing (`/v1/me/...`) and additionally each daemon contributes inbound state to the aggregator.
|
||||
|
||||
## 8. Auto-routing — every CLI verb prefers the daemon
|
||||
|
||||
The CLI's `withMesh` helper is replaced by `viaDaemonOrMesh`:
|
||||
|
||||
1. Read `~/.claudemesh/daemon/<slug>/pid`.
|
||||
2. If alive → call the daemon's UDS endpoint.
|
||||
3. Else → cold path (existing `withMesh` flow, opens its own short-lived WS).
|
||||
|
||||
Transparent to the user. `claudemesh send X "msg"` from a script becomes a sub-millisecond local UDS call when a daemon is up, instead of a 1-second broker handshake.
|
||||
|
||||
## 9. Service installation
|
||||
|
||||
```bash
|
||||
claudemesh daemon install-service # writes systemd unit / launchd plist / Windows SC
|
||||
claudemesh daemon uninstall-service
|
||||
```
|
||||
|
||||
Generated unit:
|
||||
- `Restart=on-failure`, `RestartSec=5s`
|
||||
- `MemoryMax=512M` (will rarely use this)
|
||||
- `StandardOutput/Error=journal`
|
||||
- For systemd, runs as the invoking user (no root needed).
|
||||
|
||||
`claudemesh install` (the existing setup verb) gains an opt-in prompt: *"Install as a background service that always runs?"* For interactive users this is opt-in; for `--yes` it defaults to yes on Linux servers (detected by absence of TTY + presence of systemd).
|
||||
|
||||
## 10. Observability
|
||||
|
||||
```
|
||||
claudemesh daemon status human-readable: connected, lag, queue, hooks fired
|
||||
claudemesh daemon status --json machine-readable
|
||||
claudemesh daemon logs [-f] tail daemon.log
|
||||
claudemesh daemon outbox pending sends + dead-letter queue
|
||||
claudemesh daemon inbox recent received messages (FTS-searchable)
|
||||
claudemesh daemon metrics prints /v1/metrics
|
||||
|
||||
# Prometheus counters/gauges:
|
||||
cm_daemon_connected{mesh} 0/1
|
||||
cm_daemon_reconnects_total{mesh,reason}
|
||||
cm_daemon_lag_ms{mesh} last broker round-trip
|
||||
cm_daemon_outbox_depth{mesh}
|
||||
cm_daemon_outbox_dead_total{mesh}
|
||||
cm_daemon_send_total{mesh,kind=topic|dm|broadcast,status}
|
||||
cm_daemon_recv_total{mesh,kind=topic|dm,from_type=peer|apikey|webhook}
|
||||
cm_daemon_hook_invocations_total{hook,exit}
|
||||
cm_daemon_hook_duration_seconds{hook} histogram
|
||||
cm_daemon_ipc_request_total{endpoint,status}
|
||||
cm_daemon_ipc_duration_seconds{endpoint} histogram
|
||||
```
|
||||
|
||||
Tracing: optional OpenTelemetry export (`config.toml: [otel] endpoint = ...`) — emits spans for every IPC request + downstream broker call.
|
||||
|
||||
## 11. SDKs — three, all thin
|
||||
|
||||
The daemon's HTTP+UDS surface is the API; SDKs are convenience wrappers, not new surfaces.
|
||||
|
||||
**Python** (single file, stdlib only — no `requests`, no `aiohttp`):
|
||||
```python
|
||||
from claudemesh import Daemon
|
||||
cm = Daemon() # auto-discovers running daemon for current cwd's mesh
|
||||
cm.send("@oncall", "OOM detected")
|
||||
cm.topic.post("alerts", "build done", mentions=["alice"])
|
||||
for evt in cm.events(): # SSE stream, blocking iterator
|
||||
if evt.kind == "message" and "@me" in evt.body:
|
||||
cm.send(evt.from_pubkey, "got it, on it")
|
||||
```
|
||||
|
||||
**Go** (single file, stdlib only — no third-party deps):
|
||||
```go
|
||||
cm, _ := claudemesh.Connect()
|
||||
cm.Send(ctx, "@oncall", "OOM detected")
|
||||
for evt := range cm.Events(ctx) { ... }
|
||||
```
|
||||
|
||||
**TypeScript / Node** (zero runtime deps, ESM only):
|
||||
```ts
|
||||
import { Daemon } from "@claudemesh/daemon-client";
|
||||
const cm = await Daemon.connect();
|
||||
await cm.send("@oncall", "OOM detected");
|
||||
for await (const evt of cm.events()) { ... }
|
||||
```
|
||||
|
||||
Each is ~300 lines. All three are versioned in lockstep with the daemon's `/v1` surface. A `/v2` surface (when it eventually exists) keeps `/v1` alive indefinitely — old SDKs never break.
|
||||
|
||||
## 12. Security model — explicit boundaries
|
||||
|
||||
| Boundary | Trust | Mechanism |
|
||||
|---|---|---|
|
||||
| App ↔ Daemon (local) | OS user | UDS 0600, TCP loopback only |
|
||||
| Daemon ↔ Broker | Mesh keypair | WSS + ed25519 hello sig + crypto_box DM envelopes + per-topic keys (existing model) |
|
||||
| Hook ↔ Daemon (env) | OS user + filesystem | `hooks/` dir mode 0700; only files there execute; no remote install |
|
||||
| Daemon ↔ Disk | OS user | All daemon files mode 0600/0644 under `~/.claudemesh/daemon/` |
|
||||
|
||||
**No new attack surface introduced by the daemon** — apps that previously could read `~/.claudemesh/config.json` directly already had full mesh access; the daemon just adds an IPC layer on top.
|
||||
|
||||
**Hook RCE consideration**: a peer cannot install a hook on your daemon. Hooks are files YOU put on disk. Inbound messages can only trigger hooks that already exist with content you wrote. The broker has no path to your hook directory.
|
||||
|
||||
## 13. Configuration — `config.toml`
|
||||
|
||||
```toml
|
||||
[daemon]
|
||||
mesh = "prod" # set on `daemon up --mesh`; immutable thereafter
|
||||
display_name = "runpod-worker-3"
|
||||
log_level = "info"
|
||||
|
||||
[ipc]
|
||||
http_port = 0 # 0 = auto-allocate
|
||||
http_bind = "127.0.0.1" # never 0.0.0.0; explicit if you know what you're doing
|
||||
uds_mode = "0600"
|
||||
|
||||
[outbox]
|
||||
max_queue_size = 10000
|
||||
max_age_hours = 168 # 7 days
|
||||
fsync_mode = "batched_50ms" # 'strict' | 'batched_50ms' | 'off'
|
||||
|
||||
[inbox]
|
||||
retention_days = 30
|
||||
fts_enabled = true
|
||||
|
||||
[reconnect]
|
||||
initial_backoff_ms = 500
|
||||
max_backoff_ms = 30000
|
||||
backoff_multiplier = 2.0
|
||||
jitter_pct = 25
|
||||
|
||||
[hooks]
|
||||
enabled = true
|
||||
concurrency = 8
|
||||
default_timeout_s = 30
|
||||
|
||||
[metrics]
|
||||
prometheus_enabled = true
|
||||
otel_endpoint = "" # empty = disabled
|
||||
```
|
||||
|
||||
User-editable. `claudemesh daemon reload` re-reads it without dropping the WS.
|
||||
|
||||
## 14. Migration — what changes for existing users
|
||||
|
||||
- `claudemesh launch` (Claude Code mode) is unchanged. It can optionally `--via-daemon` to share the WS with a running daemon, but defaults to its own session (preserves "ephemeral session" semantics that Claude Code expects).
|
||||
- `claudemesh send X "msg"` and every other cold-path verb gets a transparent speedup when a daemon is up. No flag, no opt-in, no behavior difference visible to the user.
|
||||
- Existing `~/.claudemesh/config.json` is consumed unchanged by the daemon.
|
||||
- No DB migration. No broker changes. The daemon talks to the existing `/v1` HTTPS + WSS surfaces — broker doesn't even know whether a connection is `claudemesh launch` or `claudemesh daemon`.
|
||||
|
||||
---
|
||||
|
||||
## What needs review
|
||||
|
||||
Please critically review this spec for the v0.9.0 anchor. Specifically I want
|
||||
your hardest pushback on:
|
||||
|
||||
1. **Identity model** — persistent member by default vs ephemeral session. Have I
|
||||
missed a case where ephemeral is the right answer for a daemon? Should
|
||||
`--ephemeral` exist?
|
||||
2. **No-auth local IPC** — UDS 0600 + TCP loopback. Is "OS-trust is enough"
|
||||
actually safe in shared-tenant Linux (multi-user host, container
|
||||
side-channel)? Should there be a per-daemon token even locally?
|
||||
3. **SQLite outbox/inbox** — single writer, WAL, batched fsync. Is the
|
||||
exactly-once-via-idempotency-key claim defensible? What's the failure mode
|
||||
I'm glossing over?
|
||||
4. **Hooks fork-execing scripts** — RCE/data-exfil concerns I'm dismissing too
|
||||
easily? Should hooks be sandboxed (seccomp, no network, …)?
|
||||
5. **Auto-routing CLI verbs through daemon** — does this break composability
|
||||
with existing `claudemesh launch`? Race conditions when both are running?
|
||||
What about pidfile-stale detection?
|
||||
6. **One daemon per mesh** — why not one daemon serving all meshes, with mesh
|
||||
selection per-request? What does single-daemon actually buy beyond "fewer
|
||||
processes"?
|
||||
7. **The IPC surface duplicates the broker REST surface** — am I solving a
|
||||
problem the broker REST + per-mesh apikey already solves, with extra
|
||||
complexity for caching + queueing?
|
||||
8. **What's missing entirely** — auth boundaries, recovery flows, on-disk
|
||||
secret rotation, anything else a production daemon shipped with this spec
|
||||
would lack?
|
||||
|
||||
Score the spec on each axis: 1 = serious flaw, 5 = sound. Then list the
|
||||
top 3 changes you'd insist on before I write any code. Be ruthless — pre-launch
|
||||
window means I can break anything.
|
||||
@@ -0,0 +1,218 @@
|
||||
# `claudemesh daemon` — broker-hardening followups
|
||||
|
||||
> **Purpose**: refinements found during the v6 → v10 codex review series
|
||||
> that are real improvements but **not** v0.9.0 blockers. The
|
||||
> implementation target is `2026-05-03-daemon-spec-v0.9.0.md`. This
|
||||
> document lists what was deferred, why, and the trigger that promotes
|
||||
> each item to "must-do."
|
||||
>
|
||||
> **Background**: codex reviewed the daemon spec across 9 rounds (v1
|
||||
> through v10). Rounds 1–4 found load-bearing architectural issues
|
||||
> (identity, IPC auth, exactly-once lie, hook tokens, rotation, etc.).
|
||||
> Rounds 5–9 found progressively finer correctness issues inside one
|
||||
> subsystem (broker idempotency mechanics). v6 closed the architectural
|
||||
> review; v7–v10 are increasingly fine-grained idempotency-correctness
|
||||
> shavings on the same layer. Pre-launch (no users) doesn't need v7–v10
|
||||
> level rigor. We pulled the cheap wins into v0.9.0; the rest waits.
|
||||
|
||||
---
|
||||
|
||||
## 1. B0 dedupe fast-path before rate-limit (v10)
|
||||
|
||||
**What v10 said**: read `mesh.client_message_dedupe` BEFORE consulting
|
||||
the rate limiter. Existing id (match or mismatch) returns immediately
|
||||
without touching rate-limit budget.
|
||||
|
||||
**Why deferred**: v0.9.0 doesn't have meaningful rate-limit pressure on
|
||||
the daemon path. The split-brain failure (broker accepted, daemon
|
||||
believes failure due to rate-limit-rejection-on-retry) requires
|
||||
sustained saturated rate-limit windows, which don't exist pre-launch.
|
||||
|
||||
**Promote when**: any single mesh sees rate-limit rejections AND has
|
||||
daemon retries against committed ids. Telemetry to watch:
|
||||
`cm_broker_rate_limit_rejection_total` per mesh > 0 sustained.
|
||||
|
||||
**Implementation cost**: small — one indexed PK lookup before the
|
||||
existing limiter call. The work is mostly testing the race semantics.
|
||||
|
||||
---
|
||||
|
||||
## 2. Lua-scripted idempotent rate limiter (v10)
|
||||
|
||||
**What v10 said**: limiter keyed by `(mesh_id, client_message_id,
|
||||
window_bucket)` so retries-within-window consume budget at most once.
|
||||
|
||||
**Why deferred**: depends on (1) above. Without B0 fast-path this is
|
||||
incremental complexity for marginal benefit. With B0 it becomes the
|
||||
right belt-and-suspenders fix for the rare race where two same-id
|
||||
requests both miss B0 simultaneously.
|
||||
|
||||
**Promote when**: B0 ships. Same trigger.
|
||||
|
||||
**Implementation cost**: medium — Lua script in Redis, careful TTL
|
||||
tuning, integration with existing limiter call sites.
|
||||
|
||||
---
|
||||
|
||||
## 3. In-tx `mesh.mention_index` (v8)
|
||||
|
||||
**What v8 said**: mention-fanout index updates should commit inside the
|
||||
broker accept transaction so mention-search reads can never see a
|
||||
mention pointing at an uncommitted message.
|
||||
|
||||
**Why deferred**: the lag between accept-commit and async
|
||||
mention-indexer is small (single-digit milliseconds in expected
|
||||
deployment). Stale-read window during mention search is acceptable for
|
||||
v0.9.0; receivers learn of mentions via the `mention` event in their
|
||||
inbox stream regardless.
|
||||
|
||||
**Promote when**: real users complain about "I was mentioned but the
|
||||
mention search doesn't show it" with reproducible cases that don't
|
||||
self-heal in seconds.
|
||||
|
||||
**Implementation cost**: small — add `INSERT INTO mesh.mention_index`
|
||||
to the accept transaction. The async indexer becomes a backfill
|
||||
fallback rather than the primary path.
|
||||
|
||||
---
|
||||
|
||||
## 4. 4011 / 4012 close-code split (v6 §15.5)
|
||||
|
||||
**What v6 said**: split `4010 feature_unavailable` into three codes:
|
||||
`4010` (missing), `4011` (params invalid), `4012` (params below floor).
|
||||
|
||||
**Why deferred**: v0.9.0 ships single `4010` with structured
|
||||
`close_reason` JSON containing `kind`, `feature`, `detail`. Same
|
||||
diagnostic information, simpler protocol surface.
|
||||
|
||||
**Promote when**: ops tooling or external monitoring needs distinct
|
||||
status codes (e.g. PagerDuty rules that fire on 4012-only). Probably
|
||||
never; structured JSON is parseable.
|
||||
|
||||
**Implementation cost**: trivial — three constants and a switch on
|
||||
`close_reason.kind`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Per-OS fingerprint precedence elaborate table (v8 §2.2.1)
|
||||
|
||||
**What v8 said**: comprehensive per-OS table covering Linux machine-id
|
||||
sources, macOS `IOPlatformUUID`, Windows `MachineGuid`, BSD
|
||||
`kern.hostuuid`, plus interface exclusion rules.
|
||||
|
||||
**Why deferred**: v0.9.0 ships with the simpler "machine-id ||
|
||||
first-stable-mac" rule from v6. Edge cases (cloud images,
|
||||
machine-id-not-readable, etc.) are documented when first hit.
|
||||
|
||||
**Promote when**: operators report fingerprint false-positives we can't
|
||||
explain from the v6 rule. Each report adds one row to the per-OS
|
||||
table.
|
||||
|
||||
**Implementation cost**: incremental — each OS-specific source is a
|
||||
small probe function with a fallback chain.
|
||||
|
||||
---
|
||||
|
||||
## 6. `request_fingerprint` schema-version-2 in feature negotiation (v6 §15.1)
|
||||
|
||||
**What v6 said**: `client_message_id_dedupe` feature parameters
|
||||
versioned independently. v0.9.0 ships at version 1 with a single
|
||||
`request_fingerprint: bool` flag.
|
||||
|
||||
**Why deferred**: we don't yet need parameterized fingerprint variants
|
||||
(different canonical forms, different hash algos). Version-bump path
|
||||
is documented; we'll use it when we add the second fingerprint mode.
|
||||
|
||||
**Promote when**: we want a fingerprint algo other than sha256/JCS
|
||||
(e.g. a faster hash, or a normalized canonical form).
|
||||
|
||||
**Implementation cost**: small — single feature-bit version bump
|
||||
following the documented pattern.
|
||||
|
||||
---
|
||||
|
||||
## 7. Force-expiry / quarantine semantics for `keypair-archive.json` (v8 §14.1.1)
|
||||
|
||||
**What v8 said**: `max_archived_keys` cap with force-expiry; explicit
|
||||
quarantine of malformed archive (`keypair-archive.json.malformed-<ts>`);
|
||||
duplicate `key_id` rejection; mode-mismatch warning behavior.
|
||||
|
||||
**Why deferred**: v0.9.0 ships the simpler v6 rule — drop expired
|
||||
entries on cleanup pass; refuse to start on malformed archive (loud,
|
||||
operator-actionable). The v8 elaboration makes archive corruption
|
||||
non-blocking, which is operationally nicer but trades off audit
|
||||
clarity.
|
||||
|
||||
**Promote when**: a real operator hits an archive corruption that
|
||||
shouldn't have brought the daemon down (e.g. mid-rotation crash leaves
|
||||
a partially-written archive).
|
||||
|
||||
**Implementation cost**: small — quarantine logic + one extra startup
|
||||
check.
|
||||
|
||||
---
|
||||
|
||||
## 8. Cross-language JCS conformance for `request_fingerprint` (v6 §4.4 round-6 question)
|
||||
|
||||
**What v6 asked**: does JCS work cross-language for
|
||||
`meta_canonical_json`? Python json.dumps, Go encoding/json, and JS
|
||||
JSON.stringify all behave differently. Should we ship a vetted JCS lib
|
||||
in each SDK?
|
||||
|
||||
**Why deferred from v0.9.0**: the daemon ships in TypeScript only for
|
||||
v0.9.0 (the `claudemesh-cli` package). Single-language JCS is trivial.
|
||||
SDK ports come post-v0.9.0.
|
||||
|
||||
**Promote when**: we ship the Python or Go SDK. Each SDK port gets a
|
||||
JCS conformance test against a corpus of envelopes.
|
||||
|
||||
**Implementation cost**: small per-language — a conformance fixture
|
||||
file and a unit test.
|
||||
|
||||
---
|
||||
|
||||
## Sprint 7 (this session) — what landed vs deferred
|
||||
|
||||
**Landed in code** (not yet deployed):
|
||||
- `packages/db/migrations/0028_message_queue_idempotency_fields.sql` adds
|
||||
nullable `client_message_id` and `request_fingerprint` columns to
|
||||
`mesh.message_queue` (additive, online-safe).
|
||||
- `apps/broker/src/broker.ts` — `queueMessage` and `drainForMember`
|
||||
thread the new columns through.
|
||||
- `apps/broker/src/index.ts` — `handleSend` picks them up from the
|
||||
daemon's wire envelope; outbound push echoes them back so receiving
|
||||
daemons can dedupe.
|
||||
- `apps/broker/src/types.ts` — `WSPushMessage` declares the optional
|
||||
fields.
|
||||
|
||||
**Deployment plan (not auto-applied)**:
|
||||
1. Apply migration against prod DB (the broker's filename-tracked
|
||||
migrator picks up `0028_*.sql` on next startup).
|
||||
2. Deploy the broker with the code changes via Coolify.
|
||||
3. Verify a daemon-originated send shows non-null `client_message_id`
|
||||
in `mesh.message_queue` afterwards.
|
||||
|
||||
**Still deferred** (full broker hardening):
|
||||
- `mesh.client_message_dedupe` table with `request_fingerprint BYTEA`
|
||||
and atomic accept transaction (spec §4.7).
|
||||
- Feature-bit advertisement on hello_ack of
|
||||
`client_message_id_dedupe` v1, with daemon-side enforcement (spec §15).
|
||||
- Partial unique index `(mesh_id, client_message_id) WHERE NOT NULL`.
|
||||
|
||||
These sit behind the same trigger as the followups below: do them when
|
||||
real users hit operational corners that this addressing doesn't cover.
|
||||
|
||||
---
|
||||
|
||||
## How to use this document
|
||||
|
||||
When picking up post-v0.9.0 work on the daemon:
|
||||
|
||||
1. Check whether any of the "promote when" triggers above have fired.
|
||||
2. If yes, consult the corresponding versioned spec (v6/v7/v8/v9/v10)
|
||||
for the full proposed change.
|
||||
3. Implement the lift, update `daemon-spec-v0.9.0.md` to reflect the
|
||||
merge, and remove the item from this followups list.
|
||||
|
||||
The versioned specs live in `.artifacts/specs/` indefinitely as a
|
||||
review-trail audit.
|
||||
680
.artifacts/shipped/2026-05-03-daemon-spec-v0.9.0.md
Normal file
@@ -0,0 +1,680 @@
|
||||
# `claudemesh daemon` — Implementation spec v0.9.0
|
||||
|
||||
> **Implementation target.** Locked from the v1–v10 codex-reviewed spec
|
||||
> series. This document is what we build for v0.9.0 of the daemon.
|
||||
>
|
||||
> **Base**: v6 (the round where the architecture passed codex's
|
||||
> structural review — request_fingerprint, dedupe table, atomicity
|
||||
> contract, feature-bit negotiation, key archive format).
|
||||
>
|
||||
> **Pulled in from v7–v9**: six cheap, load-bearing fixes that close
|
||||
> real v0.9.0-era bugs (not future-scale concerns):
|
||||
>
|
||||
> 1. `aborted` outbox status + audit columns (operator recovery without
|
||||
> destroying audit trail) — v7 §4.5.2
|
||||
> 2. `BEGIN IMMEDIATE` for daemon-local SQLite serialization (v6's
|
||||
> `SELECT FOR UPDATE` is invalid SQLite anyway) — v7 §4.5.1
|
||||
> 3. Daemon-local IPC duplicate lookup table over outbox states ×
|
||||
> fingerprint match/mismatch — v8 §4.5.1
|
||||
> 4. Phase B1/B2/B3 broker validation split (the concept; we don't need
|
||||
> the elaborate phase tables) — v7 §4.6.2
|
||||
> 5. Side-effect inventory (in-tx vs async) as an implementation comment
|
||||
> block — v8 §4.7.1
|
||||
> 6. Two-layer ID model wording: daemon-consumed iff outbox row,
|
||||
> broker-consumed iff dedupe row — v9 §4.1
|
||||
>
|
||||
> **Deferred to broker-hardening followups** (see
|
||||
> `2026-05-03-daemon-spec-broker-hardening-followups.md` for the full list and
|
||||
> rationale): B0 dedupe fast-path, Lua-scripted idempotent rate
|
||||
> limiter, in-tx mention_index, 4011/4012 close-code split, per-OS
|
||||
> fingerprint precedence table, request-fingerprint schema-v2 in
|
||||
> feature negotiation. These are real improvements but not v0.9.0
|
||||
> blockers; they land as the broker matures.
|
||||
>
|
||||
> **Intent §0 unchanged from v2.**
|
||||
|
||||
---
|
||||
|
||||
## 0. Intent — unchanged, see v2 §0
|
||||
|
||||
---
|
||||
|
||||
## 1. Process model — unchanged from v3 §1 / v2 §1
|
||||
|
||||
---
|
||||
|
||||
## 2. Identity — unchanged from v5 §2
|
||||
|
||||
---
|
||||
|
||||
## 3. IPC surface — unchanged from v4 §3
|
||||
|
||||
---
|
||||
|
||||
## 4. Delivery contract — at-least-once with **request-fingerprinted** dedupe
|
||||
|
||||
Codex r5: dedupe must compare the *whole request shape*, not just
|
||||
`(mesh, client_message_id)`. Otherwise a caller who reuses an idempotency
|
||||
key with a different destination or body silently drops the new send and
|
||||
gets the old send's metadata back.
|
||||
|
||||
### 4.1 The contract (precise)
|
||||
|
||||
> **Two-layer ID rule** (from v9): a `client_message_id` is
|
||||
> **daemon-consumed** iff an outbox row exists for it; **broker-consumed**
|
||||
> iff a dedupe row exists in `mesh.client_message_dedupe`. The two layers
|
||||
> are independent: a daemon-consumed id may or may not be broker-consumed
|
||||
> (depending on whether the send reached broker commit). In v0.9.0 there
|
||||
> are no daemon-bypass clients, so for practical purposes "daemon-consumed"
|
||||
> is the operative rule.
|
||||
>
|
||||
> **Local guarantee**: each successful `POST /v1/send` returns a stable
|
||||
> `client_message_id`. The send is durably persisted to `outbox.db` before
|
||||
> the response returns. The daemon enforces request-fingerprint
|
||||
> idempotency at the IPC layer (§4.5).
|
||||
>
|
||||
> **Local audit guarantee**: a `client_message_id` once written to
|
||||
> `outbox.db` is never released. Operator recovery via `requeue` always
|
||||
> mints a fresh id; the old row stays in `aborted` for audit. There is
|
||||
> no daemon-side path to free a used id.
|
||||
>
|
||||
> **Broker guarantee**: the broker maintains a dedupe record per accepted
|
||||
> `(mesh_id, client_message_id)` in `mesh.client_message_dedupe`. Each
|
||||
> dedupe record carries a canonical `request_fingerprint`. Retries with
|
||||
> the same id AND matching fingerprint collapse to the original
|
||||
> `broker_message_id`. Retries with mismatched fingerprint return
|
||||
> `409 idempotency_key_reused` and do **not** create a new message.
|
||||
>
|
||||
> **Atomicity guarantee**: dedupe row insertion, message row insertion,
|
||||
> and history row insertion happen in one broker DB transaction. Either
|
||||
> all land, or none do. No orphan dedupe rows.
|
||||
>
|
||||
> **End-to-end guarantee**: at-least-once delivery, with
|
||||
> `client_message_id` propagated to receivers' inboxes.
|
||||
|
||||
### 4.2 Daemon-supplied `client_message_id` — unchanged from v3 §4.2
|
||||
|
||||
### 4.3 Broker schema — request fingerprint added (v6)
|
||||
|
||||
```sql
|
||||
CREATE TABLE mesh.client_message_dedupe (
|
||||
mesh_id UUID NOT NULL REFERENCES mesh.mesh(id) ON DELETE CASCADE,
|
||||
client_message_id TEXT NOT NULL,
|
||||
|
||||
-- The original accepted message; FK NOT enforced because the message row
|
||||
-- may be GC'd by retention sweeps before the dedupe row expires.
|
||||
broker_message_id UUID NOT NULL,
|
||||
|
||||
-- Canonical fingerprint of the original request. Recomputed on every
|
||||
-- duplicate retry; mismatch → 409 idempotency_key_reused. Schema in §4.4.
|
||||
request_fingerprint BYTEA NOT NULL, -- 32-byte sha256
|
||||
|
||||
destination_kind TEXT NOT NULL CHECK(destination_kind IN ('topic','dm','queue')),
|
||||
destination_ref TEXT NOT NULL,
|
||||
first_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
expires_at TIMESTAMPTZ, -- NULL = `permanent` mode
|
||||
history_available BOOLEAN NOT NULL DEFAULT TRUE, -- flipped FALSE when message row GC'd
|
||||
|
||||
PRIMARY KEY (mesh_id, client_message_id)
|
||||
);
|
||||
|
||||
CREATE INDEX client_message_dedupe_expires_idx
|
||||
ON mesh.client_message_dedupe(expires_at)
|
||||
WHERE expires_at IS NOT NULL;
|
||||
|
||||
ALTER TABLE mesh.topic_message ADD COLUMN client_message_id TEXT;
|
||||
ALTER TABLE mesh.message_queue ADD COLUMN client_message_id TEXT;
|
||||
```
|
||||
|
||||
**`status` column dropped (codex r5)**. Rejected requests do **not**
|
||||
consume idempotency keys. Rationale below in §4.6.
|
||||
|
||||
### 4.4 Request fingerprint — canonical form (NEW v6)
|
||||
|
||||
The fingerprint covers everything that makes a send semantically distinct.
|
||||
A retry must reproduce the same fingerprint bit-for-bit; anything else is
|
||||
a different send and must not be collapsed.
|
||||
|
||||
```
|
||||
request_fingerprint = sha256(
|
||||
envelope_version || 0x00 ||
|
||||
destination_kind || 0x00 ||
|
||||
destination_ref || 0x00 ||
|
||||
reply_to_id_or_empty || 0x00 ||
|
||||
priority || 0x00 ||
|
||||
meta_canonical_json || 0x00 ||
|
||||
body_hash
|
||||
)
|
||||
```
|
||||
|
||||
Where:
|
||||
- `envelope_version`: integer string (e.g. `"1"`). Bumps when the envelope
|
||||
shape changes.
|
||||
- `destination_kind`: `topic`, `dm`, or `queue`.
|
||||
- `destination_ref`: topic name, recipient ed25519 pubkey hex, or queue id.
|
||||
- `reply_to_id_or_empty`: original `broker_message_id` or empty string.
|
||||
- `priority`: `now`, `next`, or `low`.
|
||||
- `meta_canonical_json`: the `meta` field, serialized with sorted keys,
|
||||
no whitespace, escape-canonical (RFC 8785 JCS). Empty meta = empty string.
|
||||
- `body_hash`: sha256(body bytes), hex.
|
||||
|
||||
The fingerprint is computed:
|
||||
1. **Daemon-side** before durable outbox persistence — stored as
|
||||
`outbox.request_fingerprint` (NEW column) so retries always produce
|
||||
the same fingerprint regardless of caller behavior.
|
||||
2. **Broker-side** on first receipt — stored in
|
||||
`client_message_dedupe.request_fingerprint`.
|
||||
3. **Broker-side** on every duplicate retry — recomputed and compared
|
||||
byte-equal to the stored value.
|
||||
|
||||
If the daemon and broker disagree on the canonical form (e.g. JCS
|
||||
implementation drift), the broker emits
|
||||
`cm_broker_dedupe_fingerprint_mismatch_total{client_id, mesh_id}` and
|
||||
returns `409 idempotency_key_reused` with a body that includes the
|
||||
broker's fingerprint hex for debugging. Daemons that see this should
|
||||
log it loudly and stop retrying that outbox row (it goes to `dead`).
|
||||
|
||||
### 4.5 Daemon-local idempotency at the IPC layer (from v8)
|
||||
|
||||
The daemon enforces fingerprint idempotency **before** the request hits
|
||||
`outbox.db` so a caller bug never creates duplicate-key/mismatch-payload
|
||||
state at all.
|
||||
|
||||
#### 4.5.1 IPC accept algorithm
|
||||
|
||||
On `POST /v1/send`:
|
||||
|
||||
1. Validate request envelope (auth, schema, size limits, destination
|
||||
resolvable). Failures here return `4xx` immediately. **No outbox
|
||||
row is written; the `client_message_id` is not consumed.**
|
||||
2. Compute `request_fingerprint` (§4.4).
|
||||
3. Open a SQLite transaction with `BEGIN IMMEDIATE` so a concurrent IPC
|
||||
accept on the same id serializes against this one. `BEGIN IMMEDIATE`
|
||||
acquires the RESERVED lock at transaction start; SQLite has no
|
||||
row-level lock and `SELECT FOR UPDATE` is not supported.
|
||||
4. `SELECT id, request_fingerprint, status, broker_message_id,
|
||||
last_error FROM outbox WHERE client_message_id = ?`.
|
||||
5. Apply the lookup table below. For the "(no row)" case, INSERT inside
|
||||
the same transaction.
|
||||
6. COMMIT.
|
||||
|
||||
| Existing row state | Fingerprint | Daemon response |
|
||||
|---|---|---|
|
||||
| (no row) | — | INSERT new outbox row `pending`; return `202 accepted, queued` |
|
||||
| `pending` | match | Return `202 accepted, queued`. No mutation |
|
||||
| `pending` | mismatch | Return `409`, `conflict: "outbox_pending_fingerprint_mismatch"` |
|
||||
| `inflight` | match | Return `202 accepted, inflight`. No mutation |
|
||||
| `inflight` | mismatch | Return `409`, `conflict: "outbox_inflight_fingerprint_mismatch"` |
|
||||
| `done` | match | Return `200 ok, duplicate: true, broker_message_id, history_id`. No broker call |
|
||||
| `done` | mismatch | Return `409`, `conflict: "outbox_done_fingerprint_mismatch", broker_message_id` |
|
||||
| `dead` | match | Return `409`, `conflict: "outbox_dead_fingerprint_match", reason: "<last_error>"` |
|
||||
| `dead` | mismatch | Return `409`, `conflict: "outbox_dead_fingerprint_mismatch"` |
|
||||
| `aborted` | match | Return `409`, `conflict: "outbox_aborted_fingerprint_match"`. Operator-retired id, never reusable |
|
||||
| `aborted` | mismatch | Return `409`, `conflict: "outbox_aborted_fingerprint_mismatch"` |
|
||||
|
||||
Every `409` carries the daemon's `request_fingerprint` (8-byte hex
|
||||
prefix) for client/server canonical-form-drift debugging. A
|
||||
`client_message_id` written to `outbox.db` is permanently bound to that
|
||||
row's lifecycle — the only "free" state is "no row exists".
|
||||
|
||||
#### 4.5.2 Outbox table
|
||||
|
||||
```sql
|
||||
CREATE TABLE outbox (
|
||||
id TEXT PRIMARY KEY,
|
||||
client_message_id TEXT NOT NULL UNIQUE,
|
||||
request_fingerprint BLOB NOT NULL, -- 32 bytes
|
||||
payload BLOB NOT NULL,
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER DEFAULT 0,
|
||||
next_attempt_at INTEGER NOT NULL,
|
||||
status TEXT CHECK(status IN
|
||||
('pending','inflight','done','dead','aborted')),
|
||||
last_error TEXT,
|
||||
delivered_at INTEGER,
|
||||
broker_message_id TEXT,
|
||||
aborted_at INTEGER, -- v7
|
||||
aborted_by TEXT, -- v7: operator/auto
|
||||
superseded_by TEXT -- v7: id of requeue successor
|
||||
);
|
||||
CREATE INDEX outbox_pending ON outbox(status, next_attempt_at);
|
||||
CREATE INDEX outbox_aborted ON outbox(status, aborted_at) WHERE status = 'aborted';
|
||||
```
|
||||
|
||||
`aborted_at` / `aborted_by` / `superseded_by` give operators a clear
|
||||
audit trail. `superseded_by` lets `outbox inspect` show the chain when
|
||||
a row is requeued multiple times. `request_fingerprint` is computed
|
||||
once at IPC accept time and frozen for the row's lifecycle.
|
||||
|
||||
#### 4.5.3 Operator recovery via `requeue`
|
||||
|
||||
```
|
||||
claudemesh daemon outbox requeue --id <outbox_row_id>
|
||||
[--new-client-id <id> | --auto]
|
||||
[--patch-payload <path>]
|
||||
```
|
||||
|
||||
Atomically (single SQLite transaction):
|
||||
1. Marks the existing row `aborted`, sets `aborted_at = now`,
|
||||
`aborted_by = "operator"`. Row is **never deleted** — audit trail
|
||||
permanent.
|
||||
2. Mints a fresh `client_message_id` (caller-supplied or auto-ulid).
|
||||
3. Inserts a new outbox row `pending` with the fresh id and the same
|
||||
payload (or patched if `--patch-payload`).
|
||||
4. Sets `superseded_by = <new_row_id>` on the old row.
|
||||
|
||||
The old `client_message_id` is permanently dead. There is no path for
|
||||
an id to become free again.
|
||||
|
||||
### 4.5b Broker duplicate response — three cases
|
||||
|
||||
| Case | HTTP/WS code | Body |
|
||||
|---|---|---|
|
||||
| First insert | `201 created` | `{ broker_message_id, client_message_id, history_id, duplicate: false }` |
|
||||
| Duplicate, fingerprint match | `200 ok` | `{ broker_message_id, client_message_id, history_id, duplicate: true, history_available, first_seen_at }` |
|
||||
| Duplicate, fingerprint mismatch | `409 idempotency_key_reused` | `{ client_message_id, conflict: "request_fingerprint_mismatch", broker_fingerprint_prefix: "ab12cd34..." }` (first 8 bytes hex) |
|
||||
|
||||
Daemon outcomes:
|
||||
- `201` → mark outbox row `done`, store `broker_message_id`.
|
||||
- `200 duplicate` with `history_available: true` → mark `done`, log INFO.
|
||||
- `200 duplicate` with `history_available: false` → mark `done`, log WARN.
|
||||
- `409 idempotency_key_reused` → mark outbox row `dead`. Operator runs
|
||||
`outbox requeue` (§4.5.3); old id stays `aborted`, new id is fresh.
|
||||
|
||||
### 4.6 Rejected-request semantics — id consumed iff outbox row written
|
||||
|
||||
> **Rule**: a `client_message_id` is daemon-consumed iff the daemon
|
||||
> writes an outbox row. Anything that fails before outbox insertion
|
||||
> (auth, schema, size, destination not resolvable) leaves the id
|
||||
> untouched and freely reusable.
|
||||
|
||||
#### 4.6.1 Daemon-side rejection phasing
|
||||
|
||||
| Phase | When daemon rejects | Outbox row? | Caller may reuse id? |
|
||||
|---|---|---|---|
|
||||
| **A. IPC validation** (auth, schema, size, destination resolvable) | Before §4.5.1 step 3 | No | Yes — id never consumed |
|
||||
| **B. Outbox stored, broker network/transient failure** | After IPC accept, broker `5xx` or timeout | `pending` → retried | N/A — daemon owns retries |
|
||||
| **C. Outbox stored, broker permanent rejection** | Broker returns `4xx` after IPC accept | `dead` | No — rotate via `requeue` |
|
||||
| **D. Operator retirement** | Operator runs `requeue` on `dead` or `pending` row | `aborted` (audit) + new row with fresh id | Old id NEVER reusable; new id is fresh |
|
||||
|
||||
#### 4.6.2 Broker-side rejection phasing (B1 / B2 / B3)
|
||||
|
||||
The broker validates in three phases relative to dedupe-row insertion:
|
||||
|
||||
| Phase | Validation | Side effects | Result for direct broker callers (none in v0.9.0) |
|
||||
|---|---|---|---|
|
||||
| **B1. Pre-dedupe-claim** | Auth (mesh membership), schema, size, mesh exists, member exists, destination kind valid, payload bytes ≤ `max_payload.inline_bytes`, rate limit not exceeded | None | `4xx`. No dedupe row. Direct broker caller may retry with same id |
|
||||
| **B2. Post-dedupe-claim** (in-tx) | destination_ref existence (topic exists, member subscribed, etc.) | INSERT into dedupe rolled back | `4xx`, transaction rolled back, no dedupe row remains. Direct broker caller may retry with same id |
|
||||
| **B3. Accepted** | All side effects commit atomically | Dedupe row, message row, history row, delivery_queue rows | `201` with `broker_message_id` |
|
||||
|
||||
**Daemon-mediated callers (the only path in v0.9.0)** see only the
|
||||
daemon-layer rules of §4.6.1: any broker `4xx` after IPC accept lands
|
||||
the outbox row in `dead`. Daemon-mediated callers MUST rotate via
|
||||
`requeue` (§4.5.3); the daemon-consumed id is never reusable
|
||||
regardless of whether the broker layer sees a dedupe row. The "may
|
||||
retry with same id" wording above describes broker-bypass callers
|
||||
only, which v0.9.0 does not have.
|
||||
|
||||
**Critical guarantee**: there is no broker code path where a permanent
|
||||
4xx leaves a dedupe row behind. Either the request committed and a
|
||||
dedupe row exists (B3), or it didn't and no dedupe row exists (B1, B2).
|
||||
"Dedupe row exists" is the unambiguous signal of "id consumed at the
|
||||
broker layer."
|
||||
|
||||
If the broker decides post-commit that an accepted message is invalid
|
||||
(async content-policy job), that's NOT a permanent rejection — it's a
|
||||
follow-up moderation event that operates on the `broker_message_id`,
|
||||
not on the dedupe key.
|
||||
|
||||
Net result: `client_message_dedupe` rows only exist when the broker
|
||||
**successfully** accepted a message and committed it. The single source
|
||||
of truth for "was this idempotency key consumed?" is the existence of
|
||||
the dedupe row. No status enum, no ambiguous states.
|
||||
|
||||
### 4.7 Broker atomicity contract
|
||||
|
||||
#### 4.7.1 Side-effect inventory
|
||||
|
||||
Every successful broker accept atomically commits these durable state
|
||||
changes in **one transaction**:
|
||||
|
||||
| Effect | Table | Why in-tx |
|
||||
|---|---|---|
|
||||
| Dedupe record | `mesh.client_message_dedupe` | Idempotency authority |
|
||||
| Message body | `mesh.topic_message` / `mesh.message_queue` | Authoritative store |
|
||||
| History row | `mesh.message_history` | Replay log; lost-on-rollback breaks ordered replay |
|
||||
| Fan-out work | `mesh.delivery_queue` | Each recipient must see exactly committed messages |
|
||||
|
||||
**Outside the transaction** (non-authoritative or rebuildable):
|
||||
- WS push to live subscribers — best-effort live notifications.
|
||||
- Webhook fan-out — async via `delivery_queue` workers.
|
||||
- Rate-limit counters — telemetry only; authority is the external
|
||||
limiter checked in B1.
|
||||
- Audit log entries — append-only stream; rebuildable from history.
|
||||
- Search/FTS index updates — async via outbox-pattern worker.
|
||||
- Mention index updates — async (deferred in-tx promotion to followups
|
||||
doc).
|
||||
- Metrics — Prometheus, pull-based.
|
||||
|
||||
If any in-transaction insert fails, the transaction rolls back
|
||||
completely. The accept is `5xx` to daemon; daemon retries. No partial
|
||||
state.
|
||||
|
||||
#### 4.7.2 Pseudocode
|
||||
|
||||
```sql
|
||||
-- Pre-generate broker_message_id (ulid) in code, pass in.
|
||||
BEGIN;
|
||||
|
||||
-- Step 1: try to claim the idempotency key.
|
||||
INSERT INTO mesh.client_message_dedupe
|
||||
(mesh_id, client_message_id, broker_message_id, request_fingerprint,
|
||||
destination_kind, destination_ref, expires_at)
|
||||
VALUES ($mesh_id, $client_id, $msg_id, $fingerprint,
|
||||
$dest_kind, $dest_ref, $expires_at)
|
||||
ON CONFLICT (mesh_id, client_message_id) DO NOTHING;
|
||||
|
||||
-- Step 2: inspect what's actually there now (ours or someone else's).
|
||||
SELECT broker_message_id, request_fingerprint, destination_kind,
|
||||
destination_ref, history_available, first_seen_at
|
||||
FROM mesh.client_message_dedupe
|
||||
WHERE mesh_id = $mesh_id AND client_message_id = $client_id
|
||||
FOR SHARE;
|
||||
|
||||
-- Branch:
|
||||
-- row.broker_message_id == $msg_id → first insert; continue.
|
||||
-- row.broker_message_id != $msg_id → duplicate. Compare fingerprints:
|
||||
-- match → ROLLBACK; return 200 duplicate.
|
||||
-- mismatch → ROLLBACK; return 409 idempotency_key_reused.
|
||||
|
||||
-- Step 3: validate Phase B2 (destination_ref existence — topic exists,
|
||||
-- member subscribed, etc.). If B2 fails → ROLLBACK; return 4xx (no
|
||||
-- dedupe row remains).
|
||||
|
||||
-- Step 4: insert in-tx side effects (§4.7.1).
|
||||
INSERT INTO mesh.topic_message (id, mesh_id, client_message_id, body, ...)
|
||||
VALUES ($msg_id, $mesh_id, $client_id, ...);
|
||||
|
||||
INSERT INTO mesh.message_history (broker_message_id, mesh_id, ...)
|
||||
VALUES ($msg_id, $mesh_id, ...);
|
||||
|
||||
INSERT INTO mesh.delivery_queue (broker_message_id, recipient_pubkey, ...)
|
||||
SELECT $msg_id, member_pubkey, ...
|
||||
FROM mesh.topic_subscription
|
||||
WHERE topic = $dest_ref AND mesh_id = $mesh_id;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
The branch logic determines the response shape (`201` / `200 duplicate`
|
||||
/ `409 idempotency_key_reused`) before COMMIT. The duplicate and 409
|
||||
branches always ROLLBACK because nothing else needs to commit.
|
||||
`SELECT … FOR SHARE` blocks concurrent writers from upgrading the same
|
||||
dedupe row mid-transaction.
|
||||
|
||||
#### 4.7.3 Failure modes
|
||||
|
||||
- Crash before `COMMIT`: all rows roll back. Next daemon retry inserts
|
||||
cleanly.
|
||||
- Crash after `COMMIT` but before WS ACK: dedupe row exists. Daemon
|
||||
retries → fingerprint matches → `200 duplicate`. Net: exactly one
|
||||
broker-accepted row, one daemon `done` transition.
|
||||
- Constraint violation on message row insert: rolls back the whole tx.
|
||||
`5xx` to daemon. Same fingerprint reproduces; daemon eventually
|
||||
marks `dead`. No orphan dedupe row.
|
||||
|
||||
Counter `cm_broker_dedupe_orphan_check_total` runs nightly and
|
||||
validates that every `client_message_dedupe` row has a matching
|
||||
`topic_message` / `message_queue` row OR the matching row has been
|
||||
retention-pruned (`history_available = FALSE`). Inconsistencies logged
|
||||
as `cm_broker_dedupe_orphan_found{mesh_id}` for human review.
|
||||
|
||||
### 4.8 Outbox schema
|
||||
|
||||
The authoritative outbox schema for v0.9.0 is in §4.5.2 (includes
|
||||
`aborted` status and audit columns from the v7 pull). `request_fingerprint`
|
||||
is computed at IPC accept time and frozen for the row's lifecycle —
|
||||
the daemon never recomputes from `payload` post-enqueue (would produce
|
||||
drift if envelope_version changes between daemon runs).
|
||||
|
||||
### 4.9 Outbox max-age math — bounded (v6)
|
||||
|
||||
Codex r5: the v5 formula `(dedupe_retention_days * 24) - 24h_margin`
|
||||
breaks at `dedupe_retention_days = 1` (yields zero) and is undefined
|
||||
behavior at `<= 1`.
|
||||
|
||||
v6 formula and bounds:
|
||||
|
||||
- **Minimum supported broker dedupe retention**: 3 days. Daemon refuses
|
||||
to start if broker advertises `dedupe_retention_days < 3` (treats it
|
||||
as `feature_param_invalid`, exits 4010).
|
||||
- **Daemon `max_age_hours` derivation**:
|
||||
- `permanent` mode → daemon uses config default (168h = 7d), cap 720h
|
||||
(30d).
|
||||
- `retention_scoped` mode → daemon `max_age_hours = max(72,
|
||||
(dedupe_retention_days * 24) - safety_margin_hours)` where
|
||||
`safety_margin_hours = max(24, ceil(dedupe_retention_days * 0.1 *
|
||||
24))`. For `dedupe_retention_days=3` this gives
|
||||
`max(72, 72-24) = 72h`. For 30 days: `max(72, 720-72) = 648h`. For
|
||||
365 days: `max(72, 8760-876) = 7884h`.
|
||||
- The 72h floor prevents the daemon outbox from being uselessly short
|
||||
— three days is enough margin for normal operator response to a
|
||||
paged outage.
|
||||
|
||||
- Operator override allowed via `[outbox] max_age_hours_override = N`,
|
||||
but if `N` exceeds `dedupe_retention_days * 24 - 1` daemon refuses to
|
||||
start with `outbox_max_age_above_dedupe_window`. The override exists
|
||||
for the rare case of a much-shorter-than-default outbox; it does not
|
||||
exist to circumvent the broker's dedupe window.
|
||||
|
||||
### 4.10 Inbox schema — unchanged from v3 §4.5
|
||||
|
||||
### 4.11 Crash recovery — unchanged from v3 §4.6
|
||||
|
||||
### 4.12 Failure modes — corrected for fingerprint model (v6)
|
||||
|
||||
- **Fingerprint mismatch on retry** (`409 idempotency_key_reused`): outbox
|
||||
row marked `dead`. Surfaced in `--failed` view. Operator command
|
||||
`outbox requeue --new-id <id>` rotates `client_message_id` and retries.
|
||||
- **Daemon retry after dedupe row hard-deleted by retention sweep**: in
|
||||
`retention_scoped` mode, daemon `max_age_hours` is bounded inside the
|
||||
retention window (§4.9), so this can only happen via operator override.
|
||||
In that case the retry creates a NEW dedupe row + new message — the
|
||||
caller chose this risk explicitly. Counter
|
||||
`cm_daemon_retry_after_dedupe_expired_total`.
|
||||
- **Daemon retry after dedupe row hard-deleted in `permanent` mode**:
|
||||
cannot happen by definition — `permanent` means no `expires_at`. Only
|
||||
mesh deletion removes dedupe rows.
|
||||
- **Duplicate row, history pruned**: as v5 §4.4. Mark `done`, log
|
||||
`cm_daemon_dedupe_history_pruned_total`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Inbound — unchanged from v3 §5
|
||||
|
||||
---
|
||||
|
||||
## 6. Hooks — unchanged from v4 §6
|
||||
|
||||
---
|
||||
|
||||
## 7-13. Multi-mesh, auto-routing, service install, observability, SDKs, security model, configuration — unchanged from v4
|
||||
|
||||
---
|
||||
|
||||
## 14. Lifecycle — unchanged from v5 §14
|
||||
|
||||
---
|
||||
|
||||
## 15. Version compat — feature param updated for new dedupe semantics
|
||||
|
||||
### 15.1 Feature bits with parameters (v6 update)
|
||||
|
||||
| Bit | `params.version` | Required parameters | Optional parameters |
|
||||
|---|---|---|---|
|
||||
| `client_message_id_dedupe` | `1` | `mode: "retention_scoped"\|"permanent"`, `dedupe_retention_days: int (>= 3)` (when mode=retention_scoped), `request_fingerprint: bool == true` | `tombstone_history_pruned_window_days: int` |
|
||||
| `concurrent_connection_policy` | `1` | (no parameters) | `default_policy: "prefer_newest"\|"prefer_oldest"\|"allow_concurrent"` |
|
||||
| `member_keypair_rotated_event` | `1` | (no parameters) | — |
|
||||
| `key_epoch` | `1` | `max_concurrent_epochs: int (>= 1)` | — |
|
||||
| `max_payload` | `1` | `inline_bytes: int (>= 1024)`, `blob_bytes: int (>= 1024)` | — |
|
||||
|
||||
`client_message_id_dedupe` ships at `params.version = 1` with
|
||||
`request_fingerprint: bool == true` as a required parameter. A broker
|
||||
that doesn't advertise the feature, or advertises it without
|
||||
`request_fingerprint: true`, is treated as "feature missing" and the
|
||||
daemon refuses to start. That's intentional — v0.9.0 daemons require
|
||||
fingerprint enforcement for safe idempotency.
|
||||
|
||||
The schema-version-2 evolution (parameters that need versioning) is
|
||||
deferred (see followups doc).
|
||||
|
||||
`dedupe_retention_days` minimum is 3 (matches the §4.9 floor).
|
||||
|
||||
### 15.2 Negotiation handshake — unchanged shape from v5 §15.2
|
||||
|
||||
### 15.3 IPC negotiation — unchanged from v3 §15.3
|
||||
|
||||
### 15.4 Compatibility matrix — unchanged from v3 §15.4
|
||||
|
||||
### 15.5 Diagnostic close code (v0.9.0)
|
||||
|
||||
v0.9.0 ships a single WebSocket close code with a structured
|
||||
`close_reason` JSON payload that distinguishes the underlying cause:
|
||||
|
||||
| Code | Reason | `close_reason.kind` values |
|
||||
|---|---|---|
|
||||
| `4010` | `feature_unavailable` | `feature_unavailable` (feature missing from broker's `supported`) · `feature_param_invalid` (params fail validation: missing required, out of bounds, unknown version) · `feature_param_below_floor` (param below daemon's hard floor, e.g. `dedupe_retention_days < 3`) |
|
||||
|
||||
`close_reason` payload shape:
|
||||
```json
|
||||
{
|
||||
"kind": "feature_unavailable" | "feature_param_invalid" | "feature_param_below_floor",
|
||||
"feature": "client_message_id_dedupe",
|
||||
"detail": "..."
|
||||
}
|
||||
```
|
||||
|
||||
Daemon logs the full negotiation payload at WARN before exiting;
|
||||
supervisor + alerting catches the restart loop. The split into
|
||||
4011/4012 codes is deferred (see followups doc).
|
||||
|
||||
---
|
||||
|
||||
## 16. Threat model — unchanged from v4 §16
|
||||
|
||||
---
|
||||
|
||||
## 17. Migration — broker dedupe table + atomicity (v6)
|
||||
|
||||
Broker side, deploy order:
|
||||
|
||||
1. `CREATE TABLE mesh.client_message_dedupe` with v6 schema (additive,
|
||||
online-safe).
|
||||
2. `ALTER TABLE mesh.topic_message ADD COLUMN client_message_id`.
|
||||
3. `ALTER TABLE mesh.message_queue ADD COLUMN client_message_id`.
|
||||
4. Broker code refactor: every accept path wraps dedupe insert + message
|
||||
insert in **one transaction** (§4.7). Pre-generated
|
||||
`broker_message_id` (ulid in code) passed in.
|
||||
5. Broker code: nightly job to delete dedupe rows where `expires_at <
|
||||
NOW()` (skip in `permanent` mode).
|
||||
6. Broker code: hook into the message-retention sweep — when a
|
||||
`topic_message` or `message_queue` row is hard-deleted, find the
|
||||
matching dedupe row by `client_message_id` and set `history_available
|
||||
= FALSE`. (Note: `client_message_id` is nullable on those tables for
|
||||
legacy traffic; nullable rows have no dedupe row to update.)
|
||||
7. Broker code: nightly orphan-check job (§4.7); alerts on non-zero.
|
||||
8. Broker advertises `client_message_id_dedupe` feature with
|
||||
`params.version = 1` and `request_fingerprint: true`.
|
||||
9. Daemon refuses to start unless that feature bit is advertised with
|
||||
valid v1 params.
|
||||
|
||||
Rollback plan: feature flag disables fingerprint enforcement broker-side
|
||||
(falls back to existing pre-v6 behavior — no dedupe). Daemons that
|
||||
require fingerprint refuse to start. Operator switches off the feature
|
||||
flag, reverts the daemon, restarts. No data loss; pending dedupe rows
|
||||
remain in place for the next forward roll.
|
||||
|
||||
---
|
||||
|
||||
## v0.9.0 lock — what's in vs deferred
|
||||
|
||||
**In** (this document): everything codex r1–r4 ratified plus the six
|
||||
sweet-spot pulls from v7–v9 enumerated at the top — `aborted` outbox
|
||||
status, `BEGIN IMMEDIATE`, IPC duplicate lookup table, B1/B2/B3 phasing
|
||||
concept, side-effect inventory, two-layer ID model.
|
||||
|
||||
**Deferred** (see `2026-05-03-daemon-spec-broker-hardening-followups.md`):
|
||||
- B0 dedupe fast-path before rate-limit (v10).
|
||||
- Lua-scripted idempotent rate limiter keyed by
|
||||
`(mesh, client_id, window)` (v10).
|
||||
- In-tx `mesh.mention_index` (v8).
|
||||
- 4011 / 4012 close-code split (v6 §15.5 — collapsed to 4010 with
|
||||
structured reason JSON for v0.9.0).
|
||||
- Per-OS fingerprint precedence elaborate table (v8 §2.2.1).
|
||||
- `request_fingerprint` schema-version-2 in feature negotiation (v6
|
||||
§15.1 ships at version 1 with `request_fingerprint: bool`).
|
||||
- Force-expiry / quarantine semantics for `keypair-archive.json`
|
||||
(v8 §14.1.1).
|
||||
|
||||
These deferrals are real improvements but not v0.9.0 blockers. They
|
||||
land as the broker matures and we have actual scale-load to optimize
|
||||
against.
|
||||
|
||||
---
|
||||
|
||||
## Cross-spec note: §15.5 close-code collapse
|
||||
|
||||
For v0.9.0 we ship a single `4010 feature_unavailable` close code with
|
||||
a structured `close_reason` JSON payload that distinguishes the
|
||||
underlying cause:
|
||||
|
||||
```json
|
||||
{
|
||||
"close_reason": {
|
||||
"kind": "feature_unavailable" | "feature_param_invalid" | "feature_param_below_floor",
|
||||
"feature": "client_message_id_dedupe",
|
||||
"detail": "..."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The 4011/4012 split is deferred to followups.
|
||||
|
||||
---
|
||||
|
||||
## NON-NORMATIVE: round-6 review trailer (preserved for audit only)
|
||||
|
||||
> **Not part of the v0.9.0 contract.** Preserved verbatim from the
|
||||
> v6 source spec as a record of the open questions at the time of the
|
||||
> codex round-6 review. Items below have either been resolved in this
|
||||
> merged document, deferred to the followups doc, or superseded.
|
||||
> Do NOT use this section as a checklist for implementation.
|
||||
|
||||
1. **Request fingerprint canonical form (§4.4)** — does JCS work
|
||||
cross-language for `meta_canonical_json` (Python json.dumps,
|
||||
Go encoding/json, JS JSON.stringify all behave differently)? Should
|
||||
we ship a vetted JCS lib in each SDK or fall back to a simpler
|
||||
"sorted keys + no spaces + escape-as-stored" rule with conformance
|
||||
tests?
|
||||
2. **Atomicity contract (§4.7)** — is the orphan-check sufficient, or
|
||||
does a violation mean we need a "broker rebuild dedupe from messages"
|
||||
recovery tool? The latter is destructive but useful for ops emergencies.
|
||||
3. **Max-age formula (§4.9)** — is the 72h floor correct? Is the
|
||||
percentage-based safety margin (`max(24, ceil(0.1 * dedupe_window))`)
|
||||
the right shape? Or simpler to say "always 24h"?
|
||||
4. **`409 idempotency_key_reused` recovery flow (§4.5)** — is sending the
|
||||
row to `dead` and surfacing it via `outbox --failed` enough? Should
|
||||
the daemon emit a high-priority event for the SSE stream so operators
|
||||
are paged immediately?
|
||||
5. **Diagnostic close codes (§15.5)** — is splitting 4010/4011/4012
|
||||
useful, or does it just push complexity onto operators? Should we
|
||||
collapse to 4010 with structured close-reason JSON instead?
|
||||
6. **Anything else still wrong?** Read it as if you were going to
|
||||
operate this for a year. What falls down?
|
||||
|
||||
Three options:
|
||||
- **(a) v6 is shippable**: lock the spec, start coding the frozen core.
|
||||
- **(b) v7 needed**: list the must-fix items.
|
||||
- **(c) the architecture itself is wrong**: what would you do differently?
|
||||
|
||||
Be ruthless.
|
||||
232
.artifacts/specs/2026-04-10-anthropic-vision-meshes-invites.md
Normal file
@@ -0,0 +1,232 @@
|
||||
# Anthropic Vision: Meshes & Invitations
|
||||
|
||||
**Status:** in progress · partial implementation 2026-04-10
|
||||
**Owner:** agutierrez
|
||||
**Scope:** `apps/web`, `packages/api`, `packages/db`, `apps/broker` (future), `apps/cli` (future)
|
||||
|
||||
---
|
||||
|
||||
## Guiding principles
|
||||
|
||||
1. **Identity is opaque, display is free-form.** Humans pick any name; the system uses random IDs.
|
||||
2. **Secrets never appear in URLs.** Links are capabilities, not credentials.
|
||||
3. **Defaults are obvious; advanced options are discoverable but hidden.**
|
||||
4. **Self-service wherever possible; admins don't become gatekeepers.**
|
||||
5. **Every visible action is also an auditable event.**
|
||||
|
||||
These mirror how Anthropic builds its own org/workspace/project model.
|
||||
|
||||
---
|
||||
|
||||
## Part 1 — Meshes
|
||||
|
||||
### Problem
|
||||
Global uniqueness on `mesh.slug` creates name collisions at scale. Two users picking "platform" or "test" fight for the slug. At 50k users this is the default state.
|
||||
|
||||
### Decision
|
||||
**Drop the slug as an identity concept.** `mesh.id` (opaque, already random) is the canonical identifier everywhere (URLs, invites, broker lookups). `mesh.name` is a free-form display label, non-unique. `mesh.slug` is kept as a non-unique cosmetic string derived from the name at creation time, embedded in invite payloads for debugging.
|
||||
|
||||
### What this enables
|
||||
- Two users can both name their mesh "platform-team" with zero friction
|
||||
- URLs stay stable (`/meshes/{id}`) even if the user renames the mesh
|
||||
- No "slug taken" error state exists in the product anymore
|
||||
|
||||
### Tradeoff explicitly accepted
|
||||
Users lose the ability to type `claudemesh join platform-team` — but they never did, because the CLI takes signed invite tokens, not slugs. This capability was phantom.
|
||||
|
||||
### Implementation — DONE in this spec
|
||||
- [x] Drop `UNIQUE` constraint on `mesh.slug` (migration `0017_mesh-slug-non-unique.sql`)
|
||||
- [x] Remove `slug` field from `createMyMeshInputSchema`
|
||||
- [x] Remove slug field from `CreateMeshForm`
|
||||
- [x] Server-side `toSlug(name)` derives slug from name automatically
|
||||
- [x] Schema comment documents the non-canonical role of `slug`
|
||||
|
||||
### Future (optional, not in v0.1.x)
|
||||
- **Vanity slugs as a Pro feature:** one globally-unique handle per *account* (not per mesh), exposed as `claudemesh.com/@acme/...`. Sold as part of an org tier. This is where slug uniqueness actually pays for itself — against usernames, not against meshes.
|
||||
|
||||
---
|
||||
|
||||
## Part 2 — Invitations
|
||||
|
||||
### Problems with the current invite system
|
||||
|
||||
| # | Problem | Severity |
|
||||
|---|---|---|
|
||||
| 1 | `mesh_root_key` is embedded in the invite URL as base64url JSON | 🔴 **Security** |
|
||||
| 2 | Invite URLs are ~400 chars of opaque base64url | 🟡 UX |
|
||||
| 3 | No invite-by-email; only shareable link | 🟡 UX |
|
||||
| 4 | Required form fields (role, maxUses, expiresInDays) for every invite | 🟡 UX |
|
||||
| 5 | Landing page does not clearly preview role/consent | 🟡 UX |
|
||||
| 6 | No audit trail for invites received-but-never-clicked | 🟢 Polish |
|
||||
| 7 | `ic://` link scheme is vestigial, nothing registers the handler | 🟢 Polish |
|
||||
|
||||
### Severity 🔴 — the root key leak
|
||||
|
||||
Current canonical invite bytes:
|
||||
```
|
||||
v | mesh_id | mesh_slug | broker_url | expires_at | mesh_root_key | role | owner_pubkey
|
||||
```
|
||||
|
||||
`mesh_root_key` is a 32-byte shared secret used by all channel and broadcast encryption in the mesh. Once it lives in a URL:
|
||||
- Slack/Telegram/Discord link previews fetch and cache the URL → root key is in those caches
|
||||
- Browser history, sync, analytics pixels, error logs → root key persists anywhere URLs persist
|
||||
- A screenshot of the invite link is a compromise
|
||||
- Revoking the invite does **not** rotate the key, so exposure is permanent
|
||||
|
||||
**Anthropic would never do this.** The fix is a protocol change: the invite grants the *right* to receive the key, it is not the key itself.
|
||||
|
||||
### The v2 invite protocol (spec only in this doc — NOT implemented this session)
|
||||
|
||||
**Design goals**
|
||||
1. No secret material in any user-visible string (URL, QR, paste buffer)
|
||||
2. Invite URLs are short (<30 chars): `claudemesh.com/i/abc12345`
|
||||
3. Existing v1 invites continue to work during a deprecation window
|
||||
4. Revocation is clean and immediate
|
||||
5. One recipient = one root-key-delivery capability
|
||||
|
||||
**Flow**
|
||||
```
|
||||
Admin creates invite (v2):
|
||||
server generates short_code (base62, 8 chars, unique)
|
||||
server stores in DB: {id, mesh_id, code, role, max_uses, expires_at, signed_capability}
|
||||
signed_capability = ed25519_sign(canonical_v2_bytes, mesh.owner_secret_key)
|
||||
canonical_v2_bytes = v=2 | mesh_id | invite_id | expires_at | role | owner_pubkey
|
||||
NOTE: no root_key, no broker_url
|
||||
returns: claudemesh.com/i/{code}
|
||||
|
||||
Recipient clicks the link:
|
||||
web: GET /api/public/invites/code/{code}
|
||||
returns {mesh_name, inviter_name, role, expires_at, member_count}
|
||||
no secrets, no signature leaked
|
||||
web: shows consent landing: "You are joining ACME as a Member"
|
||||
recipient authenticates (sign up / log in) OR runs CLI
|
||||
|
||||
Recipient claims the invite:
|
||||
CLI: generates session ed25519 keypair (ephemeral)
|
||||
CLI: connects to broker ws://ic.claudemesh.com/ws
|
||||
CLI: sends { type: "claim_invite", code, recipient_pubkey }
|
||||
broker: looks up invite by code
|
||||
broker: verifies signed_capability against mesh.owner_pubkey
|
||||
broker: checks expires_at, max_uses vs used_count, revoked_at
|
||||
broker: increments used_count, creates mesh.member row
|
||||
broker: seals mesh.root_key with crypto_box_seal to recipient_pubkey
|
||||
broker: returns { sealed_root_key, mesh_id, member_id }
|
||||
CLI: unseals with its secret key → has root_key
|
||||
CLI: starts normal mesh traffic
|
||||
|
||||
Revocation:
|
||||
admin sets invite.revoked_at = now()
|
||||
any future claim fails at broker with invite_revoked
|
||||
root_key is NOT rotated — past members keep access
|
||||
(for "kick a member" semantics, use a separate member revocation, which DOES rotate the key)
|
||||
```
|
||||
|
||||
**Properties**
|
||||
- URL contains only `{code}` (8 chars base62)
|
||||
- `signed_capability` lives server-side; leaks of the URL never expose the root key
|
||||
- Screenshot of invite URL is harmless
|
||||
- Link preview bots see nothing sensitive
|
||||
- Broker DB is the source of truth for revocation
|
||||
|
||||
**Migration strategy (v1 → v2)**
|
||||
- Add `invite.code`, `invite.v2_capability` columns (nullable for existing rows)
|
||||
- `createMyInvite` generates BOTH v1 token (legacy) and v2 code
|
||||
- Web invite UI displays the short URL by default, long URL as "Legacy format" disclosure
|
||||
- Broker accepts both formats until v0.2.0
|
||||
- Announce deprecation window; at v0.2.0 the long-format endpoints 410 Gone
|
||||
|
||||
**Status update 2026-04-10 — v2 is now being implemented in parallel**
|
||||
|
||||
The scope that was deferred at the top of the session is actively landing in a coordinated multi-agent push:
|
||||
- Broker: new `/api/public/invites/:code/claim` endpoint, `crypto_box_seal` against recipient x25519 pubkey, signed capability verification, single-use accounting.
|
||||
- DB: `mesh.invite.version` int, `mesh.invite.capability_v2` text nullable, `mesh.invite.claimed_by_pubkey` text nullable. New table `mesh.pending_invite` for email invites.
|
||||
- CLI / web claim client: generates a fresh x25519 keypair (separate from the ed25519 identity), POSTs the pubkey, unseals the returned `sealed_root_key`, then verifies `canonical_v2` against `owner_pubkey`.
|
||||
- Email invites (parallel track): Postmark delivery wired on top of `pending_invite`; the email body carries the same `claudemesh.com/i/{code}` short URL.
|
||||
|
||||
v1 invites continue to work throughout v0.1.x. v1 endpoints return `410 Gone` at v0.2.0.
|
||||
|
||||
Docs updated in the same session: `SPEC.md` §14b, `docs/protocol.md` (v2 invites subsection), `docs/roadmap.md` (in progress).
|
||||
|
||||
---
|
||||
|
||||
### Severity 🟡 — implemented this session
|
||||
|
||||
#### Short invite codes (URL shortening, backward-compatible)
|
||||
|
||||
Additive: invites now get both a long token AND a short opaque code. The web app prefers the short URL.
|
||||
|
||||
**DB:** new nullable `invite.code` column, unique. New migration `0018_invite-short-code.sql`.
|
||||
|
||||
**API:** `createMyInvite` generates `code` (base62, 8 chars, collision-retry). Returns `shortUrl` alongside `inviteLink` / `joinUrl`.
|
||||
|
||||
**Web:** new server route `/i/[code]/page.tsx` that resolves the code server-side and redirects to the canonical `/join/[token]` page. Invite generator UI shows the short URL as the primary "Copy link" target.
|
||||
|
||||
**Backward compat:** existing invites without a `code` keep working via their long token. No broker/CLI changes.
|
||||
|
||||
**This is NOT the v2 protocol.** It only fixes the URL-length problem. The root key is still embedded in the long token that the short code resolves to. The short code is a URL shortener, not a capability boundary. Document this clearly so nobody confuses the two.
|
||||
|
||||
---
|
||||
|
||||
#### Collapsed advanced fields
|
||||
|
||||
The invite form asks for `role`, `max uses`, `expires in days` upfront. 90% of users only ever create `{ role: member, max_uses: 1, expires_in_days: 7 }`.
|
||||
|
||||
Change: defaults are pre-filled; the three fields are hidden behind an "Advanced" disclosure.
|
||||
|
||||
---
|
||||
|
||||
### Severity 🟡 — deferred
|
||||
|
||||
#### Invite by email
|
||||
|
||||
- Requires an `invitation_email` table or equivalent pending-invites state
|
||||
- Requires wire-up to email delivery (already have Postmark via turbostarter)
|
||||
- Out of scope this session; fits naturally on top of v2 invite protocol
|
||||
|
||||
#### Consent landing redesign
|
||||
|
||||
- The `/join/[token]` page should show: mesh name, inviter, role being granted, member count, expiry, explicit "Join as Member of ACME" button
|
||||
- Needs a design pass
|
||||
- Deferred
|
||||
|
||||
---
|
||||
|
||||
### Severity 🟢 — deferred
|
||||
|
||||
- Remove `ic://` scheme — it's dead, nothing handles it, safe to delete in v0.1.x cleanup
|
||||
- Received-but-not-clicked audit — falls out of email invites for free
|
||||
|
||||
---
|
||||
|
||||
## Summary table
|
||||
|
||||
| Change | Status | File(s) |
|
||||
|---|---|---|
|
||||
| Drop global slug uniqueness | ✅ done | `packages/db/src/schema/mesh.ts`, migration `0017` |
|
||||
| Remove slug from create-mesh form | ✅ done | `apps/web/src/modules/mesh/create-mesh-form.tsx` |
|
||||
| Server-derived slug from name | ✅ done | `packages/api/src/modules/mesh/mutations.ts` |
|
||||
| Short invite codes (URL shortener) | ✅ done | `packages/db` migration `0018`, api, web `/i/[code]` |
|
||||
| Collapse invite advanced fields | ✅ done | `apps/web/src/modules/mesh/invite-generator.tsx` |
|
||||
| v2 invite protocol (root key out of URL) | 🚧 in progress | broker `/api/public/invites/:code/claim`, `mesh.invite.version` + `capability_v2` + `claimed_by_pubkey`, CLI/web claim client |
|
||||
| Invite by email | 🚧 in progress | `mesh.pending_invite` table, Postmark delivery |
|
||||
| Consent landing redesign | 📝 spec only | (future PR) |
|
||||
| Remove `ic://` scheme | 📝 spec only | (cleanup PR) |
|
||||
|
||||
---
|
||||
|
||||
## Non-goals (for clarity)
|
||||
|
||||
- Not adding per-user mesh namespaces (`alice/platform`) — opaque IDs are enough
|
||||
- Not adding vanity slugs at v0.1.x — can come as a Pro tier later
|
||||
- Not changing the broker wire protocol this session
|
||||
- Not rewriting the CLI join flow this session
|
||||
|
||||
---
|
||||
|
||||
## Post-implementation checklist
|
||||
|
||||
- [x] Web builds without type errors on changed files
|
||||
- [x] Migrations run on production DB (`0017` applied; `0018` after review)
|
||||
- [x] No broker protocol change (backward compat verified)
|
||||
- [x] Existing long-token invites continue to resolve
|
||||
- [x] New invites expose `shortUrl` in the API response
|
||||
593
.artifacts/specs/2026-04-10-cli-auth-device-code-pat.md
Normal file
@@ -0,0 +1,593 @@
|
||||
# CLI Auth — Device Code Flow + Personal Access Tokens
|
||||
|
||||
**Status:** spec
|
||||
**Created:** 2026-04-10
|
||||
**Owner:** CLI-Dev (implementation), Orchestrator (spec)
|
||||
**Target version:** v0.11.0
|
||||
**Related:** `2026-04-10-anthropic-vision-meshes-invites.md`, `2026-04-10-cli-wizard-architecture-refactor.md`
|
||||
|
||||
## Goal
|
||||
|
||||
The CLI is a first-class client. From a fresh terminal, with zero prior browser interaction, a user can:
|
||||
|
||||
```
|
||||
claudemesh login # device-code OAuth, browser handshake
|
||||
claudemesh create "Platform team" # creates real mesh via /api/my/meshes
|
||||
claudemesh invite --email alice@x.com # generates invite, sends email
|
||||
claudemesh launch --mesh platform-team -y # spawns Claude Code in the mesh
|
||||
```
|
||||
|
||||
For CI / scripting / non-interactive contexts, PAT works too:
|
||||
|
||||
```
|
||||
claudemesh login --token cm_pat_abc123
|
||||
claudemesh create "CI test mesh" --json | jq .id
|
||||
```
|
||||
|
||||
This is the auth substrate that unblocks the "Anthropic vision" — every other dashboard-only feature (meshes, invites, members, billing) becomes CLI-accessible after this lands.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- SSO / SAML / enterprise IdP integration (later, post-1.0)
|
||||
- Refresh tokens with rotation (long-lived API keys are sufficient for v1)
|
||||
- Multi-account switching (one logged-in identity per `~/.claudemesh/auth.json`)
|
||||
- Device fleet management UI (single "revoke" button per token is enough for v1)
|
||||
|
||||
## Auth model overview
|
||||
|
||||
Two coexisting credential types, both backed by **Better Auth's `apiKey` plugin**:
|
||||
|
||||
| Type | Created via | Lifetime | Use case | Storage |
|
||||
|---|---|---|---|---|
|
||||
| **Device-code session token** | `claudemesh login` (OAuth-style browser handshake) | 90 days, auto-renew on use | Interactive humans on their workstation | `~/.claudemesh/auth.json` |
|
||||
| **Personal access token (PAT)** | Dashboard → Settings → CLI tokens → Generate | User-chosen (30d / 90d / 1y / never), explicit revocation | CI, scripts, automation, server-side cron | Anywhere the user puts it; CLI reads from `--token` flag, env var, or `auth.json` |
|
||||
|
||||
Both flow through the same `Authorization: Bearer cm_<type>_<random>` header. The API doesn't care which one it gets — it just validates against the `api_key` table.
|
||||
|
||||
**Token format:**
|
||||
- `cm_session_<32-byte base32>` — device-code sessions
|
||||
- `cm_pat_<32-byte base32>` — personal access tokens
|
||||
|
||||
The `cm_` prefix lets us scan for leaked tokens with regex (e.g. GitHub secret scanning, internal scripts). The middle segment (`session` / `pat`) is for human readability in token lists, not for security.
|
||||
|
||||
## User flows
|
||||
|
||||
### 1. First-time login (interactive happy path)
|
||||
|
||||
```
|
||||
$ claudemesh login
|
||||
|
||||
██ claudemesh login
|
||||
|
||||
Opening browser for authentication…
|
||||
|
||||
If your browser didn't open, visit:
|
||||
https://claudemesh.com/cli-auth?code=ABCD-EFGH
|
||||
|
||||
Enter this code:
|
||||
ABCD-EFGH
|
||||
|
||||
Waiting for confirmation… ⠋
|
||||
```
|
||||
|
||||
In the browser:
|
||||
1. User lands on `/cli-auth?code=ABCD-EFGH`
|
||||
2. If not signed in, Better Auth login screen appears, then redirects back
|
||||
3. User sees a confirmation card:
|
||||
```
|
||||
Link this CLI session?
|
||||
Code: ABCD-EFGH
|
||||
Device: Alejandro's MacBook Pro · darwin · arm64
|
||||
Expires in 9:47
|
||||
[Approve] [Deny]
|
||||
```
|
||||
4. User clicks Approve
|
||||
|
||||
CLI polls every 1.5s, sees `approved`, receives token, writes `~/.claudemesh/auth.json` with `0600`, prints:
|
||||
|
||||
```
|
||||
✔ Authenticated as Alejandro Gutiérrez
|
||||
✔ Token saved to ~/.claudemesh/auth.json
|
||||
✔ Synced 3 meshes: alexis-mou, dev, claudefarm
|
||||
|
||||
Run claudemesh --help to get started.
|
||||
```
|
||||
|
||||
### 2. First-time login (PAT, non-interactive)
|
||||
|
||||
```
|
||||
$ claudemesh login --token cm_pat_abc123def456...
|
||||
✔ Authenticated as Alejandro Gutiérrez (via PAT "ci-deploy")
|
||||
✔ Token saved to ~/.claudemesh/auth.json
|
||||
```
|
||||
|
||||
Or one-shot, no save:
|
||||
|
||||
```
|
||||
$ CLAUDEMESH_TOKEN=cm_pat_abc123 claudemesh create "test"
|
||||
```
|
||||
|
||||
### 3. Already logged in, runs a command
|
||||
|
||||
```
|
||||
$ claudemesh create "Platform team"
|
||||
✔ Created mesh platform-team (id: q5RI89Fl…)
|
||||
✔ Joined locally
|
||||
▸ Invite peers: claudemesh invite --mesh platform-team
|
||||
```
|
||||
|
||||
No auth prompt — token in `auth.json` is used silently.
|
||||
|
||||
### 4. Token expired or revoked
|
||||
|
||||
```
|
||||
$ claudemesh peers
|
||||
✘ Authentication failed (token expired or revoked)
|
||||
|
||||
Run claudemesh login to re-authenticate.
|
||||
```
|
||||
|
||||
Exit code `2`. The `auth.json` is **not** auto-deleted (user might be debugging) but the next `claudemesh login` overwrites it cleanly.
|
||||
|
||||
### 5. Wizard launch flow with auth integration
|
||||
|
||||
When `claudemesh` (bare, no auth) is run:
|
||||
|
||||
```
|
||||
██ claudemesh
|
||||
|
||||
▸ Sign in (opens browser)
|
||||
Paste a personal access token
|
||||
Join a mesh via invite URL
|
||||
Exit
|
||||
```
|
||||
|
||||
After auth completes, the wizard transitions naturally into the launch flow (mesh picker → name → role → confirm → handoff). One uninterrupted experience from "fresh install" to "Claude Code in a mesh."
|
||||
|
||||
### 6. CI / non-interactive
|
||||
|
||||
```
|
||||
# .github/workflows/test.yml
|
||||
- run: |
|
||||
claudemesh login --token ${{ secrets.CLAUDEMESH_PAT }}
|
||||
claudemesh create "CI run $GITHUB_RUN_ID" --json > mesh.json
|
||||
```
|
||||
|
||||
Or zero-state:
|
||||
|
||||
```
|
||||
- env:
|
||||
CLAUDEMESH_TOKEN: ${{ secrets.CLAUDEMESH_PAT }}
|
||||
run: claudemesh create "CI run $GITHUB_RUN_ID" --json
|
||||
```
|
||||
|
||||
Token resolution order: `--token` flag > `CLAUDEMESH_TOKEN` env var > `~/.claudemesh/auth.json`.
|
||||
|
||||
### 7. Logout
|
||||
|
||||
```
|
||||
$ claudemesh logout
|
||||
✔ Token revoked on server
|
||||
✔ Removed ~/.claudemesh/auth.json
|
||||
```
|
||||
|
||||
`logout` calls `DELETE /api/my/cli/sessions/current` to revoke server-side, then unlinks the local file. Best-effort: if the server call fails, still delete locally and warn.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Backend — Better Auth `apiKey` plugin
|
||||
|
||||
Better Auth ships an `apiKey` plugin that handles:
|
||||
- Token generation (cryptographically random)
|
||||
- Hashed storage (only the hash hits the DB; raw token never persisted)
|
||||
- Verification middleware (validates `Authorization: Bearer …`)
|
||||
- Per-token metadata (name, scopes, expiry, last-used)
|
||||
- Per-token revocation
|
||||
|
||||
We use it for both PAT and device-code sessions. Device-code sessions just have a marker in metadata distinguishing them from user-generated PATs.
|
||||
|
||||
**Wire-up:** `apps/web/src/lib/auth/index.ts` (or wherever Better Auth is initialized) adds:
|
||||
|
||||
```ts
|
||||
import { apiKey } from "better-auth/plugins";
|
||||
|
||||
export const auth = betterAuth({
|
||||
// …existing config
|
||||
plugins: [
|
||||
// …
|
||||
apiKey({
|
||||
enableMetadata: true,
|
||||
apiKeyHeaders: ["x-api-key", "authorization"],
|
||||
defaultPrefix: "cm_",
|
||||
rateLimit: { enabled: true, timeWindow: 60_000, maxRequests: 100 },
|
||||
}),
|
||||
],
|
||||
});
|
||||
```
|
||||
|
||||
### Backend — device-code table
|
||||
|
||||
The `apiKey` plugin doesn't ship device-code flow out of the box. We add a small table + 4 endpoints on top.
|
||||
|
||||
```sql
|
||||
-- packages/db/migrations/0020_cli-device-code.sql
|
||||
CREATE TABLE cli_device_code (
|
||||
device_code text PRIMARY KEY, -- opaque random, sent to CLI
|
||||
user_code text UNIQUE NOT NULL, -- short human code: "ABCD-EFGH"
|
||||
user_id text REFERENCES "user"(id), -- nullable until approved
|
||||
api_key_id text REFERENCES api_key(id), -- the issued token, set on approve
|
||||
device_name text NOT NULL, -- "Alejandro's MacBook Pro"
|
||||
device_os text NOT NULL, -- "darwin"
|
||||
device_arch text NOT NULL, -- "arm64"
|
||||
ip_address text, -- for audit
|
||||
user_agent text,
|
||||
status text NOT NULL DEFAULT 'pending', -- 'pending' | 'approved' | 'denied' | 'expired'
|
||||
created_at timestamptz NOT NULL DEFAULT now(),
|
||||
expires_at timestamptz NOT NULL, -- created_at + 10 min
|
||||
approved_at timestamptz
|
||||
);
|
||||
|
||||
CREATE INDEX cli_device_code_user_code_idx ON cli_device_code(user_code);
|
||||
CREATE INDEX cli_device_code_status_expires_idx ON cli_device_code(status, expires_at);
|
||||
```
|
||||
|
||||
A scheduled job (or lazy cleanup on insert) deletes rows where `status='expired'` AND `expires_at < now() - interval '7 days'`.
|
||||
|
||||
### Backend — endpoints
|
||||
|
||||
All under `apps/web/src/app/api/auth/cli/` (or wherever you keep public auth routes — these need to be **unauthed** since the CLI has no token yet).
|
||||
|
||||
| Method | Path | Auth | Purpose |
|
||||
|---|---|---|---|
|
||||
| `POST` | `/api/auth/cli/device-code` | none | CLI requests a new device code. Body: `{ device_name, device_os, device_arch }`. Returns `{ device_code, user_code, expires_at, verification_url }`. |
|
||||
| `GET` | `/api/auth/cli/device-code/:device_code` | none | CLI polls for status. Returns `{ status: 'pending'|'approved'|'denied'|'expired', token?: string, user?: { id, name, email } }`. Token only present when status=approved, and only **once** (subsequent polls return approved without token). |
|
||||
| `POST` | `/api/auth/cli/device-code/:user_code/approve` | session | Browser confirms. Creates an `api_key` row with metadata `{ kind: 'session', device_name, device_code }`, sets `cli_device_code.api_key_id`, status=approved. |
|
||||
| `POST` | `/api/auth/cli/device-code/:user_code/deny` | session | Browser denies. Sets status=denied. |
|
||||
|
||||
Authed endpoints (under `/api/my/cli/`):
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|---|---|---|
|
||||
| `GET` | `/api/my/cli/sessions` | List active CLI sessions for the user (devices, last seen, created). |
|
||||
| `DELETE` | `/api/my/cli/sessions/:id` | Revoke a specific session. |
|
||||
| `POST` | `/api/my/cli/tokens` | Create a PAT. Body: `{ name, expires_in_days?, scopes? }`. Returns the raw token **once**. |
|
||||
| `GET` | `/api/my/cli/tokens` | List PATs (no raw values, just metadata). |
|
||||
| `DELETE` | `/api/my/cli/tokens/:id` | Revoke a PAT. |
|
||||
|
||||
### Backend — middleware
|
||||
|
||||
Existing `enforceAuth` (in `packages/api/src/utils/`) currently reads cookies. Extend it to also accept `Authorization: Bearer cm_…`:
|
||||
|
||||
```ts
|
||||
export async function enforceAuth(ctx) {
|
||||
const bearer = ctx.req.headers.get("authorization")?.replace(/^Bearer /, "");
|
||||
if (bearer?.startsWith("cm_")) {
|
||||
const result = await auth.api.verifyApiKey({ key: bearer });
|
||||
if (result.valid) {
|
||||
// record last_used_at, increment usage counter
|
||||
return { user: result.user, via: "apiKey", apiKey: result.apiKey };
|
||||
}
|
||||
throw new TRPCError({ code: "UNAUTHORIZED", message: "Invalid token" });
|
||||
}
|
||||
// …existing cookie-based auth
|
||||
}
|
||||
```
|
||||
|
||||
The `apiKey` plugin handles `last_used_at` updates automatically.
|
||||
|
||||
### Backend — web route
|
||||
|
||||
`apps/web/src/app/[locale]/cli-auth/page.tsx`:
|
||||
|
||||
- Reads `?code=ABCD-EFGH` from query string
|
||||
- If no session, redirects to `/login?next=/cli-auth?code=ABCD-EFGH`
|
||||
- If session, fetches device code metadata via server component, renders confirmation card
|
||||
- Approve button → `POST /api/auth/cli/device-code/:user_code/approve`
|
||||
- Deny button → `POST /api/auth/cli/device-code/:user_code/deny`
|
||||
- After approve, shows: "✓ CLI authenticated. Return to your terminal."
|
||||
|
||||
Mobile-friendly. Confirmation card shows device fingerprint so the user can verify they're approving the right session.
|
||||
|
||||
### Backend — dashboard PAT UI
|
||||
|
||||
`apps/web/src/app/[locale]/dashboard/settings/cli-tokens/page.tsx`:
|
||||
|
||||
- List of existing PATs (name, created, last used, expires)
|
||||
- "Generate new token" button → modal with name + expiry picker
|
||||
- After creation, show raw token once with copy button + warning ("This token will not be shown again")
|
||||
- Per-row revoke button
|
||||
|
||||
Reuses existing dashboard layout. Should be ~150 lines including the modal.
|
||||
|
||||
### CLI — file layout
|
||||
|
||||
```
|
||||
apps/cli/src/
|
||||
├── commands/
|
||||
│ ├── login.ts # NEW
|
||||
│ ├── logout.ts # NEW
|
||||
│ ├── whoami.ts # NEW
|
||||
│ ├── create.ts # rewrite to call API
|
||||
│ ├── invite.ts # NEW
|
||||
│ ├── sync.ts # rewrite to call API
|
||||
│ └── …existing
|
||||
└── lib/
|
||||
├── auth-store.ts # NEW: read/write ~/.claudemesh/auth.json
|
||||
├── api-client.ts # NEW: typed fetch wrapper
|
||||
├── device-info.ts # NEW: collect hostname, os, arch for device-code request
|
||||
└── …existing
|
||||
```
|
||||
|
||||
### CLI — `auth-store.ts`
|
||||
|
||||
```ts
|
||||
// ~/.claudemesh/auth.json
|
||||
type AuthFile = {
|
||||
version: 1;
|
||||
token: string; // cm_session_… or cm_pat_…
|
||||
user: { id: string; name: string; email: string };
|
||||
created_at: string; // ISO
|
||||
source: "device-code" | "pat" | "env";
|
||||
};
|
||||
```
|
||||
|
||||
Read priority: `--token` flag > `CLAUDEMESH_TOKEN` env > `auth.json`.
|
||||
Write only on `login` success. File mode `0600`. Parent dir `0700`.
|
||||
On read, if file mode is too permissive, log a warning and continue.
|
||||
|
||||
### CLI — `api-client.ts`
|
||||
|
||||
Thin wrapper over `fetch`:
|
||||
|
||||
```ts
|
||||
export class ClaudemeshApi {
|
||||
constructor(private opts: { baseUrl: string; token: string }) {}
|
||||
|
||||
async createMesh(input: { name: string; slug?: string }) { … }
|
||||
async listMeshes() { … }
|
||||
async createInvite(input: { meshId: string; email?: string; role?: string }) { … }
|
||||
async listSessions() { … }
|
||||
async revokeSession(id: string) { … }
|
||||
async whoami() { … }
|
||||
}
|
||||
```
|
||||
|
||||
Type definitions live in `packages/api/src/contracts/cli.ts` (new file) — generated from the existing tRPC routers as plain types so the CLI doesn't need to import the whole tRPC client.
|
||||
|
||||
Base URL from `CLAUDEMESH_API_URL` env var, defaults to `https://claudemesh.com`. Allows local dev against `http://localhost:3000`.
|
||||
|
||||
### CLI — device-code login flow
|
||||
|
||||
```ts
|
||||
// commands/login.ts
|
||||
async function deviceCodeLogin() {
|
||||
const device = collectDeviceInfo();
|
||||
const { device_code, user_code, expires_at, verification_url } =
|
||||
await api.requestDeviceCode(device);
|
||||
|
||||
console.log(` Opening ${verification_url}…`);
|
||||
console.log(` Code: ${user_code}`);
|
||||
|
||||
await openBrowser(`${verification_url}?code=${user_code}`);
|
||||
|
||||
const spinner = ora("Waiting for confirmation").start();
|
||||
const deadline = new Date(expires_at).getTime();
|
||||
|
||||
while (Date.now() < deadline) {
|
||||
await sleep(1500);
|
||||
const result = await api.pollDeviceCode(device_code);
|
||||
if (result.status === "approved") {
|
||||
spinner.succeed("Authenticated");
|
||||
await authStore.write({ token: result.token, user: result.user, source: "device-code" });
|
||||
await syncMeshes();
|
||||
return;
|
||||
}
|
||||
if (result.status === "denied") {
|
||||
spinner.fail("Denied in browser");
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
spinner.fail("Timed out");
|
||||
process.exit(1);
|
||||
}
|
||||
```
|
||||
|
||||
Polls every 1.5s. Server returns `{ slow_down: true }` if polled too fast (rate limit at 1/sec).
|
||||
|
||||
## Security
|
||||
|
||||
1. **Tokens are hashed at rest** (Better Auth `apiKey` plugin handles this with bcrypt or argon2).
|
||||
2. **Raw tokens shown to user once.** PATs in dashboard, device-code tokens via `claudemesh login` output. Never logged, never re-displayable.
|
||||
3. **`auth.json` is `0600`.** CLI refuses to write if parent dir can't be made `0700`. Warns on read if mode is wider.
|
||||
4. **Token prefix `cm_` enables secret scanning.** Document the regex `cm_(session|pat)_[a-z0-9]{32,}` in security docs so GitHub secret scanning, GitGuardian, etc. can detect leaks.
|
||||
5. **`/api/auth/cli/device-code/:device_code` polling is rate-limited** to 1 req/sec per IP per device_code. Returns `429` with `slow_down: true` body.
|
||||
6. **Device codes expire in 10 minutes.** Approved-but-unclaimed tokens stay valid (the polling endpoint still returns the token for 60 seconds after approval, then the device_code row is GC'd).
|
||||
7. **Audit logging.** Every device-code approval, PAT creation, and PAT revocation emits an audit event (`auth.cli.session.created`, `auth.cli.pat.created`, etc.). Stored in existing audit log if there is one, otherwise new `audit_log` table.
|
||||
8. **Session invalidation on password change.** When a user changes their password via Better Auth, all `cli_session` `api_key` rows for that user are revoked. PATs are NOT auto-revoked (they're explicitly user-managed).
|
||||
9. **Token revocation is immediate.** `auth.api.verifyApiKey` checks DB on every request — no in-memory cache.
|
||||
10. **No CSRF concern** for device-code endpoints — the unauthed ones don't act on user state, the authed ones use Better Auth's existing CSRF protection.
|
||||
|
||||
## Wizard UX integration
|
||||
|
||||
The current welcome wizard already has:
|
||||
```
|
||||
▸ Create account (new to claudemesh)
|
||||
Sign in (existing account)
|
||||
Paste an invite URL
|
||||
Exit
|
||||
```
|
||||
|
||||
After this spec lands, the welcome screen becomes:
|
||||
```
|
||||
██ claudemesh
|
||||
|
||||
▸ Sign in ← device-code OAuth
|
||||
Paste an access token ← PAT path
|
||||
Join via invite URL ← unchanged
|
||||
Create account ← opens /register, then back to login
|
||||
Exit
|
||||
```
|
||||
|
||||
"Sign in" becomes the headline option. The current "Create account" still opens browser to `/register` but flows back through the device-code handshake instead of a custom callback.
|
||||
|
||||
Once authenticated, the wizard transitions to:
|
||||
```
|
||||
██ claudemesh launch
|
||||
|
||||
Account ✔ Alejandro Gutiérrez
|
||||
Mesh ▸ (pick one — 3 available)
|
||||
Name ✔ Alexis (from --name)
|
||||
Role ▸ (pick one)
|
||||
|
||||
▸ Continue
|
||||
Cancel
|
||||
```
|
||||
|
||||
Status rows show what's filled and what's left. Mesh picker fetches from `GET /api/my/meshes` via the freshly minted token.
|
||||
|
||||
This integrates cleanly with the wizard architecture refactor in `2026-04-10-cli-wizard-architecture-refactor.md`: auth becomes one screen in the launch flow with `isComplete: s => s.user !== null`. On a fresh machine the auth screen runs; on a returning machine it's auto-skipped.
|
||||
|
||||
## Error handling
|
||||
|
||||
| Scenario | Behavior |
|
||||
|---|---|
|
||||
| Browser doesn't open | Print URL prominently, keep polling |
|
||||
| Network down during poll | Retry with exponential backoff (1.5s → 3s → 6s, max 30s) |
|
||||
| Device code expires | Print "Login timed out, run `claudemesh login` to retry", exit 1 |
|
||||
| Token rejected by API | Print "Authentication failed", suggest `claudemesh login`, exit 2 |
|
||||
| `auth.json` corrupted | Print "Auth file corrupted, run `claudemesh login`", exit 2 |
|
||||
| `auth.json` permissions wrong | Warn, fix to `0600`, continue |
|
||||
| PAT pasted to `--token` is malformed | Print "Invalid token format (expected `cm_pat_…`)", exit 1 |
|
||||
| PAT pasted to `--token` is valid format but unknown | API returns 401, print "Token rejected", exit 2 |
|
||||
| Two CLI instances poll simultaneously | Both get the same approved status; first to read gets the token, second gets `{ status: 'approved', token: null }` (already_claimed). Document this. |
|
||||
| User clicks Approve in browser, then closes tab | CLI's poll catches it, login succeeds. The browser tab closure is irrelevant. |
|
||||
| User completes login on machine A, then runs `claudemesh login` on machine B with same account | Both sessions coexist as separate `api_key` rows. `claudemesh whoami --sessions` shows both. |
|
||||
|
||||
## Implementation phases
|
||||
|
||||
Each phase ships independently and is independently testable.
|
||||
|
||||
### Phase 1 — Backend foundation (4–6 hours)
|
||||
|
||||
- [ ] Wire Better Auth `apiKey` plugin in `apps/web/src/lib/auth/`
|
||||
- [ ] Migration `0020_cli-device-code.sql`
|
||||
- [ ] Drizzle schema for `cli_device_code` in `packages/db/src/schema/auth.ts`
|
||||
- [ ] Endpoints: `POST /api/auth/cli/device-code`, `GET /api/auth/cli/device-code/:device_code`, `POST /api/auth/cli/device-code/:user_code/approve`, `POST /api/auth/cli/device-code/:user_code/deny`
|
||||
- [ ] Extend `enforceAuth` middleware to accept `Authorization: Bearer cm_…`
|
||||
- [ ] Endpoints: `POST /api/my/cli/tokens`, `GET /api/my/cli/tokens`, `DELETE /api/my/cli/tokens/:id`, `GET /api/my/cli/sessions`, `DELETE /api/my/cli/sessions/:id`
|
||||
- [ ] Unit tests for token verification and device-code state machine
|
||||
|
||||
### Phase 2 — Web routes (3–4 hours)
|
||||
|
||||
- [ ] `/cli-auth?code=...` page (server component + approve/deny client component)
|
||||
- [ ] `/dashboard/settings/cli-tokens` page (list + create modal + revoke)
|
||||
- [ ] Translations for both pages (en, es)
|
||||
- [ ] E2E test: full device-code happy path with Playwright
|
||||
|
||||
### Phase 3 — CLI auth core (4–5 hours)
|
||||
|
||||
- [ ] `lib/device-info.ts` — collect hostname, os, arch
|
||||
- [ ] `lib/auth-store.ts` — read/write `~/.claudemesh/auth.json` with mode checks
|
||||
- [ ] `lib/api-client.ts` — typed fetch wrapper with bearer header
|
||||
- [ ] `commands/login.ts` — device-code flow + `--token` PAT path
|
||||
- [ ] `commands/logout.ts` — revoke + delete local
|
||||
- [ ] `commands/whoami.ts` — print current identity + token source
|
||||
- [ ] Token resolution helper (`--token` > `CLAUDEMESH_TOKEN` > `auth.json`)
|
||||
- [ ] Unit tests for auth-store and token resolution
|
||||
|
||||
### Phase 4 — CLI commands wired to API (3–4 hours)
|
||||
|
||||
- [ ] Rewrite `commands/create.ts` to call `POST /api/my/meshes`
|
||||
- [ ] New `commands/invite.ts` with `--email`, `--mesh`, `--role`, `--expires-in`
|
||||
- [ ] Rewrite `commands/sync.ts` to call `GET /api/my/meshes` and reconcile local config
|
||||
- [ ] Update `commands/list.ts` to show server-side meshes too
|
||||
- [ ] Integration tests against staging broker + web
|
||||
|
||||
### Phase 5 — Wizard integration (3–4 hours)
|
||||
|
||||
- [ ] Welcome screen new options (Sign in / Paste token / Create account / Join invite)
|
||||
- [ ] Auth screen as a flow step with `isComplete: s => s.user !== null`
|
||||
- [ ] Status rows pattern showing auth state during launch
|
||||
- [ ] First-run detection (no `auth.json`) → auto-route to login
|
||||
|
||||
### Phase 6 — Polish, docs, ship (2–3 hours)
|
||||
|
||||
- [ ] Update `README.md`, `apps/cli/README.md`, `docs/quickstart.md`
|
||||
- [ ] CHANGELOG entry for v0.11.0
|
||||
- [ ] Telemetry events for `auth.cli.login.{start,success,fail}`
|
||||
- [ ] Bump `apps/cli/package.json` to `0.11.0`
|
||||
- [ ] Publish to npm
|
||||
- [ ] Deploy broker / web (no broker changes, web for new routes)
|
||||
|
||||
**Total estimate:** 19–26 hours of focused work. Realistic: 3–4 days with testing and review.
|
||||
|
||||
## Dependencies between phases
|
||||
|
||||
```
|
||||
Phase 1 (backend) ──┬─→ Phase 2 (web routes)
|
||||
└─→ Phase 3 (CLI auth core)
|
||||
│
|
||||
└─→ Phase 4 (commands)
|
||||
│
|
||||
└─→ Phase 5 (wizard)
|
||||
│
|
||||
└─→ Phase 6 (ship)
|
||||
```
|
||||
|
||||
Phase 1 and 2 can be parallelized after the schema lands. Phase 3 needs Phase 1 endpoints live (even if on staging). Phase 4 onwards is strictly serial.
|
||||
|
||||
## Telemetry
|
||||
|
||||
Emit these events (PostHog or whatever the existing analytics are):
|
||||
|
||||
- `cli.login.started` — properties: `{ method: 'device-code' | 'pat' }`
|
||||
- `cli.login.succeeded` — properties: `{ method, user_id }`
|
||||
- `cli.login.failed` — properties: `{ method, reason }`
|
||||
- `cli.logout` — properties: `{ user_id }`
|
||||
- `cli.command.executed` — properties: `{ command, exit_code, duration_ms, authenticated: boolean }`
|
||||
- `cli.api.error` — properties: `{ endpoint, status, error_code }`
|
||||
|
||||
Telemetry is **opt-out**. First run shows a one-line notice: "claudemesh collects anonymized usage telemetry. Disable with `claudemesh telemetry off`."
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **Better Auth `apiKey` plugin version** — confirm it's installed and at a version that supports `enableMetadata`. Check `pnpm why better-auth` in `apps/web`.
|
||||
2. **Audit log table** — does one already exist? If not, this spec adds three rows of log; not worth a new table for that. Use `console.log` with structured JSON to stderr and let the platform's log collector handle it.
|
||||
3. **Email sending** — `claudemesh invite --email` requires a transactional email path. Does the web app already have one (Resend, Postmark)? If yes, reuse. If no, defer the email send to a follow-up; the invite command can still create the invite and print the URL.
|
||||
4. **Token scopes** — v1 ships with no scopes; every token has full account access. Should we add `mesh:read`, `mesh:write`, `invite:create` scopes from day one, or wait? **Recommendation:** wait. YAGNI. Add when a user actually wants a read-only CI token.
|
||||
5. **PAT expiry default** — 90 days? 1 year? Never? Better Auth supports all three. **Recommendation:** 1 year default, user can pick "never" with explicit warning.
|
||||
6. **Mesh slug uniqueness in `claudemesh create`** — what happens if two users try to create meshes with the same slug? Existing API behavior should be tested. If it errors, the CLI should suggest `--slug platform-team-2`.
|
||||
7. **`claudemesh login` when already logged in** — re-authenticate (overwrite) or error ("already logged in, run logout first")? **Recommendation:** re-authenticate silently with a one-line notice ("Replacing existing session for Alejandro").
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
For v0.11.0 to ship, all of these must be true:
|
||||
|
||||
- [ ] `claudemesh login` on a fresh machine (no `auth.json`) opens browser, completes device-code flow, writes `auth.json`, runs in <30 seconds end-to-end
|
||||
- [ ] `claudemesh login --token cm_pat_…` works without browser
|
||||
- [ ] `claudemesh logout` revokes server-side and deletes local file
|
||||
- [ ] `claudemesh whoami` prints user identity and token source
|
||||
- [ ] `claudemesh create "Test mesh"` creates a real mesh on the server, joins it locally, and the user can see it on the dashboard
|
||||
- [ ] `claudemesh invite --email a@b.c --mesh test` creates an invite and prints the URL
|
||||
- [ ] `claudemesh launch` (bare) on a fresh machine walks login → mesh picker → name/role → Claude Code, all in one wizard
|
||||
- [ ] Dashboard `/dashboard/settings/cli-tokens` lists, creates, and revokes PATs
|
||||
- [ ] All flows work in `en` and `es`
|
||||
- [ ] Existing `claudemesh launch` invocations (with token already in `auth.json`) still work without prompting
|
||||
- [ ] Token in `auth.json` survives an hour of idle and continues to work (no aggressive expiry)
|
||||
- [ ] Revoking a token in the dashboard makes the next CLI call fail with a clear error
|
||||
- [ ] Documentation updated in `README.md`, `apps/cli/README.md`, `docs/quickstart.md`
|
||||
- [ ] CHANGELOG entry written
|
||||
- [ ] Published to npm as `claudemesh-cli@0.11.0`
|
||||
|
||||
## What this unlocks
|
||||
|
||||
Once this lands, every dashboard-only feature becomes one CLI command away. Future specs that depend on this:
|
||||
|
||||
- `claudemesh members list` / `claudemesh members add`
|
||||
- `claudemesh billing usage`
|
||||
- `claudemesh mesh archive`
|
||||
- `claudemesh stream subscribe` (live broker events)
|
||||
- `claudemesh skill publish` (publish a skill to mesh registry)
|
||||
- `claudemesh log tail` (mesh-wide audit log)
|
||||
|
||||
This is the foundational unlock. Everything else is incremental on top.
|
||||
1490
.artifacts/specs/2026-04-10-cli-v2-pass2-facade-pattern.md
Normal file
1610
.artifacts/specs/2026-04-10-cli-v2-pass2-final-vision.md
Normal file
2060
.artifacts/specs/2026-04-10-cli-v2-pass2-local-first-storage.md
Normal file
1481
.artifacts/specs/2026-04-10-cli-v2-pass2-shared-infrastructure.md
Normal file
1702
.artifacts/specs/2026-04-10-cli-v2-pass2-ux-design.md
Normal file
1157
.artifacts/specs/2026-04-11-cli-v2-pass1.md
Normal file
87
.artifacts/specs/2026-04-15-broker-ha-statelessness-audit.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Broker HA readiness — statelessness audit
|
||||
|
||||
Single-instance broker is the biggest GA blocker. Moving to 2+ replicas
|
||||
behind a load balancer requires first understanding which state the broker
|
||||
holds in-process that breaks if split across nodes.
|
||||
|
||||
## Current in-process state (apps/broker/src/index.ts)
|
||||
|
||||
| Symbol | Line | Per-node? | Survives HA? | Notes |
|
||||
|--------|------|-----------|--------------|-------|
|
||||
| `connections` | 147 | yes (WS state) | ✅ naturally per-node | WS connections are pinned to a node by L7 routing. Each node holds only its own connections. **OK as long as the LB uses sticky sessions or cross-node fan-out.** |
|
||||
| `connectionsPerMesh` | 148 | yes | 🟡 per-node count, not global | Used for capacity cap. Global cap requires Redis. |
|
||||
| `tgTokenRateLimit` | 151 | yes | 🟡 per-node | Telegram bot rate limiting; tolerable as per-node. |
|
||||
| `urlWatches` | 173 | yes | 🔴 stuck on one node | If peer disconnects from node A and reconnects on B, the watch stays orphaned on A. **Needs DB/Redis, or "pin to owning node". Acceptable risk if watches are per-session ephemeral.** |
|
||||
| `streamSubscriptions` | 259 | yes | 🔴 multi-node broken | Sub on A, publish on B → message never reaches A's subscribers. **Needs Redis pub/sub for HA.** |
|
||||
| `meshClocks` | 270 | yes | 🔴 multi-node broken | Simulated clocks must be single-authority. Solve by pinning one node as clock leader (simple leader election) or by moving clock state to DB. |
|
||||
| `mcpRegistry` | 327 | yes | 🔴 multi-node broken | MCP server catalog cached in memory. If deployed on A but called on B, B doesn't know it exists. **Must be DB-backed** (partly is already — see `mesh_service` table). Audit the cache/DB sync path. |
|
||||
| `mcpCallResolvers` | 338 | yes | ✅ per-call ephemeral | In-flight callback resolvers; WS sticks to owning node so this is fine. |
|
||||
| `scheduledMessages` | 359 | yes | 🔴 multi-node broken | Scheduled delivery timers live in-process. Restart loses them. Persistence exists (`scheduled_message` table) + recovery on startup, but two nodes could both fire the same timer. **Needs a leader lock or per-schedule pg_advisory_lock on fire.** |
|
||||
| `sendRateLimit` | index.ts:494 | yes | 🟡 per-node | Each node enforces its own quota; a client spread across nodes could 2x the limit. Tolerable if sticky sessions hold. |
|
||||
| `hookRateLimit` | index.ts:482 | yes | 🟡 per-node | Same as sendRateLimit. |
|
||||
| `lastHash` (audit.ts:22) | — | yes | 🔴 broken on write | Two nodes writing audit rows concurrently will BOTH read the same last hash, BOTH compute a new hash, and both INSERT — the chain forks. **Needs `SELECT FOR UPDATE` or a single audit writer.** |
|
||||
|
||||
## Conclusion
|
||||
|
||||
**Current broker is NOT HA-safe.** Five symbols break under multi-instance:
|
||||
`urlWatches`, `streamSubscriptions`, `meshClocks`, `mcpRegistry` cache,
|
||||
`scheduledMessages`, `lastHash`. None are unsolvable, but none are
|
||||
trivial.
|
||||
|
||||
## Rollout plan for HA
|
||||
|
||||
### Phase 0 (now) — sticky sessions
|
||||
Deploy a single broker behind Traefik with `loadBalancer.sticky.cookie`
|
||||
enabled. WS upgrade inherits the cookie, so reconnects land on the same
|
||||
node. Gives us 1 node of safe HA headroom (i.e., one deploy rollover
|
||||
without user-visible disconnection) without any code changes.
|
||||
|
||||
### Phase 1 — Active/passive
|
||||
Two replicas. Traefik routes all traffic to primary; secondary is warm.
|
||||
Primary fails → secondary takes over, all WS connections reset. No code
|
||||
change needed; clients auto-reconnect.
|
||||
|
||||
### Phase 2 — Active/active for stateless routes
|
||||
HTTP-only routes (`/cli/*`, `/download`, `/hook`) can round-robin across
|
||||
any number of replicas today. WS routes stay sticky per mesh via Traefik
|
||||
`sticky.cookie`. Already behind Postgres → each replica reads the same
|
||||
mesh/member/invite rows.
|
||||
|
||||
### Phase 3 — Full active/active
|
||||
Migrate the 6 problematic in-memory symbols:
|
||||
- `streamSubscriptions` → Redis pub/sub
|
||||
- `meshClocks` → leader-elect via Postgres advisory lock on mesh_id
|
||||
- `scheduledMessages` → single-writer pattern: whichever replica holds
|
||||
`pg_advisory_xact_lock(schedule_id)` fires
|
||||
- `urlWatches` → DB-backed + each replica owns watches where
|
||||
`presence.node_id = this_node`
|
||||
- `mcpRegistry` → rely on `mesh_service` table, drop the in-memory cache
|
||||
- `lastHash` → wrap audit.ts writes in a transaction that
|
||||
`SELECT hash FROM audit_log ... ORDER BY id DESC FOR UPDATE`, making
|
||||
concurrent inserts serialize.
|
||||
|
||||
### Phase 4 — Multi-region
|
||||
SPOF at Frankfurt (OVH). Move to a managed Postgres with read replicas,
|
||||
one broker cluster per region, global DNS geo-routing. Out of scope for
|
||||
v1.0.0.
|
||||
|
||||
## Immediate ship: local docker-compose for 2-replica smoke test
|
||||
|
||||
`packaging/docker-compose.ha-local.yml` (TODO) spins up:
|
||||
- 2x broker (same DATABASE_URL)
|
||||
- 1x postgres
|
||||
- 1x traefik with sticky cookie
|
||||
- 1x locust / synthetic client
|
||||
|
||||
Tests:
|
||||
1. Send to peer connected on node A → delivered.
|
||||
2. Subscribe on A, publish on B → expect failure (documents the gap).
|
||||
3. Kill node A → client reconnects to B within Xs.
|
||||
4. Audit chain verify after concurrent writes from both nodes → expect
|
||||
a fork (documents the gap).
|
||||
|
||||
## Decision
|
||||
|
||||
**Ship v1.0.0 on sticky-session single-writer (Phase 0 + Phase 1 warm
|
||||
standby).** That closes the "what happens on deploy" story. Phase 3 full
|
||||
HA is v1.1.0 work.
|
||||
@@ -0,0 +1,71 @@
|
||||
# Feature request draft: rich `<channel>` notification UI
|
||||
|
||||
**Target:** `anthropics/claude-code` GitHub issues / feedback channel.
|
||||
**Drafted:** 2026-04-15.
|
||||
|
||||
Paste the section below once the issue template is ready. Adjust tone
|
||||
to match Claude Code's issue style.
|
||||
|
||||
---
|
||||
|
||||
### Title
|
||||
|
||||
Rich UI for `notifications/claude/channel` messages (first-class chat, not just reminders)
|
||||
|
||||
### Body
|
||||
|
||||
**Summary**
|
||||
|
||||
MCP servers can emit `notifications/claude/channel` notifications which
|
||||
Claude Code renders inside the current turn as a `<channel>` reminder.
|
||||
For MCP servers that are conversational in nature (peer messaging,
|
||||
collaborative sessions, delegated agents), rendering these inline as
|
||||
plain-text reminders misses the UX affordances users expect from chat:
|
||||
|
||||
- sender avatar / identity
|
||||
- timestamp
|
||||
- priority badge (urgent / normal / low)
|
||||
- expandable quote from the original thread
|
||||
- optional inline reply action that calls a specific MCP tool
|
||||
|
||||
**Concrete use case**
|
||||
|
||||
[claudemesh](https://claudemesh.com) is a peer mesh for Claude Code
|
||||
sessions. When a peer sends a message it arrives as
|
||||
`notifications/claude/channel` with structured metadata in `meta`:
|
||||
|
||||
```json
|
||||
{
|
||||
"method": "notifications/claude/channel",
|
||||
"params": {
|
||||
"content": "alice: can you rebase main before deploy?",
|
||||
"meta": {
|
||||
"from_id": "<ed25519 hex>",
|
||||
"from_name": "alice",
|
||||
"priority": "now",
|
||||
"sent_at": "2026-04-15T00:00:00Z",
|
||||
"mesh_slug": "team-platform",
|
||||
"kind": "direct"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Today this renders as a `<channel>` text block — useful, but the user
|
||||
can't tell at a glance that it's from another human.
|
||||
|
||||
**What we'd like**
|
||||
|
||||
A hint on the notification (e.g. `meta.display: "chat"`) that lets
|
||||
Claude Code render it as a chat bubble with the `from_name` as the
|
||||
speaker, priority visualised, and an optional "Reply" action bound to
|
||||
a declared MCP tool (`reply_tool_name`).
|
||||
|
||||
**Why users would benefit beyond claudemesh**
|
||||
|
||||
- Delegated agent frameworks can render sub-agent responses as chat
|
||||
- Live-pairing MCP servers get a proper UI without inventing their own
|
||||
- The existing `<channel>` fallback means older clients still see
|
||||
the same text — additive, not breaking
|
||||
|
||||
**Willing to contribute a PR** if the feature is on-roadmap.
|
||||
58
.artifacts/specs/2026-04-15-cli-distribution-pipeline.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# CLI Distribution Pipeline
|
||||
|
||||
## Status
|
||||
- Shell installer (`/install`): ✅ live, needs polish
|
||||
- Single-binary build script (`scripts/build-binaries.ts`): ✅ written, not wired to CI
|
||||
- GitHub Releases publish: ❌ not set up
|
||||
- Homebrew tap: ❌ not set up
|
||||
- winget manifest: ❌ not set up
|
||||
|
||||
## Shipped this session (alpha.28)
|
||||
- `bun build --compile` script at `apps/cli-v2/scripts/build-binaries.ts` produces
|
||||
`dist/bin/claudemesh-{darwin,linux,windows}-{x64,arm64}` locally.
|
||||
- `/install` updated to use the one-command `claudemesh <invite-url>` flow.
|
||||
- `claudemesh url-handler install` registers the `claudemesh://` scheme on the three OSes.
|
||||
|
||||
## What's missing
|
||||
|
||||
### 1. GitHub Actions to build + publish binaries
|
||||
```yaml
|
||||
# .github/workflows/release-binaries.yml
|
||||
on: { push: { tags: ['v*'] } }
|
||||
jobs:
|
||||
build:
|
||||
strategy: { matrix: { target: [darwin-x64, darwin-arm64, linux-x64, linux-arm64, windows-x64] } }
|
||||
steps:
|
||||
- uses: oven-sh/setup-bun@v2
|
||||
- run: cd apps/cli-v2 && bun install --frozen-lockfile
|
||||
- run: cd apps/cli-v2 && bun run scripts/build-binaries.ts
|
||||
- uses: softprops/action-gh-release@v2
|
||||
with: { files: apps/cli-v2/dist/bin/* }
|
||||
```
|
||||
|
||||
### 2. `/install` detects missing Node and downloads a binary
|
||||
Current `/install` requires Node 20+. Next iteration: detect absence, curl the
|
||||
right binary from GitHub Releases, drop it in `~/.claudemesh/bin/`, add to PATH.
|
||||
|
||||
### 3. Homebrew tap (`homebrew-claudemesh`)
|
||||
Separate repo with a formula that points at the GitHub Release artifact.
|
||||
Users: `brew install alezmad/claudemesh/claudemesh`. Auto-updated by the
|
||||
release workflow via `brew bump-formula-pr`.
|
||||
|
||||
### 4. winget manifest
|
||||
YAML in `microsoft/winget-pkgs` repo pointing at the Windows .exe.
|
||||
|
||||
### 5. Auto-update in-CLI
|
||||
Already have `showUpdateNotice`. Upgrade to offer `claudemesh upgrade` that
|
||||
re-runs `/install` OR downloads a new binary in place.
|
||||
|
||||
## Why this matters
|
||||
Current state: users need Node, npm, and patience. Goal state:
|
||||
```
|
||||
curl -fsSL claudemesh.com/install | sh
|
||||
```
|
||||
…and that's it, on any OS, with or without Node.
|
||||
|
||||
## Priority
|
||||
After tier-1 usability (done), this is the next biggest lever for adoption.
|
||||
Estimate: 1-2 days for full pipeline, mostly CI config + release testing.
|
||||
152
.artifacts/specs/2026-04-15-crypto-review-packet.md
Normal file
@@ -0,0 +1,152 @@
|
||||
# claudemesh crypto — external review packet
|
||||
|
||||
**Goal:** 2-day review of the claudemesh cryptographic surface by an
|
||||
external reviewer familiar with libsodium, x25519/ed25519, authenticated
|
||||
encryption, and hash-chain audit logs.
|
||||
|
||||
**Status:** self-audited + Codex-reviewed. Not yet reviewed by an
|
||||
independent human with security expertise.
|
||||
|
||||
## Scope
|
||||
|
||||
### Files in scope
|
||||
|
||||
| File | LoC | What it does |
|
||||
|---|---|---|
|
||||
| `apps/broker/src/crypto.ts` | ~400 | Hello signature verification, canonical invite bytes (v1+v2), `sealRootKeyToRecipient` via `crypto_box_seal`, `verifyInviteV2`, `claimInviteV2Core` (gated). |
|
||||
| `apps/broker/src/broker-crypto.ts` | 70 | AES-256-GCM encryption-at-rest for MCP env vars. Key from `BROKER_ENCRYPTION_KEY` or ephemeral in dev. |
|
||||
| `apps/broker/src/audit.ts` | ~250 | Hash-chained audit log. Canonical JSON payload hash, per-mesh `pg_advisory_xact_lock` for concurrent writers. |
|
||||
| `apps/cli/src/services/crypto/box.ts` | 60 | `crypto_box_easy` / `crypto_box_open_easy` wrappers that accept ed25519 keys and convert to curve25519 via `crypto_sign_*_to_curve25519`. |
|
||||
| `apps/cli/src/services/crypto/keypair.ts` | ~50 | `generateKeypair` wrapping `crypto_sign_keypair`. |
|
||||
| `apps/cli/src/commands/backup.ts` | ~180 | Config backup via Argon2id + XChaCha20-Poly1305 (`crypto_aead_xchacha20poly1305_ietf_*`) from a user passphrase. |
|
||||
| `apps/cli/src/services/invite/parse-v1.ts` | ~160 | Invite payload decode + signature verification, URL parsing, short-code resolution. |
|
||||
|
||||
### Out of scope
|
||||
|
||||
- TLS config (Traefik termination)
|
||||
- Postgres at-rest disk encryption
|
||||
- Homebrew/winget binary signing pipeline
|
||||
- Secrets storage on the user's machine (we rely on OS file mode 0600)
|
||||
|
||||
## Threat model
|
||||
|
||||
### Adversary profile
|
||||
|
||||
- **Network attacker** on the wire between CLI and broker. Controls
|
||||
DNS, can inject packets, can replay. TLS terminates at Traefik;
|
||||
assume TLS is trusted.
|
||||
- **Malicious broker** operator. Can read any row in Postgres.
|
||||
- **Mesh peer** with a valid member record. Can try to escalate
|
||||
privileges, impersonate other members, replay, DoS, exfiltrate
|
||||
other members' messages.
|
||||
- **Laptop thief** who has the user's `~/.claudemesh/` directory but
|
||||
not the login password. (Keys on disk at mode 0600.)
|
||||
|
||||
### Must hold
|
||||
|
||||
- E2E: broker cannot read plaintext of direct messages.
|
||||
- Signature: no member can forge messages signed as another member.
|
||||
- Invite integrity: modifying an invite URL invalidates the signature.
|
||||
- Backup secrecy: an attacker with the backup file but not the
|
||||
passphrase learns nothing.
|
||||
- Audit integrity: tampering with an audit row breaks chain
|
||||
verification.
|
||||
|
||||
### Known weaknesses (deliberate)
|
||||
|
||||
- **root_key in v1 invite URL**: current long URL form carries the
|
||||
mesh root key in base64(JSON). Short-URL mode (`/i/<code>`) resolves
|
||||
to the same token server-side, so this does NOT reduce the exposure.
|
||||
v2 protocol moves root_key out of the URL but CLI migration is not
|
||||
yet shipped.
|
||||
- **Session-key routing identity**: a peer can claim arbitrary
|
||||
`sessionPubkey` in hello (validated as 64-hex in alpha.36 but not
|
||||
proven-own). Proof-of-secret-key for session key is not enforced.
|
||||
Impact: a peer can route messages as any session pubkey it chooses
|
||||
but cannot decrypt replies without the matching secret, so the
|
||||
impact is DoS/confusion, not impersonation.
|
||||
- **mesh.owner_secret_key stored plaintext** in the DB. A malicious
|
||||
broker can issue arbitrary invites. Mitigated only by DB access
|
||||
control.
|
||||
|
||||
## Review checklist for the reviewer
|
||||
|
||||
1. **libsodium usage**
|
||||
- Are nonces generated with `randombytes_buf` and never reused?
|
||||
- `crypto_box_easy` / `crypto_box_open_easy` order and parameters correct?
|
||||
- Are ed25519 keys converted to curve25519 on BOTH sides consistently?
|
||||
- Is `crypto_sign_detached` / `crypto_sign_verify_detached` used with the right message bytes?
|
||||
|
||||
2. **Invite protocol**
|
||||
- Canonical bytes v1 + v2 format strings stable across CLI and broker?
|
||||
- Replay protection: is a v1 URL reusable? (short URL + usedCount)
|
||||
- Is the `maxUses` counter race-safe? (atomic UPDATE with `lt`)
|
||||
- v2 root_key sealing: does `crypto_box_seal` fit the trust model?
|
||||
- Is recipient_x25519_pubkey validated on both shape and length?
|
||||
|
||||
3. **Audit chain**
|
||||
- Is the canonical JSON serialization reviewable and stable?
|
||||
- Does `pg_advisory_xact_lock` actually serialize writes on the same mesh under HA?
|
||||
- Can a malicious broker rewrite history by dropping the `lastHash` cache + DROPping rows + replaying with a new chain? (Yes — documented. Mitigation is append-only at the DB level.)
|
||||
|
||||
4. **At-rest encryption (broker-crypto.ts)**
|
||||
- AES-256-GCM with 12-byte IV + 16-byte tag — correct, but is the IV generation guaranteed random and unique per encryption?
|
||||
- Any concern about auth tag truncation or nonce collision under high volume?
|
||||
|
||||
5. **Backup (cli/commands/backup.ts)**
|
||||
- Argon2id params reasonable? (INTERACTIVE — should possibly be SENSITIVE.)
|
||||
- XChaCha20-Poly1305 parameter order?
|
||||
- Does the passphrase-minimum (12 chars) match the Argon2id parameters?
|
||||
- Is the salt stored alongside the ciphertext and read back correctly?
|
||||
|
||||
6. **Session vs member key**
|
||||
- When is which key used? Is there any path where one is trusted for the other's purpose?
|
||||
|
||||
7. **Hello signature**
|
||||
- Timestamp skew window (`±60s`) — does the broker reject out-of-window replays?
|
||||
- Is the canonical hello string covered by the signature exactly?
|
||||
|
||||
8. **Grants**
|
||||
- Can a peer bypass server-side grant enforcement by lying about their
|
||||
own sender key in hello? (Signature pins memberPubkey to a real
|
||||
signing key, but sessionPubkey isn't proven.)
|
||||
|
||||
## Test coverage supplied
|
||||
|
||||
- `apps/broker/tests/invite-signature.test.ts`
|
||||
- `apps/broker/tests/invite-v2.test.ts`
|
||||
- `apps/broker/tests/hello-signature.test.ts`
|
||||
- `apps/broker/tests/audit-canonical.test.ts`
|
||||
- `apps/broker/tests/grants-enforcement.test.ts`
|
||||
- `apps/broker/tests/rate-limit.test.ts`
|
||||
- `apps/broker/tests/encoding.test.ts`
|
||||
- `apps/broker/tests/dup-delivery.test.ts`
|
||||
- `apps/cli/tests/unit/crypto-roundtrip.test.ts`
|
||||
|
||||
## Deliverables expected from reviewer
|
||||
|
||||
1. **Findings list** — severity (crit/high/med/low), file:line, fix recommendation.
|
||||
2. **Protocol-level critique** — anything in the invite or hello flow that can be exploited with a valid account.
|
||||
3. **Tooling recs** — libsodium best-practice they'd follow differently.
|
||||
4. **Go/no-go** for v1.0.0 GA assuming the findings are addressed.
|
||||
|
||||
## Budget
|
||||
|
||||
2 person-days. Hourly rate acceptable; fixed-fee preferred. Request
|
||||
for quote from reviewers with published libsodium / PKI experience
|
||||
(see recommended list below).
|
||||
|
||||
## Recommended reviewers
|
||||
|
||||
- Filippo Valsorda (independent, ex-Go crypto lead, known for age/tink reviews)
|
||||
- Trail of Bits (firm-rate; their Tamarin+reviewer combo is strong)
|
||||
- Latacora (firm; expensive but thorough)
|
||||
- NCC Group (firm; good for libsodium-specific)
|
||||
- Cure53 (firm; EU, fast turnaround)
|
||||
|
||||
## Review deliverable format
|
||||
|
||||
Markdown report with:
|
||||
- Findings table (id, severity, file:line, summary, recommended fix)
|
||||
- Protocol notes
|
||||
- One-page exec summary for non-technical stakeholders
|
||||
84
.artifacts/specs/2026-04-15-invite-v2-cli-migration.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# Invite v2 — CLI migration (server-side already shipped)
|
||||
|
||||
## Current state
|
||||
|
||||
**Server-side (broker) — DEPLOYED**
|
||||
- `canonicalInviteV2` bytes format (crypto.ts)
|
||||
- `verifyInviteV2` signature check
|
||||
- `claimInviteV2Core` at `POST /invites/:code/claim`
|
||||
- `sealRootKeyToRecipient` using crypto_box_seal
|
||||
- Every v1 invite also stores `capability_v2` for cross-compat
|
||||
- Web route `/api/public/invites/:code/claim` proxies to broker
|
||||
|
||||
**Client-side (CLI) — NOT MIGRATED**
|
||||
The CLI still uses the v1 flow (`enrollWithBroker`) which reads
|
||||
`mesh_root_key` from the invite token's base64 payload. This means:
|
||||
- Long URL `/join/<token>` contains the root key
|
||||
- Short URL `/i/<code>` resolves to the long URL → still contains root key
|
||||
- Anyone who can read the URL (history, screenshot, mail archive) has the key
|
||||
|
||||
## The v2 CLI flow
|
||||
|
||||
```
|
||||
parseInviteLinkV2(url)
|
||||
→ short URL /i/<code>? GET /api/public/invite-code/:code
|
||||
→ returns `{ found, code, mesh_slug, broker_url, owner_pubkey,
|
||||
canonical_v2, expires_at, role }` (NO root_key)
|
||||
→ generate local x25519 keypair (curve25519)
|
||||
→ POST /invites/<code>/claim { recipient_x25519_pubkey, display_name }
|
||||
→ broker verifies capability_v2 signature
|
||||
→ broker seals mesh.root_key with crypto_box_seal(root_key, our_pubkey)
|
||||
→ returns { sealed_root_key, mesh_id, member_id, owner_pubkey, canonical_v2 }
|
||||
→ open sealed_root_key with our x25519 secret key
|
||||
→ store root_key in ~/.claudemesh/config.json.meshes[].rootKey
|
||||
(NOT in the invite link — it was never transmitted unsealed)
|
||||
→ upgrade enroll to use claim response instead of the /join endpoint
|
||||
```
|
||||
|
||||
## What needs to change in the CLI
|
||||
|
||||
1. **New file** `apps/cli/src/services/invite/parse-v2.ts`
|
||||
- Detect short URL, resolve via `/api/public/invite-code/:code`
|
||||
- Expect the API returns v2 shape (server already has this route; verify field names)
|
||||
- Generate x25519 keypair via libsodium
|
||||
- POST to claim endpoint
|
||||
- Unseal root_key
|
||||
|
||||
2. **Conditional in `parseInviteLink`**
|
||||
- If URL is short-form and broker supports v2, use the new path
|
||||
- Fall back to v1 for legacy long-form URLs in transit
|
||||
|
||||
3. **Config schema** already has `rootKey` per mesh — just write from
|
||||
unsealed bytes instead of from the token payload.
|
||||
|
||||
4. **Spec test** `tests/golden/invite-v2.test.ts`
|
||||
- Broker already has `claimInviteV2Core` tests; add a CLI-side
|
||||
end-to-end that hits a local broker and verifies the sealed key
|
||||
round-trips.
|
||||
|
||||
## Why it wasn't rushed in this session
|
||||
|
||||
Crypto code deserves review. The server-side v2 shipped weeks ago
|
||||
with its own testing and audit; the CLI migration needs the same
|
||||
rigor — at minimum, a test that proves the sealed key we unseal
|
||||
matches the root_key the broker had in its DB, verified against
|
||||
`canonical_v2` signature.
|
||||
|
||||
The current v1 flow is a known quantity (the root_key-in-URL risk
|
||||
is documented in the spec). Broker is already v2-ready so when the
|
||||
CLI migration lands, emails / links can immediately start using the
|
||||
claim-only short URL without a server deploy.
|
||||
|
||||
## Rollout plan
|
||||
|
||||
1. Ship CLI v2 path behind `CLAUDEMESH_INVITE_V2=1` env.
|
||||
2. Dogfood: new invites generated by `claudemesh share` use `/api/public/invite-code/:code` with v2-shape response that omits token; CLI resolves via claim.
|
||||
3. Verify with `claudemesh verify` safety numbers cross-check.
|
||||
4. After 2 weeks uneventful, flip default to v2.
|
||||
5. After 60 days, stop embedding root_key in long URLs entirely.
|
||||
6. v3 (future): short URL becomes the only form.
|
||||
|
||||
## Effort
|
||||
|
||||
~1 day of focused crypto + testing. Broker work is done; API work is
|
||||
done; CLI work is a new parse path + a new enroll path + a few tests.
|
||||
75
.artifacts/specs/2026-04-15-per-peer-capabilities.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Per-Peer Capabilities
|
||||
|
||||
## Goal
|
||||
Give mesh members fine-grained control over what peers can do to their
|
||||
session. Today: any mesh peer can send you any message; all messages get
|
||||
pushed as `<channel>` reminders. Users can't say "only @alice can send me
|
||||
messages," "read-only peers," or "@bob can broadcast but not DM."
|
||||
|
||||
## Current state
|
||||
- Mesh-level role: `admin` | `member` (only affects invite issuance)
|
||||
- No per-peer filter — every peer message is delivered
|
||||
- No per-peer read/write split (all peers have the same capabilities)
|
||||
|
||||
## Target capability model
|
||||
|
||||
| Capability | Meaning |
|
||||
|--------------|--------------------------------------------------------|
|
||||
| `read` | Peer appears in your list_peers, can see your summary |
|
||||
| `dm` | Peer can send you direct messages |
|
||||
| `broadcast` | Peer's group broadcasts reach you |
|
||||
| `state-read` | Peer can read shared state keys |
|
||||
| `state-write`| Peer can set shared state keys |
|
||||
| `file-read` | Peer can read files you've shared (already exists) |
|
||||
|
||||
## CLI surface
|
||||
```
|
||||
claudemesh grant @alice dm broadcast # allow direct + broadcast
|
||||
claudemesh grant @bob state-read # read-only
|
||||
claudemesh revoke @alice broadcast
|
||||
claudemesh grants # list current grants per peer
|
||||
claudemesh block @spammer # shorthand for revoke-all
|
||||
```
|
||||
|
||||
## Broker schema
|
||||
New column on `mesh_member`:
|
||||
```sql
|
||||
peer_grants jsonb DEFAULT '{}'::jsonb
|
||||
-- shape: { "<peer_pubkey_hex>": ["dm", "broadcast", ...] }
|
||||
```
|
||||
|
||||
Alternative (cleaner): separate `peer_grant` table keyed on
|
||||
`(member_id, target_pubkey)`.
|
||||
|
||||
## Enforcement point
|
||||
Broker's message router (`apps/broker/src/index.ts` — send flow).
|
||||
Before writing the encrypted message to the recipient's queue, check
|
||||
`recipient.peer_grants[sender_pubkey]` against message kind. Drop
|
||||
silently if disallowed (sender sees delivered, recipient sees nothing —
|
||||
matches Signal/iMessage block semantics).
|
||||
|
||||
## Defaults
|
||||
- Unknown peers: `read + dm` (matches current behavior — additive-safe rollout)
|
||||
- Existing members: grandfathered into `read + dm + broadcast + state-read`
|
||||
via a migration
|
||||
- `claudemesh profile --default-grants read dm` lets users change their own default
|
||||
|
||||
## UI
|
||||
- `claudemesh peers` renders a `[grants: dm,broadcast]` tag per peer
|
||||
- `claudemesh verify` gains a `--with-grants` flag that shows the grant set
|
||||
alongside the safety number (helps the "did I accidentally block them?" check)
|
||||
|
||||
## Crypto implications
|
||||
Grants are server-enforced metadata. Not capability tokens. A malicious
|
||||
broker could forward messages regardless — this is about UX trust (spam /
|
||||
noise control), not protocol security. The spec is clear about this.
|
||||
|
||||
## Migration plan
|
||||
1. Ship broker schema change (jsonb column, nullable, default `{}`).
|
||||
2. Ship `grant/revoke/grants/block` CLI commands against an unused column.
|
||||
3. Enable enforcement in broker behind a per-mesh feature flag.
|
||||
4. Flip on for all meshes.
|
||||
|
||||
## Priority
|
||||
Nice-to-have. The killer feature here is `block` — every mesh gets a bad
|
||||
actor eventually. Ship `block` first even if the full grant system is deferred.
|
||||
162
.artifacts/specs/2026-05-01-mcp-tool-surface-trim.md
Normal file
@@ -0,0 +1,162 @@
|
||||
---
|
||||
title: MCP tool surface trim + multi-mesh push
|
||||
status: proposed
|
||||
target: claudemesh-cli 1.1.0
|
||||
author: Alejandro
|
||||
date: 2026-05-01
|
||||
---
|
||||
|
||||
# MCP tool surface trim + multi-mesh push
|
||||
|
||||
## Problem
|
||||
|
||||
Two issues with the current `claudemesh mcp` server:
|
||||
|
||||
1. **80+ tools registered.** Every Claude session that has claudemesh installed pays the deferred-tool-list cost (~80 entries surfacing in `ToolSearch`). Most of those tools are CLI-verb-wrappers that already have a perfect Bash equivalent — no structured I/O is gained by exposing them as MCP tools.
|
||||
|
||||
2. **Single-mesh push only.** A session launched with `claudemesh launch` opens its WS to one mesh. Peer messages from any other joined mesh arrive only if the user manually runs `claudemesh inbox`. The MCP push pipeline doesn't fan out across meshes.
|
||||
|
||||
The cleanest framing: **MCP earns its keep when a tool returns structured data Claude reads. CLI is better for fire-and-forget verbs.** Today's tool surface ignores that distinction.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Don't redesign the architecture as "CLI-only with a daemon."** That trades warm-WS sends (~5ms in-process) for cold Bash spawns (~300-500ms) and forces a Unix-socket bridge to recover state coherence. See discussion 2026-05-01 — the platform vision (vectors, graph, files, mesh-services) genuinely benefits from typed tool I/O.
|
||||
- **Don't break MCP backward compat in 1.x.** Existing scripts calling `mcp__claudemesh__send_message` keep working until 2.0; in 1.1 they're soft-deprecated with a stderr warning.
|
||||
|
||||
## Proposal
|
||||
|
||||
Three patches, ship together as 1.1.0:
|
||||
|
||||
### Patch 1: `--mesh <slug>` flag on `claudemesh mcp`
|
||||
|
||||
Today `claudemesh mcp` calls `readConfig()` and `startClients(config)` — connects to every mesh in `~/.claudemesh/config.json`. The `claudemesh launch` flow writes a per-session tmpdir config with one mesh, so practically the MCP server binds to one mesh per session.
|
||||
|
||||
Add an explicit flag for non-launch contexts (manual `~/.claude.json` editing):
|
||||
|
||||
```ts
|
||||
// apps/cli/src/mcp/server.ts, near line 244
|
||||
export async function startMcpServer(): Promise<void> {
|
||||
const serviceIdx = process.argv.indexOf("--service");
|
||||
if (serviceIdx !== -1 && process.argv[serviceIdx + 1]) {
|
||||
return startServiceProxy(process.argv[serviceIdx + 1]!);
|
||||
}
|
||||
|
||||
const meshIdx = process.argv.indexOf("--mesh");
|
||||
const onlyMesh = meshIdx !== -1 ? process.argv[meshIdx + 1] : null;
|
||||
|
||||
const config = readConfig();
|
||||
if (onlyMesh) {
|
||||
const before = config.meshes.length;
|
||||
config.meshes = config.meshes.filter((m) => m.slug === onlyMesh);
|
||||
if (config.meshes.length === 0) {
|
||||
throw new Error(
|
||||
`--mesh "${onlyMesh}" not found in config (have: ${
|
||||
config.meshes.map((m) => m.slug).join(", ") || "none"
|
||||
})`,
|
||||
);
|
||||
}
|
||||
}
|
||||
// ...rest unchanged
|
||||
}
|
||||
```
|
||||
|
||||
Enables this `~/.claude.json` pattern for users who want push from N meshes simultaneously without launching N Claude sessions:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"claudemesh:flexicar": { "command": "claudemesh", "args": ["mcp", "--mesh", "flexicar"] },
|
||||
"claudemesh:openclaw": { "command": "claudemesh", "args": ["mcp", "--mesh", "openclaw"] },
|
||||
"claudemesh:prueba1": { "command": "claudemesh", "args": ["mcp", "--mesh", "prueba1"] }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Each instance opens one WS, holds it for the session, decrypts and forwards `claude/channel` notifications independently. Channel events already carry `[meshSlug]` in `formatPush()` (server.ts:240), so Claude knows which mesh a message came from.
|
||||
|
||||
**LoC:** ~10. **Risk:** very low — additive flag, default behavior unchanged.
|
||||
|
||||
### Patch 2: trim 25 messaging tools from MCP surface
|
||||
|
||||
Move these tools from "registered MCP tool" to "soft-deprecated CLI shim":
|
||||
|
||||
| Module | Tool | CLI replacement | Rationale |
|
||||
|---|---|---|---|
|
||||
| messaging.ts | `send_message` | `claudemesh send <to> <msg> [--mesh X] [--priority Y]` | Pure verb, no structured return. |
|
||||
| messaging.ts | `list_peers` | `claudemesh peers --json` | One-shot, easy to parse. |
|
||||
| messaging.ts | `check_messages` | `claudemesh inbox --json` | One-shot. |
|
||||
| messaging.ts | `message_status` | `claudemesh msg-status <id>` (new) | One-shot lookup. |
|
||||
| profile.ts | `set_profile` | `claudemesh profile --avatar X --bio Y ...` | Pure write. |
|
||||
| profile.ts | `set_status` | `claudemesh status set <state>` (new) | Pure write. |
|
||||
| profile.ts | `set_summary` | `claudemesh summary <text>` (new) | Pure write. |
|
||||
| profile.ts | `set_visible` | `claudemesh visible <true\|false>` (new) | Pure write. |
|
||||
| groups.ts | `join_group` | `claudemesh group join @<name> [--role X]` (new) | Pure write. |
|
||||
| groups.ts | `leave_group` | `claudemesh group leave @<name>` (new) | Pure write. |
|
||||
| state.ts | `get_state` | `claudemesh state get <key> --json` | Already exists. |
|
||||
| state.ts | `set_state` | `claudemesh state set <key> <value>` | Already exists. |
|
||||
| state.ts | `list_state` | `claudemesh state list --json` | Already exists. |
|
||||
| memory.ts | `remember` | `claudemesh remember <text>` | Already exists. |
|
||||
| memory.ts | `recall` | `claudemesh recall <query> --json` | Already exists. |
|
||||
| memory.ts | `forget` | `claudemesh forget <id>` (new) | Pure write. |
|
||||
| scheduling.ts | `schedule_reminder` | `claudemesh remind <msg> --in/--at/--cron` | Already exists. |
|
||||
| scheduling.ts | `list_scheduled` | `claudemesh remind list --json` | Already exists. |
|
||||
| scheduling.ts | `cancel_scheduled` | `claudemesh remind cancel <id>` | Already exists. |
|
||||
| mesh-meta.ts | `mesh_info` | `claudemesh info --json` | One-shot read. |
|
||||
| mesh-meta.ts | `mesh_stats` | `claudemesh stats --json` (new) | One-shot read. |
|
||||
| mesh-meta.ts | `mesh_clock` | `claudemesh clock --json` (new) | One-shot read. |
|
||||
| mesh-meta.ts | `ping_mesh` | `claudemesh ping` (new) | Pure verb. |
|
||||
| tasks.ts | `claim_task` / `complete_task` | `claudemesh task claim/complete <id>` (new) | Pure write. |
|
||||
|
||||
**Keep as MCP tools (~50):**
|
||||
|
||||
- **vault.ts** — `vault_set / vault_list / vault_delete` (encrypted, structured payloads).
|
||||
- **vectors.ts** — `vector_store / vector_search / vector_delete` (typed embeddings, ranked results Claude reasons over).
|
||||
- **graph.ts** — `graph_query / graph_execute` (returns structured graph results).
|
||||
- **files.ts** — `share_file / get_file / list_files / list_peer_files / read_peer_file / grant_file_access / file_status / delete_file` (binary payloads, ACL semantics).
|
||||
- **skills.ts** — `share_skill / list_skills / get_skill / remove_skill / mesh_skill_deploy` (typed skill metadata).
|
||||
- **streams.ts** — `create_stream / list_streams / publish / subscribe` (event stream cursor semantics).
|
||||
- **contexts.ts** — `share_context / get_context / list_contexts` (context-passing payloads).
|
||||
- **mcp-registry-*.ts** — `mesh_mcp_*` (the ~14 mesh-MCP-services tools — these are platform-defining, MCP-native).
|
||||
- **clock-write.ts** — `mesh_set_clock / mesh_pause_clock / mesh_resume_clock` (logical-clock writes that Claude composes with reads).
|
||||
- **sql.ts** — `mesh_query / mesh_schema / mesh_execute` (typed SQL results).
|
||||
- **webhooks.ts** — `create_webhook / list_webhooks / delete_webhook` (typed webhook metadata).
|
||||
- **url-watch.ts** — `mesh_watch / mesh_unwatch / mesh_watches` (returns watch state).
|
||||
- **tasks.ts** — `create_task / list_tasks` (typed task records — only the writes go to CLI).
|
||||
|
||||
### Patch 3: tool-call → CLI shim with deprecation warning
|
||||
|
||||
For the trimmed tools, keep the registration but route through the CLI:
|
||||
|
||||
```ts
|
||||
// apps/cli/src/mcp/tools/messaging.ts (sketch)
|
||||
async function sendMessageDeprecated(args: SendMessageArgs): Promise<ToolResult> {
|
||||
process.stderr.write(
|
||||
`[claudemesh] mcp__claudemesh__send_message is soft-deprecated in 1.1. ` +
|
||||
`Use \`claudemesh send\` via Bash instead — it's faster and cleaner.\n`,
|
||||
);
|
||||
return originalSendMessageHandler(args); // unchanged behavior
|
||||
}
|
||||
```
|
||||
|
||||
In 2.0 the registrations get deleted entirely.
|
||||
|
||||
## Migration plan
|
||||
|
||||
1. **1.1.0** — ship all three patches. Existing users see deprecation warnings; nothing breaks.
|
||||
2. **1.1.x** — collect feedback. If anyone has scripts hard-wired to the deprecated tools, surface in CHANGELOG.
|
||||
3. **1.2.0** (~6 weeks later) — flip deprecation warnings to "removal in 2.0" messaging.
|
||||
4. **2.0.0** — delete the 25 tool registrations. ToolSearch surface drops to ~50 entries.
|
||||
|
||||
## Open questions
|
||||
|
||||
- **Do we need a Unix-socket bridge between CLI sends and the MCP push-pipe** so they share one WS connection per mesh per session? Probably yes for `claudemesh send` warm-path performance, but it's a separate spec — file under `socket-bridge` after this lands.
|
||||
- **Should `claudemesh launch` keep writing one MCP server entry** (current behavior, default for new users) or switch to the per-mesh-N-entries pattern from Patch 1? Recommend keeping single-entry default — Patch 1 is for advanced users who manually edit `~/.claude.json`.
|
||||
- **Do `mesh_mcp_*` tools really belong in the keep list?** They're MCP-on-mesh management — their bias is RPC-shaped, not stream-shaped. Provisional yes; revisit if 1.1 reduces their use.
|
||||
|
||||
## Effort
|
||||
|
||||
- Patch 1: ~10 LoC + 1 test. ~30 min.
|
||||
- Patch 2: ~25 tool-handler refactors (registration removed, CLI verb confirmed/added). Some new verbs (`status set`, `summary`, `visible`, `group join/leave`, `forget`, `stats`, `clock`, `ping`, `task claim/complete`, `msg-status`) need wiring through to existing broker-client methods. ~150 LoC, half a day.
|
||||
- Patch 3: deprecation shim per trimmed tool. ~50 LoC, 1 hour.
|
||||
|
||||
**Total:** ~1 dev-day for 1.1.0. ToolSearch surface drops by ~30%, multi-mesh push works, no architectural disruption, platform tools stay typed.
|
||||
234
.artifacts/specs/2026-05-02-architecture-north-star.md
Normal file
@@ -0,0 +1,234 @@
|
||||
---
|
||||
title: claudemesh North Star — CLI-first with claude/channel push-pipe
|
||||
status: canonical
|
||||
target: 2.0.0
|
||||
author: Alejandro
|
||||
date: 2026-05-02
|
||||
supersedes: none
|
||||
references:
|
||||
- 2026-05-01-mcp-tool-surface-trim.md (first cut at the trim)
|
||||
- SPEC.md
|
||||
- docs/protocol.md
|
||||
---
|
||||
|
||||
# claudemesh North Star
|
||||
|
||||
## The commitment, in one sentence
|
||||
|
||||
> **CLI is the canonical surface for every claudemesh operation. MCP exists for one thing: to deliver `claude/channel` push notifications mid-turn. That's the killer feature, and it's the only reason an MCP server runs at all.**
|
||||
|
||||
Everything else — sending messages, listing peers, sharing files, deploying mesh-MCPs, running graph queries, scheduling jobs, publishing skills — is invoked from the CLI, by humans, scripts, cron, hooks, or by Claude itself via Bash.
|
||||
|
||||
## Why this shape
|
||||
|
||||
1. **Mid-turn interrupt is the differentiator.** When peer A sends to peer B, B's Claude session pauses what it's doing and reads the message immediately. That requires `claude/channel` notifications routed through an MCP transport — Claude Code only watches MCP server connections for those events. **Lose that, and claudemesh becomes another inbox-polling pattern.** Every other primitive can degrade to "delivered at next tool boundary"; this one cannot.
|
||||
|
||||
2. **CLI is universal.** Bash works in scripts, hooks, cron, CI, terminals, automation, and Claude itself (via Bash tool calls). A primitive that exists as both an MCP tool and a CLI verb is double-maintenance with one calling convention nobody actually wants.
|
||||
|
||||
3. **JSON-on-stdout is enough structure.** Claude reads `claudemesh peers --json` exactly as well as it reads a typed MCP tool return. The CLI man page is the schema. The "MCP gives structured I/O" advantage was real when we were paying for nothing else, but warm-WS via socket bridge (below) closes the cost gap.
|
||||
|
||||
4. **Surface shrinks where it matters.** ToolSearch deferred-tool list drops from ~80 entries to ~0 entries (push-pipe registers no tools). Massive context-budget win for every Claude session.
|
||||
|
||||
## Prior art (this is not novel architecture)
|
||||
|
||||
The "live-state daemon + thin scriptable CLI talking via Unix socket" pattern is the canonical shape for CLIs in this category. Reviewers should not treat this as bespoke design:
|
||||
|
||||
- **Docker** — `dockerd` daemon, CLI talks via `/var/run/docker.sock`. `DOCKER_HOST` env override. `docker context` for multi-daemon switching.
|
||||
- **Tailscale** — `tailscaled` daemon, `tailscale` CLI via socket. Per-key ACL identity model. Same peer-mesh-with-keypairs shape as claudemesh.
|
||||
- **Stripe `listen`** — long-running CLI daemon receives webhook push, forwards to local consumer. Same push-pipe-as-CLI-subcommand shape.
|
||||
- **Obsidian CLI** — talks to a running Obsidian instance via REST. **Notable: ships a Claude skill (`~/.claude/skills/obsidian-cli/SKILL.md`) that documents every verb and flag for Claude consumption — replacing MCP tool introspection entirely.**
|
||||
|
||||
Claudemesh's CLI-first + push-pipe + socket-bridge architecture is exactly this pattern. We are following the well-trodden path, not inventing a new one.
|
||||
|
||||
## The six architectural commitments
|
||||
|
||||
### 1. **MCP server is a push-pipe, full stop.**
|
||||
|
||||
The MCP entrypoint (`claudemesh mcp [--mesh <slug>]`) does exactly three things:
|
||||
- Holds a WS connection to the broker for the meshes it's bound to.
|
||||
- Decrypts inbound peer messages.
|
||||
- Emits them as `claude/channel` notifications to the parent Claude Code session.
|
||||
|
||||
It registers **zero tools**. It advertises only `experimental: { "claude/channel": {} }`. Its `tools/list` returns an empty array. There is no surface to discover, search, or call.
|
||||
|
||||
One push-pipe per joined mesh, registered in `~/.claude.json` via `claudemesh install` (or auto-injected by `claudemesh launch`). The `--mesh` flag (shipped 1.0.3) makes this trivial.
|
||||
|
||||
### 2. **CLI is the canonical surface for every primitive.**
|
||||
|
||||
Every resource has uniform CLI verbs:
|
||||
|
||||
| Resource | Verbs |
|
||||
|---|---|
|
||||
| peer | `claudemesh peers [--json] [--mesh X]` |
|
||||
| group | `claudemesh group join/leave @<n> [--role X]` |
|
||||
| message | `claudemesh send <to> <msg>`, `claudemesh inbox`, `claudemesh msg-status <id>` |
|
||||
| state | `claudemesh state get/set/list [--json]` |
|
||||
| memory | `claudemesh remember/recall/forget` |
|
||||
| task | `claudemesh task create/claim/complete/list` |
|
||||
| file | `claudemesh file put/get/list/grant/delete` |
|
||||
| vector | `claudemesh vector store/search/delete` |
|
||||
| graph | `claudemesh graph query/execute/watch` |
|
||||
| stream | `claudemesh stream create/publish/subscribe/list` |
|
||||
| context | `claudemesh context share/get/list` |
|
||||
| skill | `claudemesh skill publish/list/get/remove` |
|
||||
| schedule | `claudemesh schedule msg/webhook/tool/list/cancel` |
|
||||
| webhook | `claudemesh webhook create/list/delete` |
|
||||
| watch | `claudemesh watch create/list/unwatch` |
|
||||
| mcp | `claudemesh mesh-mcp deploy/list/call/undeploy/catalog` |
|
||||
| clock | `claudemesh clock get/set/pause/resume` |
|
||||
| sql | `claudemesh sql query/schema/execute` |
|
||||
| vault | `claudemesh vault set/get/list/delete` |
|
||||
| profile | `claudemesh profile/summary/visible/status set` |
|
||||
|
||||
**Every verb supports `--json`** for structured consumption. **Every verb supports `--mesh <slug>`** for targeting (default: pick first or interactive picker). Verbs share one broker-call implementation — no duplication between CLI and MCP.
|
||||
|
||||
### 3. **Warm path via Unix socket bridge** (load-bearing for 2.0).
|
||||
|
||||
A push-pipe holds a live WS connection. CLI invocations should reuse that connection rather than opening their own (which costs ~300-500ms cold-start).
|
||||
|
||||
Mechanism:
|
||||
- On startup, push-pipe creates `~/.claudemesh/sockets/<mesh-slug>.sock` (Unix domain socket, mode 0600).
|
||||
- CLI verbs that need broker round-trip first try to dial that socket.
|
||||
- If alive: forward request, get response back over socket (~5ms).
|
||||
- If absent / stale: open ephemeral WS, do the op, close (~300ms — fine for cron/scripts where there's no parent push-pipe).
|
||||
|
||||
Push-pipe owns one WS, all ops through that WS, broker sees ONE session per mesh per host (no duplicate hellos). On crash, socket file is unlinked by `unlink` on exit handler; stale-socket detection by `connect()` ECONNREFUSED.
|
||||
|
||||
This is **mandatory for 2.0** — without it, every CLI op pays cold-start, and CLI-first becomes unusably slow for tight loops.
|
||||
|
||||
### 4. **JSON output is the schema, with field selection and streaming.**
|
||||
|
||||
Every CLI verb has a deterministic `--json` output shape, documented in `docs/cli-schemas.md`, validated by zod parsers in tests. Claude reads `claudemesh vector search "x" --json` and gets a typed-array shape it can reason over identically to a tool return.
|
||||
|
||||
**Three output modes, mandatory across every read-shaped verb** (modeled on `gh` and `gemini`):
|
||||
|
||||
- `--json` — full record, all fields
|
||||
- `--json <fields>` — field-selected projection (e.g. `claudemesh peers --json name,pubkey,status`)
|
||||
- `--output-format stream-json` — incremental JSONL for long-running ops (mesh-MCP calls fanning across peers, `vector search` against large indexes, `schedule list` with many entries). One object per line, Claude consumes incrementally.
|
||||
|
||||
Plus convenience output:
|
||||
- `--jq <expr>` — native jq filter pipeline
|
||||
- `--template '{{.field}}'` — Go template formatting
|
||||
|
||||
`schema_version: "1.0"` field on every JSON output — mandatory. Bumps when shape changes. Old code paths can pin with `--schema-version=1.0`.
|
||||
|
||||
### 5. **All features stay. Nothing is removed.**
|
||||
|
||||
This is **not a feature trim**. Every primitive in the current 80-tool surface gets a CLI verb. Vectors, graphs, mesh-MCP, files, vault, SQL — all of it. The user-facing pitch is unchanged: "claudemesh gives your Claude session a name, a network, shared memory, shared compute, shared skills, scheduled actions." The change is *how you call it*.
|
||||
|
||||
### 6. **The Claude skill IS the schema.** *(load-bearing for CLI-first to work)*
|
||||
|
||||
Stripping MCP tool introspection (`tools/list`) costs Claude its discoverability. The replacement: a packaged `claudemesh` skill at `~/.claude/skills/claudemesh/SKILL.md` written by `claudemesh install`, documenting every verb, flag, JSON shape, and gotcha. Claude reads it on demand via the Skill tool — **not on every session, not pre-loaded into deferred-tool-list**. This is exactly how `obsidian-cli` works today and it works perfectly.
|
||||
|
||||
The skill replaces three things at once:
|
||||
- **Tool discovery** — Claude knows the verb-set after one Skill invocation. No `tools/list` needed.
|
||||
- **Output schemas** — every JSON shape is documented in the skill, so Claude knows what to expect from `--json` without parsing TypeScript types at runtime.
|
||||
- **Behavioral conventions** — the skill teaches "preview before delete," "confirm peer match before kick," "use `--mesh` for cross-mesh ops" — soft guardrails that complement the policy engine's hard rules.
|
||||
|
||||
Topic-shards for size: `claudemesh` (core), `claudemesh-platform` (vault/vectors/graph/sql/mesh-mcp), `claudemesh-schedule` (cron/webhooks/watches), `claudemesh-admin` (kick/ban/grants/install). Each shard is independently loadable.
|
||||
|
||||
**This is the answer to the "JSON-on-stdout is a worse schema" caveat.** It's not — when Claude has a documented skill to load, the CLI surface is *more* discoverable than 80 deferred MCP tools that bloat ToolSearch silently.
|
||||
|
||||
### 7. **Pluggable policy engine, not binary `--yes`.** *(answers the Bash-blast-radius caveat)*
|
||||
|
||||
Modeled on `gemini --policy / --admin-policy` and `codex --sandbox`. Replace the current binary `-y/--yes` with:
|
||||
|
||||
- **`--approval-mode plan|read-only|write|yolo`** — four levels (read-only blocks all writes, plan blocks all side effects, write prompts on dangerous verbs, yolo skips all confirmation).
|
||||
- **`--policy <file>`** — YAML allow/deny rules per resource × verb × peer. Sample:
|
||||
|
||||
```yaml
|
||||
# ~/.claudemesh/policy.yaml
|
||||
default: prompt
|
||||
rules:
|
||||
- resource: send
|
||||
verb: "*"
|
||||
decision: allow
|
||||
- resource: sql
|
||||
verb: execute
|
||||
decision: prompt
|
||||
- resource: file
|
||||
verb: delete
|
||||
decision: deny
|
||||
- resource: mesh-mcp
|
||||
verb: call
|
||||
peers: ["@trusted"]
|
||||
decision: allow
|
||||
```
|
||||
|
||||
Policy decisions log to a tamper-evident audit file. Org admin can ship `--admin-policy` that overrides user config. **This is the real answer to "Bash carries unrestricted blast-radius once allowed" — claudemesh's own policy engine kicks in before the broker call, regardless of what shell permissions are.**
|
||||
|
||||
## What this means for `claude/channel`
|
||||
|
||||
When peer A's CLI runs `claudemesh send peer-B "hello"`:
|
||||
|
||||
1. CLI dials `~/.claudemesh/sockets/<mesh>.sock` (warm path) or opens its own WS (cold).
|
||||
2. Encrypts message with peer-B's pubkey via crypto_box.
|
||||
3. Broker receives `send` envelope, forwards encrypted blob to peer-B's connected push-pipe.
|
||||
4. Peer-B's push-pipe decrypts and emits a `claude/channel` notification.
|
||||
5. Claude Code mid-turn-injects the message as a `<channel source="claudemesh" ...>` reminder.
|
||||
6. Claude responds immediately per the system prompt convention.
|
||||
|
||||
Step 5 is the **only step that requires MCP**. Steps 1-4 are pure CLI + broker. The architecture is "CLI for everything, MCP for the one thing it's irreplaceable for."
|
||||
|
||||
## Migration path from 1.1.0
|
||||
|
||||
| Version | Ships | Behavior |
|
||||
|---|---|---|
|
||||
| **1.2.0** | Unix socket bridge. CLI verbs auto-detect push-pipe and use warm path. **Field-selectable JSON (`--json a,b,c`)** + `--jq` + `--template` adopted. | All existing MCP tools still work. Nothing breaks. |
|
||||
| **1.2.1** | Ships `~/.claude/skills/claudemesh/SKILL.md` written by `claudemesh install`. Includes full verb reference + output schemas + gotchas. Topic-shards (`-platform`, `-schedule`, `-admin`). | Skill auto-installs on `claudemesh install`. |
|
||||
| **1.3.0** | Schedule unification (`schedule msg/webhook/tool`). All remaining missing CLI verbs (file, vector, graph, mesh-mcp, vault, sql, stream, context, skill, watch). **`--output-format stream-json`** for long-running ops. | All existing MCP tools still work. New verbs additive. |
|
||||
| **1.4.0** | Resource-model rename pass — every CLI verb is `<resource> <verb>`. Old verbs become aliases. | All existing MCP tools still work. Old CLI verbs aliased forever. |
|
||||
| **1.5.0** | **Pluggable policy engine** (`--approval-mode`, `--policy`, `--admin-policy`). MCP `tools/list` shrinks to configurable allowlist (default: empty). `CLAUDEMESH_MCP_FAT=1` for users who need typed tool surface. | Default 1.5 install: MCP exposes zero tools. Push-pipe-only. Policy engine gates all writes. |
|
||||
| **2.0.0** | MCP server hardcoded to push-pipe-only. Strip all tool registrations + handlers. | **Old MCP tool calls return tool-not-found.** Users must update scripts to CLI verbs. Old CLI verbs (1.4 aliases) still work. |
|
||||
|
||||
## What stays exactly the same
|
||||
|
||||
- Crypto: ed25519 sign + x25519 sealing + crypto_box for DMs. No change.
|
||||
- Broker protocol: WS frame format, hello flow, audit log. No change.
|
||||
- Membership / mesh-scope / capability grants. No change.
|
||||
- Web app, dashboard, Telegram bridge, OAuth. No change.
|
||||
- The platform vision (vault, vectors, graph, files, skills, mesh-MCPs, scheduled jobs). All shipped, all stay.
|
||||
|
||||
## What changes for users
|
||||
|
||||
- `~/.claude.json` simplifies: `"claudemesh": { "command": "claudemesh", "args": ["mcp"] }` becomes one entry per joined mesh after `claudemesh install`. Multi-mesh push works out of the box.
|
||||
- ToolSearch loses ~80 deferred entries. Sessions are lighter.
|
||||
- Scripts that called `mcp__claudemesh__*` get a deprecation warning in 1.x, break in 2.0 — replaced by `claudemesh <verb> --json` + `jq`.
|
||||
- Claude Code system prompt for the MCP server gets shorter (no tool catalog), focused only on "RESPOND IMMEDIATELY to channel events."
|
||||
|
||||
## Open questions parked for future specs
|
||||
|
||||
- **Federation** — broker-to-broker encrypted relay so peers on different brokers can talk. Not in 2.0 scope.
|
||||
- **Offline-with-TTL inbox** — persist `now` priority messages on broker if recipient is offline, with explicit TTL. Reasonable for 2.x.
|
||||
- **Compute attribution** — when peer X invokes a mesh-MCP that peer Y deployed, who pays for broker compute / outbound calls? Pre-empts the eventual billing question. 2.x.
|
||||
- **Universal hash-chained audit** — every state mutation per mesh is hash-chained, replayable, verifiable. Today only some events are; making it universal is its own spec.
|
||||
- **ACP (Agent Communication Protocol) interop with Gemini CLI.** Gemini CLI exposes `--acp` for agent-to-agent comms — the same problem domain claudemesh occupies. Research question: is ACP a documented standard claudemesh can speak (making claudemesh peers and Gemini peers cross-talk in the same mesh), or is it Google-proprietary? If standard, implementing it is a major platform expansion. File as separate research spec before 2.x.
|
||||
|
||||
## What this spec is NOT
|
||||
|
||||
- Not a redesign of the broker. The broker stays as-is.
|
||||
- Not a redesign of crypto. Crypto stays as-is.
|
||||
- Not a feature deprecation. Every feature stays.
|
||||
- Not optional. This is the canonical 2.0 architecture; intermediate versions migrate toward it.
|
||||
|
||||
## Effort estimate to 2.0
|
||||
|
||||
Sequential, single dev (revised after caveats survey — original estimate was rosy):
|
||||
|
||||
- **1.2.0** (socket bridge + field-JSON): 1-2 weeks. Socket bridge is real distributed-systems work (stale-cleanup, version negotiation, NFS/Windows edge cases) — not 2-3 days.
|
||||
- **1.2.1** (claudemesh skill + topic shards): 2-3 days. Mostly content writing once schemas are documented.
|
||||
- **1.3.0** (schedule unification + remaining verbs + stream-json): 1 week. Each of the ~10 missing verbs is small but adds up.
|
||||
- **1.4.0** (resource-model rename + alias compat): 2-3 days.
|
||||
- **1.5.0** (policy engine + MCP allowlist): 4-5 days. Policy engine is its own subsystem — parser, evaluator, audit log, admin override.
|
||||
- **2.0.0** (strip tool handlers + cutover): 2 days.
|
||||
|
||||
Total: **~5-6 weeks of focused work** spread over 3-4 months calendar. Each release is independently shippable; the policy engine specifically can land later than 1.5 if needed.
|
||||
|
||||
## Acceptance signals — how we know it worked
|
||||
|
||||
- **ToolSearch** in a freshly-installed claudemesh session shows zero `mcp__claudemesh__*` entries by default (vs ~80 today).
|
||||
- **`claudemesh peers --json name,status`** projects exactly two fields, no extra noise.
|
||||
- **`claudemesh send <peer> "hi"`** from a Bash call inside a Claude session round-trips in <50ms (warm path via socket bridge) on localhost-broker, <250ms on EU-from-US.
|
||||
- **`Skill: claudemesh`** loaded once teaches Claude the entire mesh surface; subsequent CLI calls require no further introspection.
|
||||
- **A policy file with `decision: deny` for `file delete`** blocks the call before it hits the broker, with a clear stderr explanation.
|
||||
- **`claudemesh status set working` from cron** opens its own WS (no daemon), succeeds in <1s, no orphan connections on broker.
|
||||
155
.artifacts/specs/2026-05-02-handoff-evening.md
Normal file
@@ -0,0 +1,155 @@
|
||||
# claudemesh handoff — 2026-05-02 (evening)
|
||||
|
||||
Companion to the morning handoff (`2026-05-02-handoff.md`). Captures
|
||||
what shipped through the v1.6.x patch line and the v1.7.0 demo cut.
|
||||
Read before the next session.
|
||||
|
||||
---
|
||||
|
||||
## What shipped this evening
|
||||
|
||||
### v1.6.x patch line — closed except bridge smoke test
|
||||
|
||||
| Feature | Endpoint / file | Commit |
|
||||
|---|---|---|
|
||||
| SSE topic stream | `GET /api/v1/topics/:name/stream` | `7e71a61` |
|
||||
| Unread counts | `PATCH /v1/topics/:name/read`, `unread` on `GET /v1/topics` | `a80eb6f` |
|
||||
| Mesh-card unread badges | `apps/web/src/app/[locale]/dashboard/(user)/page.tsx` | `541440c` |
|
||||
| Member sidebar | `GET /v1/members`, chat panel right rail | `a75483b` |
|
||||
| SSE 4xx-stop fix | `apps/web/src/modules/mesh/topic-chat-panel.tsx` | `7af61e1` |
|
||||
| Humans-as-peers | `GET /v1/peers` includes recent apikey users | `f4601f4` |
|
||||
|
||||
### v1.7.0 demo cut — 4 of 5 items shipped
|
||||
|
||||
| Item | Code | Commit |
|
||||
|---|---|---|
|
||||
| Member sidebar in chat | `apps/web/src/modules/mesh/topic-chat-panel.tsx` (+sidebar) | `a75483b` |
|
||||
| Topic search + autocomplete | Same file (+ search toggle, mention dropdown, clay highlight) | `35a289b`, `00c25d9` |
|
||||
| Notification feed | `MentionsSection` on universe + `GET /v1/notifications` | `a9160a0` |
|
||||
| Public blog post | `apps/web/src/app/[locale]/(marketing)/blog/agents-and-humans-same-chat/` | `69cf39b` |
|
||||
| Demo video script | `docs/demo-v1.7.0-script.md` (90s, 5 scenes) | `69cf39b` |
|
||||
| Marketing site refresh | Timeline next-block updated | `a2ab7de` |
|
||||
| **Recorded demo video** | — | **TODO (needs human + iTerm + Chrome)** |
|
||||
| **Marketing screenshots** | — | **TODO (needs Chrome session)** |
|
||||
|
||||
### Roadmap state
|
||||
|
||||
- `docs/roadmap.md` updated. v1.6.x marks every endpoint shipped except
|
||||
bridge smoke test. v1.7.0 marks sidebar/mentions/search/feed/blog
|
||||
shipped; recording + screenshots open.
|
||||
- v2.0.0 (daemon redesign) and v0.3.0 (operator layer / per-topic
|
||||
encryption) untouched — both still architectural specs.
|
||||
|
||||
---
|
||||
|
||||
## Live status
|
||||
|
||||
- **Broker** (`wss://ic.claudemesh.com/ws`): autodeployed via Coolify
|
||||
off the gitea-vps push. The custom migration runner from earlier
|
||||
this session is the one moving migrations forward. No new
|
||||
migrations shipped today — all v1.6.x work was code-only against
|
||||
the v0.2.0 schema.
|
||||
- **Web** (`claudemesh.com`): autodeployed via Vercel off the github
|
||||
push. Verified `/v1/notifications`, `/v1/peers`, `/v1/members`,
|
||||
`/v1/topics/general/stream`, `/v1/topics/general/read` all
|
||||
return 401 with bad bearer (i.e. they exist + auth works).
|
||||
Authenticated browser smoke not run — no Playwriter session
|
||||
available during this handoff write.
|
||||
- **CLI** (`claudemesh-cli@1.6.1` on npm): unchanged this session.
|
||||
All v1.6.x work was server + web only; CLI doesn't yet consume
|
||||
the new endpoints.
|
||||
|
||||
### CLI gap — worth noting
|
||||
|
||||
The new endpoints have NO CLI surface yet:
|
||||
|
||||
- `GET /v1/notifications` — `claudemesh notification list` could show
|
||||
recent mentions in the terminal. ~30 LoC.
|
||||
- `GET /v1/members` — `claudemesh member list` shows roster + online
|
||||
state. Distinct from `peer list` which shows live sessions.
|
||||
- `PATCH /v1/topics/:name/read` — could be implicit (called by
|
||||
`topic show <name>`) or explicit (`claudemesh topic read <name>`).
|
||||
- SSE stream — `claudemesh topic tail <name>` would tail messages
|
||||
in the terminal. High demo value.
|
||||
|
||||
Wiring these is a small CLI release (v1.7.0). Not blocking anything
|
||||
but worth doing before the recording so the demo includes a
|
||||
"terminal tail" cut.
|
||||
|
||||
---
|
||||
|
||||
## Known issues / risks
|
||||
|
||||
1. **Mentions notification endpoint depends on plaintext-base64
|
||||
ciphertext** that v0.2.0 ships. When per-topic encryption lands
|
||||
in v0.3.0, both `GET /v1/notifications` and the universe-page
|
||||
`MentionsSection` query break. Migration plan is documented in
|
||||
the blog post + the inline comment: move to a
|
||||
`mesh.notification` table populated at write time.
|
||||
|
||||
2. **Postgres `convert_from(decode(ciphertext, 'base64'), 'UTF8')`
|
||||
throws on any ciphertext that isn't valid base64-of-UTF8.** All
|
||||
current writers (broker WS path, REST POST /messages, web chat
|
||||
panel) emit base64-of-plaintext-UTF8, so this works. If a future
|
||||
writer emits binary ciphertext, the mention queries crash. Add a
|
||||
safe-base64 guard or migrate to per-write notification table
|
||||
before that happens.
|
||||
|
||||
3. **No live SSE smoke test in this session.** Endpoints respond
|
||||
401 to bad bearer. Browser-authenticated test was deferred — no
|
||||
Playwriter session was reachable during the run. Worth a
|
||||
manual smoke before recording the demo.
|
||||
|
||||
4. **CSRF middleware blocks PATCH/POST without an Origin header.**
|
||||
This is correct behaviour but trips up curl users. Documented
|
||||
in the smoke notes; not a bug.
|
||||
|
||||
---
|
||||
|
||||
## Next session — three branches
|
||||
|
||||
### A. Record + ship the v1.7.0 launch (~2 hours, all human work)
|
||||
1. Spin a fresh demo mesh + two iTerm panes running
|
||||
`claudemesh launch --name Mou` and `--name Alexis`.
|
||||
2. Run the demo script in `docs/demo-v1.7.0-script.md`.
|
||||
3. Cut to 90s, upload to `claudemesh.com/media/demo-v170.mp4`.
|
||||
4. Take 4-6 screenshots (universe, mesh detail, chat with sidebar,
|
||||
mentions feed, mobile view) for the blog hero + Twitter card.
|
||||
5. Cross-post per the script's distribution checklist.
|
||||
|
||||
### B. Wire CLI verbs to v1.6.x endpoints (~3 hours, code)
|
||||
1. `claudemesh notification list [--since]` → `GET /v1/notifications`.
|
||||
2. `claudemesh member list` → `GET /v1/members`.
|
||||
3. `claudemesh topic tail <name>` → SSE consumer. Print as messages
|
||||
arrive. Highest demo value.
|
||||
4. `claudemesh topic read <name>` → `PATCH /v1/topics/:name/read`.
|
||||
5. Bump `apps/cli/package.json` to 1.7.0, publish.
|
||||
|
||||
### C. v0.3.0 first slice — per-topic encryption (~5 hours, code)
|
||||
This is the next architectural cut.
|
||||
1. Schema: add `mesh.topic.encrypted_key` (encrypted-to-mesh-root).
|
||||
2. Broker: derive symmetric key on first message via HKDF; cache.
|
||||
3. Client: per-topic key fetch + `crypto_secretbox` over body.
|
||||
4. `ciphertext` column stops being plaintext-base64 → mentions
|
||||
query needs the notification table from issue #1.
|
||||
|
||||
Highest leverage right now is **A** (the recording is what turns
|
||||
shipped code into shipped product), then **B** (CLI parity makes
|
||||
the demo fuller). **C** is the next session for someone with
|
||||
2+ uninterrupted hours.
|
||||
|
||||
---
|
||||
|
||||
## Repo state
|
||||
|
||||
- `main` ahead of `gitea-vps/main` and `github/main` by 0 commits
|
||||
at handoff time — both pushed.
|
||||
- 12 commits this evening session (sse → unread → grid → sidebar →
|
||||
ssefix → mentions → search → notifications → roadmap → humans →
|
||||
roadmap2 → blog+demo → timeline).
|
||||
- No open PRs; everything went to main directly.
|
||||
- No `.skip` / TODO files / temp commits left behind.
|
||||
|
||||
---
|
||||
|
||||
*Last handoff: this file. Previous: `2026-05-02-handoff.md` (morning).*
|
||||
106
.artifacts/specs/2026-05-02-handoff.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# claudemesh handoff — 2026-05-02
|
||||
|
||||
State of the world after a long session that shipped 1.5.0 and the v0.2.0 backend. Read this before the next session — it captures what's done, what's deployed where, what's not, and the architectural decisions worth knowing.
|
||||
|
||||
---
|
||||
|
||||
## Where things stand
|
||||
|
||||
### Released to npm
|
||||
- **`claudemesh-cli@1.5.0`** (latest tag, published earlier today). CLI-first architecture lock-in: zero-tool MCP, policy engine, bundled `claudemesh` skill. Verified install + smoke-tested via clean `npm i -g`.
|
||||
|
||||
### In `main` but NOT released yet
|
||||
Everything below is committed, deployed to the broker (`wss://ic.claudemesh.com/ws`) and the web app (Vercel `claudemesh.com`), but **`claudemesh-cli@1.5.0` on npm doesn't have any of it**. Users won't see it until v1.6.0 publishes.
|
||||
|
||||
| Feature | Code path | Verified live? |
|
||||
|---|---|---|
|
||||
| Topics (schema, broker routing, CLI verbs, skill) | `packages/db/src/schema/mesh.ts`, `apps/broker/src/broker.ts`, `apps/cli/src/commands/topic.ts` | ✅ created `#deploys-test`, sent + persisted |
|
||||
| `apikey create/list/revoke` (CLI + broker WS) | `apps/cli/src/commands/apikey.ts`, broker dispatch | ✅ full lifecycle exercised |
|
||||
| REST `/api/v1/*` (messages, topics, peers, history) | `packages/api/src/modules/mesh/v1-router.ts` + `api-key-auth.ts` | ✅ posted via curl, history round-trips |
|
||||
| Bridge peer (SDK + CLI) | `packages/sdk/src/bridge.ts`, `apps/cli/src/commands/bridge.ts` | ⚠️ code only — never run end-to-end |
|
||||
|
||||
### Architectural commitments locked this session
|
||||
- **CLI-first, MCP push-pipe** (1.5.0): MCP `tools/list = []`. Inbound peer messages still arrive as `experimental.claude/channel` notifications. The bundled skill is the sole CLI-discoverability surface for Claude.
|
||||
- **Topics complement groups, don't replace them** (v0.2.0): mesh = trust boundary, group = identity tag, topic = conversation scope. Three orthogonal axes.
|
||||
- **Humans use REST + apikey, not browser WS** (v0.2.0): the broker already plumbs `peer_type: "human"`. The real blocker was browser-side ed25519, which we sidestep by exposing REST. Web chat UI = thin client over `/v1/*` using dashboard session auth.
|
||||
- **Spec lives at**: `.artifacts/specs/2026-05-02-architecture-north-star.md` (1.5.0) and `.artifacts/specs/2026-05-02-v0.2.0-scope.md` (v0.2.0 cut + design sketches).
|
||||
|
||||
---
|
||||
|
||||
## Three pending sessions, ranked by leverage
|
||||
|
||||
### Session A — Ship v1.6.0 npm release (~30 min, highest leverage)
|
||||
**Why first**: backend is feature-complete but unreleased. Users still get the no-topics 1.5.0.
|
||||
|
||||
Steps:
|
||||
1. Bump `apps/cli/package.json` 1.5.0 → 1.6.0.
|
||||
2. Update `apps/cli/README.md` migration note (mention topics, apikey, bridge).
|
||||
3. Add `## v1.6.0` section to `docs/roadmap.md`.
|
||||
4. Build + verify: `cd apps/cli && pnpm build && node dist/entrypoints/cli.js --version`.
|
||||
5. `npm publish --tag latest --access public --no-git-checks --ignore-scripts`.
|
||||
6. `git tag cli-v1.6.0 && git push github cli-v1.6.0` — workflow builds 5 binaries + auto-bumps Homebrew/winget tap.
|
||||
7. Verify on a clean prefix: `PREFIX=/tmp/cm16 mkdir -p $PREFIX && npm install -g --prefix $PREFIX claudemesh-cli@1.6.0 && $PREFIX/bin/claudemesh --help | grep -E "topic|apikey|bridge"`.
|
||||
|
||||
### Session B — Migration drift fix (~1 day, highest pain reduction)
|
||||
**Why second**: every schema change today requires manual `psql -f migration.sql` against prod. The drizzle `_journal.json` stops at idx 11, runtime migrator silently skips anything not in journal. Today's `0022_topics.sql` and `0023_api_keys.sql` were applied by hand. **Future migrations will keep needing this until fixed.**
|
||||
|
||||
Recommended approach:
|
||||
1. Replace `drizzle-orm/postgres-js/migrator` in `apps/broker/src/migrate.ts` with a custom runner.
|
||||
2. Scan `migrations/*.sql` lexicographically (already named `NNNN_*.sql`).
|
||||
3. Track applied filenames in a new `mesh.__cmh_migrations` table (filename + sha256 + applied_at).
|
||||
4. On startup: filter unapplied files, run them in transaction order under `pg_try_advisory_lock`. Fail loud on hash mismatch (catches edits after deploy).
|
||||
5. Backfill the table with all 0000-0023 entries one-time so prod is consistent.
|
||||
6. Drop the drizzle journal usage entirely (`migrations/meta/_journal.json` becomes dead state).
|
||||
|
||||
This unblocks every future feature touching DB.
|
||||
|
||||
### Session C — Web chat UI (~2-3 days, highest visibility)
|
||||
**Why third**: the demo. Backend is ready; this is pure React + REST.
|
||||
|
||||
Path: `apps/web/src/app/[locale]/dashboard/(user)/meshes/[id]/topics/[name]/page.tsx` (new).
|
||||
|
||||
Components needed:
|
||||
- Topic header (members count, settings button).
|
||||
- Message stream — `GET /api/v1/topics/:name/messages?limit=50`. Poll every 5s for new (no WS yet — REST polling is fine for v0.2.0).
|
||||
- Compose box — `POST /api/v1/messages` with `{topic, ciphertext, nonce}`.
|
||||
- Members sidebar — `GET /api/v1/peers`.
|
||||
- Apikey lifecycle: on first load, server-side issue an apikey for the dashboard user (using their existing NextAuth session) scoped to `read,send` on this topic. Stash in browser session storage.
|
||||
|
||||
Server-side helper for apikey issuance lives in `packages/api/src/modules/mesh/api-key-auth.ts` — refactor `verifyBearer` to also expose a `createApiKeyForUser(userId, meshId, scope)` helper for the dashboard handler.
|
||||
|
||||
---
|
||||
|
||||
## Three less-urgent followups (don't block sessions A-C)
|
||||
|
||||
1. **Bridge end-to-end smoke test**: never actually run between two meshes. Needs second test mesh + bridge member onboarding ritual. Worth doing before any blog post / external demo.
|
||||
2. **`/v1/peers` includes only WS-connected agents**, not humans (since humans are REST-only and never appear in `presence`). Decide: synthetic presence rows for active apikey sessions? Or document that `/v1/peers` is "agents online"?
|
||||
3. **Topic ciphertext is plaintext base64** in the current implementation — no actual encryption. The schema names it `ciphertext` for forward-compat, but the code base64-encodes UTF-8. Real per-topic symmetric key derivation (HKDF from mesh root_key + topic_id) is a v0.3.0 item.
|
||||
|
||||
---
|
||||
|
||||
## Production state worth knowing
|
||||
|
||||
- **Broker**: `wss://ic.claudemesh.com/ws`, deployed via Coolify on OVHcloud VPS. Auto-redeploys on push to `gitea-vps main`. Deploy ETA ~3 min.
|
||||
- **Web**: `claudemesh.com`, Vercel auto-deploy on push to `github main`. Deploy ETA ~2 min.
|
||||
- **Postgres**: container `eo1f5gydsgrg19b57e9s4zw7` on the VPS. SSH via `ssh ovh`, then `docker exec eo1f5gydsgrg19b57e9s4zw7 psql -U claudemesh -d claudemesh`.
|
||||
- **Test mesh**: `openclaw` on the same broker has 5 active peers and one topic (`#deploys-test`).
|
||||
- **Active apikey** (from earlier today's smoke): `cm_OC12dRti…` was revoked. None active right now.
|
||||
|
||||
---
|
||||
|
||||
## Files most worth reading first in next session
|
||||
|
||||
1. `.artifacts/specs/2026-05-02-architecture-north-star.md` — the 7 architectural commitments.
|
||||
2. `.artifacts/specs/2026-05-02-v0.2.0-scope.md` — design sketches for topics, REST, bridge.
|
||||
3. `apps/cli/skills/claudemesh/SKILL.md` — the canonical CLI surface; ships in npm tarball.
|
||||
4. This file.
|
||||
|
||||
---
|
||||
|
||||
## Memory not yet captured
|
||||
|
||||
Worth adding to `~/.claude/projects/-Users-agutierrez-Desktop-claudemesh/memory/MEMORY.md` next session:
|
||||
|
||||
- **Drizzle journal drift is a recurring trap** — manual psql until session B lands. Save the exact apply ritual: `scp migrations/NNNN.sql ovh:/tmp/ && ssh ovh "docker cp /tmp/NNNN.sql <pg-container>:/tmp/ && docker exec <pg-container> psql -U claudemesh -d claudemesh -f /tmp/NNNN.sql"`.
|
||||
- **`workspace:*` deps break `npm publish`** — keep SDK as devDependency in `apps/cli/package.json`; Bun bundles it into dist so runtime doesn't need it. Same trick for any other workspace-only build deps.
|
||||
- **Commitlint hard-caps body lines at 100 chars** — use `git commit -F /tmp/cm-commit.txt` rather than `-m` heredocs. Heredocs that exceed the limit fail the husky hook silently.
|
||||
227
.artifacts/specs/2026-05-02-roadmap.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# claudemesh internal roadmap — 2026-05-02
|
||||
|
||||
Strategic counterpart to `docs/roadmap.md` (which is the public, marketing-tone roadmap). This file captures the *why*, the dependencies, the costs, and the things we deliberately won't do.
|
||||
|
||||
Anchored in the v0.2.0 backend cut + `#general` auto-creation + filename-tracked migrator + owner-member backfill that all shipped 2026-05-02.
|
||||
|
||||
---
|
||||
|
||||
## Forcing function
|
||||
|
||||
> **Ship v1.6.x in 2 weeks. Ship v1.7.0 in a month. Make the demo. Then commit the daemon.**
|
||||
|
||||
Each release stands on its own — usable and shippable even if the next slips. That's the property to optimize for, not "fastest path to v3.0.0."
|
||||
|
||||
---
|
||||
|
||||
## Schedule
|
||||
|
||||
| When | Version | Theme | Status |
|
||||
|---|---|---|---|
|
||||
| Now | 1.6.0 | v0.2.0 backend cut | ✅ shipped 2026-05-02 |
|
||||
| +2w | 1.6.x | Demo polish (SSE, unread, sidebar) | Active |
|
||||
| +5w | 1.7.0 | First marketing-ready version | Planned |
|
||||
| +9w | 2.0.0 | Daemon redesign | Planned |
|
||||
| +15w | 0.3.0 | Self-hosted + per-topic encryption + gateways | Planned |
|
||||
| TBD | 3.0.0 | Native Claude channels | Anthropic-gated |
|
||||
|
||||
≈4 months from today to a teams-can-self-host shape. The MCP bridge stays load-bearing the whole time but stops being the user's problem at v2.0.0.
|
||||
|
||||
---
|
||||
|
||||
## v1.6.x patch line — 0-2 weeks, polish what's deployed
|
||||
|
||||
| Item | Effort | Why now |
|
||||
|---|---|---|
|
||||
| Real-time push (SSE on `/api/v1/topics/:name/stream`) | 2 days | Chat lag is the only user-visible v0.2.0 wart. Replaces 5s polling. |
|
||||
| Unread counts via `last_read_at` | ½ day | Schema column already exists. PATCH on scroll-to-bottom + chip on topic list. |
|
||||
| Bridge end-to-end smoke (two-mesh forwarding test) | ½ day | Feature shipped, never validated. Catches obvious bugs before any external demo. |
|
||||
| Drizzle journal + `meta/` cleanup | 1 hour | Inert dead files since the new runner. Low-risk cosmetic. |
|
||||
| `/v1/peers` includes humans (synthetic presence rows for active apikeys) | 1 day | Today the dashboard chat user is invisible to other peers. |
|
||||
|
||||
Total: ~1 week of focused work. Closes the v0.2.0 backend chapter cleanly.
|
||||
|
||||
---
|
||||
|
||||
## v1.7.0 — 2-3 weeks, the demo cut
|
||||
|
||||
The release that turns claudemesh into a thing you can record and show.
|
||||
|
||||
**Scope:**
|
||||
- Member sidebar in the chat panel — names, online dots, presence summaries. Comes nearly free with SSE from v1.6.x.
|
||||
- Topic search + member-mention autocomplete — `@Mou` hot-keys to `claudemesh send Mou ...`.
|
||||
- Notification feed at `/dashboard` — "you have N unread in #deploys, 2 mentions in #incident." Purely aggregate; no new schema.
|
||||
- One-line marketing site refresh — capture screenshots from the now-real-time UI, drop the v0.2.0 stamp from the chat footer, update README/landing.
|
||||
- First public blog post + recorded demo — "claudemesh in 90 seconds" video. Triggers the first proper user-acquisition push.
|
||||
|
||||
**Not in scope:** any architectural change. v1.7.0 is pure UX polish on top of the v1.6.x foundation. Architecture work waits for v2.0.0.
|
||||
|
||||
**Why this comes before v2.0.0:** without users, the daemon is a solution for nobody. v1.7.0 produces the first real user signal so v2.0.0 has data to optimize against.
|
||||
|
||||
---
|
||||
|
||||
## v2.0.0 — 3-4 weeks, the daemon redesign
|
||||
|
||||
The single largest architectural shift on the roadmap. Background and rationale captured at length elsewhere this session; summary here.
|
||||
|
||||
### Single load-bearing principle
|
||||
|
||||
> **The user is the unit of mesh participation, not the Claude session.**
|
||||
|
||||
Every weird edge case from this session — the launch tax, the orphan owner, the per-session keypair churn, the MCP install/uninstall ritual, multi-Claude config corruption — comes from getting this one thing wrong today. Fix it once, structurally, and 70% of accumulated complexity vanishes.
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
claudemesh.com (web identity + workspace admin)
|
||||
│
|
||||
▼ JWT
|
||||
broker (unchanged) — wss://ic.claudemesh.com/ws
|
||||
│
|
||||
▼ ws per workspace
|
||||
claudemesh-daemon (per user, launchd/systemd, persistent)
|
||||
│
|
||||
▼ unix socket
|
||||
┌────┴────┐
|
||||
▼ ▼
|
||||
CLI verbs MCP push-pipe (~50 LoC)
|
||||
│
|
||||
▼
|
||||
claude (any number of sessions)
|
||||
```
|
||||
|
||||
### What v2.0.0 ships
|
||||
|
||||
- **`claudemesh-daemon`** — long-lived per-user process. One WS per workspace, kept alive across Claude session lifetimes. Listens on `~/.claudemesh/sockets/<workspace>.sock`. Started by `claudemesh login`, persists across reboots.
|
||||
- **HKDF-derived peer keypairs from JWT** — same identity across machines, no key copy ritual. Web sign-up = CLI sign-up = same row in `mesh_member`.
|
||||
- **Stateless CLI verbs** — each existing command (`send`, `peers`, `topic`, `apikey`, `bridge`, `state`, `remember`, etc.) retargeted to dial the daemon socket. ~3000 LoC of plumbing deleted, ~500 LoC of glue added.
|
||||
- **50-line MCP server** — dial daemon, forward inbound peer messages as `experimental.claude/channel` notifications. The push-pipe shrinks from ~150 LoC to ~50.
|
||||
- **`claudemesh launch` deprecated** — replaced by ambient mode: `claude` with no flags. Launch becomes a one-line alias that prints "ambient mode now, just run `claude`" and exits.
|
||||
- **"Mesh" → "workspace"** in the public surface. DB tables keep `mesh_*` names for migration sanity.
|
||||
|
||||
### What v2.0.0 kills
|
||||
|
||||
- `claudemesh launch` command — the 8-thing bootstrap was paying for state the daemon now owns persistently.
|
||||
- `--dangerously-skip-permissions` — set once at install in `settings.json` allowedTools, never seen by the user again.
|
||||
- `--dangerously-load-development-channels` — written into `~/.claude.json` once at install, never seen again.
|
||||
- Per-session `CLAUDEMESH_CONFIG_DIR` tmpdir — daemon owns config.
|
||||
- Per-session `CLAUDEMESH_DISPLAY_NAME` env var — daemon stores it.
|
||||
- MCP install/uninstall ritual on every launch — MCP entry is permanent.
|
||||
- Multi-Claude config corruption — only the daemon writes config.
|
||||
- Orphan-owner bug (just fixed via backfill) — structurally impossible because web sign-up creates the member row.
|
||||
|
||||
### What v2.0.0 keeps
|
||||
|
||||
- Wire protocol, crypto primitives, broker schema — 100% unchanged.
|
||||
- All CLI verb names — 100% unchanged (just retargeted).
|
||||
- REST `/api/v1/*` surface — 100% unchanged.
|
||||
- Web chat UI — 100% unchanged.
|
||||
- Bridge peer feature — 100% unchanged.
|
||||
- Topic semantics, ciphertext field, ephemeral DMs — 100% unchanged.
|
||||
|
||||
### Cost
|
||||
|
||||
- ~3 weeks focused engineering
|
||||
- ~30% LoC reduction in the CLI package
|
||||
- ~80% reduction in support load for "launch flags," "config corruption," "peer keypair lost," "owner has no member row"
|
||||
- ~0 cost to broker, web app, schema, protocol — none of the deep parts change
|
||||
|
||||
### Migration path (backwards-compatible at every step)
|
||||
|
||||
1. **Week 1** — daemon binary + unix socket protocol + retarget two CLI verbs (`send`, `peers`) as the smoke test. Ship to alpha testers.
|
||||
2. **Week 2** — retarget remaining verbs. HKDF-keypair migration with a one-shot `claudemesh migrate-identity` command for existing users.
|
||||
3. **Week 3** — `claudemesh launch` becomes a deprecated alias. MCP server retargeted to daemon socket. Backfill: every existing user's daemon spins up on first `claudemesh` invocation.
|
||||
4. **Cut v2.0.0**: remove deprecated launch alias one minor release later (v2.1.0) once metrics show no one's hitting it.
|
||||
|
||||
---
|
||||
|
||||
## v0.3.0 — 4-6 weeks, the operator chapter
|
||||
|
||||
For teams that want to run their own broker, encrypt at the topic level, or wire claudemesh to messaging surfaces beyond Claude Code.
|
||||
|
||||
- **Per-topic HKDF encryption** — kills the "broker can read your messages" wart. Symmetric key derived from `mesh.root_key + topic.id`. Web client gets the topic key from the sealed root_key it already holds.
|
||||
- **Self-hosted broker packaging** — single `docker-compose.yml`, postgres included. CLI accepts `--broker wss://...` to point anywhere. Federation primer.
|
||||
- **WhatsApp gateway** — peer bot that forwards a topic to a WhatsApp group.
|
||||
- **Telegram gateway** — same pattern.
|
||||
- **Tag routing** — `claudemesh send tag:repo:billing "deployed"` lands at every peer working on that repo. Already protocol-supported, needs CLI ergonomics + dashboard surface.
|
||||
|
||||
v0.3.0 is when teams that want to run their own broker can do so without paying us. Counterintuitively important: it's also when we can charge for hosted with a clean conscience.
|
||||
|
||||
---
|
||||
|
||||
## v3.0.0 — Anthropic-blessed cut (conditional)
|
||||
|
||||
Conditional on Anthropic shipping first-class agent-to-agent channels in Claude Code. We don't control the timing.
|
||||
|
||||
### What's load-bearing about today's flag
|
||||
|
||||
`--dangerously-load-development-channels server:claudemesh` does two things:
|
||||
|
||||
1. Loads the claudemesh MCP server.
|
||||
2. Tells Claude Code to treat its `experimental.claude/channel` notifications as runtime channel events.
|
||||
|
||||
The flag is named `dangerously-load-development-channels` *specifically because* the channel API is experimental and unstable. Some opt-in mechanism will always be required for Claude Code to receive external events from a third-party process — that's a security-model invariant, not a quirk of today's flag. What changes at v3.0.0 is the *form* of the opt-in, not its existence.
|
||||
|
||||
### Two scenarios depending on Anthropic's choice
|
||||
|
||||
**Scenario A — MCP-channel API graduates.** The same MCP-based push primitive becomes stable.
|
||||
- MCP wrapper stays (still translates `ws://broker → MCP notification`).
|
||||
- The `--dangerously-load-development-channels` flag is replaced by a stable settings.json entry — e.g. `mcpServers.claudemesh.acceptChannelNotifications = true`.
|
||||
- The `experimental.` prefix on the notification namespace goes away.
|
||||
- Net user-visible change: nothing, because we already write the flag once at install and the user never sees it. The migration is internal: swap the install logic to write the new settings entry instead of the old flag.
|
||||
|
||||
**Scenario B — non-MCP transport ships.** Anthropic introduces a sidecar IPC, a native WebSocket subscription declared in settings, or some other primitive.
|
||||
- The 50-line MCP wrapper from v2.0.0 disappears.
|
||||
- The daemon plugs into the new transport directly.
|
||||
- Some opt-in config is still required (settings.json entry, environment variable, etc.) — Claude Code must know to subscribe to the daemon's channel.
|
||||
- Net user-visible change: still nothing if our `claudemesh install` adapts to write the new opt-in form.
|
||||
|
||||
### What disappears regardless
|
||||
|
||||
- The `experimental.` prefix on the channel API (it stabilizes).
|
||||
- The `dangerously-` framing of the flag (the API is no longer experimental).
|
||||
- The "you have to pass a launch flag to load development channels" mental model.
|
||||
|
||||
### What stays regardless
|
||||
|
||||
- An opt-in mechanism somewhere (security model invariant).
|
||||
- The daemon as the lifecycle owner.
|
||||
- The protocol, schema, broker, topics, web chat — all unchanged.
|
||||
|
||||
### Marketing pivot
|
||||
|
||||
claudemesh becomes a "hosted backend for Claude's native multi-agent feature" rather than a "Claude Code extension." The product story simplifies regardless of which shape ships, because the user no longer has to think about MCP servers, dangerous flags, or experimental APIs — claudemesh is just there.
|
||||
|
||||
Until v3.0.0 lands, v2.x ships with the MCP bridge under the existing flag. v3.0.0 is the migration target, not a planned feature.
|
||||
|
||||
---
|
||||
|
||||
## Cross-cutting tracks (always-on, not version-gated)
|
||||
|
||||
| Track | What it covers | Target version |
|
||||
|---|---|---|
|
||||
| Mobile | iOS peer app (thin: push + reply, same JWT identity) | v2.x |
|
||||
| Browser peer (proper) | IndexedDB ed25519 + WebCrypto crypto_box for the dashboard. Today's web is REST-only; this makes it a true peer. | v2.x |
|
||||
| Peer transcript queries | "Hey Claude2, what have you touched in the last hour?" cross-session memory primitive | v0.3.0+ |
|
||||
| Mesh analytics | Volume, presence, handoff latency dashboards | v0.3.0 |
|
||||
| Slack peer (first-party) | Today: build-your-own. Shipped natively. | v0.3.0 |
|
||||
|
||||
---
|
||||
|
||||
## Deliberate exclusions
|
||||
|
||||
| Idea | Why deferred |
|
||||
|---|---|
|
||||
| Custom bot framework / plugin marketplace | Premature — claudemesh barely has organic users. Build the user base first, then platform. |
|
||||
| Voice channels | Out of scope. Different product. |
|
||||
| Video chat | Same. |
|
||||
| Email-as-peer (incoming SMTP → mesh) | Has demand from one user; ship if 3+ ask. |
|
||||
| AI summarization of channels | LLM cost + scope creep. Users can wire their own with the existing message API. |
|
||||
| Mobile push notifications via APNs/FCM | Wait for the iOS peer app, then revisit. |
|
||||
| Reactions / threading | Not yet — would muddle the protocol surface for marginal value. Reconsider after v0.3.0 user feedback. |
|
||||
|
||||
---
|
||||
|
||||
## Single-sentence summary
|
||||
|
||||
**Polish v1.6.x → ship v1.7.0 demo → commit v2.0.0 daemon → open the operator chapter at v0.3.0 → plug into native channels at v3.0.0 when Anthropic ships them.** Each release stands on its own. The protocol, the schema, the broker, and the topics are all already correct — what changes is the lifecycle owner around them.
|
||||
178
.artifacts/specs/2026-05-02-topic-key-onboarding.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Topic-key onboarding — v0.3.0 phase 2
|
||||
|
||||
The schema for per-topic encryption is shipped (migration 0026). The
|
||||
broker generates a 32-byte XSalsa20-Poly1305 key when a topic is
|
||||
created and seals one copy for the creator via `crypto_box`. The open
|
||||
question is **how new joiners get their sealed copy** without giving
|
||||
the broker the plaintext.
|
||||
|
||||
This spec covers the three live options, picks one for v0.3.0 phase 2,
|
||||
and parks the rest as future cuts. Implementation is **not in this
|
||||
spec** — that follows once we ship the chosen flow.
|
||||
|
||||
---
|
||||
|
||||
## The constraint
|
||||
|
||||
The broker holds:
|
||||
|
||||
- `topic.encrypted_key_pubkey` — the ephemeral x25519 pubkey used to
|
||||
seal each member's copy. Public. The matching secret is **discarded
|
||||
immediately after creation** — only the topic creator's session
|
||||
knows the topic key briefly during sealing, then it leaves memory.
|
||||
- `topic_member_key.(encrypted_key, nonce)` — per-member sealed
|
||||
ciphertext.
|
||||
|
||||
The broker **must not** be able to decrypt any sealed copy. So when a
|
||||
new member joins a topic that already exists, the broker can't seal a
|
||||
copy for them by itself.
|
||||
|
||||
## Option A — server-side escrow (REJECTED)
|
||||
|
||||
Broker holds the topic key encrypted under its own service key + per-
|
||||
member sealed copies. Re-sealing for new members is a server-only
|
||||
operation.
|
||||
|
||||
**Why rejected:** the broker can read every message in every topic
|
||||
forever. Calling that "per-topic encryption" misleads users. Worse
|
||||
than today's plaintext-base64 because it implies a security property
|
||||
the design doesn't deliver.
|
||||
|
||||
## Option B — member-driven re-seal (CHOSEN for phase 2)
|
||||
|
||||
When a new member joins, an existing member's CLIENT decrypts their
|
||||
own sealed copy of the topic key, then seals a new copy for the
|
||||
joiner and POSTs it to the broker.
|
||||
|
||||
**Wire:**
|
||||
|
||||
1. New member joins via `claudemesh topic join <topic>` — broker
|
||||
inserts `topic_member` row, no `topic_member_key` row.
|
||||
2. New member calls `GET /v1/topics/:name/key` → 404 with
|
||||
`key_not_sealed_for_member`.
|
||||
3. Existing online members (any of them) periodically poll
|
||||
`GET /v1/topics/:name/pending-seals` (new endpoint) and see the
|
||||
new joiner.
|
||||
4. Existing member's client:
|
||||
- Decrypts their own sealed copy via `crypto_box_open` with their
|
||||
x25519 secret + `topic.encrypted_key_pubkey`.
|
||||
- Generates a fresh ephemeral x25519 keypair.
|
||||
- Seals the topic key for the joiner via `crypto_box` with the
|
||||
joiner's pubkey + the new ephemeral.
|
||||
- POSTs the result to `POST /v1/topics/:name/seal`.
|
||||
5. Broker stores the new `topic_member_key` row.
|
||||
6. New member's `GET /v1/topics/:name/key` now returns 200.
|
||||
|
||||
**Trust model:** broker never sees plaintext. Assumes at least one
|
||||
existing member is online when the joiner connects. Worst case the
|
||||
joiner waits — UI shows "waiting for a peer to share the topic key"
|
||||
until somebody seals.
|
||||
|
||||
**Open detail — sender pubkey identity:** each re-seal uses a fresh
|
||||
ephemeral pubkey. Either:
|
||||
|
||||
(a) Store ALL ephemeral pubkeys ever used to seal copies of this
|
||||
topic, indexed by member, so the joiner can pick the right one
|
||||
when decrypting. Adds a new table.
|
||||
(b) Embed the ephemeral pubkey in the sealed payload itself (
|
||||
`encrypted_key` becomes `<32-byte ephem_pubkey><crypto_box_easy>`).
|
||||
Decoder pulls the prefix, uses it as the sender pubkey. No schema
|
||||
change beyond what 0026 already ships.
|
||||
|
||||
**(b) wins on simplicity. Phase 3 implementation ships it. Both the
|
||||
broker creator-seal and the CLI re-seal write the
|
||||
`<32-byte sender pubkey><cipher>` blob.** `topic.encrypted_key_pubkey`
|
||||
becomes informational only — the wire-format truth is the inline prefix.
|
||||
|
||||
## Web client gap (phase 3.5)
|
||||
|
||||
The CLI side of phase 3 ships in this cut. The web side does NOT —
|
||||
because web member rows have `peerPubkey` registered server-side but
|
||||
the corresponding ed25519 SECRET is discarded immediately after
|
||||
generation (see `mutations.ts:createMyMesh`). Without the secret the
|
||||
browser can't `crypto_box_open` its sealed topic key.
|
||||
|
||||
Three fixes, in increasing order of effort:
|
||||
|
||||
1. **Browser-side persistent identity (recommended)** — generate an
|
||||
ed25519 keypair in the browser on first dashboard visit, store the
|
||||
secret in IndexedDB, sync the public half to `mesh.member.peerPubkey`
|
||||
via a new `POST /v1/me/peer-pubkey` endpoint. Topic keys then seal
|
||||
to the new pubkey; web user decrypts locally. Existing #general
|
||||
topics need a re-seal cycle (the v0.3.0 phase-3 re-seal loop in
|
||||
the CLI already does this for any pending member, including web
|
||||
ones). Spec lift: ~3 hours, mostly browser code + a sync endpoint.
|
||||
|
||||
2. **Server-held secret** — keep the member's ed25519 secret server-
|
||||
side. Trivial to implement, but the broker can read everything,
|
||||
defeating the security claim. **Rejected.**
|
||||
|
||||
3. **JWT-derived keys** — derive the member's keypair from a stable
|
||||
user-secret (e.g. PBKDF2 over their session JWT). Means cross-
|
||||
device same key, but needs the JWT to include ~32 bytes of stable
|
||||
key material. Tied to v2.0.0 daemon redesign. **Deferred.**
|
||||
|
||||
Phase 3 ships option 1 deferred; web stays on v1 plaintext until 3.5.
|
||||
The CLI re-seal loop in `topic tail` already handles re-sealing for
|
||||
web members ONCE they have a real pubkey — no broker work needed
|
||||
when 3.5 lands.
|
||||
|
||||
## Option C — leaderless protocol (DEFERRED)
|
||||
|
||||
MLS, TreeKEM, or similar continuous group key agreement. Right answer
|
||||
for groups >50 members. Overkill for v0.3.0 — implementation cost is
|
||||
4-6 weeks of focused work, and the threat model gain over Option B
|
||||
only matters if we believe a member's machine can be silently
|
||||
compromised long enough to leak the topic key but short enough that
|
||||
they aren't kicked from the topic.
|
||||
|
||||
Park for v0.4.0 or v0.5.0. Revisit when we onboard a customer that
|
||||
asks for FS (forward secrecy) on group chat.
|
||||
|
||||
---
|
||||
|
||||
## Implementation checklist
|
||||
|
||||
Schema (0026 — done):
|
||||
- [x] `topic.encrypted_key_pubkey` (informational; wire truth is the
|
||||
inline 32-byte prefix on each `topic_member_key.encryptedKey`)
|
||||
- [x] `topic_member_key.(encrypted_key, nonce)`
|
||||
- [x] `topic_message.body_version` (1 = plaintext, 2 = v2 ciphertext)
|
||||
|
||||
API (phase 3 — done):
|
||||
- [x] `GET /v1/topics/:name/key` — fetch the calling member's sealed copy
|
||||
- [x] `GET /v1/topics/:name/pending-seals` — list members without keys
|
||||
- [x] `POST /v1/topics/:name/seal` — submit a re-sealed copy
|
||||
- [x] `GET /v1/topics/:name/messages` returns `bodyVersion`
|
||||
- [x] `GET /v1/topics/:name/stream` emits `bodyVersion`
|
||||
- [x] `POST /v1/messages` accepts `bodyVersion` (1|2) + skips regex
|
||||
mention extraction on v2
|
||||
|
||||
Broker / web mutation (phase 3 — done):
|
||||
- [x] `createTopic` generates topic key + seals for creator with
|
||||
inline-sender-pubkey blob format
|
||||
- [x] `ensureGeneralTopic` (web) mirrors the same flow
|
||||
|
||||
Client — CLI (phase 3 — done):
|
||||
- [x] `services/crypto/topic-key.ts` — fetch + decrypt + encrypt + reseal helpers
|
||||
- [x] `topic tail` decrypts v2 messages on render
|
||||
- [x] `topic post` encrypts v2 on send via REST POST /v1/messages
|
||||
- [x] Background re-seal loop in `topic tail` (30s cadence)
|
||||
|
||||
Client — web (phase 3.5 — DEFERRED):
|
||||
- [ ] Browser-side persistent identity (IndexedDB)
|
||||
- [ ] `POST /v1/me/peer-pubkey` sync endpoint
|
||||
- [ ] Web chat panel encrypt-on-send + decrypt-on-render (currently v1)
|
||||
|
||||
UX surfaces (phase 3 — done in CLI):
|
||||
- [x] "waiting for a peer to share the topic key" warning on tail
|
||||
- [ ] (web) "your encryption keys are pending — pair this browser"
|
||||
banner once 3.5 lands
|
||||
|
||||
Mention fan-out from phase 1 already works for both v1 and v2
|
||||
messages, so `/v1/notifications` keeps working through the cutover.
|
||||
|
||||
The phase-3 cut ships full CLI encryption + re-seal flow. Web remains
|
||||
on v1 plaintext until 3.5 lands the browser identity layer. Mixed
|
||||
CLI+web meshes in the meantime should keep using v1 sends OR accept
|
||||
that web members can't read v2 messages.
|
||||
273
.artifacts/specs/2026-05-02-v0.2.0-scope.md
Normal file
@@ -0,0 +1,273 @@
|
||||
# claudemesh v0.2.0 — scope
|
||||
|
||||
**Date:** 2026-05-02
|
||||
**Status:** draft
|
||||
**Predecessor:** [`2026-05-02-architecture-north-star.md`](./2026-05-02-architecture-north-star.md) (1.5.0 architecture lock)
|
||||
|
||||
---
|
||||
|
||||
## Cut
|
||||
|
||||
**Theme: from agent-only mesh to mesh of agents, humans, and external systems — with conversation context.**
|
||||
|
||||
| # | Feature | Effort | Spine |
|
||||
|---|---------|--------|-------|
|
||||
| 1 | **Topics** (channels/rooms within a mesh) | 2-3 d | yes |
|
||||
| 2 | **Humans in the mesh** (web chat panel) | 2-3 d | depends on #1 |
|
||||
| 3 | **REST API + external WS** (API keys per mesh) | 2-3 d | depends on #1 |
|
||||
| 4 | **Bridge peer** (forwards one topic between meshes) | 1 d | depends on #1 |
|
||||
|
||||
Optional pickup if all four ship early:
|
||||
- **Local peer aliases** (~0.5 d) — IRC-style local labels for hard-to-remember displayNames.
|
||||
- **Semantic peer search** (~0.5 d) — already in vision doc; useful once topics exist.
|
||||
|
||||
Total: 7-9 days plus 1-2 days slack. Targeting **release window: 2026-05-12 to 2026-05-16**.
|
||||
|
||||
---
|
||||
|
||||
## Why this cut
|
||||
|
||||
The 1.5.0 architecture (CLI-first, tool-less MCP, policy engine) is finished. The next bottleneck is **product surface**, not engineering.
|
||||
|
||||
Current taxonomy `mesh + group + role` is the right *organizational* structure but missing a *conversational* primitive. Every message is DM or `@group` broadcast — there's no continuity for "the deploys conversation," no scoped state/memory/files, no way for a human to join a topic without joining the whole mesh, no way for a bridge to forward a single thread of work.
|
||||
|
||||
**Topics fix this.** They are the spine of v0.2.0:
|
||||
- Without topics, "humans in mesh" floods every human with every peer's chatter.
|
||||
- Without topics, "bridge" forwards everything (loop risk, signal-to-noise problem).
|
||||
- Without topics, REST API endpoints have no natural sub-mesh scope.
|
||||
|
||||
Once topics exist, humans + REST + bridge each become 50% smaller because they slot into a clean primitive instead of inventing one.
|
||||
|
||||
---
|
||||
|
||||
## Deferred
|
||||
|
||||
| Item | Why later |
|
||||
|---|---|
|
||||
| **Federation** (broker-to-broker) | Bridges prototype it. Learn from real use first. |
|
||||
| **Sandboxes** (E2B / Modal) | Orthogonal capability. Separate release. |
|
||||
| **Sim SDK** (`@claudemesh/sim`) | Niche audience; long-tail. v0.3.0+. |
|
||||
| **Welcome back / persistent MCP** | Already in progress as 1.6.0 patch. |
|
||||
| **Mesh telemetry** | Pre-PMF telemetry is busywork; users first. |
|
||||
|
||||
---
|
||||
|
||||
## Design sketches
|
||||
|
||||
### 1. Topics
|
||||
|
||||
**Mental model:** mesh is *who you trust*; group is *who you are*; topic is *what you're talking about*. Three orthogonal axes.
|
||||
|
||||
**Wire shape:**
|
||||
|
||||
```yaml
|
||||
topic:
|
||||
id: <ulid>
|
||||
mesh_slug: openclaw
|
||||
name: deploys # unique within mesh
|
||||
description: "deploy + on-call"
|
||||
visibility: public # public | private (invite-only) | dm (1:1, autocreated)
|
||||
created_by: <pubkey>
|
||||
created_at: <ts>
|
||||
```
|
||||
|
||||
**Membership:**
|
||||
|
||||
```yaml
|
||||
topic_member:
|
||||
topic_id: <ulid>
|
||||
pubkey: <hex> # session pubkey OR member_pubkey for durable identity
|
||||
role: lead | member | observer
|
||||
joined_at: <ts>
|
||||
last_read_at: <ts> # for unread counts
|
||||
```
|
||||
|
||||
**Messages reference a topic, not just a target:**
|
||||
|
||||
```jsonc
|
||||
// existing send_message envelope gains a `topic` field
|
||||
{
|
||||
"to": "@deploys", // or topic id, or peer name (DM)
|
||||
"topic": "deploys", // optional explicit, inferred from `to: @<topic>`
|
||||
"message": "...",
|
||||
"priority": "next"
|
||||
}
|
||||
```
|
||||
|
||||
**Resolution rules:**
|
||||
- `to: "alice"` → DM to peer alice (no topic).
|
||||
- `to: "@frontend"` → group broadcast (no topic — backwards compatible with 1.5.0).
|
||||
- `to: "#deploys"` → topic message; delivered only to topic subscribers.
|
||||
- `to: "*"` → mesh-wide broadcast (kept; lower-priority than topic for new comms).
|
||||
|
||||
**State/memory/files scoping:**
|
||||
- `claudemesh state set <k> <v> --topic deploys` — namespace under topic.
|
||||
- `claudemesh remember "..." --topic deploys` — topic-scoped memory.
|
||||
- `claudemesh file list --topic deploys` — files visible only to topic members.
|
||||
|
||||
**CLI:**
|
||||
|
||||
```bash
|
||||
claudemesh topic create deploys --description "deploy + on-call"
|
||||
claudemesh topic list # all topics in mesh
|
||||
claudemesh topic join deploys
|
||||
claudemesh topic leave deploys
|
||||
claudemesh topic invite deploys <peer> # private topics
|
||||
claudemesh topic members deploys
|
||||
claudemesh topic delete deploys # creator/admin only
|
||||
claudemesh send "#deploys" "rolling out 1.5.1"
|
||||
```
|
||||
|
||||
**MCP `claude/channel` notification gains `topic`** as an attribute so peers know which conversation an inbound message belongs to.
|
||||
|
||||
**Effort breakdown:** schema + drizzle migration + CLI verbs + broker routing changes (filter by topic membership) + skill update. ~250 LoC across CLI + ~200 LoC broker.
|
||||
|
||||
---
|
||||
|
||||
### 2. Humans in the mesh
|
||||
|
||||
**Mental model:** a human is a peer with `peer_type: "human"` whose presence is durable (no session pubkey rotation; identity tied to an account). They join *topics*, not the whole mesh — so they only see relevant traffic.
|
||||
|
||||
> **Implementation update (2026-05-02):** `peer_type: "ai" | "human" | "connector"` is already plumbed end-to-end in the broker (hello envelope, ConnectedPeer, list_peers). What was missing wasn't broker support — it's the **interface** for humans, who don't have browser-side ed25519 to do hello-sig. Realistic path: **REST API is the human interface** (rolled into #3 below). The web chat panel becomes a thin client that posts/reads via REST using the dashboard user's session auth — not its own keypair. This collapses #2 and #3 into a single deliverable: REST → UI on top.
|
||||
|
||||
**Wire:**
|
||||
|
||||
```jsonc
|
||||
// hello envelope gains:
|
||||
{
|
||||
"peer_type": "human",
|
||||
"session_pubkey": <ephemeral, per browser tab>,
|
||||
"member_pubkey": <durable, account-tied>,
|
||||
"display_name": "Alejandro"
|
||||
}
|
||||
```
|
||||
|
||||
**Web panel (`apps/web`):**
|
||||
|
||||
```
|
||||
/dashboard/mesh/<slug>/topic/<topic-name>
|
||||
├── topic header (members, settings)
|
||||
├── message stream (WS-driven, infinite scroll on history)
|
||||
├── compose box (typing indicator broadcast on focus)
|
||||
└── members sidebar (presence, profile, last_read_at)
|
||||
```
|
||||
|
||||
**Backend changes:**
|
||||
- Persistent message history per topic (drizzle table `topic_messages`; existing direct messages stay ephemeral by design).
|
||||
- Topic-scoped read receipts (`topic_member.last_read_at`).
|
||||
- Typing indicator: short-lived broadcast on the topic channel (`{type: "typing", peer: "..."}`).
|
||||
|
||||
**Privacy invariant:** a human in `#deploys` sees only `#deploys` traffic + DMs sent to them. Never the whole mesh. This is the *whole reason* topics come first.
|
||||
|
||||
**Effort:** WS endpoint already exists (broker side). Add: topic_messages table, history endpoint, web UI components (compose, stream, members). ~3 days.
|
||||
|
||||
---
|
||||
|
||||
### 3. REST API + external WS
|
||||
|
||||
**Auth:** API keys per mesh, scoped by capability + topic.
|
||||
|
||||
```yaml
|
||||
api_key:
|
||||
id: <ulid>
|
||||
mesh_slug: openclaw
|
||||
label: "ci-bot"
|
||||
hash: <argon2id>
|
||||
capabilities: ["send", "read"]
|
||||
topic_scopes: ["#deploys"] # null = all topics; explicit = whitelist
|
||||
created_at: <ts>
|
||||
last_used_at: <ts>
|
||||
revoked_at: <ts | null>
|
||||
```
|
||||
|
||||
**CLI for issuance (admin only):**
|
||||
|
||||
```bash
|
||||
claudemesh apikey create --label "ci-bot" --topic deploys --cap send,read
|
||||
claudemesh apikey list
|
||||
claudemesh apikey revoke <id>
|
||||
```
|
||||
|
||||
**REST endpoints (claudemesh.com/api/v1):**
|
||||
|
||||
```
|
||||
POST /v1/messages Send a message (auth: api key).
|
||||
GET /v1/topics/:name/messages History (with pagination cursor).
|
||||
GET /v1/peers List online peers (filtered by key scope).
|
||||
GET /v1/state Read mesh state.
|
||||
POST /v1/state Write mesh state.
|
||||
```
|
||||
|
||||
**External WS:** `wss://ic.claudemesh.com/ws?api_key=...&topic=deploys` — connects with `peer_type: "external"`. Push-pipe parity with internal sessions; can subscribe to topic streams.
|
||||
|
||||
**Why REST keys not session keypairs:** external clients (Zapier, GitHub Actions, mobile apps, Slack workspace bots) need long-lived bearer-like creds, not ephemeral keypairs. Different threat model — scope tightly via topic + capability.
|
||||
|
||||
**Effort:** ~3 days. Mostly broker work; CLI gets the issuance verbs.
|
||||
|
||||
---
|
||||
|
||||
### 4. Bridge peer
|
||||
|
||||
**Mental model:** a bridge is a peer that holds memberships in two meshes and forwards traffic on a single topic between them. SDK-only (no broker changes).
|
||||
|
||||
**Implementation (uses existing `@claudemesh/sdk`):**
|
||||
|
||||
```typescript
|
||||
import { Bridge } from "@claudemesh/sdk";
|
||||
|
||||
const bridge = new Bridge({
|
||||
meshes: ["work", "external"],
|
||||
topic: "incidents",
|
||||
filter: (msg) => !msg.tags.includes("internal-only"),
|
||||
loop_prevention: { tag: "via-bridge", max_hops: 2 },
|
||||
});
|
||||
await bridge.start();
|
||||
```
|
||||
|
||||
**Loop prevention:** every forwarded message gets a `bridge_hop_<n>` tag; bridges drop messages that already carry their own tag (prevents echo) and any message with `max_hops` exceeded.
|
||||
|
||||
**CLI:** `claudemesh bridge run <config.yaml>` — runs an SDK bridge as a long-lived process. Useful for "run a bridge inside a docker container or systemd unit."
|
||||
|
||||
**What it deliberately doesn't do:**
|
||||
- Cross-broker federation (that's a separate broker-to-broker protocol).
|
||||
- Bidirectional state/memory sync (only messages on a single topic).
|
||||
- Identity unification (a peer in mesh A is *not* the same peer in mesh B; the bridge appears as the messenger).
|
||||
|
||||
**Effort:** ~1 day on top of the existing SDK.
|
||||
|
||||
---
|
||||
|
||||
## Acceptance signals
|
||||
|
||||
v0.2.0 ships when all four are demonstrable end-to-end:
|
||||
|
||||
1. A peer creates `#deploys`, two other peers join it, traffic is topic-scoped, mesh-wide chat doesn't see it.
|
||||
2. A human signs in at `claudemesh.com`, joins `#deploys`, sends a message, a Claude session in the mesh receives it as a `<channel>` interrupt with `topic="deploys"`.
|
||||
3. A `curl` POST against `/v1/messages` with an API key delivers a message into `#deploys`; the same API key is rejected on `#secrets`.
|
||||
4. A bridge peer running locally forwards `#incidents` between two test meshes; loop is prevented; one-shot demo recorded.
|
||||
|
||||
---
|
||||
|
||||
## Out of scope (explicitly)
|
||||
|
||||
- Topic hierarchy / nesting (flat namespace per mesh; revisit at scale).
|
||||
- Topic-scoped capability grants (`grant <peer> read:#topic`) — solvable later via capability extension.
|
||||
- Threads-within-topics (Slack-style). Defer.
|
||||
- Voice / video / file-upload UX for humans — text only in v0.2.0.
|
||||
- Federation, sandboxes, sim-sdk — explicitly deferred above.
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
- **Topics retrofit risk** — existing 1.5.0 message envelope assumes "to" is peer/group/star. Adding `topic` is additive on the wire but changes routing logic. Test path: backfill existing meshes with a default `#general` topic; opt-in to topic-only routing.
|
||||
- **Web chat session lifecycle** — humans expect "I closed the tab and came back, my place is preserved." Ephemeral session pubkeys break that. Workaround: tie human peer identity to `member_pubkey` + last_read_at on the topic; session pubkey rotates per tab but membership is durable.
|
||||
- **API key abuse** — leaked keys = anyone can post. Mitigations: capability + topic scoping; rate limits per key; `last_used_at` + audit trail; revoke verb is fast.
|
||||
|
||||
---
|
||||
|
||||
## Open questions
|
||||
|
||||
1. Do existing `@group` semantics survive intact, or do we collapse `@group` and `#topic` into one primitive? (Answer favored: keep both — different axes.)
|
||||
2. Should topics persist messages by default, or be opt-in? (Default: yes for `peer_type: "human"`-touched topics; configurable per topic for agent-only ones.)
|
||||
3. Where does mesh-MCP discovery live in the topic model — per topic or per mesh? (Likely per mesh; mesh-MCP is infrastructure, not conversation.)
|
||||
204
.artifacts/specs/2026-05-02-workspace-view.md
Normal file
@@ -0,0 +1,204 @@
|
||||
# Workspace view — per-user superset over joined meshes
|
||||
|
||||
**Status:** spec / not started
|
||||
**Target:** v0.4.0
|
||||
**Author:** Alejandro
|
||||
**Date:** 2026-05-02
|
||||
|
||||
## Why
|
||||
|
||||
Users routinely belong to multiple meshes — work, personal, side
|
||||
projects, ECIJA + flexicar + openclaw + prueba1 in our own dogfood.
|
||||
Today's CLI is mesh-scoped: every read or write either auto-picks the
|
||||
default mesh or forces an interactive picker. Common questions like
|
||||
*"who's online across all my meshes?"* or *"any new @-mentions
|
||||
anywhere?"* require N round-trips, one per mesh.
|
||||
|
||||
A few verbs already aggregate implicitly (`peer list`, `inbox`,
|
||||
`list`), but the surface is patchy and inconsistent.
|
||||
|
||||
We want the equivalent of "all my Slacks in one sidebar" — without
|
||||
breaking the per-mesh trust model that v0.3.0 was built around.
|
||||
|
||||
## What it is NOT
|
||||
|
||||
- **Not a literal universal mesh.** A single global mesh everyone
|
||||
joins collapses the trust boundary, blows up broadcast fan-out
|
||||
(O(users²)), and turns into spam. See the universal-mesh discussion
|
||||
rejected in this same session.
|
||||
- **Not federation.** Federation is the broker-side equivalent
|
||||
(already roadmapped under v0.3.0). Workspace is purely client-side.
|
||||
- **Not identity stitching for *other* peers.** `Mou@openclaw` and
|
||||
`Mou@flexicar-2` may or may not be the same human. Don't auto-merge.
|
||||
Stitching MY identities is fine — local config knows.
|
||||
|
||||
## What it IS
|
||||
|
||||
A virtual layer that aggregates reads across the meshes the user has
|
||||
joined, while keeping writes mesh-scoped. Pure projection over
|
||||
existing per-mesh tables. Zero broker changes. Zero protocol changes.
|
||||
|
||||
```
|
||||
┌──────────────────────────────┐
|
||||
│ workspace │
|
||||
│ (per-user view, client) │
|
||||
└─┬────────┬────────┬─────────┬┘
|
||||
│ │ │ │
|
||||
┌─────▼──┐ ┌───▼──┐ ┌───▼──┐ ┌────▼──┐
|
||||
│ mesh A │ │ B │ │ C │ │ ... │
|
||||
└────────┘ └──────┘ └──────┘ └───────┘
|
||||
(each remains its own crypto + trust domain)
|
||||
```
|
||||
|
||||
## Surface
|
||||
|
||||
### New verbs (all read-only, all aggregating)
|
||||
|
||||
```bash
|
||||
claudemesh me # overview: meshes, online peers, unread, tasks
|
||||
claudemesh me topics # all subscribed topics, namespaced
|
||||
claudemesh me notifications # cross-mesh @-mentions feed
|
||||
claudemesh me activity # cross-mesh recent send/recv/topic-post
|
||||
claudemesh me search "<q>" # full-text across memory + topics + tasks
|
||||
```
|
||||
|
||||
`claudemesh me` (no subcommand) prints a one-screen dashboard:
|
||||
|
||||
```
|
||||
workspace — agutmou (4 meshes · 23 peers visible · 2 unread @you)
|
||||
|
||||
meshes
|
||||
openclaw 7 peers · 3 topics · last activity 2m
|
||||
flexicar-2 5 peers · 1 topic · last activity 18m
|
||||
prueba1 4 peers · idle
|
||||
ECIJA 7 peers · 2 topics · 1 @you · last activity 4h
|
||||
|
||||
unread @-mentions
|
||||
ECIJA · #incident-2026-05-02 · 1 from coronel-abos
|
||||
openclaw · #deploys · 1 from claudemesh-2
|
||||
|
||||
pending tasks (3)
|
||||
ECIJA ship-F4-cliente high claimed by you
|
||||
...
|
||||
```
|
||||
|
||||
### Default-aggregation rule for existing verbs
|
||||
|
||||
When `--mesh` is omitted on a *read-only* verb, aggregate. When
|
||||
`--mesh` is omitted on a *write* verb, fall back to current behavior
|
||||
(default mesh or interactive picker). Already-aggregating verbs keep
|
||||
working unchanged.
|
||||
|
||||
| Verb | Today | After workspace |
|
||||
|---|---|---|
|
||||
| `peer list` | aggregates ✅ | unchanged |
|
||||
| `inbox` | aggregates ✅ | unchanged |
|
||||
| `list` | aggregates ✅ (lists meshes) | unchanged |
|
||||
| `notification list` | mesh-scoped | aggregates by default |
|
||||
| `topic list` | mesh-scoped | aggregates with namespacing |
|
||||
| `task list` | mesh-scoped | aggregates by default |
|
||||
| `state list` | mesh-scoped | aggregates by default |
|
||||
| `memory recall` | mesh-scoped | aggregates by default |
|
||||
| `info` / `stats` / `ping` | mesh-scoped | unchanged (per-mesh diagnostics) |
|
||||
| `send`, `topic post`, `state set`, `remember`, ... | mesh-scoped | unchanged (writes pick a mesh) |
|
||||
|
||||
### Rendering rules for aggregated views
|
||||
|
||||
1. **Topic namespacing.** `#deploys` exists in two meshes — they're
|
||||
different rooms. Render as `openclaw/#deploys`. Inside a
|
||||
mesh-scoped command, keep the bare `#deploys` shorthand.
|
||||
2. **Peer name collisions.** `Mou@openclaw` notation when the same
|
||||
display name resolves in more than one mesh. Single resolution =
|
||||
bare name.
|
||||
3. **Time-grouped activity.** `me activity` sorts globally by ts
|
||||
descending; mesh tag is shown as a dim suffix.
|
||||
4. **Unread roll-up.** `me notifications` is a per-row
|
||||
`[mesh][topic][snippet]` list, newest first.
|
||||
|
||||
## API surface (REST)
|
||||
|
||||
Mirror the read aggregations server-side so the dashboard + future
|
||||
mobile/web UIs share the same endpoints.
|
||||
|
||||
```
|
||||
GET /v1/me # workspace overview
|
||||
GET /v1/me/meshes # joined meshes + summary stats
|
||||
GET /v1/me/topics # all subscribed topics, all meshes
|
||||
GET /v1/me/notifications # cross-mesh @-mentions
|
||||
GET /v1/me/activity # unified activity feed
|
||||
GET /v1/me/peers # already implicit; formalize
|
||||
GET /v1/me/search?q=... # full-text across tables
|
||||
```
|
||||
|
||||
Auth: needs a *user-scoped* api key (one issued per user, sees all
|
||||
their meshes), which we don't have today — current keys are mesh-
|
||||
scoped. Two options:
|
||||
|
||||
- **(a) Per-user key.** New token type `cm_u_...` issued by the
|
||||
dashboard, scopes to all meshes the issuing user belongs to. Cheaper
|
||||
to build; harder to reason about because the blast radius is
|
||||
larger if leaked.
|
||||
- **(b) Multi-mesh aggregation.** Accept N mesh-scoped keys
|
||||
concurrently; CLI auto-mints them via the existing `withRestKey`
|
||||
pattern, one per joined mesh. No new key type. More round-trips on
|
||||
cold start, but rotation/revocation stays simple.
|
||||
|
||||
**Recommendation: (b).** Reuses today's auth model, doesn't widen the
|
||||
blast radius, and the ephemeral keys we already mint per-command keep
|
||||
the surface area minimal. The CLI orchestrates the fan-out client-
|
||||
side.
|
||||
|
||||
## Storage
|
||||
|
||||
Pure projection at first. The cross-mesh queries are SELECT joins
|
||||
over `mesh_member`, `mesh_topic`, `mesh_topic_member`,
|
||||
`mesh_notification`, `mesh_topic_message`, `mesh_task`, `presence`.
|
||||
|
||||
If `me` queries become hot (likely once dashboards land), add a
|
||||
materialized `user_workspace_view` refreshed on writes. Don't
|
||||
optimize early.
|
||||
|
||||
## Effort
|
||||
|
||||
| Component | Effort |
|
||||
|---|---|
|
||||
| CLI verbs (`me`, `me topics`, etc.) | 1.5 days |
|
||||
| Default-aggregation rule across existing verbs | 0.5 day |
|
||||
| REST endpoints `/v1/me/*` | 1 day |
|
||||
| Multi-mesh apikey orchestration in `withRestKey` | 0.5 day |
|
||||
| Tests + docs | 0.5 day |
|
||||
| **Total** | **~4 days** |
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **`me` as namespace vs. flag.** Could be `claudemesh --workspace
|
||||
topics` instead of `claudemesh me topics`. The verb form is
|
||||
shorter and reads better; sticking with it.
|
||||
2. **Notification ordering.** All notifications globally interleaved
|
||||
by ts, or per-mesh sections? Default to **interleaved** with mesh
|
||||
tag prefix; users can `--by-mesh` to group.
|
||||
3. **Search relevance.** Cross-mesh full-text search is easy when each
|
||||
mesh has its own pg full-text index. Cross-mesh ranking is the
|
||||
harder problem (IDF varies). Punt to v0.4.1 — start with simple
|
||||
tied-rank merge.
|
||||
4. **Web dashboard.** Should the web dashboard's main view become a
|
||||
workspace view by default? Yes, but that's downstream of this
|
||||
spec — once `/v1/me/*` exists, the web rewrite is the obvious
|
||||
next step.
|
||||
|
||||
## Out of scope (v0.4.0)
|
||||
|
||||
- Federation / cross-broker workspace.
|
||||
- Identity stitching for non-self peers.
|
||||
- Cross-mesh search ranking sophistication.
|
||||
- Cross-mesh write fan-out (`me broadcast` is intentionally NOT a
|
||||
verb — too easy to misuse).
|
||||
- Mobile/web parity beyond the REST endpoints.
|
||||
|
||||
## Why we ship this
|
||||
|
||||
Because "I want one Slack-like sidebar for all my claudemesh meshes"
|
||||
is the highest-frequency UX gap users hit, and the answer is two
|
||||
days of plumbing on top of what already exists. Federation is the
|
||||
right answer for cross-organization reach; workspace is the right
|
||||
answer for *one user, many meshes*. Both compose.
|
||||
282
.artifacts/specs/2026-05-04-per-session-presence.md
Normal file
@@ -0,0 +1,282 @@
|
||||
# Per-session broker presence — daemon-multiplexed
|
||||
|
||||
**Status:** spec, queued for 1.30.0 (alongside launch-wizard refactor).
|
||||
**Owner:** alezmad
|
||||
**Author:** Claude (Sprint A planning, 2026-05-04)
|
||||
**Related:** `2026-05-04-v2-roadmap-completion.md` (Sprint A overview),
|
||||
1.29.0 session-registry CHANGELOG entry.
|
||||
|
||||
## Problem
|
||||
|
||||
After 1.28.0 dropped the bridge tier, **launched `claude` sessions have
|
||||
no persistent broker presence**. Only the daemon does.
|
||||
|
||||
Concretely: two `claudemesh launch` sessions in the same cwd, querying
|
||||
`peer list` 2 s apart, **never see each other**. Each `claudemesh peer
|
||||
list` opens a short-lived cold-path WS that creates a `presence` row
|
||||
for the duration of the query and tears it down. The "this session"
|
||||
row everyone sees in their own snapshot is created by the snapshot
|
||||
itself; sibling sessions' queries miss it because their WS-lifetimes
|
||||
don't overlap.
|
||||
|
||||
Confirmed empirically (2026-05-04, same-cwd ECIJA-Intranet test):
|
||||
|
||||
| Snapshot | timestamp | self pubkey | self `connectedAt` |
|
||||
|---|---|---|---|
|
||||
| Session A | 11:42:37Z | `61d96106cb499208` | 11:42:38Z (= query time) |
|
||||
| Session B | 11:42:39Z | `ce77188aba02827d` | 11:42:38Z (= query time) |
|
||||
|
||||
Each saw 5 long-lived peers (the daemon and unrelated other sessions)
|
||||
plus its own ephemeral row. Neither saw the other.
|
||||
|
||||
## Goal
|
||||
|
||||
Every launched `claude` session has a long-lived broker presence row
|
||||
**owned by the daemon**, identified by the session's per-launch
|
||||
keypair. Siblings see each other in `peer list` immediately and
|
||||
continuously, not as snapshot artifacts.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Cross-machine session sync (waiting on 2.0.0 HKDF identity).
|
||||
- Replacing the daemon's own presence row — the daemon stays as a
|
||||
separate row for "the user on this machine, no specific session."
|
||||
- Persistence of the session-presence link across daemon restarts —
|
||||
daemon restart can be allowed to require launched sessions to
|
||||
re-register (same compromise as the in-memory session registry from
|
||||
1.29.0).
|
||||
|
||||
## Design
|
||||
|
||||
### State machine
|
||||
|
||||
The 1.29.0 session registry already tracks `Map<token, SessionInfo>`
|
||||
inside the daemon. Extend it to own a per-session broker connection.
|
||||
|
||||
```
|
||||
session lifecycle:
|
||||
POST /v1/sessions/register
|
||||
→ registry.set(token, info)
|
||||
→ daemon.openSessionWs(info) ← NEW
|
||||
→ broker creates presence row owned by session.pubkey
|
||||
|
||||
DELETE /v1/sessions/:token
|
||||
→ registry.delete(token)
|
||||
→ daemon.closeSessionWs(token) ← NEW
|
||||
→ broker marks presence.disconnectedAt = now()
|
||||
|
||||
reaper (30 s tick): pid dead?
|
||||
→ registry.delete(token)
|
||||
→ daemon.closeSessionWs(token)
|
||||
```
|
||||
|
||||
### Daemon-side: per-session `BrokerClient`
|
||||
|
||||
Today the daemon holds `Map<meshSlug, DaemonBrokerClient>` (one WS per
|
||||
attached mesh). Add a parallel `Map<token, SessionBrokerClient>` for
|
||||
the per-launch ephemeral connections.
|
||||
|
||||
`SessionBrokerClient` is the existing `BrokerClient` reused, configured
|
||||
with the session's per-launch keypair instead of the member's stable
|
||||
keypair. It registers presence (`presence_join`) and stays connected
|
||||
until `closeSessionWs(token)` fires. It does **not** drain the outbox
|
||||
— that's the member-keypair `DaemonBrokerClient`'s job. It only carries
|
||||
presence + receives DMs targeted at the session pubkey.
|
||||
|
||||
### Broker-side: parent-vouched presence auth
|
||||
|
||||
Today's broker accepts hello-sig auth where:
|
||||
- Caller signs the broker's nonce with their `mesh_member` keypair.
|
||||
- Broker looks up `mesh_member.peer_pubkey == sig.pubkey`.
|
||||
|
||||
For per-session keypairs, the session pubkey is **not** in `mesh_member`
|
||||
— it's freshly generated by `claudemesh launch`. We need a new
|
||||
attestation flow:
|
||||
|
||||
```
|
||||
hello {
|
||||
type: "session_hello",
|
||||
session_pubkey: <fresh keypair>,
|
||||
parent_member_pubkey: <member keypair from config>,
|
||||
display_name, cwd, role, groups,
|
||||
parent_signature: ed25519_sign(member_priv,
|
||||
"claudemesh-session/" || session_pubkey || "/" || nonce),
|
||||
nonce_challenge: <broker nonce>,
|
||||
}
|
||||
```
|
||||
|
||||
Broker validates:
|
||||
1. `parent_member_pubkey` exists in `mesh.member` for the target mesh.
|
||||
2. `parent_signature` validates against `parent_member_pubkey` over the
|
||||
canonical message above.
|
||||
3. Broker inserts a presence row keyed on `session_pubkey` but
|
||||
`member_id` pointing at the parent member's `mesh.member.id`.
|
||||
|
||||
This is the OAuth-style refresh-vs-access pattern: the parent member
|
||||
key vouches "this ephemeral session pubkey belongs to me." The broker
|
||||
binds the row to the parent member but uses the session pubkey for
|
||||
routing (so DMs targeted at the session pubkey land at this WS).
|
||||
|
||||
### CLI-side: launch.ts produces the parent signature
|
||||
|
||||
`claudemesh launch` already mints the session keypair and writes the
|
||||
session-token file. Extend it to also produce a `parent_signature`
|
||||
that the daemon can present when opening the session WS:
|
||||
|
||||
```ts
|
||||
const sessionPubkey = sessionKeypair.publicKey;
|
||||
const parentSig = ed25519_sign(
|
||||
mesh.secretKey,
|
||||
Buffer.concat([
|
||||
Buffer.from("claudemesh-session/"),
|
||||
sessionPubkey,
|
||||
Buffer.from("/"),
|
||||
/* nonce comes from broker — handled at WS-connect time */
|
||||
]),
|
||||
);
|
||||
```
|
||||
|
||||
Actually, the nonce is broker-issued at hello time, so the signature
|
||||
needs to be produced fresh per WS-connect. Simpler approach: the
|
||||
`POST /v1/sessions/register` body carries the *member secret key* (or
|
||||
a derived signing capability) so the daemon can sign nonces on behalf
|
||||
of the session.
|
||||
|
||||
That's a key-leak risk. Better: register carries a **pre-signed
|
||||
attestation** good for a TTL window:
|
||||
|
||||
```
|
||||
register body adds:
|
||||
parent_attestation: {
|
||||
session_pubkey: hex,
|
||||
parent_member_pubkey: hex,
|
||||
expires_at: ISO,
|
||||
signature: ed25519_sign(member_priv,
|
||||
"claudemesh-session-attest/" ||
|
||||
session_pubkey || "/" ||
|
||||
expires_at),
|
||||
}
|
||||
```
|
||||
|
||||
Daemon presents this attestation in `session_hello`; broker validates
|
||||
expiry and signature, then issues a nonce challenge that the daemon
|
||||
can satisfy with the session keypair (which IS held by the daemon
|
||||
for the lifetime of the registration). Two-stage: parent vouches the
|
||||
session; session signs the nonce.
|
||||
|
||||
### Registry persistence
|
||||
|
||||
For now, in-memory only (matching 1.29.0). Daemon restart drops all
|
||||
session WSes; launched `claude` processes are responsible for
|
||||
re-registering on next CLI invocation. Acceptable v1 behaviour;
|
||||
revisit when sqlite persistence lands for the registry.
|
||||
|
||||
## Wire changes
|
||||
|
||||
### Broker
|
||||
|
||||
- New `session_hello` message type (additive; existing `hello` for
|
||||
member auth unchanged).
|
||||
- `presence` row schema unchanged — `member_id` still required, but
|
||||
`session_pubkey` differs from member's stable pubkey.
|
||||
- Validate `parent_attestation.expires_at <= now() + 24h` to bound
|
||||
attestation reuse.
|
||||
|
||||
### Daemon
|
||||
|
||||
- New `SessionBrokerClient` factory — wraps `BrokerClient` with
|
||||
session-mode hello.
|
||||
- `Map<token, SessionBrokerClient>` alongside the existing
|
||||
`Map<slug, DaemonBrokerClient>`.
|
||||
- IPC routes:
|
||||
- `POST /v1/sessions/register` — extend body schema with
|
||||
`parent_attestation`.
|
||||
- `DELETE /v1/sessions/:token` — close the session WS first, then
|
||||
drop registry entry.
|
||||
|
||||
### CLI (`claudemesh launch`)
|
||||
|
||||
- Mint session keypair (today only writes the session token; need to
|
||||
add ed25519 keypair generation per launch and write the privkey
|
||||
alongside the token).
|
||||
- Sign `parent_attestation` with the member key from the joined-mesh
|
||||
config.
|
||||
- POST register with both the new keypair and the attestation.
|
||||
|
||||
## LoC estimate
|
||||
|
||||
- Daemon `SessionBrokerClient` + registry hook: ~120 LoC.
|
||||
- IPC route schema extension + validation: ~40 LoC.
|
||||
- Broker `session_hello` handler + tests: ~140 LoC.
|
||||
- CLI `claudemesh launch` keypair + attestation: ~60 LoC.
|
||||
- Tests + smoke: ~80 LoC.
|
||||
|
||||
Total: **~440 LoC** across CLI + daemon + broker.
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|---|---|
|
||||
| Member private key never leaves the user's machine, but the **attestation** (signed token) can be replayed within its TTL. | TTL bound 24h; refresh on launch; revocation path = drop the parent member's mesh enrollment (nuclear, but works). |
|
||||
| Cascading WS connections — N launches = N+1 broker WSes per user. | Acceptable up to 10-20 concurrent sessions; if it ever becomes a problem, multiplex per-session at the protocol level (one WS, multiple presence rows). Out of scope for v1. |
|
||||
| Daemon restart kills all session WSes — `peer list` from inside a launched session sees the remaining 5 peers but not its own siblings until they re-register. | Same as 1.29.0 registry. The registry could persist to sqlite later; for v1, accepted. |
|
||||
| Broker schema cost: every new presence row has a different `session_pubkey`, growing the table faster. | Already accepted — broker prunes disconnected rows on a 30-day window. Per-session keys triple the row count at peak but stay within the prune budget. |
|
||||
|
||||
## Compatibility
|
||||
|
||||
- **Older brokers** can't validate `session_hello`. Sessions will
|
||||
attempt the new hello, get back `unknown_message_type`, and fall
|
||||
back to the existing member-keyed hello (no per-session presence,
|
||||
but everything still works as 1.28.0). Add the broker change first,
|
||||
let it deploy, then ship the CLI side.
|
||||
- **Older CLIs** continue to work unchanged — they don't open
|
||||
per-session WSes. They appear as ephemeral cold-path rows just like
|
||||
today, and lose the symmetric-visibility property between siblings.
|
||||
- **Backward visible:** users on 1.30.0+ on the same mesh as users on
|
||||
≤1.29.x will see the older users as one row (their daemon) instead
|
||||
of one row per session. Acceptable — opt-in to the new visibility
|
||||
by upgrading.
|
||||
|
||||
## Sequencing
|
||||
|
||||
1. **Broker change ships first.** Add `session_hello` handler, deploy,
|
||||
bake for ~24h. No CLI behaviour change yet.
|
||||
2. **Daemon `SessionBrokerClient` ships next** behind a feature flag
|
||||
(`CLAUDEMESH_SESSION_PRESENCE=1`). Manually test with two launched
|
||||
sessions in the same cwd; verify both see each other.
|
||||
3. **CLI keypair-mint + attestation in `launch.ts` ships last**, behind
|
||||
the same flag.
|
||||
4. Flip the flag default in 1.30.0 release; document rollback via env.
|
||||
|
||||
## Verification
|
||||
|
||||
End-to-end smoke (paste into 1.30.0's CHANGELOG):
|
||||
|
||||
```
|
||||
$ # In two different shells, both cd ~/Desktop/foo:
|
||||
$ claudemesh launch --name SessionA -y # shell 1
|
||||
$ claudemesh launch --name SessionB -y # shell 2
|
||||
$
|
||||
$ # In a third shell:
|
||||
$ claudemesh peer list --json --mesh foo | jq '.[] | {n: .displayName, c: .cwd}'
|
||||
{ "n": "SessionA", "c": "/.../foo" } ← persistent, not query-induced
|
||||
{ "n": "SessionB", "c": "/.../foo" }
|
||||
$
|
||||
$ # In SessionA's shell:
|
||||
$ claudemesh peer list --mesh foo
|
||||
should include SessionB.
|
||||
$
|
||||
$ # Kill SessionB (Ctrl-C in shell 2). Wait <30s.
|
||||
$ claudemesh peer list --mesh foo
|
||||
should NOT include SessionB (reaper closed its WS).
|
||||
```
|
||||
|
||||
## Open questions
|
||||
|
||||
- Should the per-session WS also drain *its own* outbox subset, or stay
|
||||
presence-only? Recommend presence-only for v1 — keeps state machines
|
||||
simple, daemon's member-keyed WS handles all sends. Can be revisited
|
||||
when per-session policy DSL ships.
|
||||
- Should the parent attestation be revocable mid-session? Could add an
|
||||
IPC route on the daemon. Out of scope for v1; revoke = drop the
|
||||
whole member enrollment.
|
||||
104
.artifacts/specs/2026-05-04-v2-roadmap-completion.md
Normal file
@@ -0,0 +1,104 @@
|
||||
# v2.0.0 Daemon Redesign — Completion Roadmap
|
||||
|
||||
**Date:** 2026-05-04
|
||||
**Owner:** alezmad
|
||||
**Status:** in-progress (1.24.0 + 1.25.0 land most of it; remainder is two follow-up arcs)
|
||||
|
||||
## What's done
|
||||
|
||||
| v2.0.0 bullet | Version | Status |
|
||||
|---|---|---|
|
||||
| `claudemesh-daemon` long-lived launchd / systemd unit | 1.22.0 | ✅ Done |
|
||||
| MCP server shrinks to thin daemon adapter | 1.24.0 | ✅ Done — 979 → ~200 LoC of push-pipe, daemon-required, no fallback |
|
||||
| `claudemesh install` auto-installs + starts daemon | 1.24.0 | ✅ Done |
|
||||
| `claudemesh launch` ensures daemon | 1.24.0 | ✅ Done |
|
||||
| Daemon outbound routing (Sprint 4: real targets + crypto) | 1.25.0 | ✅ Done — outbox stores `mesh`, `target_spec`, `nonce`, `ciphertext`, `priority`; resolution + `crypto_box` happens at IPC accept time; drain is a forwarder |
|
||||
| CLI thin-client routing for read verbs | 1.25.0 | ✅ Partial — `peer list`, `skill list/get` route through daemon when present; same `trySendViaDaemon` fallback shape |
|
||||
| Ambient mode (raw `claude` Just Works) | 1.25.0 | ✅ Documented + functional for the daemon's attached mesh |
|
||||
|
||||
## What remains (in dependency order)
|
||||
|
||||
### A. Daemon multi-mesh (the prerequisite for "ambient mode for everything")
|
||||
|
||||
**Why it's the critical path:** ambient mode today only works for the single mesh the daemon is attached to. Users with N meshes either run N daemons (different sock paths) or restart the daemon to switch. Neither is acceptable for the v2.0.0 promise.
|
||||
|
||||
**What it takes:**
|
||||
- Daemon holds `Map<slug, DaemonBrokerClient>` instead of one broker.
|
||||
- Outbox row's `mesh` column (1.25.0 added) is the dispatch key.
|
||||
- IPC `/v1/send` requires `mesh` field (or infers from target prefix `<slug>:<target>`).
|
||||
- IPC read endpoints (`/v1/peers`, `/v1/skills`, `/v1/profile`) accept `?mesh=<slug>` or return mesh-grouped results.
|
||||
- SSE event payloads already include `mesh` slug; no change needed.
|
||||
- Drain worker selects broker by row's `mesh` column.
|
||||
- `daemon up` with no `--mesh` attaches to all joined meshes; with `--mesh X` restricts to X (legacy mode for explicit single-mesh).
|
||||
- Inbox dedupe keeps using `client_message_id` UNIQUE; mesh column for filtering only.
|
||||
|
||||
**Estimated effort:** 1 week. ~600 LoC across `run.ts`, `drain.ts`, `ipc/server.ts`, plus tests for per-mesh dispatch.
|
||||
|
||||
**Risk:** medium. The single-mesh assumption is baked into a few places (peer-list response shape, skill-list response shape). Need to choose: per-mesh tagged responses (breaking) or array-of-meshes wrapped responses (additive). Recommend the latter for back-compat.
|
||||
|
||||
### B. HKDF-derived peer keypairs (cross-machine identity)
|
||||
|
||||
**Why it matters:** today each install per machine = fresh keypair = different mesh member identity. User signs in on laptop and desktop and shows up as two different members. v2.0.0 promised "same identity across machines."
|
||||
|
||||
**What it takes:**
|
||||
- `HKDF(account_secret, info: "claudemesh/mesh/<mesh_id>/peer", salt: <user_id>)` derives a deterministic ed25519 keypair per mesh.
|
||||
- `account_secret` derives from the user's authenticated session — needs broker-side endpoint to vend it on first install.
|
||||
- Enrollment flow changes: instead of generating a fresh keypair, derive it. Subsequent installs find the same pubkey already in `mesh.member` and skip enrollment.
|
||||
- Migration: existing members keep their old keypairs (they're stored in config). Only new joins use HKDF. Optional: opt-in re-enrollment for users who want cross-machine sync.
|
||||
- Broker hello-sig protocol unchanged (still ed25519 sign).
|
||||
|
||||
**Estimated effort:** 2-3 weeks. Touches enrollment, broker auth, dashboard, security review.
|
||||
|
||||
**Risk:** high. Crypto change with security implications. Needs design review (account_secret distribution security, HKDF salt choice, key compromise recovery story).
|
||||
|
||||
### C. Mesh → workspace public surface rename
|
||||
|
||||
**Why it matters:** "mesh" is internal jargon for what users experience as "a workspace." v2.0.0 calls for the rename to align UX language.
|
||||
|
||||
**What it takes:**
|
||||
- All CLI verbs gain `workspace` aliases (`claudemesh workspace list` ≡ `claudemesh list`).
|
||||
- Help text, docs, README, marketing site updated.
|
||||
- DB tables stay `mesh_*` (migration cost prohibitive; not user-visible).
|
||||
- Wire protocol stays `mesh_*` (broker change too disruptive).
|
||||
- Eventually deprecate the `mesh` aliases (~2 minor versions later).
|
||||
|
||||
**Estimated effort:** 3-4 days. Mostly rote search/replace + new aliases.
|
||||
|
||||
**Risk:** low. Cosmetic.
|
||||
|
||||
### D. Full CLI-to-thin-client conversion
|
||||
|
||||
**Why it matters:** today the CLI has bridge + cold-path code that duplicates ~3000 LoC of broker WS / crypto / decode logic that the daemon also has. Once daemon is multi-mesh, every verb can become "open IPC, send request, render response."
|
||||
|
||||
**What it takes:**
|
||||
- Each verb: replace `withMesh(...)` (which opens its own broker WS) with `daemonOnly(...)` (calls IPC, errors if daemon down).
|
||||
- Drop `bridge/server.ts`, `bridge/client.ts`, `bridge/socket-broker.ts` entirely.
|
||||
- Drop most of `services/broker/ws-client.ts` from the CLI build (kept only for daemon's internal use).
|
||||
- CLI binary shrinks ~30-40%.
|
||||
- Daemon becomes the only broker WS holder per user.
|
||||
|
||||
**Estimated effort:** 1 week. Mostly mechanical; strict typescript catches most issues.
|
||||
|
||||
**Risk:** medium. Breaks workflows where CLI is used without daemon (CI environments, headless scripts). Need to keep a `--no-daemon` escape hatch or document the constraint.
|
||||
|
||||
## Recommended sequencing
|
||||
|
||||
```
|
||||
1.25.0 (today): Sprint 4 outbound routing + CLI thin-client read paths + ambient mode docs
|
||||
1.26.0 (next): A. Daemon multi-mesh — "ambient mode for everything"
|
||||
1.27.0: D. CLI-to-thin-client conversion — drops ~3000 LoC
|
||||
1.28.0: C. Mesh → workspace rename (aliases shipped, no removal yet)
|
||||
2.0.0: B. HKDF identity (separate security-reviewed arc)
|
||||
```
|
||||
|
||||
A → D → C → B is the right order:
|
||||
- A unblocks ambient mode for multi-mesh users (highest UX value).
|
||||
- D unblocks the LoC reduction the v2.0.0 promise mentioned ("3000 LoC removed").
|
||||
- C is cosmetic; do it once D has stabilized.
|
||||
- B is the most security-sensitive; do it last, with proper review.
|
||||
|
||||
## Out of scope for the v2.0.0 endpoint
|
||||
|
||||
- **Topic crypto (Sprint 5+).** Topics still ship as base64 plaintext. Real per-topic encryption is a v0.3.0 operator-layer item, parallel track.
|
||||
- **Broker hardening for daemon idempotency (Sprint 7).** Partial unique index on `(mesh_id, client_message_id) WHERE NOT NULL` and the `mesh.client_message_dedupe` table. Documented in `2026-05-03-daemon-spec-broker-hardening-followups.md`.
|
||||
- **`launch` deprecation.** 1.25.0 docs now recommend ambient mode for default cases; `launch` stays as the override path. Full deprecation is a 2.x decision.
|
||||
1
.claude/scheduled_tasks.lock
Normal file
@@ -0,0 +1 @@
|
||||
{"sessionId":"ae5dbe38-9c56-4d07-9fb6-a38cb8a250a6","pid":3633,"procStart":"Fri May 1 22:40:56 2026","acquiredAt":1777683244936}
|
||||
22
.claude/settings.local.json
Normal file
@@ -0,0 +1,22 @@
|
||||
{
|
||||
"permissions": {
|
||||
"allow": [
|
||||
"Bash(/Users/agutierrez/.claude/hooks/play-tts.sh Connected to mesh, setting up:*)",
|
||||
"Bash(/Users/agutierrez/.claude/hooks/play-tts.sh Connected to mesh, setting up session:*)",
|
||||
"Bash(npx tsx:*)",
|
||||
"Bash(grep -r \"defineCommand\\\\|export const run\" /Users/agutierrez/Desktop/claudemesh/apps/cli/src/commands/*.ts)",
|
||||
"Bash(pnpm build:*)",
|
||||
"Bash(/Users/agutierrez/.claude/hooks/play-tts.sh Ready to help:*)",
|
||||
"Bash(pnpm publish:*)",
|
||||
"Bash(grep -E \"\\\\.\\(tsx?|jsx?\\)$\")",
|
||||
"Bash(/Users/agutierrez/.claude/hooks/play-tts.sh Investigating dropped keystrokes in claudemesh launch:*)",
|
||||
"Read(//Users/agutierrez/.claude/**)",
|
||||
"Read(//private/tmp/**)",
|
||||
"Bash(timeout 3 node dist/index.js mcp)",
|
||||
"Bash(/Users/agutierrez/.claude/hooks/play-tts.sh Fixed ZodError in MCP notification handler:*)",
|
||||
"Bash(npm i:*)",
|
||||
"Bash(claudemesh --version)",
|
||||
"Bash(/Users/agutierrez/.claude/hooks/play-tts.sh:*)"
|
||||
]
|
||||
}
|
||||
}
|
||||
58
.claude/skills/integration-nextjs-app-router/SKILL.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
name: integration-nextjs-app-router
|
||||
description: PostHog integration for Next.js App Router applications
|
||||
metadata:
|
||||
author: PostHog
|
||||
version: 1.9.5
|
||||
---
|
||||
|
||||
# PostHog integration for Next.js App Router
|
||||
|
||||
This skill helps you add PostHog analytics to Next.js App Router applications.
|
||||
|
||||
## Workflow
|
||||
|
||||
Follow these steps in order to complete the integration:
|
||||
|
||||
1. `basic-integration-1.0-begin.md` - PostHog Setup - Begin ← **Start here**
|
||||
2. `basic-integration-1.1-edit.md` - PostHog Setup - Edit
|
||||
3. `basic-integration-1.2-revise.md` - PostHog Setup - Revise
|
||||
4. `basic-integration-1.3-conclude.md` - PostHog Setup - Conclusion
|
||||
|
||||
## Reference files
|
||||
|
||||
- `references/EXAMPLE.md` - Next.js App Router example project code
|
||||
- `references/next-js.md` - Next.js - docs
|
||||
- `references/identify-users.md` - Identify users - docs
|
||||
- `references/basic-integration-1.0-begin.md` - PostHog setup - begin
|
||||
- `references/basic-integration-1.1-edit.md` - PostHog setup - edit
|
||||
- `references/basic-integration-1.2-revise.md` - PostHog setup - revise
|
||||
- `references/basic-integration-1.3-conclude.md` - PostHog setup - conclusion
|
||||
|
||||
The example project shows the target implementation pattern. Consult the documentation for API details.
|
||||
|
||||
## Key principles
|
||||
|
||||
- **Environment variables**: Always use environment variables for PostHog keys. Never hardcode them.
|
||||
- **Minimal changes**: Add PostHog code alongside existing integrations. Don't replace or restructure existing code.
|
||||
- **Match the example**: Your implementation should follow the example project's patterns as closely as possible.
|
||||
|
||||
## Framework guidelines
|
||||
|
||||
- For Next.js 15.3+, initialize PostHog in instrumentation-client.ts for the simplest setup
|
||||
- For feature flags, use useFeatureFlagEnabled() or useFeatureFlagPayload() hooks - they handle loading states and external sync automatically
|
||||
- Add analytics capture in event handlers where user actions occur, NOT in useEffect reacting to state changes
|
||||
- Do NOT use useEffect for data transformation - calculate derived values during render instead
|
||||
- Do NOT use useEffect to respond to user events - put that logic in the event handler itself
|
||||
- Do NOT use useEffect to chain state updates - calculate all related updates together in the event handler
|
||||
- Do NOT use useEffect to notify parent components - call the parent callback alongside setState in the event handler
|
||||
- To reset component state when a prop changes, pass the prop as the component's key instead of using useEffect
|
||||
- useEffect is ONLY for synchronizing with external systems (non-React widgets, browser APIs, network subscriptions)
|
||||
|
||||
## Identifying users
|
||||
|
||||
Identify users during login and signup events. Refer to the example code and documentation for the correct identify pattern for this framework. If both frontend and backend code exist, pass the client-side session and distinct ID using `X-POSTHOG-DISTINCT-ID` and `X-POSTHOG-SESSION-ID` headers to maintain correlation.
|
||||
|
||||
## Error tracking
|
||||
|
||||
Add PostHog error tracking to relevant files, particularly around critical user flows and API boundaries.
|
||||
@@ -0,0 +1,706 @@
|
||||
# PostHog Next.js App Router Example Project
|
||||
|
||||
Repository: https://github.com/PostHog/context-mill
|
||||
Path: basics/next-app-router
|
||||
|
||||
---
|
||||
|
||||
## README.md
|
||||
|
||||
# PostHog Next.js app router example
|
||||
|
||||
This is a [Next.js](https://nextjs.org) App Router example demonstrating PostHog integration with product analytics, session replay, feature flags, and error tracking.
|
||||
|
||||
## Features
|
||||
|
||||
- **Product analytics**: Track user events and behaviors
|
||||
- **Session replay**: Record and replay user sessions
|
||||
- **Error tracking**: Capture and track errors
|
||||
- **User authentication**: Demo login system with PostHog user identification
|
||||
- **Server-side & Client-side tracking**: Examples of both tracking methods
|
||||
- **Reverse proxy**: PostHog ingestion through Next.js rewrites
|
||||
|
||||
## Getting started
|
||||
|
||||
### 1. Install dependencies
|
||||
|
||||
```bash
|
||||
npm install
|
||||
# or
|
||||
pnpm install
|
||||
```
|
||||
|
||||
### 2. Configure environment variables
|
||||
|
||||
Create a `.env.local` file in the root directory:
|
||||
|
||||
```bash
|
||||
NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN=your_posthog_project_token
|
||||
NEXT_PUBLIC_POSTHOG_HOST=https://us.i.posthog.com
|
||||
```
|
||||
|
||||
Get your PostHog project token from your [PostHog project settings](https://app.posthog.com/project/settings).
|
||||
|
||||
### 3. Run the development server
|
||||
|
||||
```bash
|
||||
npm run dev
|
||||
# or
|
||||
pnpm dev
|
||||
```
|
||||
|
||||
Open [http://localhost:3000](http://localhost:3000) with your browser to see the app.
|
||||
|
||||
## Project structure
|
||||
|
||||
```
|
||||
src/
|
||||
├── app/
|
||||
│ ├── api/
|
||||
│ │ └── auth/
|
||||
│ │ └── login/
|
||||
│ │ └── route.ts # Login API with server-side tracking
|
||||
│ ├── burrito/
|
||||
│ │ └── page.tsx # Demo feature page with event tracking
|
||||
│ ├── profile/
|
||||
│ │ └── page.tsx # User profile with error tracking demo
|
||||
│ ├── layout.tsx # Root layout with providers
|
||||
│ ├── page.tsx # Home/Login page
|
||||
│ └── globals.css # Global styles
|
||||
├── components/
|
||||
│ └── Header.tsx # Navigation header with auth state
|
||||
├── contexts/
|
||||
│ └── AuthContext.tsx # Authentication context with PostHog integration
|
||||
└── lib/
|
||||
└── posthog-server.ts # Server-side PostHog client
|
||||
|
||||
instrumentation-client.ts # Client-side PostHog initialization
|
||||
```
|
||||
|
||||
## Key integration points
|
||||
|
||||
### Client-side initialization (instrumentation-client.ts)
|
||||
|
||||
```typescript
|
||||
import posthog from "posthog-js"
|
||||
|
||||
posthog.init(process.env.NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN!, {
|
||||
api_host: "/ingest",
|
||||
ui_host: "https://us.posthog.com",
|
||||
defaults: '2026-01-30',
|
||||
capture_exceptions: true,
|
||||
debug: process.env.NODE_ENV === "development",
|
||||
});
|
||||
```
|
||||
|
||||
### User identification (AuthContext.tsx)
|
||||
|
||||
```typescript
|
||||
posthog.identify(username, {
|
||||
username: username,
|
||||
});
|
||||
```
|
||||
|
||||
### Event tracking (burrito/page.tsx)
|
||||
|
||||
```typescript
|
||||
posthog.capture('burrito_considered', {
|
||||
total_considerations: count,
|
||||
username: username,
|
||||
});
|
||||
```
|
||||
|
||||
### Error tracking (profile/page.tsx)
|
||||
|
||||
```typescript
|
||||
posthog.captureException(error);
|
||||
```
|
||||
|
||||
### Server-side tracking (app/api/auth/login/route.ts)
|
||||
|
||||
```typescript
|
||||
const posthog = getPostHogClient();
|
||||
posthog.capture({
|
||||
distinctId: username,
|
||||
event: 'server_login',
|
||||
properties: { ... }
|
||||
});
|
||||
```
|
||||
|
||||
## App router differences from pages router
|
||||
|
||||
This example uses Next.js App Router instead of Pages Router. Key differences:
|
||||
|
||||
1. **File-based routing**: Pages in `src/app/` instead of `src/pages/`
|
||||
2. **layout.tsx**: Root layout component wraps all pages
|
||||
3. **API Routes**: Located in `src/app/api/` with `route.ts` files
|
||||
4. **'use client'**: Client components need explicit directive
|
||||
5. **useRouter**: From `next/navigation` instead of `next/router`
|
||||
6. **Metadata**: Exported from layout/page instead of Head component
|
||||
7. **Server Components**: Components are server-side by default
|
||||
|
||||
## Learn more
|
||||
|
||||
- [PostHog Documentation](https://posthog.com/docs)
|
||||
- [Next.js App Router Documentation](https://nextjs.org/docs/app)
|
||||
- [PostHog Next.js Integration Guide](https://posthog.com/docs/libraries/next-js)
|
||||
|
||||
## Deploy on Vercel
|
||||
|
||||
The easiest way to deploy your Next.js app is to use the [Vercel Platform](https://vercel.com/new).
|
||||
|
||||
Check out the [Next.js deployment documentation](https://nextjs.org/docs/app/building-your-application/deploying) for more details.
|
||||
|
||||
---
|
||||
|
||||
## .env.example
|
||||
|
||||
```example
|
||||
# PostHog Configuration
|
||||
# Get your PostHog project token from: https://app.posthog.com/project/settings
|
||||
NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN=your_posthog_project_token_here
|
||||
# NEXT_PUBLIC_POSTHOG_HOST=https://eu.i.posthog.com
|
||||
NEXT_PUBLIC_POSTHOG_HOST=https://us.i.posthog.com
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## instrumentation-client.ts
|
||||
|
||||
```ts
|
||||
import posthog from "posthog-js"
|
||||
|
||||
posthog.init(process.env.NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN!, {
|
||||
api_host: "/ingest",
|
||||
ui_host: "https://us.posthog.com",
|
||||
// Include the defaults option as required by PostHog
|
||||
defaults: '2026-01-30',
|
||||
// Enables capturing unhandled exceptions via Error Tracking
|
||||
capture_exceptions: true,
|
||||
// Turn on debug in development mode
|
||||
debug: process.env.NODE_ENV === "development",
|
||||
});
|
||||
|
||||
//IMPORTANT: Never combine this approach with other client-side PostHog initialization approaches, especially components like a PostHogProvider. instrumentation-client.ts is the correct solution for initializating client-side PostHog in Next.js 15.3+ apps.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## next.config.ts
|
||||
|
||||
```ts
|
||||
import type { NextConfig } from "next";
|
||||
|
||||
const nextConfig: NextConfig = {
|
||||
/* config options here */
|
||||
async rewrites() {
|
||||
return [
|
||||
{
|
||||
source: "/ingest/static/:path*",
|
||||
destination: "https://us-assets.i.posthog.com/static/:path*",
|
||||
},
|
||||
{
|
||||
source: "/ingest/:path*",
|
||||
destination: "https://us.i.posthog.com/:path*",
|
||||
},
|
||||
];
|
||||
},
|
||||
// This is required to support PostHog trailing slash API requests
|
||||
skipTrailingSlashRedirect: true,
|
||||
};
|
||||
|
||||
export default nextConfig;
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## src/app/api/auth/login/route.ts
|
||||
|
||||
```ts
|
||||
import { NextResponse } from 'next/server';
|
||||
import { getPostHogClient } from '@/lib/posthog-server';
|
||||
|
||||
const users = new Map<string, { username: string; burritoConsiderations: number }>();
|
||||
|
||||
export async function POST(request: Request) {
|
||||
const { username, password } = await request.json();
|
||||
|
||||
if (!username || !password) {
|
||||
return NextResponse.json({ error: 'Username and password required' }, { status: 400 });
|
||||
}
|
||||
|
||||
let user = users.get(username);
|
||||
const isNewUser = !user;
|
||||
|
||||
if (!user) {
|
||||
user = { username, burritoConsiderations: 0 };
|
||||
users.set(username, user);
|
||||
}
|
||||
|
||||
// Capture server-side login event
|
||||
const posthog = getPostHogClient();
|
||||
posthog.capture({
|
||||
distinctId: username,
|
||||
event: 'server_login',
|
||||
properties: {
|
||||
username: username,
|
||||
isNewUser: isNewUser,
|
||||
source: 'api'
|
||||
}
|
||||
});
|
||||
|
||||
// Identify user on server side
|
||||
posthog.identify({
|
||||
distinctId: username,
|
||||
properties: {
|
||||
username: username,
|
||||
createdAt: isNewUser ? new Date().toISOString() : undefined
|
||||
}
|
||||
});
|
||||
|
||||
return NextResponse.json({ success: true, user });
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## src/app/burrito/page.tsx
|
||||
|
||||
```tsx
|
||||
'use client';
|
||||
|
||||
import { useState } from 'react';
|
||||
import { useAuth } from '@/contexts/AuthContext';
|
||||
import { useRouter } from 'next/navigation';
|
||||
import posthog from 'posthog-js';
|
||||
|
||||
export default function BurritoPage() {
|
||||
const { user, incrementBurritoConsiderations } = useAuth();
|
||||
const router = useRouter();
|
||||
const [hasConsidered, setHasConsidered] = useState(false);
|
||||
|
||||
// Redirect to home if not logged in
|
||||
if (!user) {
|
||||
router.push('/');
|
||||
return null;
|
||||
}
|
||||
|
||||
const handleConsideration = () => {
|
||||
incrementBurritoConsiderations();
|
||||
setHasConsidered(true);
|
||||
setTimeout(() => setHasConsidered(false), 2000);
|
||||
|
||||
// Capture burrito consideration event
|
||||
posthog.capture('burrito_considered', {
|
||||
total_considerations: user.burritoConsiderations + 1,
|
||||
username: user.username,
|
||||
});
|
||||
};
|
||||
|
||||
return (
|
||||
<div className="container">
|
||||
<h1>Burrito consideration zone</h1>
|
||||
<p>Take a moment to truly consider the potential of burritos.</p>
|
||||
|
||||
<div style={{ textAlign: 'center' }}>
|
||||
<button
|
||||
onClick={handleConsideration}
|
||||
className="btn-burrito"
|
||||
>
|
||||
I have considered the burrito potential
|
||||
</button>
|
||||
|
||||
{hasConsidered && (
|
||||
<p className="success">
|
||||
Thank you for your consideration! Count: {user.burritoConsiderations}
|
||||
</p>
|
||||
)}
|
||||
</div>
|
||||
|
||||
<div className="stats">
|
||||
<h3>Consideration stats</h3>
|
||||
<p>Total considerations: {user.burritoConsiderations}</p>
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## src/app/layout.tsx
|
||||
|
||||
```tsx
|
||||
import type { Metadata } from "next";
|
||||
import "./globals.css";
|
||||
import { AuthProvider } from "@/contexts/AuthContext";
|
||||
import Header from "@/components/Header";
|
||||
|
||||
export const metadata: Metadata = {
|
||||
title: "Burrito Consideration App",
|
||||
description: "Consider the potential of burritos",
|
||||
};
|
||||
|
||||
export default function RootLayout({
|
||||
children,
|
||||
}: Readonly<{
|
||||
children: React.ReactNode;
|
||||
}>) {
|
||||
return (
|
||||
<html lang="en">
|
||||
<body>
|
||||
<AuthProvider>
|
||||
<Header />
|
||||
<main>{children}</main>
|
||||
</AuthProvider>
|
||||
</body>
|
||||
</html>
|
||||
);
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## src/app/page.tsx
|
||||
|
||||
```tsx
|
||||
'use client';
|
||||
|
||||
import { useState } from 'react';
|
||||
import { useAuth } from '@/contexts/AuthContext';
|
||||
|
||||
export default function Home() {
|
||||
const { user, login } = useAuth();
|
||||
const [username, setUsername] = useState('');
|
||||
const [password, setPassword] = useState('');
|
||||
const [error, setError] = useState('');
|
||||
|
||||
const handleSubmit = async (e: React.FormEvent) => {
|
||||
e.preventDefault();
|
||||
setError('');
|
||||
|
||||
try {
|
||||
const success = await login(username, password);
|
||||
if (success) {
|
||||
setUsername('');
|
||||
setPassword('');
|
||||
} else {
|
||||
setError('Please provide both username and password');
|
||||
}
|
||||
} catch (err) {
|
||||
console.error('Login failed:', err);
|
||||
setError('An error occurred during login');
|
||||
}
|
||||
};
|
||||
|
||||
if (user) {
|
||||
return (
|
||||
<div className="container">
|
||||
<h1>Welcome back, {user.username}!</h1>
|
||||
<p>You are now logged in. Feel free to explore:</p>
|
||||
<ul>
|
||||
<li>Consider the potential of burritos</li>
|
||||
<li>View your profile and statistics</li>
|
||||
</ul>
|
||||
</div>
|
||||
);
|
||||
}
|
||||
|
||||
return (
|
||||
<div className="container">
|
||||
<h1>Welcome to Burrito Consideration App</h1>
|
||||
<p>Please sign in to begin your burrito journey</p>
|
||||
|
||||
<form onSubmit={handleSubmit} className="form">
|
||||
<div className="form-group">
|
||||
<label htmlFor="username">Username:</label>
|
||||
<input
|
||||
type="text"
|
||||
id="username"
|
||||
value={username}
|
||||
onChange={(e) => setUsername(e.target.value)}
|
||||
placeholder="Enter any username"
|
||||
/>
|
||||
</div>
|
||||
|
||||
<div className="form-group">
|
||||
<label htmlFor="password">Password:</label>
|
||||
<input
|
||||
type="password"
|
||||
id="password"
|
||||
value={password}
|
||||
onChange={(e) => setPassword(e.target.value)}
|
||||
placeholder="Enter any password"
|
||||
/>
|
||||
</div>
|
||||
|
||||
{error && <p className="error">{error}</p>}
|
||||
|
||||
<button type="submit" className="btn-primary">Sign In</button>
|
||||
</form>
|
||||
|
||||
<p className="note">
|
||||
Note: This is a demo app. Use any username and password to sign in.
|
||||
</p>
|
||||
</div>
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## src/app/profile/page.tsx
|
||||
|
||||
```tsx
|
||||
'use client';
|
||||
|
||||
import { useAuth } from '@/contexts/AuthContext';
|
||||
import { useRouter } from 'next/navigation';
|
||||
import posthog from 'posthog-js';
|
||||
|
||||
export default function ProfilePage() {
|
||||
const { user } = useAuth();
|
||||
const router = useRouter();
|
||||
|
||||
// Redirect to home if not logged in
|
||||
if (!user) {
|
||||
router.push('/');
|
||||
return null;
|
||||
}
|
||||
|
||||
const triggerTestError = () => {
|
||||
try {
|
||||
throw new Error('Test error for PostHog error tracking');
|
||||
} catch (err) {
|
||||
posthog.captureException(err);
|
||||
console.error('Captured error:', err);
|
||||
alert('Error captured and sent to PostHog!');
|
||||
}
|
||||
};
|
||||
|
||||
return (
|
||||
<div className="container">
|
||||
<h1>User Profile</h1>
|
||||
|
||||
<div className="stats">
|
||||
<h2>Your Information</h2>
|
||||
<p><strong>Username:</strong> {user.username}</p>
|
||||
<p><strong>Burrito Considerations:</strong> {user.burritoConsiderations}</p>
|
||||
</div>
|
||||
|
||||
<div style={{ marginTop: '2rem' }}>
|
||||
<button onClick={triggerTestError} className="btn-primary" style={{ backgroundColor: '#dc3545' }}>
|
||||
Trigger Test Error (for PostHog)
|
||||
</button>
|
||||
</div>
|
||||
|
||||
<div style={{ marginTop: '2rem' }}>
|
||||
<h3>Your Burrito Journey</h3>
|
||||
{user.burritoConsiderations === 0 ? (
|
||||
<p>You haven't considered any burritos yet. Visit the Burrito Consideration page to start!</p>
|
||||
) : user.burritoConsiderations === 1 ? (
|
||||
<p>You've considered the burrito potential once. Keep going!</p>
|
||||
) : user.burritoConsiderations < 5 ? (
|
||||
<p>You're getting the hang of burrito consideration!</p>
|
||||
) : user.burritoConsiderations < 10 ? (
|
||||
<p>You're becoming a burrito consideration expert!</p>
|
||||
) : (
|
||||
<p>You are a true burrito consideration master! 🌯</p>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## src/components/Header.tsx
|
||||
|
||||
```tsx
|
||||
'use client';
|
||||
|
||||
import Link from 'next/link';
|
||||
import { useAuth } from '@/contexts/AuthContext';
|
||||
|
||||
export default function Header() {
|
||||
const { user, logout } = useAuth();
|
||||
|
||||
return (
|
||||
<header className="header">
|
||||
<div className="header-container">
|
||||
<nav>
|
||||
<Link href="/">Home</Link>
|
||||
{user && (
|
||||
<>
|
||||
<Link href="/burrito">Burrito Consideration</Link>
|
||||
<Link href="/profile">Profile</Link>
|
||||
</>
|
||||
)}
|
||||
</nav>
|
||||
<div className="user-section">
|
||||
{user ? (
|
||||
<>
|
||||
<span>Welcome, {user.username}!</span>
|
||||
<button onClick={logout} className="btn-logout">
|
||||
Logout
|
||||
</button>
|
||||
</>
|
||||
) : (
|
||||
<span>Not logged in</span>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## src/contexts/AuthContext.tsx
|
||||
|
||||
```tsx
|
||||
'use client';
|
||||
|
||||
import { createContext, useContext, useState, ReactNode } from 'react';
|
||||
import posthog from 'posthog-js';
|
||||
|
||||
interface User {
|
||||
username: string;
|
||||
burritoConsiderations: number;
|
||||
}
|
||||
|
||||
interface AuthContextType {
|
||||
user: User | null;
|
||||
login: (username: string, password: string) => Promise<boolean>;
|
||||
logout: () => void;
|
||||
incrementBurritoConsiderations: () => void;
|
||||
}
|
||||
|
||||
const AuthContext = createContext<AuthContextType | undefined>(undefined);
|
||||
|
||||
const users: Map<string, User> = new Map();
|
||||
|
||||
export function AuthProvider({ children }: { children: ReactNode }) {
|
||||
// Use lazy initializer to read from localStorage only once on mount
|
||||
const [user, setUser] = useState<User | null>(() => {
|
||||
if (typeof window === 'undefined') return null;
|
||||
|
||||
const storedUsername = localStorage.getItem('currentUser');
|
||||
if (storedUsername) {
|
||||
const existingUser = users.get(storedUsername);
|
||||
if (existingUser) {
|
||||
return existingUser;
|
||||
}
|
||||
}
|
||||
return null;
|
||||
});
|
||||
|
||||
const login = async (username: string, password: string): Promise<boolean> => {
|
||||
try {
|
||||
const response = await fetch('/api/auth/login', {
|
||||
method: 'POST',
|
||||
headers: { 'Content-Type': 'application/json' },
|
||||
body: JSON.stringify({ username, password }),
|
||||
});
|
||||
|
||||
if (response.ok) {
|
||||
const { user: userData } = await response.json();
|
||||
|
||||
let localUser = users.get(username);
|
||||
if (!localUser) {
|
||||
localUser = userData as User;
|
||||
users.set(username, localUser);
|
||||
}
|
||||
|
||||
setUser(localUser);
|
||||
localStorage.setItem('currentUser', username);
|
||||
|
||||
// Identify user in PostHog using username as distinct ID
|
||||
posthog.identify(username, {
|
||||
username: username,
|
||||
});
|
||||
|
||||
// Capture login event
|
||||
posthog.capture('user_logged_in', {
|
||||
username: username,
|
||||
});
|
||||
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
} catch (error) {
|
||||
console.error('Login error:', error);
|
||||
return false;
|
||||
}
|
||||
};
|
||||
|
||||
const logout = () => {
|
||||
// Capture logout event before resetting
|
||||
posthog.capture('user_logged_out');
|
||||
posthog.reset();
|
||||
|
||||
setUser(null);
|
||||
localStorage.removeItem('currentUser');
|
||||
};
|
||||
|
||||
const incrementBurritoConsiderations = () => {
|
||||
if (user) {
|
||||
user.burritoConsiderations++;
|
||||
users.set(user.username, user);
|
||||
setUser({ ...user });
|
||||
}
|
||||
};
|
||||
|
||||
return (
|
||||
<AuthContext.Provider value={{ user, login, logout, incrementBurritoConsiderations }}>
|
||||
{children}
|
||||
</AuthContext.Provider>
|
||||
);
|
||||
}
|
||||
|
||||
export function useAuth() {
|
||||
const context = useContext(AuthContext);
|
||||
if (context === undefined) {
|
||||
throw new Error('useAuth must be used within an AuthProvider');
|
||||
}
|
||||
return context;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## src/lib/posthog-server.ts
|
||||
|
||||
```ts
|
||||
import { PostHog } from 'posthog-node';
|
||||
|
||||
let posthogClient: PostHog | null = null;
|
||||
|
||||
export function getPostHogClient() {
|
||||
if (!posthogClient) {
|
||||
posthogClient = new PostHog(
|
||||
process.env.NEXT_PUBLIC_POSTHOG_PROJECT_TOKEN!,
|
||||
{
|
||||
host: process.env.NEXT_PUBLIC_POSTHOG_HOST,
|
||||
flushAt: 1,
|
||||
flushInterval: 0
|
||||
}
|
||||
);
|
||||
posthogClient.debug(true);
|
||||
}
|
||||
return posthogClient;
|
||||
}
|
||||
|
||||
export async function shutdownPostHog() {
|
||||
if (posthogClient) {
|
||||
await posthogClient.shutdown();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
@@ -0,0 +1,43 @@
|
||||
---
|
||||
title: PostHog Setup - Begin
|
||||
description: Start the event tracking setup process by analyzing the project and creating an event tracking plan
|
||||
---
|
||||
|
||||
We're making an event tracking plan for this project.
|
||||
|
||||
Before proceeding, find any existing `posthog.capture()` code. Make note of event name formatting.
|
||||
|
||||
From the project's file list, select between 10 and 15 files that might have interesting business value for event tracking, especially conversion and churn events. Also look for additional files related to login that could be used for identifying users, along with error handling. Read the files. If a file is already well-covered by PostHog events, replace it with another option. Do not spawn subagents.
|
||||
|
||||
Look for opportunities to track client-side events.
|
||||
|
||||
**IMPORTANT: Server-side events are REQUIRED** if the project includes any instrumentable server-side code. If the project has API routes (e.g., `app/api/**/route.ts`) or Server Actions, you MUST include server-side events for critical business operations like:
|
||||
|
||||
- Payment/checkout completion
|
||||
- Webhook handlers
|
||||
- Authentication endpoints
|
||||
|
||||
Do not skip server-side events - they capture actions that cannot be tracked client-side.
|
||||
|
||||
Create a new file with a JSON array at the root of the project: .posthog-events.json. It should include one object for each event we want to add: event name, event description, and the file path we want to place the event in. If events already exist, don't duplicate them; supplement them.
|
||||
|
||||
Track actions only, not pageviews. These can be captured automatically. Exceptions can be made for "viewed"-type events that correspond to the top of a conversion funnel.
|
||||
|
||||
As you review files, make an internal note of opportunities to identify users and catch errors. We'll need them for the next step.
|
||||
|
||||
## Status
|
||||
|
||||
Before beginning a phase of the setup, you will send a status message with the exact prefix '[STATUS]', as in:
|
||||
|
||||
[STATUS] Checking project structure.
|
||||
|
||||
Status to report in this phase:
|
||||
|
||||
- Checking project structure
|
||||
- Verifying PostHog dependencies
|
||||
- Generating events based on project
|
||||
|
||||
|
||||
---
|
||||
|
||||
**Upon completion, continue with:** [basic-integration-1.1-edit.md](basic-integration-1.1-edit.md)
|
||||
@@ -0,0 +1,37 @@
|
||||
---
|
||||
title: PostHog Setup - Edit
|
||||
description: Implement PostHog event tracking in the identified files, following best practices and the example project
|
||||
---
|
||||
|
||||
For each of the files and events noted in .posthog-events.json, make edits to capture events using PostHog. Make sure to set up any helper files needed. Carefully examine the included example project code: your implementation should match it as closely as possible. Do not spawn subagents.
|
||||
|
||||
Use environment variables for PostHog keys. Do not hardcode PostHog keys.
|
||||
|
||||
If a file already has existing integration code for other tools or services, don't overwrite or remove that code. Place PostHog code below it.
|
||||
|
||||
For each event, add useful properties, and use your access to the PostHog source code to ensure correctness. You also have access to documentation about creating new events with PostHog. Consider this documentation carefully and follow it closely before adding events. Your integration should be based on documented best practices. Carefully consider how the user project's framework version may impact the correct PostHog integration approach.
|
||||
|
||||
Remember that you can find the source code for any dependency in the node_modules directory. This may be necessary to properly populate property names. There are also example project code files available via the PostHog MCP; use these for reference.
|
||||
|
||||
Where possible, add calls for PostHog's identify() function on the client side upon events like logins and signups. Use the contents of login and signup forms to identify users on submit. If there is server-side code, pass the client-side session and distinct ID to the server-side code to identify the user. On the server side, make sure events have a matching distinct ID where relevant.
|
||||
|
||||
It's essential to do this in both client code and server code, so that user behavior from both domains is easy to correlate.
|
||||
|
||||
You should also add PostHog exception capture error tracking to these files where relevant.
|
||||
|
||||
Remember: Do not alter the fundamental architecture of existing files. Make your additions minimal and targeted.
|
||||
|
||||
Remember the documentation and example project resources you were provided at the beginning. Read them now.
|
||||
|
||||
## Status
|
||||
|
||||
Status to report in this phase:
|
||||
|
||||
- Inserting PostHog capture code
|
||||
- A status message for each file whose edits you are planning, including a high level summary of changes
|
||||
- A status message for each file you have edited
|
||||
|
||||
|
||||
---
|
||||
|
||||
**Upon completion, continue with:** [basic-integration-1.2-revise.md](basic-integration-1.2-revise.md)
|
||||
@@ -0,0 +1,22 @@
|
||||
---
|
||||
title: PostHog Setup - Revise
|
||||
description: Review and fix any errors in the PostHog integration implementation
|
||||
---
|
||||
|
||||
Check the project for errors. Read the package.json file for any type checking or build scripts that may provide input about what to fix. Remember that you can find the source code for any dependency in the node_modules directory. Do not spawn subagents.
|
||||
|
||||
Ensure that any components created were actually used.
|
||||
|
||||
Once all other tasks are complete, run any linter or prettier-like scripts found in the package.json, but ONLY on the files you have edited or created during this session. Do not run formatting or linting across the entire project's codebase.
|
||||
|
||||
## Status
|
||||
|
||||
Status to report in this phase:
|
||||
|
||||
- Finding and correcting errors
|
||||
- Report details of any errors you fix
|
||||
- Linting, building and prettying
|
||||
|
||||
---
|
||||
|
||||
**Upon completion, continue with:** [basic-integration-1.3-conclude.md](basic-integration-1.3-conclude.md)
|
||||
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: PostHog Setup - Conclusion
|
||||
description: Review and fix any errors in the PostHog integration implementation
|
||||
---
|
||||
|
||||
Use the PostHog MCP to create a new dashboard named "Analytics basics" based on the events created here. Make sure to use the exact same event names as implemented in the code. Populate it with up to five insights, with special emphasis on things like conversion funnels, churn events, and other business critical insights.
|
||||
|
||||
Search for a file called `.posthog-events.json` and read it for available events. Do not spawn subagents.
|
||||
|
||||
Create the file posthog-setup-report.md. It should include a summary of the integration edits, a table with the event names, event descriptions, and files where events were added, along with a list of links for the dashboard and insights created. Follow this format:
|
||||
|
||||
<wizard-report>
|
||||
# PostHog post-wizard report
|
||||
|
||||
The wizard has completed a deep integration of your project. [Detailed summary of changes]
|
||||
|
||||
[table of events/descriptions/files]
|
||||
|
||||
## Next steps
|
||||
|
||||
We've built some insights and a dashboard for you to keep an eye on user behavior, based on the events we just instrumented:
|
||||
|
||||
[links]
|
||||
|
||||
### Agent skill
|
||||
|
||||
We've left an agent skill folder in your project. You can use this context for further agent development when using Claude Code. This will help ensure the model provides the most up-to-date approaches for integrating PostHog.
|
||||
|
||||
</wizard-report>
|
||||
|
||||
Upon completion, remove .posthog-events.json.
|
||||
|
||||
## Status
|
||||
|
||||
Status to report in this phase:
|
||||
|
||||
- Configured dashboard: [insert PostHog dashboard URL]
|
||||
- Created setup report: [insert full local file path]
|
||||
@@ -0,0 +1,202 @@
|
||||
# Identify users - Docs
|
||||
|
||||
Linking events to specific users enables you to build a full picture of how they're using your product across different sessions, devices, and platforms.
|
||||
|
||||
This is straightforward to do when [capturing backend events](/docs/product-analytics/capture-events?tab=Node.js.md), as you associate events to a specific user using a `distinct_id`, which is a required argument.
|
||||
|
||||
However, in the frontend of a [web](/docs/libraries/js/features.md#capturing-events) or [mobile app](/docs/libraries/ios.md#capturing-events), a `distinct_id` is not a required argument — PostHog's SDKs will generate an anonymous `distinct_id` for you automatically and you can capture events anonymously, provided you use the appropriate [configuration](/docs/libraries/js/features.md#capturing-anonymous-events).
|
||||
|
||||
To link events to specific users, call `identify`:
|
||||
|
||||
PostHog AI
|
||||
|
||||
### Web
|
||||
|
||||
```javascript
|
||||
posthog.identify(
|
||||
'distinct_id', // Replace 'distinct_id' with your user's unique identifier
|
||||
{ email: 'max@hedgehogmail.com', name: 'Max Hedgehog' } // optional: set additional person properties
|
||||
);
|
||||
```
|
||||
|
||||
### Android
|
||||
|
||||
```kotlin
|
||||
PostHog.identify(
|
||||
distinctId = distinctID, // Replace 'distinctID' with your user's unique identifier
|
||||
// optional: set additional person properties
|
||||
userProperties = mapOf(
|
||||
"name" to "Max Hedgehog",
|
||||
"email" to "max@hedgehogmail.com"
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### iOS
|
||||
|
||||
```swift
|
||||
PostHogSDK.shared.identify("distinct_id", // Replace "distinct_id" with your user's unique identifier
|
||||
userProperties: ["name": "Max Hedgehog", "email": "max@hedgehogmail.com"]) // optional: set additional person properties
|
||||
```
|
||||
|
||||
### React Native
|
||||
|
||||
```jsx
|
||||
posthog.identify('distinct_id', { // Replace "distinct_id" with your user's unique identifier
|
||||
email: 'max@hedgehogmail.com', // optional: set additional person properties
|
||||
name: 'Max Hedgehog'
|
||||
})
|
||||
```
|
||||
|
||||
### Dart
|
||||
|
||||
```dart
|
||||
await Posthog().identify(
|
||||
userId: 'distinct_id', // Replace "distinct_id" with your user's unique identifier
|
||||
userProperties: {
|
||||
email: "max@hedgehogmail.com", // optional: set additional person properties
|
||||
name: "Max Hedgehog"
|
||||
});
|
||||
```
|
||||
|
||||
Events captured after calling `identify` are identified events and this creates a person profile if one doesn't exist already.
|
||||
|
||||
Due to the cost of processing them, anonymous events can be up to 4x cheaper than identified events, so it's recommended you only capture identified events when needed.
|
||||
|
||||
## How identify works
|
||||
|
||||
When a user starts browsing your website or app, PostHog automatically assigns them an **anonymous ID**, which is stored locally.
|
||||
|
||||
Provided you've [configured persistence](/docs/libraries/js/persistence.md) to use cookies or `localStorage`, this enables us to track anonymous users – even across different sessions.
|
||||
|
||||
By calling `identify` with a `distinct_id` of your choice (usually the user's ID in your database, or their email), you link the anonymous ID and distinct ID together.
|
||||
|
||||
Thus, all past and future events made with that anonymous ID are now associated with the distinct ID.
|
||||
|
||||
This enables you to do things like associate events with a user from before they log in for the first time, or associate their events across different devices or platforms.
|
||||
|
||||
Using identify in the backend
|
||||
|
||||
Although you can call `identify` using our backend SDKs, it is used most in frontends. This is because there is no concept of anonymous sessions in the backend SDKs, so calling `identify` only updates person profiles.
|
||||
|
||||
## Best practices when using `identify`
|
||||
|
||||
### 1\. Call `identify` as soon as you're able to
|
||||
|
||||
In your frontend, you should call `identify` as soon as you're able to.
|
||||
|
||||
Typically, this is every time your **app loads** for the first time, and directly after your **users log in**.
|
||||
|
||||
This ensures that events sent during your users' sessions are correctly associated with them.
|
||||
|
||||
You only need to call `identify` once per session, and you should avoid calling it multiple times unnecessarily.
|
||||
|
||||
If you call `identify` multiple times with the same data without reloading the page in between, PostHog will ignore the subsequent calls.
|
||||
|
||||
### 2\. Use unique strings for distinct IDs
|
||||
|
||||
If two users have the same distinct ID, their data is merged and they are considered one user in PostHog. Two common ways this can happen are:
|
||||
|
||||
- Your logic for generating IDs does not generate sufficiently strong IDs and you can end up with a clash where 2 users have the same ID.
|
||||
- There's a bug, typo, or mistake in your code leading to most or all users being identified with generic IDs like `null`, `true`, or `distinctId`.
|
||||
|
||||
PostHog also has built-in protections to stop the most common distinct ID mistakes.
|
||||
|
||||
### 3\. Reset after logout
|
||||
|
||||
If a user logs out on your frontend, you should call `reset()` to unlink any future events made on that device with that user.
|
||||
|
||||
This is important if your users are sharing a computer, as otherwise all of those users are grouped together into a single user due to shared cookies between sessions.
|
||||
|
||||
**We strongly recommend you call `reset` on logout even if you don't expect users to share a computer.**
|
||||
|
||||
You can do that like so:
|
||||
|
||||
PostHog AI
|
||||
|
||||
### Web
|
||||
|
||||
```javascript
|
||||
posthog.reset()
|
||||
```
|
||||
|
||||
### iOS
|
||||
|
||||
```swift
|
||||
PostHogSDK.shared.reset()
|
||||
```
|
||||
|
||||
### Android
|
||||
|
||||
```kotlin
|
||||
PostHog.reset()
|
||||
```
|
||||
|
||||
### React Native
|
||||
|
||||
```jsx
|
||||
posthog.reset()
|
||||
```
|
||||
|
||||
### Dart
|
||||
|
||||
```dart
|
||||
Posthog().reset()
|
||||
```
|
||||
|
||||
If you *also* want to reset the `device_id` so that the device will be considered a new device in future events, you can pass `true` as an argument:
|
||||
|
||||
Web
|
||||
|
||||
PostHog AI
|
||||
|
||||
```javascript
|
||||
posthog.reset(true)
|
||||
```
|
||||
|
||||
### 4\. Person profiles and properties
|
||||
|
||||
You'll notice that one of the parameters in the `identify` method is a `properties` object.
|
||||
|
||||
This enables you to set [person properties](/docs/product-analytics/person-properties.md).
|
||||
|
||||
Whenever possible, we recommend passing in all person properties you have available each time you call identify, as this ensures their person profile on PostHog is up to date.
|
||||
|
||||
Person properties can also be set being adding a `$set` property to a event `capture` call.
|
||||
|
||||
See our [person properties docs](/docs/product-analytics/person-properties.md) for more details on how to work with them and best practices.
|
||||
|
||||
### 5\. Use deep links between platforms
|
||||
|
||||
We recommend you call `identify` [as soon as you're able](#1-call-identify-as-soon-as-youre-able), typically when a user signs up or logs in.
|
||||
|
||||
This doesn't work if one or both platforms are unauthenticated. Some examples of such cases are:
|
||||
|
||||
- Onboarding and signup flows before authentication.
|
||||
- Unauthenticated web pages redirecting to authenticated mobile apps.
|
||||
- Authenticated web apps prompting an app download.
|
||||
|
||||
In these cases, you can use a [deep link](https://developer.android.com/training/app-links/deep-linking) on Android and [universal links](https://developer.apple.com/documentation/xcode/supporting-universal-links-in-your-app) on iOS to identify users.
|
||||
|
||||
1. Use `posthog.get_distinct_id()` to get the current distinct ID. Even if you cannot call identify because the user is unauthenticated, this will return an anonymous distinct ID generated by PostHog.
|
||||
2. Add the distinct ID to the deep link as query parameters, along with other properties like UTM parameters.
|
||||
3. When the user is redirected to the app, parse the deep link and handle the following cases:
|
||||
|
||||
- The user is already authenticated on the mobile app. In this case, call [`posthog.alias()`](/docs/libraries/js/features.md#alias) with the distinct ID from the web. This associates the two distinct IDs as a single person.
|
||||
- The user is unauthenticated. In this case, call [`posthog.identify()`](/docs/libraries/js/features.md#identifying-users) with the distinct ID from the web. Events will be associated with this distinct ID.
|
||||
|
||||
As long as you associate the distinct IDs with `posthog.identify()` or `posthog.alias()`, you can track events generated across platforms.
|
||||
|
||||
## Further reading
|
||||
|
||||
- [Identifying users docs](/docs/product-analytics/identify.md)
|
||||
- [How person processing works](/docs/how-posthog-works/ingestion-pipeline.md#2-person-processing)
|
||||
- [An introductory guide to identifying users in PostHog](/tutorials/identifying-users-guide.md)
|
||||
|
||||
### Community questions
|
||||
|
||||
Ask a question
|
||||
|
||||
### Was this page useful?
|
||||
|
||||
HelpfulCould be better
|
||||
@@ -0,0 +1,385 @@
|
||||
# Next.js - Docs
|
||||
|
||||
PostHog makes it easy to get data about traffic and usage of your [Next.js](https://nextjs.org/) app. Integrating PostHog into your site enables analytics about user behavior, custom events capture, session recordings, feature flags, and more.
|
||||
|
||||
This guide walks you through integrating PostHog into your Next.js app using the [React](/docs/libraries/react.md) and the [Node.js](/docs/libraries/node.md) SDKs.
|
||||
|
||||
> You can see a working example of this integration in our [Next.js demo app](https://github.com/PostHog/posthog-js/tree/main/playground/nextjs).
|
||||
|
||||
Next.js has both client and server-side rendering, as well as pages and app routers. We'll cover all of these options in this guide.
|
||||
|
||||
> **Try `@posthog/next` (pre-release):** A simplified Next.js integration with synchronized client/server identity, server-side flag bootstrapping, and a built-in API proxy. [Read the setup guide →](/docs/libraries/next-js/posthog-next.md)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
To follow this guide along, you need:
|
||||
|
||||
1. A PostHog instance (either [Cloud](https://app.posthog.com/signup) or [self-hosted](/docs/self-host.md))
|
||||
2. A Next.js application
|
||||
|
||||
## Beta: integration via LLM
|
||||
|
||||
Install PostHog for Next.js in seconds with our wizard by running this prompt with [LLM coding agents](/blog/envoy-wizard-llm-agent.md) like Cursor and Bolt, or by running it in your terminal.
|
||||
|
||||
`npx @posthog/wizard@latest`
|
||||
|
||||
[Learn more](/wizard.md)
|
||||
|
||||
Or, to integrate manually, continue with the rest of this guide.
|
||||
|
||||
## Client-side setup
|
||||
|
||||
Install `posthog-js` using your package manager:
|
||||
|
||||
PostHog AI
|
||||
|
||||
### npm
|
||||
|
||||
```bash
|
||||
npm install --save posthog-js
|
||||
```
|
||||
|
||||
### Yarn
|
||||
|
||||
```bash
|
||||
yarn add posthog-js
|
||||
```
|
||||
|
||||
### pnpm
|
||||
|
||||
```bash
|
||||
pnpm add posthog-js
|
||||
```
|
||||
|
||||
### Bun
|
||||
|
||||
```bash
|
||||
bun add posthog-js
|
||||
```
|
||||
|
||||
Add your environment variables to your `.env.local` file and to your hosting provider (e.g. Vercel, Netlify, AWS). You can find your project token in your [project settings](https://app.posthog.com/project/settings).
|
||||
|
||||
.env.local
|
||||
|
||||
PostHog AI
|
||||
|
||||
```shell
|
||||
NEXT_PUBLIC_POSTHOG_TOKEN=<ph_project_token>
|
||||
NEXT_PUBLIC_POSTHOG_HOST=https://us.i.posthog.com
|
||||
```
|
||||
|
||||
These values need to start with `NEXT_PUBLIC_` to be accessible on the client-side.
|
||||
|
||||
## Integration
|
||||
|
||||
Next.js provides the [`instrumentation-client.ts|js`](https://nextjs.org/docs/app/api-reference/file-conventions/instrumentation-client) file for client-side setup. Add it to the root of your Next.js app (for both app and pages router) and initialize PostHog in it like this:
|
||||
|
||||
PostHog AI
|
||||
|
||||
### instrumentation-client.js
|
||||
|
||||
```javascript
|
||||
import posthog from 'posthog-js'
|
||||
posthog.init(process.env.NEXT_PUBLIC_POSTHOG_TOKEN, {
|
||||
api_host: process.env.NEXT_PUBLIC_POSTHOG_HOST,
|
||||
defaults: '2026-01-30'
|
||||
});
|
||||
```
|
||||
|
||||
### instrumentation-client.ts
|
||||
|
||||
```typescript
|
||||
import posthog from 'posthog-js'
|
||||
posthog.init(process.env.NEXT_PUBLIC_POSTHOG_TOKEN!, {
|
||||
api_host: process.env.NEXT_PUBLIC_POSTHOG_HOST,
|
||||
defaults: '2026-01-30'
|
||||
});
|
||||
```
|
||||
|
||||
Bootstrapping with `instrumentation-client`
|
||||
|
||||
When using `instrumentation-client`, the values you pass to `posthog.init` remain fixed for the entire session. This means bootstrapping only works if you evaluate flags **before your app renders** (for example, on the server).
|
||||
|
||||
If you need flag values after the app has rendered, you’ll want to:
|
||||
|
||||
- Evaluate the flag on the server and pass the value into your app, or
|
||||
- Evaluate the flag in an earlier page/state, then store and re-use it when needed.
|
||||
|
||||
Both approaches avoid flicker and give you the same outcome as bootstrapping, as long as you use the same `distinct_id` across client and server.
|
||||
|
||||
See the [bootstrapping guide](/docs/feature-flags/bootstrapping.md) for more information.
|
||||
|
||||
## Identifying users
|
||||
|
||||
> **Identifying users is required.** Call `posthog.identify('your-user-id')` after login to link events to a known user. This is what connects frontend event captures, [session replays](/docs/session-replay.md), [LLM traces](/docs/ai-engineering.md), and [error tracking](/docs/error-tracking.md) to the same person — and lets backend events link back too.
|
||||
>
|
||||
> See our guide on [identifying users](/docs/getting-started/identify-users.md) for how to set this up.
|
||||
|
||||
Set up a reverse proxy (recommended)
|
||||
|
||||
We recommend [setting up a reverse proxy](/docs/advanced/proxy.md), so that events are less likely to be intercepted by tracking blockers.
|
||||
|
||||
We have our [own managed reverse proxy service](/docs/advanced/proxy/managed-reverse-proxy.md), which is free for all PostHog Cloud users, routes through our infrastructure, and makes setting up your proxy easy.
|
||||
|
||||
If you don't want to use our managed service then there are several other options for creating a reverse proxy, including using [Cloudflare](/docs/advanced/proxy/cloudflare.md), [AWS Cloudfront](/docs/advanced/proxy/cloudfront.md), and [Vercel](/docs/advanced/proxy/vercel.md).
|
||||
|
||||
Grouping products in one project (recommended)
|
||||
|
||||
If you have multiple customer-facing products (e.g. a marketing website + mobile app + web app), it's best to install PostHog on them all and [group them in one project](/docs/settings/projects.md).
|
||||
|
||||
This makes it possible to track users across their entire journey (e.g. from visiting your marketing website to signing up for your product), or how they use your product across multiple platforms.
|
||||
|
||||
Add IPs to Firewall/WAF allowlists (recommended)
|
||||
|
||||
For certain features like [heatmaps](/docs/toolbar/heatmaps.md), your Web Application Firewall (WAF) may be blocking PostHog’s requests to your site. Add these IP addresses to your WAF allowlist or rules to let PostHog access your site.
|
||||
|
||||
**EU**: `3.75.65.221`, `18.197.246.42`, `3.120.223.253`
|
||||
|
||||
**US**: `44.205.89.55`, `52.4.194.122`, `44.208.188.173`
|
||||
|
||||
These are public, stable IPs used by PostHog services (e.g., Celery tasks for snapshots).
|
||||
|
||||
## Accessing PostHog
|
||||
|
||||
Once initialized in `instrumentation-client.js|ts`, import `posthog` from `posthog-js` anywhere and call the methods you need on the `posthog` object.
|
||||
|
||||
JavaScript
|
||||
|
||||
PostHog AI
|
||||
|
||||
```javascript
|
||||
'use client'
|
||||
import posthog from 'posthog-js'
|
||||
export default function Home() {
|
||||
return (
|
||||
<div>
|
||||
<button onClick={() => posthog.capture('test_event')}>
|
||||
Click me for an event
|
||||
</button>
|
||||
</div>
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
### Using React hooks
|
||||
|
||||
The [React feature flag hooks](/docs/libraries/react.md#feature-flags) work automatically when PostHog is initialized via `instrumentation-client.ts`. The hooks use the initialized posthog-js singleton:
|
||||
|
||||
JavaScript
|
||||
|
||||
PostHog AI
|
||||
|
||||
```javascript
|
||||
'use client'
|
||||
import { useFeatureFlagEnabled } from 'posthog-js/react'
|
||||
export default function FeatureComponent() {
|
||||
const showNewFeature = useFeatureFlagEnabled('new-feature')
|
||||
return showNewFeature ? <NewFeature /> : <OldFeature />
|
||||
}
|
||||
```
|
||||
|
||||
### Usage
|
||||
|
||||
See the [React SDK docs](/docs/libraries/react.md) for examples of how to use:
|
||||
|
||||
- [`posthog-js` functions like custom event capture, user identification, and more.](/docs/libraries/react.md#using-posthog-js-functions)
|
||||
- [Feature flags including variants and payloads.](/docs/libraries/react.md#feature-flags)
|
||||
|
||||
You can also read [the full `posthog-js` documentation](/docs/libraries/js/features.md) for all the usable functions.
|
||||
|
||||
## Server-side analytics
|
||||
|
||||
Next.js enables you to both server-side render pages and add server-side functionality. To integrate PostHog into your Next.js app on the server-side, you can use the [Node SDK](/docs/libraries/node.md).
|
||||
|
||||
First, install the `posthog-node` library:
|
||||
|
||||
PostHog AI
|
||||
|
||||
### npm
|
||||
|
||||
```bash
|
||||
npm install posthog-node --save
|
||||
```
|
||||
|
||||
### Yarn
|
||||
|
||||
```bash
|
||||
yarn add posthog-node
|
||||
```
|
||||
|
||||
### pnpm
|
||||
|
||||
```bash
|
||||
pnpm add posthog-node
|
||||
```
|
||||
|
||||
### Bun
|
||||
|
||||
```bash
|
||||
bun add posthog-node
|
||||
```
|
||||
|
||||
### Router-specific instructions
|
||||
|
||||
## App router
|
||||
|
||||
For the app router, we can initialize the `posthog-node` SDK once with a `PostHogClient` function, and import it into files.
|
||||
|
||||
This enables us to send events and fetch data from PostHog on the server – without making client-side requests.
|
||||
|
||||
JavaScript
|
||||
|
||||
PostHog AI
|
||||
|
||||
```javascript
|
||||
// app/posthog.js
|
||||
import { PostHog } from 'posthog-node'
|
||||
export default function PostHogClient() {
|
||||
const posthogClient = new PostHog(process.env.NEXT_PUBLIC_POSTHOG_TOKEN, {
|
||||
host: process.env.NEXT_PUBLIC_POSTHOG_HOST,
|
||||
flushAt: 1,
|
||||
flushInterval: 0
|
||||
})
|
||||
return posthogClient
|
||||
}
|
||||
```
|
||||
|
||||
> **Note:** Because server-side functions in Next.js can be short-lived, we set `flushAt` to `1` and `flushInterval` to `0`.
|
||||
>
|
||||
> - `flushAt` sets how many capture calls we should flush the queue (in one batch).
|
||||
> - `flushInterval` sets how many milliseconds we should wait before flushing the queue. Setting them to the lowest number ensures events are sent immediately and not batched. We also need to call `await posthog.shutdown()` once done.
|
||||
|
||||
To use this client, we import it into our pages and call it with the `PostHogClient` function:
|
||||
|
||||
JavaScript
|
||||
|
||||
PostHog AI
|
||||
|
||||
```javascript
|
||||
import Link from 'next/link'
|
||||
import PostHogClient from '../posthog'
|
||||
export default async function About() {
|
||||
const posthog = PostHogClient()
|
||||
const flags = await posthog.getAllFlags(
|
||||
'user_distinct_id' // replace with a user's distinct ID
|
||||
);
|
||||
await posthog.shutdown()
|
||||
return (
|
||||
<main>
|
||||
<h1>About</h1>
|
||||
<Link href="/">Go home</Link>
|
||||
{ flags['main-cta'] &&
|
||||
<Link href="http://posthog.com/">Go to PostHog</Link>
|
||||
}
|
||||
</main>
|
||||
)
|
||||
}
|
||||
```
|
||||
|
||||
## Pages router
|
||||
|
||||
For the pages router, we can use the `getServerSideProps` function to access PostHog on the server-side, send events, evaluate feature flags, and more.
|
||||
|
||||
This looks like this:
|
||||
|
||||
JavaScript
|
||||
|
||||
PostHog AI
|
||||
|
||||
```javascript
|
||||
// pages/posts/[id].js
|
||||
import { useContext, useEffect, useState } from 'react'
|
||||
import { getServerSession } from "next-auth/next"
|
||||
import { PostHog } from 'posthog-node'
|
||||
export default function Post({ post, flags }) {
|
||||
const [ctaState, setCtaState] = useState()
|
||||
useEffect(() => {
|
||||
if (flags) {
|
||||
setCtaState(flags['blog-cta'])
|
||||
}
|
||||
})
|
||||
return (
|
||||
<div>
|
||||
<h1>{post.title}</h1>
|
||||
<p>By: {post.author}</p>
|
||||
<p>{post.content}</p>
|
||||
{ctaState &&
|
||||
<p><a href="/">Go to PostHog</a></p>
|
||||
}
|
||||
<button onClick={likePost}>Like</button>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
export async function getServerSideProps(ctx) {
|
||||
const session = await getServerSession(ctx.req, ctx.res)
|
||||
let flags = null
|
||||
if (session) {
|
||||
const client = new PostHog(
|
||||
process.env.NEXT_PUBLIC_POSTHOG_TOKEN,
|
||||
{
|
||||
host: process.env.NEXT_PUBLIC_POSTHOG_HOST,
|
||||
}
|
||||
)
|
||||
flags = await client.getAllFlags(session.user.email);
|
||||
client.capture({
|
||||
distinctId: session.user.email,
|
||||
event: 'loaded blog article',
|
||||
properties: {
|
||||
$current_url: ctx.req.url,
|
||||
},
|
||||
});
|
||||
await client.shutdown()
|
||||
}
|
||||
const { posts } = await import('../../blog.json')
|
||||
const post = posts.find((post) => post.id.toString() === ctx.params.id)
|
||||
return {
|
||||
props: {
|
||||
post,
|
||||
flags
|
||||
},
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
> **Note**: Make sure to *always* call `await client.shutdown()` after sending events from the server-side. PostHog queues events into larger batches, and this call forces all batched events to be flushed immediately.
|
||||
|
||||
### Server-side configuration
|
||||
|
||||
Next.js overrides the default `fetch` behavior on the server to introduce their own cache. PostHog ignores that cache by default, as this is Next.js's default behavior for any fetch call.
|
||||
|
||||
You can override that configuration when initializing PostHog, but make sure you understand the pros/cons of using Next.js's cache and that you might get cached results rather than the actual result our server would return. This is important for feature flags, for example.
|
||||
|
||||
TSX
|
||||
|
||||
PostHog AI
|
||||
|
||||
```jsx
|
||||
posthog.init(process.env.NEXT_PUBLIC_POSTHOG_TOKEN, {
|
||||
// ... your configuration
|
||||
fetch_options: {
|
||||
cache: 'force-cache', // Use Next.js cache
|
||||
next_options: { // Passed to the `next` option for `fetch`
|
||||
revalidate: 60, // Cache for 60 seconds
|
||||
tags: ['posthog'], // Can be used with Next.js `revalidateTag` function
|
||||
},
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
## Configuring a reverse proxy to PostHog
|
||||
|
||||
To improve the reliability of client-side tracking and make requests less likely to be intercepted by tracking blockers, you can setup a reverse proxy in Next.js. Read more about deploying a reverse proxy using [Next.js rewrites](/docs/advanced/proxy/nextjs.md), [Next.js middleware](/docs/advanced/proxy/nextjs-middleware.md), and [Vercel rewrites](/docs/advanced/proxy/vercel.md).
|
||||
|
||||
## Further reading
|
||||
|
||||
- [How to set up Next.js analytics, feature flags, and more](/tutorials/nextjs-analytics.md)
|
||||
- [How to set up Next.js pages router analytics, feature flags, and more](/tutorials/nextjs-pages-analytics.md)
|
||||
- [How to set up Next.js A/B tests](/tutorials/nextjs-ab-tests.md)
|
||||
|
||||
### Community questions
|
||||
|
||||
Ask a question
|
||||
|
||||
### Was this page useful?
|
||||
|
||||
HelpfulCould be better
|
||||
@@ -16,3 +16,6 @@ URL="http://localhost:3000"
|
||||
|
||||
# Default locale of the apps, can be overridden separately in each app.
|
||||
DEFAULT_LOCALE="en"
|
||||
|
||||
# Shared secret for CLI sync JWT signing (HS256) — must match between broker and web app
|
||||
CLI_SYNC_SECRET="<your-cli-sync-secret>"
|
||||
|
||||
71
.github/workflows/deploy-web.yml
vendored
Normal file
@@ -0,0 +1,71 @@
|
||||
name: Deploy claudemesh-web
|
||||
|
||||
# Triggers a Coolify deploy of the apps/web Next.js app on the OVH VPS.
|
||||
# Coolify only auto-deploys the broker (it watches the gitea-vps mirror);
|
||||
# the web app needs an explicit poke. This workflow is the poke.
|
||||
#
|
||||
# The Coolify dashboard is bound to a Tailscale-only address
|
||||
# (100.122.34.28:8000), so the runner first joins the tailnet via
|
||||
# an OAuth-issued ephemeral node, then hits Coolify's deploy API.
|
||||
#
|
||||
# Path filter: redeploy on changes to the web app, the API package
|
||||
# (bundled into the web build), or any shared package the web app
|
||||
# transpiles. Anything else (broker-only, cli-only, docs) skips it.
|
||||
|
||||
on:
|
||||
push:
|
||||
branches: [main]
|
||||
paths:
|
||||
- "apps/web/**"
|
||||
- "packages/api/**"
|
||||
- "packages/db/**"
|
||||
- "packages/auth/**"
|
||||
- "packages/ui/**"
|
||||
- "packages/i18n/**"
|
||||
- "packages/shared/**"
|
||||
- "packages/email/**"
|
||||
- "packages/billing/**"
|
||||
- "packages/storage/**"
|
||||
- "packages/monitoring-web/**"
|
||||
- "pnpm-lock.yaml"
|
||||
- ".github/workflows/deploy-web.yml"
|
||||
workflow_dispatch:
|
||||
|
||||
# Coalesce rapid pushes — only one deploy in flight at a time, and
|
||||
# if a newer push lands while one is queued, the older one is
|
||||
# cancelled. Avoids the "5 commits, 5 deploys" stampede.
|
||||
concurrency:
|
||||
group: deploy-web
|
||||
cancel-in-progress: true
|
||||
|
||||
jobs:
|
||||
deploy:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Connect to Tailscale
|
||||
uses: tailscale/github-action@v3
|
||||
with:
|
||||
oauth-client-id: ${{ secrets.TS_OAUTH_CLIENT_ID }}
|
||||
oauth-secret: ${{ secrets.TS_OAUTH_SECRET }}
|
||||
tags: tag:ci
|
||||
|
||||
- name: Trigger Coolify deploy
|
||||
env:
|
||||
COOLIFY_TOKEN: ${{ secrets.COOLIFY_TOKEN }}
|
||||
APP_UUID: p68x1e3k4xmrjmblca5ybe09
|
||||
run: |
|
||||
if [ -z "$COOLIFY_TOKEN" ]; then
|
||||
echo "::error::COOLIFY_TOKEN secret is not set"
|
||||
exit 1
|
||||
fi
|
||||
response=$(curl -sS -w "\n%{http_code}" -X GET \
|
||||
"http://100.122.34.28:8000/api/v1/deploy?uuid=${APP_UUID}&force=true" \
|
||||
-H "Authorization: Bearer ${COOLIFY_TOKEN}")
|
||||
status=$(echo "$response" | tail -n1)
|
||||
body=$(echo "$response" | sed '$d')
|
||||
echo "HTTP $status"
|
||||
echo "$body"
|
||||
if [ "$status" != "200" ]; then
|
||||
echo "::error::Coolify returned HTTP $status"
|
||||
exit 1
|
||||
fi
|
||||
115
.github/workflows/release-cli.yml
vendored
Normal file
@@ -0,0 +1,115 @@
|
||||
name: Release CLI binaries
|
||||
|
||||
# Fires on any push of a tag shaped like `cli-v1.2.3` (prerelease `-alpha.N` OK).
|
||||
# Builds self-contained `bun build --compile` binaries for darwin/linux/win
|
||||
# (x64 + arm64) and attaches them to a GitHub Release. The `install.sh`
|
||||
# fallback path curls these when Node isn't available.
|
||||
#
|
||||
# Publishing to npm is still a manual step (pnpm publish from apps/cli) —
|
||||
# this workflow only handles binary distribution.
|
||||
|
||||
on:
|
||||
push:
|
||||
tags:
|
||||
- "cli-v*"
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
tag:
|
||||
description: "Release tag to build (e.g. cli-v1.0.0-alpha.28)"
|
||||
required: true
|
||||
|
||||
permissions:
|
||||
contents: write # to upload release assets
|
||||
|
||||
jobs:
|
||||
build:
|
||||
name: ${{ matrix.target }}
|
||||
runs-on: ${{ matrix.runner }}
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
include:
|
||||
- { target: darwin-x64, bun_target: bun-darwin-x64, runner: macos-latest, ext: "" }
|
||||
- { target: darwin-arm64, bun_target: bun-darwin-arm64, runner: macos-latest, ext: "" }
|
||||
- { target: linux-x64, bun_target: bun-linux-x64, runner: ubuntu-latest, ext: "" }
|
||||
- { target: linux-arm64, bun_target: bun-linux-arm64, runner: ubuntu-latest, ext: "" }
|
||||
- { target: windows-x64, bun_target: bun-windows-x64, runner: windows-latest, ext: ".exe" }
|
||||
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
|
||||
- uses: oven-sh/setup-bun@v2
|
||||
with:
|
||||
bun-version: "1.2"
|
||||
|
||||
- uses: pnpm/action-setup@v4
|
||||
|
||||
- name: Install workspace deps
|
||||
run: pnpm install --frozen-lockfile --ignore-scripts
|
||||
|
||||
- name: Compile binary
|
||||
working-directory: apps/cli
|
||||
shell: bash
|
||||
run: |
|
||||
mkdir -p dist/bin
|
||||
VERSION=$(node -p "require('./package.json').version")
|
||||
bun build --compile --minify \
|
||||
--target=${{ matrix.bun_target }} \
|
||||
--define "__CLAUDEMESH_VERSION__=\"$VERSION\"" \
|
||||
src/entrypoints/cli.ts \
|
||||
--outfile dist/bin/claudemesh-${{ matrix.target }}${{ matrix.ext }}
|
||||
|
||||
# Smoke test only on native arch. macos-latest runners are ARM64 (Apple
|
||||
# Silicon); ubuntu-latest is x64. Cross-compiled binaries can't execute
|
||||
# on the build host, so skip them.
|
||||
- name: Smoke test (native only)
|
||||
if: matrix.target == 'darwin-arm64' || matrix.target == 'linux-x64'
|
||||
working-directory: apps/cli
|
||||
run: |
|
||||
./dist/bin/claudemesh-${{ matrix.target }} --version
|
||||
./dist/bin/claudemesh-${{ matrix.target }} --help | head -5
|
||||
|
||||
- name: Upload artefact
|
||||
uses: actions/upload-artifact@v4
|
||||
with:
|
||||
name: claudemesh-${{ matrix.target }}
|
||||
path: apps/cli/dist/bin/claudemesh-${{ matrix.target }}${{ matrix.ext }}
|
||||
|
||||
release:
|
||||
needs: build
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/download-artifact@v4
|
||||
with:
|
||||
path: artifacts
|
||||
|
||||
- name: Stage binaries
|
||||
run: |
|
||||
mkdir -p release
|
||||
find artifacts -type f -exec cp {} release/ \;
|
||||
cd release && sha256sum claudemesh-* > SHA256SUMS
|
||||
|
||||
- name: Publish release
|
||||
uses: softprops/action-gh-release@v2
|
||||
with:
|
||||
tag_name: ${{ github.ref_name }}
|
||||
files: |
|
||||
release/claudemesh-*
|
||||
release/SHA256SUMS
|
||||
generate_release_notes: true
|
||||
fail_on_unmatched_files: true
|
||||
|
||||
update-homebrew:
|
||||
needs: release
|
||||
runs-on: macos-latest
|
||||
if: github.event_name == 'push' && !contains(github.ref_name, 'alpha')
|
||||
steps:
|
||||
- name: Bump Homebrew tap formula
|
||||
env:
|
||||
HOMEBREW_GITHUB_API_TOKEN: ${{ secrets.HOMEBREW_TAP_TOKEN }}
|
||||
run: |
|
||||
brew tap alezmad/claudemesh || true
|
||||
brew bump-formula-pr --no-browse --no-fork \
|
||||
--tag "${{ github.ref_name }}" \
|
||||
--revision "${{ github.sha }}" \
|
||||
alezmad/claudemesh/claudemesh || echo "formula bump skipped (no tap yet)"
|
||||
9
.github/workflows/tests.yml
vendored
@@ -45,3 +45,12 @@ jobs:
|
||||
|
||||
- name: 🧪 Test
|
||||
run: pnpm run test
|
||||
|
||||
- name: 📦 Build CLI bundle (check size budget)
|
||||
working-directory: apps/cli
|
||||
run: pnpm run build
|
||||
|
||||
- name: 🔧 CLI smoke — --version + --help
|
||||
run: |
|
||||
node apps/cli/dist/entrypoints/cli.js --version
|
||||
node apps/cli/dist/entrypoints/cli.js --help | head -5
|
||||
|
||||
4
.gitignore
vendored
@@ -45,6 +45,9 @@ yarn-error.log*
|
||||
# local env files
|
||||
.env*.local
|
||||
|
||||
# secrets
|
||||
.cli_sync_secret
|
||||
|
||||
# vercel
|
||||
.vercel
|
||||
|
||||
@@ -72,3 +75,4 @@ dist/
|
||||
apps/web/payload.db
|
||||
apps/web/public/media/*
|
||||
!apps/web/public/media/.gitkeep
|
||||
.env.local
|
||||
|
||||
@@ -1,3 +1,3 @@
|
||||
{
|
||||
"geminiApiKey": "AIzaSyBblLRkmypvabqI-xJ_b2KPVA9Pswtav0M"
|
||||
"geminiApiKey": "AIzaSyDJEyW5Q_OT1X4iGO_5jdVnq1BNANR7s2k"
|
||||
}
|
||||
35
CLAUDE.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# claudemesh
|
||||
|
||||
Peer mesh for Claude Code sessions. Broker + CLI + MCP server.
|
||||
|
||||
## Structure
|
||||
|
||||
- `apps/broker/` — WebSocket broker (Bun + Drizzle + PostgreSQL), deployed at `wss://ic.claudemesh.com/ws`. Runs drizzle migrations on startup under pg_advisory_lock.
|
||||
- `apps/cli/` — `claudemesh-cli` npm package (CLI + MCP server). Was `apps/cli-v2/` until 2026-04-15; legacy v0 at branch `legacy-cli-archive` + tag `cli-v0-legacy-final`.
|
||||
- `apps/web/` — Marketing site + dashboard at claudemesh.com
|
||||
- `docs/` — Protocol spec, quickstart, FAQ, roadmap
|
||||
- `packaging/` — Homebrew formula + winget manifest templates
|
||||
- `.github/workflows/release-cli.yml` — tag `cli-v*` → 5 platform binaries → GitHub Release with SHA256SUMS
|
||||
|
||||
## Key docs
|
||||
|
||||
- `SPEC.md` — What claudemesh is, protocol, crypto, wire format
|
||||
- `docs/protocol.md` — Wire protocol reference
|
||||
- `docs/roadmap.md` — Public roadmap (shipped + planned)
|
||||
- `docs/vision-20260407.md` — Internal feature brainstorm with 19 ideas across 3 tiers, effort estimates, and build order
|
||||
|
||||
## Deploy
|
||||
|
||||
- **Broker:** `git push gitea-vps main` triggers Coolify auto-deploy via the gitea webhook. Pending migrations apply automatically on startup.
|
||||
- **Web:** Coolify on the OVH VPS (`claudemesh.com` resolves to `135.125.191.245`, NOT Vercel — the `apps/web/Dockerfile` is what Coolify builds). Auto-deploys via `.github/workflows/deploy-web.yml` on push to `main` when paths under `apps/web/**` or `packages/{api,db,auth,ui,i18n,shared,email,billing,storage,monitoring-web}/**` change. The workflow joins the tailnet via Tailscale OAuth, then hits the Coolify API.
|
||||
- **Manual deploy** (if the workflow is broken or the path filter missed something) — Coolify dashboard at `http://100.122.34.28:8000` (Tailscale only). Token in `COOLIFY_TOKEN` repo secret. App UUIDs: broker `mcn8m74tbxfxbplmyb40b2ia`, web `p68x1e3k4xmrjmblca5ybe09`.
|
||||
- **CLI:**
|
||||
- npm: `cd apps/cli && npm publish --access public --no-git-checks --ignore-scripts`
|
||||
- Binaries: `git tag cli-v<version> && git push github cli-v<version>` — workflow builds 5 platforms.
|
||||
|
||||
## Dev
|
||||
|
||||
- Monorepo: pnpm workspaces + Turborepo
|
||||
- Broker dev: `cd apps/broker && bun --hot src/index.ts`
|
||||
- CLI build: `cd apps/cli && pnpm build` (Bun bundler)
|
||||
- CLI link for local testing: `cd apps/cli && npm link`
|
||||