Files
claudemesh/.artifacts/backlog/2026-04-11-v2-parity-test-plan.md
Alejandro Gutiérrez ee12510ef1
Some checks failed
CI / Lint (push) Has been cancelled
CI / Typecheck (push) Has been cancelled
CI / Broker tests (Postgres) (push) Has been cancelled
CI / Docker build (linux/amd64) (push) Has been cancelled
refactor: rename cli-v2 → cli, archive legacy cli, plus broker-side grants + auto-migrate
- apps/cli/ is now the canonical CLI (was apps/cli-v2/).
- apps/cli/ legacy v0 archived as branch 'legacy-cli-archive' and tag
  'cli-v0-legacy-final' before deletion; git history preserves it too.
- .github/workflows/release-cli.yml paths updated.
- pnpm-lock.yaml regenerated.

Broker-side peer-grant enforcement (spec: 2026-04-15-per-peer-capabilities):
- 0020_peer-grants.sql adds peer_grants jsonb + GIN index on mesh.member.
- handleSend in broker fetches recipient grant maps once per send, drops
  messages silently when sender lacks the required capability.
- POST /cli/mesh/:slug/grants to update from CLI; broker_messages_dropped_by_grant_total metric.
- CLI grant/revoke/block now mirror to broker via syncToBroker.

Auto-migrate on broker startup:
- apps/broker/src/migrate.ts runs drizzle migrate with pg_advisory_lock
  before the HTTP server binds. Exits non-zero on failure so Coolify
  healthcheck fails closed.
- Dockerfile copies packages/db/migrations into /app/migrations.
- postgres 3.4.5 added as direct broker dep.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 08:44:52 +01:00

48 KiB
Raw Blame History

claudemesh v2 — Pass 1 Parity Test Plan

Status: backlog reference Created: 2026-04-11 Purpose: Concrete test-by-test plan that verifies v2 behaves identically to v1 for every feature in the 2026-04-11-v1-feature-inventory.md regression list. Green parity suite = v2 Pass 1 ready to ship. Red parity suite = keep working. Companion to: 2026-04-11-v1-feature-inventory.md (this document's §12 is the input to every test below)


1. Why this exists

v2 Pass 1 is a refactor: new folder structure, facade pattern, CLI user flows. The broker is unchanged, the backend services are unchanged, every v1 tool (the 79 MCP tools + 85 broker WS message types) must keep working. The only safe way to prove "keep working" is to run v1 and v2 side by side and assert they produce identical behavior.

v2 deliberately drops some v1 CLI commands (not tools — CLI subcommands exposed to end users). Because v2 has no users yet and no migration constraints, the v2 CLI picks the best command surface without backwards compatibility. Commands dropped from v2:

v1 command Dropped in v2 because Replacement
claudemesh launch [args] Redundant with bare claudemesh. The subcommand word adds nothing. claudemesh (bare, with flags)
claudemesh disconnect telegram Bridge teardown is done inside Telegram (/revoke) or by leaving the mesh; CLI wrapper is cosmetic. In-Telegram revoke; or leave the mesh

Preserved with UX upgrade: claudemesh connect telegram stays in v2 but is rewritten as an interactive wizard — mesh picker, QR code, t.me deep link, waits for bridge confirmation. See Pass 1 spec §5.7. v1's silent auto-pick of config.meshes[0] is a multi-mesh footgun and is replaced.

The 79 MCP tools and 85 WS message types are all preserved. Only the thin wrapper CLI subcommands that translated them are dropped. This is a conscious v2 decision to simplify the CLI surface, not a regression.

v1 has thin test coverage (2 CLI test files for ~12 k LOC, ~10 broker tests mostly covering crypto primitives). That's not enough to catch regressions during a refactor. We need a parity suite — a test layer that doesn't trust v2 to describe its own correctness, but compares it to v1 directly.

The parity suite is the acceptance criteria. v2 Pass 1 ships when it's green.


2. The seven test layers

# Layer Purpose Compared against
1 Parity tests Behavioral equivalence on the inventory §12 regression list v1 CLI subprocess
2 WS contract tests Wire-format compatibility — v2 must speak the broker's v1 protocol captured v1 envelopes
3 MCP tool handler tests Every one of the 79 tools dispatches identically v1 handler output
4 End-to-end smoke tests Full user journeys against a real broker self-consistent e2e flow
5 JSON output golden tests --json output shape is stable for script consumers v1 --json captures
6 Facade unit tests Boundary invariants — no token leaks, Zod validation works facade contract spec
7 Port-forwarded v1 tests Existing crypto + broker tests still pass unchanged

Tests run in parallel where possible. Layers 1 and 2 are the most load-bearing — they're the proof that v2 doesn't break existing users.


3. Layer 1 — Parity tests (inventory §12 driven)

One test file per regression check in the inventory's "must preserve" list. Every file spawns both v1 and v2 CLIs against the same mock broker, captures their behavior, and asserts match on the dimensions that matter (exit code, stdout JSON fields, broker-side DB state, WS messages sent).

File layout:

apps/cli-v2/tests/parity/
├── first-run/
├── session-lifecycle/
├── messaging/
├── crypto/
├── tools/        (this one is covered by layer 3)
├── backends/
├── scheduled/
├── telegram/
├── dashboard-sync/
├── webhooks/
└── doctor/

3.1 First-run parity (inventory §12.1)

Test file Asserts
first-run/install.test.ts claudemesh install writes identical entries to ~/.claude.json + ~/.claude/settings.json
first-run/install-no-hooks.test.ts claudemesh install --no-hooks registers only the MCP server, not the status hooks
first-run/uninstall.test.ts claudemesh uninstall removes everything install added, leaving other config entries untouched
first-run/join-v1-invite.test.ts claudemesh join <v1-invite-url> enrolls using legacy invite format
first-run/join-v2-invite.test.ts claudemesh join <v2-invite-url> enrolls using short-code + signed payload
first-run/bare-first-run-welcome.test.ts claudemesh on a fresh machine (no config) shows the welcome wizard
first-run/bare-returning-user-launches.test.ts claudemesh on a machine with config launches a session directly (no wizard)
first-run/bare-with-flags-launches.test.ts claudemesh --resume abc, claudemesh --mesh foo -y, claudemesh --name Alexis all dispatch to the launch handler
first-run/launch-word-unknown-command.test.ts claudemesh launch returns exit code 3 (invalid args) with a clear "Unknown command" error. The word launch is deliberately not a subcommand in v2.

3.2 Session lifecycle parity (inventory §12.2)

Test file Asserts
session-lifecycle/status-hook-start.test.ts Running claudemesh hook with a Claude Code session_start payload posts to /hook/set-status with status: working, source hook
session-lifecycle/status-hook-stop.test.ts session_stop payload → /hook/set-status with status: idle, source hook
session-lifecycle/status-priority.test.ts When a hook source status is fresh, a subsequent manual status is rejected (priority gating)
session-lifecycle/status-ttl-sweep.test.ts After WORKING_TTL_MS, a stale working status decays to idle via the sweeper
session-lifecycle/list-peers-freshness.test.ts claudemesh peers marks peers with stale hook source as degraded
session-lifecycle/multi-mesh-status.test.ts Status updates in mesh A don't affect peer status in mesh B

3.3 Messaging parity (inventory §12.3)

Test file Asserts
messaging/send-priority-now.test.ts send --priority now delivers immediately, bypassing busy-gate
messaging/send-priority-next.test.ts send --priority next waits for the recipient to be idle before delivery
messaging/send-priority-low.test.ts send --priority low is pull-only (recipient must check_messages or inbox)
messaging/send-to-group.test.ts send @frontend <msg> fans out to all group members, not individual peers
messaging/send-broadcast.test.ts send "*" <msg> broadcasts to all connected peers in the mesh
messaging/offline-queue-drain.test.ts Messages sent to an offline peer persist in mesh.message_queue and drain when the peer reconnects
messaging/duplicate-delivery-prevention.test.ts Sending the same messageId twice does not double-deliver
messaging/message-status-lookup.test.ts message_status returns correct delivery state: queued / delivered / acked
messaging/inbox-drain.test.ts claudemesh inbox drains and prints pending messages; second run shows empty
messaging/inbox-wait.test.ts claudemesh inbox --wait 5 blocks for broker delivery up to 5s, returns early on arrival

3.4 Cryptographic integrity parity (inventory §12.4)

Test file Asserts
crypto/keypair-perms.test.ts Generated keypairs at ~/.claudemesh/keys/<mesh>.key are mode 0600, parent dir 0700
crypto/keypair-roundtrip.test.ts Keypair generation + persistence + reload produces the same public key
crypto/hello-sig-verification.test.ts Valid Ed25519 hello signatures pass; altered timestamps are rejected as replay
crypto/envelope-roundtrip.test.ts send_message ciphertext decrypts back to original on the recipient side
crypto/file-encrypt.test.ts share_file with to: <peer> produces AES-GCM ciphertext + wrapped symmetric key in mesh.file_key
crypto/file-decrypt.test.ts Recipient downloads + decrypts, content matches original
crypto/grant-access-rewrap.test.ts grant_file_access adds a new file_key row for the additional recipient, re-wrapping the same symmetric key
crypto/invite-v2-signature.test.ts v2 invite payloads pass Ed25519 signature verification; tampered payloads fail

3.5 Broker backends parity (inventory §12.6)

Test file Asserts
backends/postgres-mesh-execute.test.ts mesh_execute "CREATE TABLE bugs..." creates the table in the per-mesh schema
backends/postgres-mesh-query.test.ts mesh_query "SELECT * FROM bugs" returns rows
backends/postgres-mesh-schema.test.ts mesh_schema lists the newly-created table + columns
backends/postgres-cross-mesh-isolation.test.ts Query in mesh A cannot see tables created in mesh B (schema-level isolation)
backends/neo4j-graph-execute.test.ts graph_execute "CREATE (n:Bug {id: 1})" persists a node in the per-mesh Neo4j database
backends/neo4j-graph-query.test.ts graph_query "MATCH (n:Bug) RETURN n" returns the created node
backends/qdrant-vector-store.test.ts vector_store collection=docs content=... upserts into mesh_<id>_docs collection
backends/qdrant-vector-search.test.ts vector_search collection=docs query=... returns nearest neighbors with metadata
backends/qdrant-list-collections.test.ts list_collections enumerates the mesh's collections
backends/minio-share-small-file.test.ts share_file with < 64 KB uploads and returns a fileId
backends/minio-share-large-file.test.ts share_file with 10 MB uploads in chunks and returns a fileId
backends/minio-get-file.test.ts get_file returns the content or presigned URL for download
backends/minio-delete-file.test.ts delete_file removes the file from the bucket
backends/docker-mcp-deploy.test.ts mesh_mcp_deploy with a catalog alias spawns a Docker container with the expected env + memory + network_allow
backends/docker-mcp-logs.test.ts mesh_mcp_logs returns recent stdout/stderr from a running deployment
backends/docker-mcp-undeploy.test.ts mesh_mcp_undeploy SIGTERMs the container cleanly

3.6 Scheduled messages + URL watch parity (inventory §12.7)

Test file Asserts
scheduled/one-shot-deliver-at.test.ts schedule_reminder deliver_at=<ts+5s> fires at the target timestamp
scheduled/one-shot-in-seconds.test.ts schedule_reminder in_seconds=5 fires 5 seconds after submission
scheduled/cron-recurring.test.ts schedule_reminder cron="*/1 * * * *" fires every minute
scheduled/persist-across-restart.test.ts Pending reminders survive a broker restart (re-registered from mesh.scheduled_message table)
scheduled/list-cancel.test.ts list_scheduled shows pending; cancel_scheduled <id> prevents delivery
scheduled/url-watch-hash-mode.test.ts mesh_watch mode=hash detects body change via SHA-256 comparison
scheduled/url-watch-json-mode.test.ts mesh_watch mode=json extract=data.status detects value change at the jsonpath
scheduled/url-watch-status-mode.test.ts mesh_watch mode=status detects HTTP status code change
scheduled/url-watch-notify-on-match.test.ts notify_on="match:up" fires only when value equals "up"
scheduled/url-watch-persist.test.ts Active watches persist across broker restart

3.7 Telegram bridge parity (inventory §12.8)

The Telegram bridge is a broker-side feature that continues to work in v2 Pass 1 because Pass 1 doesn't touch the broker. However, the v2 CLI does NOT expose claudemesh connect telegram / claudemesh disconnect telegram commands — those were v1-only CLI surface that we drop because (a) there are no users to migrate and (b) Telegram connection is better expressed via broker APIs that the user flows expose.

Instead, Telegram bridge parity is verified via e2e tests in §6 that connect directly to the broker's POST /tg/token endpoint, simulate inbound Telegram webhook payloads, and verify outbound routing via send_message(to: "tg:<username>").

Test file Layer Asserts
telegram/connect-wizard-mesh-picker.test.ts parity claudemesh connect telegram with >1 joined mesh shows the Ink mesh picker (v1 silently picked mesh[0] — v2 is explicit)
telegram/connect-wizard-single-mesh.test.ts parity With exactly one joined mesh, the wizard skips the picker and proceeds directly to token request
telegram/connect-wizard-zero-mesh.test.ts parity With zero joined meshes, exits with error code 5 and "run claudemesh join first"
telegram/connect-wizard-happy-path.test.ts parity Mock broker returns {token, deepLink}; wizard renders QR + link; simulated telegram_bridge_connected push triggers success message
telegram/connect-wizard-poll-fallback.test.ts parity When the broker does not emit a push event, the wizard falls back to polling GET /mesh/:id/members every 2s until a tg:* entry appears
telegram/connect-wizard-rate-limited.test.ts parity Broker 429 response is caught and rendered as "too many Telegram tokens in the last hour" instead of raw HTTP
telegram/connect-wizard-link-flag.test.ts parity --link flag prints only the deep link, no QR, no wait (scriptable)
telegram/connect-wizard-status-flag.test.ts parity --status flag checks existing bridge without generating a new token
telegram/connect-wizard-ctrl-c.test.ts parity Ctrl-C during the wait phase prints the "link stays valid" hint and exits 0
(e2e) tests/e2e/telegram/broker-token-register.test.ts e2e POST /tg/token registers a bot token, writes to mesh.telegram_bridge
(e2e) tests/e2e/telegram/broker-inbound-routing.test.ts e2e Simulated inbound Telegram update is routed as a mesh send_message with subtype: telegram
(parity) telegram/send-message-to-tg-peer.test.ts parity send_message(to: "tg:<username>", ...) via the v2 CLI calls the broker with the same WS envelope as v1 would
(parity) telegram/list-peers-shows-tg-bridge.test.ts parity When a Telegram bridge is registered on the broker, claudemesh peers includes tg:<username> entries with type: bridge

3.8 Dashboard sync parity (inventory §12.9)

Test file Asserts
dashboard-sync/browser-flow.test.ts claudemesh sync opens browser, receives JWT via callback-listener, fetches mesh list
dashboard-sync/cli-sync-endpoint.test.ts POST /cli-sync with valid JWT returns the user's dashboard meshes; invalid JWT is rejected
dashboard-sync/force-resync.test.ts claudemesh sync --force re-links even if already linked

3.9 Webhooks parity (inventory §12.10)

Test file Asserts
webhooks/create-returns-url.test.ts create_webhook name=github returns a POST URL
webhooks/external-post-becomes-mesh-message.test.ts External POST /hook/:meshId/:webhookId with a JSON payload emits a mesh message to all peers
webhooks/hmac-signature-validation.test.ts HMAC-signed requests pass, unsigned requests are rejected
webhooks/list-delete.test.ts list_webhooks + delete_webhook round-trip works

3.10 Doctor checks parity (inventory §12.11)

Test file Asserts
doctor/check-node-version.test.ts doctor reports Node ≥ 20 (or warns if < 20 in a mocked env)
doctor/check-claude-on-path.test.ts doctor detects claude binary on PATH
doctor/check-mcp-registered.test.ts doctor detects MCP server entry in ~/.claude.json
doctor/check-hooks-registered.test.ts doctor detects status hooks in ~/.claude/settings.json
doctor/check-config-perms.test.ts doctor validates ~/.claudemesh/config.json is mode 0600
doctor/check-keypairs-valid.test.ts doctor validates each mesh keypair can sign + verify

Parity layer total: ~70 test files. Each file runs both v1 and v2 in the same environment and diffs the outputs.


4. Layer 2 — WS contract tests

One contract test per broker WS message type (85 total from inventory §3). Each test captures what v1's WS client would send for a given input and asserts v2 sends the byte-identical envelope (modulo legitimate non-determinism like nonces and timestamps, which are normalized before comparison).

File layout:

apps/cli-v2/tests/contract/ws/
├── lifecycle/                 (3 tests: hello, hello_ack, get_clock)
├── messaging/                 (4 tests)
├── profile/                   (5 tests)
├── groups/                    (2 tests)
├── state/                     (3 tests)
├── memory/                    (3 tests)
├── files/                     (5 tests)
├── vectors/                   (4 tests)
├── graph/                     (2 tests)
├── sql/                       (3 tests)
├── streams/                   (5 tests)
├── contexts/                  (3 tests)
├── tasks/                     (4 tests)
├── scheduling/                (3 tests)
├── metadata/                  (3 tests)
├── clock/                     (4 tests)
├── skills/                    (5 tests)
├── mcp-registry/              (11 tests)
├── vault/                     (4 tests)
├── url-watch/                 (3 tests)
├── webhooks/                  (3 tests)
└── audit/                     (2 tests)

4.1 Contract test pattern

// tests/contract/ws/state/set-state.test.ts
import { describe, it, expect } from 'bun:test';
import { normalize, captureV1Envelope, captureV2Envelope } from '@/tests/helpers/wire-capture';

describe('WS contract: set_state', () => {
  it('v2 envelope matches v1 for string value', async () => {
    const input = { meshId: 'test-mesh', key: 'sprint', value: '2026-W15' };
    const v1 = await captureV1Envelope('set_state', input);
    const v2 = await captureV2Envelope('set_state', input);
    expect(normalize(v2)).toEqual(normalize(v1));
  });

  it('v2 envelope matches v1 for JSON value', async () => {
    const input = { meshId: 'test-mesh', key: 'deploy_freeze', value: { until: '2026-04-15' } };
    const v1 = await captureV1Envelope('set_state', input);
    const v2 = await captureV2Envelope('set_state', input);
    expect(normalize(v2)).toEqual(normalize(v1));
  });

  it('v2 envelope matches v1 for null value (deletion)', async () => {
    const input = { meshId: 'test-mesh', key: 'tmp', value: null };
    const v1 = await captureV1Envelope('set_state', input);
    const v2 = await captureV2Envelope('set_state', input);
    expect(normalize(v2)).toEqual(normalize(v1));
  });
});

4.2 The normalize() helper

Strips fields that are legitimately non-deterministic between v1 and v2:

  • nonce — random per envelope
  • timestamp — wall clock
  • messageId — random UUID
  • _reqId — random correlation ID
  • ciphertext — depends on nonce + random keypair; instead of comparing ciphertext directly, both envelopes are decrypted and the plaintext is compared

Everything else (message type, meshId, priority, sender pubkey, recipient, flags) must match byte-for-byte.

4.3 Full contract test manifest (85 tests)

Every WS message type from inventory §3 gets a file:

Family WS messages Test files
Lifecycle hello, hello_ack, get_clock 3
Messaging send, peer_dir_request, peer_dir_response, peer_file_request, peer_file_response 5
Profile set_status, set_summary, set_visible, set_profile, set_stats 5
Groups join_group, leave_group 2
State set_state, get_state, list_state 3
Memory remember, recall, forget 3
Files get_file, list_files, file_status, grant_file_access, delete_file 5
Vectors vector_store, vector_search, vector_delete, list_collections 4
Graph graph_query, graph_execute 2
SQL mesh_query, mesh_execute, mesh_schema 3
Streams create_stream, publish, subscribe, unsubscribe, list_streams 5
Contexts share_context, get_context, list_contexts 3
Tasks create_task, claim_task, complete_task, list_tasks 4
Scheduling schedule, list_scheduled, cancel_scheduled 3
Metadata mesh_info, list_peers, message_status 3
Clock set_clock, pause_clock, resume_clock, get_clock 4
Skills share_skill, get_skill, list_skills, remove_skill, skill_deploy 5
MCP registry mcp_register, mcp_unregister, mcp_list, mcp_call, mcp_call_response, mcp_deploy, mcp_undeploy, mcp_update, mcp_logs, mcp_scope, mcp_schema, mcp_catalog 12
Vault vault_set, vault_get, vault_list, vault_delete 4
URL watch watch, unwatch, watch_list 3
Webhooks create_webhook, list_webhooks, delete_webhook 3
Audit audit_query, audit_verify 2

Contract layer total: ~85 test files.

This layer is the load-bearing proof that v2's WS client speaks the broker's v1 protocol unchanged. If any of these tests fail, v1 users running v2 against production brokers will experience silent misbehavior.


5. Layer 3 — MCP tool handler tests

One test file per MCP tool from inventory §2 (79 tools). Each file:

  1. Invokes the tool through v2's MCP server with a fixture input
  2. Captures the WS message v2 sends to the broker
  3. Captures the same request through v1's MCP server
  4. Asserts both produce identical WS envelopes and identical return values

File layout mirrors v2's src/mcp/tools/:

apps/cli-v2/tests/mcp-tools/
├── memory/
│   ├── remember.test.ts
│   ├── recall.test.ts
│   └── forget.test.ts
├── state/
│   ├── set-state.test.ts
│   ├── get-state.test.ts
│   └── list-state.test.ts
├── messaging/
│   ├── send-message.test.ts
│   ├── list-peers.test.ts
│   ├── check-messages.test.ts
│   └── message-status.test.ts
├── profile/
│   ├── set-profile.test.ts
│   ├── set-status.test.ts
│   ├── set-summary.test.ts
│   └── set-visible.test.ts
├── groups/
│   ├── join-group.test.ts
│   └── leave-group.test.ts
├── files/
│   ├── share-file.test.ts
│   ├── get-file.test.ts
│   ├── list-files.test.ts
│   ├── file-status.test.ts
│   ├── delete-file.test.ts
│   ├── grant-file-access.test.ts
│   ├── read-peer-file.test.ts
│   └── list-peer-files.test.ts
├── vectors/
│   ├── vector-store.test.ts
│   ├── vector-search.test.ts
│   ├── vector-delete.test.ts
│   └── list-collections.test.ts
├── graph/
│   ├── graph-query.test.ts
│   └── graph-execute.test.ts
├── sql/
│   ├── mesh-query.test.ts
│   ├── mesh-execute.test.ts
│   └── mesh-schema.test.ts
├── streams/
│   ├── create-stream.test.ts
│   ├── publish.test.ts
│   ├── subscribe.test.ts
│   └── list-streams.test.ts
├── contexts/
│   ├── share-context.test.ts
│   ├── get-context.test.ts
│   └── list-contexts.test.ts
├── tasks/
│   ├── create-task.test.ts
│   ├── claim-task.test.ts
│   ├── complete-task.test.ts
│   └── list-tasks.test.ts
├── scheduling/
│   ├── schedule-reminder.test.ts
│   ├── list-scheduled.test.ts
│   └── cancel-scheduled.test.ts
├── metadata/
│   ├── mesh-info.test.ts
│   ├── mesh-stats.test.ts
│   ├── mesh-clock.test.ts
│   └── ping-mesh.test.ts
├── clock-write/
│   ├── mesh-set-clock.test.ts
│   ├── mesh-pause-clock.test.ts
│   └── mesh-resume-clock.test.ts
├── skills/
│   ├── share-skill.test.ts
│   ├── get-skill.test.ts
│   ├── list-skills.test.ts
│   ├── remove-skill.test.ts
│   └── mesh-skill-deploy.test.ts
├── mcp-registry-tier1/
│   ├── mesh-mcp-register.test.ts
│   ├── mesh-mcp-list.test.ts
│   ├── mesh-tool-call.test.ts
│   └── mesh-mcp-remove.test.ts
├── mcp-registry-tier2/
│   ├── mesh-mcp-deploy.test.ts
│   ├── mesh-mcp-undeploy.test.ts
│   ├── mesh-mcp-update.test.ts
│   ├── mesh-mcp-logs.test.ts
│   ├── mesh-mcp-scope.test.ts
│   ├── mesh-mcp-schema.test.ts
│   └── mesh-mcp-catalog.test.ts
├── vault/
│   ├── vault-set.test.ts
│   ├── vault-list.test.ts
│   └── vault-delete.test.ts
├── url-watch/
│   ├── mesh-watch.test.ts
│   ├── mesh-unwatch.test.ts
│   └── mesh-watches.test.ts
└── webhooks/
    ├── create-webhook.test.ts
    ├── list-webhooks.test.ts
    └── delete-webhook.test.ts

MCP layer total: 79 test files.

5.1 MCP handler test pattern

// tests/mcp-tools/memory/remember.test.ts
import { describe, it, expect } from 'bun:test';
import { v1McpServer, v2McpServer, mockBroker } from '@/tests/helpers';

describe('MCP tool: remember (parity)', () => {
  it('v1 and v2 produce identical WS envelopes', async () => {
    const input = {
      content: 'Payments API rate-limits at 100 req/s after March incident',
      tags: ['payments', 'rate-limit'],
    };

    const v1Sent = await v1McpServer.invokeAndCapture('remember', input);
    const v2Sent = await v2McpServer.invokeAndCapture('remember', input);

    expect(v2Sent).toEqual(v1Sent);
  });

  it('v1 and v2 return identical tool results', async () => {
    const input = { content: 'test memory', tags: [] };
    const v1Result = await v1McpServer.invoke('remember', input);
    const v2Result = await v2McpServer.invoke('remember', input);
    expect(v2Result).toEqual(v1Result);
  });

  it('v2 rejects invalid input via Zod with same error shape as v1', async () => {
    const invalidInput = { content: '', tags: 'not-an-array' };
    await expect(v2McpServer.invoke('remember', invalidInput)).rejects.toThrow();
  });
});

6. Layer 4 — End-to-end smoke tests

Full journey tests against a real broker in a Docker sandbox. Each test spins up:

  1. A fresh Postgres + Neo4j + Qdrant + MinIO + broker stack (via testcontainers)
  2. One or more v2 CLI instances as subprocesses
  3. Optionally a Claude Code mock to simulate MCP client interactions

These are slower than layers 13 but catch integration bugs that unit-level parity tests miss.

File layout:

apps/cli-v2/tests/e2e/
├── first-run/
│   ├── install-and-join.test.ts
│   ├── fresh-mesh-create.test.ts
│   └── launch-with-mesh.test.ts
├── messaging/
│   ├── two-peer-send-receive.test.ts
│   ├── broadcast-to-group.test.ts
│   └── offline-queue-drain.test.ts
├── files/
│   ├── upload-download-small.test.ts
│   ├── upload-download-large.test.ts
│   └── e2e-encrypted-share.test.ts
├── state-and-memory/
│   ├── state-across-peers.test.ts
│   ├── memory-full-text-search.test.ts
│   └── context-share-and-query.test.ts
├── tasks/
│   ├── create-claim-complete.test.ts
│   └── list-by-status.test.ts
├── backends/
│   ├── shared-sql-roundtrip.test.ts
│   ├── neo4j-graph-roundtrip.test.ts
│   ├── qdrant-vector-roundtrip.test.ts
│   └── minio-file-roundtrip.test.ts
├── mcp-registry/
│   ├── deploy-catalog-entry.test.ts
│   ├── call-deployed-tool.test.ts
│   └── undeploy-cleanup.test.ts
├── telegram/
│   ├── connect-and-route.test.ts
│   └── disconnect-cleanup.test.ts
├── dashboard-sync/
│   └── browser-flow.test.ts
├── scheduled/
│   ├── cron-reminder-fires.test.ts
│   ├── url-watch-detects-change.test.ts
│   └── persist-across-restart.test.ts
├── webhooks/
│   └── inbound-post-becomes-message.test.ts
└── journey/
    ├── full-user-journey.test.ts         (install → join → send → file → logout)
    ├── connector-journey.test.ts         (deploy mcp → call tool → undeploy)
    └── skill-sharing-journey.test.ts     (share_skill → teammate loads → invoke)

E2E layer total: ~28 test files.

6.1 E2E test harness

Each e2e file uses a shared harness:

import { startFreshBroker, stopBroker, spawnCli } from '@/tests/helpers/e2e';

describe('e2e: two-peer send-receive', () => {
  let broker: BrokerHandle;
  let alice: CliHandle;
  let bob: CliHandle;

  beforeAll(async () => {
    broker = await startFreshBroker();
    alice = await spawnCli({ broker, displayName: 'Alice' });
    bob = await spawnCli({ broker, displayName: 'Bob' });
    await alice.join(broker.seedInvite);
    await bob.join(broker.seedInvite);
  });

  afterAll(async () => {
    await alice.shutdown();
    await bob.shutdown();
    await stopBroker(broker);
  });

  it('alice sends to bob; bob receives via inbox', async () => {
    await alice.send({ to: 'Bob', message: 'hello' });
    const inbox = await bob.inbox();
    expect(inbox).toHaveLength(1);
    expect(inbox[0].plaintext).toBe('hello');
  });
});

6.2 Testcontainers vs local broker

Two modes, switchable via env var:

  • E2E_BROKER=docker — spins up a fresh broker + all backends via testcontainers. Slow (~30s per test) but hermetic.
  • E2E_BROKER=local — connects to a running local broker (ic.claudemesh.com or localhost:8787). Fast but requires manual setup.

CI uses docker mode. Dev iteration uses local mode.


7. Layer 5 — JSON output golden tests

--json output is the stable contract for script consumers. These tests lock the shape and fields.

File layout:

apps/cli-v2/tests/golden/
├── list-json.test.ts
├── peers-json.test.ts
├── info-json.test.ts
├── inbox-json.test.ts
├── state-get-json.test.ts
├── state-list-json.test.ts
├── remember-json.test.ts
├── recall-json.test.ts
├── remind-json.test.ts
├── profile-json.test.ts
├── mcp-info-json.test.ts
└── mcp-stats-json.test.ts

7.1 Golden test pattern

// tests/golden/list-json.test.ts
import { describe, it, expect } from 'bun:test';
import { runV2Cli } from '@/tests/helpers';
import { listJsonShape } from '@/tests/fixtures/golden/list.json';

describe('golden: claudemesh list --json', () => {
  it('output shape matches locked schema', async () => {
    const output = JSON.parse(await runV2Cli(['list', '--json']));
    expect(output).toMatchObject(listJsonShape);
  });

  it('includes schema_version field', async () => {
    const output = JSON.parse(await runV2Cli(['list', '--json']));
    expect(output.schema_version).toMatch(/^\d+\.\d+$/);
  });

  it('mesh entries have all v1 fields', async () => {
    const output = JSON.parse(await runV2Cli(['list', '--json']));
    for (const mesh of output.meshes) {
      expect(mesh).toHaveProperty('slug');
      expect(mesh).toHaveProperty('name');
      expect(mesh).toHaveProperty('kind');
      expect(mesh).toHaveProperty('brokerUrl');
      expect(mesh).toHaveProperty('memberCount');
    }
  });
});

Golden fixtures live in tests/fixtures/golden/*.json and are captured from v1 CLI runs the first time the test is written, then locked.

Golden layer total: ~12 test files.


8. Layer 6 — Facade unit tests

Per-service colocated tests that verify the facade contract:

  • Every facade function validates input with Zod
  • Every output type passes the boundary scanner (no token, api_key, password, path-like patterns)
  • Error mapping via toDomainError preserves cause + logs unmapped errors
  • Never exposes class instances, DB connections, or raw HTTP responses

File layout (colocated with services):

apps/cli-v2/src/services/
├── auth/
│   └── facade.test.ts
├── mesh/
│   └── facade.test.ts
├── invite/
│   └── facade.test.ts
├── broker/
│   └── facade.test.ts
├── api/
│   └── facade.test.ts
├── crypto/
│   └── facade.test.ts
├── store/
│   └── facade.test.ts
├── config/
│   └── facade.test.ts
├── state/
│   └── facade.test.ts
├── device/
│   └── facade.test.ts
├── clipboard/
│   └── facade.test.ts
├── spawn/
│   └── facade.test.ts
├── telemetry/
│   └── facade.test.ts
├── health/
│   └── facade.test.ts
├── update/
│   └── facade.test.ts
├── i18n/
│   └── facade.test.ts
└── lifecycle/
    └── facade.test.ts

Plus one global test at tests/unit/facade-boundary-scan.test.ts that walks every facade and asserts no output type contains forbidden keys (AST-based via ts-morph, per facade-pattern spec §10.2).

Facade layer total: ~17 facade-specific test files + 1 global scanner.

8.1 Facade test pattern

// services/auth/facade.test.ts
import { describe, it, expect, vi } from 'vitest';
import * as facade from './facade';
import { getAuthService } from './index';

vi.mock('./index');

describe('auth facade contract', () => {
  it('loginWithDeviceCode rejects leaked token in output', async () => {
    vi.mocked(getAuthService).mockReturnValue({
      startDeviceCodeFlow: vi.fn().mockResolvedValue({
        user: { id: 'u1', display_name: 'Alejandro', email: 'a@b.c' },
        token: 'cm_session_SECRET',
        raw_response: { headers: {} },
      }),
    } as any);

    const result = await facade.loginWithDeviceCode();
    const serialized = JSON.stringify(result);

    expect(serialized).not.toContain('cm_session_');
    expect(serialized).not.toContain('SECRET');
    expect(serialized).not.toContain('raw_response');
  });

  it('loginWithToken validates input with Zod', async () => {
    await expect(facade.loginWithToken({ token: 'malformed' })).rejects.toMatchObject({
      code: 'AUTH_INVALID_TOKEN',
    });
  });

  it('whoAmI never throws', async () => {
    vi.mocked(getAuthService).mockReturnValue({
      getCurrentState: vi.fn().mockRejectedValue(new Error('boom')),
    } as any);
    await expect(facade.whoAmI()).resolves.toBeDefined();
  });

  it('toDomainError logs unmapped errors', async () => {
    const logSpy = vi.fn();
    vi.mocked(getAuthService).mockReturnValue({
      logout: vi.fn().mockRejectedValue(new TypeError('null pointer')),
      logger: { error: logSpy },
    } as any);

    await facade.logout();
    expect(logSpy).toHaveBeenCalledWith(
      expect.stringContaining('unmapped error'),
      expect.any(Object),
    );
  });
});

9. Layer 7 — Port-forwarded v1 tests

v1's existing tests cover crypto primitives and broker correctness. They all apply to v2 as-is because:

  • v2 uses the same crypto primitives (Ed25519, NaCl crypto_box, AES-GCM)
  • v2 talks to the same broker, so broker tests run unchanged
  • v2 parses the same invite formats

9.1 Tests to port forward

v1 test Forward-port path Notes
apps/cli/src/__tests__/crypto-roundtrip.test.ts apps/cli-v2/tests/unit/crypto-roundtrip.test.ts Direct copy; uses shared crypto primitives
apps/cli/src/__tests__/invite-parse.test.ts apps/cli-v2/tests/unit/invite-parse.test.ts Direct copy; v2 parses same v1 + v2 invite formats
apps/broker/tests/broker.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/invite-signature.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/invite-v2.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/hello-signature.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/rate-limit.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/encoding.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/dup-delivery.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/metrics.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/logging.test.ts N/A — stays in broker Broker unchanged
apps/broker/tests/integration/health.test.ts N/A — stays in broker Broker unchanged

Port-forward layer: 2 CLI tests copied, ~10 broker tests remain in place.


10. Test helper infrastructure

Shared helpers under apps/cli-v2/tests/helpers/:

10.1 v1-runner.ts

Spawns the v1 CLI as a subprocess and captures its output:

export async function runV1Cli(args: string[], opts?: RunOpts): Promise<CliResult>;
export async function v1Send(args: SendArgs): Promise<string>;  // returns messageId
export async function v1Join(inviteUrl: string): Promise<void>;
export async function v1Install(): Promise<void>;
// ... one helper per v1 command

Uses the installed claudemesh binary from apps/cli/ (v1). Tests assume v1 is available at ../../cli/dist/index.js or via npx claudemesh@0.10.5.

10.2 v2-runner.ts

Same interface, but spawns v2:

export async function runV2Cli(args: string[], opts?: RunOpts): Promise<CliResult>;
// ... matching v1 helper surface

Uses apps/cli-v2/dist/entrypoints/cli.js.

10.3 wire-capture.ts

Intercepts WS messages by routing both v1 and v2 CLIs through a test proxy that records envelopes:

export async function captureV1Envelope(op: string, input: any): Promise<WsEnvelope>;
export async function captureV2Envelope(op: string, input: any): Promise<WsEnvelope>;
export function normalize(env: WsEnvelope): WsEnvelope;

The proxy runs on a local port, the CLI's broker URL is set to ws://localhost:<port>/ws, and the proxy logs every message before forwarding to a real test broker.

10.4 mock-broker.ts

In-memory broker for unit tests. Implements enough of the WS protocol to test CLI-side behavior without a real database stack.

Handles:

  • Hello + authentication (skipped signature check in test mode)
  • Echo back ack for every client message
  • In-memory state for state_kv, memory, tasks
  • Configurable response fixtures for WS ops

10.5 real-broker.ts (for e2e)

Spins up a real broker + Postgres + Neo4j + Qdrant + MinIO stack via testcontainers:

export async function startFreshBroker(opts?: BrokerOpts): Promise<BrokerHandle>;
export async function stopBroker(handle: BrokerHandle): Promise<void>;

The stack is pre-configured with a seed mesh + fixture users + a known invite URL. Tests use these as starting state.

10.6 temp-home.ts

Creates an isolated ~/.claudemesh/ for each test:

export async function tempHome(fn: (homeDir: string) => Promise<void>): Promise<void>;

Cleans up on completion. Prevents tests from interfering with the developer's real claudemesh config.

10.7 ink-render.ts

Snapshots Ink screens for UI tests (per cli-v2-ux-design.md §12.1):

export async function renderScreen(Component: any, props?: any): Promise<string>;
export async function waitForText(frame: () => string, text: string, timeoutMs?: number): Promise<void>;

10.8 sqlite-fixture.ts

(Pass 1 only needs this if we add any local caching — most of Pass 1 won't touch SQLite since the local-first work is Pass 2.)


11. Shared fixtures

Under apps/cli-v2/tests/fixtures/:

fixtures/
├── auth/
│   ├── valid-session-token.json       # sample cm_session_... token
│   ├── valid-pat.json                 # sample cm_pat_... token
│   ├── expired-token.json
│   └── malformed-token.json
├── meshes/
│   ├── sample-personal-mesh.json
│   ├── sample-shared-mesh.json
│   └── sample-guest-mesh.json
├── invites/
│   ├── v1-invite-url.txt
│   ├── v2-invite-url.txt
│   ├── expired-invite-url.txt
│   └── malformed-invite-url.txt
├── wire/
│   ├── v1-envelopes/                  # captured v1 WS envelopes, one JSON file per op
│   │   ├── send.json
│   │   ├── set_state.json
│   │   ├── remember.json
│   │   └── ... (85 files)
│   └── broker-responses/              # captured v1 broker responses
│       ├── hello_ack.json
│       ├── peers_list.json
│       └── ...
├── golden/
│   ├── list-json.json                 # expected JSON output shape
│   ├── peers-json.json
│   ├── info-json.json
│   └── ... (12 files)
├── telegram/
│   ├── sample-bot-token.json
│   ├── sample-inbound-update.json     # Telegram webhook payload
│   └── expected-routed-message.json
└── mcp-tool-inputs/
    ├── memory-remember.json           # one fixture input per tool
    ├── memory-recall.json
    └── ... (79 files, one per tool)

Total fixture count: ~180 files. Most are small JSON snippets captured once from v1 runs and locked.


12. Execution order + dependencies

Tests run in parallel where possible, but some layers depend on others:

┌─────────────────────────────────────┐
│ Layer 7: port-forwarded v1 tests   │  ← no dependencies, runs first
│ Layer 6: facade unit tests          │  ← depends on v2 services existing
└─────────────────────────────────────┘
                ↓
┌─────────────────────────────────────┐
│ Layer 2: WS contract tests          │  ← depends on wire fixtures captured
│ Layer 3: MCP tool handler tests     │  ← depends on v1 + v2 MCP servers
│ Layer 5: JSON golden tests          │  ← depends on v1 + v2 CLI built
└─────────────────────────────────────┘
                ↓
┌─────────────────────────────────────┐
│ Layer 1: parity tests               │  ← depends on mock-broker + helpers
└─────────────────────────────────────┘
                ↓
┌─────────────────────────────────────┐
│ Layer 4: e2e smoke tests            │  ← depends on real broker + testcontainers
└─────────────────────────────────────┘

Layers 1, 2, 3, 5, 6, 7 run on every PR. Layer 4 runs on main merges + release candidates (slower).


13. CI integration

13.1 PR-level pipeline

jobs:
  lint-and-typecheck:
    - biome check
    - eslint (boundaries + 3 custom rules)
    - tsc --noEmit
    - dependency-cruiser

  unit-tests:
    - bun test tests/unit/                    # facade layer
    - bun test src/services/**/*.test.ts      # colocated facade tests
    - bun test tests/golden/                   # JSON shape

  parity-tests:
    - bun test tests/parity/
    needs: [v1-cli-available, v2-cli-built]

  contract-tests:
    - bun test tests/contract/
    needs: [wire-fixtures-available]

  mcp-tool-tests:
    - bun test tests/mcp-tools/

  port-forward-tests:
    - bun test tests/unit/crypto-roundtrip.test.ts
    - bun test tests/unit/invite-parse.test.ts

13.2 Release-candidate pipeline

Adds layer 4:

jobs:
  e2e-docker:
    - E2E_BROKER=docker bun test tests/e2e/
    timeout-minutes: 60

13.3 Coverage gates

  • Unit tests: ≥ 80% branch coverage on src/services/**/*.ts (excluding services/broker/*)
  • Parity tests: 100% of inventory §12 checks mapped to at least one passing test
  • Contract tests: 100% of 85 WS message types have at least one passing test
  • MCP tool tests: 100% of 79 tools have at least one passing test
  • E2E tests: all 28 journey tests passing on main

If any gate fails, the PR cannot merge.


14. Success criteria (the ship checklist)

v2 Pass 1 ships when ALL of these are green:

  • Layer 1 parity: 70 test files, every inventory §12 regression check has at least one passing parity test
  • Layer 2 contract: 85 test files, every broker WS message type has a contract test passing against captured v1 envelopes
  • Layer 3 MCP tools: 79 test files, every MCP tool handler produces identical WS output between v1 and v2
  • Layer 4 e2e: 28 journey tests pass against a real broker in Docker
  • Layer 5 golden: 12 JSON output tests pass, schema_version field present and stable
  • Layer 6 facade: 17 service facade test files pass + 1 global boundary scanner (AST-based, no false positives)
  • Layer 7 port-forward: 2 forwarded v1 CLI tests pass + broker test suite unchanged and green
  • Coverage gates: all thresholds met
  • Zero new regressions: any previously-passing test that starts failing must be fixed before merge (no skipping, no .todo)

Total test files in v2 Pass 1: ~295 (70 + 85 + 79 + 28 + 12 + 18 + 2).

That's a lot. Most are template-driven — one helper + one fixture + one pattern = many tests. The scaffolding pass creates the files with NotImplementedError stubs, and the implementation pass fills them in.

No time estimate. It's done when the checklist is green.


15. Scaffolding implications

When v2 is re-scaffolded, the test infrastructure ships alongside the source:

  • tests/helpers/ with v1-runner, v2-runner, wire-capture, mock-broker, real-broker, temp-home, ink-render as stubs with NotImplementedError
  • tests/fixtures/ with directory structure and placeholder JSON files
  • Every test file in layers 16 scaffolded with its describe blocks, imports, and fixture references — but the body is throw new NotImplementedError('<test name>')
  • A CI job that counts test files vs expected counts and fails if any are missing
  • A script tests/helpers/capture-v1-fixtures.ts that runs v1 once against a test broker to generate the wire fixtures

The scaffold pass adds ~300 test files. The implementation pass replaces the NotImplementedError bodies with real assertions, one file at a time, driven by the priority order above.

The implementation pass is organized by test layer, not by feature:

  1. Scaffold everything (layers 16) with stubs
  2. Implement layer 7 (port-forwarded tests) — fastest wins, establishes test harness
  3. Implement layer 6 (facade units) — validates each service as it's written
  4. Implement layer 5 (golden JSON) — locks output shapes early
  5. Implement layer 2 (WS contract) — proves wire compatibility
  6. Implement layer 3 (MCP tool handlers) — proves tool dispatch
  7. Implement layer 1 (parity) — full behavioral equivalence
  8. Implement layer 4 (e2e) — end-to-end sanity

When layer 1 is fully green, v2 Pass 1 is shippable.


16. What this plan does NOT cover

Explicitly out of scope for Pass 1 testing:

  • Broker-side tests — broker is unchanged in Pass 1; broker's own test suite runs unchanged
  • Performance regression tests — v2 shouldn't be slower than v1, but quantifying that is Pass 2 work with bench tests
  • Security audit — the spec-level security improvements (role-per-mesh Postgres, egress proxies, SSRF policies) are Pass 2
  • Accessibility audit — the testable a11y matrix is Pass 2 (requires the VoiceOver shim which is Pass 2)
  • Load tests — 10k concurrent peers, sustained message throughput — deferred
  • Chaos tests — broker restart mid-operation, network partition recovery — deferred to local-first Pass 2
  • Cross-platform tests on Windows — v2 Pass 1 targets macOS + Linux; Windows support is best-effort, tested on release candidates only

Pass 2 adds those layers when the corresponding features ship.


End of plan.