feat(broker): production hardening — caps, limits, metrics, logging

Adds the minimum ops surface area for a production broker without
over-engineering. All new config knobs are env-var driven with sane
defaults.

New modules:
- logger.ts: structured JSON logs (one line, stderr, ready for
  Loki/Datadog ingestion without preprocessing)
- metrics.ts: in-process Prometheus counters + gauges, exposed at
  GET /metrics. Tracks connections, messages, queue depth, TTL
  sweeps, hook requests, DB health.
- rate-limit.ts: token-bucket rate limiter keyed by (pid, cwd).
  Applied to POST /hook/set-status at 30/min default.
- db-health.ts: Postgres ping loop with exponential-backoff retry.
  GET /health returns 503 while DB is down.
- build-info.ts: version + gitSha (from GIT_SHA env or `git rev-parse`
  fallback) + uptime, surfaced on /health.

Behavior changes:
- Connection caps: MAX_CONNECTIONS_PER_MESH (default 100). Exceed →
  close(1008, "capacity") + metric increment.
- Message size: MAX_MESSAGE_BYTES (default 65536). WS applies it via
  `ws.maxPayload`. Hook POST bodies cap out with 413.
- Structured logs everywhere replacing the old `log()` helper.
- Env validation stricter: DATABASE_URL required + regex-checked for
  postgres:// prefix.

New endpoints:
- GET /health → {status, db, version, gitSha, uptime}. 503 if DB down.
- GET /metrics → Prometheus text format.

Verified: 21/21 tests still pass. Hit /health + /metrics live —
gitSha resolves correctly via `git rev-parse --short HEAD` in dev.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Alejandro Gutiérrez
2026-04-04 22:14:31 +01:00
parent 84e14ff410
commit 5bf815b304
8 changed files with 630 additions and 139 deletions

View File

@@ -17,6 +17,7 @@
import {
and,
asc,
count,
desc,
eq,
gte,
@@ -34,6 +35,7 @@ import {
presence,
} from "@turbostarter/db/schema/mesh";
import { env } from "./env";
import { metrics } from "./metrics";
import { inferStatusFromJsonl } from "./paths";
import type {
HookSetStatusRequest,
@@ -244,6 +246,16 @@ export async function sweepStuckWorking(): Promise<void> {
for (const row of stuck) {
await writeStatus(row.id, "idle", "jsonl", now);
}
metrics.ttlSweepsTotal.inc({ flipped: String(stuck.length) });
}
/** Update the queue_depth gauge from a single COUNT query. */
export async function refreshQueueDepth(): Promise<void> {
const [row] = await db
.select({ n: count() })
.from(messageQueue)
.where(isNull(messageQueue.deliveredAt));
metrics.queueDepth.set(Number(row?.n ?? 0));
}
/** Sweep expired pending_status entries. */