Operator walkthrough for /dashboard/health — 30s polling rollup of /v1/health, /v1/meta-tools/stats, and /v1/connections with degraded/recovered alerting.

Health rollup

The /dashboard/health page is the operator's at-a-glance status surface. It polls GET /v1/health every 30 seconds and composes three sections: the six health checks, the rollup stats panel, and the per-tenant connections panel. Each section fetches independently — Promise.allSettled on the SSR boundary means a slow connections call does not block the checks list from rendering.

When a section's underlying data is older than 5 minutes, a small "stale" pill appears in its header. The page never blocks on a single failing dependency.

The 6 health checks

GET /v1/health returns a checks object with six named entries. Each carries a status string plus a small payload of evidence.

Check	What it tests	`healthy` looks like	`degraded` / `down` looks like
`db`	Basic `SELECT 1` against the primary database.	`ok: true`, latency under 50ms.	`ok: false` flips overall `down`. The page renders a banner; the API still returns 200 so monitors keep parsing.
`vault`	Reachability of the `vault_secrets` table — credential reads work.	`ok: true`.	`ok: false` flips overall `down`. Tool calls cannot resolve provider credentials.
`embeddings`	`pgvector` populated row count over `mcp_tools` total.	`populated_pct ≥ 50%` → `ok`.	`populated_pct < 50%` → `low`; `0` → `empty`. Discovery still works (lexical fallback) but result quality drops.
`fx_rates`	`fetched_at` on the active USD/BRL row.	`≤ 72h old` → `fresh`.	`> 72h` → `stale`; missing row → `missing`. Cross-currency pricing falls back to last known.
`telemetry`	Count of `provider_success_telemetry` rows in the last hour.	`> 0` → `active`.	`= 0` → `idle`. Either nothing is calling the router, or telemetry writes are blocked.
`connections`	Per-tenant `connected_accounts` count.	`≥ 3` → `wired`.	`1–2` → `partial`; `0` → `none`. Operator likely needs to finish onboarding.

Overall status mapping

The top-level status field is derived, not stored:

down — iff db.ok = false OR vault.ok = false. The router cannot operate.
degraded — iff any non-DB/vault check is in a non-nominal state (embeddings.low, fx_rates.stale, telemetry.idle, connections.partial, etc.). The router operates with reduced quality.
healthy — every check is in its nominal state.

The endpoint always returns HTTP 200. The status field is the source of truth — monitors that branch on HTTP status instead of parsing the body will miss every degraded transition.

Alerting

State transitions emit events. When the derived status flips:

healthy → degraded (or down) emits system.health.degraded.
degraded (or down) → healthy emits system.health.recovered.

Both events fan out through the standard trigger pipeline. The rollup page renders a "Get notified" CTA that deep-links to /dashboard/triggers?event=system.health.degraded with the event filter pre-filled.

The state-change detector reads from health_snapshots (the row written on every /v1/health call). It only emits on transitions — back-to-back degraded snapshots produce one event, not many.

Snapshots and retention

Every call to /v1/health writes a row in health_snapshots (migration 0063). The schema captures the full checks object, the derived status, and an observed_at timestamp. This gives operators a rolling audit trail of health transitions independent of the live page.

The gc-health-snapshots.ts script runs daily and deletes rows older than 7 days. Override HEALTH_SNAPSHOT_RETENTION_DAYS if you want a longer trail.

Liveness vs health

There are two endpoints; do not confuse them.

GET /health — unauthenticated, simple { status, uptime_ms } shape, intended for load balancer liveness probes. Returns 200 as long as the process is up.
GET /v1/health — dual-auth (bearer or service-key), the consolidated 6-check rollup. This is what /dashboard/health consumes. Always returns HTTP 200; parse the status field.