Health rollup
Operator walkthrough for /dashboard/health — 30s polling rollup of /v1/health, /v1/meta-tools/stats, and /v1/connections with degraded/recovered alerting.
Health rollup
The /dashboard/health page is the operator's at-a-glance status surface. It polls GET /v1/health every 30 seconds and composes three sections: the six health checks, the rollup stats panel, and the per-tenant connections panel. Each section fetches independently — Promise.allSettled on the SSR boundary means a slow connections call does not block the checks list from rendering.
When a section's underlying data is older than 5 minutes, a small "stale" pill appears in its header. The page never blocks on a single failing dependency.
The 6 health checks
GET /v1/health returns a checks object with six named entries. Each carries a status string plus a small payload of evidence.
| Check | What it tests | healthy looks like | degraded / down looks like |
|---|---|---|---|
db | Basic SELECT 1 against the primary database. | ok: true, latency under 50ms. | ok: false flips overall down. The page renders a banner; the API still returns 200 so monitors keep parsing. |
vault | Reachability of the vault_secrets table — credential reads work. | ok: true. | ok: false flips overall down. Tool calls cannot resolve provider credentials. |
embeddings | pgvector populated row count over mcp_tools total. | populated_pct ≥ 50% → ok. | populated_pct < 50% → low; 0 → empty. Discovery still works (lexical fallback) but result quality drops. |
fx_rates | fetched_at on the active USD/BRL row. | ≤ 72h old → fresh. | > 72h → stale; missing row → missing. Cross-currency pricing falls back to last known. |
telemetry | Count of provider_success_telemetry rows in the last hour. | > 0 → active. | = 0 → idle. Either nothing is calling the router, or telemetry writes are blocked. |
connections | Per-tenant connected_accounts count. | ≥ 3 → wired. | 1–2 → partial; 0 → none. Operator likely needs to finish onboarding. |
Overall status mapping
The top-level status field is derived, not stored:
down— iffdb.ok = falseORvault.ok = false. The router cannot operate.degraded— iff any non-DB/vault check is in a non-nominal state (embeddings.low,fx_rates.stale,telemetry.idle,connections.partial, etc.). The router operates with reduced quality.healthy— every check is in its nominal state.
The endpoint always returns HTTP 200. The status field is the source of truth — monitors that branch on HTTP status instead of parsing the body will miss every degraded transition.
Alerting
State transitions emit events. When the derived status flips:
healthy→degraded(ordown) emitssystem.health.degraded.degraded(ordown) →healthyemitssystem.health.recovered.
Both events fan out through the standard trigger pipeline. The rollup page renders a "Get notified" CTA that deep-links to /dashboard/triggers?event=system.health.degraded with the event filter pre-filled.
The state-change detector reads from health_snapshots (the row written on every /v1/health call). It only emits on transitions — back-to-back degraded snapshots produce one event, not many.
Snapshots and retention
Every call to /v1/health writes a row in health_snapshots (migration 0063). The schema captures the full checks object, the derived status, and an observed_at timestamp. This gives operators a rolling audit trail of health transitions independent of the live page.
The gc-health-snapshots.ts script runs daily and deletes rows older than 7 days. Override HEALTH_SNAPSHOT_RETENTION_DAYS if you want a longer trail.
Liveness vs health
There are two endpoints; do not confuse them.
GET /health— unauthenticated, simple{ status, uptime_ms }shape, intended for load balancer liveness probes. Returns 200 as long as the process is up.GET /v1/health— dual-auth (bearer or service-key), the consolidated 6-check rollup. This is what/dashboard/healthconsumes. Always returns HTTP 200; parse thestatusfield.
Next steps
Last updated on
Router observability
Operator walkthrough for /dashboard/router — per (provider × canonical_tool) attempts, success rate, p50 latency, last error, and a 24h sparkline.
Router candidates triage
Operator walkthrough for /dashboard/router-candidates — frontend-only triage UI for the LLM-classified rail-candidates report. Filter, claim, defer, reject, export.