Skip to main content

Health rollup

Operator walkthrough for /dashboard/health — 30s polling rollup of /v1/health, /v1/meta-tools/stats, and /v1/connections with degraded/recovered alerting.

2 min read · updated

Health rollup

The /dashboard/health page is the operator's at-a-glance status surface. It polls GET /v1/health every 30 seconds and composes three sections: the six health checks, the rollup stats panel, and the per-tenant connections panel. Each section fetches independently — Promise.allSettled on the SSR boundary means a slow connections call does not block the checks list from rendering.

When a section's underlying data is older than 5 minutes, a small "stale" pill appears in its header. The page never blocks on a single failing dependency.

The 6 health checks

GET /v1/health returns a checks object with six named entries. Each carries a status string plus a small payload of evidence.

CheckWhat it testshealthy looks likedegraded / down looks like
dbBasic SELECT 1 against the primary database.ok: true, latency under 50ms.ok: false flips overall down. The page renders a banner; the API still returns 200 so monitors keep parsing.
vaultReachability of the vault_secrets table — credential reads work.ok: true.ok: false flips overall down. Tool calls cannot resolve provider credentials.
embeddingspgvector populated row count over mcp_tools total.populated_pct ≥ 50%ok.populated_pct < 50%low; 0empty. Discovery still works (lexical fallback) but result quality drops.
fx_ratesfetched_at on the active USD/BRL row.≤ 72h oldfresh.> 72hstale; missing row → missing. Cross-currency pricing falls back to last known.
telemetryCount of provider_success_telemetry rows in the last hour.> 0active.= 0idle. Either nothing is calling the router, or telemetry writes are blocked.
connectionsPer-tenant connected_accounts count.≥ 3wired.1–2partial; 0none. Operator likely needs to finish onboarding.

Overall status mapping

The top-level status field is derived, not stored:

  • down — iff db.ok = false OR vault.ok = false. The router cannot operate.
  • degraded — iff any non-DB/vault check is in a non-nominal state (embeddings.low, fx_rates.stale, telemetry.idle, connections.partial, etc.). The router operates with reduced quality.
  • healthy — every check is in its nominal state.

The endpoint always returns HTTP 200. The status field is the source of truth — monitors that branch on HTTP status instead of parsing the body will miss every degraded transition.

Alerting

State transitions emit events. When the derived status flips:

  • healthydegraded (or down) emits system.health.degraded.
  • degraded (or down) → healthy emits system.health.recovered.

Both events fan out through the standard trigger pipeline. The rollup page renders a "Get notified" CTA that deep-links to /dashboard/triggers?event=system.health.degraded with the event filter pre-filled.

The state-change detector reads from health_snapshots (the row written on every /v1/health call). It only emits on transitions — back-to-back degraded snapshots produce one event, not many.

Snapshots and retention

Every call to /v1/health writes a row in health_snapshots (migration 0063). The schema captures the full checks object, the derived status, and an observed_at timestamp. This gives operators a rolling audit trail of health transitions independent of the live page.

The gc-health-snapshots.ts script runs daily and deletes rows older than 7 days. Override HEALTH_SNAPSHOT_RETENTION_DAYS if you want a longer trail.

Liveness vs health

There are two endpoints; do not confuse them.

  • GET /health — unauthenticated, simple { status, uptime_ms } shape, intended for load balancer liveness probes. Returns 200 as long as the process is up.
  • GET /v1/health — dual-auth (bearer or service-key), the consolidated 6-check rollup. This is what /dashboard/health consumes. Always returns HTTP 200; parse the status field.

Next steps

Edit on GitHub

Last updated on