Skip to main content

Router observability

Operator walkthrough for /dashboard/router — per (provider × canonical_tool) attempts, success rate, p50 latency, last error, and a 24h sparkline.

1 min read · updated

Router observability

The /dashboard/router page is the operator surface for the meta-tool router. It renders one row per (provider × canonical_tool) pair with the live attempt and success counters, the p50 upstream latency, the last error code (if any), and a 24-hour sparkline.

Operators open this page when:

  • A rail is failing in production and you need to confirm it is the provider, not the dispatch logic.
  • You are about to flip META_TOOL_MOCK=0 for an org and want to confirm the live rail is healthy first.
  • You added a new provider to the catalog and want to verify traffic is flowing through it.

This page is service-key only. The dashboard reads your service key from a cookie set during login and forwards it on every request. There is no end-user equivalent.

Columns explained

ColumnSourceWhat it tells you
providerprovider_success_telemetry.provider_idWhich catalog provider serviced the calls (e.g. asaas, mercadopago, stripe).
canonical_toolprovider_success_telemetry.canonical_toolWhich meta-tool routed here — one of codespar_pay, codespar_charge, codespar_invoice, codespar_notify, codespar_ship, codespar_crypto_pay, codespar_kyc.
attemptsrolled upTotal upstream calls in the rollup window.
successesrolled upCalls that returned a 2xx upstream.
success_ratederivedsuccesses / attempts. Renders red below 95%, amber 95–99%, green 99%+.
latency_p50_msrolled up50th-percentile upstream call latency. The provider's own latency, not your dispatch overhead.
last_error_coderolled upMost recent non-success error code seen on this rail. Empty when the rail is clean.
sparklinehourly buckets24 hourly points of latency_p50_ms. Gaps render as null breaks in the line — not zero.

The rolled-up table comes from a single call to GET /v1/meta-tools/stats. The sparkline next to each row fans out one GET /v1/meta-tools/stats/hourly?provider_id=X&canonical_tool=Y per row.

What good looks like

At steady state on a healthy rail:

  • Success rate ≥ 99%. Pix charge rails (Asaas, Mercado Pago, iugu, Stone) typically sit at 99.5%+. Card rails (Stripe, MP) the same. Boleto is structurally lower because of buyer-side cancel.
  • latency_p50_ms under the provider's published SLA. Asaas Pix charge p50 ~400ms; Stripe charge p50 ~600ms; Melhor Envio quote p50 ~250ms. Sustained values 2× the published SLA mean something is wrong.
  • last_error_code empty or rotating through one-off codes. A persistent error code (the same string row after row) is a strong signal of credentials drift or a provider-side breaking change.

What to do when it goes wrong

High error rate on one provider

The rail's success_rate is below 95% and last_error_code is the same string for many rows in a row.

  1. Open /dashboard/health, scroll to the connections panel. Confirm the org's connection for that provider is connected. A drifted credential surfaces here first.
  2. Check the provider's status page. Asaas, Stripe, and Mercado Pago all publish status. A real outage will show.
  3. If the provider is confirmed broken, set eligibility = false for that (org_id × provider_id × canonical_tool) row in meta_tool_eligibility. The router drains traffic to the next failover candidate within ~30s. Reverse the flag once the provider recovers.

p50 latency degraded

latency_p50_ms is 2× or more the steady-state value.

  1. Look at the sparkline. A clean step at one hour boundary points at a deploy or a provider-side incident; a rising slope points at a capacity issue.
  2. Cross-reference with /dashboard/health — if the telemetry check is idle or counts are unexpectedly low, the issue may be on your side, not the provider's.
  3. If the latency is real and persistent, consider draining via meta_tool_eligibility while the provider investigates.

Zero attempts on a rail you expect traffic on

The row exists but attempts = 0 over the rollup window.

  1. Check whether the tenant has a connection for the provider — GET /v1/connections?provider_id=X. If there is no connection, the dispatcher cannot pick this rail.
  2. Check the classifier output for this tenant's recent calls — open /dashboard/router-candidates to see whether the LLM is putting traffic somewhere unexpected.
  3. Confirm the meta-tool is actually being invoked. GET /v1/tool-calls?canonical_tool=codespar_charge&limit=10 and inspect the chosen provider_id per call.

API reference

The underlying endpoints — GET /v1/meta-tools/stats for the rollup and GET /v1/meta-tools/stats/hourly?provider_id=X&canonical_tool=Y for the 24-bucket sparkline — are documented under Sessions API. Both are service-key only and return 403 for bearer-token requests.

Next steps

Edit on GitHub

Last updated on