Skip to main content

Test Mode

Hosted test mode lets you run an agent against the CodeSpar runtime with inline mock declarations — deterministic responses, no provider OAuth, full AgentGate governance. Declare mocks at session create; assert on the round-trip in your tests.

1 min read · updated

Test Mode

@codespar/sdkv0.9.0

Test mode requires @codespar/sdk@0.10.0 (TS) / codespar==0.10.0 (Python) or later.

Hosted test mode is the customer-controlled fixture path on the CodeSpar runtime. When the runtime is in test mode, every external tool dispatch must match a mock you declared on cs.create({ mocks: {...} }); the runtime substitutes those fixtures for upstream provider calls; policy, audit, and commerce-memory still fire on every tool invocation. The intent is one cold-signup-to-green-test loop in under ten minutes with no provider OAuth, and the same agent code running unchanged against a live-mode runtime where the real providers handle dispatch.

This page is the reference for everything mocks-related — what the syntax looks like, how test mode behaves, which error envelopes you'll see, and how to point the SDK at either the hosted backend or a local OSS runtime via CODESPAR_BASE_URL. Quickstart-style end-to-end walkthroughs live in Quickstart (TS) and Quickstart (Python); this page is what those pages link to when they say "see the canonical mocks doc."

What test mode is

Test mode is a property of the runtime, not of any individual session. Where the property lives differs by deployment:

  • Managed backend (api.codespar.dev): test mode is per-project. A project's environment field, set at creation and immutable, is the master switch. environment === 'test' means every session that resolves to that project runs in test mode; environment === 'live' means none of them do. The API key prefix (csk_test_* vs csk_live_*) is a developer-visual-signal convention derived one-to-one from the project's environment at mint time — it's not a separate authorization gate.
  • OSS self-hosted (via CODESPAR_BASE_URL): test mode is per-deployment. The server process's CODESPAR_TEST_MODE_ENABLED env var is the master switch — truthy (true or 1, case-insensitive) means the whole deployment is in test mode, anything else means it's in live mode. Setting the env var on the client process has no effect; it must be truthy on the server.

When the runtime is in test mode, every external tool dispatch in every session — whether your code drove it via session.execute() or the LLM picked it inside session.send() — must match a declared mock entry. A dispatch with no matching mock returns tool_not_mocked and the upstream provider is never called. A session that doesn't declare mocks at all can't dispatch any tools in test mode — declare the mocks the test will exercise, or run the same code against a live-mode runtime.

Built-in metadata tools bypass this gate. The current allow-list is codespar_list_tools on OSS and codespar_discover plus codespar_manage_connections on the managed backend — runtime-introspection operations with no external side effects. Any future built-in that reaches external state declares its fixtures in session.mocks like a normal tool; the allow-list does not grow to cover side-effecting calls.

Minting a test key

Test keys are minted from the dashboard at Dashboard → API Keys (/dashboard/api-keys). The mint modal defaults the new key's prefix to match the active project's environment, so a key minted against a test project is always csk_test_* and a key minted against a live project is always csk_live_*. New orgs get a test project plus a csk_test_* key auto-created at signup; the key value renders once on the post-signup landing page.

If you need to mint a csk_live_* key against a live project from a session whose default project is test (or vice versa), the explicit environment toggle on /dashboard/api-keys is still available — the default just stops being wrong for the common case.

Declaring mocks on cs.create

The mocks field on cs.create is a map keyed by canonical tool name (the server/tool form, e.g. asaas/create_payment), valued by either a static object or an array of objects.

import { CodeSpar } from "@codespar/sdk";

const cs = new CodeSpar({ apiKey: process.env.CODESPAR_API_KEY });

const session = await cs.create("user_test", {
  servers: ["asaas"],
  mocks: {
    "asaas/create_payment": { id: "pay_test_42", status: "PENDING" },
  },
});

const result = await session.execute("asaas/create_payment", { value: 100 });
console.log(result.id); // "pay_test_42"
from codespar import CodeSpar
import os

cs = CodeSpar(api_key=os.environ["CODESPAR_API_KEY"])

session = await cs.create("user_test", {
    "servers": ["asaas"],
    "mocks": {
        "asaas/create_payment": {"id": "pay_test_42", "status": "PENDING"},
    },
})

result = await session.execute("asaas/create_payment", {"value": 100})
print(result["id"])  # "pay_test_42"

The TS and Python SDKs serialize the same mocks payload to byte-identical JSON. Pick whichever language your test suite is in; the wire shape and the runtime semantics are the same.

Canonical tool names

Mock keys must match the canonical server/tool form:

^[a-z0-9][a-z0-9-]*\/[a-z0-9][a-z0-9_-]*$

So asaas/create_payment, nuvem-fiscal/create_nfe, and z-api/send_message all match. The OSS runtime uses a double-underscore form (asaas__create_payment) for the same tool — that form is not rewritten by the SDK and will surface as mocks_invalid from the hosted backend at session-create time. Migrating mocks from an OSS test suite to hosted is mechanical: replace __ with /.

Unknown tool names that match the regex (canonical-shape but not in any installed catalog) are accepted at session create and surface their failure at tool-invocation time. The validation is structural at the boundary; semantic checks happen when the LLM or your code dispatches the call.

Static vs stateful mocks

A mock value is either a single object (static — returned on every call) or an array of objects (stateful — consumed in order across calls in the same session). The runtime keeps a per-session, per-canonical-tool-name counter; each successful return from the mock store advances it by one.

const session = await cs.create("user_test", {
  servers: ["nuvem-fiscal"],
  mocks: {
    // Static: every call returns the same fixture.
    "asaas/create_payment": { id: "pay_test_42", status: "PENDING" },

    // Stateful: consumed in order, one entry per call.
    "nuvem-fiscal/create_nfe": [
      { id: "nfe_1", status: "AUTHORIZED" },
      { id: "nfe_2", status: "AUTHORIZED" },
    ],
  },
});
session = await cs.create("user_test", {
    "servers": ["nuvem-fiscal"],
    "mocks": {
        # Static: every call returns the same fixture.
        "asaas/create_payment": {"id": "pay_test_42", "status": "PENDING"},

        # Stateful: consumed in order, one entry per call.
        "nuvem-fiscal/create_nfe": [
            {"id": "nfe_1", "status": "AUTHORIZED"},
            {"id": "nfe_2", "status": "AUTHORIZED"},
        ],
    },
})

A call that's denied by policy or queued for approval does not advance the counter; an approval-replay that subsequently executes advances it by one when the replay succeeds. When a stateful mock is exhausted (the Nth+1 call against an N-entry array), the runtime returns a tool_result block with is_error: true and code: "mocks_exhausted" — see Response envelopes below.

Dispatch behavior in test mode

When the runtime is in test mode, every external tool dispatch needs a matching mock entry — there is no fallthrough to a real upstream. A dispatch with no match returns a tool_result block with is_error: true and code: "tool_not_mocked" (HTTP 422 on the catalog-routed /v1/sessions/:id/execute path; the chat-loop surfaces it as a tool_result block on the next LLM turn). The envelope fires for three failure modes:

  • The session's mocks map has no entry for the dispatched canonical tool name.
  • The session was created with no mocks field at all (or with {}) — every dispatch fails because there's nothing declared.
  • The dispatched canonical name has an unknown server prefix (unknown-server/tool against a runtime that has no unknown-server installed).

There's no per-tool passThrough flag, no escape hatch. If you want partial real-upstream coverage in the same test, you can't — that's what live-mode runtimes are for. The intended pattern is: one session per test, declare every tool the test will exercise, assert on the round-trip.

The footgun this closes is the misspelled-tool-name silent-charge case. A customer who declares mocks: { "asaas/create_paymet": ... } (typo) and watches the LLM dispatch asaas/create_payment sees tool_not_mocked on the wire instead of a silent real-upstream call against Asaas sandbox or production. Loud at the first call site, not silent at the audit-page review.

The approval-replay path is a special case. A tool call that returns approval_required and is later approved replays through the same mock-lookup gate. If the mock has been removed or the original entry is gone by the time the replay fires, the runtime returns mocks_engine_error rather than tool_not_mocked — the call was authorized once and the runtime distinguishes "no mock declared" from "mock-store inconsistency during replay."

When the runtime is in live mode (managed: project.environment === 'live'; OSS: CODESPAR_TEST_MODE_ENABLED unset or not truthy), dispatch behavior is unchanged from pre-test-mode codespar — every call routes to the upstream provider per the catalog's configuration, no mock lookup happens, and the mocks field on cs.create is rejected at session-create time per the Response envelopes table.

What still fires when a call is mocked

A mocked tool call is not a no-op governance-wise. The runtime evaluates in this fixed order on every invocation:

  1. The non-overridable deny-list (fund-transfer caps, NF-e for contested carts, wallet-policy overrides, bulk outbound thresholds, cross-tenant commitments — see Projects → safety rails). A test cannot mock past these.
  2. The policy engine (allowed / approval-required / deny). Your approval-required rule on *create_payment* fires whether or not a mock substitutes the output — policy evaluates the agent's intent (tool name + input), which is identical mocked or not.
  3. The session mock store. In test mode, a canonical-name miss returns tool_not_mocked; an exhausted stateful array returns mocks_exhausted. In live mode, this step is skipped.
  4. Upstream provider call — reached only when the runtime is in live mode. In test mode, every external dispatch terminates at step 3 (built-in metadata tools from the allow-list are the only exception; they bypass steps 2 and 3 and run their introspection work directly).

Audit chain and commerce-memory capture record what happened on every completed call, mocked or real. Project boundaries scope test events from live events (a project's environment field is immutable, so the audit chain is either all-test or all-live by construction — no per-event mocked flag is needed to filter). Commerce-memory capture fires only when the existing per-tool predicate allows it; mocking does not change the predicate's input.

The practical consequence: a customer authoring an approval-required rule on *create_payment* and testing it against a mocked asaas/create_payment sees a queued approval row in /dashboard/approvals, gets the approval_required tool_result back on the next LLM turn, and — on human approval of the queued row — sees the runtime execute the originally-deferred call, consult the mock store, and return the mocked fixture. The governance round-trip matches what production does; the upstream call is the only thing that's deterministic.

Response envelopes

Five tool_result envelopes can surface from a test-mode session at tool-invocation time. All five share the same discriminated-union wire shape — a code discriminant plus envelope-specific sibling fields — and the SDK ships type-narrowed guards for each. On the catalog-routed /v1/sessions/:id/execute path, the mock-store envelopes (mocks_exhausted, mocks_engine_error, tool_not_mocked) surface as HTTP 422 with the envelope in the response body; on the chat-loop they surface as tool_result blocks on the next LLM turn.

codeWhen it surfacesSibling fields
policy_deniedPolicy engine returned deny for the callrule_id, message
approval_requiredPolicy engine returned approval-required; row written to pending_approvalsapproval_id, expires_at, message
mocks_exhaustedStateful mock's array fully consumed; next call attemptedmessage
mocks_engine_errorInternal error evaluating the mock store (typically a malformed fixture caught at runtime, or a missing mock on the approval-replay path where the original call was authorized)message
tool_not_mockedRuntime is in test mode and the dispatched call has no matching mock. Three failure modes share the envelope: (a) mocks map has no entry for the canonical name, (b) session was created without a mocks field, (c) canonical name has an unknown server prefixtool_name, message

Plus two create-time envelopes that surface from cs.create itself (not from a tool call):

codeHTTPWhen
mocks_not_permitted (managed backend)403Project's environment is live; mocks field rejected
mocks_not_permitted (OSS self-hosted)501Server's CODESPAR_TEST_MODE_ENABLED env var is unset or not truthy; mocks field rejected
mocks_invalid400mocks payload shape is malformed (non-canonical tool name, non-object value, etc.)
mocks_payload_too_large413Serialized mocks payload exceeds 64 KiB

The code discriminant is identical across both runtimes by design — a catch (err) { if (err.code === "mocks_not_permitted") ... } block ports unchanged between managed and self-hosted deployments. Only the HTTP status differs, reflecting the gate semantics: 403 (forbidden by tenancy) on the managed backend, 501 (feature not enabled in this deployment) on OSS.

The full wire shape — every field, every enum value, every audit-event name — is published as a versioned JSON Schema at codespar-enterprise/packages/api/contracts/hosted-test-mode-wire-v1.json. That schema is the source of truth; a CI sweep on the hosted-backend deploy fails any divergence between the schema and the runtime's actual responses.

Asserting in your tests

The SDK ships type-narrowed guards for the five tool_result codes — isPolicyDenied, isApprovalRequired, isMocksExhausted, isMocksEngineError, isToolNotMocked — at packages/core/src/tool-result-codes.ts (TS) and the parallel is_* predicates at packages/python/src/codespar/tool_result_codes.py (Python). The guards validate both the code discriminant and the envelope's required sibling fields, so a malformed response (well-formed code but missing rule_id on a policy_denied) returns false from the guard rather than narrowing to a broken type.

import {
  isPolicyDenied,
  isApprovalRequired,
  isMocksExhausted,
  isToolNotMocked,
} from "@codespar/sdk";

const result = await session.send("Charge R$5 via Pix");

for (const tc of result.tool_calls) {
  if (isPolicyDenied(tc)) {
    // tc is narrowed: rule_id, message available
    expect(tc.rule_id).toBe("payments_require_approval");
  }
  if (isApprovalRequired(tc)) {
    // tc.approval_id is the row id in /dashboard/approvals
    expect(tc.approval_id).toMatch(/^apr_/);
  }
  if (isMocksExhausted(tc) || isToolNotMocked(tc)) {
    throw new Error(`mock-store failure: ${tc.message}`);
  }
}
from codespar import (
    is_policy_denied,
    is_approval_required,
    is_mocks_exhausted,
    is_tool_not_mocked,
)

result = await session.send("Charge R$5 via Pix")

for tc in result["tool_calls"]:
    if is_policy_denied(tc):
        assert tc["rule_id"] == "payments_require_approval"
    if is_approval_required(tc):
        assert tc["approval_id"].startswith("apr_")
    if is_mocks_exhausted(tc) or is_tool_not_mocked(tc):
        raise AssertionError(f"mock-store failure: {tc['message']}")

Create-time envelopes (the 403 / 400 / 413 cases above) throw as CodesparApiError with the corresponding code field. Discriminate on e.code === "mocks_not_permitted" rather than parsing e.message.

Pointing the SDK at OSS or hosted

The hosted backend at api.codespar.dev is a strict superset of the OSS runtime — every capability the OSS runtime exposes, the hosted backend exposes too, plus the AgentGate governance layer (programmable wallet, audit, commercial memory, fiscal compliance) on top. A test authored against the OSS feature set runs unchanged against the hosted backend by flipping CODESPAR_BASE_URL; tests that depend on AgentGate capabilities target the hosted backend only.

# Run against the hosted backend (default).
CODESPAR_BASE_URL=https://api.codespar.dev npm test

# Run against a local OSS runtime.
CODESPAR_BASE_URL=http://localhost:8000 npm test

The CodeSpar constructor reads CODESPAR_BASE_URL from process.env (TS) / os.environ (Python) as the default when no explicit baseUrl option is passed. Same API key shape, same headers, same Bearer scheme, same paths — one env var is the only transport-config delta.

import { CodeSpar } from "@codespar/sdk";

const cs = new CodeSpar({
  apiKey: process.env.CODESPAR_API_KEY,
  // baseUrl falls back to CODESPAR_BASE_URL, then api.codespar.dev.
});

it("mocks asaas/create_payment", async () => {
  const session = await cs.create("user_test", {
    servers: ["asaas"],
    mocks: { "asaas/create_payment": { id: "pay_test_42" } },
  });
  const r = await session.execute("asaas/create_payment", { value: 100 });
  expect(r.id).toBe("pay_test_42");
});
import os
from codespar import CodeSpar

cs = CodeSpar(api_key=os.environ["CODESPAR_API_KEY"])
# base_url falls back to CODESPAR_BASE_URL, then api.codespar.dev.

async def test_mocks_create_payment():
    session = await cs.create("user_test", {
        "servers": ["asaas"],
        "mocks": {"asaas/create_payment": {"id": "pay_test_42"}},
    })
    r = await session.execute("asaas/create_payment", {"value": 100})
    assert r["id"] == "pay_test_42"

Tests that exercise AgentGate-only capabilities (programmable wallet caps, policy hooks, commercial memory) target the hosted backend exclusively — they belong in a separate suite that doesn't run against an OSS CODESPAR_BASE_URL. The OSS runtime exposes the session primitives and the MCP catalog; the AgentGate governance layer is what the managed tier adds on top.

How mocks are stored

The wire contract is identical across both runtimes, but the storage shape is not:

  • Managed backend. Mocks and the per-tool consume counters persist in the runtime's database. A session that declared mocks survives restarts, multi-replica deployments, dashboard inspection, and the audit-page review.
  • OSS self-hosted. Mocks and counters live in the server process's memory. They are scoped to the process that handles the session and are lost on restart. Channel-bridge sessions (WhatsApp, Slack, Telegram, Discord) cannot carry mocks on OSS — the in-memory store is only reachable from the HTTP session that declared them.

The asymmetry is intentional. The managed tier's persistence is what AgentGate uses to drive auditability, multi-tenant isolation, and dashboard visibility — capabilities the OSS layer doesn't ship. If a test workflow needs mocks to survive across processes, a CI replica swap, or dashboard inspection, target the managed runtime. If a self-hosted single-process run is enough, the OSS in-memory shape covers it without a database dependency.

The code discriminant, the HTTP status, the sibling fields, and the gate ordering are all byte-identical between runtimes regardless. Test assertions on the response envelope port unchanged between OSS and managed.

Limits and ordering

A handful of constraints are worth knowing up front:

  • 64 KiB mocks-payload cap. The serialized mocks payload per session is capped at 64 KiB on both runtimes. Payloads exceeding the cap return HTTP 413 with mocks_payload_too_large at create time — the session is never created. The cap is wire-contract enforcement at the cs.create boundary; it applies the same whether the runtime persists mocks (managed) or holds them in process memory (OSS self-hosted).
  • Three create-time gates run in fixed order; first gate is the test-mode check. Session create with mocks runs three gates; the first failure short-circuits and subsequent gates do not run.
    • Gate 1 is the test-mode check. On the managed backend it asks "is the resolved project environment='test'?" — live-environment projects reject with HTTP 403 mocks_not_permitted. On a self-hosted OSS runtime it asks "is CODESPAR_TEST_MODE_ENABLED truthy on the server?" — deployments without the flag reject with HTTP 501 mocks_not_permitted. Same envelope code, different status, because the gate semantics differ.
    • Gate 2 is the 64 KiB payload cap. Over-cap payloads reject with HTTP 413 mocks_payload_too_large on both runtimes.
    • Gate 3 is shape validation (top-level object, canonical tool names, valid JSON values). Malformed payloads reject with HTTP 400 mocks_invalid on both runtimes. Order matters because the error you see tells you what to fix first.
  • Tenant isolation. Mocks declared in one session are not readable by any other session, including sessions in the same project. The (org_id, project_id, session_id) scoping invariant is the boundary; a cross-org probe returns 404 (not 403) per the existing tri-key scoping.
  • No mid-session mutation. Mocks are immutable for the session's lifetime. A new test scenario means a new session — cs.create again with the new mocks map.

References

Next steps

Edit on GitHub

Last updated on