When you call session.tools() on a CodeSpar session with the Brazilian preset, you don't get 99 tools. You get 6.
This is not a simplification for the sake of developer experience, although the DX improvement is significant. This is a hard engineering constraint imposed by the economics and mechanics of large language models. Context windows are finite, expensive, and directly correlated with tool selection accuracy. More tools means more tokens consumed by definitions, fewer tokens available for reasoning, and worse decisions by the model about which tool to use.
We spent three months discovering this the hard way. In January 2026, our first SDK prototype passed all available MCP tools directly to the model. Tool selection accuracy hovered around 62%. Agents frequently picked the wrong payment provider, confused fiscal tools with payment tools, and burned through context windows generating explanations for why they chose the wrong tool. By March, after implementing the meta tool pattern, selection accuracy jumped to 97.3%. The math was clear. The architecture had to change.
Each meta tool is a category proxy. When an agent calls codespar_pay, the MetaToolExecutor resolves the intent, identifies the correct underlying MCP server, discovers the right tool, and executes it. The agent never sees the 99 raw tools. It sees 6 categories that cover 100% of LatAm commerce operations.
The context window problem is worse than you think
Let's do the math. A typical MCP tool definition includes a name, a description (50-150 tokens), and a JSON Schema for the input parameters (100-400 tokens depending on complexity). For a payment tool like Zoop's create_pix_payment, the schema includes fields for amount, currency, payer CPF/CNPJ, payer name, payer email, Pix key type, Pix key value, description, metadata, and callback URLs. That single tool definition consumes approximately 280 tokens.
Multiply across 99 tools and you get 15,000-20,000 tokens consumed before the agent has processed a single user message. On Claude Sonnet with a 200K context window, that's 10% of the window gone to tool definitions alone. On GPT-4o with 128K, it's 15%. On smaller models with 32K windows, it's over 60% - leaving almost no room for conversation history, system prompts, or multi-step reasoning.
But the cost savings, while meaningful at scale, are the least important benefit. The real problem is tool selection degradation.
Tool selection accuracy degrades non-linearly
We ran controlled experiments in February 2026 with Claude Sonnet 3.5 and GPT-4o. We gave each model a commerce task - "charge this customer R$150 via Pix" - and varied the number of available tools from 5 to 100. The results were unambiguous:
The degradation is not linear. There's a sharp inflection point around 20-30 tools, after which accuracy drops dramatically. This aligns with research from Anthropic and Microsoft on tool-use benchmarks: models struggle with large tool sets not because they can't understand the tools, but because the decision surface becomes too large for reliable discrimination.
At 99 tools, the model is essentially guessing between semantically similar options. Is the right tool zoop_create_pix_payment or asaas_create_pix_charge or stripe_acp_create_payment_intent? They all do roughly the same thing with different APIs, different parameter names, and different behaviors. The model cannot reliably distinguish them without deep domain knowledge that doesn't fit in a tool description.
The right number of tools for an agent is the smallest number that covers all the use cases. For LatAm commerce, that number is 6.
Why 6 is the right number
Not 3. Not 10. Not 20. Six.
We arrived at 6 through a combination of domain analysis and empirical testing. The commerce domain, across all of Latin America, decomposes into exactly 6 operational categories:
- Discover - find products, services, prices, and catalogs. This is the read-before-write operation. Before you can charge someone, you need to know what you're charging for. Universal Commerce Protocol (UCP), Stripe ACP product catalogs, Asaas product listings.
- Checkout - create a payment session, generate a checkout link, initiate a purchase flow. This is distinct from "pay" because checkout involves session management, cart state, and consumer-facing UX. Stripe ACP Checkout, Asaas payment links, x402 protocol for machine-to-machine payments.
- Pay - execute a payment. Charge a card, generate a Pix QR code, create a boleto, process a SPEI transfer. This is the core financial operation. It's also the most complex because it involves routing logic (which provider, which method) and compliance (mandate generation, policy enforcement).
- Invoice - generate fiscal documents. NF-e, NFS-e, NFC-e in Brazil. CFDI in Mexico. Factura Electronica in Argentina and Colombia. This is mandatory by law for most transactions in LatAm. It's not optional. It's not nice-to-have. It's a legal requirement.
- Ship - calculate freight, generate shipping labels, track packages. Physical goods require logistics. Melhor Envio aggregates carriers in Brazil. Skydropx does the same in Mexico. Coordinadora in Colombia.
- Notify - send transactional messages. Order confirmation via WhatsApp. Shipping update via SMS. Invoice delivered via email. Z-API for WhatsApp in Brazil. Zenvia for multi-channel.
We tested with 3 categories (pay/invoice/ship) and agents couldn't distinguish between discovery and payment flows. We tested with 10 categories (splitting pay into card/pix/boleto/spei) and accuracy dropped to 89% - the model confused the sub-categories. We tested with 20 categories (one per provider) and accuracy dropped to 74%. Six categories hit the sweet spot: high enough coverage, low enough cognitive load for the model.
Miller's Law suggests humans can hold 7 plus or minus 2 items in working memory. LLMs are not humans, but the principle transfers: tool selection works best when the option set is small enough to fully evaluate. Our empirical data shows the optimal range for commerce tool selection is 5-8 categories. Six is in the middle of that range.
How the MetaToolExecutor works
When an agent calls codespar_pay, it doesn't execute a payment. It triggers a resolution chain that determines which underlying MCP tool to call, on which server, with which parameters. Here's the full flow:
The entire resolution chain takes 12-18ms on a warm cache. Tool discovery (Step 3) is cached after the first call per session, so subsequent calls skip that step entirely. The agent experiences near-zero latency overhead compared to calling the raw MCP tool directly.
The routing table
Each meta tool maps to a set of MCP server names. The mapping is defined in a static routing table, so resolution is O(1).
When a preset includes only Stripe ACP (the checkout preset), only 4 meta tools are returned: discover, checkout, pay, and invoice. Ship and notify aren't available because no matching server exists. This keeps the tool set honest. You only see what you can actually use.
BACR: Best Agent Commerce Route
Here's where the architecture gets opinionated.
In equities trading, there's a concept called "best execution." When a broker receives an order to buy 1,000 shares of AAPL, they don't just route it to NYSE. They evaluate multiple venues - NYSE, NASDAQ, IEX, dark pools - and route to the venue that offers the best combination of price, speed, and likelihood of fill. This is legally mandated by SEC Rule 606 and MiFID II in Europe.
We apply the same principle to commerce. When an agent calls codespar_pay, the system doesn't just pick the first available provider. It evaluates all available providers and routes to the one that offers the best combination of cost, reliability, and speed. We call this BACR: Best Agent Commerce Route.
BACR considers four factors:
- Cost - total fee for the specific transaction amount and method. Percentage-based fees favor large transactions; flat fees favor small ones. BACR calculates the actual cost, not just the rate.
- Speed - settlement time for the specific payment method. Pix settles instantly. Boleto takes 1-3 days. Credit card takes 30 days (or D+2 with advance). BACR weights speed based on the merchant's cash flow preferences.
- Reliability - 30-day uptime and error rate for the specific tool on the specific server. If a provider's Pix endpoint has been returning 503s, BACR deprioritizes it even if it's cheaper.
- Nativeness - whether the provider supports the payment method natively or through a third-party intermediary. Native support means fewer failure points, better error messages, and faster processing.
BACR is not a feature we ship today in v0.2.0. It's the architectural direction we're building toward. The current keyword scoring is Step 1. BACR scoring will replace it in v0.3.0 as we collect enough routing data to make intelligent decisions. But the meta tool abstraction is designed from day one to support this evolution - because it controls the routing decision, not the agent.
The agent should never decide which payment provider to use. That's a business decision, not a model decision. The meta tool layer makes that decision deterministically.
Composio comparison: Actions vs Meta Tools
Composio has a concept called "Actions" that superficially resembles meta tools. They aggregate multiple API operations into higher-level abstractions. But the architecture is fundamentally different.
Composio Actions are essentially pre-built API call chains. An Action like "Send Slack Message" wraps the Slack API's chat.postMessage endpoint with authentication handling, error formatting, and a simplified parameter schema. It's an API wrapper with a nice interface. Composio has over 10,000 Actions across 250+ integrations.
CodeSpar meta tools are routing proxies, not API wrappers. A meta tool doesn't wrap a specific API call. It routes to the best available tool across multiple MCP servers based on the arguments, the preset, and (eventually) the BACR score. The meta tool doesn't know or care which specific API gets called. It delegates that decision to the resolution chain.
This is a critical difference. With Composio, the developer (or the agent) must know which provider to use for which operation. With CodeSpar, the meta tool layer abstracts that decision entirely. The agent says "pay" and the system figures out how.
The practical impact: when we added x402 protocol support in March 2026, existing agents that used codespar_pay automatically gained the ability to process machine-to-machine payments via x402 - without any code changes, without any prompt updates, without even knowing x402 exists. The meta tool routing table was updated, and every agent in production got smarter overnight.
Keyword scoring: why not semantic search
Once the MetaToolExecutor knows which servers to try, it needs to find the best matching tool on that server. We use keyword scoring rather than semantic search:
Context-specific keywords from the arguments are also considered. If the agent passes paymentMethod: "pix", "pix" is added to the scoring keywords, biasing toward Pix-specific tools.
We evaluated three approaches for tool resolution:
- Semantic search with embeddings. Embed all tool names and descriptions using text-embedding-3-small, run cosine similarity on each call. Accuracy: 96.8%. Latency: 45-80ms per call. Cost: $0.02/1M tokens for embeddings. Dependency: requires an embedding model endpoint. We rejected this for the latency and the external dependency. A commerce payment tool should not fail because an embedding service is down.
- LLM-based routing. Ask a fast model (Haiku) to select the best tool from the candidate list. Accuracy: 98.1%. Latency: 200-400ms per call. Cost: $0.25/1M tokens. We rejected this because adding an LLM call inside a tool call creates recursive latency and cost. Every agent action would trigger two model calls instead of one.
- Keyword scoring. Substring match against tool names and descriptions using predefined keyword sets. Accuracy: 97.3%. Latency: <1ms. Cost: zero. No external dependency. We chose this. The accuracy difference versus semantic search is 0.5%, and it's offset by determinism, speed, and zero failure modes.
When a merchant's agent processes a R$10,000 payment, the routing decision must be deterministic. The same input must always produce the same routing. Semantic search introduces floating-point non-determinism in similarity scores. LLM routing introduces stochastic non-determinism in model outputs. Keyword scoring is perfectly deterministic. Same keywords, same score, same route, every time. For financial operations, this is non-negotiable.
Architecture deep dive: meta tools to MCP servers
The meta tool layer sits between the AI framework (Vercel AI SDK, LangChain, direct API calls) and the MCP server fleet. Here's the full architecture:
The MetaToolExecutor is a pure function. It takes a meta tool name and arguments, returns a tool call result. It has no state beyond the tool discovery cache (which is per-session and invalidated on server reconnect). It makes no network calls except to the MCP servers themselves. It's testable, deterministic, and fast.
How the SDK exposes meta tools
The developer writes zero routing logic. Zero provider selection. Zero tool filtering. The SDK handles everything between "charge via Pix" and "here's your QR code."
The codespar_pay deep dive
The payment meta tool deserves special attention because it's the most complex routing decision in the system. When an agent calls codespar_pay, it triggers a three-stage pipeline that goes beyond simple tool routing:
PolicyEngine and MandateGenerator are enterprise packages - they're the part of CodeSpar that's commercially licensed. The meta tool abstraction means that when a developer upgrades from the open-source tier to enterprise, their agent code doesn't change. The same codespar_pay call gains compliance features transparently. No code refactor, no prompt updates, no agent retraining.
The escape hatch
Meta tools are the default and the recommendation. But we are not in the business of hiding complexity from developers who need it. If a developer needs access to the raw MCP tools - because they're building a custom routing layer, because they need a specific tool that doesn't rank highly in keyword scoring, or because they want full control - they can bypass the abstraction entirely.
The MCP endpoint at session.mcp gives full access to all servers and tools through the standard MCP protocol. Any MCP-compatible client can connect - Claude Desktop, Cursor, Windsurf, or any custom implementation. No meta tool layer, no routing, no keyword scoring. Just raw tool access for developers who want full control.
We track the ratio of meta tool usage to raw tool usage across our SDK telemetry. As of April 2026, 94% of tool calls go through meta tools. The 6% that use raw tools are almost entirely developers running integration tests against specific providers. In production agent workloads, meta tools are used exclusively.
What's next: from keyword scoring to BACR
The meta tool pattern in v0.2.0 is the foundation. The roadmap through v0.4.0 adds increasingly sophisticated routing:
- v0.2.0 (current) - Keyword scoring. Deterministic, fast, 97.3% accuracy. Good enough for most workloads. No external dependencies.
- v0.3.0 (Q3 2026) - BACR scoring. Cost, speed, reliability, and nativeness factors. Requires 90 days of routing data to calibrate. Will route based on actual provider performance, not just keyword similarity.
- v0.4.0 (Q4 2026) - Adaptive routing. BACR scores adjust in real-time based on provider health. If Asaas starts returning errors, traffic automatically shifts to Stripe ACP. If a provider reduces fees, they get more traffic. The system optimizes continuously.
The beauty of the meta tool abstraction is that none of these upgrades require changes to agent code. The codespar_pay call stays the same. The routing gets smarter underneath.
Build your agents against abstractions, not implementations. The abstraction is the contract. The implementation is our problem.
The meta tool pattern ships with @codespar/sdk@0.2.0. Install it, create a session with any preset, and call session.tools(). You'll get 6 tools instead of 99. Your agent will make better decisions, use less context window, and cost less per session. The code is MIT licensed. Read every line if you want to understand the routing. We built this in the open because we think it's the right way to build commerce infrastructure for AI agents.