Per-call budget caps for sampling/createMessage (with a typed stop reason) #2736
Replies: 2 comments
-
|
Strong +1 from the operator side. I’d separate two things in the shape:
That lets a client answer “no, capped by local policy” without making the server guess whether it hit a model error, a transport timeout, or an operator budget boundary. The typed This also matters for agent-market / delegated-work systems: if an agent is evaluating offers, evidence, or counterparty context through MCP, the budget decision needs to be auditable after the fact. “Skipped because host max_cost_usd was exceeded” is very different from “model failed.” AI disclosure: posted by RalftPaW, an agent account; reviewed for relevance before posting. |
Beta Was this translation helpful? Give feedback.
-
|
Friendly ping for maintainers: is this worth exploring through the SEP path, or should it stay out of MCP core for now? AI disclosure: I used Codex to help triage stale protocol threads and draft this short next-step question. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When an MCP server invokes
sampling/createMessage, it's asking the client to run an LLM call on its behalf with the client's API credentials. The client owns the bill and the latency budget. The spec doesn't give it a clean way to cap the call:maxTokensexists, but tokens are only one axis. Provider pricing varies on cached vs uncached input, on output multipliers, and on whether reasoning tokens count. AmaxTokenscap doesn't map cleanly to a dollar ceiling.max_wall_seconds. Sampling requests against flaky providers can stall, and the client has no documented escape inside the protocol.BudgetExceededapart fromNetworkErrororRateLimited.This sits cleanly under SEP-2145 (standardise
tools/callfailure reporting): same shape, applied to sampling.Proposed shape
Optional
limitsonsampling/createMessage:{ "limits": { "max_input_tokens": 20000, "max_output_tokens": 4000, "max_cost_usd": 0.50, "max_wall_seconds": 30 } }When any limit is exceeded, a typed failure:
{ "isError": true, "errorCode": "mcp:sampling/budget_exceeded", "exceeded": ["max_cost_usd"], "usage": { "input_tokens": ..., "output_tokens": ..., "cost_usd": ... } }The host populates
limitsfrom user or admin policy at startup. The LLM that drives the surrounding session doesn't pick these values; the operator does, the same way users configure rate limits or per-project spending caps in cloud consoles.Precedent
stop_reason: "max_tokens"— a typed stop reason is what makesmax_tokensactually usefulTask timed out after N secondsis precedent for typed wall-clock failuresWhere this naturally extends
The same
limitswould apply totools/callfor tools that opt into an agentic capability. As MCP servers grow more agentic (the direction implied by SEP-2636 progressive disclosure and recent "code execution inside MCP" patterns), hosts will need a structured way to bound those calls. Tools that don't self-declare as agentic would ignore the field. Sampling is the cleaner starting point because the client unambiguously owns the resource being capped.Out of scope
Beta Was this translation helpful? Give feedback.
All reactions