Skip to content

Thinking models get silently mangled by the hardcoded 2048 max_tokens cap #17

@asvarnon

Description

@asvarnon

I've been running BenchLoop against a bunch of local llama.cpp models (RTX 5070 Ti, 16GB) and a few of the thinking/reasoning models were scoring way worse than non-thinking versions or even much smaller models, especially on dataextract and coding.

What I found:

  • max_tokens: 2048 is hardcoded into every task fixture, across all five suites — coding, dataextract, toolcall, instructfollow, reasonmath. Traces
    back to one shared fallback in openai_compat.py:
  "max_tokens": int(kwargs.get("max_tokens") or 2048),
  • For an always-on-reasoning model (Qwen3.6, anything with a deepseek-style reasoning format, etc.) 2048 tokens of isn't much — it gets cut off mid-thought before it ever writes an actual answer. Then this fallback a bit further down in the same file kicks in:
 if not content and reasoning:
     content = reasoning

So the raw, unclosed chain-of-thought gets dumped straight into content. dataextract's strict json.loads and coding's code-fence regex both fail
outright not because the model got the task wrong, but because it never got the chance to answer. The other three suites grade more
leniently so they don't always get caught, but they do sometimes, depending on how chatty the model's CoT is on a given task.

I confirmed this is real by re-running the same models with the cap raised, same hardware/quant/harness, nothing else changed:

  • Qwen3.6-35B-A3B (APEX-MTP quant): 65.2 → 84.9 overall
  • Qwen3.6-27B, thinking on: 61.1 → 81.2 (this one actually flips which mode looks better — thinking now beats no-think, which is the opposite of
    what the broken numbers said)
  • Gemma 4 12B Q4_K_M: 58.8 → 69.3
  • Gemma 4 12B Q8_0: 61.1 → 71.2
  • Gemma-4-12B-coder, which reasons concisely and wasn't really hitting the cap: 81.1 → 83.2, barely moved — good control case, shows this isn't
    just inflating every score across the board

Looking at SPEC.md, it lists explicit max_tokens as a reproducibility requirement, and nobody wants every run turning
into an all-day thing. Right now there's no way to raise it without editing the installed package.

I've got a fix working locally: a --max-tokens flag that overrides the fixture default across every suite (still explicit and fixed-per-run, just
configurable), plus some JSON-extraction hardening so dataextract can recover an answer even when there's leftover text around it instead of
failing the whole task. Can open a PR for it if that's wanted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions