Thinking models get silently mangled by the hardcoded 2048 max_tokens cap

I've been running BenchLoop against a bunch of local llama.cpp models (RTX 5070 Ti, 16GB) and a few of the thinking/reasoning models were scoring way worse than non-thinking versions or even much smaller models, especially on dataextract and coding. 

What I found:
- max_tokens: 2048 is hardcoded into every task fixture, across all five suites — coding, dataextract, toolcall, instructfollow, reasonmath. Traces
back to one shared fallback in openai_compat.py:

```python
  "max_tokens": int(kwargs.get("max_tokens") or 2048),
  ```

- For an always-on-reasoning model (Qwen3.6, anything with a deepseek-style reasoning format, etc.) 2048 tokens of <think> isn't much — it gets cut off mid-thought before it ever writes an actual answer. Then this fallback a bit further down in the same file kicks in:

 ```python
  if not content and reasoning:
      content = reasoning
  ```

  So the raw, unclosed chain-of-thought gets dumped straight into content. dataextract's strict json.loads and coding's code-fence regex both fail
  outright not because the model got the task wrong, but because it never got the chance to answer. The other three suites grade more
  leniently so they don't always get caught, but they do sometimes, depending on how chatty the model's CoT is on a given task.

  I confirmed this is real by re-running the same models with the cap raised, same hardware/quant/harness, nothing else changed:

  - Qwen3.6-35B-A3B (APEX-MTP quant): 65.2 → 84.9 overall
  - Qwen3.6-27B, thinking on: 61.1 → 81.2 (this one actually flips which mode looks better — thinking now beats no-think, which is the opposite of
  what the broken numbers said)
  - Gemma 4 12B Q4_K_M: 58.8 → 69.3
  - Gemma 4 12B Q8_0: 61.1 → 71.2
  - Gemma-4-12B-coder, which reasons concisely and wasn't really hitting the cap: 81.1 → 83.2, barely moved — good control case, shows this isn't
  just inflating every score across the board

  Looking at SPEC.md, it lists explicit max_tokens as a reproducibility requirement, and nobody wants every run turning
  into an all-day thing. Right now there's no way to raise it without editing the installed package.

  I've got a fix working locally: a --max-tokens flag that overrides the fixture default across every suite (still explicit and fixed-per-run, just
  configurable), plus some JSON-extraction hardening so dataextract can recover an answer even when there's leftover text around it instead of
  failing the whole task. Can open a PR for it if that's wanted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thinking models get silently mangled by the hardcoded 2048 max_tokens cap #17

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Thinking models get silently mangled by the hardcoded 2048 max_tokens cap #17

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions