Skip to content

none approach drops tool_calls on streaming requests #312

@joby-brentsmith

Description

@joby-brentsmith

none approach drops tool_calls on streaming requests

Repo: https://github.com/algorithmicsuperintelligence/optillm
Version tested: 0.3.15
File: optillm/server.py

Summary

When OptiLLM runs with approach none (direct pass-through — no optimization prefix on the model name), streaming requests that include tools do not return tool_calls to the client.

The proxy buffers the upstream response, extracts only assistant text, and synthesizes a single SSE chunk with finish_reason: "stop". OpenAI-compatible agent clients (Zed, Cursor, custom tool loops) see the announcement text but never receive tool metadata, so tool execution never starts.

none_approach() itself is implemented correctly as a transparent proxy — the bug is in how /v1/chat/completions handles the none branch when stream: true.

Steps to reproduce

  1. Start OptiLLM pointing at any OpenAI-compatible upstream that supports tool calling:

    optillm --base-url http://127.0.0.1:4000/v1 --port 8000
  2. Send a streaming chat completion with tools (model has no optimization prefix → routes to none):

    curl -s -N http://127.0.0.1:8000/v1/chat/completions \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "gpt-4o",
        "messages": [{"role": "user", "content": "Run echo hello with the shell tool"}],
        "tools": [{
          "type": "function",
          "function": {
            "name": "shell",
            "description": "Run a shell command",
            "parameters": {
              "type": "object",
              "properties": {"command": {"type": "string"}},
              "required": ["command"]
            }
          }
        }],
        "tool_choice": "auto",
        "stream": true
      }'

Actual behavior

OptiLLM returns a single synthesized chunk:

{"choices":[{"delta":{"role":"assistant","content":"Sure! Let me run that..."},"finish_reason":"stop"}]}

Then [DONE]. No tool_calls in the stream.

Expected behavior

OptiLLM should forward upstream SSE chunks verbatim, including tool_calls deltas and finish_reason: "tool_calls", e.g.:

{"choices":[{"delta":{"tool_calls":[{"index":0,"id":"...","function":{"name":"shell","arguments":""}}]}}]}

Hitting the same upstream directly (bypassing OptiLLM) with stream: true produces the expected tool-call chunks.

Root cause

In proxy(), the none branch currently does:

execute_single_approach()     # reconstructs messages from parse_conversation() text
  → none_approach(stream=False)  # stream stripped from kwargs
  → extract_contents(result)       # text from choices[0].message.content only
  → generate_streaming_response()  # fake SSE with finish_reason: "stop"

Problems:

  1. generate_streaming_response() only emits text — it has no concept of tool_calls.
  2. Upstream is always called non-streaming for the none path (kwargs.pop('stream', None) in execute_single_approach).
  3. Messages are reconstructed from parse_conversation(), which flattens user/assistant text and drops tool role messages and prior tool_calls — breaking multi-turn agent loops.

Secondary issue (non-streaming)

Some providers return assistant text in choices[0] and tool_calls in choices[1]. OptiLLM returns the raw response, so clients that only read choices[0] miss tools. (Similar to goose#6369.)

Impact

Any OpenAI-compatible client using OptiLLM as a proxy with:

  • stream: true
  • tools / tool_choice
  • approach none (unprefixed model name, or explicit none- prefix)

…will fail to execute tools. This affects IDE agents, MCP integrations, and custom agent frameworks.

Optimization approaches (rto-, cot_reflection-, etc.) are unaffected — they don't claim to be transparent proxies.

Proposed fix

For operation == 'SINGLE' and approaches[0] == 'none' only:

  1. stream: true → call none_approach upstream with stream=True and yield SSE chunks verbatim (generate_stream_passthrough).
  2. stream: false → call none_approach with the original request messages (not reconstructed), then optionally merge split tool choices into choices[0].
  3. Leave generate_streaming_response() unchanged for optimization approaches that produce text output.

A patch is attached in none-passthrough-tool-calls.patch (~70 lines, optillm/server.py only).

Workaround (today)

Bypass OptiLLM for tool-heavy agent sessions and point clients directly at the upstream OpenAI-compatible endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions