Skip to content

Add OpenRouter as a compression backend (Claude Haiku, GPT-4o-mini, etc.) #50

@fazleelahhee

Description

@fazleelahhee

Motivation

CCE's compression layer (chunk summarization for retrieval) currently runs through a local Ollama install with phi3:mini (3.8B params). That setup works, but two real pain points:

  1. Setup friction — users without Ollama installed silently fall back to truncation-only compression, missing one of the bigger savings layers.
  2. Quality ceiling — phi3:mini sometimes paraphrases incorrectly, drops error-handling branches, or hallucinates type signatures, which degrades downstream retrieval relevance.

Adding OpenRouter as an alternative backend lets users with an API key skip the Ollama install entirely and pick a stronger model (Claude Haiku 4.5, GPT-4o-mini, Llama-3.1-70B, etc.) without CCE having to maintain a per-provider client for each.

Scope

What's in

  • New src/context_engine/compression/openrouter_client.py mirroring the OllamaClient interface (same summarize(prompt, model) -> str shape).
  • Extract a minimal LLMClient protocol so Compressor can hold either backend without conditional logic at every call site.
  • Config additions in .context-engine.yaml / ~/.cce/config.yaml:
    compression:
      provider: openrouter            # ollama (default) | openrouter
      model: anthropic/claude-haiku-4-5
      api_key: ${OPENROUTER_API_KEY}  # env var preferred
      base_url: https://openrouter.ai/api/v1  # override for proxies
  • Env var OPENROUTER_API_KEY overrides the config field (matches the CCE_OLLAMA_URL pattern).
  • cce status reports the active provider, model, and (for OpenRouter) whether the API key is set.
  • Tests with a stubbed HTTP client mirroring the Ollama test pattern.
  • Docs: new "Compression backends" section in docs/wiki/Configuration.md covering setup, model picks, and cost-per-reindex.

What's out

  • Embeddings via OpenRouter — OpenRouter routes chat completions, not embeddings. Different feature, different providers (Voyage AI, OpenAI text-embedding-3, Cohere). Track separately.
  • Output compression — that's a different layer (Claude's responses, not CCE's chunks); unaffected.
  • Retrieval reranking via LLM — possible follow-up, but out of scope here.

Honest tradeoffs

Compression ratio: unchanged (~90%) — that's mostly truncation + structured summarization, not the model's reasoning power.

Quality (relevance fidelity): estimated 5–15% better recall on harder queries with Haiku/GPT-4o-mini vs. phi3:mini. Estimate is a guess until we run the recall benchmark below; should not be cited as a number until measured.

Latency: slower than local Ollama. phi3 local: ~50–100ms/chunk on M-series. OpenRouter Haiku: ~150–400ms/chunk including network. A 10k-chunk first-index goes from ~10 min to ~20–60 min.

Cost (one-time index of ~10k chunks, ~5M input tokens):

  • Haiku 4.5 via OpenRouter: ~$5
  • GPT-4o-mini via OpenRouter: ~$0.75
  • Incremental reindexes (per-commit): cents

The real win is adoption — users with an existing API key can skip the Ollama install. The quality bump is genuine but modest.

Pre-implementation: recall benchmark

Before merging, run a small A/B on a fixed corpus (suggest fastapi, already a benchmark target):

  • Bucket A: phi3:mini compression
  • Bucket B: Haiku 4.5 via OpenRouter
  • Bucket C: GPT-4o-mini via OpenRouter
  • Same query set as scripts/bench_recall.py
  • Report MRR, top-5 recall, and average compression latency per chunk

This puts real numbers in the wiki page so users can pick informed.

Test plan

  • OpenRouterClient.summarize() returns the model's text response, gracefully handles 4xx (bad key, model not found) with a clear error
  • OpenRouterClient retries transient 5xx / network errors with backoff (mirror Ollama client behavior)
  • Compressor round-trips a chunk through either backend based on compression.provider
  • Missing OPENROUTER_API_KEY falls back to truncation-only with a one-time log warning, not a hard error (matches Ollama-not-running behavior)
  • cce status shows the active provider correctly for both backends
  • Recall benchmark numbers committed alongside the code change, not after

Related

  • Existing Ollama abstraction: src/context_engine/compression/ollama_client.py, src/context_engine/compression/compressor.py
  • Configurable Ollama URL precedent (issue Possibility to configure external Ollama server #22, commit 860cc1e): same pattern of "env var > config > default"
  • 7-layer benchmark (commit 48bd407): the framework that should produce the recall numbers

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions