Add OpenRouter as a compression backend (Claude Haiku, GPT-4o-mini, etc.)

## Motivation

CCE's compression layer (chunk summarization for retrieval) currently runs through a local Ollama install with `phi3:mini` (3.8B params). That setup works, but two real pain points:

1. **Setup friction** — users without Ollama installed silently fall back to truncation-only compression, missing one of the bigger savings layers.
2. **Quality ceiling** — phi3:mini sometimes paraphrases incorrectly, drops error-handling branches, or hallucinates type signatures, which degrades downstream retrieval relevance.

Adding OpenRouter as an alternative backend lets users with an API key skip the Ollama install entirely and pick a stronger model (Claude Haiku 4.5, GPT-4o-mini, Llama-3.1-70B, etc.) without CCE having to maintain a per-provider client for each.

## Scope

### What's in

- New `src/context_engine/compression/openrouter_client.py` mirroring the `OllamaClient` interface (same `summarize(prompt, model) -> str` shape).
- Extract a minimal `LLMClient` protocol so `Compressor` can hold either backend without conditional logic at every call site.
- Config additions in `.context-engine.yaml` / `~/.cce/config.yaml`:
  ```yaml
  compression:
    provider: openrouter            # ollama (default) | openrouter
    model: anthropic/claude-haiku-4-5
    api_key: ${OPENROUTER_API_KEY}  # env var preferred
    base_url: https://openrouter.ai/api/v1  # override for proxies
  ```
- Env var `OPENROUTER_API_KEY` overrides the config field (matches the `CCE_OLLAMA_URL` pattern).
- `cce status` reports the active provider, model, and (for OpenRouter) whether the API key is set.
- Tests with a stubbed HTTP client mirroring the Ollama test pattern.
- Docs: new \"Compression backends\" section in `docs/wiki/Configuration.md` covering setup, model picks, and cost-per-reindex.

### What's out

- **Embeddings via OpenRouter** — OpenRouter routes chat completions, not embeddings. Different feature, different providers (Voyage AI, OpenAI text-embedding-3, Cohere). Track separately.
- **Output compression** — that's a different layer (Claude's responses, not CCE's chunks); unaffected.
- **Retrieval reranking via LLM** — possible follow-up, but out of scope here.

## Honest tradeoffs

**Compression ratio:** unchanged (~90%) — that's mostly truncation + structured summarization, not the model's reasoning power.

**Quality (relevance fidelity):** estimated **5–15% better recall on harder queries** with Haiku/GPT-4o-mini vs. phi3:mini. Estimate is a guess until we run the recall benchmark below; should not be cited as a number until measured.

**Latency:** *slower* than local Ollama. phi3 local: ~50–100ms/chunk on M-series. OpenRouter Haiku: ~150–400ms/chunk including network. A 10k-chunk first-index goes from ~10 min to ~20–60 min.

**Cost** (one-time index of ~10k chunks, ~5M input tokens):
- Haiku 4.5 via OpenRouter: ~$5
- GPT-4o-mini via OpenRouter: ~$0.75
- Incremental reindexes (per-commit): cents

**The real win is adoption** — users with an existing API key can skip the Ollama install. The quality bump is genuine but modest.

## Pre-implementation: recall benchmark

Before merging, run a small A/B on a fixed corpus (suggest fastapi, already a benchmark target):

- Bucket A: phi3:mini compression
- Bucket B: Haiku 4.5 via OpenRouter
- Bucket C: GPT-4o-mini via OpenRouter
- Same query set as `scripts/bench_recall.py`
- Report MRR, top-5 recall, and average compression latency per chunk

This puts real numbers in the wiki page so users can pick informed.

## Test plan

- [ ] `OpenRouterClient.summarize()` returns the model's text response, gracefully handles 4xx (bad key, model not found) with a clear error
- [ ] `OpenRouterClient` retries transient 5xx / network errors with backoff (mirror Ollama client behavior)
- [ ] `Compressor` round-trips a chunk through either backend based on `compression.provider`
- [ ] Missing `OPENROUTER_API_KEY` falls back to truncation-only with a one-time log warning, not a hard error (matches Ollama-not-running behavior)
- [ ] `cce status` shows the active provider correctly for both backends
- [ ] Recall benchmark numbers committed alongside the code change, not after

## Related

- Existing Ollama abstraction: `src/context_engine/compression/ollama_client.py`, `src/context_engine/compression/compressor.py`
- Configurable Ollama URL precedent (issue #22, commit `860cc1e`): same pattern of \"env var > config > default\"
- 7-layer benchmark (commit `48bd407`): the framework that should produce the recall numbers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenRouter as a compression backend (Claude Haiku, GPT-4o-mini, etc.) #50

Motivation

Scope

What's in

What's out

Honest tradeoffs

Pre-implementation: recall benchmark

Test plan

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add OpenRouter as a compression backend (Claude Haiku, GPT-4o-mini, etc.) #50

Description

Motivation

Scope

What's in

What's out

Honest tradeoffs

Pre-implementation: recall benchmark

Test plan

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions