Spike: 5-page Gemini Flash calibration

## Goal

Measure Gemini 3 Flash quality on the existing 5-golden calibration set vs Gemini 3 Pro (current baseline: 68/76, or 59/66 on the 4 pages where we have a recent same-prompt result). Output: a number per page, the total, and the data the user needs to make the binary decision of whether Flash becomes the default extraction model.

## Decision approach

The threshold is set **after** the data lands, not before. Pre-committing without seeing Flash's failure modes would force a yes/no on a number disconnected from what actually breaks — a Flash that drops whole rows is qualitatively worse than a Flash that misspells artist names, and the score alone doesn't tell those apart. Inspect a couple of the failure cases by hand against the dump before deciding.

The downstream `library-metadata-lookup` (LML) service does fuzzy-match reconciliation against the WXYC library. That means misspellings (`JESSIKA PRATT` → `Jessica Pratt`) and abbreviations on Flash's output are recoverable downstream; only DROPPED ROWS and WRONG CONTENT actually erode corpus utility. So when reading the spike results, weight row-count discrepancies (`compare_row_counts`) more heavily than substring misses — the substring scorer is a proxy for OCR fidelity, not corpus quality after LML.

## Method

- For the spike: a one-off script using `core.gemini.GeminiClient(model="gemini-3-flash-preview")` against each of the 5 goldens. `gemini-3-flash-preview` is the closest analog to the production `gemini-3.1-pro-preview` (same Flash tier, Gemini 3 generation; the 3.1 generation only has Flash Lite, which is a smaller separate tier).
- If Flash clears whatever bar the user picks, PR C (#46) lands the model swap properly and at that point the `gemini-flash` adapter should be added to `scripts/calibrate_models.py`. Until then, the spike's one-off script is enough.

## Acceptance criteria

- [ ] Per-page and total matched-row counts for Flash recorded as a comment on this issue.
- [ ] Same-page Pro baseline (from current stored results or a fresh run) reported alongside for direct comparison.
- [ ] Failure-case inspection: at least one Flash FAIL page's `comparing` output and quadrant content sampled by hand and described, to disambiguate misspellings (LML-recoverable) from dropped rows (LML-unrecoverable).
- [ ] User picks a threshold given the data.
- [ ] If accept: PR C lands the model swap.
- [ ] If reject: PR C closes without merging; the spike report stays as the historical record.

## Cost

Cents. Five Flash pages ≈ $0.025 total at current pricing.

## Related

- Sprint 2 parent: #38
- Conditional PR C: #46
- Library reconciliation service (downstream filter): https://github.com/WXYC/library-metadata-lookup


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: 5-page Gemini Flash calibration #43

Goal

Decision approach

Method

Acceptance criteria

Cost

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Spike: 5-page Gemini Flash calibration #43

Description

Goal

Decision approach

Method

Acceptance criteria

Cost

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions