You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Measure Gemini 3 Flash quality on the existing 5-golden calibration set vs Gemini 3 Pro (current baseline: 68/76, or 59/66 on the 4 pages where we have a recent same-prompt result). Output: a number per page, the total, and the data the user needs to make the binary decision of whether Flash becomes the default extraction model.
Decision approach
The threshold is set after the data lands, not before. Pre-committing without seeing Flash's failure modes would force a yes/no on a number disconnected from what actually breaks — a Flash that drops whole rows is qualitatively worse than a Flash that misspells artist names, and the score alone doesn't tell those apart. Inspect a couple of the failure cases by hand against the dump before deciding.
The downstream library-metadata-lookup (LML) service does fuzzy-match reconciliation against the WXYC library. That means misspellings (JESSIKA PRATT → Jessica Pratt) and abbreviations on Flash's output are recoverable downstream; only DROPPED ROWS and WRONG CONTENT actually erode corpus utility. So when reading the spike results, weight row-count discrepancies (compare_row_counts) more heavily than substring misses — the substring scorer is a proxy for OCR fidelity, not corpus quality after LML.
Method
For the spike: a one-off script using core.gemini.GeminiClient(model="gemini-3-flash-preview") against each of the 5 goldens. gemini-3-flash-preview is the closest analog to the production gemini-3.1-pro-preview (same Flash tier, Gemini 3 generation; the 3.1 generation only has Flash Lite, which is a smaller separate tier).
If Flash clears whatever bar the user picks, PR C (PR C: Switch default model to gemini-flash (conditional on spike #43) #46) lands the model swap properly and at that point the gemini-flash adapter should be added to scripts/calibrate_models.py. Until then, the spike's one-off script is enough.
Acceptance criteria
Per-page and total matched-row counts for Flash recorded as a comment on this issue.
Same-page Pro baseline (from current stored results or a fresh run) reported alongside for direct comparison.
Failure-case inspection: at least one Flash FAIL page's comparing output and quadrant content sampled by hand and described, to disambiguate misspellings (LML-recoverable) from dropped rows (LML-unrecoverable).
User picks a threshold given the data.
If accept: PR C lands the model swap.
If reject: PR C closes without merging; the spike report stays as the historical record.
Cost
Cents. Five Flash pages ≈ $0.025 total at current pricing.
Goal
Measure Gemini 3 Flash quality on the existing 5-golden calibration set vs Gemini 3 Pro (current baseline: 68/76, or 59/66 on the 4 pages where we have a recent same-prompt result). Output: a number per page, the total, and the data the user needs to make the binary decision of whether Flash becomes the default extraction model.
Decision approach
The threshold is set after the data lands, not before. Pre-committing without seeing Flash's failure modes would force a yes/no on a number disconnected from what actually breaks — a Flash that drops whole rows is qualitatively worse than a Flash that misspells artist names, and the score alone doesn't tell those apart. Inspect a couple of the failure cases by hand against the dump before deciding.
The downstream
library-metadata-lookup(LML) service does fuzzy-match reconciliation against the WXYC library. That means misspellings (JESSIKA PRATT→Jessica Pratt) and abbreviations on Flash's output are recoverable downstream; only DROPPED ROWS and WRONG CONTENT actually erode corpus utility. So when reading the spike results, weight row-count discrepancies (compare_row_counts) more heavily than substring misses — the substring scorer is a proxy for OCR fidelity, not corpus quality after LML.Method
core.gemini.GeminiClient(model="gemini-3-flash-preview")against each of the 5 goldens.gemini-3-flash-previewis the closest analog to the productiongemini-3.1-pro-preview(same Flash tier, Gemini 3 generation; the 3.1 generation only has Flash Lite, which is a smaller separate tier).gemini-flashadapter should be added toscripts/calibrate_models.py. Until then, the spike's one-off script is enough.Acceptance criteria
comparingoutput and quadrant content sampled by hand and described, to disambiguate misspellings (LML-recoverable) from dropped rows (LML-unrecoverable).Cost
Cents. Five Flash pages ≈ $0.025 total at current pricing.
Related