Skip to content

Spike: 5-page Gemini Flash calibration #43

@jakebromberg

Description

@jakebromberg

Goal

Measure Gemini 3 Flash quality on the existing 5-golden calibration set vs Gemini 3 Pro (current baseline: 68/76, or 59/66 on the 4 pages where we have a recent same-prompt result). Output: a number per page, the total, and the data the user needs to make the binary decision of whether Flash becomes the default extraction model.

Decision approach

The threshold is set after the data lands, not before. Pre-committing without seeing Flash's failure modes would force a yes/no on a number disconnected from what actually breaks — a Flash that drops whole rows is qualitatively worse than a Flash that misspells artist names, and the score alone doesn't tell those apart. Inspect a couple of the failure cases by hand against the dump before deciding.

The downstream library-metadata-lookup (LML) service does fuzzy-match reconciliation against the WXYC library. That means misspellings (JESSIKA PRATTJessica Pratt) and abbreviations on Flash's output are recoverable downstream; only DROPPED ROWS and WRONG CONTENT actually erode corpus utility. So when reading the spike results, weight row-count discrepancies (compare_row_counts) more heavily than substring misses — the substring scorer is a proxy for OCR fidelity, not corpus quality after LML.

Method

  • For the spike: a one-off script using core.gemini.GeminiClient(model="gemini-3-flash-preview") against each of the 5 goldens. gemini-3-flash-preview is the closest analog to the production gemini-3.1-pro-preview (same Flash tier, Gemini 3 generation; the 3.1 generation only has Flash Lite, which is a smaller separate tier).
  • If Flash clears whatever bar the user picks, PR C (PR C: Switch default model to gemini-flash (conditional on spike #43) #46) lands the model swap properly and at that point the gemini-flash adapter should be added to scripts/calibrate_models.py. Until then, the spike's one-off script is enough.

Acceptance criteria

  • Per-page and total matched-row counts for Flash recorded as a comment on this issue.
  • Same-page Pro baseline (from current stored results or a fresh run) reported alongside for direct comparison.
  • Failure-case inspection: at least one Flash FAIL page's comparing output and quadrant content sampled by hand and described, to disambiguate misspellings (LML-recoverable) from dropped rows (LML-unrecoverable).
  • User picks a threshold given the data.
  • If accept: PR C lands the model swap.
  • If reject: PR C closes without merging; the spike report stays as the historical record.

Cost

Cents. Five Flash pages ≈ $0.025 total at current pricing.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions