Calibrate Claude relevance/specificity scores across queries

Justification scores (1-10) aren't calibrated across queries -- an 8 for one query isn't comparable to an 8 for another, even though scores are surfaced to users.

Explore rubric-based scoring, fixed reference anchors in the prompt, or a calibration pass.

See docs/LIMITATIONS.md, Models section, Claude justifications.