Justification scores (1-10) aren't calibrated across queries -- an 8 for one query isn't comparable to an 8 for another, even though scores are surfaced to users.
Explore rubric-based scoring, fixed reference anchors in the prompt, or a calibration pass.
See docs/LIMITATIONS.md, Models section, Claude justifications.
Justification scores (1-10) aren't calibrated across queries -- an 8 for one query isn't comparable to an 8 for another, even though scores are surfaced to users.
Explore rubric-based scoring, fixed reference anchors in the prompt, or a calibration pass.
See docs/LIMITATIONS.md, Models section, Claude justifications.