Improving consistency across LLM calls for code review #186473
-
|
Hi all, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
You won’t get perfect determinism from most hosted LLMs, but you can significantly reduce variance in a structured code review pipeline. High-leverage strategies (beyond temperature)
Keep top_p fixed (often 1) and hold frequency/presence penalties constant. If the API supports it, set a fixed random seed.
Use structured outputs (JSON schema / function calling). Keep enums small and explicit (e.g. severity: ["blocker", "major", "minor"]).
Provide a clear review rubric with categories and scoring rules. Limit output size (e.g. “max 8 findings”, “one finding per item”).
Normalize diffs and context before sending: Stable file ordering Consistent chunking Remove unrelated or nondeterministic metadata
Pass 1: extract potential issues only (no severity). Pass 2: assign severity and recommendations using the rubric. This greatly stabilizes the set of findings.
Run N=3–5 inexpensive calls. Majority-vote on findings or rubric scores. Works best with structured outputs that can be merged deterministically.
Generate a draft review, then run a second pass that: Removes duplicates Enforces the rubric Rejects uncited or ungrounded claims
Each finding must include: File Line range Quoted snippet from the diff If evidence is missing, require the model to output “not enough context.”
Run separate passes for security, performance, style, tests, etc. Merge results deterministically with a fixed priority order. A structure that’s consistently stable in practice: Input: diff + rubric Output: JSON array of Hard rule: no finding without an evidence snippet from the diff |
Beta Was this translation helpful? Give feedback.
You won’t get perfect determinism from most hosted LLMs, but you can significantly reduce variance in a structured code review pipeline.
High-leverage strategies (beyond temperature)
Keep top_p fixed (often 1) and hold frequency/presence penalties constant.
If the API supports it, set a fixed random seed.
Use structured outputs (JSON schema / function calling).
Keep enums small and explicit (e.g. severity: ["blocker", "major", "minor"]).
Provide a clear review rubric with categories and scoring rules.
Limit output size (e.g. “max 8 findings”, “one finding per item”).