Improving consistency across LLM calls for code review #186473

ebad66 · 2026-02-06T06:38:13Z

ebad66
Feb 6, 2026

Hi all,
I was wondering what techniques are effective for increasing response consistency across repeated LLM calls in a structured code review setup (same prompt, different outputs)? Any practical strategies beyond temperature tuning?

Answered by Shay350

Feb 6, 2026

You won’t get perfect determinism from most hosted LLMs, but you can significantly reduce variance in a structured code review pipeline.

High-leverage strategies (beyond temperature)

Lock decoding parameters (and seed, if available)

Keep top_p fixed (often 1) and hold frequency/presence penalties constant.

If the API supports it, set a fixed random seed.

Constrain the output shape

Use structured outputs (JSON schema / function calling).

Keep enums small and explicit (e.g. severity: ["blocker", "major", "minor"]).

Reduce degrees of freedom in the prompt

Provide a clear review rubric with categories and scoring rules.

Limit output size (e.g. “max 8 findings”, “one finding per item”).

C…

View full answer

Shay350 · 2026-02-06T06:40:09Z

Shay350
Feb 6, 2026

You won’t get perfect determinism from most hosted LLMs, but you can significantly reduce variance in a structured code review pipeline.

High-leverage strategies (beyond temperature)

Lock decoding parameters (and seed, if available)

Keep top_p fixed (often 1) and hold frequency/presence penalties constant.

If the API supports it, set a fixed random seed.

Constrain the output shape

Use structured outputs (JSON schema / function calling).

Keep enums small and explicit (e.g. severity: ["blocker", "major", "minor"]).

Reduce degrees of freedom in the prompt

Provide a clear review rubric with categories and scoring rules.

Limit output size (e.g. “max 8 findings”, “one finding per item”).

Canonicalize inputs

Normalize diffs and context before sending:

Stable file ordering

Consistent chunking

Remove unrelated or nondeterministic metadata

Two-pass approach (extract → judge)

Pass 1: extract potential issues only (no severity).

Pass 2: assign severity and recommendations using the rubric.

This greatly stabilizes the set of findings.

Self-consistency via voting

Run N=3–5 inexpensive calls.

Majority-vote on findings or rubric scores.

Works best with structured outputs that can be merged deterministically.

Critic / verifier pass

Generate a draft review, then run a second pass that:

Removes duplicates

Enforces the rubric

Rejects uncited or ungrounded claims

Require grounded evidence

Each finding must include:

File

Line range

Quoted snippet from the diff

If evidence is missing, require the model to output “not enough context.”

Split into focused checks

Run separate passes for security, performance, style, tests, etc.

Merge results deterministically with a fixed priority order.

A structure that’s consistently stable in practice:

Input: diff + rubric

Output: JSON array of
{id, category, severity, file, lines, evidence, recommendation}
with a strict maxItems limit

Hard rule: no finding without an evidence snippet from the diff

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

Improving consistency across LLM calls for code review #186473

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

GitHub Community

Improving consistency across LLM calls for code review #186473

Uh oh!

Uh oh!

ebad66 Feb 6, 2026

Replies: 1 comment

Uh oh!

Shay350 Feb 6, 2026

ebad66
Feb 6, 2026

Shay350
Feb 6, 2026