Context
The code-review body asks the model to hand-execute a deterministic
procedure in Step 10/10b: remove stale state labels, apply exactly the
verdict label, then read back labels to verify and create-if-missing. That
is pure logic expressed as prose — paid in tokens every run and a recurring
"model did the label dance slightly wrong" bug class.
The repo already proved the better pattern: wf (pick, review-next,
update-next, post-merge, config) lifts deterministic workflow logic out of
the prompt into tested Python. This extends it to the review finish step.
See docs/context-optimization-plan.md → Lever F2. Gated: do this only
after Lever F1 (and ideally F3) have shown wf reliably carrying more of
the workflow, so we act on observed reliability, not faith.
Requirements (acceptance criteria)
Savings goal
Eliminate (not defer) the ~30–40 lines of Step 10/10b label-reconcile
prose from the code-review body and remove the associated bug class —
label state becomes a tested code path. Goal: the model never hand-executes
the label dance on the happy path.
Notes
Blocked-by/sequenced-after the Lever F1 story (#TBD) per the plan's
"earn wf-reliability evidence before pushing more logic down".
Context
The
code-reviewbody asks the model to hand-execute a deterministicprocedure in Step 10/10b: remove stale state labels, apply exactly the
verdict label, then read back labels to verify and create-if-missing. That
is pure logic expressed as prose — paid in tokens every run and a recurring
"model did the label dance slightly wrong" bug class.
The repo already proved the better pattern:
wf(pick, review-next,update-next, post-merge, config) lifts deterministic workflow logic out of
the prompt into tested Python. This extends it to the review finish step.
See
docs/context-optimization-plan.md→ Lever F2. Gated: do this onlyafter Lever F1 (and ideally F3) have shown
wfreliably carrying more ofthe workflow, so we act on observed reliability, not faith.
Requirements (acceptance criteria)
wf review-finish --verdict <approved|changes-requested|needs-discussion>subcommand performs the Step 10 label reconciliation + Step 10b readback
verify (resolving label names by purpose key, guarded create-if-missing,
no
--force).wf.pychange is covered by the offline decision-logic tests(label-set-in → label-set-out for each verdict, including the
create-if-missing path).
wf review-finishon the happy path and keeps athin inline fallback ("apply the verdict label, remove the others" —
just
ghcalls) for whenwferrors or Python is absent. The verboseStep 10/10b prose is removed from the body.
graceful degradation.
wf review-finishpath and theinline fallback produce the correct label state.
_shared-skills/synced if applicable; versionsbumped.
Savings goal
Eliminate (not defer) the ~30–40 lines of Step 10/10b label-reconcile
prose from the code-review body and remove the associated bug class —
label state becomes a tested code path. Goal: the model never hand-executes
the label dance on the happy path.
Notes
Blocked-by/sequenced-after the Lever F1 story (#TBD) per the plan's
"earn wf-reliability evidence before pushing more logic down".