|
| 1 | +# Runtime Regression Workflow |
| 2 | + |
| 3 | +## Goal |
| 4 | + |
| 5 | +Turn a real `liku chat` runtime finding into a checked-in, repeatable regression with as little friction as possible. |
| 6 | + |
| 7 | +This first N5 slice intentionally reuses the existing inline-proof transcript evaluator instead of introducing a second transcript engine. The workflow is: |
| 8 | + |
| 9 | +1. capture a runtime transcript or reuse an inline-proof `.log` |
| 10 | +2. sanitize it down to the smallest useful snippet |
| 11 | +3. generate a transcript fixture skeleton |
| 12 | +4. tighten the generated expectations |
| 13 | +5. run transcript regressions and the nearest focused behavior test |
| 14 | +6. commit the fixture and the behavioral fix together |
| 15 | + |
| 16 | +## Inputs supported in this slice |
| 17 | + |
| 18 | +- plaintext `liku chat` transcripts |
| 19 | +- inline-proof logs from `~/.liku/traces/chat-inline-proof/*.log` |
| 20 | +- pasted transcript text over stdin |
| 21 | + |
| 22 | +Out of scope for this first slice: |
| 23 | + |
| 24 | +- automatic replay of JSONL telemetry or agent-trace files |
| 25 | +- full transcript-to-test generation without manual expectation review |
| 26 | +- broad redaction/policy redesign for runtime capture |
| 27 | + |
| 28 | +## Fixture format |
| 29 | + |
| 30 | +Checked-in transcript fixtures live under: |
| 31 | + |
| 32 | +- `scripts/fixtures/transcripts/` |
| 33 | + |
| 34 | +The fixture bundle format is JSON with multiple named cases at the top level. Each case can include: |
| 35 | + |
| 36 | +- `description` |
| 37 | +- `source` |
| 38 | + - `capturedAt` |
| 39 | + - `tracePath` when relevant |
| 40 | + - observed provider/model metadata when available |
| 41 | +- `transcriptLines` |
| 42 | +- optional derived fields such as `prompts`, `assistantTurns`, and `observedHeaders` |
| 43 | +- `notes` |
| 44 | +- `expectations` |
| 45 | + |
| 46 | +Expectation semantics intentionally mirror the inline-proof harness: |
| 47 | + |
| 48 | +- `scope: transcript` for whole-transcript checks |
| 49 | +- `turn` for assistant-turn-specific checks |
| 50 | +- `include` |
| 51 | +- `exclude` |
| 52 | +- `count` |
| 53 | + |
| 54 | +Pattern entries are stored as JSON regex specs: |
| 55 | + |
| 56 | +- `{ "regex": "Provider:\\s+copilot", "flags": "i" }` |
| 57 | + |
| 58 | +## Commands |
| 59 | + |
| 60 | +List transcript fixtures: |
| 61 | + |
| 62 | +- `npm run regression:transcripts -- --list` |
| 63 | + |
| 64 | +Run all transcript fixtures: |
| 65 | + |
| 66 | +- `npm run regression:transcripts` |
| 67 | + |
| 68 | +Run a single transcript fixture: |
| 69 | + |
| 70 | +- `npm run regression:transcripts -- --fixture repo-boundary-clarification-runtime` |
| 71 | + |
| 72 | +Generate a fixture skeleton from a transcript file: |
| 73 | + |
| 74 | +- `npm run regression:extract -- --transcript-file C:\path\to\runtime.log --fixture-name repo-boundary-clarification` |
| 75 | + |
| 76 | +Print a fixture skeleton without writing a file: |
| 77 | + |
| 78 | +- `npm run regression:extract -- --transcript-file C:\path\to\runtime.log --stdout-only` |
| 79 | + |
| 80 | +## Recommended loop |
| 81 | + |
| 82 | +### 1. Capture the failure |
| 83 | + |
| 84 | +Prefer one of these sources: |
| 85 | + |
| 86 | +- a fresh `liku chat` transcript |
| 87 | +- an inline-proof log already saved under `~/.liku/traces/chat-inline-proof/` |
| 88 | +- a small hand-curated transcript excerpt from a runtime session |
| 89 | + |
| 90 | +Keep only the lines that prove the invariant you care about. Smaller fixtures are easier to review and less brittle. |
| 91 | + |
| 92 | +### 2. Generate a fixture skeleton |
| 93 | + |
| 94 | +Run `regression:extract` against the sanitized transcript. |
| 95 | + |
| 96 | +The helper derives: |
| 97 | + |
| 98 | +- a fixture name |
| 99 | +- prompts |
| 100 | +- assistant turns |
| 101 | +- observed provider/model headers |
| 102 | +- placeholder expectations |
| 103 | + |
| 104 | +Treat those expectations as a draft, not finished truth. |
| 105 | + |
| 106 | +### 3. Tighten expectations manually |
| 107 | + |
| 108 | +Before checking in the fixture: |
| 109 | + |
| 110 | +- remove incidental wording matches |
| 111 | +- keep only invariants that prove the bug fix or safety behavior |
| 112 | +- add `exclude` or `count` checks when they make the regression sharper |
| 113 | + |
| 114 | +Good transcript fixtures assert the behavior that matters, not every line in the transcript. |
| 115 | + |
| 116 | +### 4. Run the transcript regression and the nearest focused seam test |
| 117 | + |
| 118 | +Minimum validation: |
| 119 | + |
| 120 | +- `npm run regression:transcripts` |
| 121 | +- `node scripts/test-transcript-regression-pipeline.js` |
| 122 | + |
| 123 | +Then run the nearest behavioral regression for the feature you touched, for example: |
| 124 | + |
| 125 | +- `node scripts/test-windows-observation-flow.js` |
| 126 | +- `node scripts/test-chat-actionability.js` |
| 127 | +- `node scripts/test-bug-fixes.js` |
| 128 | + |
| 129 | +### 5. Commit the fixture with the fix |
| 130 | + |
| 131 | +The preferred N5 habit is: |
| 132 | + |
| 133 | +- runtime finding |
| 134 | +- transcript fixture |
| 135 | +- focused code/test fix |
| 136 | +- commit |
| 137 | + |
| 138 | +That keeps new hardening work grounded in observed runtime behavior instead of reconstructed memory. |
| 139 | + |
| 140 | +## Practical guidelines |
| 141 | + |
| 142 | +1. Prefer sanitized transcript snippets over full raw dumps. |
| 143 | +2. Use one fixture bundle with several named cases when the domain is closely related. |
| 144 | +3. Keep transcript fixtures deterministic and stable enough to survive harmless wording drift. |
| 145 | +4. If a transcript fixture starts growing broad, add or retain a narrower behavior test alongside it. |
0 commit comments