Skip to content

Commit 71308da

Browse files
committed
Add capability policy matrix and transcript regressions
1 parent fff1e3f commit 71308da

15 files changed

+1791
-128
lines changed

TESTING.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,42 @@ What this covers:
227227
- cohort filtering to separate pre-fix history from post-fix Phase 3 runs
228228
- evaluator characterization for transcript expectations without needing a live model run
229229

230+
### Runtime Transcript Regression Pipeline
231+
232+
Use the transcript regression pipeline when you already have a sanitized `liku chat` transcript or an inline-proof `.log` artifact and want to promote it into a checked-in regression fixture quickly:
233+
234+
```bash
235+
# List checked-in transcript fixtures
236+
npm run regression:transcripts -- --list
237+
238+
# Run all checked-in transcript fixtures
239+
npm run regression:transcripts
240+
241+
# Run one fixture only
242+
npm run regression:transcripts -- --fixture repo-boundary-clarification-runtime
243+
244+
# Generate a fixture skeleton from an existing transcript log
245+
npm run regression:extract -- --transcript-file C:\path\to\runtime.log --fixture-name repo-boundary-clarification
246+
247+
# Or print a fixture skeleton without writing a file
248+
npm run regression:extract -- --transcript-file C:\path\to\runtime.log --stdout-only
249+
```
250+
251+
What this covers:
252+
253+
- checked-in sanitized transcript fixtures under `scripts/fixtures/transcripts/`
254+
- deterministic evaluation of transcript expectations without a live model call
255+
- rapid conversion of a real runtime failure into a reusable fixture skeleton
256+
- reuse of the same transcript parsing/evaluation semantics already used by the inline-proof harness
257+
258+
Recommended workflow:
259+
260+
1. capture or identify the runtime transcript/log you want to preserve
261+
2. sanitize it down to the smallest transcript snippet that still proves the failure or behavior
262+
3. run `regression:extract` to generate a fixture skeleton
263+
4. tighten the generated expectations manually so they assert the real invariant, not incidental phrasing
264+
5. run `regression:transcripts` and the nearest behavior test before committing
265+
230266
### Manual Checks for Model Selection
231267

232268
When changing model-selection UX or Copilot routing, add these checks:

docs/CHAT_CONTINUITY_IMPLEMENTATION_PLAN.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2327,6 +2327,38 @@ The most credible next roadmap is:
23272327

23282328
### Roadmap N4 — Capability-policy matrix by app and surface class
23292329

2330+
**Status (2026-03-30)**
2331+
- first runtime matrix slice implemented
2332+
- landed via:
2333+
- `src/main/capability-policy.js`
2334+
- `src/main/ai-service/message-builder.js`
2335+
- `src/main/ai-service/policy-enforcement.js`
2336+
- `src/main/ai-service.js`
2337+
- `scripts/test-capability-policy.js`
2338+
- `scripts/test-ai-service-policy.js`
2339+
- current scope:
2340+
- added a built-in runtime capability-policy matrix for the canonical surface classes:
2341+
- `browser`
2342+
- `uia-rich`
2343+
- `visual-first-low-uia`
2344+
- `keyboard-window-first`
2345+
- the runtime policy snapshot now exposes normalized support dimensions for each surface/app combination:
2346+
- semantic control
2347+
- keyboard control
2348+
- trustworthy background capture
2349+
- precise placement
2350+
- bounded text extraction
2351+
- approval-time recovery
2352+
- prompt assembly now emits capability-policy snapshot context instead of relying only on inline surface heuristics
2353+
- action-plan enforcement now applies narrow built-in matrix checks in addition to existing per-app `actionPolicies` / `negativePolicies`
2354+
- TradingView now rides the generic `visual-first-low-uia` matrix as a first overlay for chart-evidence honesty and precise-placement bounds
2355+
- TradingView overlay metadata now pulls from existing verification/shortcut helpers so the runtime policy snapshot can surface:
2356+
- trading mode hints (`paper` / `live` / `unknown`)
2357+
- stable default shortcuts
2358+
- customizable shortcuts
2359+
- paper-test-only shortcut groups
2360+
- existing visual trust and background-capture signals are reused as policy inputs rather than duplicated into a second evidence model
2361+
23302362
**Why this should be next**
23312363
- Several current safety and honesty wins are still encoded as targeted TradingView or low-UIA heuristics.
23322364
- The next architectural step is to formalize those rules into a reusable capability-policy layer.
@@ -2359,6 +2391,22 @@ The most credible next roadmap is:
23592391

23602392
### Roadmap N5 — Runtime transcript to regression pipeline
23612393

2394+
**Status (2026-03-30)**
2395+
- first transcript-ingestion slice implemented
2396+
- landed via:
2397+
- `scripts/transcript-regression-fixtures.js`
2398+
- `scripts/extract-transcript-regression.js`
2399+
- `scripts/run-transcript-regressions.js`
2400+
- `scripts/test-transcript-regression-pipeline.js`
2401+
- `scripts/fixtures/transcripts/inline-proof-chat-regressions.json`
2402+
- `docs/RUNTIME_REGRESSION_WORKFLOW.md`
2403+
- current scope:
2404+
- added a checked-in transcript fixture format for sanitized `liku chat` regressions
2405+
- added an extraction helper that turns a runtime transcript or inline-proof log into a fixture skeleton
2406+
- added a fixture-driven runner that reuses the existing inline-proof transcript evaluator instead of introducing a second regression engine
2407+
- seeded the pipeline with checked-in transcript fixtures for repo-boundary and forgone-feature regressions
2408+
- documented the `runtime finding -> fixture -> focused rerun -> commit` workflow in repo docs and testing commands
2409+
23622410
**Why this should be next**
23632411
- The strongest recent improvements all came from real runtime transcripts, then hand-converted into tests.
23642412
- That workflow works, but it is still too manual and easy to delay.
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Runtime Regression Workflow
2+
3+
## Goal
4+
5+
Turn a real `liku chat` runtime finding into a checked-in, repeatable regression with as little friction as possible.
6+
7+
This first N5 slice intentionally reuses the existing inline-proof transcript evaluator instead of introducing a second transcript engine. The workflow is:
8+
9+
1. capture a runtime transcript or reuse an inline-proof `.log`
10+
2. sanitize it down to the smallest useful snippet
11+
3. generate a transcript fixture skeleton
12+
4. tighten the generated expectations
13+
5. run transcript regressions and the nearest focused behavior test
14+
6. commit the fixture and the behavioral fix together
15+
16+
## Inputs supported in this slice
17+
18+
- plaintext `liku chat` transcripts
19+
- inline-proof logs from `~/.liku/traces/chat-inline-proof/*.log`
20+
- pasted transcript text over stdin
21+
22+
Out of scope for this first slice:
23+
24+
- automatic replay of JSONL telemetry or agent-trace files
25+
- full transcript-to-test generation without manual expectation review
26+
- broad redaction/policy redesign for runtime capture
27+
28+
## Fixture format
29+
30+
Checked-in transcript fixtures live under:
31+
32+
- `scripts/fixtures/transcripts/`
33+
34+
The fixture bundle format is JSON with multiple named cases at the top level. Each case can include:
35+
36+
- `description`
37+
- `source`
38+
- `capturedAt`
39+
- `tracePath` when relevant
40+
- observed provider/model metadata when available
41+
- `transcriptLines`
42+
- optional derived fields such as `prompts`, `assistantTurns`, and `observedHeaders`
43+
- `notes`
44+
- `expectations`
45+
46+
Expectation semantics intentionally mirror the inline-proof harness:
47+
48+
- `scope: transcript` for whole-transcript checks
49+
- `turn` for assistant-turn-specific checks
50+
- `include`
51+
- `exclude`
52+
- `count`
53+
54+
Pattern entries are stored as JSON regex specs:
55+
56+
- `{ "regex": "Provider:\\s+copilot", "flags": "i" }`
57+
58+
## Commands
59+
60+
List transcript fixtures:
61+
62+
- `npm run regression:transcripts -- --list`
63+
64+
Run all transcript fixtures:
65+
66+
- `npm run regression:transcripts`
67+
68+
Run a single transcript fixture:
69+
70+
- `npm run regression:transcripts -- --fixture repo-boundary-clarification-runtime`
71+
72+
Generate a fixture skeleton from a transcript file:
73+
74+
- `npm run regression:extract -- --transcript-file C:\path\to\runtime.log --fixture-name repo-boundary-clarification`
75+
76+
Print a fixture skeleton without writing a file:
77+
78+
- `npm run regression:extract -- --transcript-file C:\path\to\runtime.log --stdout-only`
79+
80+
## Recommended loop
81+
82+
### 1. Capture the failure
83+
84+
Prefer one of these sources:
85+
86+
- a fresh `liku chat` transcript
87+
- an inline-proof log already saved under `~/.liku/traces/chat-inline-proof/`
88+
- a small hand-curated transcript excerpt from a runtime session
89+
90+
Keep only the lines that prove the invariant you care about. Smaller fixtures are easier to review and less brittle.
91+
92+
### 2. Generate a fixture skeleton
93+
94+
Run `regression:extract` against the sanitized transcript.
95+
96+
The helper derives:
97+
98+
- a fixture name
99+
- prompts
100+
- assistant turns
101+
- observed provider/model headers
102+
- placeholder expectations
103+
104+
Treat those expectations as a draft, not finished truth.
105+
106+
### 3. Tighten expectations manually
107+
108+
Before checking in the fixture:
109+
110+
- remove incidental wording matches
111+
- keep only invariants that prove the bug fix or safety behavior
112+
- add `exclude` or `count` checks when they make the regression sharper
113+
114+
Good transcript fixtures assert the behavior that matters, not every line in the transcript.
115+
116+
### 4. Run the transcript regression and the nearest focused seam test
117+
118+
Minimum validation:
119+
120+
- `npm run regression:transcripts`
121+
- `node scripts/test-transcript-regression-pipeline.js`
122+
123+
Then run the nearest behavioral regression for the feature you touched, for example:
124+
125+
- `node scripts/test-windows-observation-flow.js`
126+
- `node scripts/test-chat-actionability.js`
127+
- `node scripts/test-bug-fixes.js`
128+
129+
### 5. Commit the fixture with the fix
130+
131+
The preferred N5 habit is:
132+
133+
- runtime finding
134+
- transcript fixture
135+
- focused code/test fix
136+
- commit
137+
138+
That keeps new hardening work grounded in observed runtime behavior instead of reconstructed memory.
139+
140+
## Practical guidelines
141+
142+
1. Prefer sanitized transcript snippets over full raw dumps.
143+
2. Use one fixture bundle with several named cases when the domain is closely related.
144+
3. Keep transcript fixtures deterministic and stable enough to survive harmless wording drift.
145+
4. If a transcript fixture starts growing broad, add or retain a narrower behavior test alongside it.

package.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@
1616
"test:skills:inline": "node scripts/test-skill-inline-smoothness.js",
1717
"proof:inline": "node scripts/run-chat-inline-proof.js",
1818
"proof:inline:summary": "node scripts/summarize-chat-inline-proof.js",
19+
"regression:extract": "node scripts/extract-transcript-regression.js",
20+
"regression:transcripts": "node scripts/run-transcript-regressions.js",
1921
"smoke:shortcuts": "node scripts/smoke-shortcuts.js",
2022
"smoke:chat-direct": "node scripts/smoke-chat-direct.js",
2123
"smoke": "node scripts/smoke-command-system.js",
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
#!/usr/bin/env node
2+
3+
const fs = require('fs');
4+
const path = require('path');
5+
const {
6+
DEFAULT_FIXTURE_DIR,
7+
buildFixtureSkeleton,
8+
sanitizeFixtureName,
9+
upsertFixtureBundleEntry
10+
} = require(path.join(__dirname, 'transcript-regression-fixtures.js'));
11+
12+
function getArgValue(flagName) {
13+
const index = process.argv.indexOf(flagName);
14+
if (index >= 0 && index + 1 < process.argv.length) {
15+
return process.argv[index + 1];
16+
}
17+
return null;
18+
}
19+
20+
function hasFlag(flagName) {
21+
return process.argv.includes(flagName);
22+
}
23+
24+
function readTranscriptInput() {
25+
const transcriptFile = getArgValue('--transcript-file');
26+
if (transcriptFile) {
27+
return {
28+
transcript: fs.readFileSync(transcriptFile, 'utf8'),
29+
sourceTracePath: transcriptFile
30+
};
31+
}
32+
33+
if (!process.stdin.isTTY) {
34+
return {
35+
transcript: fs.readFileSync(0, 'utf8'),
36+
sourceTracePath: null
37+
};
38+
}
39+
40+
throw new Error('Provide --transcript-file <path> or pipe transcript text via stdin.');
41+
}
42+
43+
function resolveOutputFile(fixtureName) {
44+
const explicit = getArgValue('--output-file');
45+
if (explicit) return explicit;
46+
return path.join(DEFAULT_FIXTURE_DIR, `${sanitizeFixtureName(fixtureName || 'runtime-transcript')}.json`);
47+
}
48+
49+
function main() {
50+
const { transcript, sourceTracePath } = readTranscriptInput();
51+
const description = getArgValue('--description') || null;
52+
const capturedAt = getArgValue('--captured-at') || null;
53+
const requestedName = getArgValue('--fixture-name') || null;
54+
const skeleton = buildFixtureSkeleton({
55+
fixtureName: requestedName,
56+
description,
57+
transcript,
58+
sourceTracePath: getArgValue('--source-trace-path') || sourceTracePath,
59+
capturedAt
60+
});
61+
62+
const outputFile = resolveOutputFile(skeleton.fixtureName);
63+
const shouldWrite = !hasFlag('--stdout-only');
64+
65+
if (shouldWrite) {
66+
const stored = upsertFixtureBundleEntry(outputFile, skeleton.fixtureName, skeleton.entry, {
67+
overwrite: hasFlag('--overwrite')
68+
});
69+
console.log(`Saved transcript regression fixture: ${stored.filePath}`);
70+
}
71+
72+
console.log(`Fixture: ${skeleton.fixtureName}`);
73+
console.log(`Prompts: ${skeleton.entry.prompts.length}`);
74+
console.log(`Assistant turns: ${skeleton.entry.assistantTurns.length}`);
75+
console.log(`Observed providers: ${(skeleton.entry.observedHeaders.providers || []).join(', ') || 'none'}`);
76+
console.log('');
77+
console.log(JSON.stringify({ [skeleton.fixtureName]: skeleton.entry }, null, 2));
78+
}
79+
80+
if (require.main === module) {
81+
try {
82+
main();
83+
} catch (error) {
84+
console.error(error.stack || error.message);
85+
process.exit(1);
86+
}
87+
}
88+
89+
module.exports = {
90+
readTranscriptInput,
91+
resolveOutputFile
92+
};

0 commit comments

Comments
 (0)