Context
The per-case .md artifacts are excellent for human debugging (I used them to diagnose an LLM agent overwriting a property mid-run). But when I wanted the pass/fail matrix programmatically — which cases failed, which scores — I had to grep .protest/results/<run>/<case>.md and .protest/last_run_stdout. The latter was truncated and ANSI-laden, and greps for the verdict lines didn't match cleanly.
Ask
A structured, stable run summary alongside the .md files, e.g. .protest/results/<run>/summary.json:
{
"suite": "atelier",
"passed": 9,
"total": 10,
"cost": 0.0052,
"cases": [
{"name": "check_compatible_no_alert", "passed": false,
"scores": {"alert_ok": false}, "reason": "..."}
]
}
This makes downstream tooling (custom dashboards, gating scripts, agentic workflows that read results) trivial — no parsing of Rich/ANSI output.
Relation to #32
#32 (JUnit XML) targets CI consumption of tests; this is a richer, eval-aware JSON (per-case scores + cost + reason) for interactive/programmatic inspection. They could share a serializer.
Context
The per-case
.mdartifacts are excellent for human debugging (I used them to diagnose an LLM agent overwriting a property mid-run). But when I wanted the pass/fail matrix programmatically — which cases failed, which scores — I had to grep.protest/results/<run>/<case>.mdand.protest/last_run_stdout. The latter was truncated and ANSI-laden, and greps for the verdict lines didn't match cleanly.Ask
A structured, stable run summary alongside the
.mdfiles, e.g..protest/results/<run>/summary.json:{ "suite": "atelier", "passed": 9, "total": 10, "cost": 0.0052, "cases": [ {"name": "check_compatible_no_alert", "passed": false, "scores": {"alert_ok": false}, "reason": "..."} ] }This makes downstream tooling (custom dashboards, gating scripts, agentic workflows that read results) trivial — no parsing of Rich/ANSI output.
Relation to #32
#32 (JUnit XML) targets CI consumption of tests; this is a richer, eval-aware JSON (per-case scores + cost + reason) for interactive/programmatic inspection. They could share a serializer.