Skip to content

Reintroduce the protest history browse CLI, incrementally and usage-driven #107

Description

@renaudcepre

Epic / tracking issue.

Context

The history browse CLI (`list`/`runs`/`show`/`compare`/`clean`) was cut from the evals release (merged in 54d6037): functional and green, but its output proved hard to read and the UX was not yet stabilised. We shipped the settled parts and deferred the browse UI.

What stays (do NOT re-litigate):

  • The writer is always-on: every run appends one entry to `.protest/history.jsonl` (`HistoryPlugin` + `protest/history/`, schema_version=1). Data accumulates from day one.
  • The cut reader is preserved verbatim on branch `archive/history-cli` (commit d80588d) — resurrect subcommands from there.

Strategy

Reintroduce one subcommand at a time, driven by real usage, each one polished and legible before it ships. The stable contract to protect is the JSONL format, not the views.

Design backlog (fix as part of reintroduction)

Smaller items (no separate issue yet):

  • `compare [A] [B]` indexable — currently hardcoded to the latest two runs; no way to compare arbitrary runs (`show N` is indexable, `compare` is not — asymmetric).
  • Surface `labels` / `assertions` — persisted per case (`_serialize_eval_case`) but never displayed in any view.
  • `clean` silently ignores `--model`/`--suite`/`--evals`/`--tail` (accepted via the shared parser, dropped in `_run_clean`) — either honour them or reject them.
  • `show N` × `--tail` trap — `show 15` errors "Only 10 entries available" because `--tail` (default 10) truncates first; misleading message.
  • `--exit-on-regression` — `compare` always exits 0, so CI cannot gate on a detected regression.
  • `-n` is overloaded (`--concurrency` in run/eval, `--tail` in history) — reconsider on reintroduction.

Legibility (the reason it was cut)

The standout offender was the `list` "Scores" column rendering unlabeled arrows (`↗↗→`) — three arrows for three scores with no way to tell which is which. Whatever comes back must be readable at a glance: label the arrows, add an inline legend for the `+ - ⟳ * ✗` markers, and surface what `(scoring modified)` means without a docs trip.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cliCommand line interfaceenhancementNew feature or requestfutureFuture vision

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions