Skip to content

load_history reads entire JSONL into memory — stream + heap for --tail N #99

Description

@renaudcepre

Context

protest/history/storage.py:105-125 (load_history) reads the entire history JSONL file into memory before any filtering, slicing, or --tail N truncation happens. For a project running an eval matrix in CI with long retention (months of runs), the file grows linearly and load time degrades alongside it.

This is a follow-up to the eval framework review — non-blocking, deferred until after the eval feature ships.

Why it matters

History is append-only and grows unbounded by default. Common access patterns:

  • protest history --tail 20 only needs the last 20 entries
  • Plotting or stats queries often only need a date window
  • Loading everything to discard most is the dominant cost in practice

The current implementation's complexity is O(file_size) regardless of how few entries the caller wants.

Proposed approach

  • Streaming reader: parse the JSONL line by line (for line in f), yielding entries lazily
  • --tail N: keep a rolling buffer (or collections.deque(maxlen=N)) instead of loading all + slicing
  • Date-window queries: short-circuit once the entries fall outside the window (entries are append-only and timestamp-ordered, so we can stop early)
  • Heap option: only relevant if we need top-K by something other than recency; defer until a use case appears

Decisions to make at implementation time

  1. API surface: keep load_history() returning a list (compatible) and add iter_history() as the streaming primitive? Or change the existing function to return an iterator and update callers?
  2. --tail N plumbing: pass N down to the storage layer, or keep filtering in the CLI and let it consume from the iterator?
  3. Validation thresholds: at what file size / entry count does the current implementation start to pinch in real usage? Worth measuring before optimizing — pick a representative project and benchmark.

Acceptance criteria

  • --tail N execution time becomes independent of file size (verifiable via a benchmark with a synthetic 100k-entry file)
  • Existing callers of load_history keep working (or are migrated cleanly)
  • No behavior change in what gets returned — only how it's read
  • just lint clean, all tests pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions