Skip to content

Design review: karpathy-self-improve scaffold (pre-Exec2) #134

Description

@Interstellar-code

Context

Pre-Exec2 design review of the karpathy-self-improve plugin scaffold. Findings below must be addressed before or during Exec2 implementation to avoid painful churn. Review covers: DB schema, metrics approach, API design, git ratchet safety, anti-gaming rails, pattern compliance, and top blockers.


1. DB Schema

[CRITICAL] Lifecycle schema too thin for the ratchet state machine
_db.py:53 defines experiments with only profile/file/state/diff/rationale/scores/verdict/cost/timestamps. Exec2 cannot persist approval, live-window progress, verify/revert timestamps, or git provenance.
Fix: add columns target_profile_root, target_relpath, base_commit_sha, apply_commit_sha, revert_commit_sha, approved_by, approved_at, rejected_by, rejected_at, rejection_reason, live_sessions_target, live_sessions_observed, applied_at, verified_at, reverted_at, baseline_id, plus a CHECK/enum for canonical states.

[CRITICAL] No experiment-to-scenario results linkage
_db.py:53 stores only aggregate offline_score/live_score; _db.py:68 stores scenarios independently — per-scenario binary outcomes, held-out evaluation, and judge/proposer provenance are unauditable.
Fix: add eval_runs and experiment_scenario_results tables keyed by experiment_id + scenario_id, with split, judge_model, proposer_model, pass_fail, judge_rationale, scenario_snapshot, timestamps.

[HIGH] State vocabulary drifting from required contract
_db.py:57 defaults to pending and tests/test_db.py:112 hard-codes pending/running/done, conflicting with the required proposed/approved/live/verified/reverted/rejected.
Fix: rename to canonical states now, enforce with a CHECK constraint, add experiment_state_transitions audit table (from_state, to_state, actor, reason, timestamp).

[HIGH] DB location fragments cross-profile engine
_db.py:23 roots the DB at get_hermes_home() which is profile-scoped in profile mode (hermes_constants.py:53).
Fix: store under get_default_hermes_root() for a central controller DB; separate per-profile runtime state if needed.


2. Metrics Approach

[CRITICAL] Error/warning counts double-counted
_metrics.py:61 concatenates agent.log and errors.log, but errors.log is a subset of agent.log (hermes_logging.py:17).
Fix: count from one canonical stream only, or dedupe by (timestamp, logger, message) before incrementing.

[HIGH] profile="(unknown)" discards runtime-available information
_metrics.py:120 hard-codes the label even though logs are rooted under profile-aware get_hermes_home().
Fix: derive profile id from active Hermes home or pass it explicitly into collect_profile_metrics().

[HIGH] Whole-file log snapshots cannot support experiment windows
_metrics.py:54 reparses full logs; _db.py:40 stores only absolute counts — deltas cannot be attributed to a specific live observation window.
Fix: persist collection cursors/window boundaries (from_offset, to_offset, window_started_at, window_ended_at), or store deltas per experiment/baseline.

[MED] Session counting is a fragile regex heuristic
_metrics.py:32 matches on_session_start|session start|agent_init against free text, but the core lifecycle is a method call (run_agent.py:566), not a guaranteed log line.
Fix: anchor on session tags from hermes_logging.py:24.


3. API Design

[HIGH] Router does not cover the required state machine surface
plugin_api.py:46–110 expose only metrics listing/collection and experiment reads. No approve/reject/apply/live/verify/revert, no scenario CRUD, no history, no baseline curve.
Fix: add POST /experiments/{id}/approve, /reject, /apply, /verify, /revert, POST /experiments, GET /experiments/{id}/history, scenario CRUD, pause/resume, and time-series/baseline endpoints.

[HIGH] One-experiment-in-flight-per-profile invariant enforced nowhere
_db.py:228 lists experiments by free-text state; _db.py:248 updates state blindly — no unique partial index or transactional guard.
Fix: add a DB-level unique partial index on active states; enforce transitions inside BEGIN IMMEDIATE transactions.


4. Git Ratchet Safety

[HIGH] Revert/apply will be unrecoverable without commit SHAs
_db.py:53 has no commit SHAs, blob hashes, or working-tree fingerprint; daemon.py:45 is still a stub.
Fix: persist profile_git_root, target_relpath, base_blob_sha, base_commit_sha, apply_commit_sha, revert_commit_sha, dirty_before_apply, manual_conflict_detected; implement git -C ~/.hermes/profiles/<id> add -- <target-file> committing exactly one file per experiment.

[MED] No manual-edit conflict detection during live window
Nothing records whether the target file changed during the live window, so auto-revert could clobber human edits.
Fix: store pre-apply and live-window file hashes; refuse auto-revert unless the file still matches the applied experiment version.


5. Anti-Gaming / Safety Rails

[HIGH] Held-out scenarios representable but not protected
_db.py:74 has holdout, but _db.py:274 returns all scenarios for a profile with no visibility filter.
Fix: enforce held-out exclusion in code by default; add explicit privileged path for reviewers only; snapshot scenario content into eval-result rows so future edits cannot rewrite history.

[HIGH] Different-model judging not enforceable from stored data
Neither experiments nor scenarios at _db.py:53 records proposer/judge identity.
Fix: add proposer_model, judge_model, judge_run_id; add a code-level guard rejecting any eval run where proposer and judge resolve to the same model identity.

[MED] "One atomic sentence-change" constraint not representable
diff and rationale are opaque text fields at _db.py:58.
Fix: add structured proposal metadata (changed_hunks, added_lines, removed_lines, sentence_delta_count, target_span); enforce in proposer validation before approval.


6. Pattern Violations

[MED] plugin.yaml declares hooks that register() never installs
plugin.yaml:8 lists post_llm_call and pre_llm_call; __init__.py:73 only registers a CLI command and a slash command.
Fix: remove hook declarations until Exec2 actually registers callbacks, or wire them now so manifest matches runtime.


7. Top 3 Blockers for Exec2

  1. [CRITICAL] Wrong state model baked into both schema and tests — _db.py:57 and tests/test_db.py:112 use pending/running/done. Every Exec2 transition, endpoint, and UI action will require churn. Rename to canonical states and add transition/audit tables before implementation fans out.

  2. [CRITICAL] No per-scenario eval provenance — _db.py:53 stores only aggregate scores, making held-out failures, judge drift, and approval decisions impossible to defend. Introduce eval_runs + experiment_scenario_results before building the runner.

  3. [HIGH] Metrics will produce misleading baselines from day one — _metrics.py:61 double-counts warnings/errors; _metrics.py:120 collapses all profiles to (unknown). Repair attribution and deduplication before any baseline snapshots are taken.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions