Context
Pre-Exec2 design review of the karpathy-self-improve plugin scaffold. Findings below must be addressed before or during Exec2 implementation to avoid painful churn. Review covers: DB schema, metrics approach, API design, git ratchet safety, anti-gaming rails, pattern compliance, and top blockers.
1. DB Schema
[CRITICAL] Lifecycle schema too thin for the ratchet state machine
_db.py:53 defines experiments with only profile/file/state/diff/rationale/scores/verdict/cost/timestamps. Exec2 cannot persist approval, live-window progress, verify/revert timestamps, or git provenance.
Fix: add columns target_profile_root, target_relpath, base_commit_sha, apply_commit_sha, revert_commit_sha, approved_by, approved_at, rejected_by, rejected_at, rejection_reason, live_sessions_target, live_sessions_observed, applied_at, verified_at, reverted_at, baseline_id, plus a CHECK/enum for canonical states.
[CRITICAL] No experiment-to-scenario results linkage
_db.py:53 stores only aggregate offline_score/live_score; _db.py:68 stores scenarios independently — per-scenario binary outcomes, held-out evaluation, and judge/proposer provenance are unauditable.
Fix: add eval_runs and experiment_scenario_results tables keyed by experiment_id + scenario_id, with split, judge_model, proposer_model, pass_fail, judge_rationale, scenario_snapshot, timestamps.
[HIGH] State vocabulary drifting from required contract
_db.py:57 defaults to pending and tests/test_db.py:112 hard-codes pending/running/done, conflicting with the required proposed/approved/live/verified/reverted/rejected.
Fix: rename to canonical states now, enforce with a CHECK constraint, add experiment_state_transitions audit table (from_state, to_state, actor, reason, timestamp).
[HIGH] DB location fragments cross-profile engine
_db.py:23 roots the DB at get_hermes_home() which is profile-scoped in profile mode (hermes_constants.py:53).
Fix: store under get_default_hermes_root() for a central controller DB; separate per-profile runtime state if needed.
2. Metrics Approach
[CRITICAL] Error/warning counts double-counted
_metrics.py:61 concatenates agent.log and errors.log, but errors.log is a subset of agent.log (hermes_logging.py:17).
Fix: count from one canonical stream only, or dedupe by (timestamp, logger, message) before incrementing.
[HIGH] profile="(unknown)" discards runtime-available information
_metrics.py:120 hard-codes the label even though logs are rooted under profile-aware get_hermes_home().
Fix: derive profile id from active Hermes home or pass it explicitly into collect_profile_metrics().
[HIGH] Whole-file log snapshots cannot support experiment windows
_metrics.py:54 reparses full logs; _db.py:40 stores only absolute counts — deltas cannot be attributed to a specific live observation window.
Fix: persist collection cursors/window boundaries (from_offset, to_offset, window_started_at, window_ended_at), or store deltas per experiment/baseline.
[MED] Session counting is a fragile regex heuristic
_metrics.py:32 matches on_session_start|session start|agent_init against free text, but the core lifecycle is a method call (run_agent.py:566), not a guaranteed log line.
Fix: anchor on session tags from hermes_logging.py:24.
3. API Design
[HIGH] Router does not cover the required state machine surface
plugin_api.py:46–110 expose only metrics listing/collection and experiment reads. No approve/reject/apply/live/verify/revert, no scenario CRUD, no history, no baseline curve.
Fix: add POST /experiments/{id}/approve, /reject, /apply, /verify, /revert, POST /experiments, GET /experiments/{id}/history, scenario CRUD, pause/resume, and time-series/baseline endpoints.
[HIGH] One-experiment-in-flight-per-profile invariant enforced nowhere
_db.py:228 lists experiments by free-text state; _db.py:248 updates state blindly — no unique partial index or transactional guard.
Fix: add a DB-level unique partial index on active states; enforce transitions inside BEGIN IMMEDIATE transactions.
4. Git Ratchet Safety
[HIGH] Revert/apply will be unrecoverable without commit SHAs
_db.py:53 has no commit SHAs, blob hashes, or working-tree fingerprint; daemon.py:45 is still a stub.
Fix: persist profile_git_root, target_relpath, base_blob_sha, base_commit_sha, apply_commit_sha, revert_commit_sha, dirty_before_apply, manual_conflict_detected; implement git -C ~/.hermes/profiles/<id> add -- <target-file> committing exactly one file per experiment.
[MED] No manual-edit conflict detection during live window
Nothing records whether the target file changed during the live window, so auto-revert could clobber human edits.
Fix: store pre-apply and live-window file hashes; refuse auto-revert unless the file still matches the applied experiment version.
5. Anti-Gaming / Safety Rails
[HIGH] Held-out scenarios representable but not protected
_db.py:74 has holdout, but _db.py:274 returns all scenarios for a profile with no visibility filter.
Fix: enforce held-out exclusion in code by default; add explicit privileged path for reviewers only; snapshot scenario content into eval-result rows so future edits cannot rewrite history.
[HIGH] Different-model judging not enforceable from stored data
Neither experiments nor scenarios at _db.py:53 records proposer/judge identity.
Fix: add proposer_model, judge_model, judge_run_id; add a code-level guard rejecting any eval run where proposer and judge resolve to the same model identity.
[MED] "One atomic sentence-change" constraint not representable
diff and rationale are opaque text fields at _db.py:58.
Fix: add structured proposal metadata (changed_hunks, added_lines, removed_lines, sentence_delta_count, target_span); enforce in proposer validation before approval.
6. Pattern Violations
[MED] plugin.yaml declares hooks that register() never installs
plugin.yaml:8 lists post_llm_call and pre_llm_call; __init__.py:73 only registers a CLI command and a slash command.
Fix: remove hook declarations until Exec2 actually registers callbacks, or wire them now so manifest matches runtime.
7. Top 3 Blockers for Exec2
-
[CRITICAL] Wrong state model baked into both schema and tests — _db.py:57 and tests/test_db.py:112 use pending/running/done. Every Exec2 transition, endpoint, and UI action will require churn. Rename to canonical states and add transition/audit tables before implementation fans out.
-
[CRITICAL] No per-scenario eval provenance — _db.py:53 stores only aggregate scores, making held-out failures, judge drift, and approval decisions impossible to defend. Introduce eval_runs + experiment_scenario_results before building the runner.
-
[HIGH] Metrics will produce misleading baselines from day one — _metrics.py:61 double-counts warnings/errors; _metrics.py:120 collapses all profiles to (unknown). Repair attribution and deduplication before any baseline snapshots are taken.
Context
Pre-Exec2 design review of the
karpathy-self-improveplugin scaffold. Findings below must be addressed before or during Exec2 implementation to avoid painful churn. Review covers: DB schema, metrics approach, API design, git ratchet safety, anti-gaming rails, pattern compliance, and top blockers.1. DB Schema
[CRITICAL] Lifecycle schema too thin for the ratchet state machine
_db.py:53definesexperimentswith onlyprofile/file/state/diff/rationale/scores/verdict/cost/timestamps. Exec2 cannot persist approval, live-window progress, verify/revert timestamps, or git provenance.Fix: add columns
target_profile_root,target_relpath,base_commit_sha,apply_commit_sha,revert_commit_sha,approved_by,approved_at,rejected_by,rejected_at,rejection_reason,live_sessions_target,live_sessions_observed,applied_at,verified_at,reverted_at,baseline_id, plus aCHECK/enum for canonical states.[CRITICAL] No experiment-to-scenario results linkage
_db.py:53stores only aggregateoffline_score/live_score;_db.py:68stores scenarios independently — per-scenario binary outcomes, held-out evaluation, and judge/proposer provenance are unauditable.Fix: add
eval_runsandexperiment_scenario_resultstables keyed byexperiment_id+scenario_id, withsplit,judge_model,proposer_model,pass_fail,judge_rationale,scenario_snapshot, timestamps.[HIGH] State vocabulary drifting from required contract
_db.py:57defaults topendingandtests/test_db.py:112hard-codespending/running/done, conflicting with the requiredproposed/approved/live/verified/reverted/rejected.Fix: rename to canonical states now, enforce with a
CHECKconstraint, addexperiment_state_transitionsaudit table (from_state,to_state, actor, reason, timestamp).[HIGH] DB location fragments cross-profile engine
_db.py:23roots the DB atget_hermes_home()which is profile-scoped in profile mode (hermes_constants.py:53).Fix: store under
get_default_hermes_root()for a central controller DB; separate per-profile runtime state if needed.2. Metrics Approach
[CRITICAL] Error/warning counts double-counted
_metrics.py:61concatenatesagent.loganderrors.log, buterrors.logis a subset ofagent.log(hermes_logging.py:17).Fix: count from one canonical stream only, or dedupe by
(timestamp, logger, message)before incrementing.[HIGH]
profile="(unknown)"discards runtime-available information_metrics.py:120hard-codes the label even though logs are rooted under profile-awareget_hermes_home().Fix: derive profile id from active Hermes home or pass it explicitly into
collect_profile_metrics().[HIGH] Whole-file log snapshots cannot support experiment windows
_metrics.py:54reparses full logs;_db.py:40stores only absolute counts — deltas cannot be attributed to a specific live observation window.Fix: persist collection cursors/window boundaries (
from_offset,to_offset,window_started_at,window_ended_at), or store deltas per experiment/baseline.[MED] Session counting is a fragile regex heuristic
_metrics.py:32matcheson_session_start|session start|agent_initagainst free text, but the core lifecycle is a method call (run_agent.py:566), not a guaranteed log line.Fix: anchor on session tags from
hermes_logging.py:24.3. API Design
[HIGH] Router does not cover the required state machine surface
plugin_api.py:46–110expose only metrics listing/collection and experiment reads. No approve/reject/apply/live/verify/revert, no scenario CRUD, no history, no baseline curve.Fix: add
POST /experiments/{id}/approve,/reject,/apply,/verify,/revert,POST /experiments,GET /experiments/{id}/history, scenario CRUD, pause/resume, and time-series/baseline endpoints.[HIGH] One-experiment-in-flight-per-profile invariant enforced nowhere
_db.py:228lists experiments by free-text state;_db.py:248updates state blindly — no unique partial index or transactional guard.Fix: add a DB-level unique partial index on active states; enforce transitions inside
BEGIN IMMEDIATEtransactions.4. Git Ratchet Safety
[HIGH] Revert/apply will be unrecoverable without commit SHAs
_db.py:53has no commit SHAs, blob hashes, or working-tree fingerprint;daemon.py:45is still a stub.Fix: persist
profile_git_root,target_relpath,base_blob_sha,base_commit_sha,apply_commit_sha,revert_commit_sha,dirty_before_apply,manual_conflict_detected; implementgit -C ~/.hermes/profiles/<id> add -- <target-file>committing exactly one file per experiment.[MED] No manual-edit conflict detection during live window
Nothing records whether the target file changed during the live window, so auto-revert could clobber human edits.
Fix: store pre-apply and live-window file hashes; refuse auto-revert unless the file still matches the applied experiment version.
5. Anti-Gaming / Safety Rails
[HIGH] Held-out scenarios representable but not protected
_db.py:74hasholdout, but_db.py:274returns all scenarios for a profile with no visibility filter.Fix: enforce held-out exclusion in code by default; add explicit privileged path for reviewers only; snapshot scenario content into eval-result rows so future edits cannot rewrite history.
[HIGH] Different-model judging not enforceable from stored data
Neither
experimentsnorscenariosat_db.py:53records proposer/judge identity.Fix: add
proposer_model,judge_model,judge_run_id; add a code-level guard rejecting any eval run where proposer and judge resolve to the same model identity.[MED] "One atomic sentence-change" constraint not representable
diffandrationaleare opaque text fields at_db.py:58.Fix: add structured proposal metadata (
changed_hunks,added_lines,removed_lines,sentence_delta_count,target_span); enforce in proposer validation before approval.6. Pattern Violations
[MED]
plugin.yamldeclares hooks thatregister()never installsplugin.yaml:8listspost_llm_callandpre_llm_call;__init__.py:73only registers a CLI command and a slash command.Fix: remove hook declarations until Exec2 actually registers callbacks, or wire them now so manifest matches runtime.
7. Top 3 Blockers for Exec2
[CRITICAL] Wrong state model baked into both schema and tests —
_db.py:57andtests/test_db.py:112usepending/running/done. Every Exec2 transition, endpoint, and UI action will require churn. Rename to canonical states and add transition/audit tables before implementation fans out.[CRITICAL] No per-scenario eval provenance —
_db.py:53stores only aggregate scores, making held-out failures, judge drift, and approval decisions impossible to defend. Introduceeval_runs+experiment_scenario_resultsbefore building the runner.[HIGH] Metrics will produce misleading baselines from day one —
_metrics.py:61double-counts warnings/errors;_metrics.py:120collapses all profiles to(unknown). Repair attribution and deduplication before any baseline snapshots are taken.