Design review: karpathy-self-improve scaffold (pre-Exec2)

## Context

Pre-Exec2 design review of the `karpathy-self-improve` plugin scaffold. Findings below must be addressed before or during Exec2 implementation to avoid painful churn. Review covers: DB schema, metrics approach, API design, git ratchet safety, anti-gaming rails, pattern compliance, and top blockers.

---

## 1. DB Schema

**[CRITICAL]** Lifecycle schema too thin for the ratchet state machine
`_db.py:53` defines `experiments` with only `profile/file/state/diff/rationale/scores/verdict/cost/timestamps`. Exec2 cannot persist approval, live-window progress, verify/revert timestamps, or git provenance.
Fix: add columns `target_profile_root`, `target_relpath`, `base_commit_sha`, `apply_commit_sha`, `revert_commit_sha`, `approved_by`, `approved_at`, `rejected_by`, `rejected_at`, `rejection_reason`, `live_sessions_target`, `live_sessions_observed`, `applied_at`, `verified_at`, `reverted_at`, `baseline_id`, plus a `CHECK`/enum for canonical states.

**[CRITICAL]** No experiment-to-scenario results linkage
`_db.py:53` stores only aggregate `offline_score/live_score`; `_db.py:68` stores scenarios independently — per-scenario binary outcomes, held-out evaluation, and judge/proposer provenance are unauditable.
Fix: add `eval_runs` and `experiment_scenario_results` tables keyed by `experiment_id` + `scenario_id`, with `split`, `judge_model`, `proposer_model`, `pass_fail`, `judge_rationale`, `scenario_snapshot`, timestamps.

**[HIGH]** State vocabulary drifting from required contract
`_db.py:57` defaults to `pending` and `tests/test_db.py:112` hard-codes `pending/running/done`, conflicting with the required `proposed/approved/live/verified/reverted/rejected`.
Fix: rename to canonical states now, enforce with a `CHECK` constraint, add `experiment_state_transitions` audit table (`from_state`, `to_state`, actor, reason, timestamp).

**[HIGH]** DB location fragments cross-profile engine
`_db.py:23` roots the DB at `get_hermes_home()` which is profile-scoped in profile mode (`hermes_constants.py:53`).
Fix: store under `get_default_hermes_root()` for a central controller DB; separate per-profile runtime state if needed.

---

## 2. Metrics Approach

**[CRITICAL]** Error/warning counts double-counted
`_metrics.py:61` concatenates `agent.log` and `errors.log`, but `errors.log` is a subset of `agent.log` (`hermes_logging.py:17`).
Fix: count from one canonical stream only, or dedupe by `(timestamp, logger, message)` before incrementing.

**[HIGH]** `profile="(unknown)"` discards runtime-available information
`_metrics.py:120` hard-codes the label even though logs are rooted under profile-aware `get_hermes_home()`.
Fix: derive profile id from active Hermes home or pass it explicitly into `collect_profile_metrics()`.

**[HIGH]** Whole-file log snapshots cannot support experiment windows
`_metrics.py:54` reparses full logs; `_db.py:40` stores only absolute counts — deltas cannot be attributed to a specific live observation window.
Fix: persist collection cursors/window boundaries (`from_offset`, `to_offset`, `window_started_at`, `window_ended_at`), or store deltas per experiment/baseline.

**[MED]** Session counting is a fragile regex heuristic
`_metrics.py:32` matches `on_session_start|session start|agent_init` against free text, but the core lifecycle is a method call (`run_agent.py:566`), not a guaranteed log line.
Fix: anchor on session tags from `hermes_logging.py:24`.

---

## 3. API Design

**[HIGH]** Router does not cover the required state machine surface
`plugin_api.py:46–110` expose only metrics listing/collection and experiment reads. No approve/reject/apply/live/verify/revert, no scenario CRUD, no history, no baseline curve.
Fix: add `POST /experiments/{id}/approve`, `/reject`, `/apply`, `/verify`, `/revert`, `POST /experiments`, `GET /experiments/{id}/history`, scenario CRUD, pause/resume, and time-series/baseline endpoints.

**[HIGH]** One-experiment-in-flight-per-profile invariant enforced nowhere
`_db.py:228` lists experiments by free-text state; `_db.py:248` updates state blindly — no unique partial index or transactional guard.
Fix: add a DB-level unique partial index on active states; enforce transitions inside `BEGIN IMMEDIATE` transactions.

---

## 4. Git Ratchet Safety

**[HIGH]** Revert/apply will be unrecoverable without commit SHAs
`_db.py:53` has no commit SHAs, blob hashes, or working-tree fingerprint; `daemon.py:45` is still a stub.
Fix: persist `profile_git_root`, `target_relpath`, `base_blob_sha`, `base_commit_sha`, `apply_commit_sha`, `revert_commit_sha`, `dirty_before_apply`, `manual_conflict_detected`; implement `git -C ~/.hermes/profiles/<id> add -- <target-file>` committing exactly one file per experiment.

**[MED]** No manual-edit conflict detection during live window
Nothing records whether the target file changed during the live window, so auto-revert could clobber human edits.
Fix: store pre-apply and live-window file hashes; refuse auto-revert unless the file still matches the applied experiment version.

---

## 5. Anti-Gaming / Safety Rails

**[HIGH]** Held-out scenarios representable but not protected
`_db.py:74` has `holdout`, but `_db.py:274` returns all scenarios for a profile with no visibility filter.
Fix: enforce held-out exclusion in code by default; add explicit privileged path for reviewers only; snapshot scenario content into eval-result rows so future edits cannot rewrite history.

**[HIGH]** Different-model judging not enforceable from stored data
Neither `experiments` nor `scenarios` at `_db.py:53` records proposer/judge identity.
Fix: add `proposer_model`, `judge_model`, `judge_run_id`; add a code-level guard rejecting any eval run where proposer and judge resolve to the same model identity.

**[MED]** "One atomic sentence-change" constraint not representable
`diff` and `rationale` are opaque text fields at `_db.py:58`.
Fix: add structured proposal metadata (`changed_hunks`, `added_lines`, `removed_lines`, `sentence_delta_count`, `target_span`); enforce in proposer validation before approval.

---

## 6. Pattern Violations

**[MED]** `plugin.yaml` declares hooks that `register()` never installs
`plugin.yaml:8` lists `post_llm_call` and `pre_llm_call`; `__init__.py:73` only registers a CLI command and a slash command.
Fix: remove hook declarations until Exec2 actually registers callbacks, or wire them now so manifest matches runtime.

---

## 7. Top 3 Blockers for Exec2

1. **[CRITICAL]** Wrong state model baked into both schema and tests — `_db.py:57` and `tests/test_db.py:112` use `pending/running/done`. Every Exec2 transition, endpoint, and UI action will require churn. Rename to canonical states and add transition/audit tables before implementation fans out.

2. **[CRITICAL]** No per-scenario eval provenance — `_db.py:53` stores only aggregate scores, making held-out failures, judge drift, and approval decisions impossible to defend. Introduce `eval_runs` + `experiment_scenario_results` before building the runner.

3. **[HIGH]** Metrics will produce misleading baselines from day one — `_metrics.py:61` double-counts warnings/errors; `_metrics.py:120` collapses all profiles to `(unknown)`. Repair attribution and deduplication before any baseline snapshots are taken.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design review: karpathy-self-improve scaffold (pre-Exec2) #134

Context

1. DB Schema

2. Metrics Approach

3. API Design

4. Git Ratchet Safety

5. Anti-Gaming / Safety Rails

6. Pattern Violations

7. Top 3 Blockers for Exec2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Design review: karpathy-self-improve scaffold (pre-Exec2) #134

Description

Context

1. DB Schema

2. Metrics Approach

3. API Design

4. Git Ratchet Safety

5. Anti-Gaming / Safety Rails

6. Pattern Violations

7. Top 3 Blockers for Exec2

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions