Skip to content

[/improve] Self-improving agent UI — per-profile health + proposal/approval queue #206

@Interstellar-code

Description

@Interstellar-code

Summary

Add an /improve page to Switch UI that monitors per-profile / per-skill agent health and drives the self-improving harness loop (Karpathy autoresearch ratchet) as a human-in-the-loop approval queue. This page is a thin consumer of the agent-side engine API (/api/improve/*) — all heavy machinery (metrics store, eval runner, meta-agent proposer, git ratchet, experiment state machine) lives in the hermes-agent plugin.

Companion / dependency: agent engine issue → Interstellar-code/hermes-agent#133. This UI issue is blocked on that plugin's /api/improve/* surface.

This is the UI half of docs/self-improving-agent-proposal.md (§4, §6c). Engine/agent work is out of scope here.

Architecture

  • Consume /api/improve/* via the existing dashboard-proxy pattern (same as jobs/kanban).
  • Capability-probe the endpoint — hide the /improve nav + page when the agent plugin isn't installed/enabled (mirror jobs/kanban probing).
  • No business logic in the UI: it renders proposals, scores, diffs, history, and posts approve/reject/pause actions back to the API.

Page structure (/improve route)

1. Per-profile observability (P0 — zero-risk, immediate value)

Scorecard built from data the dashboard already fetches, scoped per profile:

  • sessions/day, error+warn rate, cost trend, token efficiency, retries.
  • Per-skill row (P4): invocations, error rate (logs↔sessions correlated), retries, avg tokens/cost, last-used, trend.
  • Trends sourced from the agent metrics store via API (survives restarts).
  • This section alone answers "is my harness degrading?" and ships before any loop.

2. Proposal queue (state: proposed)

Card per pending experiment:

  • Side-by-side old vs new diff (SOUL.md / profile system_prompt) — exactly one atomic change, highlighted.
  • Meta-agent rationale.
  • Offline eval table: per-scenario pass/fail before vs after, aggregate delta, eval token cost.
  • Actions: Approve / Reject / Edit-then-approve (reject is logged so the idea isn't re-proposed).

3. Observation window (state: live)

  • Progress: "12 / 30 sessions observed" (window configurable per profile).
  • Live metrics vs baseline: error/warn rate, completion, retries, token efficiency, periodic LLM-judge spot-scores.
  • This is the second verification stage (production proof, beyond offline eval).

4. Verdict + History

Controls

  • Pause/resume per profile (pause stops new proposals; in-flight experiment finishes its window).
  • One-experiment-in-flight-per-profile reflected in UI state.

Phasing (UI-side, tracks agent phases)

  • P0 — observability page (consume metrics endpoints). Ship first, standalone value.
  • P1 — surface scenario-suite runs + results (manual "run eval" trigger).
  • P2 — proposal queue + approve/reject/edit + observation-window cards.
  • P3 — memory-hygiene / USER.md staleness views (separate metrics).
  • P4 — extend scorecards + queue to skills/plugins (reuses everything).

Out of scope (separate issue)

  • The engine, metrics store, eval runner, meta-agent, git ratchet, and /api/improve/* implementation → Interstellar-code/hermes-agent#133.

Reference

docs/self-improving-agent-proposal.md — §4 (what Switch UI already has: profile file read/write, analytics/logs/sessions APIs), §6c (/improve experiment lifecycle spec). Never-upstream fork differentiator ("Skill Health / Agent Improvement" page).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions