Skip to content

Eval harness for skill-discovery trigger reliability #167

@LadyBluenotes

Description

@LadyBluenotes

What

Intent installs guidance into AGENTS.md and relies on the agent to discover and load skills before doing work. In practice agents often skip that step unless the session is heavily prefaced with context about the library. There's currently no way to measure whether a given trigger mechanism actually fires when it should, so we're choosing mechanisms without evidence.

This issue covers a library-agnostic eval harness that scores competing discovery-trigger approaches against each other.

Scope

Eval harness (offline, deterministic, CI-able):

  • A blended score per mechanism combining recall, precision, and a cost weight (tunable).
  • recall/precision measured by matching each mechanism's trigger surface against a labeled corpus; cost from a static per-mechanism model.
  • Corpus schema: { prompt, needs_skill, expected_pkg }, spanning libraries with skills, libraries without skills (must stay silent), and a synthetic/unknown library (separates "the trigger fired" from "the model already knew the library").
  • No library names baked into the mechanism under test.

Candidates scored: a range of approaches — current AGENTS.md guidance (baseline), a no-trigger control, a single universal discovery skill, per-package skills, a hybrid, an injected context block, and tool-based discovery. Approaches that only manifest at runtime carry a declared static cost and are flagged for a later live-measurement pass.

Out of scope

  • Live agent-in-the-loop cost measurement (follow-up).
  • Building any specific installer.
  • Picking a winning mechanism — that's the harness's output, not pre-decided here.

Where this comes from

Reports that Intent skills don't reliably load into context unless the agent is told about the library explicitly. We want the trigger-mechanism choice to be driven by measured reliability rather than assumption.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions