Eval harness for skill-discovery trigger reliability

## What

Intent installs guidance into `AGENTS.md` and relies on the agent to discover and load skills before doing work. In practice agents often skip that step unless the session is heavily prefaced with context about the library. There's currently no way to measure whether a given trigger mechanism actually fires when it should, so we're choosing mechanisms without evidence.

This issue covers a library-agnostic eval harness that scores competing discovery-trigger approaches against each other.

## Scope

**Eval harness (offline, deterministic, CI-able):**
- A blended score per mechanism combining recall, precision, and a cost weight (tunable).
- `recall`/`precision` measured by matching each mechanism's trigger surface against a labeled corpus; `cost` from a static per-mechanism model.
- Corpus schema: `{ prompt, needs_skill, expected_pkg }`, spanning libraries with skills, libraries **without** skills (must stay silent), and a synthetic/unknown library (separates "the trigger fired" from "the model already knew the library").
- No library names baked into the mechanism under test.

**Candidates scored:** a range of approaches — current `AGENTS.md` guidance (baseline), a no-trigger control, a single universal discovery skill, per-package skills, a hybrid, an injected context block, and tool-based discovery. Approaches that only manifest at runtime carry a declared static cost and are flagged for a later live-measurement pass.

## Out of scope

- Live agent-in-the-loop cost measurement (follow-up).
- Building any specific installer.
- Picking a winning mechanism — that's the harness's output, not pre-decided here.

## Where this comes from

Reports that Intent skills don't reliably load into context unless the agent is told about the library explicitly. We want the trigger-mechanism choice to be driven by measured reliability rather than assumption.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval harness for skill-discovery trigger reliability #167

What

Scope

Out of scope

Where this comes from

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Eval harness for skill-discovery trigger reliability #167

Description

What

Scope

Out of scope

Where this comes from

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions