dbroeglin · dbroeglin · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026
diff --git a/.apm/instructions/development.instructions.md b/.apm/instructions/development.instructions.md
@@ -16,18 +16,17 @@ experiment repo — experiment-authoring context is a template under
   formatting, and CI/pre-commit enforce it.
 - Maintain good test coverage for every behavior change with focused offline tests, not just broad
   smoke coverage.
-- Keep tests offline: exercise the runner with `MockInvoker` (and a `solver` for the success
-  path) plus a temp `--root`. Never invoke the real `copilot` binary or the network in tests.
-- Preserve invariants: filesystem is source of truth (`reindex` rebuilds `index.db`); secrets are
-  redacted on disk (`Variant.stored()` / `ProviderConfig.redacted()`); `--dry-run` is ephemeral —
-  it runs in a temp dir, validates each stage, and persists nothing (`dry_run_experiment`).
+- Keep tests offline: use Pier config/job-output fixtures and mocks plus a temp `--root`. Never
+  invoke the real `copilot` binary, Docker, or the network in tests.
+- Preserve invariants: `jobs/<job>/<run-id>/` is the filesystem source of truth; summaries are
+  derived; secrets are injected at run time and redacted from persisted configs.
 
 ## When changing public behavior
 - Update `docs/` (architecture, authoring, results-format, BYOK) and `README.md`.
 - Mirror experiment-authoring changes in the `templates/experiment_repo/` assets.
 - Bump `__version__` in `src/copilot_experiments/__init__.py` and `version` in `pyproject.toml`.
 
 ## Module responsibilities
-`models` (schemas) · `invoker` (build/run copilot) · `workspace` (provision + diff) ·
-`sessionlog` (parse events → metrics) · `runner` (orchestrate) · `storage` (layout) ·
-`index` (sqlite) · `report` (summaries) · `scaffold` (init) · `cli` (Typer).
+`models` (analysis/economics schemas) · `pier_backend` (Pier config/run integration) ·
+`pier_results` (job/run/agent/task summaries) · `sessionlog` (parse events → metrics) ·
+`storage` (Pier jobs layout) · `report` (summaries) · `scaffold` (init) · `cli` (Typer).
diff --git a/.apm/prompts/library-change.prompt.md b/.apm/prompts/library-change.prompt.md
@@ -9,9 +9,9 @@ Make a change to the `copilot_experiments` package (the harness, not an experime
 Steps:
 1. Identify the right module (see `AGENTS.md` repository map and the
    `developing-the-library` skill).
-2. Implement the change, keeping the architecture invariants intact (filesystem is source of
-   truth; secrets redacted on disk; tests/dry-runs stay offline).
-3. Add or update tests in `tests/` using `MockInvoker` and a temp `--root`.
+2. Implement the change, keeping the architecture invariants intact (`jobs/<job>/<run-id>/` is the
+   filesystem source of truth; secrets are redacted on disk; tests stay offline).
+3. Add or update tests in `tests/` using fixtures/mocks and a temp `--root`.
 4. Run `uv run ruff check --fix .`, `uv run ruff format .`, `uv run ruff check .`, and
    `uv run pytest -q`; fix until all are green.
 5. Update `docs/`, `README.md`, and the `templates/experiment_repo/` template if public

diff --git a/.apm/skills/developing-the-library/SKILL.md b/.apm/skills/developing-the-library/SKILL.md
@@ -2,39 +2,38 @@
 name: developing-the-library
 description: >-
   Use when modifying the copilot-experiments library or CLI itself — adding or
-  changing modules (models, invoker, runner, sessionlog, storage, index, report,
-  scaffold, cli), writing tests, or updating the scaffolded experiment-repo
-  template. Not for authoring experiments.
+  changing modules (models, pier_backend, pier_results, sessionlog, storage,
+  report, scaffold, cli), writing tests, or updating the scaffolded
+  experiment-repo template. Not for authoring experiments.
 ---
 
 # Developing the copilot-experiments library
 
 ## Mental model
-A **run** executes an `Experiment` (a `Task` + a list of `Variant`s). For each variant, for each
-trial, the runner: provisions a workspace → invokes Copilot → copies & parses the session log →
-captures a workspace diff → runs `verify` → writes artifacts → updates the SQLite index.
+A **run** executes a Pier `JobConfig`. For each agent/task/attempt trial, Pier provisions the
+environment, invokes the installed agent, runs the verifier, and downloads logs/artifacts.
+`copilot-experiments` contributes the `copilot-cli` Pier agent and derives summaries/analysis from
+the resulting `jobs/<job>/<run-id>/` tree.
 
 ```
-Experiment ─┬─ Task (prompt, fixture/repo, setup, verify)
-            └─ Variant[] (model, effort, agent, mode, provider/BYOK, env, trials)
-run_experiment() → results/<exp>/<run-id>/ + results/index.db
+Pier JobConfig ─┬─ tasks/datasets
+                └─ agents[] (copilot-cli model, effort, mode, kwargs)
+copilot-experiments run → jobs/<job-name>/<run-id>/
 ```
 
 ## Where to make a change
-- New experiment-definition field → `models.py` (+ thread through `invoker.build_args`/`build_env`
-  if it affects the command, + `index.py` columns if you want it queryable).
+- New Pier config/run behavior → `pier_backend.py`.
 - New CLI command/flag → `cli.py` (Typer). `B008` is ignored project-wide for Typer defaults.
-- New metric → `sessionlog.parse_metrics` (+ `Metrics` in `models.py`, + `index.py`, + `report.py`).
-- New result artifact → write it in `runner._run_trial`, document it in `storage.py`'s docstring
-  and `docs/results-format.md`.
+- New metric → `sessionlog.parse_metrics` (+ `Metrics` in `models.py`, + `pier_results.py` /
+  `report.py` if summaries should expose it).
+- New result artifact → emit or collect it through the Pier agent/backend, then document it in
+  `docs/results-format.md`.
 - Experiment-authoring change → edit `templates/experiment_repo/` (it is package data).
 
 ## Testing recipe
 - Unit-test pure functions directly (models, sessionlog, storage, scaffold).
-- For the runner, call `run_experiment(exp, root=tmp, invoker=MockInvoker())` for a persisted
-  mock path, `run_experiment(exp, root=tmp, invoker=MockInvoker(solver=...))` for a success
-  path, and `dry_run_experiment(exp, root=tmp)` to exercise the ephemeral validating dry-run
-  (returns a `DryRunReport`, persists nothing).
+- Use Pier config and job-output fixtures for CLI/storage/result tests; mock backend/auth preflights
+  instead of invoking Docker or Copilot.
 - Build synthetic `events.jsonl` dicts to test `parse_metrics` without any Copilot run.
 - Add or update focused offline tests for each behavior change. Good coverage is expected,
   especially around Pier config loading, result adaptation, CLI behavior, and session parsing.
@@ -47,5 +46,5 @@ uv run ruff check .
 uv run pytest -q
 # optional end-to-end smoke test:
 uv run copilot-experiments init sandbox/demo --force
-uv run copilot-experiments run --root sandbox/demo --dry-run
+uv run copilot-experiments validate --root sandbox/demo
 ```