Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 7 additions & 8 deletions .apm/instructions/development.instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,17 @@ experiment repo — experiment-authoring context is a template under
formatting, and CI/pre-commit enforce it.
- Maintain good test coverage for every behavior change with focused offline tests, not just broad
smoke coverage.
- Keep tests offline: exercise the runner with `MockInvoker` (and a `solver` for the success
path) plus a temp `--root`. Never invoke the real `copilot` binary or the network in tests.
- Preserve invariants: filesystem is source of truth (`reindex` rebuilds `index.db`); secrets are
redacted on disk (`Variant.stored()` / `ProviderConfig.redacted()`); `--dry-run` is ephemeral —
it runs in a temp dir, validates each stage, and persists nothing (`dry_run_experiment`).
- Keep tests offline: use Pier config/job-output fixtures and mocks plus a temp `--root`. Never
invoke the real `copilot` binary, Docker, or the network in tests.
- Preserve invariants: `jobs/<job>/<run-id>/` is the filesystem source of truth; summaries are
derived; secrets are injected at run time and redacted from persisted configs.

## When changing public behavior
- Update `docs/` (architecture, authoring, results-format, BYOK) and `README.md`.
- Mirror experiment-authoring changes in the `templates/experiment_repo/` assets.
- Bump `__version__` in `src/copilot_experiments/__init__.py` and `version` in `pyproject.toml`.

## Module responsibilities
`models` (schemas) · `invoker` (build/run copilot) · `workspace` (provision + diff) ·
`sessionlog` (parse events → metrics) · `runner` (orchestrate) · `storage` (layout) ·
`index` (sqlite) · `report` (summaries) · `scaffold` (init) · `cli` (Typer).
`models` (analysis/economics schemas) · `pier_backend` (Pier config/run integration) ·
`pier_results` (job/run/agent/task summaries) · `sessionlog` (parse events → metrics) ·
`storage` (Pier jobs layout) · `report` (summaries) · `scaffold` (init) · `cli` (Typer).
6 changes: 3 additions & 3 deletions .apm/prompts/library-change.prompt.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ Make a change to the `copilot_experiments` package (the harness, not an experime
Steps:
1. Identify the right module (see `AGENTS.md` repository map and the
`developing-the-library` skill).
2. Implement the change, keeping the architecture invariants intact (filesystem is source of
truth; secrets redacted on disk; tests/dry-runs stay offline).
3. Add or update tests in `tests/` using `MockInvoker` and a temp `--root`.
2. Implement the change, keeping the architecture invariants intact (`jobs/<job>/<run-id>/` is the
filesystem source of truth; secrets are redacted on disk; tests stay offline).
3. Add or update tests in `tests/` using fixtures/mocks and a temp `--root`.
4. Run `uv run ruff check --fix .`, `uv run ruff format .`, `uv run ruff check .`, and
`uv run pytest -q`; fix until all are green.
5. Update `docs/`, `README.md`, and the `templates/experiment_repo/` template if public
Expand Down
37 changes: 18 additions & 19 deletions .apm/skills/developing-the-library/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,39 +2,38 @@
name: developing-the-library
description: >-
Use when modifying the copilot-experiments library or CLI itself — adding or
changing modules (models, invoker, runner, sessionlog, storage, index, report,
scaffold, cli), writing tests, or updating the scaffolded experiment-repo
template. Not for authoring experiments.
changing modules (models, pier_backend, pier_results, sessionlog, storage,
report, scaffold, cli), writing tests, or updating the scaffolded
experiment-repo template. Not for authoring experiments.
---

# Developing the copilot-experiments library

## Mental model
A **run** executes an `Experiment` (a `Task` + a list of `Variant`s). For each variant, for each
trial, the runner: provisions a workspace → invokes Copilot → copies & parses the session log →
captures a workspace diff → runs `verify` → writes artifacts → updates the SQLite index.
A **run** executes a Pier `JobConfig`. For each agent/task/attempt trial, Pier provisions the
environment, invokes the installed agent, runs the verifier, and downloads logs/artifacts.
`copilot-experiments` contributes the `copilot-cli` Pier agent and derives summaries/analysis from
the resulting `jobs/<job>/<run-id>/` tree.

```
Experiment ─┬─ Task (prompt, fixture/repo, setup, verify)
└─ Variant[] (model, effort, agent, mode, provider/BYOK, env, trials)
run_experiment() → results/<exp>/<run-id>/ + results/index.db
Pier JobConfig ─┬─ tasks/datasets
└─ agents[] (copilot-cli model, effort, mode, kwargs)
copilot-experiments run → jobs/<job-name>/<run-id>/
```

## Where to make a change
- New experiment-definition field → `models.py` (+ thread through `invoker.build_args`/`build_env`
if it affects the command, + `index.py` columns if you want it queryable).
- New Pier config/run behavior → `pier_backend.py`.
- New CLI command/flag → `cli.py` (Typer). `B008` is ignored project-wide for Typer defaults.
- New metric → `sessionlog.parse_metrics` (+ `Metrics` in `models.py`, + `index.py`, + `report.py`).
- New result artifact → write it in `runner._run_trial`, document it in `storage.py`'s docstring
and `docs/results-format.md`.
- New metric → `sessionlog.parse_metrics` (+ `Metrics` in `models.py`, + `pier_results.py` /
`report.py` if summaries should expose it).
- New result artifact → emit or collect it through the Pier agent/backend, then document it in
`docs/results-format.md`.
- Experiment-authoring change → edit `templates/experiment_repo/` (it is package data).

## Testing recipe
- Unit-test pure functions directly (models, sessionlog, storage, scaffold).
- For the runner, call `run_experiment(exp, root=tmp, invoker=MockInvoker())` for a persisted
mock path, `run_experiment(exp, root=tmp, invoker=MockInvoker(solver=...))` for a success
path, and `dry_run_experiment(exp, root=tmp)` to exercise the ephemeral validating dry-run
(returns a `DryRunReport`, persists nothing).
- Use Pier config and job-output fixtures for CLI/storage/result tests; mock backend/auth preflights
instead of invoking Docker or Copilot.
- Build synthetic `events.jsonl` dicts to test `parse_metrics` without any Copilot run.
- Add or update focused offline tests for each behavior change. Good coverage is expected,
especially around Pier config loading, result adaptation, CLI behavior, and session parsing.
Expand All @@ -47,5 +46,5 @@ uv run ruff check .
uv run pytest -q
# optional end-to-end smoke test:
uv run copilot-experiments init sandbox/demo --force
uv run copilot-experiments run --root sandbox/demo --dry-run
uv run copilot-experiments validate --root sandbox/demo
```
Loading
Loading