AGENTS.md — copilot-experiments (the library + CLI)

Hand-authored agent context for developing this tool. The .apm/ directory holds APM primitives (instructions, the developing-the-library skill, prompts) that Copilot surfaces directly; this file is intentionally not generated by apm compile (that would overwrite this richer content and merge in unrelated experiment-repo context).

This repository is the tool: the copilot_experiments Python package (a library API) plus a Typer CLI. It is developed with uv.

The tool's job is to drive GitHub Copilot CLI across a parameter matrix and collect results. It also scaffolds separate standalone experiment repositories via copilot-experiments init. Do not confuse the two: this repo develops the harness; a scaffolded repo authors experiments. The experiment-repo agent context is a template under src/copilot_experiments/templates/experiment_repo/ — edit it there, never in a generated repo.

Repository map

src/copilot_experiments/ — the package.
- models.py — pydantic models: Experiment, Task, Variant, ProviderConfig, and the result objects (ExperimentRun → VariantResult → TaskResult → TrialResult, Metrics). An experiment is Tasks × Variants × Trials: task= is single-task sugar, tasks=[...] a suite (Experiment.iter_tasks() normalises both to (task_slug, Task) pairs).
- invoker.py — builds and runs the copilot command. CopilotInvoker shells out to the real CLI; MockInvoker simulates a run (used by dry-runs and the test suite).
- workspace.py — provisions an isolated per-trial workspace (copy a fixture or git clone), commits a git baseline, and captures a diff of Copilot's changes.
- sessionlog.py — locates and parses Copilot's events.jsonl into Metrics; extract_economics() pulls token-type counts + AIU cost from session.shutdown.
- pricing.py — AIU↔token cost math: documented per-token-type costPerBatch defaults, live-rate reading from session.compaction_complete, and the per-type AIU decomposition.
- analysis.py — derives a rendering-agnostic SessionAnalysis (tools, turns, tokens, economics) from session events.
- render.py — renders a SessionAnalysis to the terminal with Rich (backs the analyze command).
- runner.py — orchestration: variants × tasks × trials → result artifacts + index. Also dry_run_experiment() (ephemeral, validating plumbing check that persists nothing).
- storage.py — the results/ filesystem Layout and run discovery.
- index.py — the SQLite index (results/index.db) derived from the filesystem.
- report.py — aggregation, summary.json, and summary.md.
- scaffold.py — init logic: render templates/experiment_repo/ into a new repo.
- cli.py — the Typer app (init, run, list, show, analyze, inspect, reindex).
- templates/experiment_repo/ — package-data template for scaffolded experiment repos.
examples/tracer_bullet/ — a committed, runnable multi-turn example experiment (textstats).
examples/task_suite/ — a committed multi-task example (strtools + csvtools) exercising the task axis and its mean-success / resolved@k suite-coverage metrics.
sandbox/ — local scratch space for exercising the lib/CLI (its results/ are gitignored).
tests/ — pytest suite (uses MockInvoker; never requires a real copilot or network).
docs/ — architecture, authoring guide, analysis (analyze), results-format reference, BYOK/local-models guide, and docs/adr/ (architecture decision records).

Architecture invariants (keep these true)

The filesystem is the source of truth. results/index.db is a derived, rebuildable cache — reindex must always be able to recreate it by scanning results/.
Secrets never hit disk. Store variants via Variant.stored() / ProviderConfig.redacted(), which mask api_key / bearer_token.
Tests stay offline. Anything exercising the runner uses MockInvoker (optionally with a solver callback) and a temp --root. Do not call the real Copilot CLI in tests.
Dry-runs are ephemeral. dry_run_experiment() runs the whole pipeline with MockInvoker in a tempfile.mkdtemp() dir (results + session state both redirected there), validates each stage produced its artifact, then deletes the temp dir. Nothing is persisted under the repo; run_experiment() has no dry_run parameter.
Templates are data. Files under templates/experiment_repo/ are shipped as package data; .tmpl files are rendered with {{placeholder}} substitution and lose the suffix.

How Copilot CLI is driven (reference)

Non-interactive: copilot -p "<prompt>" --allow-all-tools.
--model / --effort / --agent / --mode, --output-format json, --session-id <uuid>, --log-dir, -C <dir>.
Session logs land at ~/.copilot/session-state/<session-id>/events.jsonl — the metrics source.
BYOK is env-driven (COPILOT_PROVIDER_*); a variant is just flags + env.

Workflow

uv sync                              # create/refresh the venv
uv run ruff check --fix .            # autofix lint issues
uv run ruff format .                 # format Python code
uv run ruff check .                  # verify lint
uv run pytest -q                     # test
# End-to-end smoke test against the sandbox:
uv run copilot-experiments init sandbox/demo --force
uv run copilot-experiments run --root sandbox/demo --dry-run
uv run copilot-experiments show --last --root sandbox/demo

For local CLI testing point --root at a scaffolded dir in sandbox/ rather than uv sync-ing that dir (its pyproject depends on this package via a git URL that won't resolve offline).

Conventions

Python 3.12+, line length 100, ruff-clean and ruff-formatted (E,F,I,UP,B,W; B008 is ignored for Typer).
Treat perfectly linted/formatted code as non-negotiable: run ruff check --fix, ruff format, then ruff check before handoff. The repo has pre-commit hooks and CI for this, but agents should still run the commands locally.
Maintain good test coverage for every behavior change. Add focused offline tests for new Pier config/result adapters, CLI behavior, session parsing, and migration paths; avoid relying only on broad smoke tests.
Prefer small, well-typed functions; keep modules single-purpose (see the map above).
Update docs/ and the experiment-repo template when you change public behavior.
Record significant/architectural decisions as an ADR under docs/adr/ (copy 0000-template.md).
Commits include the Co-authored-by: Copilot trailer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md — copilot-experiments (the library + CLI)

Repository map

Architecture invariants (keep these true)

How Copilot CLI is driven (reference)

Workflow

Conventions

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md — copilot-experiments (the library + CLI)

Repository map

Architecture invariants (keep these true)

How Copilot CLI is driven (reference)

Workflow

Conventions