Hand-authored agent context for developing this tool. The
.apm/directory holds APM primitives (instructions, thedeveloping-the-libraryskill, prompts) that Copilot surfaces directly; this file is intentionally not generated byapm compile(that would overwrite this richer content and merge in unrelated experiment-repo context).
This repository is the tool: the copilot_experiments Python package (a library API)
plus a Typer CLI. It is developed with uv.
The tool's job is to drive GitHub Copilot CLI across a parameter matrix and collect results. It also scaffolds separate standalone experiment repositories via
copilot-experiments init. Do not confuse the two: this repo develops the harness; a scaffolded repo authors experiments. The experiment-repo agent context is a template undersrc/copilot_experiments/templates/experiment_repo/— edit it there, never in a generated repo.
src/copilot_experiments/— the package.models.py— pydantic models:Experiment,Task,Variant,ProviderConfig, and the result objects (ExperimentRun→VariantResult→TaskResult→TrialResult,Metrics). An experiment isTasks × Variants × Trials:task=is single-task sugar,tasks=[...]a suite (Experiment.iter_tasks()normalises both to(task_slug, Task)pairs).invoker.py— builds and runs thecopilotcommand.CopilotInvokershells out to the real CLI;MockInvokersimulates a run (used by dry-runs and the test suite).workspace.py— provisions an isolated per-trial workspace (copy a fixture orgit clone), commits a git baseline, and captures a diff of Copilot's changes.sessionlog.py— locates and parses Copilot'sevents.jsonlintoMetrics;extract_economics()pulls token-type counts + AIU cost fromsession.shutdown.pricing.py— AIU↔token cost math: documented per-token-typecostPerBatchdefaults, live-rate reading fromsession.compaction_complete, and the per-type AIU decomposition.analysis.py— derives a rendering-agnosticSessionAnalysis(tools, turns, tokens, economics) from session events.render.py— renders aSessionAnalysisto the terminal with Rich (backs theanalyzecommand).runner.py— orchestration: variants × tasks × trials → result artifacts + index. Alsodry_run_experiment()(ephemeral, validating plumbing check that persists nothing).storage.py— theresults/filesystemLayoutand run discovery.index.py— the SQLite index (results/index.db) derived from the filesystem.report.py— aggregation,summary.json, andsummary.md.scaffold.py—initlogic: rendertemplates/experiment_repo/into a new repo.cli.py— the Typer app (init,run,list,show,analyze,inspect,reindex).templates/experiment_repo/— package-data template for scaffolded experiment repos.
examples/tracer_bullet/— a committed, runnable multi-turn example experiment (textstats).examples/task_suite/— a committed multi-task example (strtools + csvtools) exercising the task axis and its mean-success / resolved@k suite-coverage metrics.sandbox/— local scratch space for exercising the lib/CLI (itsresults/are gitignored).tests/— pytest suite (usesMockInvoker; never requires a realcopilotor network).docs/— architecture, authoring guide, analysis (analyze), results-format reference, BYOK/local-models guide, anddocs/adr/(architecture decision records).
- The filesystem is the source of truth.
results/index.dbis a derived, rebuildable cache —reindexmust always be able to recreate it by scanningresults/. - Secrets never hit disk. Store variants via
Variant.stored()/ProviderConfig.redacted(), which maskapi_key/bearer_token. - Tests stay offline. Anything exercising the runner uses
MockInvoker(optionally with asolvercallback) and a temp--root. Do not call the real Copilot CLI in tests. - Dry-runs are ephemeral.
dry_run_experiment()runs the whole pipeline withMockInvokerin atempfile.mkdtemp()dir (results + session state both redirected there), validates each stage produced its artifact, then deletes the temp dir. Nothing is persisted under the repo;run_experiment()has nodry_runparameter. - Templates are data. Files under
templates/experiment_repo/are shipped as package data;.tmplfiles are rendered with{{placeholder}}substitution and lose the suffix.
- Non-interactive:
copilot -p "<prompt>" --allow-all-tools. --model/--effort/--agent/--mode,--output-format json,--session-id <uuid>,--log-dir,-C <dir>.- Session logs land at
~/.copilot/session-state/<session-id>/events.jsonl— the metrics source. - BYOK is env-driven (
COPILOT_PROVIDER_*); a variant is just flags + env.
uv sync # create/refresh the venv
uv run ruff check --fix . # autofix lint issues
uv run ruff format . # format Python code
uv run ruff check . # verify lint
uv run pytest -q # test
# End-to-end smoke test against the sandbox:
uv run copilot-experiments init sandbox/demo --force
uv run copilot-experiments run --root sandbox/demo --dry-run
uv run copilot-experiments show --last --root sandbox/demoFor local CLI testing point --root at a scaffolded dir in sandbox/ rather than uv sync-ing
that dir (its pyproject depends on this package via a git URL that won't resolve offline).
- Python 3.12+, line length 100, ruff-clean and ruff-formatted (
E,F,I,UP,B,W;B008is ignored for Typer). - Treat perfectly linted/formatted code as non-negotiable: run
ruff check --fix,ruff format, thenruff checkbefore handoff. The repo has pre-commit hooks and CI for this, but agents should still run the commands locally. - Maintain good test coverage for every behavior change. Add focused offline tests for new Pier config/result adapters, CLI behavior, session parsing, and migration paths; avoid relying only on broad smoke tests.
- Prefer small, well-typed functions; keep modules single-purpose (see the map above).
- Update
docs/and the experiment-repo template when you change public behavior. - Record significant/architectural decisions as an ADR under
docs/adr/(copy0000-template.md). - Commits include the
Co-authored-by: Copilottrailer.