Skip to content

Latest commit

 

History

History
98 lines (90 loc) · 6.68 KB

File metadata and controls

98 lines (90 loc) · 6.68 KB

AGENTS.md — copilot-experiments (the library + CLI)

Hand-authored agent context for developing this tool. The .apm/ directory holds APM primitives (instructions, the developing-the-library skill, prompts) that Copilot surfaces directly; this file is intentionally not generated by apm compile (that would overwrite this richer content and merge in unrelated experiment-repo context).

This repository is the tool: the copilot_experiments Python package (a library API) plus a Typer CLI. It is developed with uv.

The tool's job is to drive GitHub Copilot CLI across a parameter matrix and collect results. It also scaffolds separate standalone experiment repositories via copilot-experiments init. Do not confuse the two: this repo develops the harness; a scaffolded repo authors experiments. The experiment-repo agent context is a template under src/copilot_experiments/templates/experiment_repo/ — edit it there, never in a generated repo.

Repository map

  • src/copilot_experiments/ — the package.
    • models.py — pydantic models: Experiment, Task, Variant, ProviderConfig, and the result objects (ExperimentRunVariantResultTaskResultTrialResult, Metrics). An experiment is Tasks × Variants × Trials: task= is single-task sugar, tasks=[...] a suite (Experiment.iter_tasks() normalises both to (task_slug, Task) pairs).
    • invoker.py — builds and runs the copilot command. CopilotInvoker shells out to the real CLI; MockInvoker simulates a run (used by dry-runs and the test suite).
    • workspace.py — provisions an isolated per-trial workspace (copy a fixture or git clone), commits a git baseline, and captures a diff of Copilot's changes.
    • sessionlog.py — locates and parses Copilot's events.jsonl into Metrics; extract_economics() pulls token-type counts + AIU cost from session.shutdown.
    • pricing.py — AIU↔token cost math: documented per-token-type costPerBatch defaults, live-rate reading from session.compaction_complete, and the per-type AIU decomposition.
    • analysis.py — derives a rendering-agnostic SessionAnalysis (tools, turns, tokens, economics) from session events.
    • render.py — renders a SessionAnalysis to the terminal with Rich (backs the analyze command).
    • runner.py — orchestration: variants × tasks × trials → result artifacts + index. Also dry_run_experiment() (ephemeral, validating plumbing check that persists nothing).
    • storage.py — the results/ filesystem Layout and run discovery.
    • index.py — the SQLite index (results/index.db) derived from the filesystem.
    • report.py — aggregation, summary.json, and summary.md.
    • scaffold.pyinit logic: render templates/experiment_repo/ into a new repo.
    • cli.py — the Typer app (init, run, list, show, analyze, inspect, reindex).
    • templates/experiment_repo/ — package-data template for scaffolded experiment repos.
  • examples/tracer_bullet/ — a committed, runnable multi-turn example experiment (textstats).
  • examples/task_suite/ — a committed multi-task example (strtools + csvtools) exercising the task axis and its mean-success / resolved@k suite-coverage metrics.
  • sandbox/ — local scratch space for exercising the lib/CLI (its results/ are gitignored).
  • tests/ — pytest suite (uses MockInvoker; never requires a real copilot or network).
  • docs/ — architecture, authoring guide, analysis (analyze), results-format reference, BYOK/local-models guide, and docs/adr/ (architecture decision records).

Architecture invariants (keep these true)

  • The filesystem is the source of truth. results/index.db is a derived, rebuildable cache — reindex must always be able to recreate it by scanning results/.
  • Secrets never hit disk. Store variants via Variant.stored() / ProviderConfig.redacted(), which mask api_key / bearer_token.
  • Tests stay offline. Anything exercising the runner uses MockInvoker (optionally with a solver callback) and a temp --root. Do not call the real Copilot CLI in tests.
  • Dry-runs are ephemeral. dry_run_experiment() runs the whole pipeline with MockInvoker in a tempfile.mkdtemp() dir (results + session state both redirected there), validates each stage produced its artifact, then deletes the temp dir. Nothing is persisted under the repo; run_experiment() has no dry_run parameter.
  • Templates are data. Files under templates/experiment_repo/ are shipped as package data; .tmpl files are rendered with {{placeholder}} substitution and lose the suffix.

How Copilot CLI is driven (reference)

  • Non-interactive: copilot -p "<prompt>" --allow-all-tools.
  • --model / --effort / --agent / --mode, --output-format json, --session-id <uuid>, --log-dir, -C <dir>.
  • Session logs land at ~/.copilot/session-state/<session-id>/events.jsonl — the metrics source.
  • BYOK is env-driven (COPILOT_PROVIDER_*); a variant is just flags + env.

Workflow

uv sync                              # create/refresh the venv
uv run ruff check --fix .            # autofix lint issues
uv run ruff format .                 # format Python code
uv run ruff check .                  # verify lint
uv run pytest -q                     # test
# End-to-end smoke test against the sandbox:
uv run copilot-experiments init sandbox/demo --force
uv run copilot-experiments run --root sandbox/demo --dry-run
uv run copilot-experiments show --last --root sandbox/demo

For local CLI testing point --root at a scaffolded dir in sandbox/ rather than uv sync-ing that dir (its pyproject depends on this package via a git URL that won't resolve offline).

Conventions

  • Python 3.12+, line length 100, ruff-clean and ruff-formatted (E,F,I,UP,B,W; B008 is ignored for Typer).
  • Treat perfectly linted/formatted code as non-negotiable: run ruff check --fix, ruff format, then ruff check before handoff. The repo has pre-commit hooks and CI for this, but agents should still run the commands locally.
  • Maintain good test coverage for every behavior change. Add focused offline tests for new Pier config/result adapters, CLI behavior, session parsing, and migration paths; avoid relying only on broad smoke tests.
  • Prefer small, well-typed functions; keep modules single-purpose (see the map above).
  • Update docs/ and the experiment-repo template when you change public behavior.
  • Record significant/architectural decisions as an ADR under docs/adr/ (copy 0000-template.md).
  • Commits include the Co-authored-by: Copilot trailer.