Name	Name	Last commit message	Last commit date
Latest commit History 37 Commits
.apm	.apm
.github	.github
docs	docs
examples	examples
sandbox	sandbox
src/copilot_experiments	src/copilot_experiments
tests	tests
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
AGENTS.md	AGENTS.md
README.md	README.md
apm.yml	apm.yml
pyproject.toml	pyproject.toml
uv.lock	uv.lock

copilot-experiments

A Pier-first experiment harness for running coding agents in sandboxed tasks and analyzing their results, with special support for capturing GitHub Copilot CLI sessions exactly as the real CLI produces them.

This repository is the tool. Use copilot-experiments init to scaffold a separate experiment repository containing Pier job YAML and Harbor/Pier task directories.

How it works

flowchart LR
    T["tasks/<id>/\ntask.toml + instruction + environment + tests"] --> J["experiments/*.yaml\nPier JobConfig"]
    J --> P["Pier backend\nsandbox + verifier + artifacts"]
    P --> A["copilot-cli installed agent\nreal copilot binary"]
    A --> S["native Copilot\ncopilot-session/**/events.jsonl"]
    P --> O["jobs/<job>/\nPier result.json + trials"]
    S --> C["Copilot-native analysis\nAIU, tokens, tools, turns"]
    O --> R["summary.json / summary.md\nshow / inspect / analyze"]
    O --> I["results/index.db\nderived SQLite index"]

Tasks are Harbor/Pier task directories: task.toml, instruction.md, environment/, tests/test.sh, and optional solutions/artifacts.
Experiments are Pier JobConfig YAML files under experiments/.
Pier owns sandbox execution, agent installation, trial orchestration, verifier execution, and artifact transfer.
Copilot CLI remains the system under test. The local Pier agent shells out to the real copilot binary; it does not use a Copilot SDK or custom tool loop.
Native Copilot logs remain primary for Copilot-specific metrics. ATIF trajectory output is also captured for cross-agent compatibility and non-Copilot agents.

Quickstart

uv sync

# scaffold a standalone experiment repo
uv run copilot-experiments init my-experiments
cd my-experiments
uv sync

# validate Pier job configs without starting a sandbox
uv run copilot-experiments run --dry-run

# run for real through Pier
uv run copilot-experiments run
uv run copilot-experiments show --last
uv run copilot-experiments analyze --last

Real runs require Copilot auth (COPILOT_GITHUB_TOKEN, GH_TOKEN, GITHUB_TOKEN, or gh auth login) and a Pier-supported execution backend such as Docker.

Each run is a new measurement. If the configured Pier job_name already exists under jobs/, copilot-experiments writes the rerun to a timestamped job name instead of silently reusing the completed job. Pass --resume only when you intentionally want Pier's native resume behavior for an interrupted job.

Bundled examples

uv run copilot-experiments run --root examples/tracer_bullet --dry-run
uv run copilot-experiments run --root examples/tracer_bullet
uv run copilot-experiments analyze --root examples/tracer_bullet --last

examples/tracer_bullet - one small task, cheap Copilot model.
examples/task_suite - two tasks of different difficulty.

CLI

Command	Description
`init <dir>`	Scaffold a standalone Pier experiment repository.
`deepswe-import <path>`	Generate a Pier job config for a cloned DeepSWE checkout, `tasks/` corpus, or single task.
`run [name]`	Discover Pier job configs in `experiments/` and run them. Reruns create a fresh timestamped Pier job when the configured name already exists. Falls back to legacy Python experiments when no Pier configs exist.
`run --dry-run`	Validate Pier job configs, or run the legacy ephemeral mock dry-run for legacy experiments.
`run --resume`	Resume an existing Pier job directory and skip already-completed matching trials.
`list`	List Pier job configs, legacy experiments, and stored jobs/runs.
`show <job>` / `show --last`	Print a summary for a Pier job or legacy run.
`analyze <job>` / `analyze --last` / `analyze --file <events.jsonl>`	Render a rich overview of a native Copilot session log.
`inspect <job>`	Drill into stored trials and status.
`reindex`	Rebuild the derived SQLite index from `jobs/` and legacy `results/`.

Documentation

docs/architecture.md - Pier-first architecture.
docs/authoring-experiments.md - task and job authoring.
docs/deepswe.md - importing and running DeepSWE tasks through Pier.
docs/collecting-run-data.md - everything to collect around a Copilot CLI run, including native events.jsonl, Pier artifacts, ATIF, and OTel.
docs/results-format.md - jobs/ layout and derived index.
docs/analysis.md - native Copilot session analysis.
docs/byok-and-local-models.md - provider env for Copilot CLI.
docs/adr/ - architecture decision records.

Development

uv sync
uv run ruff check --fix .
uv run ruff format .
uv run ruff check .
uv run pytest -q

Install the local hooks once to make Ruff formatting/lint fixes automatic on commit and tests run on push:

uv run pre-commit install --install-hooks
uv run pre-commit install --hook-type pre-push

See AGENTS.md for contributor guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

copilot-experiments

How it works

Quickstart

Bundled examples

CLI

Documentation

Development

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

copilot-experiments

How it works

Quickstart

Bundled examples

CLI

Documentation

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages