A Pier-first experiment harness for running coding agents in sandboxed tasks and analyzing their results, with special support for capturing GitHub Copilot CLI sessions exactly as the real CLI produces them.
This repository is the tool. Use
copilot-experiments initto scaffold a separate experiment repository containing Pier job YAML and Harbor/Pier task directories.
flowchart LR
T["tasks/<id>/\ntask.toml + instruction + environment + tests"] --> J["experiments/*.yaml\nPier JobConfig"]
J --> P["Pier backend\nsandbox + verifier + artifacts"]
P --> A["copilot-cli installed agent\nreal copilot binary"]
A --> S["native Copilot\ncopilot-session/**/events.jsonl"]
P --> O["jobs/<job>/\nPier result.json + trials"]
S --> C["Copilot-native analysis\nAIU, tokens, tools, turns"]
O --> R["summary.json / summary.md\nshow / inspect / analyze"]
O --> I["results/index.db\nderived SQLite index"]
- Tasks are Harbor/Pier task directories:
task.toml,instruction.md,environment/,tests/test.sh, and optional solutions/artifacts. - Experiments are Pier
JobConfigYAML files underexperiments/. - Pier owns sandbox execution, agent installation, trial orchestration, verifier execution, and artifact transfer.
- Copilot CLI remains the system under test. The local Pier agent shells out to the real
copilotbinary; it does not use a Copilot SDK or custom tool loop. - Native Copilot logs remain primary for Copilot-specific metrics. ATIF trajectory output is also captured for cross-agent compatibility and non-Copilot agents.
uv sync
# scaffold a standalone experiment repo
uv run copilot-experiments init my-experiments
cd my-experiments
uv sync
# validate Pier job configs without starting a sandbox
uv run copilot-experiments run --dry-run
# run for real through Pier
uv run copilot-experiments run
uv run copilot-experiments show --last
uv run copilot-experiments analyze --lastTo use a local checkout of this tool from any other directory, run the console script through
uvx --from instead of syncing this repository first:
# Use the path to your local github-copilot-lab checkout.
export COPILOT_EXPERIMENTS_REPO=/path/to/github-copilot-lab
uvx --from "$COPILOT_EXPERIMENTS_REPO" copilot-experiments init my-experiments
cd my-experiments
uvx --from "$COPILOT_EXPERIMENTS_REPO" copilot-experiments run --dry-run
uvx --from "$COPILOT_EXPERIMENTS_REPO" copilot-experiments show --lastIn PowerShell, set
$env:COPILOT_EXPERIMENTS_REPO = "C:\path\to\github-copilot-lab" and use
uvx --from $env:COPILOT_EXPERIMENTS_REPO copilot-experiments .... This installs from the
current local checkout into uv's cache. If you are iterating on the tool and need to force uv to
rebuild from the working tree, add --no-cache before --from.
Real runs require Copilot auth (COPILOT_GITHUB_TOKEN, GH_TOKEN, GITHUB_TOKEN, or gh auth login) and a Pier-supported execution backend such as Docker. run preflights the selected
backend before creating a job; for Docker this checks the CLI, Compose plugin, and daemon
connection so missing WSL integration fails before an empty Pier job is recorded.
Each run is a new measurement. If the configured Pier job_name already exists under jobs/,
copilot-experiments writes the rerun to a timestamped job name instead of silently reusing the
completed job. Pass --resume only when you intentionally want Pier's native resume behavior for an
interrupted job.
uv run copilot-experiments run --root examples/tracer_bullet --dry-run
uv run copilot-experiments run --root examples/tracer_bullet
uv run copilot-experiments analyze --root examples/tracer_bullet --lastexamples/tracer_bullet- one small task, cheap Copilot model.examples/task_suite- two tasks of different difficulty.
| Command | Description |
|---|---|
init <dir> |
Scaffold a standalone Pier experiment repository. |
deepswe-import <path> |
Generate a Pier job config for a cloned DeepSWE checkout, tasks/ corpus, or single task. |
run [name] |
Discover Pier job configs in experiments/ and run them. Reruns create a fresh timestamped Pier job when the configured name already exists. Falls back to legacy Python experiments when no Pier configs exist. |
run --dry-run |
Validate Pier job configs, or run the legacy ephemeral mock dry-run for legacy experiments. |
run --resume |
Resume an existing Pier job directory and skip already-completed matching trials. |
list |
List Pier job configs, legacy experiments, and stored jobs/runs. |
show <job> / show --last |
Print a summary for a Pier job or legacy run. |
analyze <job> / analyze --last / analyze --file <events.jsonl> |
Render a rich overview of a native Copilot session log. |
inspect <job> |
Drill into stored trials and status. |
reindex |
Rebuild the derived SQLite index from jobs/ and legacy results/. |
docs/architecture.md- Pier-first architecture.docs/authoring-experiments.md- task and job authoring.docs/deepswe.md- importing and running DeepSWE tasks through Pier.docs/collecting-run-data.md- everything to collect around a Copilot CLI run, including nativeevents.jsonl, Pier artifacts, ATIF, and OTel.docs/results-format.md-jobs/layout and derived index.docs/analysis.md- native Copilot session analysis.docs/byok-and-local-models.md- provider env for Copilot CLI.docs/adr/- architecture decision records.
uv sync
uv run ruff check --fix .
uv run ruff format .
uv run ruff check .
uv run pytest -q
uv run pytest --cov=copilot_experiments --cov-report=term-missing:skip-coveredInstall the local hooks once to make Ruff formatting/lint fixes automatic on commit and tests run on push:
uv run pre-commit install --install-hooks
uv run pre-commit install --hook-type pre-pushThe pre-push hook runs plain pytest for speed. Run the coverage command explicitly when you need branch coverage and missing-line details; CI also runs it for every push and pull request.
See AGENTS.md for contributor guidance.