A Pier-first experiment harness for running coding agents in sandboxed tasks and analyzing their results, with special support for capturing GitHub Copilot CLI sessions exactly as the real CLI produces them.
This repository is the tool. Use
copilot-experiments initto scaffold a separate experiment repository containing Pier job YAML and Harbor/Pier task directories.
flowchart LR
T["tasks/<id>/\ntask.toml + instruction + environment + tests"] --> J["experiments/*.yaml\nPier JobConfig"]
J --> P["Pier backend\nsandbox + verifier + artifacts"]
P --> A["copilot-cli installed agent\nreal copilot binary"]
A --> S["native Copilot\ncopilot-session/**/events.jsonl"]
P --> O["jobs/<job>/\nPier result.json + trials"]
S --> C["Copilot-native analysis\nAIU, tokens, tools, turns"]
O --> R["summary.json / summary.md\nshow / inspect / analyze"]
O --> I["results/index.db\nderived SQLite index"]
- Tasks are Harbor/Pier task directories:
task.toml,instruction.md,environment/,tests/test.sh, and optional solutions/artifacts. - Experiments are Pier
JobConfigYAML files underexperiments/. - Pier owns sandbox execution, agent installation, trial orchestration, verifier execution, and artifact transfer.
- Copilot CLI remains the system under test. The local Pier agent shells out to the real
copilotbinary; it does not use a Copilot SDK or custom tool loop. - Native Copilot logs remain primary for Copilot-specific metrics. ATIF trajectory output is also captured for cross-agent compatibility and non-Copilot agents.
uv sync
# scaffold a standalone experiment repo
uv run copilot-experiments init my-experiments
cd my-experiments
uv sync
# validate Pier job configs without starting a sandbox
uv run copilot-experiments run --dry-run
# run for real through Pier
uv run copilot-experiments run
uv run copilot-experiments show --last
uv run copilot-experiments analyze --lastReal runs require Copilot auth (COPILOT_GITHUB_TOKEN, GH_TOKEN, GITHUB_TOKEN, or gh auth login) and a Pier-supported execution backend such as Docker.
Each run is a new measurement. If the configured Pier job_name already exists under jobs/,
copilot-experiments writes the rerun to a timestamped job name instead of silently reusing the
completed job. Pass --resume only when you intentionally want Pier's native resume behavior for an
interrupted job.
uv run copilot-experiments run --root examples/tracer_bullet --dry-run
uv run copilot-experiments run --root examples/tracer_bullet
uv run copilot-experiments analyze --root examples/tracer_bullet --lastexamples/tracer_bullet- one small task, cheap Copilot model.examples/task_suite- two tasks of different difficulty.
| Command | Description |
|---|---|
init <dir> |
Scaffold a standalone Pier experiment repository. |
deepswe-import <path> |
Generate a Pier job config for a cloned DeepSWE checkout, tasks/ corpus, or single task. |
run [name] |
Discover Pier job configs in experiments/ and run them. Reruns create a fresh timestamped Pier job when the configured name already exists. Falls back to legacy Python experiments when no Pier configs exist. |
run --dry-run |
Validate Pier job configs, or run the legacy ephemeral mock dry-run for legacy experiments. |
run --resume |
Resume an existing Pier job directory and skip already-completed matching trials. |
list |
List Pier job configs, legacy experiments, and stored jobs/runs. |
show <job> / show --last |
Print a summary for a Pier job or legacy run. |
analyze <job> / analyze --last / analyze --file <events.jsonl> |
Render a rich overview of a native Copilot session log. |
inspect <job> |
Drill into stored trials and status. |
reindex |
Rebuild the derived SQLite index from jobs/ and legacy results/. |
docs/architecture.md- Pier-first architecture.docs/authoring-experiments.md- task and job authoring.docs/deepswe.md- importing and running DeepSWE tasks through Pier.docs/collecting-run-data.md- everything to collect around a Copilot CLI run, including nativeevents.jsonl, Pier artifacts, ATIF, and OTel.docs/results-format.md-jobs/layout and derived index.docs/analysis.md- native Copilot session analysis.docs/byok-and-local-models.md- provider env for Copilot CLI.docs/adr/- architecture decision records.
uv sync
uv run ruff check --fix .
uv run ruff format .
uv run ruff check .
uv run pytest -qInstall the local hooks once to make Ruff formatting/lint fixes automatic on commit and tests run on push:
uv run pre-commit install --install-hooks
uv run pre-commit install --hook-type pre-pushSee AGENTS.md for contributor guidance.