Skip to content

Latest commit

 

History

History
116 lines (92 loc) · 5.11 KB

File metadata and controls

116 lines (92 loc) · 5.11 KB

copilot-experiments

A Pier-first experiment harness for running coding agents in sandboxed tasks and analyzing their results, with special support for capturing GitHub Copilot CLI sessions exactly as the real CLI produces them.

This repository is the tool. Use copilot-experiments init to scaffold a separate experiment repository containing Pier job YAML and Harbor/Pier task directories.

How it works

flowchart LR
    T["tasks/<id>/\ntask.toml + instruction + environment + tests"] --> J["experiments/*.yaml\nPier JobConfig"]
    J --> P["Pier backend\nsandbox + verifier + artifacts"]
    P --> A["copilot-cli installed agent\nreal copilot binary"]
    A --> S["native Copilot\ncopilot-session/**/events.jsonl"]
    P --> O["jobs/<job>/\nPier result.json + trials"]
    S --> C["Copilot-native analysis\nAIU, tokens, tools, turns"]
    O --> R["summary.json / summary.md\nshow / inspect / analyze"]
    O --> I["results/index.db\nderived SQLite index"]
Loading
  • Tasks are Harbor/Pier task directories: task.toml, instruction.md, environment/, tests/test.sh, and optional solutions/artifacts.
  • Experiments are Pier JobConfig YAML files under experiments/.
  • Pier owns sandbox execution, agent installation, trial orchestration, verifier execution, and artifact transfer.
  • Copilot CLI remains the system under test. The local Pier agent shells out to the real copilot binary; it does not use a Copilot SDK or custom tool loop.
  • Native Copilot logs remain primary for Copilot-specific metrics. ATIF trajectory output is also captured for cross-agent compatibility and non-Copilot agents.

Quickstart

uv sync

# scaffold a standalone experiment repo
uv run copilot-experiments init my-experiments
cd my-experiments
uv sync

# validate Pier job configs without starting a sandbox
uv run copilot-experiments run --dry-run

# run for real through Pier
uv run copilot-experiments run
uv run copilot-experiments show --last
uv run copilot-experiments analyze --last

Real runs require Copilot auth (COPILOT_GITHUB_TOKEN, GH_TOKEN, GITHUB_TOKEN, or gh auth login) and a Pier-supported execution backend such as Docker.

Each run is a new measurement. If the configured Pier job_name already exists under jobs/, copilot-experiments writes the rerun to a timestamped job name instead of silently reusing the completed job. Pass --resume only when you intentionally want Pier's native resume behavior for an interrupted job.

Bundled examples

uv run copilot-experiments run --root examples/tracer_bullet --dry-run
uv run copilot-experiments run --root examples/tracer_bullet
uv run copilot-experiments analyze --root examples/tracer_bullet --last

CLI

Command Description
init <dir> Scaffold a standalone Pier experiment repository.
deepswe-import <path> Generate a Pier job config for a cloned DeepSWE checkout, tasks/ corpus, or single task.
run [name] Discover Pier job configs in experiments/ and run them. Reruns create a fresh timestamped Pier job when the configured name already exists. Falls back to legacy Python experiments when no Pier configs exist.
run --dry-run Validate Pier job configs, or run the legacy ephemeral mock dry-run for legacy experiments.
run --resume Resume an existing Pier job directory and skip already-completed matching trials.
list List Pier job configs, legacy experiments, and stored jobs/runs.
show <job> / show --last Print a summary for a Pier job or legacy run.
analyze <job> / analyze --last / analyze --file <events.jsonl> Render a rich overview of a native Copilot session log.
inspect <job> Drill into stored trials and status.
reindex Rebuild the derived SQLite index from jobs/ and legacy results/.

Documentation

Development

uv sync
uv run ruff check --fix .
uv run ruff format .
uv run ruff check .
uv run pytest -q

Install the local hooks once to make Ruff formatting/lint fixes automatic on commit and tests run on push:

uv run pre-commit install --install-hooks
uv run pre-commit install --hook-type pre-push

See AGENTS.md for contributor guidance.