Skip to content

dbroeglin/github-copilot-lab

Repository files navigation

copilot-experiments

A Pier-first experiment harness for running coding agents in sandboxed tasks and analyzing their results, with special support for capturing GitHub Copilot CLI sessions exactly as the real CLI produces them.

This repository is the tool. Use copilot-experiments init to scaffold a separate experiment repository containing Pier job YAML and Harbor/Pier task directories.

How it works

flowchart LR
    T["tasks/<id>/\ntask.toml + instruction + environment + tests"] --> J["experiments/*.yaml\nPier JobConfig"]
    J --> P["Pier backend\nsandbox + verifier + artifacts"]
    P --> A["copilot-cli installed agent\nreal copilot binary"]
    A --> S["native Copilot\ncopilot-session/**/events.jsonl"]
    P --> O["jobs/<job>/\nPier result.json + trials"]
    S --> C["Copilot-native analysis\nAIU, tokens, tools, turns"]
    O --> R["summary.json / summary.md\nshow / inspect / analyze"]
    O --> I["results/index.db\nderived SQLite index"]
Loading
  • Tasks are Harbor/Pier task directories: task.toml, instruction.md, environment/, tests/test.sh, and optional solutions/artifacts.
  • Experiments are Pier JobConfig YAML files under experiments/.
  • Pier owns sandbox execution, agent installation, trial orchestration, verifier execution, and artifact transfer.
  • Copilot CLI remains the system under test. The local Pier agent shells out to the real copilot binary; it does not use a Copilot SDK or custom tool loop.
  • Native Copilot logs remain primary for Copilot-specific metrics. ATIF trajectory output is also captured for cross-agent compatibility and non-Copilot agents.

Quickstart

uv sync

# scaffold a standalone experiment repo
uv run copilot-experiments init my-experiments
cd my-experiments
uv sync

# validate Pier job configs without starting a sandbox
uv run copilot-experiments run --dry-run

# run for real through Pier
uv run copilot-experiments run
uv run copilot-experiments show --last
uv run copilot-experiments analyze --last

To use a local checkout of this tool from any other directory, run the console script through uvx --from instead of syncing this repository first:

# Use the path to your local github-copilot-lab checkout.
export COPILOT_EXPERIMENTS_REPO=/path/to/github-copilot-lab

uvx --from "$COPILOT_EXPERIMENTS_REPO" copilot-experiments init my-experiments
cd my-experiments
uvx --from "$COPILOT_EXPERIMENTS_REPO" copilot-experiments run --dry-run
uvx --from "$COPILOT_EXPERIMENTS_REPO" copilot-experiments show --last

In PowerShell, set $env:COPILOT_EXPERIMENTS_REPO = "C:\path\to\github-copilot-lab" and use uvx --from $env:COPILOT_EXPERIMENTS_REPO copilot-experiments .... This installs from the current local checkout into uv's cache. If you are iterating on the tool and need to force uv to rebuild from the working tree, add --no-cache before --from.

Real runs require Copilot auth (COPILOT_GITHUB_TOKEN, GH_TOKEN, GITHUB_TOKEN, or gh auth login) and a Pier-supported execution backend such as Docker. run preflights the selected backend before creating a job; for Docker this checks the CLI, Compose plugin, and daemon connection so missing WSL integration fails before an empty Pier job is recorded.

Each run is a new measurement. If the configured Pier job_name already exists under jobs/, copilot-experiments writes the rerun to a timestamped job name instead of silently reusing the completed job. Pass --resume only when you intentionally want Pier's native resume behavior for an interrupted job.

Bundled examples

uv run copilot-experiments run --root examples/tracer_bullet --dry-run
uv run copilot-experiments run --root examples/tracer_bullet
uv run copilot-experiments analyze --root examples/tracer_bullet --last

CLI

Command Description
init <dir> Scaffold a standalone Pier experiment repository.
deepswe-import <path> Generate a Pier job config for a cloned DeepSWE checkout, tasks/ corpus, or single task.
run [name] Discover Pier job configs in experiments/ and run them. Reruns create a fresh timestamped Pier job when the configured name already exists. Falls back to legacy Python experiments when no Pier configs exist.
run --dry-run Validate Pier job configs, or run the legacy ephemeral mock dry-run for legacy experiments.
run --resume Resume an existing Pier job directory and skip already-completed matching trials.
list List Pier job configs, legacy experiments, and stored jobs/runs.
show <job> / show --last Print a summary for a Pier job or legacy run.
analyze <job> / analyze --last / analyze --file <events.jsonl> Render a rich overview of a native Copilot session log.
inspect <job> Drill into stored trials and status.
reindex Rebuild the derived SQLite index from jobs/ and legacy results/.

Documentation

Development

uv sync
uv run ruff check --fix .
uv run ruff format .
uv run ruff check .
uv run pytest -q
uv run pytest --cov=copilot_experiments --cov-report=term-missing:skip-covered

Install the local hooks once to make Ruff formatting/lint fixes automatic on commit and tests run on push:

uv run pre-commit install --install-hooks
uv run pre-commit install --hook-type pre-push

The pre-push hook runs plain pytest for speed. Run the coverage command explicitly when you need branch coverage and missing-line details; CI also runs it for every push and pull request.

See AGENTS.md for contributor guidance.

About

GitHub Copilot Experiments Harness

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages