After a Pier trial runs with the local copilot-cli agent, the job output keeps the native
Copilot CLI session log (agent/copilot-session/**/events.jsonl). copilot-experiments
derives two views from that raw log:
- Flat metrics — counters used for
summary.json,show, and the SQLite index. SessionAnalysis— a richer, structured overview of what happened in the session, rendered byanalyze.
Legacy Python runs also persist these views as per-trial metrics.json and analysis.json; Pier
runs derive them from the canonical job artifacts on demand.
This page covers the second one and the analyze command that renders it.
Why a session log at all, and why split data from rendering? See ADR-0004 and ADR-0006. For the broader collection playbook, including raw
events.jsonl, stdout,--share, debug logs, workspace diffs, verification output, and OpenTelemetry, see Collecting data from a Copilot CLI run.
# Most recent Pier job (first trial by default)
uv run copilot-experiments analyze --last
# A specific Pier job / trial
uv run copilot-experiments analyze tracer-bullet-textstats --trial 1
# Any events.jsonl on disk — a stored trial log, or a live session under
# ~/.copilot/session-state/<id>/events.jsonl
uv run copilot-experiments analyze --file path/to/events.jsonl
# Direct file analysis enriched with OTel per-LLM-call spans
uv run copilot-experiments analyze --file path/to/events.jsonl --otel-file path/to/copilot-otel.jsonl
# Long sessions: cap the timeline table
uv run copilot-experiments analyze --last --max-turns 15The rendering (built with Rich) has these parts:
- Header — session id, model(s), reasoning effort, repo/branch, Copilot version, start time, and wall-clock duration.
- Totals — turns, user/assistant messages, tool calls, tool failures, warnings, hooks, tokens, and total event count.
- Tool usage — a histogram of how often each tool was invoked and how often it failed,
plus the total wall-clock time it spent (
dur) and the size of the results it fed back to the model (ctx, fromtoolTelemetry.metrics— a proxy for the input-token cost each tool injects). - Token economics — only when the log carries a
session.shutdown. A Cost (AIU) table decomposing spend across the four token types (input / cache-read / cache-write / output) with each type's token count and share of cost, plus a Session economics table (requests, API time, ms/request, current & peak context, system/tool-definition tokens, compactions, truncations, files modified, lines ±, AIU per line). Multi-model sessions also get a per-model table. See ADR-0011. - LLM calls (OTel) — only when
copilot-otel.jsonlis available. One row perchat <model>span with turn id, model, input tokens, cache-write tokens, output tokens, AIU, API time, and current context size. - Phases (temporal) — only when the session has at least five turns. The turns are split
into five contiguous, near-equal groups (
early→later) and each phase shows its turn span, tool calls, output tokens, share of total output, and duration. This mirrors the phase-level analysis in Bai et al. (their Finding #6: context construction dominates early phases, generation later ones) — tool-heavy, low-output early phases give way to higher output-share later ones. See the limitation note below. - Timeline — one row per assistant turn: time, duration, output tokens, the tools invoked in that turn, and a preview of what the assistant said.
Warnings, if any, are shown in a panel at the bottom.
Phases use native per-turn data only — no per-phase input/cache/cost yet. Native
events.jsonllogs input, cache, reasoning, and AIU only as session totals (session.shutdown), never per turn; the per-turnassistant.messagecarriesoutputTokensbut notinputTokens. Copilot's OTelchat <model>spans expose per-LLM-callinput_tokens,cache_creation_input_tokens,output_tokens,nano_aiu, and server duration when OTel export is enabled;analyzeauto-loads the harness-capturedcopilot-otel.jsonlwhen present, or accepts--otel-filefor direct log analysis. The current phase table still distributes only the native per-turn signals — output tokens, tool activity, and duration — and deliberately does not fabricate per-phase input or cost. The separate LLM calls (OTel) table carries the per-call economics.
A blank "assistant said" is normal — it is not a missing-data bug. Many turns invoke a tool without any accompanying prose, so the assistant text is genuinely empty and the row shows only the tool it called. Reasoning models (e.g. OpenAI's
gpt-5*family) do this on almost every turn and additionally keep their chain-of-thought in encrypted fields (reasoningOpaque/encryptedContent) that are not human-readable by design; chattier models (e.g. Claude) narrate alongside most tool calls. This is a property of the model, not of the harness, and it does not affect any measured metric — token counts and AIU come fromsession.shutdown, and thetoolscolumn still shows what each turn did. We deliberately do not force readable output on (e.g. via--enable-reasoning-summaries), because emitting extra text would perturb the very token economics we measure.
Cost is measured in AIU (GitHub's billing unit;
totalNanoAiu / 1e9). Premium requests are ignored — GitHub stopped using them on 2026-06-01.
analyze_events(events) -> SessionAnalysis (analysis.py)
produces plain pydantic data (no formatting), so the same object backs the CLI renderer, the
legacy analysis.json, and any future consumer.
| Field | Meaning |
|---|---|
session_id, copilot_version, producer |
Session identity. |
models, reasoning_effort |
Model(s) observed and the effort level. |
repository, branch, cwd |
Workspace context from session.start. |
started_at, finished_at, duration_s |
Span of the session. |
n_turns, n_user_messages, n_assistant_messages |
Conversation counts. |
n_tool_calls, n_tool_failures |
Tool execution counts (failures correlated by toolCallId). |
n_warnings, n_hooks, n_events |
Other counts. |
input_tokens, output_tokens, total_tokens |
Token usage (authoritative from session.shutdown when present). |
economics |
TokenEconomics: token-type split, AIU cost, context composition, productivity (see below). |
tools |
ToolStat(name, calls, failures, total_duration_ms, total_result_chars), sorted by calls desc. |
llm_calls |
LlmCallSummary rows parsed from OTel chat <model> spans when copilot-otel.jsonl is available. |
turns |
TurnSummary(turn_no, duration_s, tools, output_tokens, text_preview, …) plus OTel input_tokens, cache_creation_input_tokens, aiu, and api_duration_ms when available. |
phases |
PhaseStat(name, turn_from, turn_to, n_turns, n_tool_calls, output_tokens, duration_s, output_share) — five temporal phases (empty for sessions under five turns). |
warnings |
Warning messages. |
event_type_counts |
Histogram of raw event types. |
Parsed by extract_economics(events) from session.shutdown (authoritative; summed across
multiple shutdowns when a session was resumed) plus session.compaction_* / session.truncation.
Every field is best-effort — a session with no shutdown leaves the totals null.
| Field | Meaning |
|---|---|
input_tokens_noncached, cache_read_tokens, cache_write_tokens, output_tokens |
The four metered token types. |
reasoning_tokens |
Reasoning tokens (a subset of output). |
input_tokens_total, total_tokens |
Billed input (non-cached + cache read + cache write) and the grand total. |
aiu, aiu_by_type |
Total AIU cost and its decomposition across token types (sums to aiu). |
api_duration_ms, n_requests |
Model API wall-clock and request count. |
system_tokens, tool_definitions_tokens, conversation_tokens, context_tokens |
End-of-session context-window composition. |
peak_context_tokens |
Largest context observed (from compaction/truncation pre-sizes). |
n_compactions, n_truncations, compaction_aiu, tokens_removed_truncation |
Context-management dynamics and their cost. |
files_modified, lines_added, lines_removed |
Productivity, from codeChanges. |
model_metrics |
Per-model ModelMetric(requests, input/output/cache/reasoning tokens, aiu). |
The AIU math lives in pricing.py: it reads live
per-token-type rates from session.compaction_complete (costPerBatch) when available, falls back
to documented defaults otherwise, and normalises the split so it always sums to the authoritative
totalNanoAiu. See ADR-0011.
Pier agent/trajectory.json carries the same OTel facts when the local file export is present.
Assistant steps matched by turnId get ATIF metrics.prompt_tokens, metrics.completion_tokens,
llm_call_count, and metrics.extra.copilot_otel; final_metrics.extra.copilot_otel stores the
aggregate per-call totals. The native events.jsonl path remains authoritative for session-level
totals when a session.shutdown exists.
The Copilot CLI writes a per-session append-only stream of typed events. The fields the analysis relies on:
| Event | Used for |
|---|---|
session.start |
selectedModel, reasoningEffort, copilotVersion, context.{repository,branch,cwd}, startTime. |
user.message |
User-message count. |
assistant.turn_start / assistant.turn_end |
Turn boundaries (and per-turn duration via turnId). |
assistant.message |
model, content (preview), toolRequests, outputTokens. |
tool.execution_start |
toolName ↔ toolCallId mapping; per-turn tool order. |
tool.execution_complete |
success (correlated back to the tool via toolCallId); toolTelemetry.metrics.{durationMs,resultForLlmLength} for per-tool latency and context size. |
session.warning |
Warning messages. |
hook.start |
Hook count. |
session.shutdown |
Authoritative token economics: tokenDetails per type, totalNanoAiu, totalApiDurationMs, modelMetrics, context composition, codeChanges. |
session.compaction_complete |
Live per-token AIU rates (costPerBatch), compaction cost, and context peak. |
session.truncation |
Truncation count, tokens removed, context peak. |
Parsing is defensive: unknown event types are counted but otherwise ignored, and missing
fields degrade to "unknown" rather than raising. Because the raw events.jsonl is always
retained, an improved parser can re-analyze old runs without re-running them.
A browser-based explorer for experiments, runs, session logs, and aggregated cross-run data is planned but not yet built. The decision to ship the CLI analysis first and defer the web app is recorded in ADR-0007. The groundwork is deliberately in place:
SessionAnalysisis rendering-agnostic data (ADR-0006), so a web/HTTP layer can serve the same model the CLI renders.- Pier jobs keep raw native session logs plus ATIF trajectories, while the SQLite index
(
docs/results-format.md) already supports cross-run queries — the two data sources a web explorer would build on.
When it lands, it will be documented here.