Session-log analysis

After a Pier trial runs with the local copilot-cli agent, the job output keeps the native Copilot CLI session log (agent/copilot-session/**/events.jsonl). copilot-experiments derives two views from that raw log:

Flat metrics — counters used for summary.json, show, and the SQLite index.
SessionAnalysis — a richer, structured overview of what happened in the session, rendered by analyze.

Legacy Python runs also persist these views as per-trial metrics.json and analysis.json; Pier runs derive them from the canonical job artifacts on demand.

This page covers the second one and the analyze command that renders it.

Why a session log at all, and why split data from rendering? See ADR-0004 and ADR-0006. For the broader collection playbook, including raw events.jsonl, stdout, --share, debug logs, workspace diffs, verification output, and OpenTelemetry, see Collecting data from a Copilot CLI run.

The `analyze` command

# Most recent Pier job (first trial by default)
uv run copilot-experiments analyze --last

# A specific Pier job / trial
uv run copilot-experiments analyze tracer-bullet-textstats --trial 1

# Any events.jsonl on disk — a stored trial log, or a live session under
# ~/.copilot/session-state/<id>/events.jsonl
uv run copilot-experiments analyze --file path/to/events.jsonl

# Direct file analysis enriched with OTel per-LLM-call spans
uv run copilot-experiments analyze --file path/to/events.jsonl --otel-file path/to/copilot-otel.jsonl

# Long sessions: cap the timeline table
uv run copilot-experiments analyze --last --max-turns 15

The rendering (built with Rich) has these parts:

Header — session id, model(s), reasoning effort, repo/branch, Copilot version, start time, and wall-clock duration.
Totals — turns, user/assistant messages, tool calls, tool failures, warnings, hooks, tokens, and total event count.
Tool usage — a histogram of how often each tool was invoked and how often it failed, plus the total wall-clock time it spent (dur) and the size of the results it fed back to the model (ctx, from toolTelemetry.metrics — a proxy for the input-token cost each tool injects).
Token economics — only when the log carries a session.shutdown. A Cost (AIU) table decomposing spend across the four token types (input / cache-read / cache-write / output) with each type's token count and share of cost, plus a Session economics table (requests, API time, ms/request, current & peak context, system/tool-definition tokens, compactions, truncations, files modified, lines ±, AIU per line). Multi-model sessions also get a per-model table. See ADR-0011.
LLM calls (OTel) — only when copilot-otel.jsonl is available. One row per chat <model> span with turn id, model, input tokens, cache-write tokens, output tokens, AIU, API time, and current context size.
Phases (temporal) — only when the session has at least five turns. The turns are split into five contiguous, near-equal groups (early → later) and each phase shows its turn span, tool calls, output tokens, share of total output, and duration. This mirrors the phase-level analysis in Bai et al. (their Finding #6: context construction dominates early phases, generation later ones) — tool-heavy, low-output early phases give way to higher output-share later ones. See the limitation note below.
Timeline — one row per assistant turn: time, duration, output tokens, the tools invoked in that turn, and a preview of what the assistant said.

Warnings, if any, are shown in a panel at the bottom.

Phases use native per-turn data only — no per-phase input/cache/cost yet. Native events.jsonl logs input, cache, reasoning, and AIU only as session totals (session.shutdown), never per turn; the per-turn assistant.message carries outputTokens but not inputTokens. Copilot's OTel chat <model> spans expose per-LLM-call input_tokens, cache_creation_input_tokens, output_tokens, nano_aiu, and server duration when OTel export is enabled; analyze auto-loads the harness-captured copilot-otel.jsonl when present, or accepts --otel-file for direct log analysis. The current phase table still distributes only the native per-turn signals — output tokens, tool activity, and duration — and deliberately does not fabricate per-phase input or cost. The separate LLM calls (OTel) table carries the per-call economics.

A blank "assistant said" is normal — it is not a missing-data bug. Many turns invoke a tool without any accompanying prose, so the assistant text is genuinely empty and the row shows only the tool it called. Reasoning models (e.g. OpenAI's gpt-5* family) do this on almost every turn and additionally keep their chain-of-thought in encrypted fields (reasoningOpaque / encryptedContent) that are not human-readable by design; chattier models (e.g. Claude) narrate alongside most tool calls. This is a property of the model, not of the harness, and it does not affect any measured metric — token counts and AIU come from session.shutdown, and the tools column still shows what each turn did. We deliberately do not force readable output on (e.g. via --enable-reasoning-summaries), because emitting extra text would perturb the very token economics we measure.

Cost is measured in AIU (GitHub's billing unit; totalNanoAiu / 1e9). Premium requests are ignored — GitHub stopped using them on 2026-06-01.

The `SessionAnalysis` model

analyze_events(events) -> SessionAnalysis (analysis.py) produces plain pydantic data (no formatting), so the same object backs the CLI renderer, the legacy analysis.json, and any future consumer.

Field	Meaning
`session_id`, `copilot_version`, `producer`	Session identity.
`models`, `reasoning_effort`	Model(s) observed and the effort level.
`repository`, `branch`, `cwd`	Workspace context from `session.start`.
`started_at`, `finished_at`, `duration_s`	Span of the session.
`n_turns`, `n_user_messages`, `n_assistant_messages`	Conversation counts.
`n_tool_calls`, `n_tool_failures`	Tool execution counts (failures correlated by `toolCallId`).
`n_warnings`, `n_hooks`, `n_events`	Other counts.
`input_tokens`, `output_tokens`, `total_tokens`	Token usage (authoritative from `session.shutdown` when present).
`economics`	`TokenEconomics`: token-type split, AIU cost, context composition, productivity (see below).
`tools`	`ToolStat(name, calls, failures, total_duration_ms, total_result_chars)`, sorted by calls desc.
`llm_calls`	`LlmCallSummary` rows parsed from OTel `chat <model>` spans when `copilot-otel.jsonl` is available.
`turns`	`TurnSummary(turn_no, duration_s, tools, output_tokens, text_preview, …)` plus OTel `input_tokens`, `cache_creation_input_tokens`, `aiu`, and `api_duration_ms` when available.
`phases`	`PhaseStat(name, turn_from, turn_to, n_turns, n_tool_calls, output_tokens, duration_s, output_share)` — five temporal phases (empty for sessions under five turns).
`warnings`	Warning messages.
`event_type_counts`	Histogram of raw event `type`s.

`TokenEconomics`

Parsed by extract_economics(events) from session.shutdown (authoritative; summed across multiple shutdowns when a session was resumed) plus session.compaction_* / session.truncation. Every field is best-effort — a session with no shutdown leaves the totals null.

Field	Meaning
`input_tokens_noncached`, `cache_read_tokens`, `cache_write_tokens`, `output_tokens`	The four metered token types.
`reasoning_tokens`	Reasoning tokens (a subset of output).
`input_tokens_total`, `total_tokens`	Billed input (non-cached + cache read + cache write) and the grand total.
`aiu`, `aiu_by_type`	Total AIU cost and its decomposition across token types (sums to `aiu`).
`api_duration_ms`, `n_requests`	Model API wall-clock and request count.
`system_tokens`, `tool_definitions_tokens`, `conversation_tokens`, `context_tokens`	End-of-session context-window composition.
`peak_context_tokens`	Largest context observed (from compaction/truncation pre-sizes).
`n_compactions`, `n_truncations`, `compaction_aiu`, `tokens_removed_truncation`	Context-management dynamics and their cost.
`files_modified`, `lines_added`, `lines_removed`	Productivity, from `codeChanges`.
`model_metrics`	Per-model `ModelMetric(requests, input/output/cache/reasoning tokens, aiu)`.

The AIU math lives in pricing.py: it reads live per-token-type rates from session.compaction_complete (costPerBatch) when available, falls back to documented defaults otherwise, and normalises the split so it always sums to the authoritative totalNanoAiu. See ADR-0011.

Pier agent/trajectory.json carries the same OTel facts when the local file export is present. Assistant steps matched by turnId get ATIF metrics.prompt_tokens, metrics.completion_tokens, llm_call_count, and metrics.extra.copilot_otel; final_metrics.extra.copilot_otel stores the aggregate per-call totals. The native events.jsonl path remains authoritative for session-level totals when a session.shutdown exists.

How the session log is read

The Copilot CLI writes a per-session append-only stream of typed events. The fields the analysis relies on:

Event	Used for
`session.start`	`selectedModel`, `reasoningEffort`, `copilotVersion`, `context.{repository,branch,cwd}`, `startTime`.
`user.message`	User-message count.
`assistant.turn_start` / `assistant.turn_end`	Turn boundaries (and per-turn duration via `turnId`).
`assistant.message`	`model`, `content` (preview), `toolRequests`, `outputTokens`.
`tool.execution_start`	`toolName` ↔ `toolCallId` mapping; per-turn tool order.
`tool.execution_complete`	`success` (correlated back to the tool via `toolCallId`); `toolTelemetry.metrics.{durationMs,resultForLlmLength}` for per-tool latency and context size.
`session.warning`	Warning messages.
`hook.start`	Hook count.
`session.shutdown`	Authoritative token economics: `tokenDetails` per type, `totalNanoAiu`, `totalApiDurationMs`, `modelMetrics`, context composition, `codeChanges`.
`session.compaction_complete`	Live per-token AIU rates (`costPerBatch`), compaction cost, and context peak.
`session.truncation`	Truncation count, tokens removed, context peak.

Parsing is defensive: unknown event types are counted but otherwise ignored, and missing fields degrade to "unknown" rather than raising. Because the raw events.jsonl is always retained, an improved parser can re-analyze old runs without re-running them.

Web explorer (TBD)

A browser-based explorer for experiments, runs, session logs, and aggregated cross-run data is planned but not yet built. The decision to ship the CLI analysis first and defer the web app is recorded in ADR-0007. The groundwork is deliberately in place:

SessionAnalysis is rendering-agnostic data (ADR-0006), so a web/HTTP layer can serve the same model the CLI renders.
Pier jobs keep raw native session logs plus ATIF trajectories, while the SQLite index (docs/results-format.md) already supports cross-run queries — the two data sources a web explorer would build on.

When it lands, it will be documented here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session-log analysis

The `analyze` command

The `SessionAnalysis` model

`TokenEconomics`

How the session log is read

Web explorer (TBD)

FilesExpand file tree

analysis.md

Latest commit

History

analysis.md

File metadata and controls

Session-log analysis

The analyze command

The SessionAnalysis model

TokenEconomics

How the session log is read

Web explorer (TBD)

The `analyze` command

The `SessionAnalysis` model

`TokenEconomics`