Skip to content

Latest commit

 

History

History
187 lines (154 loc) · 11.8 KB

File metadata and controls

187 lines (154 loc) · 11.8 KB

Session-log analysis

After a Pier trial runs with the local copilot-cli agent, the job output keeps the native Copilot CLI session log (agent/copilot-session/**/events.jsonl). copilot-experiments derives two views from that raw log:

  • Flat metrics — counters used for summary.json, show, and the SQLite index.
  • SessionAnalysis — a richer, structured overview of what happened in the session, rendered by analyze.

Legacy Python runs also persist these views as per-trial metrics.json and analysis.json; Pier runs derive them from the canonical job artifacts on demand.

This page covers the second one and the analyze command that renders it.

Why a session log at all, and why split data from rendering? See ADR-0004 and ADR-0006. For the broader collection playbook, including raw events.jsonl, stdout, --share, debug logs, workspace diffs, verification output, and OpenTelemetry, see Collecting data from a Copilot CLI run.

The analyze command

# Most recent Pier job (first trial by default)
uv run copilot-experiments analyze --last

# A specific Pier job / trial
uv run copilot-experiments analyze tracer-bullet-textstats --trial 1

# Any events.jsonl on disk — a stored trial log, or a live session under
# ~/.copilot/session-state/<id>/events.jsonl
uv run copilot-experiments analyze --file path/to/events.jsonl

# Direct file analysis enriched with OTel per-LLM-call spans
uv run copilot-experiments analyze --file path/to/events.jsonl --otel-file path/to/copilot-otel.jsonl

# Long sessions: cap the timeline table
uv run copilot-experiments analyze --last --max-turns 15

The rendering (built with Rich) has these parts:

  1. Header — session id, model(s), reasoning effort, repo/branch, Copilot version, start time, and wall-clock duration.
  2. Totals — turns, user/assistant messages, tool calls, tool failures, warnings, hooks, tokens, and total event count.
  3. Tool usage — a histogram of how often each tool was invoked and how often it failed, plus the total wall-clock time it spent (dur) and the size of the results it fed back to the model (ctx, from toolTelemetry.metrics — a proxy for the input-token cost each tool injects).
  4. Token economicsonly when the log carries a session.shutdown. A Cost (AIU) table decomposing spend across the four token types (input / cache-read / cache-write / output) with each type's token count and share of cost, plus a Session economics table (requests, API time, ms/request, current & peak context, system/tool-definition tokens, compactions, truncations, files modified, lines ±, AIU per line). Multi-model sessions also get a per-model table. See ADR-0011.
  5. LLM calls (OTel)only when copilot-otel.jsonl is available. One row per chat <model> span with turn id, model, input tokens, cache-write tokens, output tokens, AIU, API time, and current context size.
  6. Phases (temporal)only when the session has at least five turns. The turns are split into five contiguous, near-equal groups (earlylater) and each phase shows its turn span, tool calls, output tokens, share of total output, and duration. This mirrors the phase-level analysis in Bai et al. (their Finding #6: context construction dominates early phases, generation later ones) — tool-heavy, low-output early phases give way to higher output-share later ones. See the limitation note below.
  7. Timeline — one row per assistant turn: time, duration, output tokens, the tools invoked in that turn, and a preview of what the assistant said.

Warnings, if any, are shown in a panel at the bottom.

Phases use native per-turn data only — no per-phase input/cache/cost yet. Native events.jsonl logs input, cache, reasoning, and AIU only as session totals (session.shutdown), never per turn; the per-turn assistant.message carries outputTokens but not inputTokens. Copilot's OTel chat <model> spans expose per-LLM-call input_tokens, cache_creation_input_tokens, output_tokens, nano_aiu, and server duration when OTel export is enabled; analyze auto-loads the harness-captured copilot-otel.jsonl when present, or accepts --otel-file for direct log analysis. The current phase table still distributes only the native per-turn signals — output tokens, tool activity, and duration — and deliberately does not fabricate per-phase input or cost. The separate LLM calls (OTel) table carries the per-call economics.

A blank "assistant said" is normal — it is not a missing-data bug. Many turns invoke a tool without any accompanying prose, so the assistant text is genuinely empty and the row shows only the tool it called. Reasoning models (e.g. OpenAI's gpt-5* family) do this on almost every turn and additionally keep their chain-of-thought in encrypted fields (reasoningOpaque / encryptedContent) that are not human-readable by design; chattier models (e.g. Claude) narrate alongside most tool calls. This is a property of the model, not of the harness, and it does not affect any measured metric — token counts and AIU come from session.shutdown, and the tools column still shows what each turn did. We deliberately do not force readable output on (e.g. via --enable-reasoning-summaries), because emitting extra text would perturb the very token economics we measure.

Cost is measured in AIU (GitHub's billing unit; totalNanoAiu / 1e9). Premium requests are ignored — GitHub stopped using them on 2026-06-01.

The SessionAnalysis model

analyze_events(events) -> SessionAnalysis (analysis.py) produces plain pydantic data (no formatting), so the same object backs the CLI renderer, the legacy analysis.json, and any future consumer.

Field Meaning
session_id, copilot_version, producer Session identity.
models, reasoning_effort Model(s) observed and the effort level.
repository, branch, cwd Workspace context from session.start.
started_at, finished_at, duration_s Span of the session.
n_turns, n_user_messages, n_assistant_messages Conversation counts.
n_tool_calls, n_tool_failures Tool execution counts (failures correlated by toolCallId).
n_warnings, n_hooks, n_events Other counts.
input_tokens, output_tokens, total_tokens Token usage (authoritative from session.shutdown when present).
economics TokenEconomics: token-type split, AIU cost, context composition, productivity (see below).
tools ToolStat(name, calls, failures, total_duration_ms, total_result_chars), sorted by calls desc.
llm_calls LlmCallSummary rows parsed from OTel chat <model> spans when copilot-otel.jsonl is available.
turns TurnSummary(turn_no, duration_s, tools, output_tokens, text_preview, …) plus OTel input_tokens, cache_creation_input_tokens, aiu, and api_duration_ms when available.
phases PhaseStat(name, turn_from, turn_to, n_turns, n_tool_calls, output_tokens, duration_s, output_share) — five temporal phases (empty for sessions under five turns).
warnings Warning messages.
event_type_counts Histogram of raw event types.

TokenEconomics

Parsed by extract_economics(events) from session.shutdown (authoritative; summed across multiple shutdowns when a session was resumed) plus session.compaction_* / session.truncation. Every field is best-effort — a session with no shutdown leaves the totals null.

Field Meaning
input_tokens_noncached, cache_read_tokens, cache_write_tokens, output_tokens The four metered token types.
reasoning_tokens Reasoning tokens (a subset of output).
input_tokens_total, total_tokens Billed input (non-cached + cache read + cache write) and the grand total.
aiu, aiu_by_type Total AIU cost and its decomposition across token types (sums to aiu).
api_duration_ms, n_requests Model API wall-clock and request count.
system_tokens, tool_definitions_tokens, conversation_tokens, context_tokens End-of-session context-window composition.
peak_context_tokens Largest context observed (from compaction/truncation pre-sizes).
n_compactions, n_truncations, compaction_aiu, tokens_removed_truncation Context-management dynamics and their cost.
files_modified, lines_added, lines_removed Productivity, from codeChanges.
model_metrics Per-model ModelMetric(requests, input/output/cache/reasoning tokens, aiu).

The AIU math lives in pricing.py: it reads live per-token-type rates from session.compaction_complete (costPerBatch) when available, falls back to documented defaults otherwise, and normalises the split so it always sums to the authoritative totalNanoAiu. See ADR-0011.

Pier agent/trajectory.json carries the same OTel facts when the local file export is present. Assistant steps matched by turnId get ATIF metrics.prompt_tokens, metrics.completion_tokens, llm_call_count, and metrics.extra.copilot_otel; final_metrics.extra.copilot_otel stores the aggregate per-call totals. The native events.jsonl path remains authoritative for session-level totals when a session.shutdown exists.

How the session log is read

The Copilot CLI writes a per-session append-only stream of typed events. The fields the analysis relies on:

Event Used for
session.start selectedModel, reasoningEffort, copilotVersion, context.{repository,branch,cwd}, startTime.
user.message User-message count.
assistant.turn_start / assistant.turn_end Turn boundaries (and per-turn duration via turnId).
assistant.message model, content (preview), toolRequests, outputTokens.
tool.execution_start toolNametoolCallId mapping; per-turn tool order.
tool.execution_complete success (correlated back to the tool via toolCallId); toolTelemetry.metrics.{durationMs,resultForLlmLength} for per-tool latency and context size.
session.warning Warning messages.
hook.start Hook count.
session.shutdown Authoritative token economics: tokenDetails per type, totalNanoAiu, totalApiDurationMs, modelMetrics, context composition, codeChanges.
session.compaction_complete Live per-token AIU rates (costPerBatch), compaction cost, and context peak.
session.truncation Truncation count, tokens removed, context peak.

Parsing is defensive: unknown event types are counted but otherwise ignored, and missing fields degrade to "unknown" rather than raising. Because the raw events.jsonl is always retained, an improved parser can re-analyze old runs without re-running them.

Web explorer (TBD)

A browser-based explorer for experiments, runs, session logs, and aggregated cross-run data is planned but not yet built. The decision to ship the CLI analysis first and defer the web app is recorded in ADR-0007. The groundwork is deliberately in place:

  • SessionAnalysis is rendering-agnostic data (ADR-0006), so a web/HTTP layer can serve the same model the CLI renders.
  • Pier jobs keep raw native session logs plus ATIF trajectories, while the SQLite index (docs/results-format.md) already supports cross-run queries — the two data sources a web explorer would build on.

When it lands, it will be documented here.