diff --git a/docs/adr/0011-token-economics-from-session-shutdown.md b/docs/adr/0011-token-economics-from-session-shutdown.md index 107da41..eb7fa59 100644 --- a/docs/adr/0011-token-economics-from-session-shutdown.md +++ b/docs/adr/0011-token-economics-from-session-shutdown.md @@ -72,20 +72,21 @@ On 2026-06-21, `copilot help monitoring` and local file-exporter probes with Cop `1.0.64-0` showed an additional optional source: OpenTelemetry `chat ` spans carry per-LLM-call `gen_ai.usage.input_tokens`, `gen_ai.usage.cache_creation_input_tokens`, `gen_ai.usage.output_tokens`, `github.copilot.nano_aiu`, `github.copilot.server_duration`, and -`github.copilot.turn_id`. In the probes, summing chat spans matched the native +`github.copilot.turn_id`. Newer/local exporter variants may also include per-call cache-read fields +such as `gen_ai.usage.cache_read_input_tokens`. In the probes, summing chat spans matched the native `session.shutdown` totals for input, cache-write, output, nano-AIU, and API duration. We integrated this as a complement, not a replacement. The harness now enables local OTel file export to `copilot-otel.jsonl` for Copilot agent runs when no explicit OTLP destination is configured, and `SessionAnalysis` ingests that file when present to populate a per-call LLM table -and annotate turns with OTel input/cache-write/output/AIU/API-duration fields. Native +and annotate turns with OTel input/cache-read/cache-write/output/AIU/API-duration fields. Native `events.jsonl` remains the authoritative source for session-level totals and richer forensic -details; OTel still does not split each call's input into non-cached input versus cache-read tokens. +details; OTel cache-read fields are captured when present, otherwise that split remains session-level. The Pier ATIF trajectory also carries this complement now. Assistant steps matched by `github.copilot.turn_id` get ATIF prompt/completion token metrics plus `metrics.extra.copilot_otel` for the OTel-only fields (`cache_creation_input_tokens`, AIU, server -duration, context size, token limit, and normalized per-call records). `final_metrics.extra` stores -the aggregate OTel totals. We deliberately do not map OTel cache creation into ATIF -`cached_tokens`/`total_cached_tokens`, because those fields represent cache reads in our native -Copilot economics. +duration, context size, token limit, optional cache-read tokens, and normalized per-call records). +`final_metrics.extra` stores the aggregate OTel totals. We deliberately do not map OTel cache +creation into ATIF `cached_tokens`/`total_cached_tokens`, because those fields represent cache reads +in our native Copilot economics. diff --git a/docs/analysis.md b/docs/analysis.md index 264b300..5fe3c34 100644 --- a/docs/analysis.md +++ b/docs/analysis.md @@ -56,8 +56,8 @@ The rendering (built with [Rich](https://rich.readthedocs.io/)) has these parts: truncations, files modified, lines ±, AIU per line). Multi-model sessions also get a per-model table. See [ADR-0011](adr/0011-token-economics-from-session-shutdown.md). 5. **LLM calls (OTel)** — *only when `copilot-otel.jsonl` is available.* One row per `chat ` - span with turn id, model, input tokens, cache-write tokens, output tokens, AIU, API time, and - current context size. + span with turn id, model, input tokens, cache-read/cache-write tokens when exported, output and + total tokens, AIU, wall/API time, and current/limit context size. 6. **Phases (temporal)** — *only when the session has at least five turns.* The turns are split into five contiguous, near-equal groups (`early` → `later`) and each phase shows its turn span, tool calls, output tokens, share of total output, and duration. This mirrors the phase-level @@ -73,8 +73,9 @@ Warnings, if any, are shown in a panel at the bottom. > `events.jsonl` logs input, cache, reasoning, and AIU **only as session totals** > (`session.shutdown`), never per turn; the per-turn `assistant.message` carries `outputTokens` but > not `inputTokens`. Copilot's OTel `chat ` spans expose per-LLM-call `input_tokens`, -> `cache_creation_input_tokens`, `output_tokens`, `nano_aiu`, and server duration when OTel export is -> enabled; `analyze` auto-loads the harness-captured `copilot-otel.jsonl` when present, or accepts +> optional cache-read details, `cache_creation_input_tokens`, `output_tokens`, `nano_aiu`, and +> server duration when OTel export is enabled; `analyze` auto-loads the harness-captured +> `copilot-otel.jsonl` when present, or accepts > `--otel-file` for direct log analysis. The current phase table still distributes only the native > per-turn signals — output tokens, tool activity, and duration — and deliberately does **not** > fabricate per-phase input or cost. The separate **LLM calls (OTel)** table carries the per-call @@ -113,7 +114,7 @@ legacy `analysis.json`, and any future consumer. | `economics` | `TokenEconomics`: token-type split, AIU cost, context composition, productivity (see below). | | `tools` | `ToolStat(name, calls, failures, total_duration_ms, total_result_chars)`, sorted by calls desc. | | `llm_calls` | `LlmCallSummary` rows parsed from OTel `chat ` spans when `copilot-otel.jsonl` is available. | -| `turns` | `TurnSummary(turn_no, duration_s, tools, output_tokens, text_preview, …)` plus OTel `input_tokens`, `cache_creation_input_tokens`, `aiu`, and `api_duration_ms` when available. | +| `turns` | `TurnSummary(turn_no, duration_s, tools, output_tokens, text_preview, …)` plus OTel `input_tokens`, optional `cache_read_input_tokens`, `cache_creation_input_tokens`, `aiu`, and `api_duration_ms` when available. | | `phases` | `PhaseStat(name, turn_from, turn_to, n_turns, n_tool_calls, output_tokens, duration_s, output_share)` — five temporal phases (empty for sessions under five turns). | | `warnings` | Warning messages. | | `event_type_counts` | Histogram of raw event `type`s. | diff --git a/docs/collecting-run-data.md b/docs/collecting-run-data.md index 562ea66..43a42a8 100644 --- a/docs/collecting-run-data.md +++ b/docs/collecting-run-data.md @@ -319,9 +319,10 @@ When `agent/copilot-otel.jsonl` is present, each assistant step that can be matc `metrics.extra.copilot_otel`. That extra payload preserves the per-call OTel facts the standard ATIF token fields cannot represent directly: `cache_creation_input_tokens`, `aiu`, `server_duration_ms`, current context size, token limit, and the raw normalized `llm_calls` list. -`final_metrics.extra.copilot_otel` stores the run-level OTel aggregate. ATIF -`total_cached_tokens` remains the native cache-read total; OTel cache creation is not written there -because Copilot OTel does not expose a per-call cache-read split. +When the exporter includes per-call cache reads, the payload also includes +`cache_read_input_tokens`. `final_metrics.extra.copilot_otel` stores the run-level OTel aggregate. +ATIF `total_cached_tokens` uses the native cache-read total when available and falls back to OTel +cache reads when OTel exposes them. Use ATIF for cross-agent comparisons and fallback analysis. Use native `events.jsonl` for Copilot-specific economics, compaction/truncation detail, and the most faithful timeline. @@ -491,7 +492,7 @@ Observed span records from local file-exporter probes: | Span | Useful fields | What it adds beyond `events.jsonl` | | --- | --- | --- | | `invoke_agent` | `gen_ai.conversation.id`, `gen_ai.agent.id`, `gen_ai.agent.version`, `gen_ai.request.model`, `github.copilot.agent.type`, `github.copilot.turn_count`, aggregate `gen_ai.usage.*`, aggregate `github.copilot.nano_aiu`, `github.copilot.context.skills`, `github.copilot.context.mcp_server_names`, `github.copilot.context.custom_agent_names` | Agent-level trace root and aggregate usage/cost for the invocation. MCP server names are hashed. | -| `chat ` | `gen_ai.conversation.id`, `github.copilot.turn_id`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.response.id`, `gen_ai.response.finish_reasons`, `gen_ai.usage.input_tokens`, `gen_ai.usage.cache_creation_input_tokens`, `gen_ai.usage.output_tokens`, `github.copilot.nano_aiu`, `github.copilot.server_duration`, `github.copilot.interaction_id`, `github.copilot.service_request_id` | Per-LLM-call input/output/cache-write tokens, nano-AIU, and server duration. This is the main data source for turn-level economics. | +| `chat ` | `gen_ai.conversation.id`, `github.copilot.turn_id`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.response.id`, `gen_ai.response.finish_reasons`, `gen_ai.usage.input_tokens`, optional cache-read usage fields, `gen_ai.usage.cache_creation_input_tokens`, `gen_ai.usage.output_tokens`, `github.copilot.nano_aiu`, `github.copilot.server_duration`, `github.copilot.interaction_id`, `github.copilot.service_request_id` | Per-LLM-call input/output/cache-read/cache-write tokens when exported, nano-AIU, and server duration. This is the main data source for turn-level economics. | | `execute_tool ` | `gen_ai.operation.name`, `gen_ai.tool.name`, `gen_ai.tool.call.id`, `gen_ai.tool.type`, `gen_ai.provider.name`, span status | Standard tool spans that can be joined to chat spans by trace and to native events by tool-call id. Native `tool.execution_complete.toolTelemetry` remains richer for result sizes and file/code metadata. | `chat ` spans also emitted a `github.copilot.session.usage_info` event with @@ -505,15 +506,16 @@ Per-call token/cost interpretation: | Field | Interpretation | | --- | --- | | `gen_ai.usage.input_tokens` | Total prompt/input tokens for that LLM request. In observed runs, summing chat spans matched `session.shutdown.modelMetrics..usage.inputTokens`. | +| `gen_ai.usage.cache_read_input_tokens` | Cache-read tokens for that request when exported. The parser also accepts common aliases such as `gen_ai.usage.cached_tokens`. | | `gen_ai.usage.cache_creation_input_tokens` | Cache-write tokens for that request. In observed runs, summing chat spans matched `session.shutdown.tokenDetails.cache_write.tokenCount`. | | `gen_ai.usage.output_tokens` | Output tokens for that request. | | `github.copilot.nano_aiu` | Per-request cost in nano-AIU. Prefer this over `github.copilot.cost` for economics; summing observed chat spans matched `session.shutdown.totalNanoAiu`. | | `github.copilot.server_duration` | Per-request server/API duration in milliseconds; summing observed chat spans matched `session.shutdown.totalApiDurationMs`. | | `github.copilot.turn_id` | Join key to native `assistant.turn_*` / `assistant.message` events. | -The observed OTel spans did **not** split per-call `input_tokens` into non-cached input versus -cache-read tokens. That split remains session-level in `session.shutdown.tokenDetails`; per call, we -can capture total input, cache creation, output, server duration, and AIU. +Some observed OTel spans did **not** split per-call `input_tokens` into non-cached input versus +cache-read tokens. When the exporter provides cache-read fields, the parser captures them; otherwise +that split remains session-level in `session.shutdown.tokenDetails`. Metrics currently documented by the CLI help: diff --git a/src/copilot_experiments/analysis.py b/src/copilot_experiments/analysis.py index 0673eb1..62b5fd0 100644 --- a/src/copilot_experiments/analysis.py +++ b/src/copilot_experiments/analysis.py @@ -271,6 +271,7 @@ def _apply_otel_records(analysis: SessionAnalysis, records: list[dict[str, Any]] if turn is None: continue _add_turn_int(turn, "input_tokens", call.input_tokens) + _add_turn_int(turn, "cache_read_input_tokens", call.cache_read_input_tokens) _add_turn_int(turn, "cache_creation_input_tokens", call.cache_creation_input_tokens) if turn.output_tokens is None: _add_turn_int(turn, "output_tokens", call.output_tokens) @@ -283,6 +284,7 @@ def _apply_otel_records(analysis: SessionAnalysis, records: list[dict[str, Any]] return input_tokens = _sum_int(call.input_tokens for call in calls) + cache_read_tokens = _sum_int(call.cache_read_input_tokens for call in calls) cache_creation_tokens = _sum_int(call.cache_creation_input_tokens for call in calls) output_tokens = _sum_int(call.output_tokens for call in calls) api_duration_ms = _sum_int(call.server_duration_ms for call in calls) @@ -291,6 +293,8 @@ def _apply_otel_records(analysis: SessionAnalysis, records: list[dict[str, Any]] if input_tokens is not None: analysis.input_tokens = input_tokens analysis.economics.input_tokens_total = input_tokens + if cache_read_tokens is not None: + analysis.economics.cache_read_tokens = cache_read_tokens if cache_creation_tokens is not None: analysis.economics.cache_write_tokens = cache_creation_tokens if output_tokens is not None: @@ -339,6 +343,17 @@ def llm_calls_from_otel(records: list[dict[str, Any]]) -> list[LlmCallSummary]: response_id=_otel_string(attrs.get("gen_ai.response.id")), finish_reasons=_strings(attrs.get("gen_ai.response.finish_reasons")), input_tokens=input_tokens, + cache_read_input_tokens=_first_otel_int( + attrs, + "gen_ai.usage.cache_read_input_tokens", + "gen_ai.usage.cache_read_tokens", + "gen_ai.usage.cached_input_tokens", + "gen_ai.usage.cached_tokens", + "gen_ai.usage.prompt_tokens_details.cached_tokens", + "gen_ai.usage.input_token_details.cached_tokens", + "gen_ai.usage.input_token_details.cache_read_tokens", + "gen_ai.usage.input_tokens_details.cached_tokens", + ), cache_creation_input_tokens=_otel_int( attrs.get("gen_ai.usage.cache_creation_input_tokens") ), @@ -372,6 +387,32 @@ def _chat_usage_event(events: Any) -> dict[str, Any]: return {} +def _first_otel_int(attrs: dict[str, Any], *keys: str) -> int | None: + for key in keys: + value = _otel_attr_value(attrs, key) + if value is None: + continue + parsed = _otel_int(value) + if parsed is not None: + return parsed + return None + + +def _otel_attr_value(attrs: dict[str, Any], key: str) -> Any: + if key in attrs: + return attrs[key] + parts = key.split(".") + for index in range(len(parts) - 1, 0, -1): + prefix = ".".join(parts[:index]) + value = attrs.get(prefix) + if not isinstance(value, dict): + continue + for part in parts[index:]: + value = value.get(part) if isinstance(value, dict) else None + return value + return None + + def _otel_attrs(raw: Any) -> dict[str, Any]: if isinstance(raw, dict): return {str(k): _otel_value(v) for k, v in raw.items()} diff --git a/src/copilot_experiments/invoker.py b/src/copilot_experiments/invoker.py index 2ee2bbb..8e6bcc0 100644 --- a/src/copilot_experiments/invoker.py +++ b/src/copilot_experiments/invoker.py @@ -363,6 +363,7 @@ def _write_synthetic_otel(path: Path, events: list[dict], model: str) -> None: data = ev.get("data") or {} output_tokens = int(data.get("outputTokens") or 0) input_tokens = 1000 + i * 100 + cache_read_tokens = 900 + i * 90 cache_creation_tokens = 100 + i * 10 nano_aiu = 100_000_000 + i * 10_000_000 record = { @@ -373,6 +374,7 @@ def _write_synthetic_otel(path: Path, events: list[dict], model: str) -> None: "gen_ai.request.model": model, "gen_ai.response.model": model, "gen_ai.usage.input_tokens": input_tokens, + "gen_ai.usage.cache_read_input_tokens": cache_read_tokens, "gen_ai.usage.cache_creation_input_tokens": cache_creation_tokens, "gen_ai.usage.output_tokens": output_tokens, "github.copilot.nano_aiu": nano_aiu, diff --git a/src/copilot_experiments/models.py b/src/copilot_experiments/models.py index 0b897ff..5a70a52 100644 --- a/src/copilot_experiments/models.py +++ b/src/copilot_experiments/models.py @@ -344,6 +344,7 @@ class LlmCallSummary(BaseModel): response_id: str | None = None finish_reasons: list[str] = Field(default_factory=list) input_tokens: int | None = None + cache_read_input_tokens: int | None = None cache_creation_input_tokens: int | None = None output_tokens: int | None = None total_tokens: int | None = None @@ -368,6 +369,7 @@ class TurnSummary(BaseModel): tools: list[str] = Field(default_factory=list) output_tokens: int | None = None input_tokens: int | None = None + cache_read_input_tokens: int | None = None cache_creation_input_tokens: int | None = None aiu: float | None = None api_duration_ms: int | None = None diff --git a/src/copilot_experiments/pier_agents/copilot_cli.py b/src/copilot_experiments/pier_agents/copilot_cli.py index f7d1f15..02ced35 100644 --- a/src/copilot_experiments/pier_agents/copilot_cli.py +++ b/src/copilot_experiments/pier_agents/copilot_cli.py @@ -490,7 +490,7 @@ def append_step(step: Step) -> None: total_cached_tokens=( parsed_metrics.cache_read_tokens if parsed_metrics is not None and parsed_metrics.cache_read_tokens is not None - else None + else otel_totals.get("cache_read_input_tokens") ), total_steps=len(steps), extra=extra or None, @@ -572,6 +572,7 @@ def _apply_otel_metrics(step: Step, calls: list[LlmCallSummary]) -> None: metrics = step.metrics or Metrics() input_tokens = _sum_optional_int(call.input_tokens for call in calls) + cache_read_input_tokens = _sum_optional_int(call.cache_read_input_tokens for call in calls) output_tokens = _sum_optional_int(call.output_tokens for call in calls) if input_tokens is not None: metrics.prompt_tokens = input_tokens @@ -584,6 +585,7 @@ def _apply_otel_metrics(step: Step, calls: list[LlmCallSummary]) -> None: { "llm_call_count": len(calls), "input_tokens": input_tokens, + "cache_read_input_tokens": cache_read_input_tokens, "cache_creation_input_tokens": _sum_optional_int( call.cache_creation_input_tokens for call in calls ), @@ -608,6 +610,9 @@ def _otel_final_totals(calls: list[LlmCallSummary]) -> dict[str, Any]: { "llm_call_count": len(calls), "input_tokens": _sum_optional_int(call.input_tokens for call in calls), + "cache_read_input_tokens": _sum_optional_int( + call.cache_read_input_tokens for call in calls + ), "cache_creation_input_tokens": _sum_optional_int( call.cache_creation_input_tokens for call in calls ), diff --git a/src/copilot_experiments/render.py b/src/copilot_experiments/render.py index 376b9cf..5d72ea9 100644 --- a/src/copilot_experiments/render.py +++ b/src/copilot_experiments/render.py @@ -192,25 +192,36 @@ def _llm_calls_table(a: SessionAnalysis) -> Table | None: table.add_column("turn", justify="right") table.add_column("model") table.add_column("in", justify="right") + table.add_column("cache read", justify="right") table.add_column("cache write", justify="right") table.add_column("out", justify="right") + table.add_column("total", justify="right") table.add_column("AIU", justify="right") + table.add_column("wall", justify="right") table.add_column("API", justify="right") - table.add_column("ctx", justify="right") + table.add_column("ctx/limit", justify="right") for call in a.llm_calls: model = call.response_model or call.request_model or "-" api_duration = ( _dur(call.server_duration_ms / 1000) if call.server_duration_ms is not None else "-" ) + ctx = ( + f"{_int(call.current_tokens)}/{_int(call.token_limit)}" + if call.token_limit is not None + else _int(call.current_tokens) + ) table.add_row( call.turn_id or "-", model, _int(call.input_tokens), + _int(call.cache_read_input_tokens), _int(call.cache_creation_input_tokens), _int(call.output_tokens), + _int(call.total_tokens), _aiu(call.aiu), + _dur(call.duration_s), api_duration, - _int(call.current_tokens), + ctx, ) return table diff --git a/tests/test_analysis.py b/tests/test_analysis.py index aece02d..33c81ae 100644 --- a/tests/test_analysis.py +++ b/tests/test_analysis.py @@ -125,6 +125,7 @@ def _otel_records() -> list[dict]: "gen_ai.request.model": "claude-opus-4.8", "gen_ai.response.model": "claude-opus-4.8", "gen_ai.usage.input_tokens": 1000, + "gen_ai.usage.cache_read_input_tokens": 700, "gen_ai.usage.cache_creation_input_tokens": 200, "gen_ai.usage.output_tokens": 100, "github.copilot.nano_aiu": 500_000_000, @@ -151,6 +152,7 @@ def _otel_records() -> list[dict]: "gen_ai.request.model": "claude-opus-4.8", "gen_ai.response.model": "claude-opus-4.8", "gen_ai.usage.input_tokens": 1200, + "gen_ai.usage.cache_read_input_tokens": 900, "gen_ai.usage.cache_creation_input_tokens": 50, "gen_ai.usage.output_tokens": 50, "github.copilot.nano_aiu": 250_000_000, @@ -213,18 +215,21 @@ def test_analyze_enriches_llm_calls_from_otel(): a = analyze_events(_events(), _otel_records()) assert len(a.llm_calls) == 2 assert a.llm_calls[0].input_tokens == 1000 + assert a.llm_calls[0].cache_read_input_tokens == 700 assert a.llm_calls[0].cache_creation_input_tokens == 200 assert a.llm_calls[0].aiu == 0.5 assert a.llm_calls[0].server_duration_ms == 750 assert a.llm_calls[0].current_tokens == 900 assert a.llm_calls[1].turn_id == "1" assert a.turns[0].input_tokens == 1000 + assert a.turns[0].cache_read_input_tokens == 700 assert a.turns[0].cache_creation_input_tokens == 200 assert a.turns[0].aiu == 0.5 assert a.turns[1].input_tokens == 1200 assert a.input_tokens == 2200 assert a.output_tokens == 150 assert a.total_tokens == 2350 + assert a.economics.cache_read_tokens == 1600 assert a.economics.cache_write_tokens == 250 assert a.economics.aiu == 0.75 assert a.economics.api_duration_ms == 1200 @@ -256,5 +261,6 @@ def test_render_includes_otel_llm_calls(): render_session_analysis(a, console, title="with-otel", max_turns=10) out = buf.getvalue() assert "LLM calls (OTel)" in out + assert "cache read" in out assert "cache write" in out assert "0.500" in out diff --git a/tests/test_pier_agent.py b/tests/test_pier_agent.py index 9be2bd4..39b92d6 100644 --- a/tests/test_pier_agent.py +++ b/tests/test_pier_agent.py @@ -119,6 +119,7 @@ def test_copilot_cli_trajectory_includes_otel_llm_metrics(tmp_path: Path): assert first_agent_step["metrics"]["extra"]["copilot_otel"] == { "llm_call_count": 1, "input_tokens": 1000, + "cache_read_input_tokens": 700, "cache_creation_input_tokens": 200, "output_tokens": 100, "total_tokens": 1100, @@ -136,6 +137,7 @@ def test_copilot_cli_trajectory_includes_otel_llm_metrics(tmp_path: Path): "response_model": "claude-opus-4.8", "finish_reasons": [], "input_tokens": 1000, + "cache_read_input_tokens": 700, "cache_creation_input_tokens": 200, "output_tokens": 100, "total_tokens": 1100, @@ -153,6 +155,7 @@ def test_copilot_cli_trajectory_includes_otel_llm_metrics(tmp_path: Path): assert final_otel == { "llm_call_count": 2, "input_tokens": 2200, + "cache_read_input_tokens": 1600, "cache_creation_input_tokens": 250, "output_tokens": 150, "total_tokens": 2350, @@ -180,7 +183,7 @@ def test_copilot_cli_trajectory_uses_otel_totals_without_shutdown_metrics(tmp_pa data = trajectory.to_json_dict() assert data["final_metrics"]["total_prompt_tokens"] == 2200 assert data["final_metrics"]["total_completion_tokens"] == 150 - assert data["final_metrics"].get("total_cached_tokens") is None + assert data["final_metrics"]["total_cached_tokens"] == 1600 assert data["final_metrics"]["extra"]["copilot_otel"]["llm_call_count"] == 2 @@ -316,6 +319,7 @@ def _otel_records() -> list[dict]: "gen_ai.request.model": "claude-opus-4.8", "gen_ai.response.model": "claude-opus-4.8", "gen_ai.usage.input_tokens": 1000, + "gen_ai.usage.cache_read_input_tokens": 700, "gen_ai.usage.cache_creation_input_tokens": 200, "gen_ai.usage.output_tokens": 100, "github.copilot.nano_aiu": 500_000_000, @@ -342,6 +346,7 @@ def _otel_records() -> list[dict]: "gen_ai.request.model": "claude-opus-4.8", "gen_ai.response.model": "claude-opus-4.8", "gen_ai.usage.input_tokens": 1200, + "gen_ai.usage.cache_read_input_tokens": 900, "gen_ai.usage.cache_creation_input_tokens": 50, "gen_ai.usage.output_tokens": 50, "github.copilot.nano_aiu": 250_000_000,