dbroeglin · dbroeglin · Jun 27, 2026 · Jun 27, 2026
diff --git a/docs/adr/0011-token-economics-from-session-shutdown.md b/docs/adr/0011-token-economics-from-session-shutdown.md
@@ -72,20 +72,21 @@ On 2026-06-21, `copilot help monitoring` and local file-exporter probes with Cop
 `1.0.64-0` showed an additional optional source: OpenTelemetry `chat <model>` spans carry
 per-LLM-call `gen_ai.usage.input_tokens`, `gen_ai.usage.cache_creation_input_tokens`,
 `gen_ai.usage.output_tokens`, `github.copilot.nano_aiu`, `github.copilot.server_duration`, and
-`github.copilot.turn_id`. In the probes, summing chat spans matched the native
+`github.copilot.turn_id`. Newer/local exporter variants may also include per-call cache-read fields
+such as `gen_ai.usage.cache_read_input_tokens`. In the probes, summing chat spans matched the native
 `session.shutdown` totals for input, cache-write, output, nano-AIU, and API duration.
 
 We integrated this as a complement, not a replacement. The harness now enables local OTel file
 export to `copilot-otel.jsonl` for Copilot agent runs when no explicit OTLP destination is
 configured, and `SessionAnalysis` ingests that file when present to populate a per-call LLM table
-and annotate turns with OTel input/cache-write/output/AIU/API-duration fields. Native
+and annotate turns with OTel input/cache-read/cache-write/output/AIU/API-duration fields. Native
 `events.jsonl` remains the authoritative source for session-level totals and richer forensic
-details; OTel still does not split each call's input into non-cached input versus cache-read tokens.
+details; OTel cache-read fields are captured when present, otherwise that split remains session-level.
 
 The Pier ATIF trajectory also carries this complement now. Assistant steps matched by
 `github.copilot.turn_id` get ATIF prompt/completion token metrics plus
 `metrics.extra.copilot_otel` for the OTel-only fields (`cache_creation_input_tokens`, AIU, server
-duration, context size, token limit, and normalized per-call records). `final_metrics.extra` stores
-the aggregate OTel totals. We deliberately do not map OTel cache creation into ATIF
-`cached_tokens`/`total_cached_tokens`, because those fields represent cache reads in our native
-Copilot economics.
+duration, context size, token limit, optional cache-read tokens, and normalized per-call records).
+`final_metrics.extra` stores the aggregate OTel totals. We deliberately do not map OTel cache
+creation into ATIF `cached_tokens`/`total_cached_tokens`, because those fields represent cache reads
+in our native Copilot economics.
diff --git a/docs/analysis.md b/docs/analysis.md
@@ -56,8 +56,8 @@ The rendering (built with [Rich](https://rich.readthedocs.io/)) has these parts:
    truncations, files modified, lines ±, AIU per line). Multi-model sessions also get a per-model
    table. See [ADR-0011](adr/0011-token-economics-from-session-shutdown.md).
 5. **LLM calls (OTel)** — *only when `copilot-otel.jsonl` is available.* One row per `chat <model>`
-   span with turn id, model, input tokens, cache-write tokens, output tokens, AIU, API time, and
-   current context size.
+   span with turn id, model, input tokens, cache-read/cache-write tokens when exported, output and
+   total tokens, AIU, wall/API time, and current/limit context size.
 6. **Phases (temporal)** — *only when the session has at least five turns.* The turns are split
    into five contiguous, near-equal groups (`early` → `later`) and each phase shows its turn span,
    tool calls, output tokens, share of total output, and duration. This mirrors the phase-level
@@ -73,8 +73,9 @@ Warnings, if any, are shown in a panel at the bottom.
 > `events.jsonl` logs input, cache, reasoning, and AIU **only as session totals**
 > (`session.shutdown`), never per turn; the per-turn `assistant.message` carries `outputTokens` but
 > not `inputTokens`. Copilot's OTel `chat <model>` spans expose per-LLM-call `input_tokens`,
-> `cache_creation_input_tokens`, `output_tokens`, `nano_aiu`, and server duration when OTel export is
-> enabled; `analyze` auto-loads the harness-captured `copilot-otel.jsonl` when present, or accepts
+> optional cache-read details, `cache_creation_input_tokens`, `output_tokens`, `nano_aiu`, and
+> server duration when OTel export is enabled; `analyze` auto-loads the harness-captured
+> `copilot-otel.jsonl` when present, or accepts
 > `--otel-file` for direct log analysis. The current phase table still distributes only the native
 > per-turn signals — output tokens, tool activity, and duration — and deliberately does **not**
 > fabricate per-phase input or cost. The separate **LLM calls (OTel)** table carries the per-call
@@ -113,7 +114,7 @@ legacy `analysis.json`, and any future consumer.
 | `economics` | `TokenEconomics`: token-type split, AIU cost, context composition, productivity (see below). |
 | `tools` | `ToolStat(name, calls, failures, total_duration_ms, total_result_chars)`, sorted by calls desc. |
 | `llm_calls` | `LlmCallSummary` rows parsed from OTel `chat <model>` spans when `copilot-otel.jsonl` is available. |
-| `turns` | `TurnSummary(turn_no, duration_s, tools, output_tokens, text_preview, …)` plus OTel `input_tokens`, `cache_creation_input_tokens`, `aiu`, and `api_duration_ms` when available. |
+| `turns` | `TurnSummary(turn_no, duration_s, tools, output_tokens, text_preview, …)` plus OTel `input_tokens`, optional `cache_read_input_tokens`, `cache_creation_input_tokens`, `aiu`, and `api_duration_ms` when available. |
 | `phases` | `PhaseStat(name, turn_from, turn_to, n_turns, n_tool_calls, output_tokens, duration_s, output_share)` — five temporal phases (empty for sessions under five turns). |
 | `warnings` | Warning messages. |
 | `event_type_counts` | Histogram of raw event `type`s. |

diff --git a/docs/collecting-run-data.md b/docs/collecting-run-data.md
@@ -319,9 +319,10 @@ When `agent/copilot-otel.jsonl` is present, each assistant step that can be matc
 `metrics.extra.copilot_otel`. That extra payload preserves the per-call OTel facts the standard
 ATIF token fields cannot represent directly: `cache_creation_input_tokens`, `aiu`,
 `server_duration_ms`, current context size, token limit, and the raw normalized `llm_calls` list.
-`final_metrics.extra.copilot_otel` stores the run-level OTel aggregate. ATIF
-`total_cached_tokens` remains the native cache-read total; OTel cache creation is not written there
-because Copilot OTel does not expose a per-call cache-read split.
+When the exporter includes per-call cache reads, the payload also includes
+`cache_read_input_tokens`. `final_metrics.extra.copilot_otel` stores the run-level OTel aggregate.
+ATIF `total_cached_tokens` uses the native cache-read total when available and falls back to OTel
+cache reads when OTel exposes them.
 
 Use ATIF for cross-agent comparisons and fallback analysis. Use native `events.jsonl` for
 Copilot-specific economics, compaction/truncation detail, and the most faithful timeline.
@@ -491,7 +492,7 @@ Observed span records from local file-exporter probes:
 | Span | Useful fields | What it adds beyond `events.jsonl` |
 | --- | --- | --- |
 | `invoke_agent` | `gen_ai.conversation.id`, `gen_ai.agent.id`, `gen_ai.agent.version`, `gen_ai.request.model`, `github.copilot.agent.type`, `github.copilot.turn_count`, aggregate `gen_ai.usage.*`, aggregate `github.copilot.nano_aiu`, `github.copilot.context.skills`, `github.copilot.context.mcp_server_names`, `github.copilot.context.custom_agent_names` | Agent-level trace root and aggregate usage/cost for the invocation. MCP server names are hashed. |
-| `chat <model>` | `gen_ai.conversation.id`, `github.copilot.turn_id`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.response.id`, `gen_ai.response.finish_reasons`, `gen_ai.usage.input_tokens`, `gen_ai.usage.cache_creation_input_tokens`, `gen_ai.usage.output_tokens`, `github.copilot.nano_aiu`, `github.copilot.server_duration`, `github.copilot.interaction_id`, `github.copilot.service_request_id` | Per-LLM-call input/output/cache-write tokens, nano-AIU, and server duration. This is the main data source for turn-level economics. |
+| `chat <model>` | `gen_ai.conversation.id`, `github.copilot.turn_id`, `gen_ai.request.model`, `gen_ai.response.model`, `gen_ai.response.id`, `gen_ai.response.finish_reasons`, `gen_ai.usage.input_tokens`, optional cache-read usage fields, `gen_ai.usage.cache_creation_input_tokens`, `gen_ai.usage.output_tokens`, `github.copilot.nano_aiu`, `github.copilot.server_duration`, `github.copilot.interaction_id`, `github.copilot.service_request_id` | Per-LLM-call input/output/cache-read/cache-write tokens when exported, nano-AIU, and server duration. This is the main data source for turn-level economics. |
 | `execute_tool <tool>` | `gen_ai.operation.name`, `gen_ai.tool.name`, `gen_ai.tool.call.id`, `gen_ai.tool.type`, `gen_ai.provider.name`, span status | Standard tool spans that can be joined to chat spans by trace and to native events by tool-call id. Native `tool.execution_complete.toolTelemetry` remains richer for result sizes and file/code metadata. |
 
 `chat <model>` spans also emitted a `github.copilot.session.usage_info` event with
@@ -505,15 +506,16 @@ Per-call token/cost interpretation:
 | Field | Interpretation |
 | --- | --- |
 | `gen_ai.usage.input_tokens` | Total prompt/input tokens for that LLM request. In observed runs, summing chat spans matched `session.shutdown.modelMetrics.<model>.usage.inputTokens`. |
+| `gen_ai.usage.cache_read_input_tokens` | Cache-read tokens for that request when exported. The parser also accepts common aliases such as `gen_ai.usage.cached_tokens`. |
 | `gen_ai.usage.cache_creation_input_tokens` | Cache-write tokens for that request. In observed runs, summing chat spans matched `session.shutdown.tokenDetails.cache_write.tokenCount`. |
 | `gen_ai.usage.output_tokens` | Output tokens for that request. |
 | `github.copilot.nano_aiu` | Per-request cost in nano-AIU. Prefer this over `github.copilot.cost` for economics; summing observed chat spans matched `session.shutdown.totalNanoAiu`. |
 | `github.copilot.server_duration` | Per-request server/API duration in milliseconds; summing observed chat spans matched `session.shutdown.totalApiDurationMs`. |
 | `github.copilot.turn_id` | Join key to native `assistant.turn_*` / `assistant.message` events. |
 
-The observed OTel spans did **not** split per-call `input_tokens` into non-cached input versus
-cache-read tokens. That split remains session-level in `session.shutdown.tokenDetails`; per call, we
-can capture total input, cache creation, output, server duration, and AIU.
+Some observed OTel spans did **not** split per-call `input_tokens` into non-cached input versus
+cache-read tokens. When the exporter provides cache-read fields, the parser captures them; otherwise
+that split remains session-level in `session.shutdown.tokenDetails`.
 
 Metrics currently documented by the CLI help:
 

diff --git a/src/copilot_experiments/analysis.py b/src/copilot_experiments/analysis.py
@@ -271,6 +271,7 @@ def _apply_otel_records(analysis: SessionAnalysis, records: list[dict[str, Any]]
         if turn is None:
             continue
         _add_turn_int(turn, "input_tokens", call.input_tokens)
+        _add_turn_int(turn, "cache_read_input_tokens", call.cache_read_input_tokens)
         _add_turn_int(turn, "cache_creation_input_tokens", call.cache_creation_input_tokens)
         if turn.output_tokens is None:
             _add_turn_int(turn, "output_tokens", call.output_tokens)
@@ -283,6 +284,7 @@ def _apply_otel_records(analysis: SessionAnalysis, records: list[dict[str, Any]]
         return
 
     input_tokens = _sum_int(call.input_tokens for call in calls)
+    cache_read_tokens = _sum_int(call.cache_read_input_tokens for call in calls)
     cache_creation_tokens = _sum_int(call.cache_creation_input_tokens for call in calls)
     output_tokens = _sum_int(call.output_tokens for call in calls)
     api_duration_ms = _sum_int(call.server_duration_ms for call in calls)
@@ -291,6 +293,8 @@ def _apply_otel_records(analysis: SessionAnalysis, records: list[dict[str, Any]]
     if input_tokens is not None:
         analysis.input_tokens = input_tokens
         analysis.economics.input_tokens_total = input_tokens
+    if cache_read_tokens is not None:
+        analysis.economics.cache_read_tokens = cache_read_tokens
     if cache_creation_tokens is not None:
         analysis.economics.cache_write_tokens = cache_creation_tokens
     if output_tokens is not None:
@@ -339,6 +343,17 @@ def llm_calls_from_otel(records: list[dict[str, Any]]) -> list[LlmCallSummary]:
                 response_id=_otel_string(attrs.get("gen_ai.response.id")),
                 finish_reasons=_strings(attrs.get("gen_ai.response.finish_reasons")),
                 input_tokens=input_tokens,
+                cache_read_input_tokens=_first_otel_int(
+                    attrs,
+                    "gen_ai.usage.cache_read_input_tokens",
+                    "gen_ai.usage.cache_read_tokens",
+                    "gen_ai.usage.cached_input_tokens",
+                    "gen_ai.usage.cached_tokens",
+                    "gen_ai.usage.prompt_tokens_details.cached_tokens",
+                    "gen_ai.usage.input_token_details.cached_tokens",
+                    "gen_ai.usage.input_token_details.cache_read_tokens",
+                    "gen_ai.usage.input_tokens_details.cached_tokens",
+                ),
                 cache_creation_input_tokens=_otel_int(
                     attrs.get("gen_ai.usage.cache_creation_input_tokens")
                 ),
@@ -372,6 +387,32 @@ def _chat_usage_event(events: Any) -> dict[str, Any]:
     return {}
 
 
+def _first_otel_int(attrs: dict[str, Any], *keys: str) -> int | None:
+    for key in keys:
+        value = _otel_attr_value(attrs, key)
+        if value is None:
+            continue
+        parsed = _otel_int(value)
+        if parsed is not None:
+            return parsed
+    return None
+
+
+def _otel_attr_value(attrs: dict[str, Any], key: str) -> Any:
+    if key in attrs:
+        return attrs[key]
+    parts = key.split(".")
+    for index in range(len(parts) - 1, 0, -1):
+        prefix = ".".join(parts[:index])
+        value = attrs.get(prefix)
+        if not isinstance(value, dict):
+            continue
+        for part in parts[index:]:
+            value = value.get(part) if isinstance(value, dict) else None
+        return value
+    return None
+
+
 def _otel_attrs(raw: Any) -> dict[str, Any]:
     if isinstance(raw, dict):
         return {str(k): _otel_value(v) for k, v in raw.items()}

diff --git a/src/copilot_experiments/invoker.py b/src/copilot_experiments/invoker.py
@@ -363,6 +363,7 @@ def _write_synthetic_otel(path: Path, events: list[dict], model: str) -> None:
                 data = ev.get("data") or {}
                 output_tokens = int(data.get("outputTokens") or 0)
                 input_tokens = 1000 + i * 100
+                cache_read_tokens = 900 + i * 90
                 cache_creation_tokens = 100 + i * 10
                 nano_aiu = 100_000_000 + i * 10_000_000
                 record = {
@@ -373,6 +374,7 @@ def _write_synthetic_otel(path: Path, events: list[dict], model: str) -> None:
                         "gen_ai.request.model": model,
                         "gen_ai.response.model": model,
                         "gen_ai.usage.input_tokens": input_tokens,
+                        "gen_ai.usage.cache_read_input_tokens": cache_read_tokens,
                         "gen_ai.usage.cache_creation_input_tokens": cache_creation_tokens,
                         "gen_ai.usage.output_tokens": output_tokens,
                         "github.copilot.nano_aiu": nano_aiu,

diff --git a/src/copilot_experiments/models.py b/src/copilot_experiments/models.py
@@ -344,6 +344,7 @@ class LlmCallSummary(BaseModel):
     response_id: str | None = None
     finish_reasons: list[str] = Field(default_factory=list)
     input_tokens: int | None = None
+    cache_read_input_tokens: int | None = None
     cache_creation_input_tokens: int | None = None
     output_tokens: int | None = None
     total_tokens: int | None = None
@@ -368,6 +369,7 @@ class TurnSummary(BaseModel):
     tools: list[str] = Field(default_factory=list)
     output_tokens: int | None = None
     input_tokens: int | None = None
+    cache_read_input_tokens: int | None = None
     cache_creation_input_tokens: int | None = None
     aiu: float | None = None
     api_duration_ms: int | None = None

diff --git a/src/copilot_experiments/pier_agents/copilot_cli.py b/src/copilot_experiments/pier_agents/copilot_cli.py
@@ -490,7 +490,7 @@ def append_step(step: Step) -> None:
                 total_cached_tokens=(
                     parsed_metrics.cache_read_tokens
                     if parsed_metrics is not None and parsed_metrics.cache_read_tokens is not None
-                    else None
+                    else otel_totals.get("cache_read_input_tokens")
                 ),
                 total_steps=len(steps),
                 extra=extra or None,
@@ -572,6 +572,7 @@ def _apply_otel_metrics(step: Step, calls: list[LlmCallSummary]) -> None:
 
     metrics = step.metrics or Metrics()
     input_tokens = _sum_optional_int(call.input_tokens for call in calls)
+    cache_read_input_tokens = _sum_optional_int(call.cache_read_input_tokens for call in calls)
     output_tokens = _sum_optional_int(call.output_tokens for call in calls)
     if input_tokens is not None:
         metrics.prompt_tokens = input_tokens
@@ -584,6 +585,7 @@ def _apply_otel_metrics(step: Step, calls: list[LlmCallSummary]) -> None:
             {
                 "llm_call_count": len(calls),
                 "input_tokens": input_tokens,
+                "cache_read_input_tokens": cache_read_input_tokens,
                 "cache_creation_input_tokens": _sum_optional_int(
                     call.cache_creation_input_tokens for call in calls
                 ),
@@ -608,6 +610,9 @@ def _otel_final_totals(calls: list[LlmCallSummary]) -> dict[str, Any]:
         {
             "llm_call_count": len(calls),
             "input_tokens": _sum_optional_int(call.input_tokens for call in calls),
+            "cache_read_input_tokens": _sum_optional_int(
+                call.cache_read_input_tokens for call in calls
+            ),
             "cache_creation_input_tokens": _sum_optional_int(
                 call.cache_creation_input_tokens for call in calls
             ),

diff --git a/src/copilot_experiments/render.py b/src/copilot_experiments/render.py
@@ -192,25 +192,36 @@ def _llm_calls_table(a: SessionAnalysis) -> Table | None:
     table.add_column("turn", justify="right")
     table.add_column("model")
     table.add_column("in", justify="right")
+    table.add_column("cache read", justify="right")
     table.add_column("cache write", justify="right")
     table.add_column("out", justify="right")
+    table.add_column("total", justify="right")
     table.add_column("AIU", justify="right")
+    table.add_column("wall", justify="right")
     table.add_column("API", justify="right")
-    table.add_column("ctx", justify="right")
+    table.add_column("ctx/limit", justify="right")
     for call in a.llm_calls:
         model = call.response_model or call.request_model or "-"
         api_duration = (
             _dur(call.server_duration_ms / 1000) if call.server_duration_ms is not None else "-"
         )
+        ctx = (
+            f"{_int(call.current_tokens)}/{_int(call.token_limit)}"
+            if call.token_limit is not None
+            else _int(call.current_tokens)
+        )
         table.add_row(
             call.turn_id or "-",
             model,
             _int(call.input_tokens),
+            _int(call.cache_read_input_tokens),
             _int(call.cache_creation_input_tokens),
             _int(call.output_tokens),
+            _int(call.total_tokens),
             _aiu(call.aiu),
+            _dur(call.duration_s),
             api_duration,
-            _int(call.current_tokens),
+            ctx,
         )
     return table