Skip to content

Fix OTel cache-token parsing and add real-session regression tests#23

Merged
dbroeglin merged 1 commit into
mainfrom
dbroeglin/otel-cache-token-keys
Jun 27, 2026
Merged

Fix OTel cache-token parsing and add real-session regression tests#23
dbroeglin merged 1 commit into
mainfrom
dbroeglin/otel-cache-token-keys

Conversation

@dbroeglin

Copy link
Copy Markdown
Owner

Summary

While double-checking the session-parsing implementation against real Copilot CLI runs (captured from a sibling experiment harness), the per-call LLM calls (OTel) table showed empty cache read/write columns — a real data-structure bug.

The bug

llm_calls_from_otel() looked up per-call cache tokens only via underscore-style OTel attribute keys (gen_ai.usage.cache_read_input_tokens, gen_ai.usage.cache_creation_input_tokens), but the Copilot CLI emits the dotted form:

  • gen_ai.usage.cache_read.input_tokens
  • gen_ai.usage.cache_creation.input_tokens

So per-call cache read/write were silently dropped (empty columns in analyze). Session-level session.shutdown totals were already correct, which is why aggregate AIU/token numbers reconciled and the bug hid from the synthetic fixtures.

The fix

Add the dotted keys to the lookup lists in analysis.py. Verified safe: the OTel per-call cache sums equal the session.shutdown cache totals exactly.

Regression tests on real data

Adds tests/test_real_sessions.py + 4 captured real sessions under tests/fixtures/real_sessions/:

Model AIU cache_read / cache_write
gpt-5.5 8.67 80,896 / 0
claude-opus-4.7 25.69 117,653 / 29,793
mai-code-1-flash-picker 2.00 64,000 / 0
gemini-3.1-pro-preview 6.38 84,254 / 0

These are fully offline. Golden values are cross-checked against the raw session.shutdown payload, and the strongest invariant re-derives the AIU total and cache tokens from the independent OTel chat spans — exactly the check that surfaced this bug. Note providers differ: only Anthropic reports cache_creation (explicit cache writes); MAI/GPT/Gemini use implicit caching (cache_read only, no separate write charge).

A 1-ULP rounding edge surfaced by the real mai data also relaxed the aiu_by_type reconciliation assertion to a ±1e-6 tolerance.

Testing

  • uv run ruff check . — clean
  • uv run pytest -q — all pass (incl. 32 new real-session assertions)

Sessions contain no secrets (BYOK keys are never written to the event log).

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

The per-call LLM analysis read cache tokens only from underscore-style OTel
attribute keys (gen_ai.usage.cache_read_input_tokens /
gen_ai.usage.cache_creation_input_tokens), but the Copilot CLI emits the
dotted form (gen_ai.usage.cache_read.input_tokens /
gen_ai.usage.cache_creation.input_tokens). As a result per-call cache
read/write were silently dropped and rendered empty in `analyze`, even
though the session-level shutdown totals were correct.

Add the dotted keys to the lookup lists so per-call cache tokens are parsed.

Also add offline regression tests driven by four captured real Copilot CLI
sessions (gpt-5.5, claude-opus-4.7, mai-code-1-flash-picker,
gemini-3.1-pro-preview). Golden values are cross-checked against the raw
session.shutdown payload, and the strongest invariant re-derives the AIU
total and cache tokens from the independent OTel chat spans — which is what
surfaced this bug. Sessions contain no secrets.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dbroeglin dbroeglin merged commit d65683c into main Jun 27, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant