Fix OTel cache-token parsing and add real-session regression tests by dbroeglin · Pull Request #23 · dbroeglin/github-copilot-lab

dbroeglin · 2026-06-27T16:34:44Z

Summary

While double-checking the session-parsing implementation against real Copilot CLI runs (captured from a sibling experiment harness), the per-call LLM calls (OTel) table showed empty cache read/write columns — a real data-structure bug.

The bug

llm_calls_from_otel() looked up per-call cache tokens only via underscore-style OTel attribute keys (gen_ai.usage.cache_read_input_tokens, gen_ai.usage.cache_creation_input_tokens), but the Copilot CLI emits the dotted form:

gen_ai.usage.cache_read.input_tokens
gen_ai.usage.cache_creation.input_tokens

So per-call cache read/write were silently dropped (empty columns in analyze). Session-level session.shutdown totals were already correct, which is why aggregate AIU/token numbers reconciled and the bug hid from the synthetic fixtures.

The fix

Add the dotted keys to the lookup lists in analysis.py. Verified safe: the OTel per-call cache sums equal the session.shutdown cache totals exactly.

Regression tests on real data

Adds tests/test_real_sessions.py + 4 captured real sessions under tests/fixtures/real_sessions/:

Model	AIU	cache_read / cache_write
`gpt-5.5`	8.67	80,896 / 0
`claude-opus-4.7`	25.69	117,653 / 29,793
`mai-code-1-flash-picker`	2.00	64,000 / 0
`gemini-3.1-pro-preview`	6.38	84,254 / 0

These are fully offline. Golden values are cross-checked against the raw session.shutdown payload, and the strongest invariant re-derives the AIU total and cache tokens from the independent OTel chat spans — exactly the check that surfaced this bug. Note providers differ: only Anthropic reports cache_creation (explicit cache writes); MAI/GPT/Gemini use implicit caching (cache_read only, no separate write charge).

A 1-ULP rounding edge surfaced by the real mai data also relaxed the aiu_by_type reconciliation assertion to a ±1e-6 tolerance.

Testing

uv run ruff check . — clean
uv run pytest -q — all pass (incl. 32 new real-session assertions)

Sessions contain no secrets (BYOK keys are never written to the event log).

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

The per-call LLM analysis read cache tokens only from underscore-style OTel attribute keys (gen_ai.usage.cache_read_input_tokens / gen_ai.usage.cache_creation_input_tokens), but the Copilot CLI emits the dotted form (gen_ai.usage.cache_read.input_tokens / gen_ai.usage.cache_creation.input_tokens). As a result per-call cache read/write were silently dropped and rendered empty in `analyze`, even though the session-level shutdown totals were correct. Add the dotted keys to the lookup lists so per-call cache tokens are parsed. Also add offline regression tests driven by four captured real Copilot CLI sessions (gpt-5.5, claude-opus-4.7, mai-code-1-flash-picker, gemini-3.1-pro-preview). Golden values are cross-checked against the raw session.shutdown payload, and the strongest invariant re-derives the AIU total and cache tokens from the independent OTel chat spans — which is what surfaced this bug. Sessions contain no secrets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dbroeglin merged commit d65683c into main Jun 27, 2026
7 checks passed

dbroeglin mentioned this pull request Jun 27, 2026

test: broaden coverage of analysis, OTel merge, and rendering #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OTel cache-token parsing and add real-session regression tests#23

Fix OTel cache-token parsing and add real-session regression tests#23
dbroeglin merged 1 commit into
mainfrom
dbroeglin/otel-cache-token-keys

dbroeglin commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dbroeglin commented Jun 27, 2026

Summary

The bug

The fix

Regression tests on real data

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant