Skip to content

Explorer FTS Track 1a follow-ups: deferred review concerns #174

@rdhyee

Description

@rdhyee

Sub-issue of #167 / #165. Captures concerns surfaced during self-review of #173 that were knowingly deferred rather than fixed in that PR.

Deferred items

1. Add a unit test for the isamples.search instrumentation itself

The structured log emitted by `doSearch()` claims invariants like `terms_count == searchTerms(term).length` and `elapsed_ms > 0`. Nothing currently verifies this. A future tokenizer or instrumentation refactor could silently corrupt the log without a test catching it.

Scope: a small Playwright unit test (or pytest fixture) that runs a single canonical search and asserts:

  • exactly one `isamples.search` log emitted
  • `terms_count` matches the JS-side `searchTerms(term).length`
  • `elapsed_ms` is a positive integer
  • `results_count` matches the rendered `#searchResults` count text
  • `seen_urls` is a list of `{name, transfer_size, body_size}` objects

Land alongside the offline builder work in #170 since both touch tokenization correctness.

2. "Warm" semantics in the contract doc

The current perf-smoke measures "warm" as a re-run of the same query on the same page. If DuckDB-WASM caches result sets, this measures cache lookup, not query execution. The v1 contract doc (#169) should specify what "warm" actually means for the budget targets:

  • `re-run-same-query warm` (current measurement) — measures end-to-end cache + render path
  • `new-query-after-warm-up warm` — measures query execution after the parquet metadata is cached but the result set is fresh
  • both, with separate budgets

Action: include this distinction in the SEARCH_INDEX_V1.md budget section. The benchmark in #171 should produce both numbers.

3. `searchTerms(term)` is now outside the try block

In #173 we moved `searchTerms(term)` to before the try-block so `terms.length` is available in the structured log even on early-error paths. `searchTerms` cannot throw with the current implementation, but a future tokenizer change that throws would now skip the structured log entirely. Action: wrap the tokenizer call defensively when the v1 substrate tokenizer lands in #170 (which is more likely to throw on edge inputs).

4. Bytes attribution beyond the search window

Even with the `seen_urls` list now captured per search, attribution is window-based: any fetch from `data.isamples.org` that starts during the search window is counted. A long-running fetch started by an earlier interaction could finish during the search and pollute `bytes_transfer`. Filtering to fetches whose `responseEnd` falls within the window would be tighter. Action: revisit when the v1 substrate is in place, since the substrate-fetch URLs are predictable and can be matched by name prefix instead.

Refs

#165, #167, #173

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions