Skip to content

Explorer FTS Track 4: Browser query prototype + benchmark #171

@rdhyee

Description

@rdhyee

Updated 2026-05-08 per Codex review (rounds 1 + 2) on #165. §5 added in round 1; round 2 reframed it as a BM25 oracle + escalation trigger, not a substitute for a hosted-search comparison. Hosted search remains a permanent contingency, not a question DuckDB FTS closes.

Sub-issue of #165. Depends on #170 (offline builder).

Goal

Implement the browser-side query path against the v1 substrate, behind a feature flag, and run the canonical benchmark to compare against the v1 contract budgets and the curated benchmark from #169.

Scope

1. Browser query path

  • New cell in explorer.qmd (or extracted module): searchSubstrate(term).
  • Steps per multi-term query:
    1. Tokenize term using the JS tokenizer from Explorer FTS Track 3: Offline index builder + tokenizer regression set #170.
    2. Apply query-time stopword policy (per Explorer FTS Track 2: search_index_v1 contract doc #169 §3): drop curated stopwords from the bag-of-words AND.
    3. For each surviving token: resolve shard URL via the partition function; fetch the shard via DuckDB-WASM HTTP range read.
    4. Compute per-token postings with BM25 contribution (using DF + doc_len from the substrate).
    5. AND-combine across tokens (intersect pid sets, sum scores).
    6. Apply field weights from query-side config (per Explorer FTS Track 2: search_index_v1 contract doc #169 §5).
    7. Sort by score, take top 50.
    8. Keyed pid IN (...) join back to samples_map_lite.parquet for display fields (label, source, lat, lng, place_name).
    9. Compose with sourceFilterSQL() and facetFilterSQL() (existing helpers in explorer.qmd:377 / :528).

2. Feature flag

3. Benchmark run (browser-path)

Per-query metrics required in the benchmark JSON. This list is the contract between #171 (produce the data) and #172 (consume it as a hard-fail). Adding a hard-fail in #172 without adding the metric here means the gate evaluates against absent data.

metric source feeds #172 hard-fail
cold latency (ms) performance.measure('search-…') performance gate
warm-repeat-same-query latency (ms) second invocation, same page performance gate
warm-new-query-after-warm-up latency (ms) different query, same page performance gate
filter-composed cold latency (ms) per-query × per-filter combo performance gate
bytes transferred (cold + warm) per #167 instrumentation performance gate
results count length of result rows non-empty checks
top-K result PIDs (K = 50) the ranked PID list top-K overlap quality gates
top-K overlap vs hand-labeled (top-3, top-10) computed in test harness quality gate
top-K overlap vs DuckDB FTS oracle (top-3, top-10) computed in test harness quality gate
top-3 PIDs for concept-only queries (ceramic, bone, mammal, +1-2) substrate run concept-only top-3 relevance hard-fail
top-10 Jaccard between pottery from Cyprus and pottery Cyprus two runs, computed stopword near-equivalence hard-fail
top-K identity for pottery pottery cyprus vs pottery cyprus two runs, computed duplicate-term hard-fail
empty/short/long-token query outcome (state, fetch count, elapsed) special test cases edge-length-token hard-fail
all-stopword query (a the of) outcome special test case controlled-empty-state hard-fail
display-join-miss count (substrate hit, no samples_map_lite row) per-query missing-display-join hard-fail
filter-composition top-K identity preservation per (query, filter) pair hand-labeled expected filtered top-K filter-composition hard-fail
tokenizer-parity check: every benchmark term tokenized identically by Python and JS side-channel, run once per benchmark tokenizer-parity hard-fail

4. Worst-case + concept-label coverage

  • Worst-case composition: rare token + 2 facets + source filter, cold.
  • Concept-only queries from the curated benchmark (ceramic, bone, mammal) must return non-empty results — verifies the v1 substrate dereferences vocabulary labels correctly. Failing any concept-only query is a hard fail, regardless of latency.
  • Stopword-heavy queries (pottery from Cyprus) must return non-empty results — verifies query-time stopword removal works.
  • Wildcard literals (%, _) handled by tokenizer (no ILIKE escape — substrate path doesn't use ILIKE).

5. DuckDB FTS local-only relevance oracle

Run a parallel relevance evaluation against a known-good BM25 system as a v1 quality oracle and escalation trigger. This anchors "does our static substrate approximate BM25 over the same document projection?" — it does not answer "is static-browser the right product boundary for good search?"

What this oracle does NOT cover (and therefore does NOT close the hosted-search question):

  • Richer analyzers (language-specific tokenizers, n-gram, stemming variations)
  • Phrase search with positional indexes
  • Typo tolerance / fuzzy matching
  • Tuning + explainability of relevance (boost knobs, explain traces)
  • API latency under composition with other filters at scale
  • v2+ field-growth ergonomics (adding 6+ entity-derived fields without rebuild pain)

These are reasons hosted search remains a permanent contingency — see §7 below + #172 NO-GO framing.

6. Acceptance criteria for the prototype

This issue ships the prototype + benchmark data. It does not decide GO/NO-GO — that's #172.

  • Substrate path implemented behind ?fts=v1
  • Composes correctly with source + facet filters
  • Worst-case composition test passes
  • All concept-only benchmark queries return non-empty results AND have their top-3 PIDs recorded
  • Stopword-heavy queries return non-empty results AND record top-10 Jaccard vs the stopword-stripped form
  • Wildcard literals handled
  • Benchmark JSON contains every metric listed in the §3 metrics-contract table (the data Explorer FTS Track 5: GO/NO-GO decision gate #172 consumes)
  • Top-3 / top-10 overlap numbers vs hand-labeled set AND vs DuckDB FTS reference posted to this issue

7. Hosted-search backend as a permanent contingency (not just NO-GO downstream)

The DuckDB FTS oracle in §5 anchors v1 quality. It does not close the hosted-search question. Hosted search (Solr / Meilisearch / Typesense / equivalent) remains a contingency for either of the following triggers:

  • (a) v1 GO/NO-GO failure in Explorer FTS Track 5: GO/NO-GO decision gate #172 — substrate misses budgets or quality thresholds.
  • (b) v2+ quality requirements that exceed what a static substrate can deliver — e.g., phrase search, typo tolerance, richer analyzer pipelines, or v2 field growth that pushes the static substrate over its byte budget.

Either trigger fires the same downstream issue: Explorer FTS Track 6: Hosted-search backend. A v1 GO does not close (b); it just means we ship the static substrate and revisit hosted search when v2 requirements demand it.

Out of scope

Refs

#165, #169, #170, #172, PR #95

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions