Skip to content

Improve Interactive Explorer full-text search substrate #165

@rdhyee

Description

@rdhyee

Context

Follow-up to #84 and PR #95. The DuckDB FTS spike answered the first question: native DuckDB FTS works, including BM25 ranking, but the generated browser-delivered index is too large to ship by default.

As of 2026-05-08, the live Interactive Explorer at https://isamples.org/explorer.html does not have production full-text search in the search-engine sense. It has browser-side DuckDB-WASM substring search over Parquet.

Current behavior

  • The live page is backed by isamplesorg.github.io/explorer.qmd.
  • Search runs against https://data.isamples.org/isamples_202601_samples_map_lite.parquet (~60 MB).
  • The live search predicate covers only:
    • label
    • place_name
  • Search uses ILIKE '%term%', with multi-word queries split into terms and combined with AND.
  • Result ordering is a simple score:
    • label match = 3
    • place_name match = 2
  • Search composes with active source and facet filters.
  • Search does not currently cover description in the live explorer because samples_map_lite.parquet does not carry it.
  • Search does not cover the canonical Solr searchText equivalent described in query-spec.qmd.

Relevant code/docs:

  • explorer.qmd: searchTerms, textSearchWhere, textSearchScore
  • explorer.qmd: search handler around doSearch()
  • query-spec.qmd: intended text MATCHES semantics
  • tools/build_fts_index.py: preserved DuckDB FTS spike artifact

Mismatch to resolve

query-spec.qmd currently says the web Explorer subset is label + description + place_name, but the live explorer searches only label + place_name.

Either:

  1. Update the docs/UI to state the actual live subset, or
  2. Change the live search implementation to include description, probably via sample_facets_v2.parquet or another search-optimized projection.

Measured behavior

Native DuckDB measurements against the live hosted Parquet files, run on 2026-05-08. These are not browser timings, but they show the rough scan cost.

Live-style samples_map_lite search over label + place_name:

  • pottery: 50 results, ~1.8s cold
  • basalt: 50 results, ~0.5s
  • unlikely no-hit term: 0 results, ~0.2s

sample_facets_v2 search over label + description + place_name:

  • pottery: 50 results, ~3.0s
  • cyprus: 50 results, ~0.7s
  • basalt: 50 results, ~0.5s

DuckDB FTS spike findings from PR #95:

  • Full index (label + description + place_name): ~358 MB
  • Lite index (label + place_name): ~211 MB
  • ATTACH over HTTP in DuckDB-WASM works, but downloading 200-358 MB is too large for an interactive default page.

Recommended path

Build a small, browser-friendly, pre-tokenized inverted-index Parquet substrate rather than shipping a DuckDB .duckdb FTS database by default.

Sketch:

  • Offline pipeline creates token rows such as:
    • token
    • pid
    • field
    • weight
    • optional term_frequency
    • optional source
    • optional partition/hash/prefix columns
  • Host it under https://data.isamples.org/ as versioned Parquet.
  • Partition by token prefix/hash so browser queries touch only relevant byte ranges.
  • For multi-term queries, fetch/intersect PID sets, score by field/term weights, then join back to samples_map_lite or sample_facets_v2 for display fields.
  • Keep DuckDB FTS as an optional "enhanced search" experiment only if users explicitly accept a large download.

Acceptance criteria

  • Decide and document current search semantics: side-panel lookup vs global filter. Coordinate with Explorer state contract: URL/DOM/widget-state inventory + search-as-global-filter decision #164.
  • Resolve the query-spec.qmd mismatch for the current live subset.
  • Add a search substrate design note covering data shape, partitioning, update cadence, and expected size.
  • Prototype an inverted-index Parquet file and benchmark:
    • first search latency
    • repeated search latency
    • bytes transferred
    • result quality for pottery, pottery Cyprus, basalt, and source/facet-filtered searches
  • Update the Explorer UI copy so users know which fields are searched.
  • Add Playwright or equivalent smoke coverage for:
    • multi-term AND behavior
    • literal wildcard characters such as % and _
    • source/facet-filter composition
    • no-result behavior

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions