Skip to content

Explorer FTS Track 3: Offline index builder + tokenizer regression set #170

@rdhyee

Description

@rdhyee

Updated 2026-05-08 per Codex review round 2 on #165. Concept-label join (vocab_labels.parquet) is now an explicit input + tokenization step, not an implicit consequence of #169's contract. Build-stats artifact (build_stats.json) added so coverage is empirical, not narrative.

Sub-issue of #165. Depends on #169 (search_index_v1 contract doc).

Goal

Implement the offline pipeline that builds the v1 search index from the sample-centric document projection specified in #169 / SEARCH_INDEX_V1.md. The projection joins across the property graph and emits per-virtual-field token rows.

Scope

1. Builder

  • File: tools/build_search_index.py (new; sibling to tools/build_fts_index.py from PR Improve search: multi-term AND + relevance ranking (FTS spike) #95, which stays as the FTS spike artifact).

  • Inputs (v1 minimum):

    • samples_map_lite.parquetsample.label, sample.place_name[] per sample
    • sample_facets_v2.parquetsample.description per sample, plus the URI lists for material, context, object_type
    • vocab_labels.parquet → SKOS prefLabel (en) for each URI; load-bearing: this is what makes concept.label work and what dereferences <…>/Potterypottery for the index
  • Document-projection step:

    1. For each sample (pid), gather the v1-minimum text fragments tagged by virtual field:
      • sample.label (1 fragment per sample)
      • sample.description (1 fragment per sample, sparse coverage)
      • sample.place_name (1+ fragments from the array)
      • concept.label (N fragments — one per material URI + one per context URI + one per object_type URI; resolve each URI through vocab_labels.parquet and emit the prefLabel; URIs without a prefLabel fall back to URI tail and a concept_label_missing_pref_label build-stat counter increments)
    2. Tokenize each fragment with the Python tokenizer (§2). Track per-(pid, field) token frequencies and doc length.
    3. Emit token rows per Explorer FTS Track 2: search_index_v1 contract doc #169 §4 schema:
      { token, pid, field, tf, doc_len }
      
    4. Compute global per-token DF across all (pid, field) pairs.
  • Outputs:

  • Honors per-shard byte cap from the contract (default 5 MB); sub-shards high-frequency tokens automatically.

2. Python tokenizer

3. Tokenizer regression set

  • File: tests/search_tokenizer_regression.json.
  • ≥ 30 strings covering:
    • diacritics: Çatalhöyük, Köln, São Paulo
    • mixed case: MaterialSampleRecord, iSamples
    • hyphenated compounds: Iron-Age, co-located
    • archaeological place names with whitespace + punctuation
    • IGSN-style ids: IGSN:HRV000ABC
    • numeric: 1965, 2.5kg
    • empty / whitespace-only
    • very long strings (length-filter edge)
    • the strings pottery, ceramic, bone, mammal, marine, basalt as plain inputs (the regression set proves the tokenizer handles these correctly; URI dereferencing is proved separately in §4 builder E2E, where the test fixture maps known URIs to these labels and asserts the token rows appear)
  • Each entry: { "input": "...", "expected_tokens": ["..."] }.

4. Tests

  • tests/test_search_tokenizer.py — Python tokenizer against regression set. Scope: tokenizer only. Does NOT prove URI dereferencing.
  • A JS port of the tokenizer (small, ~15 lines) lives in assets/js/search_tokenizer.js; a Node-based parity check (tests/test_search_tokenizer_js.spec.js or similar) runs the same regression set against the JS implementation. CI runs both.
  • tests/test_search_index_builder.py — end-to-end builder fixture, scope: URI dereferencing + projection + shard structure. Build a tiny corpus (10 docs) where:
    • At least 3 docs have material URIs that map to known prefLabels in a fixture vocab_labels table (e.g., <test://Pottery>"Pottery", <test://Ceramic>"Ceramic", <test://Bone>"Bone").
    • At least 1 doc has a material URI with no prefLabel (verify URI-tail fallback + concept_label_missing_pref_label counter increments in build_stats.json).
    • Asserts that searching the resulting substrate for pottery returns exactly the pid(s) whose material is <test://Pottery> — the actual proof that URI dereferencing works end-to-end.
    • Also asserts shard structure, DF/doc_len values, and lookup of known tokens returns expected pids per virtual field.

5. CI

  • Add a workflow step that runs both regression tests on every PR. Tokenizer divergence between Python and JS is a hard fail.

6. Build-stats artifact

The build emits <output_dir>/build_stats.json that turns coverage from a doc claim into an empirical artifact future runs can be regressed against:

{
  "data_version": "isamples_202601",
  "built_at_utc": "2026-MM-DDTHH:MM:SSZ",
  "total_samples": 6700000,
  "fields": {
    "sample.label":       { "samples_with_field": 6680000, "total_tokens": 12345678, "avg_doc_len": 3.2 },
    "sample.description": { "samples_with_field": 1610000, "total_tokens": 89012345, "avg_doc_len": 41.5 },
    "sample.place_name":  { "samples_with_field": 2210000, "total_tokens": 4567890,  "avg_doc_len": 4.1 },
    "concept.label":      { "samples_with_field": 6650000, "total_tokens": 8901234,  "avg_doc_len": 3.0 }
  },
  "concept_label_uri_resolution": {
    "material_resolved":     0.97,
    "material_missing_pref": 0.03,
    "context_resolved":      0.95,
    "context_missing_pref":  0.05,
    "object_type_resolved":  0.99,
    "object_type_missing_pref": 0.01
  },
  "shard_count": 64,
  "shard_max_size_mb": 4.7,
  "total_bytes_uncompressed": 234567890,
  "build_seconds": 312.5,
  "top_df_tokens": [["the", 5800000], ["of", 4200000], ...]
}

Acceptance

  • Builder produces a v1 index from a small test corpus that round-trips against the Explorer FTS Track 2: search_index_v1 contract doc #169 §4 schema
  • Concept-label coverage: samples_with_field for concept.label ≥ 90% of total_samples (verifies the URI dereferencing actually works against real data, not just unit-test corpus)
  • Concept-label resolution rate: ≥ 90% of facet URIs resolve to a SKOS prefLabel (en); URIs missing prefLabel fall back to URI tail with the build-stat counter incrementing
  • Python and JS tokenizers produce identical output for every string in the regression set
  • CI fails if the two diverge
  • build_stats.json artifact emitted and committed alongside the PR
  • Build-time stats summary in PR description (lift the headlines from build_stats.json)
  • PR merged to main

Out of scope

Refs

#165, #169, #171

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions