Skip to content

[PROPOSAL] Adopt tree-sitter query files (.scm) for node discovery — phased rollout, defer full migration until triggers met #43

Description

@Wolfvin

Summary

Proposal to adopt tree-sitter query files (.scm) as the long-term direction for CodeLens's node/edge discovery layer, with a phased rollout that defers full migration until concrete triggers are met. This issue documents the design analysis so the discussion is not lost and future contributors can pick up the thread.

Status: Proposal / Discussion — not approved for implementation yet.
Recommendation: Do NOT migrate existing parsers now. DO set up the YAML node-type registry as a low-risk stepping stone. DO adopt .scm for any NEW language added from now on, after a successful pilot.

Background

CodeLens currently extracts code structure (functions, classes, calls, imports) using two parallel mechanisms:

  1. Tree-sitter parsers (7 languages: HTML, CSS, JS, TS/TSX, Rust, Python, Vue, Svelte, Blade) — each is a hand-written Python module that walks the AST imperatively using BaseParser.walk_tree() / find_nodes_by_type().
  2. Regex fallback parsers (28 languages: C, C++, Go, Java, Kotlin, Swift, Ruby, PHP, etc.) — used when tree-sitter grammar is unavailable or the environment lacks tree-sitter.

Combined: ~35 languages covered. Test coverage >= 80% on scripts/ (with parsers/ and commands/ omitted from coverage).

The current parser-per-language model works, but each new language requires a bespoke Python module (~250 LOC for tree-sitter parsers, ~150 LOC for regex fallbacks). This raises a recurring question: should CodeLens adopt a more declarative approach to node discovery?

Three approaches considered

Approach 1 — Tree-sitter query files (.scm) [ast-grep style]

Tree-sitter ships its own query language (files use .scm extension, syntax is Scheme-like). A query file declaratively describes which AST nodes to capture, with @capture annotations for field extraction.

Conceptual example (python.scm):

(function_definition
  name: (identifier) @function.name) @function.def

(call
  function: (identifier) @call.callee) @call.site

(import_statement
  (dotted_name) @import.path) @import

A single generic engine reads the .scm file, executes the query against the AST, and returns captured nodes. One engine serves many languages.

Approach 2 — Node type registry (YAML) [Semgrep internal style]

Externalize the hardcoded node-type lookups (find_nodes_by_type(root, "function_definition")) to a YAML config:

python:
  function_def: [function_definition, lambda]
  call: [call]
  import: [import_statement]
  class: [class_definition]

java:
  function_def: [method_declaration, constructor_declaration]
  call: [method_invocation]
  import: [import_declaration]
  class: [class_declaration]

go:
  function_def: [function_declaration, method_declaration]
  call: [call_expression]
  import: [import_declaration]
  class: [class_declaration]

The Python walker stays, but reads node types from config instead of hardcoding them. Adding a language = adding 5 lines of YAML.

Approach 3 — Token-based matching [Spacegrep style]

Tokenize source code and match patterns at the token level. Less accurate than AST, but zero grammar dependency.

Already implemented. CodeLens's 28 fallback_*.py parsers already do this (using regex, not formal tokenization, but conceptually equivalent). This is not a new direction — it is the existing fallback layer. Listed here only for completeness; no action proposed.

Key analytical insight

CodeLens does not sell "list all functions." CodeLens sells a cross-file call graph with framework semantics and status computation.

The parser layer is more than node discovery. Decomposing what each current tree-sitter parser does (~250 LOC each):

Sub-task Approx LOC Solvable by .scm?
Node discovery (functions, classes, calls, imports) ~80 Yes — sweet spot
Raw edge extraction (caller->callee pairs) ~50 Yes
Framework semantic extraction (Tauri #[command], Pinia useXxxStore, Next.js conventions) ~70 No — requires post-processing logic
Helper utilities (BaseParser inheritance, line numbers, text extraction) ~50 No — generic Python

So the realistic LOC reduction from .scm migration is ~50% (250 -> ~125), not the ~80% one might naively expect from "50 LOC .scm replaces 250 LOC Python."

Worse: the parts that .scm does NOT replace — framework semantics and edge resolution — are precisely the parts that constitute CodeLens's differentiation from ast-grep and Semgrep. Those tools stop at Layer 1. CodeLens adds Layer 2+3:

┌─────────────────────────────────────────────┐
│ Layer 3: Edge resolver + status computation │  <- Python, codebase-level
│   (snake<->camel, Tauri IPC, Pinia, Next.js)│     NOT replaced by .scm
├─────────────────────────────────────────────┤
│ Layer 2: Framework semantic extractor       │  <- Python, per-framework
│   (is_tauri_command, is_pinia_store, etc)   │     NOT replaced by .scm
├─────────────────────────────────────────────┤
│ Layer 1: Node + edge discovery              │  <- .scm query files
│   (functions, classes, calls, imports)      │     THIS is what .scm replaces
├─────────────────────────────────────────────┤
│ Layer 0: Tree-sitter grammar + AST          │  <- existing GrammarLoader
└─────────────────────────────────────────────┘

Honest ROI analysis

Factor 1 — How often does CodeLens add new languages?

Changelog review (v5.x -> v8.2): tree-sitter parsers went from 0 to 7, averaging <1 new language per major version. No indication of a planned 10+ language expansion in the next 6 months (issue #18 is open but not actively resourced).

At this rate, a 2-3 week migration investment saving 1-2 days per new language has a 3-5 year break-even horizon.

Factor 2 — What is CodeLens's actual pain point?

From the issue tracker, the urgent problems are:

  1. Edge resolution accuracy — cross-file call graph still misses calls. The newly-merged hybrid_type_resolver.py (PR feat(types): hybrid type resolution — closes #13 #29) is a step toward fixing this. This is the core differentiator vs. ast-grep.
  2. Performance — issues [PERF] RAM-first indexing pipeline: index in-memory, flush once to SQLite #10 (RAM-first indexing) and [ROADMAP] Phase 2 — Speed (Rust core via PyO3) #22 (Rust core via PyO3) are open. Scanning 5000+ files is still slow.
  3. P0 bugs unfixed[BUG-01] scan: pr.store_scan_result(result) calls non-existent method — SQLite analysis_cache never populated #31 (SQLite analysis_cache never populated), [BUG-02] --deep flag runs HybridEngine twice — LSP findings double-counted #32 (--deep runs HybridEngine twice). These threaten credibility.
  4. AI agent adoption — chicken-and-egg. CodeLens needs agents that use it; agents need proof it is useful.

None of these are solved by .scm. .scm solves "adding languages is slow" — a non-urgent problem.

Factor 3 — Roadmap Phase 2 (Rust core via PyO3)

Issues #20 and #22 (open) propose rewriting the indexing/query core in Rust via PyO3. If executed in the next 6-12 months, all Python parsers would be rewritten anyway. Naive .scm migration to Python could become wasted effort.

Mitigation: tree-sitter's Rust bindings also support .scm query files. So .scm files written now are portable to a future Rust core. The Python walker glue would be discarded, but the .scm content itself is durable. This makes .scm a safer bet than rewriting parsers in pure Python idioms.

Recommendation

Do not migrate existing parsers now. Prepare the ground for future migration.

Priority order (highest ROI first):

Priority Action Effort ROI
1 Fix P0 bugs (#31, #32) 2-3 days Very high — restores credibility
2 Continue edge resolution accuracy work (build on hybrid_type_resolver) ongoing High — core differentiator
3 Performance work (#10, Phase 2 #22) 2-4 weeks High — unlocks scale
4 YAML node type registry (Approach 2, stepping stone) 1-2 days Medium — externalize config, zero risk
5 .scm for NEW languages only (no migration of existing) per-language Medium — incremental investment
6 Full migration of 7 existing tree-sitter parsers to .scm 2-3 weeks Low — thin ROI given current add-language rate

Conditional triggers for "invest now"

The recommendation flips to "yes, migrate now" if all four of the following are true:

  1. P0 bugs ([BUG-01] scan: pr.store_scan_result(result) calls non-existent method — SQLite analysis_cache never populated #31, [BUG-02] --deep flag runs HybridEngine twice — LSP findings double-counted #32) are fixed
  2. Phase 2 (Rust core) status is clarified — either cancelled, or confirmed with a clear plan that .scm will be reused in Rust
  3. Concrete commitment to add 5+ new languages within 6 months (issue [FEATURE] 158-language support via universal tree-sitter grammar loader #18 active and resourced)
  4. A pilot migration of 1 language passes 100% test parity with the legacy parser

Until all four are met, this proposal stays in "discussion" status.

Proposed pilot (if/when triggers are met)

Pick one language as the pilot. Rust is the best candidate because:

  • Existing rust_parser.py (~250 LOC) gives a clear baseline for parity comparison
  • Rust grammar is mature and stable
  • Tauri IPC handling exercises Layer 2 (framework semantics) — good stress test of the "scm for Layer 1, Python for Layer 2" split

Steps:

  1. Write scripts/parsers/scm/rust.scm covering: function definitions, method definitions, struct/enum/impl blocks, call sites, use imports, #[attribute] captures.
  2. Build scripts/parsers/scm_engine.py — a generic engine that loads .scm files, executes queries, returns nodes/edges in the same shape as existing parsers ({path, nodes, edges}).
  3. Run both parsers (legacy + scm) on the existing Rust test fixtures and benchmark fixtures. Assert 100% parity on captured nodes/edges.
  4. If parity passes: wire scm_engine into commands/scan.py for Rust only, behind a flag --use-scm rust. Let it bake for 1 release cycle.
  5. If bake is clean: migrate Python, JS, TS, HTML, CSS one language per PR.
  6. Never migrate the Vue/Svelte/Blade parsers — they have heavy Layer 2 framework logic that .scm cannot express. Leave them as Python.

Plugin system angle (future direction, not part of this proposal)

CodeLens already has an engine plugin type (scripts/plugin_system.py). .scm query files could eventually be packaged as plugins rather than hardcoded in the repo:

my-org/rust-advanced/
  plugin.yaml      (type: engine, entrypoint: scm_engine.py)
  rust.scm         (query file)
  framework.yaml   (optional: Layer 2 framework rules)

This means external contributors could ship new language support without modifying CodeLens core. This is more ambitious than the original proposal and aligns with CodeLens's existing plugin philosophy. Out of scope for this issue — would warrant a separate proposal if there is interest.

Open questions for discussion

  1. Is there a concrete plan to add 5+ new languages in the next 6 months? (Maintainer input needed — this is the single biggest ROI lever.)
  2. Is Phase 2 (Rust core) still active, or effectively deferred? If deferred indefinitely, the "wait for Phase 2 clarity" trigger becomes moot and the calculus shifts.
  3. Are there languages where the legacy Python parser is actively painful to maintain (frequent bug fixes, regressions)? Those would be higher-priority migration candidates than the average.
  4. Is there appetite for the YAML node-type registry (Approach 2) as an immediate low-risk stepping stone, independent of the .scm decision?

References

Files (would be touched, if/when approved)

  • scripts/parsers/scm/ (new directory for .scm query files)
  • scripts/parsers/scm_engine.py (new generic engine)
  • scripts/commands/scan.py (dispatch wiring)
  • scripts/languages/node_types.yaml (if Approach 2 is also adopted as stepping stone)
  • tests/test_scm_engine.py (parity tests against legacy parsers)
  • references/parser-rules.md (documentation update)
  • CONTRIBUTING.md (add note: "new languages should use .scm, not bespoke Python parsers")

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions