You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Proposal to adopt tree-sitter query files (.scm) as the long-term direction for CodeLens's node/edge discovery layer, with a phased rollout that defers full migration until concrete triggers are met. This issue documents the design analysis so the discussion is not lost and future contributors can pick up the thread.
Status: Proposal / Discussion — not approved for implementation yet. Recommendation: Do NOT migrate existing parsers now. DO set up the YAML node-type registry as a low-risk stepping stone. DO adopt .scm for any NEW language added from now on, after a successful pilot.
Background
CodeLens currently extracts code structure (functions, classes, calls, imports) using two parallel mechanisms:
Tree-sitter parsers (7 languages: HTML, CSS, JS, TS/TSX, Rust, Python, Vue, Svelte, Blade) — each is a hand-written Python module that walks the AST imperatively using BaseParser.walk_tree() / find_nodes_by_type().
Regex fallback parsers (28 languages: C, C++, Go, Java, Kotlin, Swift, Ruby, PHP, etc.) — used when tree-sitter grammar is unavailable or the environment lacks tree-sitter.
Combined: ~35 languages covered. Test coverage >= 80% on scripts/ (with parsers/ and commands/ omitted from coverage).
The current parser-per-language model works, but each new language requires a bespoke Python module (~250 LOC for tree-sitter parsers, ~150 LOC for regex fallbacks). This raises a recurring question: should CodeLens adopt a more declarative approach to node discovery?
Tree-sitter ships its own query language (files use .scm extension, syntax is Scheme-like). A query file declaratively describes which AST nodes to capture, with @capture annotations for field extraction.
Tokenize source code and match patterns at the token level. Less accurate than AST, but zero grammar dependency.
Already implemented. CodeLens's 28 fallback_*.py parsers already do this (using regex, not formal tokenization, but conceptually equivalent). This is not a new direction — it is the existing fallback layer. Listed here only for completeness; no action proposed.
Key analytical insight
CodeLens does not sell "list all functions." CodeLens sells a cross-file call graph with framework semantics and status computation.
The parser layer is more than node discovery. Decomposing what each current tree-sitter parser does (~250 LOC each):
Helper utilities (BaseParser inheritance, line numbers, text extraction)
~50
No — generic Python
So the realistic LOC reduction from .scm migration is ~50% (250 -> ~125), not the ~80% one might naively expect from "50 LOC .scm replaces 250 LOC Python."
Worse: the parts that .scm does NOT replace — framework semantics and edge resolution — are precisely the parts that constitute CodeLens's differentiation from ast-grep and Semgrep. Those tools stop at Layer 1. CodeLens adds Layer 2+3:
┌─────────────────────────────────────────────┐
│ Layer 3: Edge resolver + status computation │ <- Python, codebase-level
│ (snake<->camel, Tauri IPC, Pinia, Next.js)│ NOT replaced by .scm
├─────────────────────────────────────────────┤
│ Layer 2: Framework semantic extractor │ <- Python, per-framework
│ (is_tauri_command, is_pinia_store, etc) │ NOT replaced by .scm
├─────────────────────────────────────────────┤
│ Layer 1: Node + edge discovery │ <- .scm query files
│ (functions, classes, calls, imports) │ THIS is what .scm replaces
├─────────────────────────────────────────────┤
│ Layer 0: Tree-sitter grammar + AST │ <- existing GrammarLoader
└─────────────────────────────────────────────┘
Honest ROI analysis
Factor 1 — How often does CodeLens add new languages?
Changelog review (v5.x -> v8.2): tree-sitter parsers went from 0 to 7, averaging <1 new language per major version. No indication of a planned 10+ language expansion in the next 6 months (issue #18 is open but not actively resourced).
At this rate, a 2-3 week migration investment saving 1-2 days per new language has a 3-5 year break-even horizon.
Factor 2 — What is CodeLens's actual pain point?
From the issue tracker, the urgent problems are:
Edge resolution accuracy — cross-file call graph still misses calls. The newly-merged hybrid_type_resolver.py (PR feat(types): hybrid type resolution — closes #13 #29) is a step toward fixing this. This is the core differentiator vs. ast-grep.
AI agent adoption — chicken-and-egg. CodeLens needs agents that use it; agents need proof it is useful.
None of these are solved by .scm. .scm solves "adding languages is slow" — a non-urgent problem.
Factor 3 — Roadmap Phase 2 (Rust core via PyO3)
Issues #20 and #22 (open) propose rewriting the indexing/query core in Rust via PyO3. If executed in the next 6-12 months, all Python parsers would be rewritten anyway. Naive .scm migration to Python could become wasted effort.
Mitigation: tree-sitter's Rust bindings also support .scm query files. So .scm files written now are portable to a future Rust core. The Python walker glue would be discarded, but the .scm content itself is durable. This makes .scm a safer bet than rewriting parsers in pure Python idioms.
Recommendation
Do not migrate existing parsers now. Prepare the ground for future migration.
A pilot migration of 1 language passes 100% test parity with the legacy parser
Until all four are met, this proposal stays in "discussion" status.
Proposed pilot (if/when triggers are met)
Pick one language as the pilot. Rust is the best candidate because:
Existing rust_parser.py (~250 LOC) gives a clear baseline for parity comparison
Rust grammar is mature and stable
Tauri IPC handling exercises Layer 2 (framework semantics) — good stress test of the "scm for Layer 1, Python for Layer 2" split
Steps:
Write scripts/parsers/scm/rust.scm covering: function definitions, method definitions, struct/enum/impl blocks, call sites, use imports, #[attribute] captures.
Build scripts/parsers/scm_engine.py — a generic engine that loads .scm files, executes queries, returns nodes/edges in the same shape as existing parsers ({path, nodes, edges}).
Run both parsers (legacy + scm) on the existing Rust test fixtures and benchmark fixtures. Assert 100% parity on captured nodes/edges.
If parity passes: wire scm_engine into commands/scan.py for Rust only, behind a flag --use-scm rust. Let it bake for 1 release cycle.
If bake is clean: migrate Python, JS, TS, HTML, CSS one language per PR.
Never migrate the Vue/Svelte/Blade parsers — they have heavy Layer 2 framework logic that .scm cannot express. Leave them as Python.
Plugin system angle (future direction, not part of this proposal)
CodeLens already has an engine plugin type (scripts/plugin_system.py). .scm query files could eventually be packaged as plugins rather than hardcoded in the repo:
This means external contributors could ship new language support without modifying CodeLens core. This is more ambitious than the original proposal and aligns with CodeLens's existing plugin philosophy. Out of scope for this issue — would warrant a separate proposal if there is interest.
Open questions for discussion
Is there a concrete plan to add 5+ new languages in the next 6 months? (Maintainer input needed — this is the single biggest ROI lever.)
Is Phase 2 (Rust core) still active, or effectively deferred? If deferred indefinitely, the "wait for Phase 2 clarity" trigger becomes moot and the calculus shifts.
Are there languages where the legacy Python parser is actively painful to maintain (frequent bug fixes, regressions)? Those would be higher-priority migration candidates than the average.
Is there appetite for the YAML node-type registry (Approach 2) as an immediate low-risk stepping stone, independent of the .scm decision?
Summary
Proposal to adopt tree-sitter query files (
.scm) as the long-term direction for CodeLens's node/edge discovery layer, with a phased rollout that defers full migration until concrete triggers are met. This issue documents the design analysis so the discussion is not lost and future contributors can pick up the thread.Status: Proposal / Discussion — not approved for implementation yet.
Recommendation: Do NOT migrate existing parsers now. DO set up the YAML node-type registry as a low-risk stepping stone. DO adopt
.scmfor any NEW language added from now on, after a successful pilot.Background
CodeLens currently extracts code structure (functions, classes, calls, imports) using two parallel mechanisms:
BaseParser.walk_tree()/find_nodes_by_type().Combined: ~35 languages covered. Test coverage >= 80% on
scripts/(withparsers/andcommands/omitted from coverage).The current parser-per-language model works, but each new language requires a bespoke Python module (~250 LOC for tree-sitter parsers, ~150 LOC for regex fallbacks). This raises a recurring question: should CodeLens adopt a more declarative approach to node discovery?
Three approaches considered
Approach 1 — Tree-sitter query files (.scm) [ast-grep style]
Tree-sitter ships its own query language (files use
.scmextension, syntax is Scheme-like). A query file declaratively describes which AST nodes to capture, with@captureannotations for field extraction.Conceptual example (
python.scm):A single generic engine reads the
.scmfile, executes the query against the AST, and returns captured nodes. One engine serves many languages.Approach 2 — Node type registry (YAML) [Semgrep internal style]
Externalize the hardcoded node-type lookups (
find_nodes_by_type(root, "function_definition")) to a YAML config:The Python walker stays, but reads node types from config instead of hardcoding them. Adding a language = adding 5 lines of YAML.
Approach 3 — Token-based matching [Spacegrep style]
Tokenize source code and match patterns at the token level. Less accurate than AST, but zero grammar dependency.
Already implemented. CodeLens's 28
fallback_*.pyparsers already do this (using regex, not formal tokenization, but conceptually equivalent). This is not a new direction — it is the existing fallback layer. Listed here only for completeness; no action proposed.Key analytical insight
CodeLens does not sell "list all functions." CodeLens sells a cross-file call graph with framework semantics and status computation.
The parser layer is more than node discovery. Decomposing what each current tree-sitter parser does (~250 LOC each):
#[command], PiniauseXxxStore, Next.js conventions)So the realistic LOC reduction from
.scmmigration is ~50% (250 -> ~125), not the ~80% one might naively expect from "50 LOC .scm replaces 250 LOC Python."Worse: the parts that
.scmdoes NOT replace — framework semantics and edge resolution — are precisely the parts that constitute CodeLens's differentiation from ast-grep and Semgrep. Those tools stop at Layer 1. CodeLens adds Layer 2+3:Honest ROI analysis
Factor 1 — How often does CodeLens add new languages?
Changelog review (v5.x -> v8.2): tree-sitter parsers went from 0 to 7, averaging <1 new language per major version. No indication of a planned 10+ language expansion in the next 6 months (issue #18 is open but not actively resourced).
At this rate, a 2-3 week migration investment saving 1-2 days per new language has a 3-5 year break-even horizon.
Factor 2 — What is CodeLens's actual pain point?
From the issue tracker, the urgent problems are:
hybrid_type_resolver.py(PR feat(types): hybrid type resolution — closes #13 #29) is a step toward fixing this. This is the core differentiator vs. ast-grep.analysis_cachenever populated), [BUG-02] --deep flag runs HybridEngine twice — LSP findings double-counted #32 (--deepruns HybridEngine twice). These threaten credibility.None of these are solved by
.scm..scmsolves "adding languages is slow" — a non-urgent problem.Factor 3 — Roadmap Phase 2 (Rust core via PyO3)
Issues #20 and #22 (open) propose rewriting the indexing/query core in Rust via PyO3. If executed in the next 6-12 months, all Python parsers would be rewritten anyway. Naive
.scmmigration to Python could become wasted effort.Mitigation: tree-sitter's Rust bindings also support
.scmquery files. So.scmfiles written now are portable to a future Rust core. The Python walker glue would be discarded, but the.scmcontent itself is durable. This makes.scma safer bet than rewriting parsers in pure Python idioms.Recommendation
Do not migrate existing parsers now. Prepare the ground for future migration.
Priority order (highest ROI first):
hybrid_type_resolver).scmfor NEW languages only (no migration of existing).scmConditional triggers for "invest now"
The recommendation flips to "yes, migrate now" if all four of the following are true:
.scmwill be reused in RustUntil all four are met, this proposal stays in "discussion" status.
Proposed pilot (if/when triggers are met)
Pick one language as the pilot. Rust is the best candidate because:
rust_parser.py(~250 LOC) gives a clear baseline for parity comparisonSteps:
scripts/parsers/scm/rust.scmcovering: function definitions, method definitions, struct/enum/impl blocks, call sites,useimports,#[attribute]captures.scripts/parsers/scm_engine.py— a generic engine that loads.scmfiles, executes queries, returns nodes/edges in the same shape as existing parsers ({path, nodes, edges}).scm_engineintocommands/scan.pyfor Rust only, behind a flag--use-scm rust. Let it bake for 1 release cycle..scmcannot express. Leave them as Python.Plugin system angle (future direction, not part of this proposal)
CodeLens already has an
engineplugin type (scripts/plugin_system.py)..scmquery files could eventually be packaged as plugins rather than hardcoded in the repo:This means external contributors could ship new language support without modifying CodeLens core. This is more ambitious than the original proposal and aligns with CodeLens's existing plugin philosophy. Out of scope for this issue — would warrant a separate proposal if there is interest.
Open questions for discussion
.scmdecision?References
scripts/grammar_loader.py,scripts/base_parser.py,scripts/parsers/*.py,scripts/parsers/fallback_*.py.scm-style queries across 15+ languagesFiles (would be touched, if/when approved)
scripts/parsers/scm/(new directory for.scmquery files)scripts/parsers/scm_engine.py(new generic engine)scripts/commands/scan.py(dispatch wiring)scripts/languages/node_types.yaml(if Approach 2 is also adopted as stepping stone)tests/test_scm_engine.py(parity tests against legacy parsers)references/parser-rules.md(documentation update)CONTRIBUTING.md(add note: "new languages should use.scm, not bespoke Python parsers")