[PROPOSAL] Adopt tree-sitter query files (.scm) for node discovery — phased rollout, defer full migration until triggers met

## Summary

Proposal to adopt tree-sitter query files (`.scm`) as the long-term direction for CodeLens's node/edge discovery layer, with a phased rollout that **defers full migration until concrete triggers are met**. This issue documents the design analysis so the discussion is not lost and future contributors can pick up the thread.

**Status:** Proposal / Discussion — not approved for implementation yet.
**Recommendation:** Do NOT migrate existing parsers now. DO set up the YAML node-type registry as a low-risk stepping stone. DO adopt `.scm` for any NEW language added from now on, after a successful pilot.

## Background

CodeLens currently extracts code structure (functions, classes, calls, imports) using two parallel mechanisms:

1. **Tree-sitter parsers** (7 languages: HTML, CSS, JS, TS/TSX, Rust, Python, Vue, Svelte, Blade) — each is a hand-written Python module that walks the AST imperatively using `BaseParser.walk_tree()` / `find_nodes_by_type()`.
2. **Regex fallback parsers** (28 languages: C, C++, Go, Java, Kotlin, Swift, Ruby, PHP, etc.) — used when tree-sitter grammar is unavailable or the environment lacks tree-sitter.

Combined: ~35 languages covered. Test coverage >= 80% on `scripts/` (with `parsers/` and `commands/` omitted from coverage).

The current parser-per-language model works, but each new language requires a bespoke Python module (~250 LOC for tree-sitter parsers, ~150 LOC for regex fallbacks). This raises a recurring question: **should CodeLens adopt a more declarative approach to node discovery?**

## Three approaches considered

### Approach 1 — Tree-sitter query files (.scm) [ast-grep style]

Tree-sitter ships its own query language (files use `.scm` extension, syntax is Scheme-like). A query file declaratively describes which AST nodes to capture, with `@capture` annotations for field extraction.

Conceptual example (`python.scm`):
```scheme
(function_definition
  name: (identifier) @function.name) @function.def

(call
  function: (identifier) @call.callee) @call.site

(import_statement
  (dotted_name) @import.path) @import
```

A single generic engine reads the `.scm` file, executes the query against the AST, and returns captured nodes. One engine serves many languages.

### Approach 2 — Node type registry (YAML) [Semgrep internal style]

Externalize the hardcoded node-type lookups (`find_nodes_by_type(root, "function_definition")`) to a YAML config:

```yaml
python:
  function_def: [function_definition, lambda]
  call: [call]
  import: [import_statement]
  class: [class_definition]

java:
  function_def: [method_declaration, constructor_declaration]
  call: [method_invocation]
  import: [import_declaration]
  class: [class_declaration]

go:
  function_def: [function_declaration, method_declaration]
  call: [call_expression]
  import: [import_declaration]
  class: [class_declaration]
```

The Python walker stays, but reads node types from config instead of hardcoding them. Adding a language = adding 5 lines of YAML.

### Approach 3 — Token-based matching [Spacegrep style]

Tokenize source code and match patterns at the token level. Less accurate than AST, but zero grammar dependency.

**Already implemented.** CodeLens's 28 `fallback_*.py` parsers already do this (using regex, not formal tokenization, but conceptually equivalent). This is not a new direction — it is the existing fallback layer. Listed here only for completeness; no action proposed.

## Key analytical insight

**CodeLens does not sell "list all functions." CodeLens sells a cross-file call graph with framework semantics and status computation.**

The parser layer is more than node discovery. Decomposing what each current tree-sitter parser does (~250 LOC each):

| Sub-task | Approx LOC | Solvable by .scm? |
|---|---|---|
| Node discovery (functions, classes, calls, imports) | ~80 | Yes — sweet spot |
| Raw edge extraction (caller->callee pairs) | ~50 | Yes |
| Framework semantic extraction (Tauri `#[command]`, Pinia `useXxxStore`, Next.js conventions) | ~70 | No — requires post-processing logic |
| Helper utilities (BaseParser inheritance, line numbers, text extraction) | ~50 | No — generic Python |

So the realistic LOC reduction from `.scm` migration is ~50% (250 -> ~125), not the ~80% one might naively expect from "50 LOC .scm replaces 250 LOC Python."

Worse: the parts that `.scm` does NOT replace — framework semantics and edge resolution — are precisely the parts that constitute CodeLens's differentiation from ast-grep and Semgrep. Those tools stop at Layer 1. CodeLens adds Layer 2+3:

```
┌─────────────────────────────────────────────┐
│ Layer 3: Edge resolver + status computation │  <- Python, codebase-level
│   (snake<->camel, Tauri IPC, Pinia, Next.js)│     NOT replaced by .scm
├─────────────────────────────────────────────┤
│ Layer 2: Framework semantic extractor       │  <- Python, per-framework
│   (is_tauri_command, is_pinia_store, etc)   │     NOT replaced by .scm
├─────────────────────────────────────────────┤
│ Layer 1: Node + edge discovery              │  <- .scm query files
│   (functions, classes, calls, imports)      │     THIS is what .scm replaces
├─────────────────────────────────────────────┤
│ Layer 0: Tree-sitter grammar + AST          │  <- existing GrammarLoader
└─────────────────────────────────────────────┘
```

## Honest ROI analysis

### Factor 1 — How often does CodeLens add new languages?

Changelog review (v5.x -> v8.2): tree-sitter parsers went from 0 to 7, averaging **<1 new language per major version**. No indication of a planned 10+ language expansion in the next 6 months (issue #18 is open but not actively resourced).

At this rate, a 2-3 week migration investment saving 1-2 days per new language has a 3-5 year break-even horizon.

### Factor 2 — What is CodeLens's actual pain point?

From the issue tracker, the urgent problems are:

1. **Edge resolution accuracy** — cross-file call graph still misses calls. The newly-merged `hybrid_type_resolver.py` (PR #29) is a step toward fixing this. This is the core differentiator vs. ast-grep.
2. **Performance** — issues #10 (RAM-first indexing) and #22 (Rust core via PyO3) are open. Scanning 5000+ files is still slow.
3. **P0 bugs unfixed** — #31 (SQLite `analysis_cache` never populated), #32 (`--deep` runs HybridEngine twice). These threaten credibility.
4. **AI agent adoption** — chicken-and-egg. CodeLens needs agents that use it; agents need proof it is useful.

None of these are solved by `.scm`. `.scm` solves "adding languages is slow" — a non-urgent problem.

### Factor 3 — Roadmap Phase 2 (Rust core via PyO3)

Issues #20 and #22 (open) propose rewriting the indexing/query core in Rust via PyO3. If executed in the next 6-12 months, **all Python parsers would be rewritten anyway**. Naive `.scm` migration to Python could become wasted effort.

**Mitigation:** tree-sitter's Rust bindings also support `.scm` query files. So `.scm` files written now are portable to a future Rust core. The Python walker glue would be discarded, but the `.scm` content itself is durable. This makes `.scm` a safer bet than rewriting parsers in pure Python idioms.

## Recommendation

**Do not migrate existing parsers now. Prepare the ground for future migration.**

Priority order (highest ROI first):

| Priority | Action | Effort | ROI |
|---|---|---|---|
| 1 | Fix P0 bugs (#31, #32) | 2-3 days | Very high — restores credibility |
| 2 | Continue edge resolution accuracy work (build on `hybrid_type_resolver`) | ongoing | High — core differentiator |
| 3 | Performance work (#10, Phase 2 #22) | 2-4 weeks | High — unlocks scale |
| 4 | **YAML node type registry** (Approach 2, stepping stone) | 1-2 days | Medium — externalize config, zero risk |
| 5 | **`.scm` for NEW languages only** (no migration of existing) | per-language | Medium — incremental investment |
| 6 | Full migration of 7 existing tree-sitter parsers to `.scm` | 2-3 weeks | Low — thin ROI given current add-language rate |

## Conditional triggers for "invest now"

The recommendation flips to "yes, migrate now" if **all four** of the following are true:

1. P0 bugs (#31, #32) are fixed
2. Phase 2 (Rust core) status is clarified — either cancelled, or confirmed with a clear plan that `.scm` will be reused in Rust
3. Concrete commitment to add 5+ new languages within 6 months (issue #18 active and resourced)
4. A pilot migration of 1 language passes 100% test parity with the legacy parser

Until all four are met, this proposal stays in "discussion" status.

## Proposed pilot (if/when triggers are met)

Pick one language as the pilot. **Rust** is the best candidate because:
- Existing `rust_parser.py` (~250 LOC) gives a clear baseline for parity comparison
- Rust grammar is mature and stable
- Tauri IPC handling exercises Layer 2 (framework semantics) — good stress test of the "scm for Layer 1, Python for Layer 2" split

Steps:
1. Write `scripts/parsers/scm/rust.scm` covering: function definitions, method definitions, struct/enum/impl blocks, call sites, `use` imports, `#[attribute]` captures.
2. Build `scripts/parsers/scm_engine.py` — a generic engine that loads `.scm` files, executes queries, returns nodes/edges in the same shape as existing parsers (`{path, nodes, edges}`).
3. Run both parsers (legacy + scm) on the existing Rust test fixtures and benchmark fixtures. Assert 100% parity on captured nodes/edges.
4. If parity passes: wire `scm_engine` into `commands/scan.py` for Rust only, behind a flag `--use-scm rust`. Let it bake for 1 release cycle.
5. If bake is clean: migrate Python, JS, TS, HTML, CSS one language per PR.
6. **Never migrate** the Vue/Svelte/Blade parsers — they have heavy Layer 2 framework logic that `.scm` cannot express. Leave them as Python.

## Plugin system angle (future direction, not part of this proposal)

CodeLens already has an `engine` plugin type (`scripts/plugin_system.py`). `.scm` query files could eventually be packaged as **plugins** rather than hardcoded in the repo:

```
my-org/rust-advanced/
  plugin.yaml      (type: engine, entrypoint: scm_engine.py)
  rust.scm         (query file)
  framework.yaml   (optional: Layer 2 framework rules)
```

This means external contributors could ship new language support without modifying CodeLens core. This is more ambitious than the original proposal and aligns with CodeLens's existing plugin philosophy. **Out of scope for this issue** — would warrant a separate proposal if there is interest.

## Open questions for discussion

1. Is there a concrete plan to add 5+ new languages in the next 6 months? (Maintainer input needed — this is the single biggest ROI lever.)
2. Is Phase 2 (Rust core) still active, or effectively deferred? If deferred indefinitely, the "wait for Phase 2 clarity" trigger becomes moot and the calculus shifts.
3. Are there languages where the legacy Python parser is actively painful to maintain (frequent bug fixes, regressions)? Those would be higher-priority migration candidates than the average.
4. Is there appetite for the YAML node-type registry (Approach 2) as an immediate low-risk stepping stone, independent of the `.scm` decision?

## References

- **Related issues:** #18 (158-language support via universal grammar loader), #20 (Rust core via PyO3), #22 (Phase 2 — Speed), #10 (RAM-first indexing)
- **Existing parser infrastructure:** `scripts/grammar_loader.py`, `scripts/base_parser.py`, `scripts/parsers/*.py`, `scripts/parsers/fallback_*.py`
- **External references:**
  - [ast-grep](https://ast-grep.github.io/) — production tool built on `.scm`-style queries across 15+ languages
  - [tree-sitter query syntax documentation](https://tree-sitter.github.io/tree-sitter/using-parsers#query-syntax)
  - [Semgrep rule syntax](https://semgrep.dev/docs/writing-rules/overview/) — for comparison with Approach 2 (YAML registry)

## Files (would be touched, if/when approved)

- `scripts/parsers/scm/` (new directory for `.scm` query files)
- `scripts/parsers/scm_engine.py` (new generic engine)
- `scripts/commands/scan.py` (dispatch wiring)
- `scripts/languages/node_types.yaml` (if Approach 2 is also adopted as stepping stone)
- `tests/test_scm_engine.py` (parity tests against legacy parsers)
- `references/parser-rules.md` (documentation update)
- `CONTRIBUTING.md` (add note: "new languages should use `.scm`, not bespoke Python parsers")


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Adopt tree-sitter query files (.scm) for node discovery — phased rollout, defer full migration until triggers met #43

Summary

Background

Three approaches considered

Approach 1 — Tree-sitter query files (.scm) [ast-grep style]

Approach 2 — Node type registry (YAML) [Semgrep internal style]

Approach 3 — Token-based matching [Spacegrep style]

Key analytical insight

Honest ROI analysis

Factor 1 — How often does CodeLens add new languages?

Factor 2 — What is CodeLens's actual pain point?

Factor 3 — Roadmap Phase 2 (Rust core via PyO3)

Recommendation

Conditional triggers for "invest now"

Proposed pilot (if/when triggers are met)

Plugin system angle (future direction, not part of this proposal)

Open questions for discussion

References

Files (would be touched, if/when approved)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sub-task	Approx LOC	Solvable by .scm?
Node discovery (functions, classes, calls, imports)	~80	Yes — sweet spot
Raw edge extraction (caller->callee pairs)	~50	Yes
Framework semantic extraction (Tauri `#[command]`, Pinia `useXxxStore`, Next.js conventions)	~70	No — requires post-processing logic
Helper utilities (BaseParser inheritance, line numbers, text extraction)	~50	No — generic Python

Priority	Action	Effort	ROI
1	Fix P0 bugs (#31, #32)	2-3 days	Very high — restores credibility
2	Continue edge resolution accuracy work (build on `hybrid_type_resolver`)	ongoing	High — core differentiator
3	Performance work (#10, Phase 2 #22)	2-4 weeks	High — unlocks scale
4	YAML node type registry (Approach 2, stepping stone)	1-2 days	Medium — externalize config, zero risk
5	`.scm` for NEW languages only (no migration of existing)	per-language	Medium — incremental investment
6	Full migration of 7 existing tree-sitter parsers to `.scm`	2-3 weeks	Low — thin ROI given current add-language rate

[PROPOSAL] Adopt tree-sitter query files (.scm) for node discovery — phased rollout, defer full migration until triggers met #43

Description

Summary

Background

Three approaches considered

Approach 1 — Tree-sitter query files (.scm) [ast-grep style]

Approach 2 — Node type registry (YAML) [Semgrep internal style]

Approach 3 — Token-based matching [Spacegrep style]

Key analytical insight

Honest ROI analysis

Factor 1 — How often does CodeLens add new languages?

Factor 2 — What is CodeLens's actual pain point?

Factor 3 — Roadmap Phase 2 (Rust core via PyO3)

Recommendation

Conditional triggers for "invest now"

Proposed pilot (if/when triggers are met)

Plugin system angle (future direction, not part of this proposal)

Open questions for discussion

References

Files (would be touched, if/when approved)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions