Skip to content

[FEATURE] Performance & parallelism — worker pool, prefilter, AST disk cache, daemon #56

Description

@Wolfvin

Summary

CodeLens scan is single-threaded. Add --jobs N parallelism, regex prefilter (skip files that cannot match), AST disk cache (avoid re-parsing across commands), and optional shared daemon (1 process per project, N MCP clients). Target: 2-4x speedup on 8-core for scan, <2x slowdown for rule matching vs search regex.

Worker consensus (6 reports)

Worker Source Contribution
CodeGraph update!/CodeLens_CodeGraph_Upgrade_Analysis.md #5 Worker-thread pool: query pool (CPU-heavy codelens_explore in worker threads) + parse pool (ProcessPoolExecutor for tree-sitter, not thread-safe). 2-4x speedup on 8-core. CODELENS_QUERY_POOL_SIZE / CODELENS_PARSE_WORKERS env vars.
CodeGraph same file #4 Shared daemon architecture — 1 detached codelens serve --mcp --daemon per project root, N concurrent MCP clients over Unix socket. 1 watcher, 1 SQLite WAL writer, 1 tree-sitter warm-up. Idle timeout 300s.
CodeGraph same file #6 Watchdog stack — PPID watchdog (orphan detection), liveness watchdog (heartbeat), stale stdin teardown.
Opengrep update!/CodeLens_Opengrep_Upgrade_Analysis.md #59 Parallelism / --jobs N for scan. Python multiprocessing.Pool (tree-sitter not thread-safe).
Semgrep update!/CodeLens_Upgrade_Issues_from_Semgrep.md CL-009 Pre-filtering optimization — derive "fast regex" from each rule pattern (e.g. eval(...)eval), run ripgrep to filter candidate files. Skips 90%+ of files in <1s. 3x speedup. --no-prefilter to disable.
Semgrep same file CL-020 Disk cache for AST parse results — ~/.codelens/cache/ keyed by SHA-256 of file content + parser version. Auto-evict >30 days. codelens cache clear / codelens cache stats. --no-cache for benchmarks.
UBS update!/CodeLens_UBS_Upgrade_Analysis.md #21 --jobs=N (0=auto, 1=deterministic for CI, 16=explicit) + --only=LANG filter (only scan Python+Rust). 2-4x speedup on multi-core.
RepoAudit update!/CodeLens_Upgrade_Issues_from_RepoAudit.md CL-042 LLM response cache + token cost tracking (related — disk cache pattern reused for LLM).

Proposed phased scope

Phase 1 — Regex prefilter (P1, 1-2 weeks, quick win)

  • New scripts/prefilter.py
  • Analyze each rule pattern at load time, extract literal tokens (identifiers, strings)
  • Build prefilter regex from tokens
  • Run ripgrep subprocess to filter candidate files before AST parse
  • Stats in output: {prefilter: {total_files, passed, skipped, time_ms}}
  • --no-prefilter flag to disable
  • Target: 3x speedup on 5000-file repo with 100+ rules

Phase 2 — --jobs N parallelism (P1, 1-2 weeks)

  • concurrent.futures.ProcessPoolExecutor for CPU-bound parse (tree-sitter not thread-safe)
  • --jobs N flag (0=auto-detect cpu_count, 1=single-threaded for CI determinism, N=explicit)
  • JOBS env var
  • Worker entry point takes (task_id, file_path, language), returns (task_id, ExtractionResult)
  • --only=LANG[,LANG,...] filter (skip irrelevant parsers)
  • Per-worker recycle (WASM memory grows but never shrinks)
  • 2-stage retry (fresh worker, then comment-stripped)
  • Target: 2-4x speedup on 8-core

Phase 3 — AST disk cache (P1, 1 week)

  • New scripts/disk_cache.py
  • Cache at ~/.codelens/cache/ keyed by SHA-256 of file content + parser version
  • Store pickled AST
  • Auto-evict entries >30 days
  • codelens cache clear / codelens cache stats commands
  • --no-cache flag for benchmarks
  • Hit ratio exposed in output

Phase 4 — Shared daemon (P2, 3-4 weeks, depends on Phase 2)

  • codelens serve --mcp --daemon — detached process per project root
  • Unix-domain socket (Linux/macOS) or named pipe (Windows)
  • N concurrent MCP clients share 1 engine (1 watcher, 1 SQLite WAL writer, 1 tree-sitter warm-up)
  • Daemon registry in ~/.codelens/daemons/ keyed by SHA-256 of project root path
  • codelens daemons command (list/stop)
  • Idle timeout 300s
  • CODELENS_NO_DAEMON=1 opt-out

Phase 5 — Watchdog stack (P2, 2 weeks, depends on Phase 4)

  • PPID watchdog — orphan detection via os.getppid() polling (POSIX) or parent liveness (Windows)
  • Liveness watchdog — separate process, parent writes heartbeat byte to child's stdin every 1s, child SIGKILLs parent if no byte within 30s
  • Stale stdin teardown — listen for stdin error event, destroy stream on terminal event

Acceptance criteria

  • Phase 1: prefilter skips 90%+ of files in <1s on 5000-file repo
  • Phase 2: 2-4x speedup on 8-core for scan
  • Phase 3: AST cache hit ratio >80% on second run of same command
  • Phase 4: daemon serves N concurrent MCP clients without crash
  • Phase 5: orphan/wedge scenarios handled gracefully

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions