Skip to content

[PERF] RAM-first indexing pipeline: index in-memory, flush once to SQLite #10

Description

@Wolfvin

Problem

Current scan does repeated SQLite writes per file/symbol during indexing. On large codebases this causes excessive I/O and slow full scans.

Proposed Change

Adopt a RAM-first pipeline:

  1. During scan, accumulate all nodes/edges in memory (dicts/lists)
  2. After all files parsed, do a single bulk INSERT transaction into SQLite
  3. Use WAL mode + PRAGMA synchronous=NORMAL for incremental updates

Expected Gains

  • 5-10x faster full scan on medium codebases (1k-10k files)
  • Agents waiting for initial index see much faster first-ready time
  • Memory usage is bounded and released after flush

Reference

codebase-memory-mcp indexes the Linux kernel (28M LOC, 75K files) in 3 minutes using this approach with LZ4 compression on the in-memory buffer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions