Problem
Current scan does repeated SQLite writes per file/symbol during indexing. On large codebases this causes excessive I/O and slow full scans.
Proposed Change
Adopt a RAM-first pipeline:
- During scan, accumulate all nodes/edges in memory (dicts/lists)
- After all files parsed, do a single bulk INSERT transaction into SQLite
- Use WAL mode + PRAGMA synchronous=NORMAL for incremental updates
Expected Gains
- 5-10x faster full scan on medium codebases (1k-10k files)
- Agents waiting for initial index see much faster first-ready time
- Memory usage is bounded and released after flush
Reference
codebase-memory-mcp indexes the Linux kernel (28M LOC, 75K files) in 3 minutes using this approach with LZ4 compression on the in-memory buffer.
Problem
Current scan does repeated SQLite writes per file/symbol during indexing. On large codebases this causes excessive I/O and slow full scans.
Proposed Change
Adopt a RAM-first pipeline:
Expected Gains
Reference
codebase-memory-mcp indexes the Linux kernel (28M LOC, 75K files) in 3 minutes using this approach with LZ4 compression on the in-memory buffer.