Purpose
Brain must automatically detect binary files during ingest and exclude them — even when they have text-like extensions. The current 256-byte content scan is insufficient for files where binary data starts later.
Current State
Binary detection exists at 4 levels:
- Skip-dirs:
node_modules, .git, dist, etc. (in loader.ts)
- Extension filter:
BINARY_EXTENSIONS Set (~45 known binary extensions, in loader.ts)
- Content scan:
isBinaryContent() scans first 256 bytes — null byte = immediate binary, >30% non-printable = binary (in loader.ts, called from index.ts)
- File size cap: 2MB max (in
index.ts)
Gap / Problem
- 256-byte limit is too small: Binary data can start after the first 256 bytes (e.g., text metadata headers followed by binary body). Files with text extensions (
.json, .xml, .txt) that contain embedded binary data slip through.
- No per-chunk validation: If
isBinaryContent() misses a binary file, the chunker produces garbage text chunks that get stored as embeddings — polluting the search index.
- No reporting: Binary-skipped files are silently discarded without appearing in
IngestResult metrics. You can't see how many files were rejected.
Proposed Solution
1. Enhanced isBinaryContent() in src/ingest/loader.ts
- For files ≤64KB: scan the entire file content
- For files >64KB: sample first 4KB + middle region (offset ≈ size/2, take 4KB) + last 4KB
- Keep null-byte immediate-return optimization
- Keep the >30% non-printable threshold
2. Add isBinaryChunk() to src/ingest/chunker.ts
- Before returning any chunk, verify its content is valid text
- Use the same non-printable ratio check on the full chunk bytes
- If binary → skip the chunk, log a warning, continue processing remaining chunks
- This acts as a safety net for any binary content that passed the file-level check
3. Improve reporting in src/ingest/index.ts
- Add
binarySkipped: number to the IngestResult interface
- Track explicitly which files were skipped due to binary detection
- Include in progress callback (or at minimum the final result counters)
4. Tests
test/ingest.test.ts: Add test cases for isBinaryContent() with:
- Binary file with text header (e.g., binary data after byte 512)
- Large binary file (>64KB) where binary data is only in the middle/later regions
- Edge cases: zero-byte file, UTF-8 BOM + binary trailer
test/ingest.test.ts: Add test cases for per-chunk validation
Files to Modify
src/ingest/loader.ts — enhance isBinaryContent()
src/ingest/chunker.ts — add per-chunk binary validation
src/ingest/index.ts — add binarySkipped to IngestResult, wire up enhanced detection
test/ingest.test.ts — add binary detection tests
Constraints
- Must not break existing ingest behavior for non-binary files
- Must be fast — binary detection is on the hot path of ingest
- Must work with the current 2MB file size cap
- Bun-native (use
TextEncoder/TextDecoder, no external deps)
Purpose
Brain must automatically detect binary files during ingest and exclude them — even when they have text-like extensions. The current 256-byte content scan is insufficient for files where binary data starts later.
Current State
Binary detection exists at 4 levels:
node_modules,.git,dist, etc. (inloader.ts)BINARY_EXTENSIONSSet (~45 known binary extensions, inloader.ts)isBinaryContent()scans first 256 bytes — null byte = immediate binary, >30% non-printable = binary (inloader.ts, called fromindex.ts)index.ts)Gap / Problem
.json,.xml,.txt) that contain embedded binary data slip through.isBinaryContent()misses a binary file, the chunker produces garbage text chunks that get stored as embeddings — polluting the search index.IngestResultmetrics. You can't see how many files were rejected.Proposed Solution
1. Enhanced
isBinaryContent()insrc/ingest/loader.ts2. Add
isBinaryChunk()tosrc/ingest/chunker.ts3. Improve reporting in
src/ingest/index.tsbinarySkipped: numberto theIngestResultinterface4. Tests
test/ingest.test.ts: Add test cases forisBinaryContent()with:test/ingest.test.ts: Add test cases for per-chunk validationFiles to Modify
src/ingest/loader.ts— enhanceisBinaryContent()src/ingest/chunker.ts— add per-chunk binary validationsrc/ingest/index.ts— addbinarySkippedtoIngestResult, wire up enhanced detectiontest/ingest.test.ts— add binary detection testsConstraints
TextEncoder/TextDecoder, no external deps)