Root Cause
The isBinaryContent() enhancement from #178 is insufficient for large binaries (>64KB) where all 3 sampled regions (first/middle/last 4KB) happen to have <30% non-printable chars. This is common in Go/compiled binaries with embedded string tables. Additionally, the per-chunk isBinaryChunk() validator is ineffective because it checks decoded text — binary bytes have already been laundered through TextDecoder → replacement chars (U+FFFD) → re-encoded by TextEncoder as valid UTF-8.
Two Problems
1. isBinaryContent() — insufficient sampling for large files
- Current: samples only 3 regions × 4KB = 12KB total
- A 2MB file with printable content in these 3 regions passes as "text"
- Need statistical stride sampling across the entire file
2. isBinaryChunk() — laundered through TextDecoder
- Current: encodes chunk text via
TextEncoder, checks for non-printable bytes
- But: binary data was already decoded by
TextDecoder — invalid bytes became U+FFFD
- U+FFFD encodes to valid printable UTF-8:
EF BF BD
- Result: all chunks pass validation, no matter how binary the source was
Fix
Part A: Statistical stride sampling in isBinaryContent()
Replace the 3-region approach for large files with:
- Thorough scan of first 8KB (catches ELF/PE/Mach-O headers)
- Statistical sampling: scan every 512th byte across the entire file (not just 3 regions)
- Thorough scan of last 8KB
- If cumulative non-printable ratio exceeds threshold → binary
- Null byte in any sampled byte → immediate binary
Part B: Fix isBinaryChunk() to be useful
Instead of checking byte-level printability (which was disproven), check for replacement character density:
- Count U+FFFD occurrences in the chunk text
- If replacement chars exceed ~10% of chunk length → binary source
- This catches chunks that TextDecoder couldn't decode properly
Part C: Also fix extensionless file handling in walkDirectory()
Extensionless files are currently silently dropped by the directory walker. They should be included (with language: null) so that isBinaryContent() can decide. This ensures directory-level ingest also catches extensionless binaries (Dockerfile, Makefile, .gitignore would also benefit — they're currently lost).
Files
src/ingest/loader.ts — enhance isBinaryContent() + fix walkDirectory() to include extensionless files
src/ingest/chunker.ts — fix isBinaryChunk() to check U+FFFD density instead of byte printable ratio
test/ingest.test.ts — update/add tests for stride sampling + replacement char detection
Root Cause
The
isBinaryContent()enhancement from #178 is insufficient for large binaries (>64KB) where all 3 sampled regions (first/middle/last 4KB) happen to have <30% non-printable chars. This is common in Go/compiled binaries with embedded string tables. Additionally, the per-chunkisBinaryChunk()validator is ineffective because it checks decoded text — binary bytes have already been laundered throughTextDecoder→ replacement chars (U+FFFD) → re-encoded byTextEncoderas valid UTF-8.Two Problems
1.
isBinaryContent()— insufficient sampling for large files2.
isBinaryChunk()— laundered through TextDecoderTextEncoder, checks for non-printable bytesTextDecoder— invalid bytes became U+FFFDEF BF BDFix
Part A: Statistical stride sampling in
isBinaryContent()Replace the 3-region approach for large files with:
Part B: Fix
isBinaryChunk()to be usefulInstead of checking byte-level printability (which was disproven), check for replacement character density:
Part C: Also fix extensionless file handling in
walkDirectory()Extensionless files are currently silently dropped by the directory walker. They should be included (with
language: null) so thatisBinaryContent()can decide. This ensures directory-level ingest also catches extensionless binaries (Dockerfile, Makefile, .gitignore would also benefit — they're currently lost).Files
src/ingest/loader.ts— enhanceisBinaryContent()+ fixwalkDirectory()to include extensionless filessrc/ingest/chunker.ts— fixisBinaryChunk()to check U+FFFD density instead of byte printable ratiotest/ingest.test.ts— update/add tests for stride sampling + replacement char detection