Skip to content

[FIX] Binary detection still misses large binaries with sparse printable regions #181

Description

@four-bytes-robby

Root Cause

The isBinaryContent() enhancement from #178 is insufficient for large binaries (>64KB) where all 3 sampled regions (first/middle/last 4KB) happen to have <30% non-printable chars. This is common in Go/compiled binaries with embedded string tables. Additionally, the per-chunk isBinaryChunk() validator is ineffective because it checks decoded text — binary bytes have already been laundered through TextDecoder → replacement chars (U+FFFD) → re-encoded by TextEncoder as valid UTF-8.

Two Problems

1. isBinaryContent() — insufficient sampling for large files

  • Current: samples only 3 regions × 4KB = 12KB total
  • A 2MB file with printable content in these 3 regions passes as "text"
  • Need statistical stride sampling across the entire file

2. isBinaryChunk() — laundered through TextDecoder

  • Current: encodes chunk text via TextEncoder, checks for non-printable bytes
  • But: binary data was already decoded by TextDecoder — invalid bytes became U+FFFD
  • U+FFFD encodes to valid printable UTF-8: EF BF BD
  • Result: all chunks pass validation, no matter how binary the source was

Fix

Part A: Statistical stride sampling in isBinaryContent()

Replace the 3-region approach for large files with:

  1. Thorough scan of first 8KB (catches ELF/PE/Mach-O headers)
  2. Statistical sampling: scan every 512th byte across the entire file (not just 3 regions)
  3. Thorough scan of last 8KB
  4. If cumulative non-printable ratio exceeds threshold → binary
  5. Null byte in any sampled byte → immediate binary

Part B: Fix isBinaryChunk() to be useful

Instead of checking byte-level printability (which was disproven), check for replacement character density:

  • Count U+FFFD occurrences in the chunk text
  • If replacement chars exceed ~10% of chunk length → binary source
  • This catches chunks that TextDecoder couldn't decode properly

Part C: Also fix extensionless file handling in walkDirectory()

Extensionless files are currently silently dropped by the directory walker. They should be included (with language: null) so that isBinaryContent() can decide. This ensures directory-level ingest also catches extensionless binaries (Dockerfile, Makefile, .gitignore would also benefit — they're currently lost).

Files

  • src/ingest/loader.ts — enhance isBinaryContent() + fix walkDirectory() to include extensionless files
  • src/ingest/chunker.ts — fix isBinaryChunk() to check U+FFFD density instead of byte printable ratio
  • test/ingest.test.ts — update/add tests for stride sampling + replacement char detection

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions