Skip to content

[FEAT] Enhanced binary detection: full-content scanning + per-chunk validation #178

Description

@four-bytes-robby

Purpose

Brain must automatically detect binary files during ingest and exclude them — even when they have text-like extensions. The current 256-byte content scan is insufficient for files where binary data starts later.

Current State

Binary detection exists at 4 levels:

  1. Skip-dirs: node_modules, .git, dist, etc. (in loader.ts)
  2. Extension filter: BINARY_EXTENSIONS Set (~45 known binary extensions, in loader.ts)
  3. Content scan: isBinaryContent() scans first 256 bytes — null byte = immediate binary, >30% non-printable = binary (in loader.ts, called from index.ts)
  4. File size cap: 2MB max (in index.ts)

Gap / Problem

  • 256-byte limit is too small: Binary data can start after the first 256 bytes (e.g., text metadata headers followed by binary body). Files with text extensions (.json, .xml, .txt) that contain embedded binary data slip through.
  • No per-chunk validation: If isBinaryContent() misses a binary file, the chunker produces garbage text chunks that get stored as embeddings — polluting the search index.
  • No reporting: Binary-skipped files are silently discarded without appearing in IngestResult metrics. You can't see how many files were rejected.

Proposed Solution

1. Enhanced isBinaryContent() in src/ingest/loader.ts

  • For files ≤64KB: scan the entire file content
  • For files >64KB: sample first 4KB + middle region (offset ≈ size/2, take 4KB) + last 4KB
  • Keep null-byte immediate-return optimization
  • Keep the >30% non-printable threshold

2. Add isBinaryChunk() to src/ingest/chunker.ts

  • Before returning any chunk, verify its content is valid text
  • Use the same non-printable ratio check on the full chunk bytes
  • If binary → skip the chunk, log a warning, continue processing remaining chunks
  • This acts as a safety net for any binary content that passed the file-level check

3. Improve reporting in src/ingest/index.ts

  • Add binarySkipped: number to the IngestResult interface
  • Track explicitly which files were skipped due to binary detection
  • Include in progress callback (or at minimum the final result counters)

4. Tests

  • test/ingest.test.ts: Add test cases for isBinaryContent() with:
    • Binary file with text header (e.g., binary data after byte 512)
    • Large binary file (>64KB) where binary data is only in the middle/later regions
    • Edge cases: zero-byte file, UTF-8 BOM + binary trailer
  • test/ingest.test.ts: Add test cases for per-chunk validation

Files to Modify

  • src/ingest/loader.ts — enhance isBinaryContent()
  • src/ingest/chunker.ts — add per-chunk binary validation
  • src/ingest/index.ts — add binarySkipped to IngestResult, wire up enhanced detection
  • test/ingest.test.ts — add binary detection tests

Constraints

  • Must not break existing ingest behavior for non-binary files
  • Must be fast — binary detection is on the hot path of ingest
  • Must work with the current 2MB file size cap
  • Bun-native (use TextEncoder/TextDecoder, no external deps)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions