[FIX] Binary detection still misses large binaries with sparse printable regions

## Root Cause
The `isBinaryContent()` enhancement from #178 is insufficient for large binaries (>64KB) where all 3 sampled regions (first/middle/last 4KB) happen to have <30% non-printable chars. This is common in Go/compiled binaries with embedded string tables. Additionally, the per-chunk `isBinaryChunk()` validator is ineffective because it checks decoded text — binary bytes have already been laundered through `TextDecoder` → replacement chars (U+FFFD) → re-encoded by `TextEncoder` as valid UTF-8.

## Two Problems

### 1. `isBinaryContent()` — insufficient sampling for large files
- Current: samples only 3 regions × 4KB = 12KB total
- A 2MB file with printable content in these 3 regions passes as "text"
- Need statistical stride sampling across the entire file

### 2. `isBinaryChunk()` — laundered through TextDecoder
- Current: encodes chunk text via `TextEncoder`, checks for non-printable bytes
- But: binary data was already decoded by `TextDecoder` — invalid bytes became U+FFFD
- U+FFFD encodes to valid printable UTF-8: `EF BF BD`
- Result: all chunks pass validation, no matter how binary the source was

## Fix

### Part A: Statistical stride sampling in `isBinaryContent()`
Replace the 3-region approach for large files with:
1. Thorough scan of first 8KB (catches ELF/PE/Mach-O headers)
2. Statistical sampling: scan every 512th byte across the **entire file** (not just 3 regions)
3. Thorough scan of last 8KB
4. If cumulative non-printable ratio exceeds threshold → binary
5. Null byte in any sampled byte → immediate binary

### Part B: Fix `isBinaryChunk()` to be useful
Instead of checking byte-level printability (which was disproven), check for **replacement character density**:
- Count U+FFFD occurrences in the chunk text
- If replacement chars exceed ~10% of chunk length → binary source
- This catches chunks that TextDecoder couldn't decode properly

### Part C: Also fix extensionless file handling in `walkDirectory()`
Extensionless files are currently silently dropped by the directory walker. They should be included (with `language: null`) so that `isBinaryContent()` can decide. This ensures directory-level ingest also catches extensionless binaries (Dockerfile, Makefile, .gitignore would also benefit — they're currently lost).

## Files
- `src/ingest/loader.ts` — enhance `isBinaryContent()` + fix `walkDirectory()` to include extensionless files
- `src/ingest/chunker.ts` — fix `isBinaryChunk()` to check U+FFFD density instead of byte printable ratio
- `test/ingest.test.ts` — update/add tests for stride sampling + replacement char detection

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FIX] Binary detection still misses large binaries with sparse printable regions #181

Root Cause

Two Problems

1. `isBinaryContent()` — insufficient sampling for large files

2. `isBinaryChunk()` — laundered through TextDecoder

Fix

Part A: Statistical stride sampling in `isBinaryContent()`

Part B: Fix `isBinaryChunk()` to be useful

Part C: Also fix extensionless file handling in `walkDirectory()`

Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[FIX] Binary detection still misses large binaries with sparse printable regions #181

Description

Root Cause

Two Problems

1. isBinaryContent() — insufficient sampling for large files

2. isBinaryChunk() — laundered through TextDecoder

Fix

Part A: Statistical stride sampling in isBinaryContent()

Part B: Fix isBinaryChunk() to be useful

Part C: Also fix extensionless file handling in walkDirectory()

Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `isBinaryContent()` — insufficient sampling for large files

2. `isBinaryChunk()` — laundered through TextDecoder

Part A: Statistical stride sampling in `isBinaryContent()`

Part B: Fix `isBinaryChunk()` to be useful

Part C: Also fix extensionless file handling in `walkDirectory()`