What's missing
Our project guidelines (AGENTS.md) say computer vision work in this project should be evaluated with standard accuracy metrics — precision, recall, F1, and IoU (intersection-over-union) for segmentation — and that we should validate model output against human inspection where possible.
Right now we have unit tests that check the code behaves correctly on small synthetic examples (e.g. "does this function return the right shape"), but nothing that checks: "when we run the detector on a real, human-labeled inspection photo, how close does it get to what a person would have marked?"
There's already a data/annotations/ folder reserved for ground-truth labels, but no script that compares model output against it.
Why it matters
Without this, we have no objective way to know if a change to the detector (like the NMS change in issue #1) actually makes results better or worse on real images — we're relying on unit tests and manual eyeballing. For a tool used to flag well-integrity and safety-relevant failures, that's a gap.
What needs to happen
Add a small evaluation script (e.g. scripts/evaluate_detection.py) that:
- Loads any labeled images from
data/annotations/
- Runs the current detection pipeline on them
- Reports precision, recall, F1, and average IoU against the ground-truth boxes
- Prints a simple summary table
This doesn't need to be fancy — a clear, repeatable number we can watch over time is the goal.
What's missing
Our project guidelines (
AGENTS.md) say computer vision work in this project should be evaluated with standard accuracy metrics — precision, recall, F1, and IoU (intersection-over-union) for segmentation — and that we should validate model output against human inspection where possible.Right now we have unit tests that check the code behaves correctly on small synthetic examples (e.g. "does this function return the right shape"), but nothing that checks: "when we run the detector on a real, human-labeled inspection photo, how close does it get to what a person would have marked?"
There's already a
data/annotations/folder reserved for ground-truth labels, but no script that compares model output against it.Why it matters
Without this, we have no objective way to know if a change to the detector (like the NMS change in issue #1) actually makes results better or worse on real images — we're relying on unit tests and manual eyeballing. For a tool used to flag well-integrity and safety-relevant failures, that's a gap.
What needs to happen
Add a small evaluation script (e.g.
scripts/evaluate_detection.py) that:data/annotations/This doesn't need to be fancy — a clear, repeatable number we can watch over time is the goal.