Skip to content

No way to measure detection accuracy against real ground-truth labels #4

@djimrastephane

Description

@djimrastephane

What's missing

Our project guidelines (AGENTS.md) say computer vision work in this project should be evaluated with standard accuracy metrics — precision, recall, F1, and IoU (intersection-over-union) for segmentation — and that we should validate model output against human inspection where possible.

Right now we have unit tests that check the code behaves correctly on small synthetic examples (e.g. "does this function return the right shape"), but nothing that checks: "when we run the detector on a real, human-labeled inspection photo, how close does it get to what a person would have marked?"

There's already a data/annotations/ folder reserved for ground-truth labels, but no script that compares model output against it.

Why it matters

Without this, we have no objective way to know if a change to the detector (like the NMS change in issue #1) actually makes results better or worse on real images — we're relying on unit tests and manual eyeballing. For a tool used to flag well-integrity and safety-relevant failures, that's a gap.

What needs to happen

Add a small evaluation script (e.g. scripts/evaluate_detection.py) that:

  • Loads any labeled images from data/annotations/
  • Runs the current detection pipeline on them
  • Reports precision, recall, F1, and average IoU against the ground-truth boxes
  • Prints a simple summary table

This doesn't need to be fancy — a clear, repeatable number we can watch over time is the goal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions