Context
Follow-up split out of #13 (closed) and PR #30. PR #30 promoted ~22 national/international outlets from unknown (domain_score 0.2) to Tier 3 trusted_media (0.6) so legitimate outbreak reporting was no longer floored below the filter's credibility threshold. It also added a topical relevance term to the search-stage score (#4).
Problem
The relevance term added in #30 lives in search-stage ranking only — it does not feed the filter's keep decision. The filter's compute_priority_score still blends in 0.25·credibility heavily. Now that reputable outlets sit at Tier 3 (0.6), that credibility weight can push off-topic pieces from those same outlets over the keep threshold on authority alone.
Observed during the #3/#4/#13 review: an H5N1 run kept a CBS transcript that was off-topic — admitted on credibility, not relevance. This is the direct, PR-acknowledged trade-off of the Tier 3 promotions ("Known interaction to flag for reviewers" in #30).
Proposed change
In bioscancast/stages/filtering/heuristics.py (compute_priority_score / heuristic weights in stages/filtering/config.py):
- Raise the keyword-overlap / relevance weight in the filter's keep decision, and/or
- Lower the
0.25·credibility blend weight,
so a Tier 3 domain no longer clears the borderline on authority alone when topical overlap is low.
Notes / guardrails
Full method + numbers: data/investigations/findings-issues-3-4-13.md.
Context
Follow-up split out of #13 (closed) and PR #30. PR #30 promoted ~22 national/international outlets from
unknown(domain_score 0.2) to Tier 3trusted_media(0.6) so legitimate outbreak reporting was no longer floored below the filter's credibility threshold. It also added a topical relevance term to the search-stage score (#4).Problem
The relevance term added in #30 lives in search-stage ranking only — it does not feed the filter's keep decision. The filter's
compute_priority_scorestill blends in0.25·credibilityheavily. Now that reputable outlets sit at Tier 3 (0.6), that credibility weight can push off-topic pieces from those same outlets over the keep threshold on authority alone.Observed during the #3/#4/#13 review: an H5N1 run kept a CBS transcript that was off-topic — admitted on credibility, not relevance. This is the direct, PR-acknowledged trade-off of the Tier 3 promotions ("Known interaction to flag for reviewers" in #30).
Proposed change
In
bioscancast/stages/filtering/heuristics.py(compute_priority_score/ heuristic weights instages/filtering/config.py):0.25·credibilityblend weight,so a Tier 3 domain no longer clears the borderline on authority alone when topical overlap is low.
Notes / guardrails
scripts/sweep_filter_params.pyagainst the hand-labeled pools (data/investigations/live_pools/labels.json) to confirm precision improves without dropping official recall.Full method + numbers:
data/investigations/findings-issues-3-4-13.md.