diff --git a/CURRENT.md b/CURRENT.md index 1854562..60de7c0 100644 --- a/CURRENT.md +++ b/CURRENT.md @@ -4,7 +4,7 @@ - Project: `security-scanner` - Merge mode: `guarded-auto-merge` -- Active goal: `personal-prod-deploy` +- Active goal: `ghas-quality-vuln-parity` - Last auto merge: `ledger:20260617T003405Z-autopilot-3236f4` - Ledger entries: `4` - Ledger index hash: `sha256:e1893a649a1101b74a087b5eaaa275813a85708c5bb46c4ae70c24e10a111050` diff --git a/docs/workbench/agentic-workflows/2026-06-21-ghas-quality-vuln-parity-goal.md b/docs/workbench/agentic-workflows/2026-06-21-ghas-quality-vuln-parity-goal.md new file mode 100644 index 0000000..3dba8ef --- /dev/null +++ b/docs/workbench/agentic-workflows/2026-06-21-ghas-quality-vuln-parity-goal.md @@ -0,0 +1,157 @@ +# Agentic Workflow: GHAS급 vuln/SAST 탐지 품질 (CodeQL parity SLO) + +**Status:** Ready for long single-goal execution +**Date:** 2026-06-21 +**Goal ID:** `ghas-quality-vuln-parity` +**Spec:** `docs/workbench/specs/ghas-quality-vuln-subtrack/{requirements,design,review}.md` +**Merge flow:** pull request + +장시간 단일 goal 실행 패킷. vuln/SAST 탐지를 **GHAS code-scanning(CodeQL) parity SLO**에 맞추는 측정 +harness + FP-억제 품질 머신을 구축한다. 시크릿 서브트랙(PR #58)의 검증된 2층 구조를 1:1 전이하되, +**vuln 고유로 durable disposition을 자율층에서 빼 H-track으로** 옮긴다(VulnerabilityFinding이 durable +store에 미적재 + `set_finding_disposition`이 FINDING_STATE 부재 시 ValueError → storage-projection +stop-condition). 실 code-scanning live-fetch는 stop-condition, 커밋은 synthetic-or-redacted-only. + +## Goal + +vuln/SAST의 per-repo 1:1 CodeQL parity 측정 harness + 인라인 FP-억제 티어 + 합성 회귀 게이트 enforce + +report-only parity 게이트 배선을 synthetic fixture로 TDD 완성하고 PR/CI/merge까지 닫는다. + +**완료 기준(자율 goal done = M3):** + +- **M1**: code-scanning 도메인 모델 `CodeScanAlertRecord`(redacted) + 매처 + `compare_codescan_alerts_with_findings`(CWE-교집합 3등급: matched-by-cwe/by-rule-token/unmatched) + + 적대적 fixture. **precision/recall은 `core/vulnerability/evaluation.py` 재사용(신규 계산 코드 0줄)**, + `CodeScanAlertRecord→VulnerabilityEvaluationKey` 어댑터로만. line-window는 진짜 `|alert−finding|≤N`, + recall 분모=open+fixed alert만·precision 페널티=dismissed 별도. 네트워크 0. +- **M2**: 인라인 싼 티어(scan-vuln 후처리: code_flow_count·severity floor·저신뢰 rule 억제) — 결정적· + 메타데이터-only·억제율 회귀로 보장되는 부분만 default-on, 동작 바꾸는 신규 억제는 gated. 합성 코퍼스를 + SQLi/XSS/path-traversal/command-injection/SSRF 5종으로 확장 + rule-class 정규화 적용. 기존 scan-vuln + default 출력 불변(canary TP 보존). +- **M3(자율 done)**: 합성 회귀 게이트 enforce(evaluate precision≥0.90/recall≥0.99) + report-only parity + 게이트 `governance.vuln_parity_slo --check`(threshold yml 부재→report-only, frozen synthetic snapshot + 대비, 나이>임계→stale-degraded) 배선. 실 snapshot 없이 결정적 재현 증명. +- 기존 Gitleaks-first secret + 기존 vuln scan/import/report/gate default path 불변. +- GHAS trigger/upload/alert mutation/**live-fetch 없음**. Architecture review(pre/post-M2/post-M3/final) + blocking 0. PR CI + local governance gate 통과. + +**H-track(자율 루프 밖, stop-condition PR):** H1 실 code-scanning snapshot 취득 → H2 baseline + fixture- +vs-real divergence → H3 목표 확정 + parity enforce → **H4 vuln verdict durable disposition 배선(storage +projection)**. + +## Execution Contract + +- 단일 장기 goal로 M1~M3을 끝까지. 중간 승인 없음. 사람 개입은 stop-condition 시에만. +- Subagent 적극 사용(구현 worker gpt-5.5/high; 보조는 repo policy). PR 만들고 CI 통과 후 merge 가능까지. +- 실 endpoint/host/credential/private path/real SARIF/real code-scanning export/real finding 커밋 금지. + +## Fixed Decisions + +- Scope: vuln 자율 M1~M3(synthetic-only). 실 fetch·baseline·enforce·durable disposition은 H-track. +- 측정: CodeQL code-scanning alert oracle, per-repo 1:1, snapshot=ground-truth(frozen synthetic). 계산은 + `core/vulnerability/evaluation.py` 재사용(제4 엔진 신설 금지). 합성 evaluate와 parity 매처 같은 계산 코어. +- 매칭: rule-class 정규화 + line-window를 합성 게이트·parity 둘 다 동일 의미론 적용(VFR8 정합). +- 인라인 티어: 결정적·메타데이터-only 부분 default-on, 동작 변경분 gated. validity-check 아날로그 없음. +- **durable disposition 금지(자율)**: vuln verdict는 v1 자율에서 기존 throwaway JSONL 유지. durable + 영속은 storage projection 필요 → `storage-projection-or-schema-migration-required` stop → H4. +- snapshot: synthetic redacted fixture만 커밋(`source: synthetic` marker 필수, 없으면 fail-closed). 실 + snapshot은 `.gitignore` + allowed_writes 비포함 이중 차단. +- **governance 핵심 자율수정 금지**: allowed_writes는 `governance/vuln_parity_slo.py`만(시크릿 + `governance/parity_slo.py`와 별도 파일). `autopilot_goal.yml`·`autopilot_gate.py`·`public_safety.py` + 수정 필요 시 stop(scope-expansion) → 사람 PR. +- 슬롯: 자율 코드는 active_goal 슬롯 없이 머지(머지 시 governance 3파일 main(theirs) 채택). 실 슬롯 전환은 사용자 결정. + +## Required Architecture Review Gate + +Mandatory blocking. pre-implementation / post-M2 / post-M3 / final. SoT change·scope expansion·unsafe +data·기존 default 변경 요구 시만 정지; 그 외 in-goal 수정. + +## Multi-agent Execution Model + +Subagent를 disjoint 책임으로(매처/모델 Worker A, 인라인 티어 Worker B, 합성 게이트+parity_slo Worker C, +architecture/security reviewer read-only, code_simplifier). Main agent 통합·최종 판단. + +## Allowed Write Surface + +`governance/autopilot_goal.yml`의 `allowed_writes`가 authoritative. 요약: 승격 spec, 이 workflow 문서, +src/tests/eval/examples, `governance/vuln_parity_slo.py`(신규 게이트만), ledger, CURRENT.md. **`governance/**` +광역 아님** — 그 밖 governance 변경은 scope expansion 정지. + +## Suggested Work Plan + +### Readiness (M0 = goal-setup, 이미 orchestrator가 수행) +goal-setup(spec 승격 + autopilot_goal.yml goal_id + current.yml active_goal + CURRENT.md 원자 커밋)은 +orchestrator가 완료. 너는 pre-implementation architecture review부터 시작. + +### M1 측정 substrate +1. red-first: 매처 CWE/rule-token/line-window/dismissed 채점; 적대적 fixture(CWE-부재/라인드리프트/ + CodeQL↔Semgrep 다른 rule.id/dismissed)에서 정규화·윈도·필터 누락이 red; precision/recall이 + `core/vulnerability/evaluation.py`에서 산출; 분모 state-aware. +2. 구현: CodeScanAlertRecord, 어댑터, 매처(신규 precision/recall 계산 0줄). line-window N fixture 확정. + +### M2 인라인 티어 + 합성 강화 +1. red-first: 안전 코드 FP 억제 + 취약 recall 유지(evaluate gate), default-on이 recall≥0.99 안 깸, + 기존 default 출력 불변, 독립 적대 쌍 회귀. +2. 구현: 인라인 gating(default-on/gated 경계), 합성 코퍼스 5종 + rule-class 정규화. post-M2 review. + +### M3 합성 게이트 + parity_slo (자율 done) +1. red-first: 합성 회귀 게이트 enforce; `governance/vuln_parity_slo.py` report-only(threshold 부재)· + frozen synthetic snapshot 대비·stale-degraded. +2. 구현: vuln_parity_slo.py. final review → PR. CURRENT.md에 "parity SLO enforce 미달성, H-track 대기". + +## Required Local Checks + +```bash +uv run pytest +uv run python -m governance.render --validate +uv run python -m governance.render --check +uv run python -m governance.rebuild_ledger_index --check +uv run python -m governance.render_github_ruleset --output governance/main_ruleset.json --check +uv run python -m governance.public_safety --diff origin/main...HEAD +uv run python -m governance.public_safety --path docs/workbench/specs/ghas-quality-vuln-subtrack +uv run python -m governance.vuln_parity_slo --check +uv run python -m governance.autopilot_gate --base origin/main +``` + +## Stop Conditions + +`governance/autopilot_goal.yml`의 `stop_conditions`(정본 16). 핵심: ghas-live-fetch-or-mutation-required +(H1 실 fetch), **storage-projection-or-schema-migration-required**(durable disposition·snapshot durable → +H4), existing-secret-default-behavior-change, architecture-review-blocking-finding, public-safety-hit, +scope-expansion(governance 핵심 파일 수정), same-blocker-three-times, break-glass. + +## Resume Prompt + +```text +Goal: complete `ghas-quality-vuln-parity` in the security-scanner repo through a PR. + +Read first: +- AGENTS.md +- governance/autopilot_goal.yml +- docs/workbench/agentic-workflows/2026-06-21-ghas-quality-vuln-parity-goal.md +- docs/workbench/specs/ghas-quality-vuln-subtrack/{requirements,design,review}.md +- src/security_scanner/core/vulnerability/{evaluation,model}.py +- src/security_scanner/baseline/ghas_api/__init__.py +- src/security_scanner/runtime/vulnerability_verify_artifact.py +- src/security_scanner/cli/commands (import-sarif/scan-vuln/report/gate/evaluate) + +Implement M1~M3 (autonomous, synthetic fixtures only, no real GHAS/code-scanning): +M1 CodeScanAlertRecord + compare_codescan_alerts_with_findings matcher (CWE 3-tier) + adversarial + fixtures. Reuse core/vulnerability/evaluation.py (zero new precision/recall code). True |line|<=N + window, state-aware denominators. +M2 inline cheap tier (metadata-only default-on / gated for behavior change), synthetic corpus 5 CWE + classes + rule-class normalization. Existing scan-vuln default output unchanged. +M3 synthetic regression gate enforce + report-only parity gate governance.vuln_parity_slo --check. + +Do NOT: durable-persist vuln verdict (storage projection -> H4 human-gated), call/fetch GHAS code- +scanning, commit real SARIF/findings, modify governance/autopilot_goal.yml | autopilot_gate.py | +public_safety.py (allowed_writes = governance/vuln_parity_slo.py only), change existing secret/vuln +scan defaults. Real snapshot fetch, baseline, enforce, durable disposition are human-gated H1~H4. +Use multi-agent. Mandatory architecture gates: pre-implementation, post-M2, post-M3, final. Finish +by opening a PR, waiting for CI, merge when green. Autonomous done = M3; record "parity SLO enforce +pending H-track" in CURRENT.md. + +Required checks: pytest; render --validate/--check; rebuild_ledger_index --check; +render_github_ruleset --check; public_safety --diff and --path docs/workbench/specs/ghas-quality-vuln- +subtrack; vuln_parity_slo --check; autopilot_gate --base origin/main. +``` diff --git a/docs/workbench/specs/ghas-quality-vuln-subtrack/design.md b/docs/workbench/specs/ghas-quality-vuln-subtrack/design.md new file mode 100644 index 0000000..e840ba3 --- /dev/null +++ b/docs/workbench/specs/ghas-quality-vuln-subtrack/design.md @@ -0,0 +1,426 @@ +# GHAS급 탐지 품질 트랙 — VULN/SAST 서브트랙 Design (v2, 리뷰 반영) + +> SoT: `requirements.md`(이 파일과 같은 dir, locked). 이 문서는 **오토파일럿이 단일 goal로 자율 +> 시퀀싱**해 자율층 스펙을 달성하도록 설계한다. 작성 2026-06-21. +> v2: 멀티에이전트 리뷰(31건: blocker 6·major 13·minor 8·nit 4) 반영. `review.md` 참조. +> 상위/형제: `.claude/specs/20260620-ghas-quality-track/`(시크릿 서브트랙 design v2 + review). 이 트랙의 +> blocker/major는 거의 전부 시크릿 트랙(PR #58)이 review로 잡고 해소한 항목의 **vuln 평행**이라, +> v2의 골격은 시크릿 design v2 구조의 1:1 전이다. + +## 0. 단일 Goal (오토파일럿 north star) + +> **vuln/SAST 탐지 품질을 GHAS code-scanning(CodeQL) parity SLO에 도달시킨다.** 단 실행은 **2층**으로 +> 가른다(시크릿 트랙 검증 구조 전이, 리뷰 AP-01): +> +> - **자율 goal done(autopilot 단일 goal, M1~M3)**: parity 매처 + 인라인 싼 티어 + 합성 회귀 게이트를 +> **synthetic/fixture만으로** TDD 구축·증명하고, **report-only parity 게이트를 synthetic fixture로 +> 배선**까지. 실 GHAS 무접촉. PR merge로 done. +> - **human-gated 운영층(H1~H4, stop-condition PR)**: 실 code-scanning snapshot 취득 → baseline 측정 +> → measure-first 목표 확정 → parity enforce 전환 + **vuln verdict durable disposition 배선**. 자율 +> 루프 밖. + +**done 정의 명확화(리뷰 `report-only-enforce-unreachable`/AP-01/AP-08)**: 자율 goal done = **M3** +(매처 + 인라인 티어 + 합성 회귀 게이트 enforce + report-only parity 게이트 배선, synthetic 증명, PR +merge). requirements V-Q9의 v1 done(baseline 측정 + 목표 도달 + 실 parity enforce)은 **H1~H4 완료 +후에만** 성립. PR merge 시 CURRENT.md에 "parity SLO enforce 미달성, H-track 대기" 명시. + +**vuln 고유 악화 요인(리뷰 AP-02/ARCH-VULN-01)**: 시크릿은 disposition이 이미 durable 배선돼 M3(LLM +티어 disposition)이 자율층이었지만, vuln finding은 durable store에 **아예 적재되지 않아**(store.py에 +`VulnerabilityFinding` 참조 0건) durable disposition을 만들려면 storage projection 신규 = +`storage-projection-or-schema-migration-required` stop-condition. 따라서 **durable disposition은 +자율층에서 빠지고 H-track으로 연기** → vuln 자율 범위가 시크릿보다 좁다. + +## 1. 아키텍처 개요 (현 자산 → 목표) + +``` + ┌──────────────── 자율층 (autopilot single goal, M1~M3) ──────────────────┐ + scan-vuln │ [parity 매처] CodeScanAlertRecord ↔ VulnerabilityFinding │ + (Semgrep-compat)│ rule-class(CWE)+line-window 매칭 → **core/vulnerability/evaluation.py │ + VulnerabilityFinding[] │ 재사용**(신규 precision/recall 코드 0줄), 어댑터로만 수렴 │ + │ [인라인 싼 티어] gate.py 확장(severity/precision/code_flow floor) │ + │ default-on 결정적 부분 + gated 신규 부분 분리 │ + synthetic │ [합성 회귀 게이트] evaluate --category code-vuln (존재, 강화) │ + code-vuln 코퍼스│ rule-class+line-window EvaluationKey 의미론(parity와 공유) │ + + fixture ────►│ [CI 게이트] governance.vuln_parity_slo --check: threshold 부재→ │ + │ report-only(synthetic fixture 측정·리포트만), 존재→enforce(H-track 후) │ + │ snapshot 나이>임계→stale-degraded │ + └─────────────────────────────────────────────────────────────────────────┘ + ┌──────────────── human-gated 운영층 (H1~H4, stop-condition PR) ───────────┐ + 실 GHAS │ baseline/ghas_api(GET-only) → code-scanning fetch(NEW) → 실 redacted │ + code-scanning │ frozen snapshot(local 비커밋) → baseline 측정 + fixture-vs-real │ + (human-PR) │ divergence 보고 → 목표 확정 → parity enforce 전환 │ + │ + vuln verdict **durable disposition 배선**(storage projection 신규) │ + └─────────────────────────────────────────────────────────────────────────┘ +``` + +**불변(시크릿 트랙과 공유):** snapshot=ground-truth, per-repo 1:1, measure-first, human-PR fetch +게이트, 공개안전 redaction([[vuln-redaction-design]]), GhApiRunner GET-only 계약. + +**vuln 고유 신규 컴포넌트:** code-scanning fetch(VFR2, H-track), rule-class 매처(VFR1/V-Q2, 자율), +vuln durable disposition(VFR5/V-Q6, **H-track으로 연기** — 리뷰 C), dismissed_reason 채점(VFR6/V-Q4). + +## A. Autopilot Execution Shape — goal-setup 시 `governance/autopilot_goal.yml` 반영 (리뷰 blocker AP-03/vuln-sot-path/vuln-governance-wildcard 반영) + +> **지시: 아래를 그대로 복사하지 말고, 현행 `phase-2a` goal.yml을 base 템플릿으로 두고 diff만 얹어라** +> (시크릿 major `acceptance-checks-drift` 전이). 누락 게이트 방지. + +- `goal_id`: `ghas-quality-vuln-parity` +- `execution_mode`: `long-single-goal` / human_gate: `stop-conditions-only` / merge_flow: `pull-request` +- **SoT 위치 결정(blocker `vuln-sot-path-gitignored-gate-blind`/VD-01)**: 리뷰된 spec을 + **`docs/workbench/specs/ghas-quality-vuln-subtrack/`로 승격(migrate)** 하고 git 추적. 현 + `.claude/specs/`는 `.gitignore:72`(`.claude/*`, skills만 예외)라 게이트가 `outside allowed_writes`로 + 차단하고 public_safety가 SoT를 스캔하지 못한다(`git check-ignore` 확인). grill 원본은 `.claude/specs/`에 + 두고 커밋본만 승격. **이 승격을 M0/goal-setup 산출물로 명시.** +- `allowed_writes`(화이트리스트): `docs/workbench/specs/ghas-quality-vuln-subtrack/**`, + `docs/workbench/agentic-workflows/2026-06-21-ghas-quality-vuln-parity-goal.md`, + `src/security_scanner/**`, `tests/**`, `eval/**`, `ledger/**`, `CURRENT.md`, + **`governance/vuln_parity_slo.py`(신규 게이트만)**. + **`governance/**` 광역 금지(blocker `vuln-governance-wildcard-self-modify`)** — `autopilot_goal.yml`· + `autopilot_gate.py`·`public_safety.py` 자율 수정 금지(Fixed decision), 필요 시 사람 PR. 시크릿 + `governance/parity_slo.py`와 vuln `governance/vuln_parity_slo.py`는 **별도 파일**(공유 아님, §5 못박음). +- `acceptance_checks`(phase-2a와 1:1 정렬, diff만): architecture-review **pre/post-M2/post-M3/final** + + `pytest` + `render --validate/--check` + `render_github_ruleset --check` + + `rebuild_ledger_index --check` + `public_safety --diff origin/main...HEAD` + + **`public_safety --path docs/workbench/specs/ghas-quality-vuln-subtrack`** + + `autopilot_gate --base origin/main` + **신규 `governance.vuln_parity_slo --check`**(report-only→enforce). +- `stop_conditions`: **현행 정본 16개 집합을 base로** + 본 트랙 유효분 명시: `ghas-live-fetch-or-mutation- + required`(M4/H1 실 fetch), `storage-projection-or-schema-migration-required`(**vuln durable + disposition·snapshot durable 적재 경로 — 리뷰 AP-02/VD-06**), `existing-secret-default-behavior-change`, + `architecture-review-blocking-finding`, `public-safety-hit`, `scope-expansion`, + `same-blocker-three-times`, `break-glass` 등. 임의 부분집합 금지. +- **goal-setup 원자성(리뷰 AP-03/AP-04, 메모리 deadlock 교훈)**: goal-setup 커밋이 `autopilot_goal.yml` + goal_id·`current.yml` active_goal·`CURRENT.md`를 **한 커밋에 동시 갱신**해야 한다(`render.py`가 + `active_goal != goal_id`이면 검증 실패, `current.yml`은 allowed_writes 밖이라 autopilot 자율 해결 + 불가). Fixed decision. +- **슬롯 전략(리뷰 major AP-04/`vuln-active-goal-slot-eviction`)**: 현재 `current.yml` active_goal은 + `personal-prod-deploy`로 점유 중(확인). vuln **자율 코드(M1 매처·M2 인라인·합성 코퍼스·M3 report-only + 게이트)는 시크릿 패턴대로 active_goal 슬롯 없이** 일반 PR(claude/* 브랜치는 autopilot-gate 면제)로 + 머지 가능하며, **governance 3파일(autopilot_goal.yml·current.yml·CURRENT.md)은 main(theirs) 채택해 + byte-identical 유지**(self-modification 회피). 실제 슬롯을 vuln goal로 점유 전환할지는 **사용자 결정 + 사항(stop/escalate)** — personal-prod-deploy 완료 후 또는 사용자 승인 하에만. H1~H4(실 fetch·durable + projection)만 슬롯/human-gate가 필요한 슬라이스로 분리된다(B의 2층 분리와 정합). + +## B. 자율층 / H-track 2층 분리 (리뷰 blocker AP-01) + +시크릿 design v2의 자율층(M0~M5)/human-gated(H1~H3) 2층 분리를 1:1 전이하되, vuln 고유로 **M3 durable +disposition을 자율층에서 빼 H-track으로**(리뷰 AP-02) 옮긴다 → 자율 범위가 시크릿보다 좁다. + +- **자율층(M1~M3)**: 네트워크 0·합성/fixture만·default-off/synthetic-first. parity 매처 + 인라인 싼 + 티어 + 합성 회귀 게이트 + **report-only parity 게이트 배선**(synthetic fixture로 결정적 재현 증명). + 실 GHAS 무접촉. PR merge로 done. +- **H-track(H1~H4)**: 실 code-scanning snapshot 취득(H1) → baseline 측정 + divergence 보고(H2) → + 목표 확정 + parity enforce 전환(H3) → **vuln verdict durable disposition 배선(H4, storage projection)**. + 자율 루프 밖, stop-condition PR. + +**자율 goal done 재정의**: 합성 회귀 게이트 **enforce** + report-only parity 게이트가 synthetic fixture +로 결정적 재현됨(실 snapshot 없이 배선 증명)까지. 실 parity SLO 도달은 H-track 완료 후. PR merge 시 +CURRENT.md에 "parity SLO enforce 미달성, H-track 대기" 명시. + +## 2. 검증 가능한 Milestone (자율층 M1~M3 + H-track H1~H4) + +각 milestone은 **독립 검증 가능한 done 정의**를 갖는다. **M1~M3은 frozen fixture/합성 대비 자율 진행**. +**H1~H4는 실 GHAS·baseline·durable projection이라 human-PR/stop-condition 게이트로 격리**. + +### M1. 측정 substrate — code-scanning 도메인 모델 + 매처 + 적대적 fixture (자율) + +목적: GHAS code-scanning alert을 redacted로 표현하고 우리 finding과 매칭하는 순수-로직 계층. 네트워크 +없음 → 합성/fixture로 완결. + +작업: +- `CodeScanAlertRecord`(신규, `storage/base.py` 또는 `core/vulnerability/`): redacted 필드만 — + `repository`, `alert_number`, `rule_id`, `security_severity_level`, `cwe_ids`(rule.tags→CWE), + `state`, `dismissed_reason`, `location_path`, `location_start_line`, `location_end_line`, + `fetched_at`, `source_tool="ghas-code-scanning"`. (secret `GhasAlertRecord` 평행.) +- `CodeScanComparisonKey`: `(repository, file_path, line_window, normalized_rule_class)`. §4.2 참조. +- 매처 `compare_codescan_alerts_with_findings(...)`: **3등급 집계**(matched-by-cwe / matched-by-rule- + token / unmatched). **precision/recall은 `core/vulnerability/evaluation.py` 재사용**(리뷰 D — §D). +- **적대적 fixture(리뷰 F)**: 정규화/line-window/필터 누락이 **red가 되는** 케이스 — (a) CWE-부재 + rule-token-only, (b) source/sink 라인 드리프트(line-window 밖→윈도 정의 검증), (c) CodeQL↔Semgrep + 동일취약 다른 rule.id(정규화 누락 시 unmatched), (d) dismissed_reason 케이스. + +**done:** +- `CodeScanAlertRecord` + 매처 단위테스트 그린. CWE 매칭/rule-token fallback/line-window가 fixture로 + 검증되고, dismissed_reason 채점 경로(VFR6)가 테스트로 고정. 네트워크 0. +- **인변(리뷰 D, `parity-harness-third-engine`/`precision-recall-primitive`)**: 신규 precision/recall· + gate 계산 코드 **0줄**, `core/vulnerability/evaluation.py`(VulnerabilityEvaluationResult.precision/ + recall + threshold) 재사용. `CodeScanAlertRecord`→`VulnerabilityEvaluationKey` **어댑터로만** 수렴. +- **인변(리뷰 F)**: 위 적대적 fixture에서 정규화/윈도/필터 누락이 red, 정상 케이스 green. +- **인변(리뷰 H, 분모 공식)**: recall 분모 = open+fixed CodeQL alert만, precision 페널티 = + dismissed(fp/used-in-tests) 위치를 우리가 띄운 건수 별도 누적. 이 수식이 매처/테스트에 고정(§4.2). +- **인변(리뷰 G·VD-07)**: line-window N은 M1에서 fixture 값으로 확정(open question 닫기). CWE 결손률· + by-rule-token 구제율을 매처 결과 메타로 노출. + +### M2. 인라인 FP-억제 티어 + 합성 코퍼스 강화 (자율) + +목적: 공짜 인라인 gating을 scan-vuln에 적용하고, 합성 코퍼스를 recall SLO가 의미있을 규모로 확장. + +작업: +- 인라인 gating 확장(`gate.py` 또는 scan-vuln 후처리): `code_flow_count`(trace=reachability 근거) + 신호 반영, 저신뢰 rule.id·INFO/LOW severity floor 억제. **validity-check 아날로그 없음**(V-Q3) — + 순수 메타데이터 기반. +- **default-on / gated 경계(리뷰 K, `vuln-existing-scan-default-invariance`)**: 결정적·메타데이터-only· + 억제율 회귀 테스트로 보장되는 부분(이미 gate가 INFO/LOW 비차단)만 **default-on**, **동작을 바꾸는 신규 + rule 억제**(어떤 rule을 새로 비차단, code_flow 없는 HIGH finding 억제)는 **gated/opt-in**. +- 합성 코퍼스 확장: `eval/synthetic-code-vuln/`에 취약/안전 코드 쌍 + expected-findings를 핵심 CWE + 클래스로 확대. **rule-class 정규화(V-Q2)를 합성 expected에도 적용**(§E). + +**done:** +- 인라인 gating이 합성 코퍼스에서 안전 코드 FP를 억제(precision↑)하고 취약 코드 recall 유지를 + `evaluate`(precision≥0.90/recall≥0.99 gate)로 입증. 회귀 테스트 그린. +- **인변(리뷰 VD-07)**: 합성 코퍼스가 **SQLi/XSS/path-traversal/command-injection/SSRF 5종**을 커버 + (done 기준으로 고정, '≥N' placeholder 제거). +- **인변(리뷰 K, default-on 안전)**: default-on 변경이 합성 회귀 게이트(recall≥0.99)를 깨지 않음 — + **canary TP 보존**(out-of-rule이 아닌 핵심 TP는 억제되지 않음). 기존 scan-vuln **default 출력 불변** + (기존 노출 finding이 무단 억제되지 않음을 합성 회귀로 고정). default-on 변경이 stop-condition(scope- + expansion·existing-default-change)을 치는지 명시 판정. +- **인변(리뷰 F)**: 우리 룰과 **독립적으로** 작성한 적대 취약/안전 쌍(out-of-rule CWE로 recall<1 의도) + 에서 회귀 누락이 red. +- **인변(리뷰 nit `vuln-llm-input-leak`)**: 인라인 신호를 verifier에 더 반영하더라도 redacted-metadata + 계약 준수(trace는 count/shape만, related_location path 평문 금지). + +### M3. 합성 회귀 게이트 + report-only parity 게이트 배선 (자율, 자율 goal done) + +목적: 회귀 방지. 합성 회귀 게이트 enforce + parity 게이트는 실 snapshot 없이 **synthetic fixture로 +report-only 배선**(자율 goal의 종착점). + +작업: +- 합성 회귀 게이트(`evaluate --category code-vuln`, recall≥0.99/precision≥0.90) — 이미 존재, CI 배선 + 확인/강화. **매칭 의미론은 parity와 동일 rule-class+line-window EvaluationKey**(§E). +- **`governance/vuln_parity_slo.py` 신규 게이트**: frozen code-scanning snapshot 대비 재현 측정. + **threshold yml 부재/빈값이면 report-only**(synthetic fixture로 측정·리포트만), 존재하면 enforce + (H-track baseline 후). **snapshot 나이>임계면 `stale-degraded`**(silent pass 금지, scan-health 선례). + 자율층에서는 **synthetic redacted fixture**로 게이트 경로가 결정적 재현됨을 증명한다. + +**done:** +- CI가 합성 회귀 게이트를 **enforce**하고, `vuln_parity_slo --check`가 synthetic fixture로 report-only + 측정·리포트를 결정적 재현(네트워크 0). snapshot 나이 노출. silent staleness 없음. +- **이 시점이 자율 goal done** — PR merge. CURRENT.md에 "parity SLO enforce 미달성, H-track 대기" 명시. +- final 아키텍처 리뷰 통과. + +> ⚠️ **자율 goal done은 여기까지.** 실 parity SLO 도달(measure-first 목표)은 아래 H-track 완료 후. + +### H1. 실 code-scanning snapshot 취득 (HUMAN-PR 게이트) + +> ⚠️ GHAS live-fetch 필요 → `ghas-live-fetch-or-mutation-required` stop-condition. 오토파일럿은 자율 +> goal done(M3) 후 여기서 **멈추고 human-PR을 요청**한다. + +작업: +- `baseline/ghas_api`에 code-scanning fetch 추가: `fetch_codescan_alert_records(target, api, + tool_name="CodeQL")` → `GET /repos/{repo}/code-scanning/alerts?tool_name=CodeQL&state=...`. + - **재사용(리뷰 nit ARCH-VULN-03)**: `GhApiRunner.get_json` GET-only 가드 + 페이지네이션 헬퍼 + + `_sanitize_error` redaction 패턴. + - **신규**: `CodeScanAlertRecord` 모델, code-scanning 정규화 함수, **신규 `compare-codescan` CLI** + (리뷰 VD-05 — `compare-ghas`는 `secret-scanning/alerts` 하드와이어·`--category` 미등록이라 재사용 + 불가, 신규 경로가 기본). + - rule.tags→CWE 추출, dismissed_reason 보존, dismissed_comment·raw message 미취득(공개안전). `ref` + 파라미터로 비교 universe 정렬(default-branch HEAD). +- fetch 결과를 **redacted frozen snapshot**으로 고정. **저장 매체(리뷰 ARCH-VULN-03·snapshot-redaction)**: + 실 snapshot은 **gitignore 사설 경로에만 보관·커밋 금지**(시크릿 `.gitignore` + allowed_writes 비포함 + 이중 차단), synthetic fixture만 커밋. snapshot을 durable store에 적재하면 storage-projection + stop-condition을 치므로 frozen 파일로 둔다. + +**done:** human-PR로 ≥1 GHAS code-scanning-enabled repo의 CodeQL alert을 redacted frozen snapshot +으로 취득. snapshot이 공개안전 검사 통과(**단 public_safety 통과는 보조 검사이며, 상대경로 누출은 +gitignore 1차 방어가 차단 — 리뷰 snapshot-redaction**). 이후 H2는 이 snapshot 대비. + +### H2. parity baseline 측정 + SLO 확정 + divergence 보고 (HUMAN-PR 게이트) + +목적: measure-first(V-Q9). frozen snapshot 대비 현 scan-vuln의 precision/recall gap을 측정하고 실 +parity SLO 목표를 확정. + +작업: +- M1 매처로 frozen snapshot ↔ 우리 scan-vuln 결과 per-repo precision/recall 산출(by-cwe/by-rule- + token/unmatched 등급별). 집계(micro→macro — 시크릿과 정렬). +- baseline 수치 보고 → 현실적 SLO 확정: recall ≥ CodeQL의 Y%, precision: dismissed-FP 위치 미탐지율. +- **fixture-vs-real divergence 1회 보고(리뷰 F)**: 합성 recall SLO와 실 parity baseline이 괴리하면 + 보고(self-fulfilling 신호). **CWE 결손률·by-rule-token 구제율·measure 대상 언어 비중**(리뷰 + `codeql-only-oracle-language-bias`)을 baseline 1급 메타로 노출. + +**done:** baseline precision/recall 리포트 생성(frozen snapshot 대비, 결정적 재현). 실 parity SLO +목표치가 문서에 확정 기록. divergence·언어 비중 메타 노출. 네트워크 0(H1 snapshot 재사용). + +### H3. parity enforce 전환 (HUMAN-PR 게이트) + +작업: +- H2 확정 SLO threshold를 `governance/vuln_parity_slo.py` threshold yml에 커밋 → report-only→enforce + 전환. 확정 SLO 후퇴 시 PR 차단. snapshot 재취득 SLA(N일/룰셋 변경 시) governance 명시. + +**done:** 이중 CI 게이트(합성 회귀 + frozen parity enforce) 그린, 회귀 차단. snapshot 나이 노출. +**이 시점에 measure-first v1 done** — vuln/SAST가 측정 가능한 GHAS parity SLO에 도달. + +### H4. vuln verdict durable disposition 배선 (HUMAN-PR 게이트 — storage projection) + +> ⚠️ **리뷰 blocker AP-02/ARCH-VULN-01로 자율층에서 H-track으로 재분류.** vuln finding은 durable +> nosql store에 적재되지 않아(store.py에 `VulnerabilityFinding` 참조 0건), durable disposition을 만들려면 +> vuln finding state를 durable store에 **신규 projection**해야 한다 → +> `storage-projection-or-schema-migration-required` stop-condition. **자율 진행 불가, human-PR.** + +작업: +- vuln finding state를 durable store에 projection하는 신규 경로(신규 entity type 또는 FINDING_STATE + projection) — §4.3 참조. 이게 선행돼야 disposition 어느 안도 동작. +- vuln verify 경로(`run_verify_vulnerability_artifact`)가 verdict를 durable disposition에 기록 + (actor='ollama'/source='verifier', STATE_EVENT 감사). **종단 verdict(TP/FP)만 반영, NEEDS_REVIEW는 + durable 미기록**(리뷰 nit ARCH-VULN-05 — 시크릿 `disposition_status_for_verdict` 동작 동일). +- **finding_id 안정성(리뷰 I, `finding-id-not-stable`)**: fallback 경로(`compute_vulnerability_finding_id` + 가 fingerprint 부재 시 `{rule_id,file_path,line_start,message}` 해시)에서 rule_id 정규화·line drift· + message 변경 시 disposition 유실 테스트로 노출 → 안정 키 별도 정의 또는 Semgrep-compat 출력에 stable + partialFingerprints 강제를 명시 결정(§4.3). + +**done:** vuln verify 후 종단 verdict가 durable store에서 조회되고 STATE_EVENT 감사 기록. 같은 finding +재-스캔 시 disposition 보존(안정 키 검증 포함). (시크릿 `ollama-verify-periodic-todo` 해소의 vuln 평행.) + +## 3. Milestone 의존성 / 시퀀싱 + +``` +자율층 (autopilot 단일 goal): + M1 (모델·매처·적대 fixture) ──┐ + M2 (인라인+합성)──────────────┼──> M3 (합성 게이트 enforce + report-only parity 배선) = 자율 goal done + │ +H-track (stop-condition PR, 자율 루프 밖): │ PR merge + H1 (실 snapshot, human) ──> H2 (baseline+divergence) ──> H3 (parity enforce) = measure-first v1 done + H4 (durable disposition, storage projection, human) ── (독립, H1 불요) +``` + +- **자율 선행 가능(human 불요):** M1, M2, M3 — frozen fixture/합성으로 완결. 오토파일럿이 병렬/순차 + 자율 진행 후 M3에서 PR merge. +- **human-PR 게이트:** H1(실 fetch), H2/H3(baseline·enforce, H1 snapshot 의존), **H4(durable + disposition, storage projection — H1 불요·독립)**. +- 권장 오토파일럿 순서: M1 → M2 → M3 (자율, PR merge) → [정지, CURRENT.md에 H-track 대기 명시]. + H1~H4는 사용자 주도 후속. + +## 4. 인터페이스 / 데이터 계약 (설계 확정 + open) + +### 4.1 code-scanning alert fetch (VFR2/H1) + +``` +GET /repos/{owner}/{repo}/code-scanning/alerts?tool_name=CodeQL&state={open|dismissed|fixed}&ref={default} +→ CodeScanAlertRecord{ + rule_id, security_severity_level∈{low,medium,high,critical}, + cwe_ids(from rule.tags external/cwe/cwe-NNN), + state∈{open,dismissed,fixed}, dismissed_reason∈{false positive,won't fix,used in tests,null}, + location{path,start_line,end_line} # most_recent_instance.location +} +``` +- GhApiRunner GET-only 계약 재사용. dismissed_comment·raw message 미취득(공개안전, V-Q4 NFR). +- **redaction(리뷰 minor `vuln-snapshot-path-redaction`)**: location.path는 사설 상대경로라 + public_safety(`identifier.private-path`는 절대경로만 매칭)가 못 잡는다 → 실 snapshot은 gitignore + 사설 경로 보관·커밋 금지가 **1차 방어**. 커밋이 불가피하면 path를 fingerprint화(line-window는 평문 + 라인 유지). M4/H1 done의 'public_safety 통과'는 필요조건일 뿐 충분조건 아님. +- **fixed alert staleness(리뷰 H)**: fixed alert의 `most_recent_instance.location`은 라인이 stale할 + 수 있음 → H1 fetch 시 검증 항목으로 둔다. + +### 4.2 매칭 키 + 분모 공식 (VFR1/V-Q2) — 3등급 (리뷰 G·H·VD-03·E 반영) + +``` +CodeScanComparisonKey = (repository, file_path, line_window, normalized_rule_class) + normalized_rule_class 우선순위: CWE 교집합(by-cwe) > rule-token 정규화(by-rule-token) > unmatched + line_window: |alert_line − finding_line| ≤ N (진짜 윈도, start_line//N 양자화 금지 — 리뷰 G) + N은 M1에서 fixture로 확정(requirements ±N과 정합) +``` +- **CWE 다대다 처리(리뷰 G)**: 같은 윈도 내 CWE가 다대다면 **1:1 greedy 최대매칭**(또는 CWE 계층 일치). + 같은 (file, window)에 서로 다른 취약점(cwe-89 SQLi·cwe-79 XSS)이 공존해도 각각 올바르게 1:1 배정. +- **rule-token fallback 술어(리뷰 `rule-token-fallback`)**: stop-token(audit/lang/py/python 등) 제거 + 후 **핵심 취약-클래스 토큰의 정확 집합 일치만** 매칭(부분 겹침 금지 — path-traversal↔open-redirect + 오매칭 방지). CWE 브리지 매핑 테이블 확장을 M1 작업으로 둬 by-rule-token 의존 최소화. +- **합성 게이트와 의미론 공유(리뷰 VD-03/E)**: 합성 회귀 게이트(`VulnerabilityEvaluationKey`)와 parity + 매처가 **같은 rule-class 정규화 + line-window EvaluationKey 의미론**을 채택한다(권장안 — 일관성). + 현 `VulnerabilityEvaluationKey`는 `(file_path, line_start, rule_id)` 완전일치(naive)이므로, M2에서 + expected-findings 스키마(`ruleId`)에 rule-class 정규화를 **로드 시** 적용하고 line_start를 line-window + 로 확장한다. 두 게이트가 같은 EvaluationKey 의미론을 공유함을 **VFR8 정합 조건**으로 명시. +- **분모 공식(리뷰 H, `dismissed-reason-snapshot`)**: `recall 분모 = open+fixed CodeQL alert`(매칭 + 모집단)만. `precision 페널티 = dismissed(false positive/used in tests) 위치를 우리가 띄운 건수`를 + **별도 누적**. `won't fix`는 TP-비차단으로 분리 집계(recall 포함, precision 페널티 미포함). dismissed + 모집단의 repo별 밀도 편향은 §7 리스크. +- **메타 노출(리뷰 G·codeql-only-oracle-language-bias)**: 우리 finding CWE 결손률, by-rule-token 구제 + 비율, parity 측정 대상 언어 비중을 baseline(H2) 1급 메타로 노출. by-rule-token으로 구제된 약매칭은 + 신뢰도 등급으로 분리 집계, 비율 높으면 baseline 신뢰도 경고. +- **coverage ≠ precision/recall(리뷰 D·precision-recall-primitive)**: `GhasComparisonResult.ghas_coverage` + (시크릿 coverage 의미)와 `VulnerabilityEvaluationResult.precision`(TP/(TP+FP))은 **다른 메트릭**. + vuln parity 매처가 산출하는 것은 후자(truth-labeled precision/recall)임을 명시. + +### 4.3 disposition 채널 (VFR5/H4) — storage projection 필요 (리뷰 blocker AP-02/ARCH-VULN-01) + +> **코드 전제 확인**: `set_finding_disposition`(store.py:998-1000)은 `read_finding_state(finding_id)`가 +> None이면 `ValueError("finding state does not exist")`. FINDING_STATE 행은 secret `Finding`을 durable +> store에 append할 때만 생성된다. `VulnerabilityFinding`은 throwaway `VulnerabilityJsonlStore`에만 +> 존재하고 nosql store에 적재 0건. **두 안 모두 vuln finding state를 durable store에 신규 projection +> 해야 동작하며, 이는 `storage-projection-or-schema-migration-required` stop-condition → H4 human-gated.** + +- **1안:** 시크릿 `set_finding_disposition` 재사용. 단 **선행 조건**: vuln finding state를 durable store에 + projection(신규 entity type 또는 FINDING_STATE projection)해야 함 → storage projection. +- **2안:** vuln 전용 `set_vuln_finding_disposition` + vuln 전용 store 파티션. **이 역시 신규 STATE_EVENT + projection** → storage projection. (footnote 아님 — 두 안 모두 stop-condition을 친다.) +- **§4.3 판정 기준**: 1안·2안 어느 쪽이든 새 durable projection이므로 H4(human-gated). v1 자율층의 vuln + verifier는 **기존 throwaway JSONL 동작 유지**(durable 아님). `§4.3의 '1차 시도=1안'을 그대로 자율 + 진행하면 autopilot이 동일 ValueError를 same-blocker-three-times까지 반복`하므로, design이 1안=storage + projection임을 못박아 자율층에서 시도 자체를 막는다. +- **finding_id 안정성(리뷰 I)**: durable disposition 키 finding_id는 fallback에서 정규화·drift·message에 + 민감(model.py 확인). H4에서 안정 키 정의 또는 stable partialFingerprints 강제를 명시 결정. (durable + disposition이 H4로 연기되므로 이 항목도 H4 전제.) + +### 4.4 CI parity 게이트 (VFR8/M3) — report-only/enforce 자동 분기 (리뷰 VD-04/AP-08) + +- `governance/vuln_parity_slo.py --check`: threshold yml **부재/빈값 → report-only**(측정·리포트만), + **존재 → enforce**(H3 후 threshold 커밋 시). snapshot 나이>임계 → `stale-degraded`(`pass` 아님, + silent staleness 금지). 자율층 M3에서는 synthetic fixture로 report-only 경로가 결정적 재현됨을 증명. + +## 5. 시크릿 트랙과의 정합 / 의식적 분기 (재확인) + +| 측면 | 시크릿 | vuln(이 트랙) | 분기 근거 | +| --- | --- | --- | --- | +| oracle endpoint | secret-scanning/alerts | code-scanning/alerts (tool_name=CodeQL) | 별도 API·멀티툴(V-Q5) | +| 매치 키 | secret_type 4-튜플 1:1 | rule-class(CWE) + line-window 3등급 | rule.id 툴별 상이(V-Q2) | +| precision/recall 엔진 | core/evaluation/metrics.py 재사용 | **core/vulnerability/evaluation.py 재사용**(신규 계산 0줄) | vuln 전용 엔진 이미 존재(리뷰 D) | +| 합성 evaluate vs parity 매처 | (해당 없음) | **같은 계산 코어 공유**(제4 엔진 차단) | 합성·parity 의미론 일관(리뷰 D/E) | +| FP-oracle | resolution(거친 분류) | dismissed_reason(SAST-FP 1급) | dismissed_reason이 더 풍부(V-Q4) | +| validity check | 연기(evidence-gated) | **아날로그 없음**(reachability=탐지기 책임) | SAST 구조적 차이(V-Q3) | +| disposition 배선 | **이미 durable**+periodic(자율 M3) | **storage projection 신규 → H-track(H4)** | vuln finding state 미적재(리뷰 AP-02) | +| 자율 범위 | M0~M5(disposition 포함) | **M1~M3(disposition 제외, 더 좁음)** | durable이 storage projection 필요 | +| CI 게이트 파일 | governance/parity_slo.py | **governance/vuln_parity_slo.py(별도)** | 두 트랙 파일 충돌 방지(리뷰 governance-wildcard) | +| 주력 데이터셋 | 실 GHAS snapshot | 합성(주력)+실 snapshot(calibration) | SAST 합성 ground-truth 완전(V-Q7) | +| 주기 scan 배선 | 완료 | 비용 게이트로 연기 | Semgrep 무거움·#2 500+repo | + +## 6. YAGNI / 비채택 (불확실 미래 기능 제외) + +- 사후 reachability/taint 재계산 엔진(탐지기 책임, V-Q3). +- live SAST validity 아날로그(대응물 없음, V-Q3). +- 멀티툴 oracle 통합(CodeQL 고정, V-Q5). +- 주기 scan-all vuln 자동 배선(비용 게이트로 격리, V-Q6). +- import-sarif/codeql.yml 자체 품질 SLO(통로·self-scan, V-Q1). +- push protection / PR 인라인 차단(상위 트랙 비대상). +- 제3/제4 precision/recall 엔진 신설(기존 evaluation.py 재사용, 리뷰 D). + +## 7. 리스크 / 완화 + +- **rule-class 매칭 약함 → parity 오측정:** 3등급 집계(by-cwe/by-rule-token/unmatched)로 silent 오분류 + 방지. unmatched 비율·by-rule-token 구제율이 높으면 baseline 신뢰도 경고(리뷰 G). +- **CWE 결손 비대칭:** 우리 Semgrep-compat finding은 cwe_ids가 비는 경우가 흔함 → by-rule-token 강등. + CWE 결손률을 baseline 1급 메타로 노출(리뷰 G). +- **dismissed FP-oracle repo별 밀도 편향(리뷰 H):** repo마다 dismiss 이력이 달라 FP-oracle 풀 크기가 + 불균등 → per-repo precision 페널티 분모 불안정. baseline에 명기. +- **CodeQL 언어 커버리지 ≠ Semgrep-compat:** oracle 부재 repo/언어는 per-repo SLO 제외(C-monitor만). + 측정 모집단=CodeQL 지원 언어임을 baseline 메타·goal done에 범위 한정자로 명시(리뷰 language-bias). +- **합성 자기참조(우리 룰이 우리가 심은 것만 잡음):** 적대적 fixture(M1/M2 done) + 실 GHAS snapshot + calibration(H2 divergence 보고)으로 외부 검증(리뷰 F). +- **VulnerabilityFinding 별도 모델 → disposition 배선 마찰:** durable disposition을 H4로 분리, storage + projection stop-condition 명시(리뷰 AP-02). +- **finding_id 정규화 민감(리뷰 I):** durable disposition 키가 fallback에서 rule_id/line/message에 + 민감 → H4에서 안정 키 또는 stable partialFingerprints 명시 결정. +- **snapshot staleness:** passive 노출(scan-health 선례) + `stale-degraded`(silent staleness 금지, NFR). + +## 8. 완료 정의 + +### 자율 goal done (M3, PR merge) +1. parity 매처(rule-class+line-window, 3등급, CWE 다대다 처리) + 적대적 fixture 그린. 신규 precision/ + recall 계산 코드 0줄(evaluation.py 재사용). +2. 인라인 싼 티어(default-on 결정적 + gated 신규 경계), 기존 scan-vuln default 출력 불변. +3. 합성 회귀 게이트(recall≥0.99/precision≥0.90) **enforce** + `vuln_parity_slo` report-only 게이트가 + synthetic fixture로 결정적 재현 배선. +4. 모든 측정·억제 경로 네트워크 0, 공개안전 redaction 정합. +5. PR merge 시 CURRENT.md에 "parity SLO enforce 미달성, H-track 대기" 명시. + +### measure-first v1 done (H1~H4 완료 후) +6. 실 code-scanning(CodeQL) frozen snapshot 취득(H1, human-PR). +7. parity baseline 측정 + 실 SLO 확정 + fixture-vs-real divergence·언어 비중 메타(H2, measure-first). +8. parity enforce 전환 — 이중 CI 게이트(합성 회귀 + frozen parity enforce) 그린, 회귀 차단(H3). +9. vuln verdict durable disposition 배선(H4, storage projection) — 재탐지 시 disposition 보존. diff --git a/docs/workbench/specs/ghas-quality-vuln-subtrack/requirements.md b/docs/workbench/specs/ghas-quality-vuln-subtrack/requirements.md new file mode 100644 index 0000000..6a7fbab --- /dev/null +++ b/docs/workbench/specs/ghas-quality-vuln-subtrack/requirements.md @@ -0,0 +1,287 @@ +# GHAS급 탐지 품질 트랙 — VULN/SAST 서브트랙 Requirements + +> Phase 1 (grill-to-spec) **완료 — 승인 대기**. SoT: 이 파일(`requirements.md`). +> 상위 트랙: `.claude/specs/20260620-ghas-quality-track/`(시크릿 서브트랙, locked). +> 작성 2026-06-21. 작성 모드: self-driven grill(자문자답 + 리서치 선행). + +## 승인 대상 + +- Source of truth: `requirements.md` +- Preview companion: `requirements.html` (generated, 검토용 — source 대체 아님) + +## 한 줄 목표 + +security-scanner의 **vuln/SAST 탐지 품질**(precision/recall)을 측정 가능한 GHAS code-scanning +parity SLO에 맞춘다. 시크릿 서브트랙과 **공유 substrate**(parity 측정 프레임·snapshot=ground-truth· +disposition 후크·measure-first·governance 게이트)를 재사용하되, **CodeQL/SAST 고유의 alert 의미론** +(rule 기반·security_severity·멀티툴 rule.id·dismissed_reason FP 신호)에서 의식적으로 갈라진다. + +## 시크릿 서브트랙에서 그대로 재사용한 결정 (전이 원칙) + +| 전이 결정 | 출처 | vuln에서의 적용 | +| --- | --- | --- | +| GHAS parity SLO (alert을 oracle) | 시크릿 Q3 | code-scanning alert이 oracle (secret-scanning alert 대신) | +| snapshot = ground truth (1회 게이트 fetch→frozen→CI 반복) | 시크릿 Q4 | 동일. 단 code-scanning snapshot 별도 취득 | +| per-repo 1:1 (풀링 아님) | 시크릿 Q5 | 동일 | +| non-GHAS B-floor + C-monitor | 시크릿 Q6 | 동일 (Semgrep-compat은 GHAS-off repo에도 적용) | +| measure-first SLO done | 시크릿 Q10 | 동일 | +| 실 GHAS fetch는 human-PR 게이트 | 시크릿 FR2 | 동일 stop-condition 재사용 | +| 티어드 자동 품질 머신 | 시크릿 Q9 | 부분 재사용 (V-Q9 참조 — 인라인 티어 내용은 갈림) | + +## 의식적으로 갈라진 결정 (vuln/SAST 고유) — 요약 + +| # | 결정 | 시크릿과 다른 점 + 근거 | +| --- | --- | --- | +| V-Q1 | **범위 = scan-vuln(Semgrep-compat)의 품질** | import-sarif는 통로, codeql.yml은 self-scan(범위밖). oracle은 GHAS **code-scanning** alert | +| V-Q2 | **매치 키 = rule.id 정규화 필요** | secret_type은 발급처-전역 표준이라 1:1. SAST rule.id는 **툴마다 다름** → CodeQL↔Semgrep rule을 CWE로 브리지 | +| V-Q3 | **FP 억제 = rule/severity gating + LLM verifier**, validity-check 아날로그 **없음** | secret validity(발급처 API 실검증)에 대응물 없음 — SAST는 데이터플로우 reachability가 그 자리. v1은 no-network | +| V-Q4 | **dismissed_reason을 FP truth 1급 신호로** | code-scanning dismissed_reason("false positive"/"used in tests"/"won't fix")이 secret resolution보다 SAST-FP에 직결 | +| V-Q5 | **oracle 툴 고정 = CodeQL** (멀티툴 중) | code-scanning은 멀티툴. CodeQL parity가 "GHAS급"의 표준 의미. Semgrep-on-GHAS 등은 비교 universe 오염 | +| V-Q6 | **disposition durable 배선이 신규 작업** | 시크릿 verify는 이미 set_finding_disposition 배선. **vuln verify는 throwaway JSONL** — durable store/ledger 미배선이 핵심 갭 | +| V-Q7 | **데이터셋 = 합성+실GHAS 2층** (시크릿보다 합성 비중↑) | SAST는 합성 취약코드 코퍼스가 성숙(이미 `eval/synthetic-code-vuln` 존재). recall 측정에 합성이 결정적 | + +## 자문자답 흐름 (provenance) + +### V-Q1. 1차 범위: scan-vuln vs import-sarif vs codeql.yml 중 무엇의 품질인가? + +**자문:** vuln 서브시스템은 세 표면이 있다 — (a) `scan-vuln`(Semgrep-compat 로컬 SAST 실행), +(b) `import-sarif`(외부 SARIF 정규화 통로), (c) `.github/workflows/codeql.yml`(이 repo 자체를 +CodeQL로 스캔). 무엇의 precision/recall을 끌어올리나? + +**자답: scan-vuln(Semgrep-compat)의 탐지 품질이 1차 대상.** 근거: +- `import-sarif`는 **변환 통로**일 뿐 탐지기가 아니다(어떤 SARIF든 받아 정규화). 품질의 주체가 아니라 + 품질 측정의 **입력 경로**다 → FR로 유지하되 "품질 끌어올림" 대상은 아님. +- `codeql.yml`은 **우리 repo를 스캔하는 self-scan**(공급망 위생). 우리 *제품*의 탐지 품질과 무관 → + 범위 밖. (단, 역설적으로 codeql.yml이 만드는 SARIF는 합성 외 **실 SAST 샘플** 공급원이 될 수 있어 + 데이터셋 보강처로만 언급 — V-Q7.) +- 사용자 repo에 실행되는 탐지기는 `scan-vuln`이다. GHAS code-scanning과 1:1 비교 가능한 우리 산출물도 + scan-vuln 결과(VULN_FINDING)다 → **품질 SLO의 주체 = scan-vuln**. + +함의: oracle은 GHAS **code-scanning** alert(secret-scanning 아님). 새 fetch 경로 필요(V-Q5·FR2). + +### V-Q2. GHAS parity의 vuln 버전 — alert을 oracle 삼는가? 매치 정의는? + +**자문:** 시크릿은 `(repository, file_path, line_start, secret_type)` 4-튜플로 GHAS alert↔finding +1:1 매칭(`GhasAlertComparisonKey`). secret_type은 GitHub 표준 분류(예 `github_pat`)라 발급처-전역 +1:1이 성립한다. SAST는 그게 안 된다 — CodeQL rule.id(`py/sql-injection`)와 Semgrep rule.id +(`python.lang.security.audit.sql-injection`)는 **같은 취약점인데 문자열이 다르다**. file+line+rule을 +naive 매칭하면 같은 SQLi를 local-only/ghas-only로 양쪽에 잘못 분류 → parity 측정이 무의미해진다. + +**자답: code-scanning alert을 oracle 삼되, 매치 키에 rule.id 정규화 계층을 둔다.** 결정: +- **비교 키 = `(repository, file_path, line_window, normalized_rule_class)`.** + - `normalized_rule_class`: rule.id → **CWE**로 브리지(가능하면). CodeQL alert의 `rule.tags`에 + `external/cwe/cwe-89` 형태로 CWE가 있고, 우리 SARIF importer는 이미 `cwe_ids`를 추출한다 + (`sarif.py:_extract_cwe_ids`). 즉 **CWE 교집합 매칭이 가장 견고한 공통축**. + - CWE 부재 시 fallback: rule.id 문자열 토큰 정규화(소문자·구분자 통일·툴 프리픽스 제거) 후 부분일치. + 이건 약한 신호라 "matched-by-cwe / matched-by-rule-token / unmatched"를 **구분 집계**한다(설계 단계 + 품질 등급). + - `line_window`: 정확 line 1:1이 아니라 **±N 라인 윈도**(설계 단계 N 확정). 근거: 같은 취약점이라도 + CodeQL은 sink 라인, Semgrep은 source 라인을 보고할 수 있어 정확 라인 일치는 over-strict. +- **함의(시크릿 Q3 전이 유지):** GHAS(CodeQL) 미탐인데 우리만 탐지한 finding은 정의상 FP("GHAS급"이 + 목표, "GHAS보다 recall↑"는 비목표). 단 rule-class 매칭이 약하면 "측정 불가"로 빠질 수 있어 V-Q2의 + 3등급 집계가 silent-FP 오분류를 막는다. + +### V-Q3. FP 억제 메커니즘 — secret validity-check의 SAST 아날로그가 있는가? + +**자문:** 시크릿 트랙의 핵심 FP 억제 후보는 (a) LLM verifier→disposition, (b) path/placeholder/ +context-class 휴리스틱, (c) partner-pattern boost, (d) live validity check(연기됨). SAST엔 무엇이 +대응하나? validity check(발급처에 토큰 유효성 질의)는 SAST에 직접 대응물이 없다 — "이 SQLi가 실제로 +도달 가능한가"는 네트워크 질의가 아니라 **데이터플로우 reachability** 문제다. + +**자답: SAST FP 억제 = (1) rule/severity/precision gating + (2) LLM vuln verifier→disposition + +(3) reachability/trace 신호 활용. validity-check 아날로그는 명시적으로 "없음"으로 둔다.** 결정: +- **인라인 싼 티어(공짜, 모든 스캔):** + - `precision`/`security_severity`/`severity` gating: 이미 `gate.py`가 severity_min/precision_min + 랭킹을 가짐. SARIF의 `precision` 메타(VERY_HIGH..LOW)와 `code_flow_count`(데이터플로우 trace + 유무)를 FP 억제 신호로 인라인 적용. **trace가 있는 finding은 reachability 근거가 있어 더 신뢰**. + - rule 억제(allowlist/severity floor): 저신뢰 rule.id·INFO/LOW를 기본 비차단(이미 gate가 함). +- **비동기 LLM 티어:** vuln verifier(`llm/vulnerability/verifier.py`)가 애매한 finding에 verdict → + **durable disposition으로 반영**(V-Q6 — 이게 신규 작업). 시크릿과 달리 vuln verifier는 현재 + throwaway JSONL에만 씀. +- **validity-check 아날로그:** "없음"으로 명시. SAST의 reachability(taint/data-flow)는 **탐지기 + 내부**(Semgrep dataflow / CodeQL taint) 책임이지 사후 verifier 책임이 아니다. v1은 no-network + (시크릿 Q7 전이). reachability를 우리가 사후 재계산하는 건 YAGNI/scope-creep → 비채택. + +### V-Q4. GHAS alert state/dismissal을 어떻게 truth로 쓰나? + +**자문:** code-scanning alert은 state ∈ {open, dismissed, fixed}, 그리고 dismissed_reason ∈ +{"false positive", "won't fix", "used in tests"}를 갖는다(리서치 확인). 시크릿 트랙 미결정 항목 +"GHAS alert state 처리(open/resolved/dismissed)"의 vuln 버전이다. 무엇을 TP truth로, 무엇을 FP +truth로 보나? + +**자답: open/fixed = TP-truth, dismissed("false positive"/"used in tests") = **명시적 FP-truth**.** +결정(시크릿보다 강한 신호 활용): +- **TP oracle:** state ∈ {open, fixed}. GHAS가 실제 취약점으로 인정·추적한 것. +- **FP oracle(시크릿과 의식적으로 갈림):** dismissed_reason ∈ {"false positive", "used in tests"}는 + **GitHub가 라벨한 ground-truth FP**다. 우리가 같은 위치를 띄우면 그건 우리도 FP를 띄운 것 → + precision 페널티로 **직접 채점**. dismissed_reason "won't fix"는 TP이되 비차단(위험 수용)이므로 + recall 채점엔 포함, precision 페널티엔 미포함(애매 클래스로 분리 집계). + - 근거: secret-scanning의 resolution은 카테고리가 거칠지만(revoked/false_positive/...), + code-scanning dismissed_reason은 **SAST-FP 의미가 1급**이라 oracle 신호가 더 풍부. 이걸 안 쓰면 + GHAS가 가진 가장 값진 라벨을 버리는 것. +- **함의:** snapshot은 alert state + dismissed_reason을 **redacted 보존**해야 한다(FR2 확장). 단 + dismissed_comment(자유서술, 경로/코드 누출 위험)는 **취득 안 함**(공개안전). + +### V-Q5. 멀티툴 code-scanning에서 oracle 툴을 고정하나? + +**자문:** code-scanning은 CodeQL뿐 아니라 업로드된 임의 SARIF 툴(Semgrep, 외부 SAST)을 alert으로 +받는다. `tool_name` 필터가 있다. parity oracle universe에 어느 툴 alert을 넣나? 우리 scan-vuln도 +Semgrep-compat인데, GHAS에 Semgrep alert이 이미 있으면 "우리 vs GHAS-Semgrep"은 거의 동어반복이고, +"우리 vs CodeQL"은 의미 있는 parity다. + +**자답: oracle = CodeQL alert로 고정(`tool_name=CodeQL` 필터).** 근거: +- "GHAS급"의 시장 표준 의미는 **CodeQL**(GitHub 1st-party taint 엔진)이다. parity 목표를 CodeQL로 + 잡아야 "우리가 GHAS만큼 잡나"가 의미를 갖는다. +- 멀티툴을 다 oracle에 넣으면 비교 universe가 repo의 우연한 GHAS 설정에 좌우돼 per-repo truth가 + 불안정(시크릿 Q5 per-repo 1:1 정신 위배). +- **함의(시크릿과 갈림):** secret-scanning은 단일 엔진이라 tool 고정 이슈가 없었다. vuln은 oracle + tool을 명시 고정해야 측정이 재현된다. 이건 FR/NFR에 못박는다. +- 보조: CodeQL은 Python·JS 등 언어 제약이 있다. 우리 Semgrep-compat이 CodeQL 미지원 언어를 스캔하면 + oracle 부재 → 그 repo/언어는 **per-repo SLO 비대상**(C-monitor만, 시크릿 Q6 전이). + +### V-Q6. FP 억제 품질 머신이 언제 도나 + disposition durable 배선 + +**자문:** 시크릿 Q9는 "티어드 자동"이고, 시크릿 verify는 이미 `set_finding_disposition`(durable +store + STATE_EVENT ledger)에 배선돼 주기 scan도 혜택받는다([[ollama-verify-periodic-todo]] 해소). +vuln은? 코드 확인 결과: **vuln verify(`run_verify_vulnerability_artifact`)는 throwaway JSONL에만 +verdict를 쓰고 durable store/ledger에 안 쓴다.** `triage_state`는 JSONL 내부 필드일 뿐. 또한 +`scan_all`(주기 scan)에 vuln 자체가 미배선(secret만). + +**자답: 티어드 자동(시크릿 Q9 전이) + vuln disposition durable 배선을 신규 핵심 작업으로.** 결정: +- 인라인 싼 티어(severity/precision/trace gating·rule 억제)는 모든 scan-vuln에 즉시 적용(공짜). +- LLM vuln verifier는 자동이되 배치·애매 건에. 결과를 **durable disposition으로 반영** — 신규 배선: + - vuln finding의 `finding_id`(이미 partialFingerprints 우선 결정적 id)를 키로 durable disposition + store에 verdict 기록(시크릿 `set_finding_disposition`과 같은 채널 또는 vuln 전용 평행 채널 — + 설계 단계 결정, 단 STATE_EVENT 감사·actor='ollama'/source='verifier' 일관성 유지). + - **주의(별도 모델):** vuln finding은 `core.finding.Finding`이 아니라 `VulnerabilityFinding`(별도 + SARIF-native 모델)이다. 시크릿 disposition 후크(`Verdict`/`Disposition` 매핑)를 그대로 못 쓸 수 + 있다 → 설계에서 (a) vuln triage_state↔Verdict 어휘 통일 vs (b) vuln 전용 disposition 평행 트랙 + 중 택1. 어휘는 이미 양쪽 다 TRUE_POSITIVE/FALSE_POSITIVE/NEEDS_REVIEW로 일치 → 통일이 유력. +- 주기 경로 혜택: `scan_all`에 vuln 스캔+verify를 배선할지는 **#2 비용 제약**과 충돌 가능(500+ repo × + Semgrep은 무겁다) → v1은 **on-demand scan-vuln + verify 자동 disposition**까지, 주기 scan-all 배선은 + 비용 측정 후 별도 게이트(설계 milestone에서 격리). + +### V-Q7. 측정 데이터셋 — 합성 vs 실 GHAS code-scanning repo? + +**자문:** 시크릿은 실 GHAS-enabled repo snapshot이 주 oracle, 합성은 보조였다. vuln은? 합성 취약코드 +코퍼스(`eval/synthetic-code-vuln` 이미 존재, expected-findings 스키마 있음)와 실 code-scanning repo +snapshot 중 무엇이 주력인가? + +**자답: 2층 — (1) 합성 코퍼스가 recall/회귀 게이트의 주력, (2) 실 GHAS code-scanning snapshot이 +parity calibration.** 근거(시크릿보다 합성 비중↑): +- **합성이 더 강력한 이유(SAST 고유):** 취약/안전 코드 쌍을 의도적으로 심을 수 있어 **ground-truth가 + 완벽**하다(어느 라인이 진짜 SQLi인지 우리가 안다). 시크릿은 진짜 크리덴셜을 합성에 못 넣지만(push + protection), SAST 취약 패턴은 안전하게 합성 가능 → recall 측정의 결정적 도구. 이미 + `evaluate`(precision_min=0.90/recall_min=0.99 gate)가 합성 대비 동작. +- **실 GHAS snapshot의 역할:** 합성은 "우리 룰이 우리가 심은 걸 잡나"(자기참조 위험). **실 CodeQL + alert parity**가 "현실 코드에서 GHAS만큼 잡나"의 외부 검증 → calibration/validation(시크릿 Q5의 + GHAS-repo 역할과 동일). 단 실 fetch는 human-PR 게이트. +- **데이터셋 정합:** 합성 코퍼스 expected-findings 스키마는 `(filePath, lineStart, ruleId)` — V-Q2 + rule-class 정규화를 합성에도 적용해야 실/합성 채점이 일관(설계 단계). + +### V-Q8. 기존 자산 관계 — main 위 쌓기 + +**자문(시크릿 Q8 전이):** vuln 자산이 어디까지 와 있나? + +**자답: main 위에서 쌓는다.** 현존 자산(코드 확인): +- `core/vulnerability/`: `model.py`(VulnerabilityFinding, triage_state/verifier_verdict 후크 보유), + `sarif.py`(SARIF importer, CWE/OWASP/precision/security_severity/code_flow 추출), `evaluation.py` + (precision/recall + gate), `gate.py`(severity/precision gating), `redaction.py`(공개안전, PR #48 + 머지 [[vuln-redaction-design]]). +- `scanners/semgrep_compatible/`(runner), `runtime/vulnerability_scan.py`(scan-vuln/import-sarif), + `runtime/vulnerability_verify_artifact.py`(verify, **단 throwaway**), `llm/vulnerability/` + (verifier+prompt, redacted-metadata-only). +- CLI: `verify --category code-vuln`(배선됨), `report/gate/evaluate --category code-vuln`(배선됨), + `import-sarif`/`scan-vuln`(배선됨). compare-ghas는 **secret 전용**(vuln 미지원). +- 코퍼스: `eval/synthetic-code-vuln/`(스키마 있음, 샘플 1건). +- **미보유(신규 작업):** code-scanning alert fetch, vuln parity 비교(rule-class 매칭), + vuln disposition durable 배선, vuln snapshot harness. + +### V-Q9. SLO done-definition + +**자문(시크릿 Q10 전이):** measure-first 동일 적용? + +**자답: measure-first.** baseline 측정(현 scan-vuln의 CodeQL 대비 precision/recall gap) → 현실적 +목표 확정 → gap 닫음. 단 vuln은 **이중 SLO**: +- **합성 SLO(이미 존재·강화):** recall ≥ 0.99(심은 취약점 거의 다 잡기), precision ≥ 0.90. 회귀 게이트. +- **실 GHAS parity SLO(신규·measure-first):** CodeQL alert 대비 per-repo precision/recall 일치율 + 목표(baseline 후 확정). recall은 "CodeQL의 Y%", precision은 "dismissed-FP 위치 안 띄우기". +- v1 done = (a) code-scanning snapshot 취득 경로 + (b) parity baseline 측정 + (c) 목표 설정 + + (d) 인라인+LLM 티어로 gap 닫고 합성 회귀 게이트 그린. + +## 기능 요구사항 (vuln/SAST 서브트랙) + +- **VFR1 code-scanning parity 측정 harness.** GHAS-enabled repo별로 **code-scanning** alert + snapshot(oracle=CodeQL)과 우리 scan-vuln 결과(VULN_FINDING)를 V-Q2 rule-class 매칭으로 1:1 비교해 + per-repo precision/recall 산출 후 집계. 매칭 등급(by-cwe / by-rule-token / unmatched) 구분 집계. +- **VFR2 code-scanning snapshot 취득.** `baseline/ghas_api`에 code-scanning alert fetch 추가 + (`/repos/.../code-scanning/alerts`, GET-only, `tool_name=CodeQL`). alert의 redacted 필드만 보존: + number, rule.id, rule.security_severity_level, rule.tags(→CWE만), state, dismissed_reason, + most_recent_instance.location(path/start_line/end_line). dismissed_comment·raw message 미취득. + 실 fetch는 `ghas-live-fetch-or-mutation-required` human-PR 게이트 준수. +- **VFR3 baseline 측정(measure-first).** 현 scan-vuln의 CodeQL snapshot 대비 precision/recall gap을 + frozen snapshot 대비 측정 → 실 parity SLO 목표치 확정. +- **VFR4 티어드 품질 머신.** + - 인라인 싼 티어: severity/precision/`code_flow_count`(trace=reachability 근거) gating + 저신뢰 + rule 억제 → 즉시 FP 억제, 모든 scan-vuln. (validity-check 아날로그 없음 — V-Q3.) + - 비동기 LLM 티어: vuln verifier가 애매 finding에 verdict → durable disposition 반영(VFR5). +- **VFR5 vuln disposition durable 배선(신규 핵심 갭).** vuln verifier verdict를 throwaway JSONL이 + 아니라 durable disposition store + 감사 ledger(STATE_EVENT, actor/source 기록)로 흐르게 한다. + `finding_id`(결정적) 키로 재탐지 시 disposition 유지. vuln triage_state↔Verdict 어휘 통일 또는 + vuln 전용 평행 채널(설계 결정). +- **VFR6 dismissed_reason FP 채점.** snapshot의 dismissed_reason("false positive"/"used in tests")을 + FP-oracle로 직접 채점에 사용(우리가 그 위치 띄우면 precision 페널티). "won't fix"는 TP-비차단 클래스로 + 분리 집계. +- **VFR7 non-GHAS 전이 + drift 모니터.** 증류한 품질 머신(gating+verifier disposition)을 전 repo + scan-vuln에 적용. non-GHAS/CodeQL-미지원-언어 repo는 vuln verifier 샘플 drift 모니터(SLO 아님). +- **VFR8 parity SLO CI 게이트.** frozen code-scanning snapshot 대비 재현 측정을 CI 게이트화(측정 시 + human-PR fetch 불요). baseline 후 확정된 목표 후퇴 시 차단. **합성 회귀 게이트(`evaluate`, + recall≥0.99/precision≥0.90)는 별도 유지** — 둘 다 그린이어야 통과. + +## 비기능 요구사항 + +| 항목 | 요구값 | +| --- | --- | +| 오프라인 박스 호환 | 측정·억제 경로에 네트워크/secret egress 없음(snapshot fetch는 게이트된 1회 예외). validity-check 미도입이라 secret 트랙보다 egress 표면 더 작음 | +| 재현성 | frozen code-scanning snapshot + 합성 코퍼스로 CI 결정적 측정 | +| 비용 | LLM 티어는 배치·애매 건 한정. Semgrep-compat scan-vuln 자체가 무거우므로 주기 scan-all 배선은 비용 측정 후 별도 게이트(#2 500+ repo 제약) | +| staleness 가시성 | snapshot 나이/타임스탬프를 출력에 노출(scan-health 선례), silent staleness 금지 | +| 공개안전 | snapshot·findings redacted([[vuln-redaction-design]] 정합). code-scanning은 raw message/dismissed_comment에 경로·코드 누출 위험 → 미취득 또는 `sanitize_vulnerability_text` 경유 | +| governance | 실 GHAS code-scanning fetch는 human-PR 게이트 유지. fetch는 GET-only(GhApiRunner 계약 재사용) | +| oracle 재현성 | parity oracle 툴을 CodeQL로 고정(`tool_name`) — 멀티툴 universe 오염 방지(V-Q5) | + +## 사용자 시나리오 + +- **VS1 baseline.** 운영자가 GHAS code-scanning-enabled repo에서 baseline 측정 → "현 scan-vuln + precision/recall이 CodeQL 대비 얼마"를 확인 → measure-first로 parity 목표 설정. +- **VS2 회귀 게이트.** Semgrep rule/정규화 변경 후 CI가 (a) 합성 코퍼스 recall/precision + (b) frozen + code-scanning snapshot parity 둘 다 재측정 → 어느 쪽이든 후퇴 시 PR 차단. +- **VS3 FP 억제 전이.** 주기/온디맨드 scan-vuln이 non-GHAS repo 돌 때 인라인 gating + LLM verifier가 + 자동 FP 억제, verdict가 durable disposition으로 흘러 재탐지 시 재-asks 억제, 샘플 drift 모니터가 + 전이 건전성 보고. +- **VS4 dismissed 정합.** GHAS가 "used in tests"로 dismiss한 alert을 우리가 같은 위치에 띄우면 parity + 채점이 precision 페널티로 잡아내 룰 억제 후보로 노출. + +## 범위 밖 / 연기 + +- **import-sarif 자체 품질**: 변환 통로라 탐지 품질 주체 아님(V-Q1). FR로 유지하되 품질 SLO 대상 아님. +- **codeql.yml self-scan 품질**: 우리 repo 공급망 위생이라 제품 탐지 품질과 무관(V-Q1). 단 그 SARIF는 + 데이터셋 보강처로만 언급(V-Q7). +- **사후 reachability/taint 재계산**: 탐지기(Semgrep/CodeQL) 책임. 우리가 사후 재계산은 YAGNI(V-Q3). +- **validity-check 아날로그**: SAST엔 대응물 없음 — 명시적 비채택(V-Q3). +- **주기 scan-all에 vuln 배선**: 비용 게이트로 격리. v1은 on-demand scan-vuln+verify까지(V-Q6). +- **멀티툴 oracle**: CodeQL 고정, 다른 GHAS 업로드 툴은 비교 universe 제외(V-Q5). +- **push protection / PR-차단**: 상위 트랙 비대상 정합. + +## 미결정 항목 (Phase 2 design open questions) + +- rule-class 정규화 정밀도: CWE 브리지 매핑 테이블 범위(어느 CWE부터), CWE-부재 fallback 토큰 정규화 규칙. +- line 매칭 윈도 N(정확 라인 vs ±N) + source/sink 라인 불일치 처리. +- 비교 universe: HEAD-only vs full-history(code-scanning은 기본 default-branch HEAD ref 중심 — + 시크릿 full-history와 다를 수 있음. `ref` 파라미터 정렬 필요). +- vuln disposition 채널: 시크릿 `set_finding_disposition` 재사용(어휘 통일) vs vuln 전용 평행 store. + VulnerabilityFinding이 별도 모델인 점이 변수. +- 집계 방식: per-repo micro vs macro 평균(시크릿과 정렬). +- CodeQL 언어 커버리지 vs 우리 Semgrep-compat 언어 — oracle 부재 repo/언어 SLO 제외 판정 기준. +- snapshot 갱신 트리거/주기(passive staleness 노출은 확정, 갱신 정책은 설계). +- 합성 코퍼스 확장 규모(현재 샘플 1건) — recall SLO를 의미있게 만들 최소 취약 클래스 수. diff --git a/docs/workbench/specs/ghas-quality-vuln-subtrack/review.md b/docs/workbench/specs/ghas-quality-vuln-subtrack/review.md new file mode 100644 index 0000000..2189cce --- /dev/null +++ b/docs/workbench/specs/ghas-quality-vuln-subtrack/review.md @@ -0,0 +1,116 @@ +# GHAS급 VULN/SAST 품질 design.md — 멀티에이전트 리뷰 + 반영 기록 + +> 대상: `design.md`(v1) → 반영 후 `design.md`(v2). 리뷰: 5차원 병렬(opus) → 적대적 검증(sonnet) → 종합. +> Workflow `wy7vx73el`, agent 43, subagent. **synthesize 세션은 살아 있어 종합(synthesis) 정상 산출.** +> 확정 지적 **31건**(차원별 리뷰 → 적대적 검증 통과분만). overall(synthesis): **needs-rework → v2에 반영 완료.** + +## 종합 판정 인용 (synthesis.overall = needs-rework) + +> "vuln design.md를 autopilot 단일 goal 실행에 넘기기 전에 v2 개정이 반드시 필요하다(needs-rework). +> 핵심은 두 가지다. (1) 시크릿 design의 'Autopilot Execution Shape' 섹션이 통째로 누락되어, 현 상태로 +> goal-setup을 시도하면 첫 커밋에서 autopilot_gate가 차단한다(SoT가 .claude/specs gitignore 경로, +> goal_id/active_goal 불일치, governance/** 광역 자기수정 위험 — 모두 코드/governance 파일로 실증됨). +> (2) M3 durable disposition 배선은 vuln finding이 durable store에 전혀 적재되지 않아(store.py에 +> VulnerabilityFinding 참조 0건, set_finding_disposition은 FINDING_STATE 부재 시 ValueError) '자율' +> 라벨과 달리 storage-projection stop-condition을 정통으로 친다. … 시크릿 트랙(PR #58)의 검증된 구조 — +> 자율층 M0~M5(synthetic-only, 슬롯 없이 머지) / human-gated H1~H3(실 GHAS) 2층 분리, metrics 엔진 +> 재사용 인변, 적대적 fixture, parity_slo report-only→enforce 자동 분기 — 를 vuln으로 1:1 전이하는 +> 것이 모든 blocker의 공통 해법이다." + +종합이 짚은 핵심: **vuln blocker/major의 거의 전부가 시크릿 트랙(PR #58)이 review로 이미 잡고 해소한 +항목의 vuln 평행**이다. 따라서 v2의 골격은 시크릿 design v2 구조의 1:1 전이다. 단 한 가지 vuln 고유 +악화 요인이 있다 — **M3 durable disposition은 시크릿과 달리 vuln finding state가 durable store에 아예 +없어**(시크릿은 이미 배선), 자율 범위가 시크릿보다 **좁아진다**(durable disposition을 자율층에서 빼고 +H-track으로 연기). + +## 심각도 집계 + +| 심각도 | 건수 | 비고 | +| --- | --- | --- | +| blocker | 6 | autopilot-fit 3 · codebase-arch 1 · security-publicsafety 2 | +| major | 13 | requirements-fidelity 2 · autopilot-fit 3 · measurement 5 · security 3 | +| minor | 8 | 명세 보강 | +| nit | 4 | 표기/가독성 | + +적대적 검증이 조정한 severity도 기록: `precision-recall-primitive-does-not-exist`는 blocker 주장 → +**major로 하향**(vuln에는 `core/vulnerability/evaluation.py`가 이미 precision/recall을 구현, 제3 엔진 +신설이 처음부터는 아님). `ARCH-VULN-03`/`ARCH-VULN-05`/`vuln-snapshot-path-redaction`/ +`vuln-llm-input-leak`은 코드가 이미 안전하거나 design이 부분 인지 → minor/nit로 하향. 코드 근거가 +탄탄한 리뷰. + +## blocker (6) — v2 반영 + +| id | 차원 | 문제 | v2 해소 | +| --- | --- | --- | --- | +| `AP-03`/`VD-01` | autopilot | 'Autopilot Execution Shape' 섹션 전체 누락(goal_id/execution_mode/allowed_writes/acceptance_checks/stop_conditions/SoT승격 0건) → goal-setup 즉시 차단 | §A 신설(시크릿 §Autopilot Execution Shape 1:1 전이): goal_id=`ghas-quality-vuln-parity`, long-single-goal/stop-conditions-only/PR, allowed_writes 화이트리스트, acceptance_checks(phase-2a base+diff), stop_conditions 정본+vuln 유효분, goal-setup 3파일 동시 갱신 | +| `vuln-sot-path-gitignored-gate-blind`/`VD-01` | security | SoT가 `.claude/specs`(`.gitignore:72`)에 있어 autopilot_gate가 outside-allowed_writes로 차단·public_safety 스캔 불가 | §A: SoT를 `docs/workbench/specs/ghas-quality-vuln-subtrack/`로 git 승격(커밋본만), grill 원본만 .claude 잔존. allowed_writes에 그 docs 경로, acceptance_checks에 `public_safety --path` 추가. M0 산출물 명시 | +| `vuln-governance-wildcard-self-modify` | security | allowed_writes `governance/**` 광역(현 autopilot_goal.yml:27) 답습 시 autopilot이 stop_conditions·autopilot_gate.py·public_safety.py 자율 수정 | §A: `governance/**` 광역 금지, vuln 전용 게이트 `governance/vuln_parity_slo.py` **단일** 화이트리스트. 3파일 자율수정 금지 Fixed decision. 시크릿 `parity_slo.py`와 분리(별도 파일)를 §5 정합표에 명시 | +| `AP-01` | autopilot | M4 live-fetch가 자율 M-시퀀스 중간에 박혀 M5/M6/M7 done이 human snapshot에 종속 → 시크릿 2층 분리 폐기 | §B: 자율층(M1~M3)/H-track(H1~H4) 2층 분리. M4 live-fetch를 H-track으로 이동. 자율 goal done = 합성 회귀 게이트 enforce + report-only parity 배선(synthetic fixture 증명)까지로 §0/§8 재작성. PR merge 시 CURRENT.md "parity SLO enforce 미달성, H-track 대기" 규약 | +| `AP-02`/`ARCH-VULN-01` | autopilot/arch | M3 disposition '1안=set_finding_disposition 재사용'이 코드 전제 위배(store.py:998-1000 FINDING_STATE 부재 시 ValueError, VulnerabilityFinding nosql store 적재 0건) → storage-projection stop-condition 직격 | §C: M3 durable disposition을 **자율층에서 제거**, H-track으로 재분류. v1 자율 vuln verifier는 기존 throwaway JSONL 동작 유지(durable 아님). §4.3에 1안·2안 모두 storage-projection을 친다고 못박음(same-blocker 반복 차단) | + +## major (13) — v2 반영 + +| id | 차원 | 문제 | v2 해소 | +| --- | --- | --- | --- | +| `VD-02`/`precision-recall-primitive` | requirements/measure | parity 매처가 신규 precision/recall 경로 신설 락인 부재(`core/vulnerability/evaluation.py`가 이미 제3 엔진으로 존재) | §D: M1 done 인변 "신규 precision/recall·gate 계산 코드 0줄, `core/vulnerability/evaluation.py` 재사용, CodeScanAlertRecord→VulnerabilityEvaluationKey 어댑터로만 수렴". §5 분기표에 합성 evaluate와 parity 매처가 같은 계산 코어 공유 명시(제4 엔진 차단). coverage≠precision/recall 의미 분리 | +| `VD-03` | requirements | 합성 회귀 게이트(VulnerabilityEvaluationKey: file+line_start+rule_id 완전일치 — 코드 확인)와 V-Q2 rule-class 정규화 모순 | §E: 두 게이트 모두 rule-class 정규화+line-window EvaluationKey 의미론 채택(권장안). expected-findings 스키마·정규화 적용 지점 §4.2 고정. VFR8 정합 조건 명시 | +| `AP-04`/`vuln-active-goal-slot-eviction` | autopilot/security | active_goal 슬롯이 personal-prod-deploy 점유(current.yml:40)인데 슬롯 경합/default-off 머지 경로 미판정 | §J: vuln 자율 코드는 시크릿 패턴대로 active_goal 슬롯 없이 governance 3파일 main(theirs) 채택해 머지. 실제 슬롯 점유 전환은 사용자 결정(stop/escalate). goal-setup 3파일 동시 갱신 절차 §A | +| `AP-05`/`vuln-existing-scan-default-invariance` | autopilot/security | M2 인라인 gating default-on 여부·scan-vuln 기존 출력 불변·stop-condition 관계 미판정 | §K: 결정적·메타데이터-only·억제율 회귀로 보장되는 부분만 default-on, 동작 바꾸는 신규 rule 억제(code_flow gating·저신뢰 rule 신규 억제)는 gated. default-on이 합성 recall≥0.99(canary TP 보존) 안 깸을 M2 done 인변. 기존 scan-vuln default 출력 불변 인변 | +| `AP-06` | autopilot | vuln design이 멀티에이전트 리뷰 미수행인데 §0·§3이 자율 시퀀싱을 기정사실로 기술 | 본 review.md 산출로 해소. v2가 blocker/major 전부 반영. §0 단정을 "리뷰 반영 v2" 전제로 수정 | +| `synthetic-self-fulfilling`/`vuln-synthetic-fixture-self-fulfilling` | measure/security | 합성 비중↑인데 self-fulfilling 방어(적대적 fixture)가 시크릿보다 약함(§7 한 줄) | §F: M1 done에 정규화/line-window/필터 누락이 red가 되는 적대적 fixture(CWE-부재 rule-token-only, source/sink 라인 드리프트, CodeQL↔Semgrep 동일취약 다른 rule.id, dismissed_reason 케이스) 명시. M2/M7 done에 독립 작성 적대 쌍 회귀 누락 red 추가. 합성↔실 snapshot divergence 보고를 H-track baseline done에 | +| `cwe-intersection-asymmetry-recall-inflation` | measure | CWE-교집합 매칭 (a)다대다 충돌 (b)CWE 결손 비대칭 (c)`start_line//N` 양자화가 ±N 윈도 의도와 모순 | §G: §4.2에 같은 윈도 내 CWE 다대다 1:1 greedy 최대매칭(또는 CWE 계층 일치), `|alert_line−finding_line|≤N` 진짜 윈도로 정의·N 확정, CWE 결손률·by-rule-token 구제율을 baseline 1급 메타 노출+신뢰도 경고 | +| `dismissed-reason-snapshot-survivorship-bias` | measure | precision/recall 분모를 state별로 미공식화(시크릿 `alert-state-not-filtered`의 vuln 평행) | §H: recall 분모=open+fixed CodeQL alert만, precision 페널티=dismissed(fp/used-in-tests) 별도 누적으로 §4.2/M1 done 수식 고정. dismissed repo별 밀도 편향 §7 리스크. fixed alert location staleness §4.1 검증 | +| `finding-id-not-stable-across-rule-normalization` | measure | disposition 영속 키 finding_id가 fallback에서 `{rule_id,file_path,line_start,message}` 해시(코드 확인) → 정규화·drift·message 변경 시 유실 | §I: 안정 키 정의 또는 Semgrep-compat 출력에 stable partialFingerprints 강제를 §4.3에서 명시 결정. **단 durable disposition이 §C로 H-track 연기되므로 이 항목은 H-track 전제로 명시** | + +## minor (8) — v2 반영 요지 + +- `VD-04`/`AP-08` (`report-only-enforce-unreachable`): vuln parity 게이트를 'threshold 부재→report-only, + 존재→enforce(H-track baseline 후)' 자동 분기. 자율 goal done을 '합성 회귀 enforce + parity report-only + 배선'까지로 축소(§B). snapshot 나이>임계 stale-degraded(silent pass 금지) 전이(§4.4). +- `VD-05`: M4 'compare-ghas --category code-vuln' 대안이 사실 불일치(`cmd_compare_ghas`는 + `secret-scanning/alerts` 하드와이어·`--category` 미등록 — 확인) → 그 선택지 삭제, 신규 + `compare-codescan`을 기본으로 고정(§H1 작업). V-Q8 정합. +- `VD-06`: M3/§4.3 disposition durable 배선이 storage projection stop-condition 미명시 → §A + stop_conditions에 `storage-projection-or-schema-migration-required` 포함, §C에서 2안(vuln 전용 + 파티션)이 storage projection에 해당함을 판정. +- `codeql-only-oracle-language-coverage-bias`: CodeQL 미지원 언어(PHP/Bash/IaC) SLO 제외가 baseline + 대표성 비공개 편향 → baseline 리포트(H-track)에 'parity 측정 대상 언어 비중 / C-monitor-only 비중'을 + 1급 메타 노출(VFR3 추가). goal done에 '측정 모집단=CodeQL 지원 언어' 범위 한정자 명시(§G·§8). +- `rule-token-fallback-spec-underdefined`: by-rule-token 부분일치 술어 미정의 → §4.2에 stop-token + 제거 후 핵심 취약-클래스 토큰 정확 집합 일치만 매칭(부분 겹침 금지). CWE 브리지 매핑 테이블 확장을 + M1 작업으로 박아 by-rule-token 의존 최소화. +- `vuln-snapshot-path-redaction-unspecified`: snapshot location.path 재다션 정책 미명시, + public_safety는 절대경로만 잡음(`identifier.private-path` 확인) → §4.1/H1 done에 '실 snapshot은 + gitignore 사설 경로 보관·커밋 금지(시크릿 이중 차단), synthetic fixture만 커밋, public_safety 통과는 + 보조 검사이며 상대경로 누출은 gitignore가 1차 방어' 명시. +- `vuln-existing-scan-default-invariance` (인변): 기존 scan default 불변이 design done 미고정 → + M2 done에 '기존 노출 finding 무단 억제 안 됨' 합성 회귀로 고정(§K). +- `VD-07`: M2 '≥N개 취약 클래스'·line-window N placeholder → M2 done에 최소 클래스 집합 명시 나열 + (SQLi/XSS/path-traversal/command-injection/SSRF 5종 고정), line-window N은 'M1에서 확정' 시점 명시(§G). + +## nit — v2 반영 요지 + +- `ARCH-VULN-03`: M4 fetch 재사용/신규 표면 분리 명시(재사용=GhApiRunner GET-only 가드·페이지네이션· + redaction 헬퍼, 신규=CodeScanAlertRecord·정규화·compare-codescan). §H1·§5에 반영. +- `ARCH-VULN-05`: triage_state↔Verdict 어휘 일치 ≠ durable 반영 — NEEDS_REVIEW는 durable 미기록. + §C(H-track durable)에서 종단 verdict만 반영·NEEDS_REVIEW 무기록 명시. +- `vuln-llm-input-leak-surface-verified-ok`: 현 LLM 입력 redaction 견고(코드 확인) → M2/M6 done에 + '신규 verifier 입력은 redacted-metadata 계약 준수(trace는 count/shape만, related_location path 평문 + 금지)' 인변(§K). +- `precision-recall-primitive` 잔여: GhasComparisonResult(coverage)와 VulnerabilityEvaluationResult + (precision/recall) 의미 분리를 §4.2/§5에 명시(§D와 통합). + +## 판정 + +design.md v2는 **blocker 6 · major 13 전부 반영**. 잔여는 구현 중 해소할 Open Questions(CWE 브리지 +매핑 커버리지, line-window k값 확정, rule-token 술어 정밀화, 합성 코퍼스 클래스 확장). **goal-setup +진행 가능**, 단 goal-setup이 (1) SoT를 `docs/workbench/specs/ghas-quality-vuln-subtrack/`로 승격, +(2) allowed_writes/acceptance_checks/stop_conditions를 phase-2a 템플릿 기준 + §A diff로 작성, +(3) autopilot_goal.yml goal_id·current.yml active_goal·CURRENT.md를 한 커밋에 동시 갱신해야 함. + +**핵심 변경 요지**: M3 durable disposition 재분류로 **자율 범위가 시크릿보다 좁아졌다.** 시크릿은 +disposition이 이미 durable 배선돼 있어 M3(LLM 티어 disposition)이 자율층에 들어갔지만, vuln은 finding +state가 durable store에 아예 없어(store.py 참조 0건) durable disposition을 만들려면 storage projection +신규 = stop-condition. 따라서 vuln 자율 goal done = **인라인 싼 티어 + 합성 회귀 게이트 + report-only +parity 배선까지**이고, durable disposition(vuln verdict 영속)·실 GHAS fetch·baseline·실 parity enforce는 +**모두 H-track**으로 분리된다. diff --git a/eval/codescan-parity-corpus/README.md b/eval/codescan-parity-corpus/README.md new file mode 100644 index 0000000..a19aff5 --- /dev/null +++ b/eval/codescan-parity-corpus/README.md @@ -0,0 +1,56 @@ +# codescan-parity-corpus + +Synthetic, fully fake adversarial corpus for the GHAS **code-scanning** parity +matcher (`core/vulnerability/codescan_parity.py`). This is the vuln-domain analog +of `eval/ghas-parity-corpus/`. + +## Provenance fail-closed + +`synthetic-snapshot.json` carries the top-level provenance marker +`"source": "synthetic"`. `load_codescan_snapshot` refuses to load any snapshot +whose `source` is not exactly `synthetic`, so a real (or unmarked) snapshot can +never feed the autonomous harness. ZERO network: the matcher and loader are pure +logic over redacted records. + +## All values are fake + +No real repository names, file paths, code snippets, secret material, or rule +taxonomies. Paths are synthetic relatives under `synthetic_app/`. The CodeQL-style +and Semgrep-style `rule.id` tokens are representative shapes, not copied content. + +## Adversarial cases (design §2 M1) + +The snapshot is engineered so that switching OFF one matcher responsibility turns +a specific metric red: + +- **(a) CWE-absent rule-token-only** — alert `js/path-injection` (no CWE) vs + finding `javascript.lang.security.audit.path-traversal` (no CWE). Matches only + via the by-rule-token fallback (`path-injection` folds to `path` + `traversal`). +- **(b) source/sink line drift** — alert at line 24, finding at line 26 (`+2 == N`), + just inside the fixed line window `N = 2`. Matches only with the window. +- **(c) CodeQL ↔ Semgrep same vuln, different rule.id** — alert `py/sql-injection` + (CWE-89) vs finding `python.lang.security.audit.sql-injection` (CWE-89). Matches + by-cwe. Without the CWE bridge AND without rule-token it splits into FP + FN. +- **(d) dismissed_reason cases** — a `dismissed` / `false positive` alert our + finding hits (precision penalty: `dismissed_fp_hit`, and a false positive), plus + a `won't fix` alert (TP-non-blocking, excluded from the recall denominator). + +## Denominator semantics (design §4.2) + +- **Recall denominator** = alerts in `state ∈ {open, fixed}` only. +- **Precision penalty** = count of `dismissed` / (`false positive` | `used in + tests`) locations our finding surfaced (`dismissed_fp_hit`); each is also a + false positive. +- **`won't fix`** = TP-non-blocking: excluded from the recall denominator, not a + precision penalty. + +The line window `N` is fixed at **2** in M1 (closes open question VD-07) and is +pinned by the tests in `tests/test_codescan_parity.py`. + +## Precision/recall reuse + +The matcher synthesizes canonical keys and routes them through +`core/vulnerability/evaluation.py::evaluate_vulnerability_findings` — there is +**zero** new precision/recall formula. `result.detection` is a +`VulnerabilityEvaluationResult` whose `.precision` / `.recall` come straight from +the reused metrics layer. diff --git a/eval/codescan-parity-corpus/synthetic-snapshot.json b/eval/codescan-parity-corpus/synthetic-snapshot.json new file mode 100644 index 0000000..27f08ab --- /dev/null +++ b/eval/codescan-parity-corpus/synthetic-snapshot.json @@ -0,0 +1,96 @@ +{ + "schemaVersion": 1, + "source": "synthetic", + "description": "Adversarial synthetic GHAS code-scanning parity snapshot. ALL VALUES ARE FAKE: no real repo names, file paths, code snippets, or rule taxonomies. Synthetic relative paths only (synthetic_app/*). It pairs CodeQL-style alert rule.ids against Semgrep-style finding rule.ids so that a missing CWE bridge, a missing rule-token fallback, a too-narrow line window, or a missing state filter each turn a specific metric red. Adversarial cases: (a) CWE-absent rule-token-only match, (b) source/sink line drift inside the window, (c) CodeQL<->Semgrep same vuln different rule.id matched by-cwe, (d) dismissed_reason cases (false positive = precision penalty, won't fix = TP-non-blocking excluded from recall).", + "repoFullName": "synthetic-org/synthetic-codescan-repo", + "fetchedAt": "2026-06-21T12:00:00+00:00", + "alerts": [ + { + "alertNumber": 1, + "ruleId": "py/sql-injection", + "securitySeverityLevel": "high", + "cweIds": ["CWE-89"], + "state": "open", + "filePath": "synthetic_app/handlers.py", + "lineStart": 10, + "lineEnd": 10, + "note": "(c) CodeQL rule.id py/sql-injection (CWE-89). Our finding uses a different Semgrep-style rule.id but the same CWE -> matches by-cwe. Without the CWE bridge AND without rule-token, it splits into FP+FN." + }, + { + "alertNumber": 2, + "ruleId": "py/xss", + "securitySeverityLevel": "medium", + "cweIds": ["CWE-79"], + "state": "open", + "filePath": "synthetic_app/render.py", + "lineStart": 24, + "lineEnd": 24, + "note": "(b) source/sink line drift: our finding sits at line 26 (+2 == N), just inside the line window. Matches only with the window." + }, + { + "alertNumber": 3, + "ruleId": "js/path-injection", + "securitySeverityLevel": "high", + "cweIds": [], + "state": "fixed", + "filePath": "synthetic_app/files.js", + "lineStart": 40, + "lineEnd": 40, + "note": "(a) CWE-absent rule-token-only: neither side carries a CWE. Matches only via the rule-token fallback (path-injection folds to path+traversal). state=fixed -> positive truth (recall denominator)." + }, + { + "alertNumber": 4, + "ruleId": "py/xss", + "securitySeverityLevel": "medium", + "cweIds": ["CWE-79"], + "state": "dismissed", + "dismissedReason": "false positive", + "filePath": "synthetic_app/legacy.py", + "lineStart": 55, + "lineEnd": 55, + "note": "(d) dismissed false positive (FP-oracle). Our finding lands here, so it is a precision penalty (dismissed_fp_hit) and a false positive. Excluded from the recall denominator -> disabling the state filter makes this an undetected FN and drops recall." + }, + { + "alertNumber": 5, + "ruleId": "py/command-injection", + "securitySeverityLevel": "high", + "cweIds": ["CWE-78"], + "state": "dismissed", + "dismissedReason": "won't fix", + "filePath": "synthetic_app/ops.py", + "lineStart": 70, + "lineEnd": 70, + "note": "(d) won't fix (TP-non-blocking). We do NOT detect it. Excluded from the recall denominator and not a precision penalty." + } + ], + "findings": [ + { + "ruleId": "python.lang.security.audit.sql-injection", + "sourceTool": "semgrep", + "cweIds": ["CWE-89"], + "filePath": "synthetic_app/handlers.py", + "lineStart": 10 + }, + { + "ruleId": "py/xss", + "sourceTool": "codeql", + "cweIds": ["CWE-79"], + "filePath": "synthetic_app/render.py", + "lineStart": 26 + }, + { + "ruleId": "javascript.lang.security.audit.path-traversal", + "sourceTool": "semgrep", + "cweIds": [], + "filePath": "synthetic_app/files.js", + "lineStart": 40 + }, + { + "ruleId": "py/xss", + "sourceTool": "codeql", + "cweIds": ["CWE-79"], + "filePath": "synthetic_app/legacy.py", + "lineStart": 55 + } + ] +} diff --git a/eval/synthetic-code-vuln/corpus-snapshot.json b/eval/synthetic-code-vuln/corpus-snapshot.json new file mode 100644 index 0000000..5886cd2 --- /dev/null +++ b/eval/synthetic-code-vuln/corpus-snapshot.json @@ -0,0 +1,123 @@ +{ + "schemaVersion": 1, + "source": "synthetic", + "name": "synthetic-code-vuln-5class-corpus", + "description": "Synthetic, fully fake 5-class code-vuln regression corpus (design VD-07). ALL VALUES ARE FAKE: no real repo names, file paths, code snippets, or rule taxonomies. Synthetic relative paths only (synthetic_app/*). Covers SQLi (CWE-89), XSS (CWE-79), path-traversal (CWE-22), command-injection (CWE-78), SSRF (CWE-918). Each class has a vulnerable case (expectedFindings) and a safe case (safeCases, must NOT be flagged -> exercises precision). actualFindings deliberately mix CodeQL-style and Semgrep-style rule.ids so only the normalization-aware path (RuleClassNormalizer + line-window, shared with the M1 parity matcher) matches them. Safe-case findings are intentionally absent from actualFindings, so a precision-correct scanner that does not flag them keeps precision high.", + "lineWindow": 2, + "expectedFindings": [ + { + "vulnClass": "sql-injection", + "filePath": "synthetic_app/handlers.py", + "lineStart": 42, + "ruleId": "python.lang.security.audit.sql-injection", + "cweIds": ["CWE-89"], + "note": "Expected uses a Semgrep-style rule.id; the matching actual finding uses a CodeQL-style py/sql-injection. Matches by-cwe via the shared normalizer (cross-dialect)." + }, + { + "vulnClass": "xss", + "filePath": "synthetic_app/render.py", + "lineStart": 18, + "ruleId": "python.lang.security.audit.xss", + "cweIds": ["CWE-79"], + "note": "Cross-dialect: actual is CodeQL py/reflected-xss carrying CWE-79; matches by-cwe." + }, + { + "vulnClass": "path-traversal", + "filePath": "synthetic_app/files.py", + "lineStart": 30, + "ruleId": "python.lang.security.audit.path-traversal", + "cweIds": [], + "note": "No CWE on either side: matches only via the rule-token fallback (path-injection folds to path+traversal). Actual finding sits at line 32 (+2 == N), inside the line window." + }, + { + "vulnClass": "command-injection", + "filePath": "synthetic_app/ops.py", + "lineStart": 55, + "ruleId": "python.lang.security.audit.command-injection", + "cweIds": ["CWE-78"], + "note": "Cross-dialect: actual is CodeQL py/command-line-injection with CWE-78; matches by-cwe." + }, + { + "vulnClass": "ssrf", + "filePath": "synthetic_app/fetch.py", + "lineStart": 12, + "ruleId": "python.lang.security.audit.ssrf", + "cweIds": ["CWE-918"], + "note": "Cross-dialect: actual is CodeQL py/request-forgery with CWE-918; matches by-cwe." + } + ], + "actualFindings": [ + { + "vulnClass": "sql-injection", + "filePath": "synthetic_app/handlers.py", + "lineStart": 42, + "ruleId": "py/sql-injection", + "sourceTool": "codeql", + "cweIds": ["CWE-89"] + }, + { + "vulnClass": "xss", + "filePath": "synthetic_app/render.py", + "lineStart": 18, + "ruleId": "py/reflected-xss", + "sourceTool": "codeql", + "cweIds": ["CWE-79"] + }, + { + "vulnClass": "path-traversal", + "filePath": "synthetic_app/files.py", + "lineStart": 32, + "ruleId": "javascript.lang.security.audit.path-traversal", + "sourceTool": "semgrep", + "cweIds": [] + }, + { + "vulnClass": "command-injection", + "filePath": "synthetic_app/ops.py", + "lineStart": 55, + "ruleId": "py/command-line-injection", + "sourceTool": "codeql", + "cweIds": ["CWE-78"] + }, + { + "vulnClass": "ssrf", + "filePath": "synthetic_app/fetch.py", + "lineStart": 12, + "ruleId": "py/request-forgery", + "sourceTool": "codeql", + "cweIds": ["CWE-918"] + } + ], + "safeCases": [ + { + "vulnClass": "sql-injection", + "filePath": "synthetic_app/safe_handlers.py", + "lineStart": 40, + "note": "Parameterized query: not vulnerable. A precision-correct scanner does NOT flag this -> intentionally absent from actualFindings." + }, + { + "vulnClass": "xss", + "filePath": "synthetic_app/safe_render.py", + "lineStart": 20, + "note": "Auto-escaped template output: not vulnerable." + }, + { + "vulnClass": "path-traversal", + "filePath": "synthetic_app/safe_files.py", + "lineStart": 28, + "note": "Path is validated against an allowlist: not vulnerable." + }, + { + "vulnClass": "command-injection", + "filePath": "synthetic_app/safe_ops.py", + "lineStart": 50, + "note": "Fixed argv list, no shell: not vulnerable." + }, + { + "vulnClass": "ssrf", + "filePath": "synthetic_app/safe_fetch.py", + "lineStart": 10, + "note": "URL host validated against an allowlist: not vulnerable." + } + ] +} diff --git a/governance/autopilot_goal.yml b/governance/autopilot_goal.yml index 81f5ce2..4951843 100644 --- a/governance/autopilot_goal.yml +++ b/governance/autopilot_goal.yml @@ -1,5 +1,5 @@ schema_version: 1 -goal_id: personal-prod-deploy +goal_id: ghas-quality-vuln-parity execution_mode: style: long-single-goal human_gate: stop-conditions-only @@ -15,16 +15,14 @@ policy_decisions: fork_prs: blocked-or-skipped-before-secrets public_artifacts: synthetic-or-redacted-only allowed_writes: - - docs/workbench/specs/phase-2a-sarif-native-sast/** - - docs/workbench/agentic-workflows/2026-06-20-phase-2a-sarif-import-first-goal.md + - docs/workbench/specs/ghas-quality-vuln-subtrack/** + - docs/workbench/agentic-workflows/2026-06-21-ghas-quality-vuln-parity-goal.md - docs/views/research-and-technical-decisions.md - src/security_scanner/** - tests/** - - deploy/systemd/user/** - examples/** - eval/** - - docs/workbench/** - - governance/** + - governance/vuln_parity_slo.py - ledger/** - CURRENT.md acceptance_checks: @@ -38,7 +36,8 @@ acceptance_checks: - uv run python -m governance.rebuild_ledger_index --check - uv run python -m governance.render_github_ruleset --output governance/main_ruleset.json --check - uv run python -m governance.public_safety --diff origin/main...HEAD - - uv run python -m governance.public_safety --path docs/workbench/specs/phase-2a-sarif-native-sast --path docs/views/research-and-technical-decisions.md + - uv run python -m governance.public_safety --path docs/workbench/specs/ghas-quality-vuln-subtrack + - uv run python -m governance.vuln_parity_slo --check - uv run python -m governance.autopilot_gate --base origin/main stop_conditions: - public-safety-hit diff --git a/governance/current.yml b/governance/current.yml index 1ea16e8..d8177d5 100644 --- a/governance/current.yml +++ b/governance/current.yml @@ -37,7 +37,7 @@ gates: proof_ref: '' proof_hash: '' autopilot: - active_goal: personal-prod-deploy + active_goal: ghas-quality-vuln-parity merge_mode: guarded-auto-merge last_auto_merge: ledger:20260617T003405Z-autopilot-3236f4 open_decisions: [] diff --git a/governance/vuln_parity_slo.py b/governance/vuln_parity_slo.py new file mode 100644 index 0000000..1808bff --- /dev/null +++ b/governance/vuln_parity_slo.py @@ -0,0 +1,377 @@ +"""GHAS code-scanning parity SLO gate (M3) — report-only until a threshold exists. + +This gate measures our code-vulnerability detector's per-repo GHAS *parity* +against frozen **synthetic** code-scanning snapshot fixtures and reports macro +precision/recall. It is the autonomous-layer CI vehicle for the +``ghas-quality-vuln-parity`` goal — the 1:1 vuln-domain transfer of the proven +secret-track gate ``governance/parity_slo.py``. + +Two-mode by design (design.md §4.4 / §2 M3, requirements measure-first): + +* **report-only** — the default and the ONLY mode reachable autonomously: when no + threshold file exists (or it is empty), the gate prints the measured numbers and + ALWAYS exits 0. It never blocks. The real, calibrated thresholds are committed + only after the human-gated H1~H3 track measures a real CodeQL baseline, so until + then there is nothing legitimate to enforce. +* **enforce** — reachable only once a human commits a threshold file: macro + precision/recall below the committed minimums fail the gate (exit 1). This is + the measure-first auto-branch (threshold present => enforce). + +Staleness is surfaced, never silently passed (design ``staleness-passive-only``, +scan-health precedent): a snapshot older than the max age is reported as +``stale-degraded``. In report-only that is a visible warning (exit 0); in enforce +it fails (exit 1) so a stale snapshot cannot silently satisfy the gate. A snapshot +with no parseable ``fetched_at`` is treated as stale (unknown age must not pass). + +Inputs are SYNTHETIC fixtures only. ``core.vulnerability.codescan.load_codescan_ +snapshot`` fails closed unless the snapshot carries ``source: synthetic`` +provenance, so a real GHAS code-scanning export can never drive this gate. + +Computation reuse: per-repo precision/recall come straight from the M1 parity +matcher (``core.vulnerability.codescan_parity.compare_codescan_alerts_with_ +findings``), whose ``.detection`` is the metrics-layer +``core.vulnerability.evaluation`` result. This module adds NO new precision/recall +formula — it only loads snapshots, macro-aggregates +(``aggregate_codescan_parity`` — averaging, not a TP/(TP+FP) re-derivation), reads +an optional threshold, and judges report-only vs enforce vs stale. +""" + +from __future__ import annotations + +import argparse +import datetime as dt +import json +import sys +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +import yaml + +from security_scanner.core.vulnerability.codescan import ( + RuleClassNormalizer, + load_codescan_snapshot, +) +from security_scanner.core.vulnerability.codescan_parity import ( + MacroCodeScanParityResult, + aggregate_codescan_parity, + compare_codescan_alerts_with_findings, +) + +# The committed M1 synthetic code-scanning snapshot fixture dir (mirrors the +# secret gate's DEFAULT_SNAPSHOT_DIR = eval/ghas-parity-corpus). +DEFAULT_SNAPSHOT_DIR = Path("eval/codescan-parity-corpus") +# A SEPARATE threshold file from the secret gate's, so the two tracks never +# collide. It does NOT exist in the autonomous layer => report-only. +DEFAULT_THRESHOLD_PATH = Path("governance/vuln_parity_slo_thresholds.yml") + +# A snapshot older than this is reported as stale-degraded. Synthetic fixtures +# have no real freshness obligation, so the default is generous; the real cadence +# SLA is set by the human-gated H-track. +DEFAULT_MAX_SNAPSHOT_AGE_DAYS = 90 + + +@dataclass(frozen=True) +class VulnParitySloThresholds: + """Calibrated minimums. Absent until the human-gated H-track commits them.""" + + precision_min: float + recall_min: float + + +@dataclass(frozen=True) +class VulnParitySloResult: + """Outcome of one vuln parity-SLO evaluation pass.""" + + mode: str # "report-only" | "enforce" + macro: MacroCodeScanParityResult + snapshot_count: int + stale: bool + stale_snapshots: tuple[str, ...] + thresholds: VulnParitySloThresholds | None + failures: tuple[str, ...] + + @property + def total_dismissed_fp_hit(self) -> int: + return self.macro.total_dismissed_fp_hit + + @property + def passed(self) -> bool: + """Whether the gate should exit 0. + + report-only never blocks (exit 0 even when stale or below target — there + is no committed target to enforce yet). enforce blocks on any failure, + including a stale snapshot (staleness must not silently pass). + """ + if self.mode == "report-only": + return True + return not self.failures + + +def load_thresholds(path: Path) -> VulnParitySloThresholds | None: + """Load calibrated thresholds, or None when absent/empty (report-only).""" + if not path.exists(): + return None + raw = path.read_text(encoding="utf-8").strip() + if not raw: + return None + data = yaml.safe_load(raw) + if not isinstance(data, dict) or not data: + return None + try: + precision_min = float(data["precision_min"]) + recall_min = float(data["recall_min"]) + except (KeyError, TypeError, ValueError) as exc: + raise ValueError( + "vuln_parity_slo thresholds must define numeric precision_min and " + "recall_min" + ) from exc + return VulnParitySloThresholds( + precision_min=precision_min, recall_min=recall_min + ) + + +def discover_snapshots(snapshot_dir: Path) -> list[Path]: + """Return committed synthetic snapshot fixture files (sorted, deterministic).""" + if not snapshot_dir.exists(): + return [] + return sorted(snapshot_dir.glob("*snapshot*.json")) + + +def _snapshot_is_stale( + fetched_at: str | None, *, now: dt.datetime, max_age_days: int +) -> bool: + """True when the snapshot's fetched_at is older than the max age. + + A snapshot with no parseable fetched_at is treated as stale (unknown age must + not silently pass — design staleness-passive-only). + """ + if not fetched_at: + return True + parsed = _parse_timestamp(fetched_at) + if parsed is None: + return True + age = now - parsed + return age > dt.timedelta(days=max_age_days) + + +def _parse_timestamp(value: str) -> dt.datetime | None: + text = value.strip() + if text.endswith("Z"): + text = text[:-1] + "+00:00" + try: + parsed = dt.datetime.fromisoformat(text) + except ValueError: + return None + if parsed.tzinfo is None: + parsed = parsed.replace(tzinfo=dt.timezone.utc) + return parsed + + +def evaluate_vuln_parity_slo( + *, + snapshot_dir: Path = DEFAULT_SNAPSHOT_DIR, + threshold_path: Path = DEFAULT_THRESHOLD_PATH, + now: dt.datetime | None = None, + max_age_days: int = DEFAULT_MAX_SNAPSHOT_AGE_DAYS, +) -> VulnParitySloResult: + """Measure macro parity over synthetic snapshots and judge the SLO mode.""" + now = now or dt.datetime.now(dt.timezone.utc) + thresholds = load_thresholds(threshold_path) + mode = "enforce" if thresholds is not None else "report-only" + + normalizer = RuleClassNormalizer() + snapshot_paths = discover_snapshots(snapshot_dir) + + repo_results = [] + stale_snapshots: list[str] = [] + for path in snapshot_paths: + # load_codescan_snapshot fails closed on non-synthetic provenance. + snapshot = load_codescan_snapshot(path) + if _snapshot_is_stale( + snapshot.fetched_at, now=now, max_age_days=max_age_days + ): + stale_snapshots.append(path.name) + repo_results.append( + compare_codescan_alerts_with_findings( + repository=snapshot.repo_full_name, + alerts=snapshot.alerts, + findings=snapshot.findings, + normalizer=normalizer, + ) + ) + + macro = aggregate_codescan_parity(repo_results) + stale = bool(stale_snapshots) + + failures: list[str] = [] + if thresholds is not None: + if macro.macro_precision < thresholds.precision_min: + failures.append( + f"macro precision {macro.macro_precision:.4f} < minimum " + f"{thresholds.precision_min:.4f}" + ) + if macro.macro_recall < thresholds.recall_min: + failures.append( + f"macro recall {macro.macro_recall:.4f} < minimum " + f"{thresholds.recall_min:.4f}" + ) + if stale: + # In enforce mode a stale snapshot is a hard failure: it must not + # silently satisfy the gate. + failures.append( + "stale-degraded: snapshot(s) older than " + f"{max_age_days}d: {', '.join(stale_snapshots)}" + ) + + return VulnParitySloResult( + mode=mode, + macro=macro, + snapshot_count=len(snapshot_paths), + stale=stale, + stale_snapshots=tuple(stale_snapshots), + thresholds=thresholds, + failures=tuple(failures), + ) + + +def render_report(result: VulnParitySloResult) -> str: + """Render a public-safe, aggregate-only vuln parity-SLO report.""" + macro = result.macro + lines = [ + "GHAS Vuln Code-Scanning Parity SLO", + "==================================", + f"Mode: {result.mode}", + f"Snapshots measured: {result.snapshot_count}", + f"Repos: {macro.repo_count}", + f"Macro precision: {macro.macro_precision:.4f}", + f"Macro recall: {macro.macro_recall:.4f}", + f"Matched by-cwe: {macro.total_matched_by_cwe}", + f"Matched by-rule-token: {macro.total_matched_by_rule_token}", + f"Unmatched: {macro.total_unmatched}", + f"Dismissed-FP hit: {macro.total_dismissed_fp_hit}", + f"CWE-deficit rate: {macro.macro_cwe_deficit_rate:.4f}", + f"Rule-token rescue rate: {macro.macro_rule_token_rescue_rate:.4f}", + ] + if result.thresholds is not None: + lines.append( + f"Thresholds: precision_min {result.thresholds.precision_min:.4f}, " + f"recall_min {result.thresholds.recall_min:.4f}" + ) + else: + lines.append( + "Thresholds: none committed (report-only; enforce pending H-track)" + ) + # Surface snapshot age / staleness (NFR, scan-health precedent): always state + # the staleness verdict, not only when stale. + if result.stale: + lines.append(f"Stale-degraded: {', '.join(result.stale_snapshots)}") + else: + lines.append("Snapshot freshness: OK (within max age)") + if result.mode == "report-only": + lines.append("Result: REPORT-ONLY (never blocks; measure-first)") + elif result.failures: + lines.append("Result: FAIL") + for failure in result.failures: + lines.append(f" - {failure}") + else: + lines.append("Result: PASS") + return "\n".join(lines) + "\n" + + +def main(argv: list[str] | None = None) -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--root", type=Path, default=Path.cwd()) + parser.add_argument( + "--snapshot-dir", + type=Path, + default=DEFAULT_SNAPSHOT_DIR, + help="directory of committed synthetic code-scanning snapshot fixtures", + ) + parser.add_argument( + "--threshold-path", + type=Path, + default=DEFAULT_THRESHOLD_PATH, + help="optional calibrated threshold yml (absent => report-only)", + ) + parser.add_argument( + "--max-age-days", + type=int, + default=DEFAULT_MAX_SNAPSHOT_AGE_DAYS, + help="snapshot age beyond which it is stale-degraded", + ) + parser.add_argument( + "--check", + action="store_true", + help="evaluate and report; exit non-zero only in enforce mode failure", + ) + parser.add_argument( + "--json", action="store_true", help="emit a machine-readable JSON summary" + ) + args = parser.parse_args(argv) + + root = args.root.resolve() + snapshot_dir = ( + args.snapshot_dir + if args.snapshot_dir.is_absolute() + else root / args.snapshot_dir + ) + threshold_path = ( + args.threshold_path + if args.threshold_path.is_absolute() + else root / args.threshold_path + ) + + try: + result = evaluate_vuln_parity_slo( + snapshot_dir=snapshot_dir, + threshold_path=threshold_path, + max_age_days=args.max_age_days, + ) + except Exception as exc: # noqa: BLE001 - present any setup/provenance error. + print(f"vuln_parity_slo gate setup failed: {exc}", file=sys.stderr) + return 1 + + if args.json: + print(json.dumps(_result_to_dict(result), indent=2, sort_keys=True)) + else: + print(render_report(result)) + + if result.passed: + return 0 + for failure in result.failures: + print(f"vuln_parity_slo: {failure}", file=sys.stderr) + return 1 + + +def _result_to_dict(result: VulnParitySloResult) -> dict[str, Any]: + macro = result.macro + return { + "mode": result.mode, + "snapshotCount": result.snapshot_count, + "repoCount": macro.repo_count, + "macroPrecision": macro.macro_precision, + "macroRecall": macro.macro_recall, + "matchedByCwe": macro.total_matched_by_cwe, + "matchedByRuleToken": macro.total_matched_by_rule_token, + "unmatched": macro.total_unmatched, + "dismissedFpHit": macro.total_dismissed_fp_hit, + "cweDeficitRate": macro.macro_cwe_deficit_rate, + "ruleTokenRescueRate": macro.macro_rule_token_rescue_rate, + "stale": result.stale, + "staleSnapshots": list(result.stale_snapshots), + "thresholds": ( + None + if result.thresholds is None + else { + "precisionMin": result.thresholds.precision_min, + "recallMin": result.thresholds.recall_min, + } + ), + "failures": list(result.failures), + "passed": result.passed, + } + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/src/security_scanner/core/vulnerability/__init__.py b/src/security_scanner/core/vulnerability/__init__.py index d71e2aa..6220b1b 100644 --- a/src/security_scanner/core/vulnerability/__init__.py +++ b/src/security_scanner/core/vulnerability/__init__.py @@ -1,12 +1,15 @@ """Code vulnerability finding model and SARIF-first helpers.""" from security_scanner.core.vulnerability.evaluation import ( + NormalizedExpectedFinding, VulnerabilityEvaluationKey, VulnerabilityEvaluationResult, VulnerabilityEvaluationThresholds, VulnerabilityExpectedFinding, evaluate_vulnerability_findings, + evaluate_vulnerability_findings_normalized, evaluate_vulnerability_gate, + load_vulnerability_corpus_normalized, load_vulnerability_expected_findings, render_vulnerability_evaluation_report, ) @@ -34,6 +37,7 @@ "VULN_CATEGORY", "VULN_ENTITY_TYPE", "VULN_SCHEMA_VERSION", + "NormalizedExpectedFinding", "SarifImportError", "VulnerabilityEvaluationKey", "VulnerabilityEvaluationResult", @@ -45,10 +49,12 @@ "VulnerabilityLocation", "compute_vulnerability_finding_id", "evaluate_vulnerability_findings", + "evaluate_vulnerability_findings_normalized", "evaluate_vulnerability_gate", "evaluate_vulnerability_gate_policy", "import_sarif_file", "import_sarif_payload", + "load_vulnerability_corpus_normalized", "load_vulnerability_expected_findings", "render_vulnerability_evaluation_report", "render_vulnerability_report", diff --git a/src/security_scanner/core/vulnerability/codescan.py b/src/security_scanner/core/vulnerability/codescan.py new file mode 100644 index 0000000..bb997fc --- /dev/null +++ b/src/security_scanner/core/vulnerability/codescan.py @@ -0,0 +1,342 @@ +"""GHAS code-scanning alert domain model + rule-class normalizer (M1). + +This is the vuln-domain analog of the proven secret-track artifacts +(``baseline/ghas_api/normalize.py`` + the ``GhasAlertRecord`` value object). It +turns a GHAS *code-scanning* alert into a redacted value object +(:class:`CodeScanAlertRecord`) and collapses a tool's ``rule_id`` / CWE tags onto +a single canonical vuln class so a CodeQL/Semgrep token-mismatch no longer splits +one vulnerability across ``local_only`` (precision penalty) and ``ghas_only`` +(recall penalty). + +The matcher in :mod:`security_scanner.core.vulnerability.codescan_parity` +performs the fuzzy (line-window) join on top of this; the precision/recall +*formula* and gate *threshold* judgement stay in +``core.vulnerability.evaluation`` (no new metric code here). + +:class:`CodeScanAlertRecord` is a PURE leaf value object kept in +``core/vulnerability/``: it has NO coupling to the durable nosql store. Wiring +the alert snapshot into a durable projection is the H4 storage-projection trap +and is deliberately out of scope here. + +Normalization priority (design §4.2): + CWE intersection (by-cwe) > rule-token normalization (by-rule-token) > unmatched. +""" + +from __future__ import annotations + +import json +import re +from collections.abc import Iterable, Mapping +from dataclasses import dataclass, field +from pathlib import Path + +from security_scanner.core.vulnerability.model import ( + VulnerabilityFinding, + VulnerabilityLocation, + compute_vulnerability_finding_id, +) + +# CWE token shape, mirrors sarif._CWE_RE so canonical ids line up (``CWE-NNN``). +_CWE_RE = re.compile(r"(?i)cwe[-_/ ]?(\d{1,5})") + + +# --------------------------------------------------------------------------- +# Redacted alert value object (secret GhasAlertRecord parallel) +# --------------------------------------------------------------------------- + +@dataclass(frozen=True) +class CodeScanAlertRecord: + """Redacted GHAS code-scanning alert (no raw message/snippet/private path). + + A pure value object: it carries only the redacted fields the parity matcher + needs and intentionally has NO storage-store coupling. + """ + + repository: str + alert_number: int + rule_id: str + security_severity_level: str | None + cwe_ids: tuple[str, ...] + state: str # open | dismissed | fixed + dismissed_reason: str | None # false positive | won't fix | used in tests | None + location_path: str | None + location_start_line: int | None + location_end_line: int | None + fetched_at: str | None = None + source_tool: str = "ghas-code-scanning" + + +# --------------------------------------------------------------------------- +# CWE bridge: CWE id -> canonical vuln class +# --------------------------------------------------------------------------- + +# Initial coverage of the core vuln classes (design §2/§4.2). Extend by adding a +# row; an EMPTY bridge maps nothing so the adversarial fixtures go red when the +# CWE path is disabled. +DEFAULT_CWE_BRIDGE: dict[str, str] = { + "CWE-89": "sql-injection", + "CWE-79": "xss", + "CWE-22": "path-traversal", + "CWE-23": "path-traversal", + "CWE-36": "path-traversal", + "CWE-78": "command-injection", + "CWE-77": "command-injection", + "CWE-918": "ssrf", +} + + +# Rule-token classes: canonical class -> the exact core token SET that identifies +# it. Matching requires an EXACT set match of the surviving core tokens, so +# ``path-traversal`` never matches ``open-redirect`` (no partial overlap). +DEFAULT_RULE_TOKEN_CLASSES: dict[str, frozenset[str]] = { + "sql-injection": frozenset({"sql", "injection"}), + "xss": frozenset({"xss"}), + "path-traversal": frozenset({"path", "traversal"}), + "command-injection": frozenset({"command", "injection"}), + "ssrf": frozenset({"ssrf"}), + "open-redirect": frozenset({"open", "redirect"}), +} + + +# Stop-tokens stripped before exact-set comparison: language, tool, taxonomy and +# generic-audit noise that does not identify the vuln class. +DEFAULT_STOP_TOKENS: frozenset[str] = frozenset( + { + "audit", + "lang", + "language", + "security", + "py", + "python", + "js", + "javascript", + "ts", + "typescript", + "java", + "go", + "golang", + "rb", + "ruby", + "php", + "cs", + "csharp", + "ql", + "codeql", + "semgrep", + "external", + "cwe", + "rule", + "rules", + "best", + "practice", + "practices", + "problem", + "problems", + "warning", + "error", + "vuln", + "vulnerability", + "generic", + } +) + + +# Set-level synonym rewrites applied AFTER stop-token removal, so tool-specific +# vuln-class idioms fold onto one canonical core-token set before the exact-set +# comparison. Each rule rewrites a matched subset to a canonical set. Example: +# CodeQL's ``path-injection`` ({path, injection}) means the same class as +# ``path-traversal`` ({path, traversal}). +_SET_SYNONYMS: tuple[tuple[frozenset[str], frozenset[str]], ...] = ( + (frozenset({"path", "injection"}), frozenset({"path", "traversal"})), +) + + +def _split_tokens(rule_id: str) -> list[str]: + """Tokenize a rule id on common separators, lower-cased.""" + raw = re.split(r"[^a-zA-Z0-9]+", rule_id.strip().lower()) + return [token for token in raw if token] + + +def _core_tokens( + rule_id: str, + stop_tokens: frozenset[str], +) -> frozenset[str]: + """Return the surviving core vuln-class tokens after stop-token removal.""" + tokens = frozenset(t for t in _split_tokens(rule_id) if t not in stop_tokens) + # Fold known source/sink synonym sets (path-injection -> path-traversal). + for matched, canonical in _SET_SYNONYMS: + if tokens == matched: + return canonical + return tokens + + +def extract_cwe_ids(values: Iterable[str]) -> tuple[str, ...]: + """Normalize arbitrary CWE-bearing tokens into ``CWE-NNN`` ids (sorted, unique).""" + found: set[str] = set() + for value in values: + match = _CWE_RE.search(str(value)) + if match: + found.add(f"CWE-{int(match.group(1))}") + return tuple(sorted(found)) + + +@dataclass(frozen=True) +class RuleClassNormalizer: + """Maps a ``rule_id`` and/or ``cwe_ids`` onto a canonical vuln class. + + An EMPTY ``cwe_bridge`` together with ``enable_rule_token=False`` normalizes + nothing — every lookup misses. That is what makes the same-vuln-different- + rule.id fixture go red when normalization is disabled. + """ + + cwe_bridge: Mapping[str, str] = field( + default_factory=lambda: dict(DEFAULT_CWE_BRIDGE) + ) + rule_token_classes: Mapping[str, frozenset[str]] = field( + default_factory=lambda: dict(DEFAULT_RULE_TOKEN_CLASSES) + ) + stop_tokens: frozenset[str] = DEFAULT_STOP_TOKENS + enable_rule_token: bool = True + + def cwe_class(self, cwe_ids: Iterable[str]) -> str | None: + """First bridgeable CWE -> its canonical class, else ``None``.""" + for cwe in extract_cwe_ids(cwe_ids): + mapped = self.cwe_bridge.get(cwe) + if mapped is not None: + return mapped + return None + + def has_bridgeable_cwe(self, cwe_ids: Iterable[str]) -> bool: + return self.cwe_class(cwe_ids) is not None + + def rule_token_class(self, rule_id: str) -> frozenset[str] | None: + """Exact core-token set for ``rule_id``, or ``None``. + + Returns the surviving core-token *set* (so callers can compare two sides + for EXACT equality — no partial overlap). Returns ``None`` when rule-token + normalization is disabled or no core tokens survive. + """ + if not self.enable_rule_token: + return None + core = _core_tokens(rule_id, self.stop_tokens) + if not core: + return None + return core + + def rule_token_canonical(self, rule_id: str) -> str | None: + """Canonical class name for a rule-token set, if it matches a known class.""" + core = self.rule_token_class(rule_id) + if core is None: + return None + for canonical, token_set in self.rule_token_classes.items(): + if token_set == core: + return canonical + return None + + +# --------------------------------------------------------------------------- +# Snapshot fixture loading (provenance fail-closed) +# --------------------------------------------------------------------------- + +@dataclass(frozen=True) +class CodeScanSnapshot: + """Loaded synthetic code-scanning snapshot fixture (provenance-guarded).""" + + repo_full_name: str + source: str + alerts: list[CodeScanAlertRecord] + findings: list[VulnerabilityFinding] + fetched_at: str | None = None + + +def load_codescan_snapshot(path: str | Path) -> CodeScanSnapshot: + """Load a synthetic code-scanning snapshot fixture. + + Fails closed unless ``source`` is exactly ``synthetic`` — a real (or + unmarked) snapshot must never feed the autonomous harness. + """ + data = json.loads(Path(path).read_text(encoding="utf-8")) + source = str(data.get("source", "")).strip().lower() + if source != "synthetic": + raise ValueError( + "code-scanning snapshot must carry provenance marker source: synthetic " + f"(got {data.get('source')!r}); refusing to load" + ) + + repo_full_name = str(data["repoFullName"]) + fetched_at = data.get("fetchedAt") + + alerts = [ + _alert_from_dict(repo_full_name, item, fetched_at) + for item in data.get("alerts", []) + ] + findings = [ + _finding_from_dict(item) for item in data.get("findings", []) + ] + return CodeScanSnapshot( + repo_full_name=repo_full_name, + source=source, + alerts=alerts, + findings=findings, + fetched_at=fetched_at, + ) + + +def _alert_from_dict( + repo_full_name: str, item: dict, fetched_at: str | None +) -> CodeScanAlertRecord: + start = item.get("lineStart") + end = item.get("lineEnd") + return CodeScanAlertRecord( + repository=repo_full_name, + alert_number=int(item["alertNumber"]), + rule_id=str(item["ruleId"]), + security_severity_level=item.get("securitySeverityLevel"), + cwe_ids=extract_cwe_ids(item.get("cweIds", [])), + state=str(item.get("state", "open")), + dismissed_reason=item.get("dismissedReason"), + location_path=item.get("filePath"), + location_start_line=int(start) if start is not None else None, + location_end_line=int(end) if end is not None else None, + fetched_at=fetched_at, + ) + + +def _finding_from_dict(item: dict) -> VulnerabilityFinding: + file_path = str(item["filePath"]) + line_start = int(item["lineStart"]) + rule_id = str(item["ruleId"]) + source_tool = str(item.get("sourceTool", "semgrep")) + message = str(item.get("message", "synthetic finding")) + finding_id = compute_vulnerability_finding_id( + source_tool=source_tool, + rule_id=rule_id, + partial_fingerprints=None, + file_path=file_path, + line_start=line_start, + message=message, + ) + return VulnerabilityFinding( + finding_id=finding_id, + rule_id=rule_id, + message=message, + primary_location=VulnerabilityLocation( + file_path=file_path, + line_start=line_start, + line_end=item.get("lineEnd"), + ), + source_tool=source_tool, + cwe_ids=extract_cwe_ids(item.get("cweIds", [])), + ) + + +__all__ = [ + "CodeScanAlertRecord", + "CodeScanSnapshot", + "DEFAULT_CWE_BRIDGE", + "DEFAULT_RULE_TOKEN_CLASSES", + "DEFAULT_STOP_TOKENS", + "RuleClassNormalizer", + "extract_cwe_ids", + "load_codescan_snapshot", +] diff --git a/src/security_scanner/core/vulnerability/codescan_parity.py b/src/security_scanner/core/vulnerability/codescan_parity.py new file mode 100644 index 0000000..36e2068 --- /dev/null +++ b/src/security_scanner/core/vulnerability/codescan_parity.py @@ -0,0 +1,507 @@ +"""GHAS code-scanning alert -> VulnerabilityEvaluationKey parity adapter (M1). + +This is the 1:1 transfer of the proven secret-track parity matcher +(``baseline/ghas_api/parity.py``) to the code-vulnerability domain. It turns +GHAS code-scanning alerts (:class:`CodeScanAlertRecord`) and our own SARIF +findings (:class:`VulnerabilityFinding`) into the +``VulnerabilityExpectedFinding`` / ``VulnerabilityEvaluationKey`` shape that +``core.vulnerability.evaluation`` already understands, so the precision/recall +*formula* and gate *threshold* judgement are reused VERBATIM — no new metric code. + +The adapter owns exactly the responsibilities the metrics layer cannot: + +(a) **rule_id/CWE -> canonical vuln class** via + :class:`~security_scanner.core.vulnerability.codescan.RuleClassNormalizer`, + with priority CWE-intersection (by-cwe) > rule-token (by-rule-token) > + unmatched, so a CodeQL/Semgrep token-mismatch no longer splits one vuln in + two. +(b) **state-aware truth filter** (design §4.2 denominator formula) — recall + denominator = alerts in ``state ∈ {open, fixed}`` only; + ``dismissed_reason ∈ {false positive, used in tests}`` is an explicit + FP-oracle (precision penalty when our finding lands there, NOT in the recall + denominator); ``won't fix`` is TP-non-blocking (excluded from the recall + denominator, no precision penalty). +(c) **line-window matching** — a finding matches an alert when their line + intervals overlap or are within ``±N`` lines (TRUE window, no + ``start_line//N`` quantization). Because this is a fuzzy join it cannot be + expressed as exact-key equality, so the adapter resolves the TP/FP/FN + pairing itself (1:1 greedy via ``_AlertSlot.consumed``) and then hands + canonical keys to ``evaluate_vulnerability_findings`` for the headline + numbers. + +This module is a pure function over its inputs: it performs no network calls and +has no durable-store coupling. +""" + +from __future__ import annotations + +from collections.abc import Iterable, Sequence +from dataclasses import dataclass + +from security_scanner.core.vulnerability.codescan import ( + CodeScanAlertRecord, + RuleClassNormalizer, +) +from security_scanner.core.vulnerability.evaluation import ( + VulnerabilityEvaluationResult, + VulnerabilityExpectedFinding, + evaluate_vulnerability_findings, +) +from security_scanner.core.vulnerability.model import ( + VulnerabilityFinding, + VulnerabilityLocation, +) + +# Positive-truth states for the recall denominator (design §4.2 line 330-333). +CODESCAN_POSITIVE_TRUTH_STATES: tuple[str, ...] = ("open", "fixed") +# Dismissed reasons that are an explicit FP-oracle (precision penalty). +CODESCAN_FP_ORACLE_REASONS: tuple[str, ...] = ("false positive", "used in tests") +# Dismissed reason that is TP-non-blocking (excluded from recall, no penalty). +CODESCAN_NON_BLOCKING_REASON: str = "won't fix" + + +@dataclass(frozen=True) +class ParityConfig: + """Tunable parity-matching policy. + + ``line_window`` is the ``±N`` window (interval overlap always matches + regardless of N). It is FIXED to a concrete value in M1 (closes open + question VD-07) and pinned by the fixtures. ``positive_truth_states`` + parameterizes the state-aware truth filter so a test can disable it (and + prove the resulting recall regression). + """ + + line_window: int = 2 + positive_truth_states: tuple[str, ...] = CODESCAN_POSITIVE_TRUTH_STATES + + +@dataclass(frozen=True) +class CodeScanParityResult: + """Per-repo code-scanning parity outcome. + + ``detection`` carries the reused metrics-layer result (so ``.precision`` / + ``.recall`` come straight from ``core.vulnerability.evaluation``). The tier + counts and FP-oracle counters are parity-specific buckets. + """ + + repository: str + detection: VulnerabilityEvaluationResult + matched_by_cwe: int + matched_by_rule_token: int + unmatched: int + dismissed_fp_hit: int + cwe_deficit_rate: float + rule_token_rescue_rate: float + + @property + def precision(self) -> float: + return self.detection.precision + + @property + def recall(self) -> float: + return self.detection.recall + + +@dataclass(frozen=True) +class MacroCodeScanParityResult: + """Macro (per-repo averaged) code-scanning parity summary. + + The vuln-domain analog of the secret-track ``MacroParityResult``. This is + aggregation ONLY: ``macro_precision`` / ``macro_recall`` are the unweighted + average of the per-repo numbers that already came from the metrics layer + (``CodeScanParityResult.detection.precision`` / ``.recall``). It introduces + NO new precision/recall formula — the tier counters and FP-oracle totals are + summed buckets carried forward from the per-repo matcher. + """ + + repo_count: int + macro_precision: float + macro_recall: float + total_matched_by_cwe: int + total_matched_by_rule_token: int + total_unmatched: int + total_dismissed_fp_hit: int + macro_cwe_deficit_rate: float + macro_rule_token_rescue_rate: float + + +def aggregate_codescan_parity( + results: Iterable[CodeScanParityResult], +) -> MacroCodeScanParityResult: + """Macro-average per-repo code-scanning precision/recall (SLO consumes macro). + + Mirrors ``baseline.ghas_api.parity.aggregate_repo_parity``: the per-repo + precision/recall come straight from the metrics layer, so this is pure + averaging plus summed tier/FP-oracle buckets — not a TP/(TP+FP) re-derivation. + """ + results = list(results) + if not results: + return MacroCodeScanParityResult( + repo_count=0, + macro_precision=1.0, + macro_recall=1.0, + total_matched_by_cwe=0, + total_matched_by_rule_token=0, + total_unmatched=0, + total_dismissed_fp_hit=0, + macro_cwe_deficit_rate=0.0, + macro_rule_token_rescue_rate=0.0, + ) + n = len(results) + return MacroCodeScanParityResult( + repo_count=n, + macro_precision=sum(r.detection.precision for r in results) / n, + macro_recall=sum(r.detection.recall for r in results) / n, + total_matched_by_cwe=sum(r.matched_by_cwe for r in results), + total_matched_by_rule_token=sum(r.matched_by_rule_token for r in results), + total_unmatched=sum(r.unmatched for r in results), + total_dismissed_fp_hit=sum(r.dismissed_fp_hit for r in results), + macro_cwe_deficit_rate=sum(r.cwe_deficit_rate for r in results) / n, + macro_rule_token_rescue_rate=sum( + r.rule_token_rescue_rate for r in results + ) + / n, + ) + + +# --------------------------------------------------------------------------- +# Truth classification (state-aware denominator) +# --------------------------------------------------------------------------- + +def _norm(value: str | None) -> str: + return (value or "").strip().lower() + + +def _is_positive_truth(alert: CodeScanAlertRecord, config: ParityConfig) -> bool: + return _norm(alert.state) in config.positive_truth_states + + +def _is_fp_oracle(alert: CodeScanAlertRecord) -> bool: + return _norm(alert.dismissed_reason) in CODESCAN_FP_ORACLE_REASONS + + +# --------------------------------------------------------------------------- +# Core fuzzy join +# --------------------------------------------------------------------------- + +@dataclass +class _AlertSlot: + record: CodeScanAlertRecord + cwe_class: str | None + rule_token: frozenset[str] | None + consumed: bool = False + + +def _alert_lines(alert: CodeScanAlertRecord) -> tuple[int, int]: + start = alert.location_start_line + if start is None: + return (0, 0) + end = alert.location_end_line if alert.location_end_line is not None else start + lo, hi = (start, end) + if hi < lo: + lo, hi = hi, lo + return (lo, hi) + + +def _lines_match( + finding_line: int, + alert_interval: tuple[int, int], + window: int, +) -> bool: + lo, hi = alert_interval + # Interval overlap (finding line inside the alert span). + if lo <= finding_line <= hi: + return True + # ±N window around the nearest interval endpoint. + nearest = lo if finding_line < lo else hi + return abs(finding_line - nearest) <= window + + +def _slot_lines_match( + slot: _AlertSlot, finding: VulnerabilityFinding, window: int +) -> bool: + finding_line = finding.primary_location.line_start or 0 + return _lines_match(finding_line, _alert_lines(slot.record), window) + + +def _rule_class_match( + slot: _AlertSlot, + finding_cwe_class: str | None, + finding_token: frozenset[str] | None, +) -> str | None: + """Return the match tier (``by-cwe`` / ``by-rule-token``) or ``None``. + + Priority (design §4.2): by-cwe first (both sides share a canonical CWE + class); else by-rule-token (EXACT core-token set equality — no partial + overlap); else no match. + """ + if ( + slot.cwe_class is not None + and finding_cwe_class is not None + and slot.cwe_class == finding_cwe_class + ): + return "by-cwe" + if ( + finding_cwe_class is None + and slot.cwe_class is None + and slot.rule_token is not None + and finding_token is not None + and slot.rule_token == finding_token + ): + return "by-rule-token" + return None + + +def compare_codescan_alerts_with_findings( + *, + repository: str, + alerts: Sequence[CodeScanAlertRecord], + findings: Sequence[VulnerabilityFinding], + normalizer: RuleClassNormalizer, + config: ParityConfig | None = None, +) -> CodeScanParityResult: + """Compute per-repo code-scanning parity for one GHAS-enabled repo. + + Returns the metrics-layer ``VulnerabilityEvaluationResult`` (so + precision/recall come straight from ``core.vulnerability.evaluation``) plus + the parity-specific tier counts and FP-oracle counters. + """ + config = config or ParityConfig() + + # 1. State-aware truth filter: positive-truth alerts (recall denominator) are + # open + fixed only. Dismissed-FP-oracle alerts are tracked separately and + # won't-fix alerts are simply excluded. + truth_alerts = [ + a + for a in alerts + if a.location_path is not None + and a.location_start_line is not None + and _is_positive_truth(a, config) + ] + fp_oracle_alerts = [ + a + for a in alerts + if a.location_path is not None + and a.location_start_line is not None + and not _is_positive_truth(a, config) + and _is_fp_oracle(a) + ] + + truth_slots = [ + _AlertSlot( + record=a, + cwe_class=normalizer.cwe_class(a.cwe_ids), + rule_token=normalizer.rule_token_class(a.rule_id), + ) + for a in truth_alerts + ] + fp_oracle_slots = [ + _AlertSlot( + record=a, + cwe_class=normalizer.cwe_class(a.cwe_ids), + rule_token=normalizer.rule_token_class(a.rule_id), + ) + for a in fp_oracle_alerts + ] + + # CWE-deficit meta-metric over the positive-truth alert population. + cwe_deficit_rate = _deficit_rate( + normalizer.has_bridgeable_cwe(a.cwe_ids) for a in truth_alerts + ) + + expected: list[VulnerabilityExpectedFinding] = [] + actual: list[VulnerabilityFinding] = [] + matched_by_cwe = 0 + matched_by_rule_token = 0 + unmatched = 0 + dismissed_fp_hit = 0 + match_index = 0 + + # 2. Fuzzy join: each finding tries to claim one unconsumed positive-truth + # alert in the same file with a matching rule-class and a tolerated line. + for finding in findings: + finding_cwe_class = normalizer.cwe_class(finding.cwe_ids) + finding_token = normalizer.rule_token_class(finding.rule_id) + + slot, tier = _find_matching_slot( + finding, finding_cwe_class, finding_token, truth_slots, config + ) + if slot is not None: + slot.consumed = True + match_index += 1 + if tier == "by-cwe": + matched_by_cwe += 1 + else: + matched_by_rule_token += 1 + shared_key = _matched_key(repository, match_index, tier) + expected.append(shared_key) + actual.append(_canonical_finding(finding, shared_key)) + continue + + # Not a positive-truth match. Is this finding sitting on a dismissed + # false-positive / used-in-tests location? That is an explicit FP-oracle + # hit: a precision penalty AND a false positive. + oracle = _find_fp_oracle_slot( + finding, finding_cwe_class, finding_token, fp_oracle_slots, config + ) + if oracle is not None: + oracle.consumed = True + dismissed_fp_hit += 1 + actual.append(_local_only_finding(repository, finding, match_index)) + match_index += 1 + unmatched += 1 + continue + + # Pure local-only finding -> false positive. + actual.append(_local_only_finding(repository, finding, match_index)) + match_index += 1 + unmatched += 1 + + # 3. Unconsumed positive-truth alerts -> false negatives (ghas-only truth). + for slot in truth_slots: + if not slot.consumed: + expected.append(_ghas_only_key(slot.record)) + + # by-rule-token rescue rate = fraction of matches that needed the rule-token + # fallback (i.e. could not be resolved by the CWE bridge). + total_matched = matched_by_cwe + matched_by_rule_token + rule_token_rescue_rate = _rate(matched_by_rule_token, total_matched) + + detection = evaluate_vulnerability_findings(expected, actual) + + return CodeScanParityResult( + repository=repository, + detection=detection, + matched_by_cwe=matched_by_cwe, + matched_by_rule_token=matched_by_rule_token, + unmatched=unmatched, + dismissed_fp_hit=dismissed_fp_hit, + cwe_deficit_rate=cwe_deficit_rate, + rule_token_rescue_rate=rule_token_rescue_rate, + ) + + +def _find_matching_slot( + finding: VulnerabilityFinding, + finding_cwe_class: str | None, + finding_token: frozenset[str] | None, + slots: list[_AlertSlot], + config: ParityConfig, +) -> tuple[_AlertSlot | None, str | None]: + """First unconsumed positive-truth slot in the same file/window/rule-class.""" + for slot in slots: + if slot.consumed: + continue + if slot.record.location_path != finding.primary_location.file_path: + continue + if not _slot_lines_match(slot, finding, config.line_window): + continue + tier = _rule_class_match(slot, finding_cwe_class, finding_token) + if tier is not None: + return slot, tier + return None, None + + +def _find_fp_oracle_slot( + finding: VulnerabilityFinding, + finding_cwe_class: str | None, + finding_token: frozenset[str] | None, + slots: list[_AlertSlot], + config: ParityConfig, +) -> _AlertSlot | None: + """First unconsumed dismissed-FP-oracle slot the finding lands on.""" + for slot in slots: + if slot.consumed: + continue + if slot.record.location_path != finding.primary_location.file_path: + continue + if not _slot_lines_match(slot, finding, config.line_window): + continue + if _rule_class_match(slot, finding_cwe_class, finding_token) is not None: + return slot + return None + + +# --------------------------------------------------------------------------- +# Meta-metric helpers +# --------------------------------------------------------------------------- + +def _deficit_rate(flags: object) -> float: + """Fraction lacking a bridgeable CWE over an iterable of has-CWE booleans.""" + items = list(flags) + if not items: + return 0.0 + deficient = sum(1 for has_cwe in items if not has_cwe) + return deficient / len(items) + + +def _rate(part: int, total: int) -> float: + return part / total if total else 0.0 + + +# --------------------------------------------------------------------------- +# Canonical-key synthesis (kept stable so evaluation.py keys line up 1:1) +# --------------------------------------------------------------------------- + +def _matched_key( + repository: str, index: int, tier: str | None +) -> VulnerabilityExpectedFinding: + return VulnerabilityExpectedFinding( + file_path=f"__matched__/{index}", + line_start=index, + rule_id=f"__matched__:{tier}", + ) + + +def _canonical_finding( + finding: VulnerabilityFinding, shared_key: VulnerabilityExpectedFinding +) -> VulnerabilityFinding: + """A VulnerabilityFinding whose EvaluationKey equals ``shared_key`` (TP).""" + return VulnerabilityFinding( + finding_id=finding.finding_id, + rule_id=shared_key.rule_id, + message=finding.message, + primary_location=VulnerabilityLocation( + file_path=shared_key.file_path, + line_start=shared_key.line_start, + ), + source_tool=finding.source_tool, + cwe_ids=finding.cwe_ids, + ) + + +def _ghas_only_key(alert: CodeScanAlertRecord) -> VulnerabilityExpectedFinding: + return VulnerabilityExpectedFinding( + file_path=f"__ghas_only__/{alert.location_path}", + line_start=alert.location_start_line or 0, + rule_id=f"ghas:{alert.rule_id}", + ) + + +def _local_only_finding( + repository: str, finding: VulnerabilityFinding, index: int +) -> VulnerabilityFinding: + """A finding with a guaranteed-unique key so it lands as a false positive.""" + return VulnerabilityFinding( + finding_id=finding.finding_id, + rule_id=f"__local_only__/{index}/{finding.rule_id}", + message=finding.message, + primary_location=VulnerabilityLocation( + file_path=f"__local_only__/{index}/{finding.primary_location.file_path}", + line_start=finding.primary_location.line_start or 0, + ), + source_tool=finding.source_tool, + cwe_ids=finding.cwe_ids, + ) + + +__all__ = [ + "CODESCAN_POSITIVE_TRUTH_STATES", + "CODESCAN_FP_ORACLE_REASONS", + "CODESCAN_NON_BLOCKING_REASON", + "ParityConfig", + "CodeScanParityResult", + "MacroCodeScanParityResult", + "aggregate_codescan_parity", + "compare_codescan_alerts_with_findings", +] diff --git a/src/security_scanner/core/vulnerability/evaluation.py b/src/security_scanner/core/vulnerability/evaluation.py index e171982..c5f6b2d 100644 --- a/src/security_scanner/core/vulnerability/evaluation.py +++ b/src/security_scanner/core/vulnerability/evaluation.py @@ -1,13 +1,35 @@ -"""Synthetic corpus evaluation for code vulnerability findings.""" +"""Synthetic corpus evaluation for code vulnerability findings. + +Two matching semantics coexist here, by DESIGN (design §E lines 326-329): + +- The ORIGINAL exact-key path — :class:`VulnerabilityEvaluationKey` + ``(file_path, line_start, rule_id)`` full equality — used by + :func:`evaluate_vulnerability_findings`. This is the legacy naive matcher and + its behavior is FROZEN: every existing caller / test keeps the same results. +- The M2 normalization-aware path — :func:`evaluate_vulnerability_findings_ + normalized` — reuses the M1 + :class:`~security_scanner.core.vulnerability.codescan.RuleClassNormalizer` + (CWE-bridge / rule-token canonicalization) and a line-window so a CodeQL-style + and a Semgrep-style ``ruleId`` for the SAME vuln class match. It is the SAME + rule-class + line-window semantics the M1 parity matcher + (``codescan_parity.py``) uses, satisfying the VFR8 consistency condition. + +The boundary is deliberate: the normalized path is a NEW function. It does NOT +mutate :func:`evaluate_vulnerability_findings` or the exact-key. Like the M1 +matcher it pre-normalizes both sides into synthetic canonical keys and then hands +them to :func:`evaluate_vulnerability_findings` for the headline precision/recall +— so there is ZERO new precision/recall formula. +""" from __future__ import annotations import json from collections import Counter -from collections.abc import Iterable +from collections.abc import Iterable, Sequence from dataclasses import dataclass from pathlib import Path +from security_scanner.core.vulnerability.codescan import RuleClassNormalizer from security_scanner.core.vulnerability.model import VulnerabilityFinding @@ -204,3 +226,269 @@ def _append_key_section( lines.append(title + ":") for key in keys: lines.append(f" - {key.display()}") + + +# --------------------------------------------------------------------------- +# M2 normalization-aware path (design §E) — NEW, additive. Does NOT change the +# exact-key behavior of evaluate_vulnerability_findings above. +# --------------------------------------------------------------------------- + +# Concrete line-window N, pinned to the M1 parity matcher's value (VFR8 +# consistency: the synthetic regression gate and the parity matcher share the +# same line-window). See codescan_parity.ParityConfig.line_window. +NORMALIZED_LINE_WINDOW: int = 2 + + +@dataclass(frozen=True) +class NormalizedExpectedFinding: + """Expected finding carrying enough metadata to derive its canonical class. + + Unlike :class:`VulnerabilityExpectedFinding` (which pins an exact ``rule_id``) + this carries the raw ``rule_id`` / ``cwe_ids`` so the SAME + :class:`RuleClassNormalizer` used by the M1 parity matcher derives the + canonical class on the expected side too. + """ + + file_path: str + line_start: int + rule_id: str + cwe_ids: tuple[str, ...] = () + + @classmethod + def from_dict(cls, data: dict) -> NormalizedExpectedFinding: + return cls( + file_path=str(data["filePath"]), + line_start=int(data["lineStart"]), + rule_id=str(data["ruleId"]), + cwe_ids=tuple(str(item) for item in data.get("cweIds", [])), + ) + + +def load_vulnerability_corpus_normalized( + path: str | Path, +) -> tuple[list[NormalizedExpectedFinding], list[VulnerabilityFinding]]: + """Load the M2 5-class synthetic corpus snapshot (provenance fail-closed). + + Returns ``(expected, actual)`` ready for + :func:`evaluate_vulnerability_findings_normalized`. Refuses to load a snapshot + whose ``source`` is not exactly ``synthetic`` so a real (or unmarked) corpus + can never feed the autonomous regression gate. + """ + data = json.loads(Path(path).read_text(encoding="utf-8")) + source = str(data.get("source", "")).strip().lower() + if source != "synthetic": + raise ValueError( + "vulnerability corpus snapshot must carry provenance marker " + f"source: synthetic (got {data.get('source')!r}); refusing to load" + ) + expected = [ + NormalizedExpectedFinding.from_dict(item) + for item in data.get("expectedFindings", []) + ] + actual = [ + _normalized_finding_from_dict(item) + for item in data.get("actualFindings", []) + ] + return expected, actual + + +def _normalized_finding_from_dict(item: dict) -> VulnerabilityFinding: + from security_scanner.core.vulnerability.model import ( + VulnerabilityLocation, + compute_vulnerability_finding_id, + ) + + file_path = str(item["filePath"]) + line_start = int(item["lineStart"]) + rule_id = str(item["ruleId"]) + source_tool = str(item.get("sourceTool", "semgrep")) + finding_id = compute_vulnerability_finding_id( + source_tool=source_tool, + rule_id=rule_id, + partial_fingerprints=None, + file_path=file_path, + line_start=line_start, + message="synthetic finding", + ) + return VulnerabilityFinding( + finding_id=finding_id, + rule_id=rule_id, + message="synthetic finding", + primary_location=VulnerabilityLocation( + file_path=file_path, + line_start=line_start, + line_end=item.get("lineEnd"), + ), + source_tool=source_tool, + cwe_ids=tuple(str(c) for c in item.get("cweIds", [])), + ) + + +def _canonical_class( + normalizer: RuleClassNormalizer, + *, + cwe_ids: Iterable[str], + rule_id: str, +) -> str | None: + """Canonical vuln class via the shared M1 normalizer (CWE bridge > rule-token).""" + cwe_class = normalizer.cwe_class(cwe_ids) + if cwe_class is not None: + return cwe_class + return normalizer.rule_token_canonical(rule_id) + + +@dataclass +class _ExpectedSlot: + expected: NormalizedExpectedFinding | VulnerabilityExpectedFinding + vuln_class: str | None + consumed: bool = False + + +def _line_window_match(a: int, b: int, window: int) -> bool: + return abs(a - b) <= window + + +def evaluate_vulnerability_findings_normalized( + expected_findings: Sequence[NormalizedExpectedFinding | VulnerabilityExpectedFinding], + actual_findings: Sequence[VulnerabilityFinding], + *, + normalizer: RuleClassNormalizer | None = None, + line_window: int = NORMALIZED_LINE_WINDOW, +) -> VulnerabilityEvaluationResult: + """Normalization-aware evaluation (design §E) reusing M1's normalizer. + + Accepts either :class:`NormalizedExpectedFinding` (carries ``cwe_ids`` so the + CWE bridge can fire) or a legacy :class:`VulnerabilityExpectedFinding` (no + ``cwe_ids`` -> class derived from ``rule_id`` tokens only). + + Like the M1 parity matcher this performs a fuzzy join on + ``(file_path, canonical vuln class, line-window)`` — NOT exact-key equality — + using the shared :class:`RuleClassNormalizer`. It then synthesizes canonical + keys for each TP / FP / FN and hands them to + :func:`evaluate_vulnerability_findings` so the precision/recall FORMULA is + reused verbatim (zero new metric code). The existing exact-key path is + untouched. + + A finding matches an expected entry when they share a file, a non-``None`` + canonical class, and their lines are within ``±line_window``. Matching is 1:1 + greedy (each expected slot consumed once). + """ + normalizer = normalizer or RuleClassNormalizer() + slots = [ + _ExpectedSlot( + expected=item, + vuln_class=_canonical_class( + normalizer, + cwe_ids=getattr(item, "cwe_ids", ()), + rule_id=item.rule_id, + ), + ) + for item in expected_findings + ] + + matched: list[VulnerabilityExpectedFinding] = [] + actual_keys: list[VulnerabilityFinding] = [] + match_index = 0 + fp_index = 0 + + for finding in actual_findings: + finding_class = _canonical_class( + normalizer, + cwe_ids=finding.cwe_ids, + rule_id=finding.rule_id, + ) + slot = _find_expected_slot(finding, finding_class, slots, line_window) + if slot is not None: + slot.consumed = True + match_index += 1 + shared = _normalized_matched_key(match_index) + matched.append(shared) + actual_keys.append(_normalized_canonical_finding(finding, shared)) + else: + fp_index += 1 + actual_keys.append(_normalized_local_only_finding(finding, fp_index)) + + expected_keys = list(matched) + for slot in slots: + if not slot.consumed: + expected_keys.append(_normalized_ghas_only_key(slot.expected)) + + return evaluate_vulnerability_findings(expected_keys, actual_keys) + + +def _find_expected_slot( + finding: VulnerabilityFinding, + finding_class: str | None, + slots: list[_ExpectedSlot], + line_window: int, +) -> _ExpectedSlot | None: + if finding_class is None: + return None + finding_line = finding.primary_location.line_start or 0 + for slot in slots: + if slot.consumed: + continue + if slot.vuln_class is None or slot.vuln_class != finding_class: + continue + if slot.expected.file_path != finding.primary_location.file_path: + continue + if not _line_window_match( + finding_line, slot.expected.line_start, line_window + ): + continue + return slot + return None + + +def _normalized_matched_key(index: int) -> VulnerabilityExpectedFinding: + return VulnerabilityExpectedFinding( + file_path=f"__matched__/{index}", + line_start=index, + rule_id="__matched__", + ) + + +def _normalized_canonical_finding( + finding: VulnerabilityFinding, shared: VulnerabilityExpectedFinding +) -> VulnerabilityFinding: + from security_scanner.core.vulnerability.model import VulnerabilityLocation + + return VulnerabilityFinding( + finding_id=finding.finding_id, + rule_id=shared.rule_id, + message=finding.message, + primary_location=VulnerabilityLocation( + file_path=shared.file_path, + line_start=shared.line_start, + ), + source_tool=finding.source_tool, + cwe_ids=finding.cwe_ids, + ) + + +def _normalized_local_only_finding( + finding: VulnerabilityFinding, index: int +) -> VulnerabilityFinding: + from security_scanner.core.vulnerability.model import VulnerabilityLocation + + return VulnerabilityFinding( + finding_id=finding.finding_id, + rule_id=f"__local_only__/{index}", + message=finding.message, + primary_location=VulnerabilityLocation( + file_path=f"__local_only__/{index}", + line_start=index, + ), + source_tool=finding.source_tool, + cwe_ids=finding.cwe_ids, + ) + + +def _normalized_ghas_only_key( + expected: NormalizedExpectedFinding | VulnerabilityExpectedFinding, +) -> VulnerabilityExpectedFinding: + return VulnerabilityExpectedFinding( + file_path=f"__expected_only__/{expected.file_path}", + line_start=expected.line_start, + rule_id=f"expected:{expected.rule_id}", + ) diff --git a/src/security_scanner/core/vulnerability/gate.py b/src/security_scanner/core/vulnerability/gate.py index 58d7edd..fe93a77 100644 --- a/src/security_scanner/core/vulnerability/gate.py +++ b/src/security_scanner/core/vulnerability/gate.py @@ -1,9 +1,37 @@ -"""Gate policy for code vulnerability findings.""" +"""Gate policy for code vulnerability findings. + +M2 inline cheap FP-suppression tier (design §2 / §K) lives here as ADDITIVE, +default-OFF opt-in signals. The crux of design §K is the default-on vs gated +boundary: + +- **default-on** = ONLY deterministic, metadata-only changes that provably cannot + flip an already-blocking finding to non-blocking for the EXISTING default + thresholds. The existing gate already non-blocks INFO/LOW severity and + UNKNOWN/LOW precision; that is the entire default-on surface and M2 adds NOTHING + to it. Any new suppression that could change which findings block is gated. +- **gated/opt-in** = the two new ``VulnerabilityGateThresholds`` flags below, both + DEFAULT OFF. With both off, :func:`evaluate_vulnerability_gate_policy` behaves + EXACTLY as before (same blocking set, same reason string) — the + default-invariance canary in ``tests/test_vulnerability_gate_tier.py`` pins this. + +The two opt-in signals (V-Q3: metadata-only, no validity-check analogue, no LLM, +no network): + +- ``require_trace`` — a finding with ``code_flow_count == 0`` has NO data-flow + reachability evidence, so it is treated as non-blocking when the flag is ON. A + finding WITH a trace keeps blocking. +- ``suppress_rules`` — a frozenset of canonical vuln *classes* (e.g. + ``"sql-injection"``) treated as non-blocking. Rule-class normalization REUSES + the M1 :class:`~security_scanner.core.vulnerability.codescan.RuleClassNormalizer` + (no duplicated normalizer here), so a CodeQL-style and a Semgrep-style rule.id + for the same class are suppressed together. +""" from __future__ import annotations -from dataclasses import dataclass +from dataclasses import dataclass, field +from security_scanner.core.vulnerability.codescan import RuleClassNormalizer from security_scanner.core.vulnerability.model import VulnerabilityFinding _SEVERITY_RANK = { @@ -27,6 +55,10 @@ class VulnerabilityGateThresholds: max_blocking: int = 0 severity_min: str = "HIGH" precision_min: str = "HIGH" + # --- M2 inline cheap tier: OPT-IN signals, DEFAULT OFF ------------------- + # When both are at their defaults the gate behaves exactly as before. + require_trace: bool = False + suppress_rules: frozenset[str] = field(default_factory=frozenset) @dataclass(frozen=True) @@ -41,10 +73,17 @@ def evaluate_vulnerability_gate_policy( findings: list[VulnerabilityFinding], thresholds: VulnerabilityGateThresholds | None = None, ) -> VulnerabilityGateResult: - """Evaluate code-vuln findings using severity + precision thresholds.""" + """Evaluate code-vuln findings using severity + precision thresholds. + + The M2 inline tier (``require_trace`` / ``suppress_rules``) only ever REMOVES + findings from the blocking set, and only when its opt-in flag is set. With + both flags at their defaults the blocking set and reason string are identical + to the pre-M2 behavior. + """ policy = thresholds or VulnerabilityGateThresholds() severity_min = _normalize_severity(policy.severity_min) precision_min = _normalize_precision(policy.precision_min) + normalizer = RuleClassNormalizer() if policy.suppress_rules else None blocking = [ finding for finding in findings @@ -52,6 +91,8 @@ def evaluate_vulnerability_gate_policy( finding, severity_min=severity_min, precision_min=precision_min, + policy=policy, + normalizer=normalizer, ) ] blocking_count = len(blocking) @@ -76,14 +117,45 @@ def _is_blocking( *, severity_min: str, precision_min: str, + policy: VulnerabilityGateThresholds, + normalizer: RuleClassNormalizer | None, ) -> bool: if finding.triage_state == "FALSE_POSITIVE": return False - return ( + base_blocking = ( _SEVERITY_RANK.get(finding.severity, 0) >= _SEVERITY_RANK.get(severity_min, 3) and _PRECISION_RANK.get(finding.precision, 0) >= _PRECISION_RANK.get(precision_min, 3) ) + if not base_blocking: + return False + # --- M2 inline cheap tier (opt-in suppression of an OTHERWISE-blocking + # finding). Each branch is gated by its flag, so with defaults nothing + # below changes the result. + if policy.require_trace and finding.code_flow_count == 0: + return False + if normalizer is not None and _rule_class_suppressed( + finding, policy.suppress_rules, normalizer + ): + return False + return True + + +def _rule_class_suppressed( + finding: VulnerabilityFinding, + suppress_rules: frozenset[str], + normalizer: RuleClassNormalizer, +) -> bool: + """True when the finding's canonical vuln class is in ``suppress_rules``. + + Uses the shared M1 normalizer: CWE bridge first, then the rule-token + canonical class. No duplicated normalization logic. + """ + cwe_class = normalizer.cwe_class(finding.cwe_ids) + if cwe_class is not None and cwe_class in suppress_rules: + return True + token_class = normalizer.rule_token_canonical(finding.rule_id) + return token_class is not None and token_class in suppress_rules def _normalize_severity(value: str) -> str: diff --git a/tests/test_codescan_parity.py b/tests/test_codescan_parity.py new file mode 100644 index 0000000..131230c --- /dev/null +++ b/tests/test_codescan_parity.py @@ -0,0 +1,750 @@ +"""Adversarial parity tests for the GHAS code-scanning -> EvaluationKey adapter (M1). + +This is the 1:1 transfer of the proven secret-track parity matcher +(``tests/test_ghas_parity.py``) to the code-vulnerability domain. The matcher +synthesizes canonical keys and hands them to +``core.vulnerability.evaluation.evaluate_vulnerability_findings`` for the headline +precision/recall — ZERO new precision/recall formula lives here. + +Each test toggles OFF exactly one matcher responsibility and asserts a specific +metric goes red: + +- normalization OFF (empty bridge + empty token map) + -> same-vuln-different-rule.id splits into FP + FN (recall/precision drop). +- line-window OFF (tolerance 0) + -> the +/-N drift pair stops matching; the just-out-of-window negative + control MUST NOT match even with tolerance. +- state filter + -> a dismissed / "false positive" alert our finding hits raises + ``dismissed_fp_hit`` AND becomes an FP; a "won't fix" alert is excluded + from the recall denominator (recall not penalized for missing it). + +The by-cwe vs by-rule-token tier counts and the CWE many-to-many 1:1 binding are +asserted explicitly. The committed fixture loads via ``load_codescan_snapshot``; +a non-``synthetic`` provenance marker fails closed. +""" + +from __future__ import annotations + +from pathlib import Path + +import pytest + +from security_scanner.core.vulnerability.codescan import ( + DEFAULT_CWE_BRIDGE, + CodeScanAlertRecord, + RuleClassNormalizer, + load_codescan_snapshot, +) +from security_scanner.core.vulnerability.codescan_parity import ( + ParityConfig, + compare_codescan_alerts_with_findings, +) +from security_scanner.core.vulnerability.evaluation import ( + VulnerabilityEvaluationResult, +) +from security_scanner.core.vulnerability.model import ( + VulnerabilityFinding, + VulnerabilityLocation, + compute_vulnerability_finding_id, +) + +REPO = "synthetic-org/synthetic-codescan-repo" +# Concrete line-window N fixed in M1 (closes the open question VD-07). +LINE_WINDOW = 2 +FIXTURE = ( + Path(__file__).resolve().parents[1] + / "eval" + / "codescan-parity-corpus" + / "synthetic-snapshot.json" +) + + +def _normalizer( + *, + cwe_bridge=DEFAULT_CWE_BRIDGE, + enable_rule_token: bool = True, +) -> RuleClassNormalizer: + return RuleClassNormalizer( + cwe_bridge=cwe_bridge, + enable_rule_token=enable_rule_token, + ) + + +def _alert( + *, + number: int, + rule_id: str, + path: str, + start_line: int, + end_line: int | None = None, + cwe_ids: tuple[str, ...] = (), + state: str = "open", + dismissed_reason: str | None = None, + severity: str | None = "high", +) -> CodeScanAlertRecord: + return CodeScanAlertRecord( + repository=REPO, + alert_number=number, + rule_id=rule_id, + security_severity_level=severity, + cwe_ids=cwe_ids, + state=state, + dismissed_reason=dismissed_reason, + location_path=path, + location_start_line=start_line, + location_end_line=end_line if end_line is not None else start_line, + ) + + +def _finding( + *, + rule_id: str, + path: str, + line_start: int, + cwe_ids: tuple[str, ...] = (), +) -> VulnerabilityFinding: + finding_id = compute_vulnerability_finding_id( + source_tool="semgrep", + rule_id=rule_id, + partial_fingerprints=None, + file_path=path, + line_start=line_start, + message="synthetic", + ) + return VulnerabilityFinding( + finding_id=finding_id, + rule_id=rule_id, + message="synthetic finding", + primary_location=VulnerabilityLocation( + file_path=path, line_start=line_start + ), + source_tool="semgrep", + cwe_ids=cwe_ids, + ) + + +# --------------------------------------------------------------------------- +# (c) CodeQL <-> Semgrep same vuln, different rule.id: matches by-cwe +# --------------------------------------------------------------------------- + +def test_same_vuln_different_rule_id_matches_by_cwe(): + """CodeQL ``py/sql-injection`` (CWE-89) vs Semgrep different rule.id, same CWE.""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/handlers.py", + start_line=10, + cwe_ids=("CWE-89",), + ) + ] + findings = [ + _finding( + rule_id="python.lang.security.audit.sql-injection", + path="synthetic_app/handlers.py", + line_start=10, + cwe_ids=("CWE-89",), + ) + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + assert isinstance(result.detection, VulnerabilityEvaluationResult) + assert result.detection.true_positive_count == 1 + assert result.detection.false_positive_count == 0 + assert result.detection.false_negative_count == 0 + assert result.detection.precision == 1.0 + assert result.detection.recall == 1.0 + assert result.matched_by_cwe == 1 + assert result.matched_by_rule_token == 0 + assert result.unmatched == 0 + + +def test_same_vuln_without_normalization_splits_red(): + """RED-PROOF: empty bridge + token map -> the same vuln splits into FP + FN.""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/handlers.py", + start_line=10, + cwe_ids=("CWE-89",), + ) + ] + findings = [ + _finding( + rule_id="python.lang.security.audit.sql-injection", + path="synthetic_app/handlers.py", + line_start=10, + cwe_ids=("CWE-89",), + ) + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + # normalization fully disabled: no CWE bridge, no rule-token rescue. + normalizer=_normalizer(cwe_bridge={}, enable_rule_token=False), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + assert result.detection.true_positive_count == 0 + assert result.detection.false_negative_count == 1 # ghas-only alert + assert result.detection.false_positive_count == 1 # local-only finding + assert result.detection.recall < 1.0 + assert result.detection.precision < 1.0 + assert result.matched_by_cwe == 0 + + +# --------------------------------------------------------------------------- +# (a) CWE-absent rule-token-only: matches ONLY via by-rule-token +# --------------------------------------------------------------------------- + +def test_cwe_absent_matches_by_rule_token(): + """Neither side carries a CWE -> match only through exact core-token set.""" + alerts = [ + _alert( + number=1, + rule_id="js/path-injection", + path="synthetic_app/files.js", + start_line=22, + cwe_ids=(), + ) + ] + findings = [ + _finding( + rule_id="javascript.lang.security.audit.path-traversal", + path="synthetic_app/files.js", + line_start=22, + cwe_ids=(), + ) + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + assert result.detection.true_positive_count == 1 + assert result.matched_by_cwe == 0 + assert result.matched_by_rule_token == 1 + assert result.unmatched == 0 + + +def test_rule_token_requires_exact_core_set_no_partial_overlap(): + """path-traversal MUST NOT match open-redirect (no partial-overlap matching).""" + normalizer = _normalizer() + path_set = normalizer.rule_token_class( + "javascript.lang.security.audit.path-traversal" + ) + redirect_set = normalizer.rule_token_class("js/open-redirect") + assert path_set is not None + assert redirect_set is not None + assert path_set != redirect_set + + alerts = [ + _alert( + number=1, + rule_id="js/open-redirect", + path="synthetic_app/files.js", + start_line=22, + cwe_ids=(), + ) + ] + findings = [ + _finding( + rule_id="javascript.lang.security.audit.path-traversal", + path="synthetic_app/files.js", + line_start=22, + cwe_ids=(), + ) + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=normalizer, + config=ParityConfig(line_window=LINE_WINDOW), + ) + + # Different vuln classes at the same location must NOT cross-match. + assert result.detection.true_positive_count == 0 + assert result.detection.false_positive_count == 1 + assert result.detection.false_negative_count == 1 + assert result.unmatched >= 1 + + +# --------------------------------------------------------------------------- +# (b) source/sink line drift + negative control +# --------------------------------------------------------------------------- + +def test_line_drift_within_window_matches(): + """Finding at alert_line + N (just inside the window) matches.""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/handlers.py", + start_line=30, + cwe_ids=("CWE-89",), + ) + ] + findings = [ + _finding( + rule_id="py/sql-injection", + path="synthetic_app/handlers.py", + line_start=32, # +2 == N, just inside + cwe_ids=("CWE-89",), + ) + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + assert result.detection.true_positive_count == 1 + assert result.detection.recall == 1.0 + + +def test_line_drift_without_window_goes_red(): + """RED-PROOF: window 0 -> a +1 drift no longer matches.""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/handlers.py", + start_line=30, + cwe_ids=("CWE-89",), + ) + ] + findings = [ + _finding( + rule_id="py/sql-injection", + path="synthetic_app/handlers.py", + line_start=31, # +1 + cwe_ids=("CWE-89",), + ) + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=0), # exact-line only + ) + + assert result.detection.true_positive_count == 0 + assert result.detection.false_negative_count == 1 + assert result.detection.false_positive_count == 1 + + +def test_window_boundary_negative_control(): + """One drift just inside N, one just outside; outside MUST NOT match. + + With N=2: drift +2 (40 -> 42) MUST match; drift +3 (60 -> 63) MUST NOT. + A too-greedy window that matched both would fail the must-NOT assertion. + """ + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/inside.py", + start_line=40, + cwe_ids=("CWE-89",), + ), + _alert( + number=2, + rule_id="py/sql-injection", + path="synthetic_app/outside.py", + start_line=60, + cwe_ids=("CWE-89",), + ), + ] + findings = [ + _finding( + rule_id="py/sql-injection", + path="synthetic_app/inside.py", + line_start=42, # +2 in + cwe_ids=("CWE-89",), + ), + _finding( + rule_id="py/sql-injection", + path="synthetic_app/outside.py", + line_start=63, # +3 out + cwe_ids=("CWE-89",), + ), + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + assert result.detection.true_positive_count == 1 + assert result.detection.false_negative_count == 1 # outside alert + assert result.detection.false_positive_count == 1 # outside finding + + +def test_interval_overlap_matches_multiline_alert(): + """alert start..end interval overlap counts as a match even with window 0.""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/multiline.py", + start_line=10, + end_line=14, + cwe_ids=("CWE-89",), + ) + ] + findings = [ + _finding( + rule_id="py/sql-injection", + path="synthetic_app/multiline.py", + line_start=13, # inside interval, >0 from start + cwe_ids=("CWE-89",), + ) + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=0), + ) + + assert result.detection.true_positive_count == 1 + + +# --------------------------------------------------------------------------- +# (d) dismissed_reason: state-aware denominator + FP-oracle +# --------------------------------------------------------------------------- + +def test_dismissed_false_positive_hit_is_precision_penalty(): + """Finding on a dismissed/false-positive alert -> dismissed_fp_hit + FP.""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/open.py", + start_line=5, + cwe_ids=("CWE-89",), + state="open", + ), + _alert( + number=2, + rule_id="py/xss", + path="synthetic_app/dismissed.py", + start_line=8, + cwe_ids=("CWE-79",), + state="dismissed", + dismissed_reason="false positive", + ), + ] + findings = [ + _finding( + rule_id="py/sql-injection", + path="synthetic_app/open.py", + line_start=5, + cwe_ids=("CWE-89",), + ), + # We (wrongly) surface a finding on the dismissed false-positive location. + _finding( + rule_id="py/xss", + path="synthetic_app/dismissed.py", + line_start=8, + cwe_ids=("CWE-79",), + ), + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + # The open alert is detected (TP); the dismissed-FP hit is a precision penalty. + assert result.detection.true_positive_count == 1 + assert result.dismissed_fp_hit == 1 + assert result.detection.false_positive_count == 1 + assert result.detection.precision == 0.5 + # The dismissed alert is NOT in the recall denominator. + assert result.detection.false_negative_count == 0 + assert result.detection.recall == 1.0 + + +def test_wont_fix_excluded_from_recall_denominator(): + """A "won't fix" alert we do not detect must NOT punish recall (TP-non-blocking).""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/open.py", + start_line=5, + cwe_ids=("CWE-89",), + state="open", + ), + _alert( + number=2, + rule_id="py/command-injection", + path="synthetic_app/wontfix.py", + start_line=12, + cwe_ids=("CWE-78",), + state="dismissed", + dismissed_reason="won't fix", + ), + ] + # We only detect the open one; the won't-fix alert is undetected. + findings = [ + _finding( + rule_id="py/sql-injection", + path="synthetic_app/open.py", + line_start=5, + cwe_ids=("CWE-89",), + ) + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + # won't-fix is TP-non-blocking: not in recall denom, not a precision penalty. + assert result.detection.true_positive_count == 1 + assert result.detection.false_negative_count == 0 + assert result.detection.recall == 1.0 + assert result.dismissed_fp_hit == 0 + + +def test_fixed_alert_in_recall_denominator(): + """A ``fixed`` alert is positive truth (recall denominator).""" + alerts = [ + _alert( + number=1, + rule_id="py/ssrf", + path="synthetic_app/fetch.py", + start_line=7, + cwe_ids=("CWE-918",), + state="fixed", + ) + ] + # We do NOT detect it -> it must be a false negative. + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=[], + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + assert result.detection.false_negative_count == 1 + assert result.detection.recall == 0.0 + + +# --------------------------------------------------------------------------- +# CWE many-to-many: SQLi + XSS at same (file, window) each bind 1:1 +# --------------------------------------------------------------------------- + +def test_cwe_many_to_many_binds_one_to_one(): + """Two different vulns at the same window each bind to the right alert.""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/mixed.py", + start_line=20, + cwe_ids=("CWE-89",), + ), + _alert( + number=2, + rule_id="py/xss", + path="synthetic_app/mixed.py", + start_line=21, + cwe_ids=("CWE-79",), + ), + ] + findings = [ + _finding( + rule_id="py/xss", + path="synthetic_app/mixed.py", + line_start=21, + cwe_ids=("CWE-79",), + ), + _finding( + rule_id="py/sql-injection", + path="synthetic_app/mixed.py", + line_start=20, + cwe_ids=("CWE-89",), + ), + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + # Both bind 1:1, no cross-match, no leftover FP/FN. + assert result.detection.true_positive_count == 2 + assert result.detection.false_positive_count == 0 + assert result.detection.false_negative_count == 0 + assert result.matched_by_cwe == 2 + + +# --------------------------------------------------------------------------- +# meta-metrics: CWE-deficit + by-rule-token rescue rate +# --------------------------------------------------------------------------- + +def test_meta_metrics_cwe_deficit_and_rescue_rate(): + """A CWE-absent rule-token rescue raises both meta rates above 0.""" + alerts = [ + _alert( + number=1, + rule_id="py/sql-injection", + path="synthetic_app/a.py", + start_line=10, + cwe_ids=("CWE-89",), + ), + _alert( + number=2, + rule_id="js/path-injection", + path="synthetic_app/b.js", + start_line=20, + cwe_ids=(), # CWE-deficient + ), + ] + findings = [ + _finding( + rule_id="py/sql-injection", + path="synthetic_app/a.py", + line_start=10, + cwe_ids=("CWE-89",), + ), + _finding( + rule_id="javascript.lang.security.audit.path-traversal", + path="synthetic_app/b.js", + line_start=20, + cwe_ids=(), # CWE-deficient + ), + ] + + result = compare_codescan_alerts_with_findings( + repository=REPO, + alerts=alerts, + findings=findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + assert result.matched_by_cwe == 1 + assert result.matched_by_rule_token == 1 + # one of two truth alerts lacks a bridgeable CWE. + assert result.cwe_deficit_rate == 0.5 + # one of two matches was rescued by rule-token. + assert result.rule_token_rescue_rate == 0.5 + + +# --------------------------------------------------------------------------- +# fixture: provenance fail-closed + end-to-end adversarial snapshot +# --------------------------------------------------------------------------- + +def test_provenance_marker_required_fail_closed(tmp_path): + """A snapshot without source: synthetic must fail closed.""" + bad = tmp_path / "no-provenance.json" + bad.write_text( + '{"repoFullName": "synthetic-org/x", "alerts": [], "findings": []}', + encoding="utf-8", + ) + + with pytest.raises(ValueError, match="synthetic"): + load_codescan_snapshot(bad) + + +def test_committed_fixture_loads_and_matches(): + """End-to-end over the committed adversarial snapshot. + + The fixture is engineered so normalization (by-cwe + by-rule-token), the + line-window, and the state filter produce a clean, high-recall picture, with + one dismissed-FP hit (precision penalty) and one won't-fix alert excluded + from the recall denominator. + """ + snapshot = load_codescan_snapshot(FIXTURE) + assert snapshot.source == "synthetic" + assert snapshot.repo_full_name == REPO + + result = compare_codescan_alerts_with_findings( + repository=snapshot.repo_full_name, + alerts=snapshot.alerts, + findings=snapshot.findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + + assert isinstance(result.detection, VulnerabilityEvaluationResult) + # Positive-truth alerts (open + fixed) are all detected -> perfect recall. + assert result.detection.recall == 1.0 + # by-cwe and by-rule-token tiers both exercised. + assert result.matched_by_cwe >= 1 + assert result.matched_by_rule_token >= 1 + # one dismissed false-positive location we surfaced -> precision penalty. + assert result.dismissed_fp_hit == 1 + assert result.detection.false_positive_count >= 1 + assert result.detection.precision < 1.0 + # meta-metrics exposed. + assert 0.0 <= result.cwe_deficit_rate <= 1.0 + assert 0.0 <= result.rule_token_rescue_rate <= 1.0 + + +def test_fixture_states_drive_red_when_filter_disabled(): + """RED-PROOF over the fixture: counting dismissed alerts as truth drops recall.""" + snapshot = load_codescan_snapshot(FIXTURE) + + with_filter = compare_codescan_alerts_with_findings( + repository=snapshot.repo_full_name, + alerts=snapshot.alerts, + findings=snapshot.findings, + normalizer=_normalizer(), + config=ParityConfig(line_window=LINE_WINDOW), + ) + without_filter = compare_codescan_alerts_with_findings( + repository=snapshot.repo_full_name, + alerts=snapshot.alerts, + findings=snapshot.findings, + normalizer=_normalizer(), + # state filter OFF: every state counts as positive truth. + config=ParityConfig( + line_window=LINE_WINDOW, + positive_truth_states=("open", "fixed", "dismissed"), + ), + ) + + assert with_filter.detection.recall == 1.0 + assert without_filter.detection.recall < 1.0 diff --git a/tests/test_governance_vuln_parity_slo.py b/tests/test_governance_vuln_parity_slo.py new file mode 100644 index 0000000..86285f2 --- /dev/null +++ b/tests/test_governance_vuln_parity_slo.py @@ -0,0 +1,379 @@ +"""Tests for the M3 vuln code-scanning parity SLO gate (report-only until threshold). + +Mirrors ``tests/test_governance_parity_slo.py`` for the code-vulnerability domain. +Exercises the three documented modes — report-only (no threshold), enforce +(threshold committed), and stale-degraded (snapshot too old) — plus the +provenance fail-closed guard on the snapshot input. All mutation fixtures are +synthetic and written into a tmp dir so the committed corpus is never the subject; +a separate test asserts the COMMITTED corpus runs clean in report-only. +""" + +from __future__ import annotations + +import datetime as dt +import json +from pathlib import Path + +import pytest + +from governance.vuln_parity_slo import ( + discover_snapshots, + evaluate_vuln_parity_slo, + load_thresholds, + main, + render_report, +) + +NOW = dt.datetime(2026, 6, 21, 12, 0, tzinfo=dt.timezone.utc) + + +def _snapshot_dict( + *, + repo: str = "synthetic-org/synthetic-codescan-repo", + fetched_at: str = "2026-06-20T12:00:00+00:00", + matched: bool = True, + extra_alerts: list[dict] | None = None, + extra_findings: list[dict] | None = None, +) -> dict: + # One open SQLi alert (CWE-89) and (when matched) one local finding at the + # same location/CWE-class, so per-repo precision/recall = 1.0; when not + # matched the finding is dropped so recall drops (used for the enforce-fail + # case). + findings: list[dict] = [] + if matched: + findings = [ + { + "ruleId": "python.lang.security.audit.sql-injection", + "sourceTool": "semgrep", + "cweIds": ["CWE-89"], + "filePath": "synthetic_app/handlers.py", + "lineStart": 10, + } + ] + alerts = [ + { + "alertNumber": 1, + "ruleId": "py/sql-injection", + "securitySeverityLevel": "high", + "cweIds": ["CWE-89"], + "state": "open", + "filePath": "synthetic_app/handlers.py", + "lineStart": 10, + "lineEnd": 10, + } + ] + if extra_alerts: + alerts.extend(extra_alerts) + if extra_findings: + findings.extend(extra_findings) + return { + "schemaVersion": 1, + "source": "synthetic", + "repoFullName": repo, + "fetchedAt": fetched_at, + "alerts": alerts, + "findings": findings, + } + + +def _write_snapshot( + directory: Path, data: dict, name: str = "synthetic-snapshot.json" +) -> Path: + directory.mkdir(parents=True, exist_ok=True) + path = directory / name + path.write_text(json.dumps(data), encoding="utf-8") + return path + + +# --------------------------------------------------------------------------- # +# report-only (no threshold) # +# --------------------------------------------------------------------------- # + + +def test_report_only_when_no_threshold_file(tmp_path): + snap_dir = tmp_path / "corpus" + _write_snapshot(snap_dir, _snapshot_dict()) + + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, + threshold_path=tmp_path / "absent.yml", + now=NOW, + ) + + assert result.mode == "report-only" + assert result.passed is True # report-only NEVER blocks + assert result.macro.macro_precision == 1.0 + assert result.macro.macro_recall == 1.0 + + +def test_report_only_passes_even_when_below_would_be_target(tmp_path): + # A recall miss (unmatched) in report-only still exits 0: there is no + # committed target to enforce yet (measure-first). + snap_dir = tmp_path / "corpus" + _write_snapshot(snap_dir, _snapshot_dict(matched=False)) + + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, threshold_path=tmp_path / "absent.yml", now=NOW + ) + + assert result.mode == "report-only" + assert result.macro.macro_recall < 1.0 + assert result.passed is True + + +def test_empty_threshold_file_is_report_only(tmp_path): + snap_dir = tmp_path / "corpus" + _write_snapshot(snap_dir, _snapshot_dict()) + threshold = tmp_path / "thresholds.yml" + threshold.write_text("", encoding="utf-8") + + assert load_thresholds(threshold) is None + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, threshold_path=threshold, now=NOW + ) + assert result.mode == "report-only" + + +# --------------------------------------------------------------------------- # +# enforce (threshold committed) # +# --------------------------------------------------------------------------- # + + +def test_enforce_passes_when_macro_meets_threshold(tmp_path): + snap_dir = tmp_path / "corpus" + _write_snapshot(snap_dir, _snapshot_dict()) + threshold = tmp_path / "thresholds.yml" + threshold.write_text("precision_min: 0.9\nrecall_min: 0.9\n", encoding="utf-8") + + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, threshold_path=threshold, now=NOW + ) + + assert result.mode == "enforce" + assert result.passed is True + assert result.failures == () + + +def test_enforce_fails_when_macro_below_threshold(tmp_path): + snap_dir = tmp_path / "corpus" + _write_snapshot(snap_dir, _snapshot_dict(matched=False)) # recall < 1.0 + threshold = tmp_path / "thresholds.yml" + threshold.write_text("precision_min: 0.9\nrecall_min: 0.99\n", encoding="utf-8") + + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, threshold_path=threshold, now=NOW + ) + + assert result.mode == "enforce" + assert result.passed is False + assert any("recall" in f for f in result.failures) + + +# --------------------------------------------------------------------------- # +# stale-degraded (snapshot too old) # +# --------------------------------------------------------------------------- # + + +def test_stale_in_report_only_warns_but_passes(tmp_path): + snap_dir = tmp_path / "corpus" + _write_snapshot( + snap_dir, _snapshot_dict(fetched_at="2025-01-01T00:00:00+00:00") + ) + + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, + threshold_path=tmp_path / "absent.yml", + now=NOW, + max_age_days=90, + ) + + assert result.stale is True + assert result.mode == "report-only" + assert result.passed is True # surfaced, not silently passed, not blocking + + +def test_stale_in_enforce_fails_not_silent_pass(tmp_path): + # design staleness-passive-only: a stale snapshot must NOT silently satisfy + # an enforcing gate even when the numbers look fine. + snap_dir = tmp_path / "corpus" + _write_snapshot( + snap_dir, _snapshot_dict(fetched_at="2025-01-01T00:00:00+00:00") + ) + threshold = tmp_path / "thresholds.yml" + threshold.write_text("precision_min: 0.9\nrecall_min: 0.9\n", encoding="utf-8") + + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, threshold_path=threshold, now=NOW, max_age_days=90 + ) + + assert result.stale is True + assert result.mode == "enforce" + assert result.passed is False + assert any("stale-degraded" in f for f in result.failures) + + +def test_missing_fetched_at_is_treated_as_stale(tmp_path): + snap_dir = tmp_path / "corpus" + data = _snapshot_dict() + del data["fetchedAt"] + _write_snapshot(snap_dir, data) + + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, threshold_path=tmp_path / "absent.yml", now=NOW + ) + assert result.stale is True + + +# --------------------------------------------------------------------------- # +# provenance fail-closed # +# --------------------------------------------------------------------------- # + + +def test_non_synthetic_snapshot_fails_closed(tmp_path): + snap_dir = tmp_path / "corpus" + data = _snapshot_dict() + data["source"] = "real" # not synthetic -> load must fail closed + _write_snapshot(snap_dir, data) + + with pytest.raises(Exception): + evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, threshold_path=tmp_path / "absent.yml", now=NOW + ) + + +# --------------------------------------------------------------------------- # +# dismissed_fp_hit and won't-fix surfacing # +# --------------------------------------------------------------------------- # + + +def test_dismissed_fp_hit_and_wont_fix_surface_in_report(tmp_path): + """A finding on a dismissed-FP alert is a precision penalty surfaced in the + report; a won't-fix alert is excluded from recall with no penalty. + """ + snap_dir = tmp_path / "corpus" + data = _snapshot_dict( + extra_alerts=[ + { + "alertNumber": 2, + "ruleId": "py/xss", + "cweIds": ["CWE-79"], + "state": "dismissed", + "dismissedReason": "false positive", + "filePath": "synthetic_app/legacy.py", + "lineStart": 55, + "lineEnd": 55, + }, + { + "alertNumber": 3, + "ruleId": "py/command-injection", + "cweIds": ["CWE-78"], + "state": "dismissed", + "dismissedReason": "won't fix", + "filePath": "synthetic_app/ops.py", + "lineStart": 70, + "lineEnd": 70, + }, + ], + extra_findings=[ + { + "ruleId": "py/xss", + "sourceTool": "codeql", + "cweIds": ["CWE-79"], + "filePath": "synthetic_app/legacy.py", + "lineStart": 55, + } + ], + ) + _write_snapshot(snap_dir, data) + + result = evaluate_vuln_parity_slo( + snapshot_dir=snap_dir, threshold_path=tmp_path / "absent.yml", now=NOW + ) + report = render_report(result) + + # The dismissed-FP hit is exercised (the codeql/xss finding lands on the + # dismissed false-positive alert) -> precision penalty + dismissed_fp_hit. + assert result.total_dismissed_fp_hit >= 1 + assert "Dismissed-FP hit" in report + # won't-fix alert is excluded from the recall denominator: recall stays 1.0 + # (the open SQLi alert is still matched). + assert result.macro.macro_recall == 1.0 + assert result.macro.macro_precision < 1.0 + + +# --------------------------------------------------------------------------- # +# CLI exit codes + committed corpus # +# --------------------------------------------------------------------------- # + + +def test_cli_check_report_only_exits_zero(tmp_path, capsys): + snap_dir = tmp_path / "corpus" + _write_snapshot(snap_dir, _snapshot_dict()) + + code = main( + [ + "--check", + "--snapshot-dir", + str(snap_dir), + "--threshold-path", + str(tmp_path / "absent.yml"), + ] + ) + out = capsys.readouterr().out + assert code == 0 + assert "report-only" in out + + +def test_cli_json_report_only(tmp_path, capsys): + snap_dir = tmp_path / "corpus" + _write_snapshot(snap_dir, _snapshot_dict()) + + code = main( + [ + "--check", + "--json", + "--snapshot-dir", + str(snap_dir), + "--threshold-path", + str(tmp_path / "absent.yml"), + ] + ) + out = capsys.readouterr().out + assert code == 0 + payload = json.loads(out) + assert payload["mode"] == "report-only" + assert payload["passed"] is True + + +def test_committed_corpus_runs_report_only(tmp_path): + # The committed eval/codescan-parity-corpus snapshot must drive the gate in + # report-only with no committed thresholds (autonomous layer is always + # report-only). + result = evaluate_vuln_parity_slo(threshold_path=tmp_path / "absent.yml", now=NOW) + assert result.mode == "report-only" + assert result.snapshot_count >= 1 + assert result.passed is True + + +def test_committed_corpus_cli_check_exits_zero(): + # The autonomous acceptance check: ``--check`` on the committed corpus exits 0 + # in report-only with no thresholds present. + code = main(["--check"]) + assert code == 0 + + +def test_committed_corpus_exercises_dismissed_fp_hit(tmp_path): + # Post-M2 review #3: dismissed_fp_hit must not be dark-launched. The committed + # fixture has a dismissed false-positive alert that a finding lands on. + result = evaluate_vuln_parity_slo(threshold_path=tmp_path / "absent.yml", now=NOW) + assert result.total_dismissed_fp_hit >= 1 + assert "Dismissed-FP hit" in render_report(result) + + +def test_discover_snapshots_is_deterministic(tmp_path): + snap_dir = tmp_path / "corpus" + _write_snapshot(snap_dir, _snapshot_dict(), name="b-snapshot.json") + _write_snapshot(snap_dir, _snapshot_dict(), name="a-snapshot.json") + + found = discover_snapshots(snap_dir) + assert [p.name for p in found] == ["a-snapshot.json", "b-snapshot.json"] diff --git a/tests/test_vulnerability_corpus_normalized.py b/tests/test_vulnerability_corpus_normalized.py new file mode 100644 index 0000000..3edc7aa --- /dev/null +++ b/tests/test_vulnerability_corpus_normalized.py @@ -0,0 +1,243 @@ +"""M2 synthetic 5-class corpus + normalization-aware evaluation path tests. + +Two boundaries are under test: + +1. **5-class corpus (design VD-07)** — ``eval/synthetic-code-vuln/`` covers + SQLi / XSS / path-traversal / command-injection / SSRF, each with a + vulnerable case (expected finding) AND a safe case (must NOT be flagged so it + exercises precision). Evaluating the corpus' own findings against its own + expected list yields recall>=0.99 and precision>=0.90 (the existing + ``VulnerabilityEvaluationThresholds``). + +2. **Normalization-aware path (design §E)** — ``evaluate_vulnerability_findings_ + normalized`` reuses the M1 ``RuleClassNormalizer`` + line-window so a + CodeQL-style ``ruleId`` and a Semgrep-style ``ruleId`` for the SAME class + still match. CRUCIALLY this is a NEW function: the existing + ``evaluate_vulnerability_findings`` exact-key behavior is untouched, proven by + a contrast test where the exact-key path splits the same pair into FP+FN. + +The adversarial out-of-rule pair (a CWE class with no bridge + no rule-token) +intentionally goes recall<1 on the normalized path — that red is CORRECT and is +asserted as an expected-fail so a future silent normalization regression is +caught. +""" + +from __future__ import annotations + +import json +from pathlib import Path + +from security_scanner.core.vulnerability.codescan import RuleClassNormalizer +from security_scanner.core.vulnerability.evaluation import ( + VulnerabilityEvaluationThresholds, + evaluate_vulnerability_findings, + evaluate_vulnerability_findings_normalized, + evaluate_vulnerability_gate, + load_vulnerability_corpus_normalized, +) +from security_scanner.core.vulnerability.model import ( + VulnerabilityFinding, + VulnerabilityLocation, + compute_vulnerability_finding_id, +) + +CORPUS = ( + Path(__file__).resolve().parents[1] + / "eval" + / "synthetic-code-vuln" + / "corpus-snapshot.json" +) + +FIVE_CLASSES = { + "sql-injection", + "xss", + "path-traversal", + "command-injection", + "ssrf", +} + + +def _load_corpus() -> dict: + return json.loads(CORPUS.read_text(encoding="utf-8")) + + +def _finding_from_dict(item: dict) -> VulnerabilityFinding: + file_path = str(item["filePath"]) + line_start = int(item["lineStart"]) + rule_id = str(item["ruleId"]) + source_tool = str(item.get("sourceTool", "semgrep")) + cwe_ids = tuple(str(c) for c in item.get("cweIds", [])) + finding_id = compute_vulnerability_finding_id( + source_tool=source_tool, + rule_id=rule_id, + partial_fingerprints=None, + file_path=file_path, + line_start=line_start, + message="synthetic finding", + ) + return VulnerabilityFinding( + finding_id=finding_id, + rule_id=rule_id, + message="synthetic finding", + primary_location=VulnerabilityLocation( + file_path=file_path, + line_start=line_start, + line_end=item.get("lineEnd"), + ), + source_tool=source_tool, + cwe_ids=cwe_ids, + ) + + +# --------------------------------------------------------------------------- +# Corpus shape: provenance + 5 classes + safe cases +# --------------------------------------------------------------------------- + + +def test_corpus_provenance_is_synthetic(): + data = _load_corpus() + assert str(data["source"]).strip().lower() == "synthetic" + + +def test_corpus_covers_five_cwe_classes(): + data = _load_corpus() + classes = {str(c["vulnClass"]) for c in data["expectedFindings"]} + assert classes == FIVE_CLASSES + + +def test_corpus_has_safe_cases_per_class(): + """Each class has at least one safe case that must NOT be flagged.""" + data = _load_corpus() + safe_classes = {str(c["vulnClass"]) for c in data["safeCases"]} + assert FIVE_CLASSES <= safe_classes + + +def test_corpus_paths_are_synthetic(): + data = _load_corpus() + paths = [c["filePath"] for c in data["expectedFindings"]] + paths += [c["filePath"] for c in data["safeCases"]] + paths += [c["filePath"] for c in data["actualFindings"]] + assert all(p.startswith("synthetic_app/") for p in paths) + + +# --------------------------------------------------------------------------- +# Normalized corpus evaluation: recall>=0.99 / precision>=0.90 +# --------------------------------------------------------------------------- + + +def test_normalized_corpus_meets_recall_and_precision_slo(): + """The corpus' own findings hit recall>=0.99 + precision>=0.90 (default gate). + + The actual findings deliberately use a DIFFERENT tool dialect than expected + for at least one class, so only the normalization-aware path can match them. + Safe-case findings are excluded (none flagged), so precision stays high. + """ + expected, actual = load_vulnerability_corpus_normalized(CORPUS) + result = evaluate_vulnerability_findings_normalized( + expected, actual, normalizer=RuleClassNormalizer() + ) + gate = evaluate_vulnerability_gate(result, VulnerabilityEvaluationThresholds()) + assert gate.passed, gate.reason + assert result.recall >= 0.99 + assert result.precision >= 0.90 + + +def test_normalized_path_matches_cross_dialect_pair(): + """CodeQL-style actual ruleId matches a Semgrep-style expected ruleId.""" + expected = load_vulnerability_corpus_normalized(CORPUS)[0] + # A CodeQL-style finding for SQLi at the canonical line. + actual = [ + _finding_from_dict( + { + "filePath": "synthetic_app/handlers.py", + "lineStart": 42, + "ruleId": "py/sql-injection", + "sourceTool": "codeql", + "cweIds": ["CWE-89"], + } + ) + ] + result = evaluate_vulnerability_findings_normalized( + expected, actual, normalizer=RuleClassNormalizer() + ) + # At least the SQLi pair is a true positive via the normalizer. + assert result.true_positive_count >= 1 + + +def test_exact_key_path_splits_cross_dialect_pair(): + """CONTRAST: the EXISTING exact-key path does NOT match the cross-dialect pair. + + This pins the boundary: the normalization-aware path is additive; the legacy + ``evaluate_vulnerability_findings`` exact-key behavior is unchanged (a + different ruleId => FP + FN, not a TP). + """ + from security_scanner.core.vulnerability.evaluation import ( + VulnerabilityExpectedFinding, + ) + + expected = [ + VulnerabilityExpectedFinding( + file_path="synthetic_app/handlers.py", + line_start=42, + rule_id="python.lang.security.audit.sql-injection", + ) + ] + actual = [ + _finding_from_dict( + { + "filePath": "synthetic_app/handlers.py", + "lineStart": 42, + "ruleId": "py/sql-injection", + "sourceTool": "codeql", + "cweIds": ["CWE-89"], + } + ) + ] + result = evaluate_vulnerability_findings(expected, actual) + # Exact-key mismatch: ruleId differs => no TP, one FP and one FN. + assert result.true_positive_count == 0 + assert result.false_positive_count == 1 + assert result.false_negative_count == 1 + + +# --------------------------------------------------------------------------- +# Adversarial out-of-rule pair: recall<1 is INTENDED (kept as a red guard) +# --------------------------------------------------------------------------- + + +def test_out_of_rule_class_recall_below_one_is_intended(): + """An out-of-rule CWE (no bridge, no rule-token) is an INTENDED miss. + + design §F: an independently-authored adversarial vuln whose class is NOT in + the normalizer must NOT be rescued — recall<1 here is the correct red and is + asserted so a future over-broad normalizer (silently matching it) is caught. + """ + from security_scanner.core.vulnerability.evaluation import ( + VulnerabilityExpectedFinding, + ) + + # An out-of-rule class: deserialization (CWE-502) — no bridge row, opaque token. + expected = [ + VulnerabilityExpectedFinding( + file_path="synthetic_app/out_of_rule.py", + line_start=7, + rule_id="py/unsafe-deserialization", + ) + ] + actual = [ + _finding_from_dict( + { + "filePath": "synthetic_app/out_of_rule.py", + "lineStart": 7, + "ruleId": "python.lang.security.audit.pickle-load", + "sourceTool": "semgrep", + "cweIds": ["CWE-502"], + } + ) + ] + result = evaluate_vulnerability_findings_normalized( + expected, actual, normalizer=RuleClassNormalizer() + ) + # No bridge + non-matching rule tokens => the pair does NOT match. + assert result.recall < 1.0 + assert result.false_negative_count == 1 diff --git a/tests/test_vulnerability_gate_tier.py b/tests/test_vulnerability_gate_tier.py new file mode 100644 index 0000000..46c5ef5 --- /dev/null +++ b/tests/test_vulnerability_gate_tier.py @@ -0,0 +1,220 @@ +"""M2 inline cheap FP-suppression tier tests (gate-layer ONLY). + +The #1 acceptance constraint (design §K, stop-condition +``existing-secret-default-behavior-change``) is that the EXISTING default gate +behavior must not change. So the first test is the default-invariance canary: +``evaluate_vulnerability_gate_policy`` with default thresholds (all new opt-in +flags OFF) produces EXACTLY today's verdict. + +The inline tier adds two opt-in signals to ``VulnerabilityGateThresholds``, both +DEFAULT OFF: + +- ``require_trace`` — a finding with ``code_flow_count == 0`` (no data-flow + reachability evidence) is treated as non-blocking. A finding WITH a trace + keeps blocking. +- ``suppress_rules`` — a frozenset of canonical vuln *classes* (reusing the M1 + ``RuleClassNormalizer``) whose findings are treated as non-blocking + (low-confidence rule suppression). + +When both flags are off the policy is byte-identical to today: no default-on +behavior change, so a default-on change can never flip a currently-blocking +finding. The suppression-rate regression test proves a canary TP is never +suppressed by anything default-on. +""" + +from __future__ import annotations + +from security_scanner.core.vulnerability.gate import ( + VulnerabilityGateThresholds, + evaluate_vulnerability_gate_policy, +) +from security_scanner.core.vulnerability.model import ( + VulnerabilityFinding, + VulnerabilityLocation, +) + + +def _finding(**overrides) -> VulnerabilityFinding: + defaults = dict( + finding_id="vuln_canary", + source_tool="semgrep", + rule_id="python.lang.security.audit.sql-injection", + message="Potential SQL injection.", + severity="HIGH", + precision="HIGH", + cwe_ids=("CWE-89",), + code_flow_count=1, + primary_location=VulnerabilityLocation( + file_path="synthetic_app/handlers.py", + line_start=42, + ), + ) + defaults.update(overrides) + return VulnerabilityFinding(**defaults) + + +# --------------------------------------------------------------------------- +# Default-invariance canary (MUST stay green — write FIRST) +# --------------------------------------------------------------------------- + + +def test_default_thresholds_block_high_high_finding_unchanged(): + """A HIGH/HIGH finding still blocks under the existing default policy.""" + result = evaluate_vulnerability_gate_policy([_finding()]) + assert result.passed is False + assert result.blocking_count == 1 + + +def test_default_thresholds_nonblock_info_low_unchanged(): + """INFO/LOW + UNKNOWN precision is still non-blocking (existing default).""" + findings = [ + _finding(finding_id="v_info", severity="INFO", precision="UNKNOWN"), + _finding(finding_id="v_low", severity="LOW", precision="LOW"), + ] + result = evaluate_vulnerability_gate_policy(findings) + assert result.passed is True + assert result.blocking_count == 0 + + +def test_default_ignores_code_flow_count_and_rule_class(): + """With flags OFF, a HIGH/HIGH finding blocks regardless of trace count. + + Proves the new signals are inert by default — a HIGH/HIGH finding with NO + trace (``code_flow_count == 0``) still blocks under the default policy, so + no default-on change can have silently suppressed it. + """ + no_trace = _finding(finding_id="v_no_trace", code_flow_count=0) + result = evaluate_vulnerability_gate_policy([no_trace]) + assert result.passed is False + assert result.blocking_count == 1 + + +def test_explicit_default_thresholds_equal_implicit(): + """Constructing default thresholds explicitly equals passing None.""" + findings = [_finding(), _finding(finding_id="v2", severity="LOW")] + implicit = evaluate_vulnerability_gate_policy(findings) + explicit = evaluate_vulnerability_gate_policy( + findings, VulnerabilityGateThresholds() + ) + assert implicit == explicit + + +def test_new_flags_default_off(): + """The new opt-in flags default OFF on the dataclass.""" + policy = VulnerabilityGateThresholds() + assert policy.require_trace is False + assert policy.suppress_rules == frozenset() + + +# --------------------------------------------------------------------------- +# Inline tier (gated): require_trace +# --------------------------------------------------------------------------- + + +def test_require_trace_suppresses_high_finding_with_no_trace(): + """With ``require_trace`` ON, a HIGH finding with no trace is non-blocking.""" + no_trace = _finding(finding_id="v_no_trace", code_flow_count=0) + gated = evaluate_vulnerability_gate_policy( + [no_trace], VulnerabilityGateThresholds(require_trace=True) + ) + assert gated.passed is True + assert gated.blocking_count == 0 + + +def test_require_trace_keeps_high_finding_with_trace_blocking(): + """``require_trace`` does NOT suppress a finding that HAS a data-flow trace.""" + with_trace = _finding(finding_id="v_trace", code_flow_count=2) + gated = evaluate_vulnerability_gate_policy( + [with_trace], VulnerabilityGateThresholds(require_trace=True) + ) + assert gated.passed is False + assert gated.blocking_count == 1 + + +def test_require_trace_off_keeps_no_trace_finding_blocking(): + """Flag OFF (default): the no-trace HIGH finding still blocks.""" + no_trace = _finding(finding_id="v_no_trace", code_flow_count=0) + default = evaluate_vulnerability_gate_policy([no_trace]) + assert default.passed is False + assert default.blocking_count == 1 + + +# --------------------------------------------------------------------------- +# Inline tier (gated): suppress_rules (rule-class via M1 normalizer) +# --------------------------------------------------------------------------- + + +def test_suppress_rules_suppresses_matching_class(): + """A finding whose canonical class is suppressed is non-blocking when ON.""" + finding = _finding(rule_id="py/sql-injection", cwe_ids=("CWE-89",)) + gated = evaluate_vulnerability_gate_policy( + [finding], + VulnerabilityGateThresholds(suppress_rules=frozenset({"sql-injection"})), + ) + assert gated.passed is True + assert gated.blocking_count == 0 + + +def test_suppress_rules_canonicalizes_across_tool_dialects(): + """Both CodeQL- and Semgrep-style rule.ids fold onto the same class. + + Suppressing ``sql-injection`` must catch BOTH ``py/sql-injection`` and + ``python.lang.security.audit.sql-injection`` because they normalize via the + shared M1 ``RuleClassNormalizer`` onto one class. + """ + codeql = _finding(finding_id="v_ql", rule_id="py/sql-injection", cwe_ids=()) + semgrep = _finding( + finding_id="v_sg", + rule_id="python.lang.security.audit.sql-injection", + cwe_ids=(), + ) + policy = VulnerabilityGateThresholds(suppress_rules=frozenset({"sql-injection"})) + gated = evaluate_vulnerability_gate_policy([codeql, semgrep], policy) + assert gated.passed is True + assert gated.blocking_count == 0 + + +def test_suppress_rules_does_not_touch_other_classes(): + """Suppressing one class does not suppress a different class.""" + xss = _finding(finding_id="v_xss", rule_id="py/xss", cwe_ids=("CWE-79",)) + policy = VulnerabilityGateThresholds(suppress_rules=frozenset({"sql-injection"})) + gated = evaluate_vulnerability_gate_policy([xss], policy) + assert gated.passed is False + assert gated.blocking_count == 1 + + +def test_suppress_rules_off_keeps_finding_blocking(): + """Flag OFF (default empty set): nothing suppressed.""" + finding = _finding(rule_id="py/sql-injection", cwe_ids=("CWE-89",)) + default = evaluate_vulnerability_gate_policy([finding]) + assert default.passed is False + assert default.blocking_count == 1 + + +# --------------------------------------------------------------------------- +# Safe-code finding stays non-blocking; canary TP preserved +# --------------------------------------------------------------------------- + + +def test_safe_code_finding_stays_non_blocking_in_all_modes(): + """A LOW/UNKNOWN 'safe-code' finding is non-blocking with or without flags.""" + safe = _finding(finding_id="v_safe", severity="LOW", precision="UNKNOWN") + for policy in ( + VulnerabilityGateThresholds(), + VulnerabilityGateThresholds(require_trace=True), + VulnerabilityGateThresholds(suppress_rules=frozenset({"sql-injection"})), + ): + result = evaluate_vulnerability_gate_policy([safe], policy) + assert result.passed is True + assert result.blocking_count == 0 + + +def test_canary_true_positive_never_suppressed_by_default_on(): + """A core canary TP (HIGH/HIGH, has a trace) blocks under the default policy. + + This is the suppression-rate regression assertion: a default-on change must + not raise the suppression rate of canary TPs. Since the default policy is + unchanged (no default-on suppression), the canary keeps blocking. + """ + canary = _finding(finding_id="v_canary_tp", code_flow_count=3) + assert evaluate_vulnerability_gate_policy([canary]).blocking_count == 1 diff --git a/tests/test_vulnerability_synthetic_regression_gate.py b/tests/test_vulnerability_synthetic_regression_gate.py new file mode 100644 index 0000000..ed36f66 --- /dev/null +++ b/tests/test_vulnerability_synthetic_regression_gate.py @@ -0,0 +1,127 @@ +"""M3 synthetic regression gate — ENFORCE (recall>=0.99 / precision>=0.90). + +This is the CI-enforced regression guard for the autonomous vuln-parity goal. +``uv run pytest`` (CI job ``ci/pytest``) runs these, so the gate is enforced as a +real test, not a report-only artifact: + +* the GREEN guard loads the committed 5-class synthetic corpus, runs the M2 + normalization-aware evaluation, and asserts the default + :class:`VulnerabilityEvaluationThresholds` gate (recall>=0.99, precision>=0.90) + PASSES. A regression that drops a true positive or adds a false positive turns + this red. +* the RED canary proves the enforce is NOT vacuous: dropping one actual true + positive from the corpus makes the SAME gate FAIL (recall falls below 0.99), so + we know the gate would actually catch a real recall regression. + +Computation reuse: this exercises ONLY M2's +:func:`evaluate_vulnerability_findings_normalized` + +:func:`evaluate_vulnerability_gate` over the committed corpus. There is no new +precision/recall code here. +""" + +from __future__ import annotations + +from pathlib import Path + +from security_scanner.core.vulnerability.codescan import RuleClassNormalizer +from security_scanner.core.vulnerability.evaluation import ( + VulnerabilityEvaluationThresholds, + evaluate_vulnerability_findings_normalized, + evaluate_vulnerability_gate, + load_vulnerability_corpus_normalized, +) + +CORPUS = ( + Path(__file__).resolve().parents[1] + / "eval" + / "synthetic-code-vuln" + / "corpus-snapshot.json" +) + + +def test_synthetic_regression_gate_enforces_recall_and_precision_slo(): + """GREEN: the committed corpus passes the default enforce gate. + + recall>=0.99 and precision>=0.90 over the normalization-aware path. This is + the regression guard CI enforces via ``uv run pytest``. + """ + expected, actual = load_vulnerability_corpus_normalized(CORPUS) + result = evaluate_vulnerability_findings_normalized( + expected, actual, normalizer=RuleClassNormalizer() + ) + + gate = evaluate_vulnerability_gate(result, VulnerabilityEvaluationThresholds()) + + assert gate.passed, gate.reason + assert result.recall >= 0.99 + assert result.precision >= 0.90 + assert result.false_negative_count == 0 + + +def test_synthetic_regression_gate_is_not_vacuous_red_canary(): + """RED canary: drop one actual TP -> the SAME gate FAILS (recall regression). + + Proves the enforce gate above is real. If a future change silently dropped a + detector finding (or weakened the matcher), recall would fall below 0.99 and + the gate would block — exactly what this canary demonstrates by construction. + """ + expected, actual = load_vulnerability_corpus_normalized(CORPUS) + assert len(actual) >= 1 + + # Simulate a regression: one true-positive finding is no longer emitted. + regressed_actual = actual[:-1] + + result = evaluate_vulnerability_findings_normalized( + expected, regressed_actual, normalizer=RuleClassNormalizer() + ) + + gate = evaluate_vulnerability_gate(result, VulnerabilityEvaluationThresholds()) + + assert gate.passed is False + assert result.recall < 0.99 + assert result.false_negative_count >= 1 + + +def test_synthetic_regression_gate_catches_false_positive_precision_regression(): + """RED canary (precision): an extra unmatched finding drops precision < 0.90. + + Complements the recall canary: a single spurious finding that matches no + expected class is a false positive, and with five expected TPs one extra FP + takes precision to 5/6 ~= 0.833 < 0.90, so the gate blocks. + """ + from security_scanner.core.vulnerability.model import ( + VulnerabilityFinding, + VulnerabilityLocation, + compute_vulnerability_finding_id, + ) + + expected, actual = load_vulnerability_corpus_normalized(CORPUS) + + spurious = VulnerabilityFinding( + finding_id=compute_vulnerability_finding_id( + source_tool="semgrep", + rule_id="py/sql-injection", + partial_fingerprints=None, + file_path="synthetic_app/spurious.py", + line_start=999, + message="synthetic finding", + ), + rule_id="py/sql-injection", + message="synthetic finding", + primary_location=VulnerabilityLocation( + file_path="synthetic_app/spurious.py", + line_start=999, + ), + source_tool="semgrep", + cwe_ids=("CWE-89",), + ) + + result = evaluate_vulnerability_findings_normalized( + expected, [*actual, spurious], normalizer=RuleClassNormalizer() + ) + + gate = evaluate_vulnerability_gate(result, VulnerabilityEvaluationThresholds()) + + assert gate.passed is False + assert result.precision < 0.90 + assert result.false_positive_count >= 1