feat(sast): add SARIF-native code vulnerability path by pureliture · Pull Request #47 · source-security-dev/security-scanner

pureliture · 2026-06-20T01:11:40Z

Purpose & Motivation

Add the opt-in Phase 2a SARIF-native SAST product path so the scanner can normalize code vulnerability findings without changing the existing Gitleaks-first secret detection default.

No linked issue. This PR implements the approved long-goal scope from the in-thread Phase 2a research and architecture review packet.

Context

This keeps secret findings and VULN_FINDING separate. SARIF is the canonical interchange contract, and analyzer execution is limited to a Semgrep-compatible subprocess adapter that emits SARIF and feeds the importer. scan-vuln defaults to redacted path policy for real local checkouts, while import-sarif keeps synthetic paths by default for committed fixtures.

LLM support is verifier/explainer only. The vulnerability verifier uses redacted metadata and fail-closed parsing; it does not detect vulnerabilities or generate concrete code patches.

Note

Architecture review gates were mandatory and completed at pre-implementation, post-M2, post-M3, and final. Final gates initially blocked on SARIF metadata leakage risk and scan-vuln path policy; both were fixed before this PR.

Dependency

No new Python dependency is added. A Semgrep-compatible binary is only required when a user runs scan-vuln; SARIF import/report/gate/evaluate tests do not require it.

Verification

uv run pytest (783 passed)
uv run ruff check ... on changed code/test paths
uv run python -m governance.render --validate
uv run python -m governance.render --check
uv run python -m governance.rebuild_ledger_index --check
uv run python -m governance.render_github_ruleset --output governance/main_ruleset.json --check
uv run python -m governance.public_safety --diff origin/main...HEAD
uv run python -m governance.public_safety --path ... on changed public artifacts
uv run python -m governance.autopilot_gate --base origin/main
CLI smoke: import-sarif, report --category code-vuln, evaluate --category code-vuln, gate --category code-vuln --max 1

Checklist

This PR does not include secret values.
Existing secret detection defaults are preserved.
GHAS live fetch/upload/mutation is not introduced.
SCA/dependency scanning remains out of scope.

SARIF 기반 VULN_FINDING 모델과 JSONL 저장소, import-sarif/scan-vuln 명령, code-vuln report/gate/evaluate/verify 흐름을 추가한다. Semgrep-compatible adapter는 SARIF만 canonical importer로 넘기고, vulnerability verifier는 redacted metadata만 사용한다. Architecture review gate에서 발견된 SARIF metadata 누출 위험과 scan-vuln path policy 기본값을 수정하고 회귀 테스트를 추가했다. Co-Authored-By: Codex GPT-5 <noreply@openai.com>

+            "Do not request or reveal raw source snippets, repository names, hosts, "
+            "or paths.",


+            "Return strict JSON only with keys: label, confidence, reason, "
+            "remediation.",


gemini-code-assist

Code Review

This pull request implements Phase 2a SARIF-native SAST support as an opt-in extension (--category code-vuln), introducing a new VulnerabilityFinding domain model, a SARIF 2.1.0 importer with path redaction, JSONL artifact persistence, a Semgrep-compatible analyzer adapter, and a redacted-metadata-based LLM verifier. The reviewer's feedback focuses on enhancing robustness and error handling across these new modules. Key recommendations include wrapping JSON parsing, file reading, and model instantiation in defensive try...except blocks to prevent raw tracebacks, verifying the existence of the scan root directory, handling malformed UTF-8 characters during LLM response decoding, raising explicit errors for missing input files, and adding comprehensive unit tests for unexpected types in parsed JSON fields.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-20T01:14:31Z

+    if int(record.get("schemaVersion", 0)) != VULN_SCHEMA_VERSION:
+        raise ValueError(
+            f"line {line_number}: unsupported schemaVersion "
+            f"{record.get('schemaVersion')}"
+        )
+    finding = record.get("finding")
+    if not isinstance(finding, dict):
+        raise ValueError(f"line {line_number}: finding must be an object")
+    if finding.get("category", VULN_CATEGORY) != VULN_CATEGORY:
+        raise ValueError(f"line {line_number}: expected category {VULN_CATEGORY}")
+    return VulnerabilityFinding.from_dict(finding)


Wrap the parsing of schemaVersion and the instantiation of VulnerabilityFinding in a try...except block to catch KeyError, TypeError, and ValueError. This prevents unhandled exceptions from bubbling up to the CLI and causing raw tracebacks when encountering malformed JSONL records.

Suggested change

if int(record.get("schemaVersion", 0)) != VULN_SCHEMA_VERSION:

raise ValueError(

f"line {line_number}: unsupported schemaVersion "

f"{record.get('schemaVersion')}"

)

finding = record.get("finding")

if not isinstance(finding, dict):

raise ValueError(f"line {line_number}: finding must be an object")

if finding.get("category", VULN_CATEGORY) != VULN_CATEGORY:

raise ValueError(f"line {line_number}: expected category {VULN_CATEGORY}")

return VulnerabilityFinding.from_dict(finding)

try:

schema_version = int(record.get("schemaVersion") or 0)

except (ValueError, TypeError):

raise ValueError(f"line {line_number}: invalid schemaVersion")

if schema_version != VULN_SCHEMA_VERSION:

raise ValueError(

f"line {line_number}: unsupported schemaVersion {schema_version}"

)

finding = record.get("finding")

if not isinstance(finding, dict):

raise ValueError(f"line {line_number}: finding must be an object")

if finding.get("category", VULN_CATEGORY) != VULN_CATEGORY:

raise ValueError(f"line {line_number}: expected category {VULN_CATEGORY}")

try:

return VulnerabilityFinding.from_dict(finding)

except (KeyError, TypeError, ValueError) as exc:

raise ValueError(f"line {line_number}: invalid finding data: {exc}") from exc

gemini-code-assist · 2026-06-20T01:14:31Z

+    data = json.loads(Path(path).read_text(encoding="utf-8"))
+    schema_version = int(data.get("schemaVersion", 1))
+    if schema_version != 1:
+        raise ValueError(
+            f"unsupported vulnerability corpus schemaVersion: {schema_version}"
+        )
+    raw = data.get("expectedFindings", [])
+    if not isinstance(raw, list):
+        raise ValueError("vulnerability corpus expectedFindings must be a list")
+    return [VulnerabilityExpectedFinding.from_dict(item) for item in raw]


Ensure robustness when loading expected findings by wrapping the file reading, JSON parsing, and item instantiation in defensive try...except blocks. This prevents raw TypeError, KeyError, or json.JSONDecodeError tracebacks from being shown to the user.

try: content = Path(path).read_text(encoding="utf-8") except OSError as exc: raise ValueError(f"failed to read expected findings file: {exc}") from exc try: data = json.loads(content) except json.JSONDecodeError as exc: raise ValueError(f"invalid expected findings JSON: {exc}") from exc if not isinstance(data, dict): raise ValueError("expected findings must be a JSON object") try: schema_version = int(data.get("schemaVersion") or 1) except (ValueError, TypeError): raise ValueError("invalid schemaVersion in expected findings") if schema_version != 1: raise ValueError( f"unsupported vulnerability corpus schemaVersion: {schema_version}" ) raw = data.get("expectedFindings", []) if not isinstance(raw, list): raise ValueError("vulnerability corpus expectedFindings must be a list") findings = [] for idx, item in enumerate(raw, start=1): if not isinstance(item, dict): raise ValueError(f"expectedFindings item {idx} must be an object") try: findings.append(VulnerabilityExpectedFinding.from_dict(item)) except (KeyError, TypeError, ValueError) as exc: raise ValueError(f"invalid expectedFindings item {idx}: {exc}") from exc return findings

gemini-code-assist · 2026-06-20T01:14:31Z

+        root = Path(root)
+        output = Path(sarif_path)


Verify that the root path exists and is a directory before executing the subprocess. If the directory does not exist, subprocess.run will raise a FileNotFoundError which would misleadingly be reported as the binary not being found.

root = Path(root) if not root.is_dir(): raise SemgrepExecutionError(f"scan root directory does not exist: {root}") output = Path(sarif_path)

gemini-code-assist · 2026-06-20T01:14:31Z

+            method="POST",
+        )
+        with urllib.request.urlopen(request, timeout=timeout_seconds) as response:  # noqa: S310
+            return response.read().decode("utf-8")


Use errors="replace" or errors="ignore" when decoding the HTTP response from the Ollama/LLM server. This prevents the verifier from crashing with a UnicodeDecodeError if the response contains malformed UTF-8 characters.

Suggested change

return response.read().decode("utf-8")

return response.read().decode("utf-8", errors="replace")

gemini-code-assist · 2026-06-20T01:14:31Z

+    if input_path.resolve() == output_path.resolve():
+        raise ValueError("--output must be different from --findings")
+    findings = VulnerabilityJsonlStore(input_path).read_all()


Raise a FileNotFoundError if the input findings file does not exist. Silently returning an empty list and creating an empty verified file can mask user errors such as typos in the input path.

if input_path.resolve() == output_path.resolve(): raise ValueError("--output must be different from --findings") if not input_path.is_file(): raise FileNotFoundError(f"findings file not found: {input_path}") findings = VulnerabilityJsonlStore(input_path).read_all()

gemini-code-assist · 2026-06-20T01:14:31Z

+    try:
+        payload = json.loads(Path(path).read_text(encoding="utf-8"))
+    except json.JSONDecodeError as exc:
+        raise SarifImportError(f"invalid SARIF JSON: {exc.msg}") from exc


Wrap the file reading in a try...except OSError block to raise a SarifImportError (which inherits from ValueError) instead of letting FileNotFoundError or PermissionError bubble up. This ensures that the CLI commands can cleanly catch the error and print a user-friendly message instead of a raw traceback.

try: content = Path(path).read_text(encoding="utf-8") except OSError as exc: raise SarifImportError(f"failed to read SARIF file: {exc}") from exc try: payload = json.loads(content) except json.JSONDecodeError as exc: raise SarifImportError(f"invalid SARIF JSON: {exc.msg}") from exc

gemini-code-assist · 2026-06-20T01:14:31Z

+def _safe_properties(properties: dict[str, object]) -> dict[str, object]:
+    return {
+        sanitize_vulnerability_identifier(key, fallback="property"): _safe_value(value)
+        for key, value in properties.items()
+    }


Defensively handle cases where properties is None or not a dictionary to prevent runtime AttributeError when calling .items().

Suggested change

def _safe_properties(properties: dict[str, object]) -> dict[str, object]:

return {

sanitize_vulnerability_identifier(key, fallback="property"): _safe_value(value)

for key, value in properties.items()

}

def _safe_properties(properties: dict[str, object] | None) -> dict[str, object]:

if not isinstance(properties, dict):

return {}

return {

sanitize_vulnerability_identifier(key, fallback="property"): _safe_value(value)

for key, value in properties.items()

}

gemini-code-assist · 2026-06-20T01:14:31Z

+def test_vulnerability_jsonl_store_rejects_wrong_nested_category(tmp_path):
+    path = tmp_path / "vuln-findings.jsonl"
+    record = {
+        "entityType": "VULN_FINDING",
+        "schemaVersion": 1,
+        "finding": {**_finding().to_dict(), "category": "secret"},
+    }
+    path.write_text(json.dumps(record), encoding="utf-8")
+
+    with pytest.raises(ValueError, match="expected category code-vuln"):
+        VulnerabilityJsonlStore(path).read_all()


Add a comprehensive unit test covering unexpected types (such as None, numbers, booleans, lists, and dictionaries) for parsed JSON fields to ensure robust defensive type validation.

def test_vulnerability_jsonl_store_rejects_wrong_nested_category(tmp_path): path = tmp_path / "vuln-findings.jsonl" record = { "entityType": "VULN_FINDING", "schemaVersion": 1, "finding": {**_finding().to_dict(), "category": "secret"}, } path.write_text(json.dumps(record), encoding="utf-8") with pytest.raises(ValueError, match="expected category code-vuln"): VulnerabilityJsonlStore(path).read_all() @pytest.mark.parametrize( "invalid_value", [None, 123, True, [], {}], ) def test_vulnerability_finding_defensive_type_validation(tmp_path, invalid_value): path = tmp_path / "vuln-findings.jsonl" record = { "entityType": "VULN_FINDING", "schemaVersion": 1, "finding": {**_finding().to_dict(), "findingId": invalid_value}, } path.write_text(json.dumps(record), encoding="utf-8") with pytest.raises(ValueError): VulnerabilityJsonlStore(path).read_all()

References

When implementing defensive type validation for parsed JSON fields (e.g., verifying a field is a string), ensure robustness by adding comprehensive unit tests that cover various unexpected types, including None, numbers, booleans, lists, and dictionaries.

…-hardening fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review)

github-advanced-security AI found potential problems Jun 20, 2026

View reviewed changes

pureliture merged commit ae79edf into main Jun 20, 2026
9 checks passed

pureliture deleted the codex/phase-2a-sarif-sast-m0 branch June 20, 2026 01:14

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

pureliture mentioned this pull request Jun 20, 2026

fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review) #48

Merged

pureliture added a commit that referenced this pull request Jun 20, 2026

Merge pull request #48 from source-security-dev/claude/vuln-redaction…

704f344

…-hardening fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sast): add SARIF-native code vulnerability path#47

feat(sast): add SARIF-native code vulnerability path#47
pureliture merged 1 commit into
mainfrom
codex/phase-2a-sarif-sast-m0

pureliture commented Jun 20, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		"Do not request or reveal raw source snippets, repository names, hosts, "
		"or paths.",

		"Return strict JSON only with keys: label, confidence, reason, "
		"remediation.",

	return response.read().decode("utf-8")
	return response.read().decode("utf-8", errors="replace")

Conversation

pureliture commented Jun 20, 2026

Purpose & Motivation

Context

Note

Dependency

Verification

Checklist

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants