fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review)#48
Conversation
…review) PR #47 review + adversarial red-team found the central vulnerability redaction was allow-by-default, leaking raw source/secret/path/host into the VULN_FINDING JSONL and the M4 LLM prompt, while an operator backstop also over-redacted legitimate rule prose. Rework to robust sensitive-class redaction: - redaction.py: NFKC + unicode-homoglyph + %2F/%5C separator folding; broadened secret patterns (provider prefixes sk_/AIza/github_pat/glpat/npm, high-entropy bare token, PEM); broadened host TLDs; precise SQL-statement and narrow identifier-assignment redaction instead of a prose-destroying operator/keyword backstop; backslash + bare-source-filename paths. - sarif.py: owasp_tags normalized to A01..A10 tokens; raw 'tags' dropped from persisted properties. - model.py: verifier_verdict reason/remediation/error re-sanitized on read. - verifier.py: verdict reason/remediation/error routed through central redaction. - runner.py: semgrep stderr reuses central redaction. - tests: 47 redaction regressions locking every review + red-team attack closed, and asserting legitimate rule prose/ids survive (no over-redaction). Closes review findings A1/A2 (blocker), A3/A4/B1/B2 and 16 red-team bypasses. uv run pytest 830 passed; public_safety/render/ledger green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request hardens the vulnerability metadata redaction mechanism by introducing homoglyph normalization, percent-encoding handling, and comprehensive regex patterns to prevent leaks of secrets, paths, hosts, and other sensitive information. It also routes verifier verdicts and process details through this central redaction system and adds a robust test suite. The review feedback correctly identifies that several regex patterns (_HOST_RE, _SRC_FILE_RE, and _IDENTIFIER_PATH_RE) lack case-insensitivity, which could allow uppercase or mixed-case hostnames and file extensions to bypass redaction. It is recommended to apply the suggested (?i) flags and expand the test cases accordingly.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| _HOST_RE = re.compile( | ||
| r"(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+" | ||
| r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|" | ||
| r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|" | ||
| r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b" | ||
| r"(?::\d{2,5})?" | ||
| ) |
There was a problem hiding this comment.
The regular expression _HOST_RE is compiled without case-insensitivity. Consequently, hostnames with uppercase TLDs or domains (e.g., evil.internal.EXAMPLE, DB.INTERNAL, EXAMPLE.COM) will bypass host redaction. Adding the (?i) inline flag at the beginning of the pattern ensures that all host variations are correctly redacted.
| _HOST_RE = re.compile( | |
| r"(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+" | |
| r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|" | |
| r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|" | |
| r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b" | |
| r"(?::\d{2,5})?" | |
| ) | |
| _HOST_RE = re.compile( | |
| r"(?i)(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+" | |
| r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|" | |
| r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|" | |
| r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b" | |
| r"(?::\d{2,5})?" | |
| ) |
There was a problem hiding this comment.
반영했습니다 (f83738f) — _HOST_RE에 (?i)를 적용해 대문자/혼합 host TLD(예: DB.PROD.ACME.CLOUD) 우회를 닫고, 대문자 host 회귀 테스트를 추가했습니다.
| _SRC_FILE_RE = re.compile(rf"(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b") | ||
| _IDENTIFIER_PATH_RE = re.compile( | ||
| rf"(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){{1,}}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b" | ||
| r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+" | ||
| ) |
There was a problem hiding this comment.
The regular expressions _SRC_FILE_RE and _IDENTIFIER_PATH_RE are compiled without case-insensitivity. As a result, they only match lowercase file extensions (e.g., .py, .java, .cpp). If a SARIF file or source code reference contains uppercase or mixed-case extensions (e.g., App.Java, main.CPP, config.PY), they will completely bypass the redaction. Adding the (?i) inline flag at the beginning of the patterns ensures that all file extension variations are correctly redacted.
| _SRC_FILE_RE = re.compile(rf"(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b") | |
| _IDENTIFIER_PATH_RE = re.compile( | |
| rf"(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){{1,}}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b" | |
| r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+" | |
| ) | |
| _SRC_FILE_RE = re.compile(rf"(?i)(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b") | |
| _IDENTIFIER_PATH_RE = re.compile( | |
| rf"(?i)(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){1,}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b" | |
| r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+" | |
| ) |
There was a problem hiding this comment.
반영했습니다 (f83738f) — _SRC_FILE_RE/_IDENTIFIER_PATH_RE에 (?i)를 적용해 대문자 확장자(예: creds.JSON) 우회를 닫았습니다.
| @pytest.mark.parametrize( | ||
| "text, leak", | ||
| [ | ||
| ("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"), | ||
| ("leak at back\\slash\\rel.py", "back\\slash\\rel.py"), | ||
| ("the offending file is config.py here", "config.py"), | ||
| ], | ||
| ) | ||
| def test_a2_backslash_and_bare_paths_are_redacted(text, leak): | ||
| assert leak not in sanitize_vulnerability_text(text) |
There was a problem hiding this comment.
Add test cases with uppercase and mixed-case file extensions to verify that case-insensitive file path redaction works correctly and to prevent future regressions.
| @pytest.mark.parametrize( | |
| "text, leak", | |
| [ | |
| ("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"), | |
| ("leak at back\\slash\\rel.py", "back\\slash\\rel.py"), | |
| ("the offending file is config.py here", "config.py"), | |
| ], | |
| ) | |
| def test_a2_backslash_and_bare_paths_are_redacted(text, leak): | |
| assert leak not in sanitize_vulnerability_text(text) | |
| # --- A2: backslash (Windows) relative paths + bare source filenames --- | |
| @pytest.mark.parametrize( | |
| "text, leak", | |
| [ | |
| ("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"), | |
| ("leak at back\\slash\\rel.py", "back\\slash\\rel.py"), | |
| ("the offending file is config.py here", "config.py"), | |
| ("See src\\internal\\creds.PY for the leak", "src\\internal\\creds.PY"), | |
| ("the offending file is config.Py here", "config.Py"), | |
| ], | |
| ) | |
| def test_a2_backslash_and_bare_paths_are_redacted(text, leak): | |
| assert leak.lower() not in sanitize_vulnerability_text(text).lower() |
There was a problem hiding this comment.
반영했습니다 (f83738f) — test_case_insensitive_paths_and_hosts_redacted에 대문자/혼합 확장자(Creds.JSON, App.PY, Creds.PY) 케이스를 추가했습니다.
| @pytest.mark.parametrize( | ||
| "text, leak", | ||
| [ | ||
| ( | ||
| "Reported by john.doe@internal-corp.example", | ||
| "john.doe@internal-corp.example", | ||
| ), | ||
| ("callback to 10.1.2.3 observed", "10.1.2.3"), | ||
| ("host evil.internal.example reached", "evil.internal.example"), | ||
| ("leak at src/secrets/master.key now", "secrets/master.key"), | ||
| ("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"), | ||
| ], | ||
| ) | ||
| def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak): | ||
| assert leak not in sanitize_vulnerability_text(text) |
There was a problem hiding this comment.
Add test cases with uppercase hostnames to verify that case-insensitive host redaction works correctly and to prevent future regressions.
| @pytest.mark.parametrize( | |
| "text, leak", | |
| [ | |
| ( | |
| "Reported by john.doe@internal-corp.example", | |
| "john.doe@internal-corp.example", | |
| ), | |
| ("callback to 10.1.2.3 observed", "10.1.2.3"), | |
| ("host evil.internal.example reached", "evil.internal.example"), | |
| ("leak at src/secrets/master.key now", "secrets/master.key"), | |
| ("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"), | |
| ], | |
| ) | |
| def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak): | |
| assert leak not in sanitize_vulnerability_text(text) | |
| # --- A4: email / IP / host / unicode-separator paths --- | |
| @pytest.mark.parametrize( | |
| "text, leak", | |
| [ | |
| ( | |
| "Reported by john.doe@internal-corp.example", | |
| "john.doe@internal-corp.example", | |
| ), | |
| ("callback to 10.1.2.3 observed", "10.1.2.3"), | |
| ("host evil.internal.example reached", "evil.internal.example"), | |
| ("host EVIL.INTERNAL.EXAMPLE reached", "EVIL.INTERNAL.EXAMPLE"), | |
| ("leak at src/secrets/master.key now", "secrets/master.key"), | |
| ("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"), | |
| ], | |
| ) | |
| def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak): | |
| assert leak.lower() not in sanitize_vulnerability_text(text).lower() |
There was a problem hiding this comment.
반영했습니다 (f83738f) — 같은 파라메트라이즈에 대문자/혼합 host(DB.PROD.ACME.CLOUD, Db.Prod.Acme.Cloud) 케이스를 추가했습니다.
gemini review flagged _HOST_RE TLD allowlist and _SRC_FILE_RE / _IDENTIFIER_PATH_RE source-extension allowlist were case-sensitive, so uppercase/mixed-case hosts (DB.PROD.ACME.CLOUD) and file extensions (creds.JSON) bypassed redaction. Add (?i) and uppercase/mixed-case regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Outcome
#47 머지 후 리뷰에서 발견된 공개안전 누출을 후속으로 수정한다. 중앙 vulnerability redaction이 allow-by-default라 raw source/secret/path/host가
VULN_FINDINGJSONL과 M4 LLM 프롬프트로 새고(blocker A1·A2), 동시에 operator/SQL 백스톱이 정상 rule prose를 파괴했다. 멀티에이전트 리뷰(5차원) + adversarial red-team(4 angle)으로 양방향(누출 + 과검열)을 라이브 재현해 닫았다.무엇을
%2F/%5C분리자 정규화(우회 차단).sk_/AIza/github_pat/glpat/npm/xox/gh[opsu]_) + 고엔트로피 bare 토큰 + PEM..cloud/.xyz/.app…).SELECT…FROM만) + 좁은 identifier-assignment(RHS가 식별자일 때만, 숫자/비교 제외).A01..A10토큰으로 정규화, persistedproperties에서 rawtags제거(A3).verifier_verdict의 reason/remediation/error를 read 시 재검열(B1 방어).닫힌 것
리뷰 findings A1·A2(blocker), A3·A4·B1·B2, 그리고 red-team의 16개 novel bypass(bare provider secret, 고엔트로피 토큰, homoglyph/URL-encoded 분리자, 광역 host, verifier
error필드). 동시에 red-team이 지적한 11개 over-redaction(update/Delete/==/len <= 255/SELECT query/std::string/data-flow->/timeout = 30/실제 Semgrep rule id)이 모두 보존됨.검증
tests/test_vulnerability_redaction.py— 47개 회귀(모든 attack 닫힘 + prose 보존을 양방향 고정).uv run pytest830 passed(secret-scanner 무회귀);public_safety/render --validate/--check/validate_ledger/rebuild_ledger_index --check전부 green; ruff clean.🤖 Generated with Claude Code