fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review) by pureliture · Pull Request #48 · source-security-dev/security-scanner

pureliture · 2026-06-20T02:20:32Z

Outcome

#47 머지 후 리뷰에서 발견된 공개안전 누출을 후속으로 수정한다. 중앙 vulnerability redaction이 allow-by-default라 raw source/secret/path/host가 VULN_FINDING JSONL과 M4 LLM 프롬프트로 새고(blocker A1·A2), 동시에 operator/SQL 백스톱이 정상 rule prose를 파괴했다. 멀티에이전트 리뷰(5차원) + adversarial red-team(4 angle)으로 양방향(누출 + 과검열)을 라이브 재현해 닫았다.

무엇을

redaction.py — deny-by-default 민감-클래스 redaction으로 재설계:
- NFKC + 유니코드 homoglyph(U+2044/2215/2550/00B7 등) + %2F/%5C 분리자 정규화(우회 차단).
- secret 패턴 확장: provider prefix(sk_/AIza/github_pat/glpat/npm/xox/gh[opsu]_) + 고엔트로피 bare 토큰 + PEM.
- host TLD 광역화(.cloud/.xyz/.app…).
- prose 파괴 백스톱 제거 → 정밀 SQL-statement(SELECT…FROM만) + 좁은 identifier-assignment(RHS가 식별자일 때만, 숫자/비교 제외).
- backslash + bare-source-filename 경로.
sarif.py — owasp_tags를 A01..A10 토큰으로 정규화, persisted properties에서 raw tags 제거(A3).
model.py — verifier_verdict의 reason/remediation/error를 read 시 재검열(B1 방어).
verifier.py — verdict reason/remediation/error를 중앙 redaction 경유(B1).
runner.py — semgrep stderr가 중앙 redaction 재사용(B2).

닫힌 것

리뷰 findings A1·A2(blocker), A3·A4·B1·B2, 그리고 red-team의 16개 novel bypass(bare provider secret, 고엔트로피 토큰, homoglyph/URL-encoded 분리자, 광역 host, verifier error 필드). 동시에 red-team이 지적한 11개 over-redaction(update/Delete/==/len <= 255/SELECT query/std::string/data-flow ->/timeout = 30/실제 Semgrep rule id)이 모두 보존됨.

검증

tests/test_vulnerability_redaction.py — 47개 회귀(모든 attack 닫힘 + prose 보존을 양방향 고정).
uv run pytest 830 passed(secret-scanner 무회귀); public_safety/render --validate/--check/validate_ledger/rebuild_ledger_index --check 전부 green; ruff clean.
테스트의 secret-shaped fixture는 fragment 조립이라 소스에 연속 secret 패턴 없음.

🤖 Generated with Claude Code

…review) PR #47 review + adversarial red-team found the central vulnerability redaction was allow-by-default, leaking raw source/secret/path/host into the VULN_FINDING JSONL and the M4 LLM prompt, while an operator backstop also over-redacted legitimate rule prose. Rework to robust sensitive-class redaction: - redaction.py: NFKC + unicode-homoglyph + %2F/%5C separator folding; broadened secret patterns (provider prefixes sk_/AIza/github_pat/glpat/npm, high-entropy bare token, PEM); broadened host TLDs; precise SQL-statement and narrow identifier-assignment redaction instead of a prose-destroying operator/keyword backstop; backslash + bare-source-filename paths. - sarif.py: owasp_tags normalized to A01..A10 tokens; raw 'tags' dropped from persisted properties. - model.py: verifier_verdict reason/remediation/error re-sanitized on read. - verifier.py: verdict reason/remediation/error routed through central redaction. - runner.py: semgrep stderr reuses central redaction. - tests: 47 redaction regressions locking every review + red-team attack closed, and asserting legitimate rule prose/ids survive (no over-redaction). Closes review findings A1/A2 (blocker), A3/A4/B1/B2 and 16 red-team bypasses. uv run pytest 830 passed; public_safety/render/ledger green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request hardens the vulnerability metadata redaction mechanism by introducing homoglyph normalization, percent-encoding handling, and comprehensive regex patterns to prevent leaks of secrets, paths, hosts, and other sensitive information. It also routes verifier verdicts and process details through this central redaction system and adds a robust test suite. The review feedback correctly identifies that several regex patterns (_HOST_RE, _SRC_FILE_RE, and _IDENTIFIER_PATH_RE) lack case-insensitivity, which could allow uppercase or mixed-case hostnames and file extensions to bypass redaction. It is recommended to apply the suggested (?i) flags and expand the test cases accordingly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-20T02:23:07Z

+_HOST_RE = re.compile(
+    r"(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+"
+    r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|"
+    r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|"
+    r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b"
+    r"(?::\d{2,5})?"
 )


The regular expression _HOST_RE is compiled without case-insensitivity. Consequently, hostnames with uppercase TLDs or domains (e.g., evil.internal.EXAMPLE, DB.INTERNAL, EXAMPLE.COM) will bypass host redaction. Adding the (?i) inline flag at the beginning of the pattern ensures that all host variations are correctly redacted.

Suggested change

_HOST_RE = re.compile(

r"(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+"

r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|"

r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|"

r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b"

r"(?::\d{2,5})?"

)

_HOST_RE = re.compile(

r"(?i)(?<![\w.@-])(?:[A-Za-z0-9-]+\.)+"

r"(?:com|net|org|io|dev|ai|co|cloud|app|xyz|us|uk|ca|de|fr|jp|cn|ru|eu|me|"

r"tv|cc|sh|info|biz|site|online|tech|store|zip|mov|svc|k8s|"

r"internal|local|example|corp|gov|edu|test|lan|intra|prod|stg|qa)\b"

r"(?::\d{2,5})?"

)

반영했습니다 (f83738f) — _HOST_RE에 (?i)를 적용해 대문자/혼합 host TLD(예: DB.PROD.ACME.CLOUD) 우회를 닫고, 대문자 host 회귀 테스트를 추가했습니다.

gemini-code-assist · 2026-06-20T02:23:08Z

+_SRC_FILE_RE = re.compile(rf"(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b")
+_IDENTIFIER_PATH_RE = re.compile(
+    rf"(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){{1,}}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b"
+    r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+"
 )


The regular expressions _SRC_FILE_RE and _IDENTIFIER_PATH_RE are compiled without case-insensitivity. As a result, they only match lowercase file extensions (e.g., .py, .java, .cpp). If a SARIF file or source code reference contains uppercase or mixed-case extensions (e.g., App.Java, main.CPP, config.PY), they will completely bypass the redaction. Adding the (?i) inline flag at the beginning of the patterns ensures that all file extension variations are correctly redacted.

Suggested change

_SRC_FILE_RE = re.compile(rf"(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b")

_IDENTIFIER_PATH_RE = re.compile(

rf"(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){{1,}}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b"

r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+"

)

_SRC_FILE_RE = re.compile(rf"(?i)(?<![/\\\w.-])[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b")

_IDENTIFIER_PATH_RE = re.compile(

rf"(?i)(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){1,}[A-Za-z0-9_.-]+\.(?:{_SOURCE_EXT})\b"

r"|(?<![/\\\w.-])(?:[A-Za-z0-9_.-]+[\\/]){2,}[A-Za-z0-9_.-]+"

)

반영했습니다 (f83738f) — _SRC_FILE_RE/_IDENTIFIER_PATH_RE에 (?i)를 적용해 대문자 확장자(예: creds.JSON) 우회를 닫았습니다.

gemini-code-assist · 2026-06-20T02:23:08Z

+@pytest.mark.parametrize(
+    "text, leak",
+    [
+        ("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"),
+        ("leak at back\\slash\\rel.py", "back\\slash\\rel.py"),
+        ("the offending file is config.py here", "config.py"),
+    ],
+)
+def test_a2_backslash_and_bare_paths_are_redacted(text, leak):
+    assert leak not in sanitize_vulnerability_text(text)


Add test cases with uppercase and mixed-case file extensions to verify that case-insensitive file path redaction works correctly and to prevent future regressions.

Suggested change

@pytest.mark.parametrize(

"text, leak",

[

("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"),

("leak at back\\slash\\rel.py", "back\\slash\\rel.py"),

("the offending file is config.py here", "config.py"),

],

)

def test_a2_backslash_and_bare_paths_are_redacted(text, leak):

assert leak not in sanitize_vulnerability_text(text)

# --- A2: backslash (Windows) relative paths + bare source filenames ---

@pytest.mark.parametrize(

"text, leak",

[

("See src\\internal\\creds.py for the leak", "src\\internal\\creds.py"),

("leak at back\\slash\\rel.py", "back\\slash\\rel.py"),

("the offending file is config.py here", "config.py"),

("See src\\internal\\creds.PY for the leak", "src\\internal\\creds.PY"),

("the offending file is config.Py here", "config.Py"),

],

)

def test_a2_backslash_and_bare_paths_are_redacted(text, leak):

assert leak.lower() not in sanitize_vulnerability_text(text).lower()

반영했습니다 (f83738f) — test_case_insensitive_paths_and_hosts_redacted에 대문자/혼합 확장자(Creds.JSON, App.PY, Creds.PY) 케이스를 추가했습니다.

gemini-code-assist · 2026-06-20T02:23:08Z

+@pytest.mark.parametrize(
+    "text, leak",
+    [
+        (
+            "Reported by john.doe@internal-corp.example",
+            "john.doe@internal-corp.example",
+        ),
+        ("callback to 10.1.2.3 observed", "10.1.2.3"),
+        ("host evil.internal.example reached", "evil.internal.example"),
+        ("leak at src／secrets／master.key now", "secrets/master.key"),
+        ("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"),
+    ],
+)
+def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak):
+    assert leak not in sanitize_vulnerability_text(text)


Add test cases with uppercase hostnames to verify that case-insensitive host redaction works correctly and to prevent future regressions.

Suggested change

@pytest.mark.parametrize(

"text, leak",

[

(

"Reported by john.doe@internal-corp.example",

"john.doe@internal-corp.example",

),

("callback to 10.1.2.3 observed", "10.1.2.3"),

("host evil.internal.example reached", "evil.internal.example"),

("leak at src／secrets／master.key now", "secrets/master.key"),

("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"),

],

)

def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak):

assert leak not in sanitize_vulnerability_text(text)

# --- A4: email / IP / host / unicode-separator paths ---

@pytest.mark.parametrize(

"text, leak",

[

(

"Reported by john.doe@internal-corp.example",

"john.doe@internal-corp.example",

),

("callback to 10.1.2.3 observed", "10.1.2.3"),

("host evil.internal.example reached", "evil.internal.example"),

("host EVIL.INTERNAL.EXAMPLE reached", "EVIL.INTERNAL.EXAMPLE"),

("leak at src／secrets／master.key now", "secrets/master.key"),

("connect fe80::1ff:fe23:4567:890a here", "fe80::1ff:fe23:4567:890a"),

],

)

def test_a4_network_pii_and_unicode_paths_are_redacted(text, leak):

assert leak.lower() not in sanitize_vulnerability_text(text).lower()

반영했습니다 (f83738f) — 같은 파라메트라이즈에 대문자/혼합 host(DB.PROD.ACME.CLOUD, Db.Prod.Acme.Cloud) 케이스를 추가했습니다.

gemini review flagged _HOST_RE TLD allowlist and _SRC_FILE_RE / _IDENTIFIER_PATH_RE source-extension allowlist were case-sensitive, so uppercase/mixed-case hosts (DB.PROD.ACME.CLOUD) and file extensions (creds.JSON) bypassed redaction. Add (?i) and uppercase/mixed-case regression tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

pureliture merged commit 704f344 into main Jun 20, 2026
9 checks passed

pureliture deleted the claude/vuln-redaction-hardening branch June 20, 2026 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review)#48

fix(sast): harden vulnerability redaction to deny-by-default (PR #47 review)#48
pureliture merged 2 commits into
mainfrom
claude/vuln-redaction-hardening

pureliture commented Jun 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

pureliture Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

pureliture Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

pureliture Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

pureliture Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pureliture commented Jun 20, 2026

Outcome

무엇을

닫힌 것

검증

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

pureliture Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

pureliture Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

pureliture Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

pureliture Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant