Skip to content

[FEATURE] Output formatters expansion — junit_xml, emacs, vim, gitlab_sast, gitlab_secrets, jsonl, toon, dataflow-traces #52

Description

@Wolfvin

Summary

CodeLens ships only json, markdown, sarif, ai, compact formatters. Add 8 new formatters covering GitLab CI, Jenkins, Emacs, Vim, JSONL streaming, token-optimized TOON, and SARIF dataflow traces for editor/CI consumers.

Worker consensus (5 reports)

Worker Source Contribution
Opengrep update!/CodeLens_Opengrep_Upgrade_Analysis.md #42 6 new formatters: text (table), gitlab-sast, gitlab-secrets, junit-xml, emacs (file:line:col: msg), vim (quickfix). --output <file> (multiple allowed). --incremental-output streams findings as found.
Semgrep update!/CodeLens_Upgrade_Issues_from_Semgrep.md CL-007 5 new formatters: junit_xml, emacs, vim, gitlab_sast, gitlab_secrets. Unified Finding dataclass from scripts/formatters/base.py.
Opengrep same file #50 --dataflow-traces flag enabling SARIF codeFlows field per finding (array of threadFlow, 1 per taint path). GitHub code scanning UI shows full taint path clickable step-by-step.
UBS update!/CodeLens_UBS_Upgrade_Analysis.md #19 TOON (Token-Optimized Object Notation) — ~50% smaller than JSON, ~34% token saving for LLM. Schema inference: findings[65]{severity,count,title}: header + CSV-like rows.
UBS same file #20 JSONL streaming output + Beads/issue-tracker integration. --format=jsonl (1 object per line). Stream mode emits findings as discovered. Issue-tracker import scripts (GitHub Issues, JIRA).

Proposed scope (P1, 2-3 weeks)

Phase 1 — Unified Finding dataclass (P1, 3 days)

  • New scripts/formatters/base.py with Finding dataclass
  • All engines emit Finding objects, formatters consume them
  • Backward compat: existing JSON output unchanged

Phase 2 — 5 standard formatters (P1, 1 week)

  • text — human-readable table: rule_id | severity | file:line | message
  • junit-xml — universal test result format for Jenkins/GitLab
  • emacsfile:line:col: severity: message for compile-mode
  • vimfile:line:col: message for quickfix
  • gitlab-sast — GitLab CI native security scan format

Phase 3 — SARIF dataflow traces (P1, 3 days)

  • --dataflow-traces flag
  • Reuse taint_path from JSON output, convert to SARIF codeFlows (array of threadFlow)
  • Validate with sarif-validator

Phase 4 — JSONL streaming (P2, 1 week)

  • --format=jsonl — line-delimited JSON
  • Stream mode: emit findings as discovered (refactor engines away from collect-all-then-print)
  • --jsonl-output=<file> for file output

Phase 5 — TOON (P3, 1 week, optional)

  • --format=toon — token-optimized for LLM
  • Python-native encoder (no external tru binary)
  • Fallback to --format ai (JSON) with stderr warning on encoder error

Phase 6 — Issue-tracker import scripts (P3, 1 week, optional)

  • scripts/import-to-github-issues.py (uses gh issue create)
  • scripts/import-to-jira.py

Acceptance criteria

  • All 5 Phase 2 formatters produce valid output (validated against format specs)
  • SARIF codeFlows renders correctly in GitHub code scanning UI
  • JSONL streaming works for >10000 findings without OOM
  • TOON output is valid (parseable back to dict)
  • --output <file> supports multiple formats simultaneously

License note

Semgrep formatters are LGPL-2.1 — reference only, reimplement from spec.

Files

  • New scripts/formatters/{base,text,junit_xml,emacs,vim,gitlab_sast,gitlab_secrets,jsonl,toon}.py
  • Update scripts/formatters/sarif.py for codeFlows
  • Update scripts/formatters/__init__.py for new format dispatch
  • Update scripts/codelens.py for --output, --dataflow-traces, --incremental-output flags

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions