Skip to content

[codex] Add incremental scan queue worker MVP#15

Merged
pureliture merged 9 commits into
mainfrom
codex/issue-12-g1-queue-ledger-model
Jun 15, 2026
Merged

[codex] Add incremental scan queue worker MVP#15
pureliture merged 9 commits into
mainfrom
codex/issue-12-g1-queue-ledger-model

Conversation

@pureliture

Copy link
Copy Markdown
Contributor

Purpose & Motivation

Issue #12의 incremental scan queue worker MVP를 구현합니다.

주요 목표는 기존 scan-all 중심 batch 실행에 더해, commit 단위 durable queue/ledger 모델을 추가하고 Docker Compose + DynamoDB Local로 새 PC에서도 public repo quickstart smoke가 가능한 경로를 제공하는 것입니다.

Context

이 PR은 다음 흐름을 추가합니다.

  • REF_STATE, SCAN_JOB, SCAN_LEDGER, REPO_LEASE logical entity를 DynamoDB-compatible single-table store에 추가
  • deterministic repoId / jobId와 conditional write 기반 enqueue idempotency 구현
  • incremental discovery runtime으로 remote ref 변화 감지 및 commit별 job enqueue
  • bounded scan worker runtime으로 pending/expired job lease, repo lease, retry/dead_letter, ledger completion 처리
  • queue status와 quickstart runtime 추가
  • Dockerfile/Compose에 DynamoDB Local, gh, glab, gitleaks 기반 turnkey path 추가
  • scan-all orchestration을 runtime/scan_all.py로 이동해 CLI 책임 축소

저장소는 물리적으로 single-table 구조를 유지하지만, queue/ledger/lease를 위한 logical schema와 access pattern을 추가했습니다.

Note

리뷰 시 특히 봐주면 좋은 부분입니다.

  • complete_processed_job의 순서: findings write → SCAN_LEDGER put-if-absent → SCAN_JOB completed
  • retryable failure는 pending으로 되돌리고, attempts exhausted는 dead_letter로 전환하는 동작
  • REPO_LEASE key shape: gsi1pk=REPO_LEASE#ALL, gsi1sk=<lease_until>#<repo_id>
  • SCAN_JOB_STATUS#pending 같은 MVP status partition은 500+ repo 지속 운영용으로는 shard/counter 개선이 필요합니다. 이 PR은 MVP/local-turnkey proof 범위입니다.
  • LLM verifier를 queue/worker path에 자동 연결하는 작업은 아직 후속 범위입니다.

Dependency

Refs #12.

후속 작업 후보:

  • production scaling: queue shard, target shard, status aggregate, capacity plan
  • verifier integration: scanner finding 이후 LLM verification 자동화
  • cloud deployment path: managed DynamoDB/worker runtime hardening
  • operational metrics: queue depth, expired lease, retry/dead_letter alerts

Checklist

  • 이 PR에 포함된 Commit에는 Secret Value가 포함되지 않았음을 확인했습니다.
  • uv run pytest -q tests/test_incremental_scan_storage.py
  • uv run pytest -q tests/test_scan_target_storage.py tests/test_cli_scan_all.py
  • uv run pytest -q
  • git diff --check origin/main..HEAD
  • Docker Compose quickstart smoke on ragflow-ubuntu with public synthetic target path
  • public-safety grep reviewed; only synthetic placeholders/examples observed

pureliture and others added 7 commits June 12, 2026 11:39
Add the issue #12 ADR, implementation spec, and Goal-based agentic workflow for the incremental scan queue worker MVP.

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Implement the G1 storage model for incremental scan queue state, commit ledger rows, job leases, repo leases, and retry-safe completion.

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Implement discover-updates initialize and enqueue modes with git ref discovery, ref state updates, ledger skips, and queue job creation.

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Add git log opts support for Gitleaks and implement scan-worker --once for queued incremental commit jobs.

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Add container/runtime support for a one-command local quickstart path with DynamoDB Local persistence, SCM preflight checks, public git fallback, and quickstart queue seeding/worker execution.

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
Extract scan-all locking, catalog lookup, fetch isolation, in-memory manifest assembly, local scan invocation, and notification logging into runtime/scan_all.py while keeping the CLI as a thin presentation wrapper.

Co-Authored-By: Codex GPT-5 <noreply@openai.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an incremental scan queue worker MVP for branch-aware secret scanning, replacing Dynalite with DynamoDB Local and introducing new CLI commands (discover-updates, scan-worker, queue-status, doctor, and quickstart) along with a turnkey Docker Compose worker. The feedback recommends improving lock safety in run_scan_all using a with statement, optimizing get_queue_status by querying GSI1 instead of scanning the table, continuing the worker loop on repository lease failures instead of breaking, and expanding type validation tests for git_log_opts.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +105 to +141
try:
lock_cm = _scan_all_lock(lock_path)
lock_cm.__enter__()
except _LockContention:
event_at = request.now_factory()
request.notification_writer(
log_path,
lock_contention_record(
event_at=event_at,
lock_path=str(lock_path),
),
)
return ScanAllResult(
exit_code=3,
status="lock_contention",
error="another scan-all is running",
)
except OSError as exc:
request.notification_writer(
log_path,
fatal_error_record(
event_at=request.now_factory(),
scan_run_id=None,
error=f"Lock acquisition failed: {exc}",
stage="lock_acquisition",
),
)
return ScanAllResult(
exit_code=1,
status="lock_failure",
error=f"failed to acquire lock: {exc}",
)

try:
return _run_scan_all_locked(request, started_at, log_path)
finally:
lock_cm.__exit__(None, None, None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

run_scan_all 함수에서 @contextlib.contextmanager로 구현된 _scan_all_lock을 사용할 때, __enter__()__exit__()를 수동으로 호출하고 있습니다. 이 방식은 예외 발생 시 컨텍스트 매니저가 예외를 정상적으로 전달받지 못해 예외가 누락되거나 contextlib 내부에서 RuntimeError가 발생할 수 있어 안전하지 않습니다.

대신 Python의 표준 with 문을 사용하여 안전하고 직관적으로 락을 관리하는 것을 권장합니다.

    try:
        with _scan_all_lock(lock_path):
            return _run_scan_all_locked(request, started_at, log_path)
    except _LockContention:
        event_at = request.now_factory()
        request.notification_writer(
            log_path,
            lock_contention_record(
                event_at=event_at,
                lock_path=str(lock_path),
            ),
        )
        return ScanAllResult(
            exit_code=3,
            status="lock_contention",
            error="another scan-all is running",
        )
    except OSError as exc:
        request.notification_writer(
            log_path,
            fatal_error_record(
                event_at=request.now_factory(),
                scan_run_id=None,
                error=f"Lock acquisition failed: {exc}",
                stage="lock_acquisition",
            ),
        )
        return ScanAllResult(
            exit_code=1,
            status="lock_failure",
            error=f"failed to acquire lock: {exc}",
        )

Comment on lines +355 to +359
lease_items = scan_all_pages(
self._table,
FilterExpression="entityType = :entity_type",
ExpressionAttributeValues={":entity_type": "REPO_LEASE"},
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

get_queue_status 메서드에서 scan_all_pages를 사용하여 테이블 전체를 스캔하고 있습니다. DynamoDB의 Scan 작업은 필터 표현식 적용 전에 테이블의 모든 항목을 읽으므로, 데이터가 늘어남에 따라 성능 저하와 높은 비용(RCU 소모)을 초래합니다.

REPO_LEASE 항목은 모두 gsi1pk = REPO_LEASE#ALL 형태의 GSI1 키를 가지고 있으므로, Scan 대신 GSI1에 대한 Query를 사용하면 훨씬 효율적입니다.

Suggested change
lease_items = scan_all_pages(
self._table,
FilterExpression="entityType = :entity_type",
ExpressionAttributeValues={":entity_type": "REPO_LEASE"},
)
lease_items = query_all_pages(
self._table,
IndexName=GSI1_NAME,
KeyConditionExpression="gsi1pk = :pk",
ExpressionAttributeValues={":pk": "REPO_LEASE#ALL"},
)

Comment on lines +100 to +106
if not request.store.acquire_repo_lease(
job.repo_id,
worker_id,
request.lease_seconds,
):
request.store.return_job_to_pending(job.job_id, "repo lease unavailable")
break

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

특정 리포지토리의 임대(Repo Lease)를 획득하지 못했을 때 break를 호출하여 루프를 완전히 빠져나가고 있습니다. 이로 인해 큐에 대기 중인 다른 독립적인 리포지토리의 작업들이 있음에도 불구하고, 단 하나의 리포지토리가 잠겨 있다는 이유로 워커가 프로세싱을 완전히 중단하게 됩니다.

break 대신 continue를 사용하여 다른 리포지토리의 작업을 계속 시도할 수 있도록 개선하는 것이 좋습니다. 단, return_job_to_pending이 해당 작업을 즉시 다시 pending 상태로 되돌리기 때문에, 동일한 워커가 다음 루프에서 동일한 작업을 다시 가져와 무한 루프를 도는 것을 방지하기 위해 return_job_to_pending 시 약간의 지연 시간(Backoff)을 next_attempt_at에 추가하는 것을 권장합니다.

Suggested change
if not request.store.acquire_repo_lease(
job.repo_id,
worker_id,
request.lease_seconds,
):
request.store.return_job_to_pending(job.job_id, "repo lease unavailable")
break
if not request.store.acquire_repo_lease(
job.repo_id,
worker_id,
request.lease_seconds,
):
request.store.return_job_to_pending(job.job_id, "repo lease unavailable")
continue

Comment thread tests/test_manifest.py
Comment on lines +160 to +172
def test_scan_git_log_opts_non_string_raises(tmp_path: Path) -> None:
yaml_content = """\
version: 1
targets:
- name: demo-bad-log-opts
path: /tmp/demo-bad-log-opts
scan:
git_log_opts: 123
"""
p = write_yaml(tmp_path, yaml_content)

with pytest.raises(ManifestError, match="scan.git_log_opts"):
load_manifest(p)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

git_log_opts 필드에 대한 타입 검증 테스트가 정수형(123)에 대해서만 수행되고 있습니다. 저장소의 일반 규칙(General Rules)에 따라, JSON/YAML 파싱 필드의 방어적 타입 검증 시 None, boolean, list, dictionary 등 다양한 예기치 않은 타입들이 들어왔을 때도 견고하게 동작하는지 검증하는 종합적인 단위 테스트를 추가해야 합니다.

@pytest.mark.parametrize(
    "invalid_value",
    [
        "null",
        "123",
        "True",
        "['--some-opt']",
        "{'key': 'value'}",
    ],
)
def test_scan_git_log_opts_invalid_types_raise(tmp_path: Path, invalid_value) -> None:
    yaml_content = f"""\\
        version: 1
        targets:
          - name: demo-bad-log-opts
            path: /tmp/demo-bad-log-opts
        scan:
          git_log_opts: {invalid_value}
    """
    p = write_yaml(tmp_path, yaml_content)

    with pytest.raises(ManifestError, match="scan.git_log_opts"):
        load_manifest(p)
References
  1. When implementing defensive type validation for parsed JSON fields (e.g., verifying a field is a string), ensure robustness by adding comprehensive unit tests that cover various unexpected types, including None, numbers, booleans, lists, and dictionaries.

@pureliture pureliture marked this pull request as ready for review June 15, 2026 23:10
Resolve conflicts between incremental queue worker support and the latest severity/noise-filter changes on main. Keep git_log_opts and enable_noise_filter manifest options together.\n\nCo-Authored-By: Codex GPT-5 <noreply@openai.com>
Comment thread src/security_scanner/runtime/scan_all.py Fixed
Comment thread src/security_scanner/runtime/scan_all.py Fixed
Comment thread src/security_scanner/runtime/scan_all.py Fixed
Comment thread tests/test_cli_scan_worker.py Fixed
Comment thread tests/test_cli_scan_worker.py Fixed
Use a default factory for the scan-all notification writer so static analysis does not treat the writer as a bound dataclass method. Clean up scan-worker test notices reported by CodeQL.

Co-Authored-By: Codex GPT-5 <noreply@openai.com>
@pureliture pureliture merged commit 0d9e485 into main Jun 15, 2026
2 checks passed
@pureliture pureliture deleted the codex/issue-12-g1-queue-ledger-model branch June 16, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants