Ref update queue + commit ledger 기반 branch-aware scan worker

## 요약

기존 #2는 multi-branch finding 통합을 다뤘지만, 현재 `scan-all`/`SCAN_TARGET` catalog 이후의 운영 모델에서는 더 큰 단위의 설계가 필요하다.

목표는 많은 repository와 여러 branch를 주기적으로 다루면서도 매번 전체 history를 재검사하지 않도록, **fetch/discovery와 scan execution을 분리**하고 **commit-level scan ledger**로 재검사를 피하는 것이다.

## 현재 상태

- `scan-all`은 enabled `SCAN_TARGET`을 읽고, 각 repo를 clone/fetch한 뒤 같은 실행 안에서 `run_local_scan()`을 호출한다.
- `fetch_or_clone()`은 기존 clone에 대해 `git fetch --all --prune`을 수행한다.
- `GitleaksRunner`는 현재 `gitleaks git` 또는 `gitleaks dir`를 실행하지만, commit range/ref range를 표현하는 scanner option은 없다.
- `Finding.repo.branch`와 `Finding.repo.commit` 필드는 존재하지만 아직 scan context에서 채워지지 않는다.
- scan job queue, worker lease, ref update ledger, commit scan ledger는 아직 없다.

## 문제

단일 batch `scan-all`은 초기 MVP에는 단순하지만, repository 수와 branch 수가 커지면 다음 문제가 생긴다.

- fetch와 scan이 같은 critical path에 있어 한 run이 길어지면 다음 주기가 밀린다.
- 새 commit이 일부 branch에만 추가돼도 전체 target 재검사 쪽으로 흐르기 쉽다.
- branch별 잔존 여부와 commit별 scan 여부를 분리해서 추적하기 어렵다.
- worker 병렬성, backpressure, retry, lease, stale branch 정책을 표현할 durable model이 없다.

## 제안

### 1. `fetch-cron` / `discover-updates` command

주기적으로 enabled target을 fetch하고 ref 변화를 계산한다.

- bare mirror 또는 cache checkout을 갱신한다.
- 각 ref의 `old_sha` / `new_sha`를 기록한다.
- 새 commit range를 계산한다.
- scan 대상 commit/range를 `SCAN_JOB` queue에 enqueue한다.

Queue에는 patch blob을 저장하지 않는다. 최소 metadata만 저장한다.

```text
repo_id
ref_name
old_sha
new_sha
commit_sha or commit_range
scanner_name
scanner_version
rule_pack_version
scanner_config_hash
priority
status
attempts
lease_until
created_at
updated_at
```

### 2. `scan-worker` command

worker는 queue를 polling하고 lease를 획득한 job만 처리한다.

- 하나의 worker process/container는 한 번에 하나의 repo workspace lease를 잡는다.
- job의 commit/range만 checkout 또는 `gitleaks git --log-opts` 범위로 스캔한다.
- 성공 시 `SCAN_LEDGER`에 scan completion을 기록한다.
- 실패 시 retry/backoff/dead-letter 상태로 이동한다.

### 3. Commit-level scan ledger

같은 commit을 같은 scanner/rule/config 조합으로 이미 검사했다면 재검사하지 않는다.

Ledger key 후보:

```text
repo_id
commit_sha
scanner_name
scanner_version
rule_pack_version
scanner_config_hash
```

`rule_pack_version` 또는 `scanner_config_hash`가 바뀌면 기존 scan completion은 재사용하지 않는다.

### 4. Branch-aware finding/occurrence model

#2의 핵심 문제는 유지한다. 다만 branch fan-out은 scan worker/ledger 설계와 함께 다룬다.

- identity는 가능한 한 secret/rule 중심으로 안정화한다.
- branch/ref/path/line/commit은 occurrence 또는 observation 쪽으로 둔다.
- report/gate/evaluate는 branch별 잔존 여부를 노출한다.

## 완료 기준

- [ ] fetch/discovery가 scan execution과 분리되어 있다.
- [ ] ref update가 `SCAN_JOB`으로 저장되고 worker가 polling/lease로 처리한다.
- [ ] 이미 검사한 commit은 동일 scanner/rule/config 조합에서 재검사하지 않는다.
- [ ] 새 commit이 추가된 branch만 incremental scan 대상이 된다.
- [ ] worker가 중복 실행되어도 같은 job이 동시에 처리되지 않는다.
- [ ] finding 또는 observation에서 branch/ref/commit 잔존 여부를 확인할 수 있다.
- [ ] 기존 `scan-all`은 full baseline 또는 fallback path로 유지된다.

## Out of scope

- Webhook/on-demand fetch.
- 외부 SCM API 기반 실시간 event ingestion.
- managed queue service 도입.
- Gitleaks를 primary scanner에서 교체.

## Supersedes

Supersedes #2. #2의 multi-branch 문제의식은 유지하되, 현재 architecture에서는 queue/ledger/worker 설계와 함께 다루는 편이 더 정확하다.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ref update queue + commit ledger 기반 branch-aware scan worker #12

요약

현재 상태

문제

제안

1. `fetch-cron` / `discover-updates` command

2. `scan-worker` command

3. Commit-level scan ledger

4. Branch-aware finding/occurrence model

완료 기준

Out of scope

Supersedes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Ref update queue + commit ledger 기반 branch-aware scan worker #12

Description

요약

현재 상태

문제

제안

1. fetch-cron / discover-updates command

2. scan-worker command

3. Commit-level scan ledger

4. Branch-aware finding/occurrence model

완료 기준

Out of scope

Supersedes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `fetch-cron` / `discover-updates` command

2. `scan-worker` command