Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
[Unit]
Description=security-scanner personal dead-letter auto-requeue
Documentation=https://github.com/source-security-dev/security-scanner

[Service]
Type=oneshot
Slice=securityscanner.slice
Nice=15
IOSchedulingClass=idle
TasksMax=128
WorkingDirectory=%h/security-scanner
EnvironmentFile=-%h/.config/security-scanner/personal-prod.env
Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/bin
Environment=SECURITY_SCANNER_STORAGE_BACKEND=dynamodb
Environment=SECURITY_SCANNER_DYNAMO_ENDPOINT=http://localhost:4567
Environment=SECURITY_SCANNER_DYNAMO_TABLE=security_scanner_personal
Environment=SECURITY_SCANNER_DEAD_LETTER_AUTO_REQUEUE_LIMIT=10
Environment=SECURITY_SCANNER_DEAD_LETTER_AUTO_REQUEUE_COOLDOWN_MINUTES=30
ExecStart=/usr/bin/env PATH=%h/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/bin %h/.local/bin/uv run security-scanner dead-letter auto-requeue \
--storage-backend ${SECURITY_SCANNER_STORAGE_BACKEND} \
--job-type verify \
--cooldown-minutes ${SECURITY_SCANNER_DEAD_LETTER_AUTO_REQUEUE_COOLDOWN_MINUTES} \
--limit ${SECURITY_SCANNER_DEAD_LETTER_AUTO_REQUEUE_LIMIT} \
--apply

[Install]
WantedBy=default.target
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[Unit]
Description=Scheduler for security-scanner personal dead-letter auto-requeue
Documentation=https://github.com/source-security-dev/security-scanner

[Timer]
OnCalendar=*:0/30:00
Persistent=true
RandomizedDelaySec=300
Unit=security-scanner-personal-dead-letter-auto-requeue.service

[Install]
WantedBy=timers.target
301 changes: 301 additions & 0 deletions docs/workbench/specs/dead-letter-auto-requeue/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
# Dead-Letter Auto-Requeue Design Spec

## Overview

Dead-letter auto-requeue extends the existing dead-letter recovery module with a
conservative, timer-ready policy for transient verifier failures. It reuses the
current DB-backed `SCAN_JOB` queue and manual recovery seam, without adding SQS,
LocalStack, a new table, or a new GSI.

## Requirements Reference

- Phase 1 source: `requirements.md`
- Preview companion: `requirements.html`
- Core scope: one PR, conservative timer-ready auto-requeue.
- Core policy: verify transient dead-letter jobs only, default cooldown 30
minutes, at most one automatic requeue per job.
- Explicit exclusions: SQS, LocalStack, Kafka-style offsets, outbox framework,
new storage tables, new GSIs, broad automatic recovery.

## Approach Proposal

### Selected: extend existing recovery seam

Add auto-requeue behavior to the existing `runtime/dead_letter_recovery.py`
runtime seam, storage protocol, scan CLI command group, and user systemd unit
set.

Why this is selected:

- Auto-requeue is a policy-controlled subset of dead-letter recovery, not a new
queue subsystem.
- The current module already owns public-safe classification, bounded reads,
guarded requeue, and rendering.
- Reusing the seam keeps manual and automatic recovery consistent while avoiding
duplicate DTOs and storage paths.

### Guardrail inside the selected approach

Do not bolt timer/cooldown conditions into the manual requeue path. The module
should expose separate use cases:

- manual recovery: operator-selected subset;
- automatic recovery: policy-selected transient subset;
- shared primitives: classification, public-safe projection, bounded guarded
storage apply.

### Rejected: separate auto-requeue module

A separate module would create a cleaner file split, but it would likely
duplicate classification, selection, rendering, and guarded storage logic already
owned by dead-letter recovery.

### Rejected: queue policy layer

A general retry/dead-letter policy layer could be useful later, but this PR does
not need multi-queue policy orchestration, provider-specific behavior, or a new
workflow abstraction.

## Architecture

```mermaid
flowchart TD
Timer["systemd user timer"] --> Service["auto-requeue service"]
Operator["operator CLI dry-run/apply"] --> CLI["scan dead-letter auto-requeue"]
Service --> CLI
CLI --> Runtime["runtime.dead_letter_recovery"]
Runtime --> Classifier["DeadLetterClassification"]
Runtime --> Policy["AutoRequeuePolicy"]
Runtime --> Store["IncrementalScanStore dead-letter methods"]
Store --> DB["SCAN_JOB status=dead_letter rows"]
Policy --> Selection["verify + transient + cooldown + one-shot"]
Selection --> GuardedApply["conditional pending transition"]
GuardedApply --> Pending["SCAN_JOB status=pending"]
Pending --> Drain["verify-drain"]
```

## Data Flow

### Dry Run

1. CLI builds an `AutoRequeueDeadLetterRequest`.
2. Runtime reads a bounded dead-letter page using the existing storage seam.
3. Runtime classifies each job into:
- terminal reason;
- root error class;
- public-safe auto eligibility.
4. Runtime filters to jobs that match all automatic policy gates.
5. Runtime returns a public-safe summary with `would_move`, selected root error
class counts, and skipped reasons.
6. No storage mutation occurs.

### Apply

1. The same request runs with explicit apply enabled.
2. Runtime computes eligible jobs with the same policy as dry run.
3. Runtime asks storage to move only matching rows from `dead_letter` to
`pending`.
4. Storage applies existing safety guards:
- row still exists;
- status is still `dead_letter`;
- updated time is not newer than the selection watermark;
- filter fields still match.
5. Storage clears execution ownership and sets `next_attempt_at` to apply time.
6. Storage preserves failure evidence and marks the job as already
auto-requeued once.
7. Runtime returns moved/skipped counts and public-safe classification counts.

### Timer

1. The user-level timer invokes the auto-requeue CLI with conservative defaults.
2. Default policy targets verify jobs only.
3. Default cooldown is 30 minutes.
4. Default limit is small and configurable through unit environment.
5. Timer files are shipped but not treated as broad auto-recovery being enabled.

## Component Details

### `runtime.dead_letter_recovery`

Add a richer classification DTO while preserving public-safe output.

Conceptual shape:

```text
DeadLetterClassification
terminal_reason
root_error_class
auto_requeue_eligible
```

`terminal_reason` explains why the job reached `dead_letter`, for example retry
budget exhausted or lease expiry budget exhausted.

`root_error_class` explains what kind of failure caused that terminal state, for
example verifier timeout, verifier transport, malformed verify job, scanner
runtime, or unknown.

The current single `error_class` rendering can stay for compatibility, but
auto-requeue selection must use `root_error_class`, not only
`retry-budget-exhausted`.

Add a dedicated automatic use case:

```text
auto_requeue_dead_letter_jobs(request) -> DeadLetterAutoRequeueSummary
```

Responsibilities:

- validate positive limit and non-negative cooldown;
- select only jobs older than the cooldown;
- select only allowed job types, defaulting to verify when invoked by timer;
- select only transient root error classes;
- skip jobs that were already automatically requeued once;
- support dry-run and explicit apply;
- return public-safe moved/would-move/skipped counts.

### Storage

Reuse existing dead-letter inspect and guarded requeue operations, but add enough
stored evidence to enforce one automatic requeue per job.

The implementation can choose the smallest compatible representation, but it
must support:

- checking whether a job was already auto-requeued;
- marking a moved job as auto-requeued once during apply;
- preserving attempts, max attempts, lease expiry counters, fence, last error,
and finding snapshot.

No new table or GSI is introduced. The metadata must live on the existing
`SCAN_JOB` row or an already-supported extension field.

### CLI

Add a scan queue adjacent command under the existing dead-letter group.

Recommended shape:

```text
dead-letter auto-requeue
```

Required behavior:

- dry-run by default;
- `--apply` required for mutation;
- bounded `--limit`;
- `--cooldown-minutes` defaulting to 30;
- `--job-type` supported, timer defaulting to verify;
- `--root-error-class` or equivalent filter supported for diagnostics;
- public-safe output only.

The command should render:

- applied yes/no;
- would move;
- moved;
- selected counts by job type and root error class;
- skipped counts by public-safe reason.

### Systemd User Units

Add personal user units for conservative timer-ready operation.

Recommended files:

```text
deploy/systemd/user/security-scanner-personal-dead-letter-auto-requeue.service
deploy/systemd/user/security-scanner-personal-dead-letter-auto-requeue.timer
```

The service should invoke the CLI with:

- storage backend from the personal environment;
- job type verify;
- cooldown 30 minutes;
- small limit;
- explicit apply.

The timer cadence should be slower than the cooldown or otherwise avoid tight
repeat loops. The unit files should be installable, but enabling remains an
operator action.

## Error Handling

- Unsupported storage backend exits before mutation.
- Invalid limit or cooldown exits before mutation.
- Unknown or permanent root error classes are skipped, not fatal.
- Jobs updated after selection are skipped by conditional write.
- Jobs already auto-requeued once are skipped.
- Empty eligible set is a successful no-op.
- Storage failures fail the command with a public-safe diagnostic.
- Raw job id, raw error, private repo/path/ref/commit material, endpoint, prompt,
response, and finding snapshot are never printed.

## Testing Strategy

### Runtime tests

- Retry-budget-exhausted timeout jobs expose root error class verifier timeout.
- Retry-budget-exhausted transport jobs expose root error class verifier
transport.
- Malformed verify jobs are never auto eligible.
- Unknown jobs are never auto eligible.
- Cooldown blocks recent dead-letter jobs.
- A job already auto-requeued once is skipped.
- Dry-run performs no mutation and reports would-move counts.
- Apply reports moved and skipped counts.

### Storage tests

- Auto-requeue apply moves eligible dead-letter rows to pending.
- Apply marks the row as auto-requeued once.
- Apply preserves failure evidence and retry/lease counters.
- Apply skips rows changed after selection.
- Existing manual requeue behavior remains compatible.

### CLI tests

- Auto-requeue dry-run renders public-safe summary.
- Auto-requeue apply requires explicit flag.
- Unsupported backend exits without mutation.
- Cooldown and limit validation reject invalid values.
- Output excludes raw private or secret-like material.

### Systemd tests

- Personal unit invokes the new CLI command.
- Personal unit uses verify job type and 30 minute cooldown.
- Timer exists and is not a broad recovery surface.

## TDD Strategy

Use red -> green -> refactor.

1. Add failing runtime tests for split classification and auto eligibility.
2. Add failing runtime tests for cooldown, one-shot skip, dry-run, and apply
summary.
3. Add failing storage tests for marking one-shot metadata and preserving
existing failure evidence.
4. Add failing CLI tests for dry-run/apply and validation.
5. Add failing systemd unit tests for conservative timer-ready wiring.
6. Implement the smallest runtime, storage, CLI, and unit changes to pass.
7. Refactor only after focused tests pass.

## Milestones

- M1: Classification split — done when tests prove terminal reason and root
error class are distinct and timeout/transport exhausted jobs are eligible.
- M2: Auto-requeue runtime policy — done when tests prove cooldown, one-shot
guard, dry-run, apply, and public-safe skipped counts.
- M3: Storage evidence — done when tests prove rows can be marked
auto-requeued once without resetting failure history.
- M4: CLI and timer-ready units — done when CLI and systemd tests prove
conservative verify-only auto-requeue wiring.

## Open Questions

- None. Implementation may adjust exact flag names if the approved behavior and
public-safe contract remain intact.
42 changes: 42 additions & 0 deletions docs/workbench/specs/dead-letter-auto-requeue/requirements.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
<!doctype html>
<html lang="ko">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Dead-Letter Auto-Requeue Requirements</title>
<style>
body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif; line-height: 1.55; margin: 40px auto; max-width: 960px; color: #1f2937; }
h1, h2, h3 { color: #111827; line-height: 1.2; }
code { background: #f3f4f6; padding: 0.1rem 0.25rem; border-radius: 4px; }
table { border-collapse: collapse; width: 100%; }
th, td { border: 1px solid #d1d5db; padding: 0.55rem; vertical-align: top; }
th { background: #f9fafb; text-align: left; }
.note { background: #f9fafb; border-left: 4px solid #6b7280; padding: 0.75rem 1rem; }
</style>
</head>
<body>
<h1>Dead-Letter Auto-Requeue Requirements</h1>
<p class="note">Generated companion preview. Source of truth is <code>requirements.md</code>.</p>
<h2>승인 대상</h2>
<ul>
<li>Source of truth: <code>requirements.md</code></li>
<li>Preview companion: <code>requirements.html</code></li>
</ul>
<h2>핵심 목표</h2>
<p>기존 수동 dead-letter recovery 위에 transient failure 전용 자동 requeue를 단일 PR로 추가한다. 기본 운영 posture는 conservative timer-ready다. SQS, LocalStack, outbox, 새 테이블, 새 GSI는 범위 밖이다.</p>
<h2>주요 요구사항</h2>
<ul>
<li>Dry-run과 explicit apply를 분리한다.</li>
<li>Limit, cooldown, job type, root error class로 후보를 제한한다.</li>
<li>Verifier timeout, verifier transport, lease-expired budget 계열만 자동 대상으로 삼는다.</li>
<li>Malformed payload, missing snapshot, schema/validation, unknown error는 operator-only로 남긴다.</li>
<li>Public-safe aggregate와 skipped counts만 출력한다.</li>
<li>Systemd user timer에서 작은 limit으로 주기 실행할 수 있어야 한다.</li>
<li>Timer-ready 산출물은 제공하되, 운영자가 별도로 enable하기 전까지 broad auto-recovery를 전제로 하지 않는다.</li>
<li>같은 job에는 자동 재투입 기회를 최대 1회만 준다.</li>
<li>기본 cooldown은 30분이다.</li>
</ul>
<h2>미결정 항목</h2>
<p>없음. <code>requirements.md</code> 승인 후 Phase 2에서 design spec을 작성한다.</p>
</body>
</html>
Loading
Loading