Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
12 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 69 additions & 1 deletion deploy/systemd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,18 @@ them to fit their host layout.
| `user/security-scanner-scan-all.service` | **No-sudo** user-level variant (`%h`-based, runs as the invoking user). |
| `user/security-scanner-scan-all.timer` | Schedules the user `.service` (default: every 2 hours). |

**Scale worker pool + periodic jobs (M3, see §9):**

| Path | Purpose |
| --- | --- |
| `security-scanner-scan-worker@.service` | Instanced daemon template; `scan-worker@1..N` are N independent processes, each a distinct fence-token holder (FR-4). |
| `scan-worker.target` | Brings the whole worker pool up/down at once. |
| `security-scanner-lease-reaper.{service,timer}` | Reclaims expired job + repo leases (FR-6) on a timer. |
| `security-scanner-incr-poll.{service,timer}` | `discover-updates --enqueue --from-catalog --ls-remote-skip` (FR-2). |
| `security-scanner-baseline.{service,timer}` | Per-repo baseline `ScanJob` enqueue (FR-3). |
| `security-scanner-freshness-eval.{service,timer}` | Per-repo staleness detector + `BREACH_COUNTER` rollup (FR-8). |
| `security-scanner-catalog-reconcile.{service,timer}` | Org catalog reconcile (FR-1). **Governance-gated: keep DISABLED until GATE 2** (default provider refuses live fetch). |

Two flavors:
- **System-level** (`§4`) — requires root, runs as a dedicated `scanner` user, full
hardening. Use for shared/production hosts.
Expand Down Expand Up @@ -319,7 +331,63 @@ manually if you also want to clean state.

---

## 9. Related documents
## 9. Scale worker pool (M3) — N processes + periodic timers

The scale redesign (design.md v2, FR-4) replaces the single weekly `scan-all`
oneshot with a **queue + N-worker-pool** model: a per-repo job queue, N
independent worker processes draining it, and several periodic timers feeding
and maintaining the queue. The `scan-all` units above still work; the units in
this section are the scale path.

> **Box-gated.** The deployment box is OFFLINE. These artifacts are what a future
> box deploy instantiates; DEPLOYED behavior (N live processes, `Restart=on-failure`
> recovery on a real crash, and the real cadence values) is NOT proven here. The
> `OnCalendar=` values in every timer are **GATE-1 placeholders** — the box load
> gate sets the real cadences (poll interval, baseline window, N). Do not treat
> them as load-validated.

### 9a. Worker pool

`security-scanner-scan-worker@.service` is an instanced (templated) unit. The
systemd instance name `%i` is threaded into `--worker-id scan-worker@%i`, so
`scan-worker@1 .. scan-worker@N` run as N independent OS processes, each a
distinct RepoLease fence-token holder. The RepoLease CAS (M2) guarantees two
instances never scan the same repo concurrently (FR-4).

Bring up N instances (pick N from the box load gate):

```bash
sudo systemctl enable --now scan-worker@{1..8} # example: 8 workers
sudo systemctl enable --now scan-worker.target # group start/stop
# stop the whole pool:
sudo systemctl stop scan-worker.target
```

Each instance is `Type=simple` (long-running daemon, polls until `SIGTERM`) with
`Restart=on-failure`; a crashed instance is restarted by systemd and its stranded
leases are reclaimed by the lease-reaper timer below.

### 9b. Periodic timers

```bash
sudo systemctl enable --now security-scanner-lease-reaper.timer
sudo systemctl enable --now security-scanner-incr-poll.timer
sudo systemctl enable --now security-scanner-baseline.timer
sudo systemctl enable --now security-scanner-freshness-eval.timer
```

### 9c. catalog-reconcile — DISABLED until GATE 2

Do **not** `systemctl enable` `security-scanner-catalog-reconcile.timer` yet. The
`reconcile` command's default org-list provider is a governance-gated stub that
REFUSES to fetch live GitHub (a live org GET is gated to a human PR + the
autopilot `ghas-live-fetch-or-mutation-required` stop-condition, GATE 2). As
shipped the unit fails closed; enabling the timer early only schedules failing
runs. Enable it only after GATE 2 clears and a live provider is wired.

---

## 10. Related documents

- `docs/workbench/adrs/ADR-20260531-periodic-multi-repo-scan-catalog.md`
- `docs/workbench/specs/2026-05-31-scan-all-and-target-catalog.md`
Expand Down
17 changes: 17 additions & 0 deletions deploy/systemd/scan-worker.target
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
[Unit]
Description=security-scanner incremental worker pool (N scan-worker@ instances)
Documentation=https://github.com/source-security-dev/security-scanner
# Aggregates the N independent scan-worker@ daemon instances into one start/stop
# unit so an operator brings the whole pool up or down at once.
#
# Bring up N instances (choose N from the box load gate — N is box-gated, not
# invented here):
# systemctl enable --now scan-worker@{1..N}
# systemctl enable --now scan-worker.target
#
# Each scan-worker@i WantedBy=scan-worker.target, so enabling the instances wires
# them into this target. Stop the whole pool with:
# systemctl stop scan-worker.target

[Install]
WantedBy=multi-user.target
41 changes: 41 additions & 0 deletions deploy/systemd/security-scanner-baseline.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
[Unit]
Description=security-scanner baseline (enqueue per-repo baseline ScanJobs, FR-3)
Documentation=https://github.com/source-security-dev/security-scanner
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot

# --- Operator: replace placeholders below ---
User=scanner
Group=scanner
WorkingDirectory=/opt/security-scanner

Environment=SECURITY_SCANNER_STORAGE_BACKEND=dynamodb
Environment=SECURITY_SCANNER_DYNAMO_TABLE=security-scanner
Environment=SECURITY_SCANNER_DYNAMO_ENDPOINT=http://127.0.0.1:8000
Environment=SECURITY_SCANNER_AWS_REGION=ap-northeast-2

EnvironmentFile=-/etc/security-scanner/scm.env

# Enqueue one low-priority baseline ScanJob per INCLUDED catalog repo (NOT
# scan-all; per-repo queue baseline, SC-3). Backpressure skips this run when the
# pending backlog is over threshold; the rolling 1/N slice covers the catalog
# across runs. The --rolling-offset placeholder below assumes rolling is left at
# its default divisor; GATE 1 sets the real divisor/offset rotation.
ExecStart=/usr/bin/uv run security-scanner baseline

NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=read-only
# systemd creates the per-service log and cache dirs (named "security-scanner"
# under its managed state roots) with correct ownership and auto-grants this unit
# RW to them, even under ProtectSystem=strict — so no explicit ReadWritePaths is
# needed and no machine-local absolute path is hardcoded here.
LogsDirectory=security-scanner
CacheDirectory=security-scanner

[Install]
WantedBy=multi-user.target
23 changes: 23 additions & 0 deletions deploy/systemd/security-scanner-baseline.timer
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[Unit]
Description=Scheduler for security-scanner baseline enqueue
Documentation=https://github.com/source-security-dev/security-scanner

[Timer]
# PLACEHOLDER CADENCE — set by GATE 1 (box online load validation). The design
# default is a weekly full-scan window (DEFAULT_BASELINE_CADENCE_HOURS = 24*7),
# mirroring the pre-scale weekly full scan, but the real 500-repo baseline window
# (rolling-slice rotation vs queue depth) is a load-gate decision; do NOT invent
# a load-validated value. Placeholder: weekly, Sunday 04:00.
OnCalendar=Sun *-*-* 04:00:00

# If the host was off when the timer fired, run as soon as possible afterwards.
Persistent=true

# Randomized delay so a fleet doesn't all enqueue baselines at the same instant.
RandomizedDelaySec=900

# Bind to the matching .service unit (same basename).
Unit=security-scanner-baseline.service

[Install]
WantedBy=timers.target
56 changes: 56 additions & 0 deletions deploy/systemd/security-scanner-catalog-reconcile.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
[Unit]
Description=security-scanner catalog-reconcile (org catalog reconcile + coverage gap, FR-1)
Documentation=https://github.com/source-security-dev/security-scanner
After=network-online.target
Wants=network-online.target

# =============================================================================
# GOVERNANCE GATE 2 — DO NOT ENABLE THIS UNIT (or its .timer) UNTIL GATE 2 CLEARS
# -----------------------------------------------------------------------------
# The reconcile command's DEFAULT org-list provider is a governance-gated stub
# (GovernanceGatedOrgRepoListProvider) that REFUSES to fetch live GitHub: a live
# org GET is gated to a human PR + the autopilot
# `ghas-live-fetch-or-mutation-required` stop-condition (GATE 2). As shipped,
# `security-scanner reconcile` (the ExecStart below) will FAIL CLOSED with an
# "inject a provider" error rather than reach live GitHub by accident.
#
# So this unit is intentionally INERT until GATE 2: enabling the timer before the
# gate clears only produces failing oneshot runs (no live fetch, no mutation).
# After GATE 2, the operator wires the GATE-2 live provider (out of scope here)
# and only THEN enables the timer.
# =============================================================================

[Service]
Type=oneshot

# --- Operator: replace placeholders below ---
User=scanner
Group=scanner
WorkingDirectory=/opt/security-scanner

Environment=SECURITY_SCANNER_STORAGE_BACKEND=dynamodb
Environment=SECURITY_SCANNER_DYNAMO_TABLE=security-scanner
Environment=SECURITY_SCANNER_DYNAMO_ENDPOINT=http://127.0.0.1:8000
Environment=SECURITY_SCANNER_AWS_REGION=ap-northeast-2

EnvironmentFile=-/etc/security-scanner/scm.env

# Reconcile the org catalog (FR-1) and thread the coverage gap into the freshness
# rollup (--evaluate-freshness materializes BREACH_COUNTER.coverage_gap). NOTE:
# the default provider refuses live fetch (GATE 2 above) — this run fails closed
# until a GATE-2 live provider is wired.
ExecStart=/usr/bin/uv run security-scanner reconcile --evaluate-freshness

NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=read-only
# systemd creates the per-service log and cache dirs (named "security-scanner"
# under its managed state roots) with correct ownership and auto-grants this unit
# RW to them, even under ProtectSystem=strict — so no explicit ReadWritePaths is
# needed and no machine-local absolute path is hardcoded here.
LogsDirectory=security-scanner
CacheDirectory=security-scanner

[Install]
WantedBy=multi-user.target
29 changes: 29 additions & 0 deletions deploy/systemd/security-scanner-catalog-reconcile.timer
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
[Unit]
Description=Scheduler for security-scanner catalog-reconcile
Documentation=https://github.com/source-security-dev/security-scanner

# =============================================================================
# GOVERNANCE GATE 2 — DO NOT `systemctl enable` THIS TIMER UNTIL GATE 2 CLEARS.
# The bound .service refuses live org fetch until a GATE-2 live provider is wired
# (see security-scanner-catalog-reconcile.service). Enabling this timer early
# only schedules failing fail-closed runs. Keep it DISABLED until GATE 2.
# =============================================================================

[Timer]
# PLACEHOLDER CADENCE — set by GATE 1 (box load validation) AND only meaningful
# once GATE 2 clears. The design soft target is hourly org reconcile (Data Flow
# step 1), but the real cadence (GitHub rate-limit budget vs catalog drift) is a
# load-gate decision; do NOT invent a load-validated value. Placeholder: hourly.
OnCalendar=*-*-* *:00:00

# If the host was off when the timer fired, run as soon as possible afterwards.
Persistent=true

# Randomized delay so a fleet doesn't all hit the GitHub org API at once.
RandomizedDelaySec=300

# Bind to the matching .service unit (same basename).
Unit=security-scanner-catalog-reconcile.service

[Install]
WantedBy=timers.target
42 changes: 42 additions & 0 deletions deploy/systemd/security-scanner-freshness-eval.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
[Unit]
Description=security-scanner freshness-eval (per-repo staleness detector + BREACH_COUNTER, FR-8)
Documentation=https://github.com/source-security-dev/security-scanner
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot

# --- Operator: replace placeholders below ---
User=scanner
Group=scanner
WorkingDirectory=/opt/security-scanner

Environment=SECURITY_SCANNER_STORAGE_BACKEND=dynamodb
Environment=SECURITY_SCANNER_DYNAMO_TABLE=security-scanner
Environment=SECURITY_SCANNER_DYNAMO_ENDPOINT=http://127.0.0.1:8000
Environment=SECURITY_SCANNER_AWS_REGION=ap-northeast-2

EnvironmentFile=-/etc/security-scanner/scm.env

# Scheduled staleness DETECTOR: staleness is the absence of a worker event, so it
# cannot hang off worker writes — it runs on a timer, enumerates REPO_HEALTH,
# evaluates per-repo breaches against both thresholds, and materializes the
# BREACH_COUNTER rollup the read API consumes O(1). The --poll-interval-hours /
# --baseline-cadence-hours / --margin-hours thresholds default to placeholders
# (the load gate sets the real cadence values).
ExecStart=/usr/bin/uv run security-scanner freshness-eval

NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=read-only
# systemd creates the per-service log and cache dirs (named "security-scanner"
# under its managed state roots) with correct ownership and auto-grants this unit
# RW to them, even under ProtectSystem=strict — so no explicit ReadWritePaths is
# needed and no machine-local absolute path is hardcoded here.
LogsDirectory=security-scanner
CacheDirectory=security-scanner

[Install]
WantedBy=multi-user.target
20 changes: 20 additions & 0 deletions deploy/systemd/security-scanner-freshness-eval.timer
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[Unit]
Description=Scheduler for security-scanner freshness-eval
Documentation=https://github.com/source-security-dev/security-scanner

[Timer]
# PLACEHOLDER CADENCE — set by GATE 1 (box online load validation). The detector
# only needs to run often enough to surface a breach within the alert SLA; the
# real cadence (vs REPO_HEALTH scan cost at 500 repos) is a load-gate decision,
# do NOT invent a load-validated value. Placeholder: every 10 minutes.
OnCalendar=*-*-* *:0/10:00
OnBootSec=180

# If the host was off when the timer fired, run as soon as possible afterwards.
Persistent=true

# Bind to the matching .service unit (same basename).
Unit=security-scanner-freshness-eval.service

[Install]
WantedBy=timers.target
55 changes: 55 additions & 0 deletions deploy/systemd/security-scanner-incr-poll.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
[Unit]
Description=security-scanner incr-poll (discover changed refs + enqueue scan jobs, FR-2)
Documentation=https://github.com/source-security-dev/security-scanner
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot

# --- Operator: replace placeholders below ---
User=scanner
Group=scanner
WorkingDirectory=/opt/security-scanner

Environment=SECURITY_SCANNER_STORAGE_BACKEND=dynamodb
Environment=SECURITY_SCANNER_DYNAMO_TABLE=security-scanner
Environment=SECURITY_SCANNER_DYNAMO_ENDPOINT=http://127.0.0.1:8000
Environment=SECURITY_SCANNER_AWS_REGION=ap-northeast-2

EnvironmentFile=-/etc/security-scanner/scm.env

# Poll the INCLUDED catalog (M1) repos, probe ref SHAs with git ls-remote and
# skip repos whose refs are unchanged (SC-6a poll-storm mitigation), and enqueue
# one SCAN_JOB per newly observed unscanned commit. The scan-worker@ pool drains
# the queue.
#
# --cadence-seconds is the SC-6d cadence budget: a poll cycle whose wall-time
# exceeds it fires a cadence-overrun ALERT (to the notification-log seam) instead
# of silently falling behind. The value below is a GATE 1 PLACEHOLDER (set by box
# online load validation, like the timer OnCalendar cadences) — it is NOT a
# load-validated number. Tune it to the incr-poll.timer OnCalendar period at GATE 1.
ExecStart=/usr/bin/uv run security-scanner discover-updates \
--enqueue \
--from-catalog \
--ls-remote-skip \
--cadence-seconds 300

# discover-updates exit codes: 0 = ok, 1 = fatal, 2 = at least one repo's fetch
# failed but others completed. Treat 2 as non-failure so one bad repo does not
# fail the whole poll.
SuccessExitStatus=0 2

NoNewPrivileges=yes
PrivateTmp=yes
ProtectSystem=strict
ProtectHome=read-only
# systemd creates the per-service log and cache dirs (named "security-scanner"
# under its managed state roots) with correct ownership and auto-grants this unit
# RW to them, even under ProtectSystem=strict — so no explicit ReadWritePaths is
# needed and no machine-local absolute path is hardcoded here.
LogsDirectory=security-scanner
CacheDirectory=security-scanner

[Install]
WantedBy=multi-user.target
23 changes: 23 additions & 0 deletions deploy/systemd/security-scanner-incr-poll.timer
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[Unit]
Description=Scheduler for security-scanner incr-poll
Documentation=https://github.com/source-security-dev/security-scanner

[Timer]
# PLACEHOLDER CADENCE — set by GATE 1 (box online load validation). The design's
# soft target is ~5min incremental poll cadence (Data Flow step 2), but the real
# 500-repo poll interval (ls-remote storm vs freshness) is a load-gate decision;
# do NOT invent a load-validated value. Placeholder: every 5 minutes.
OnCalendar=*-*-* *:0/5:00
OnBootSec=120

# If the host was off when the timer fired, run as soon as possible afterwards.
Persistent=true

# Small randomized delay so a fleet doesn't all ls-remote at the same instant.
RandomizedDelaySec=30

# Bind to the matching .service unit (same basename).
Unit=security-scanner-incr-poll.service

[Install]
WantedBy=timers.target
Loading
Loading