Ergonomics & safety gaps when driving rcc from automation (downstream evidence)

## Context

I maintain a downstream project ([`serving-stack`](https://github.com/ResearchComputer/snappy-serving-stack)) that drives `rcc` heavily: we wrap `rcc push` / `rcc run` in a Python CLI ([`snapper`](https://github.com/ResearchComputer/snappy-serving-stack/tree/main/tools/snapper)) that submits Slurm jobs, polls them to readiness, fetches per-rank logs, and tunnels to the served endpoint. Observed against **`rcc` 0.1.0**.

This issue collects the concrete friction points hit when driving `rcc` from *automation* (not just interactively). Happy to split any of these into standalone issues — I kept them in one place because several share a root cause (rcc is currently optimized for interactive typing, not for being scripted). Ordered by impact × stakes.

---

## P1 — `push` / `pull --delete` is a footgun (highest stakes)

Our `.rcc/rccignore` carries this comment, verbatim:

> JIT compile caches + container HOME on shared FS (written by vLLM/OpenTela). **MUST be excluded: a `push --delete` without this wipes hundreds of thousands of cache files and the per-job generated scripts written into DEPLOY_DIR.**

The entire deletion-safety story is pushed onto the user via a ever-growing `rccignore`. Concrete asks:

- **Mirror vs. sync distinction.** Today `--delete` mirrors local→remote. Consider making the *default* non-destructive (add/update only) and require an explicit `--mirror` / `--delete-excluded` for the dangerous semantics. (rsync's `--delete-excluded` vs `--delete` split is a useful prior art.)
- **Protect-list.** A `--keep-remote GLOB` (or profile-level `keep_remote = [...]`) that survives even `--mirror`, for paths the job owns (`logs/`, `cache/`, `*.safetensors`, `last_service.env`). This is rccignore-with-teeth — the kind of guard rail that belongs in the tool, not the user's memory.
- **Itemized dry-run.** `--dry-run` exists; please surface deletions in a clearly distinct section (`Would DELETE: ...`) so they can't hide in a wall of `f.f.....` rsync lines.

## P1 — `run` is awkward to script: streaming vs. exit-code, `bash -lc` tax, no `--env`/`--cwd`

This is the single most-used verb for us, and three things hurt:

1. **Streaming and exit-code visibility are mutually exclusive.** Our wrapper ([`snapper/rcc.py`](https://github.com/ResearchComputer/snappy-serving-stack/blob/main/tools/snapper/snapper/rcc.py)) has to choose:
   ```python
   def _run_argv(argv, *, stream=False):
       if stream:
           proc = subprocess.run(argv)           # user sees output live...
           return RunResult(proc.returncode, "", "")  # ...but stdout/stderr come back EMPTY
       proc = subprocess.run(argv, capture_output=True, text=True)  # or capture text...
       return RunResult(proc.returncode, proc.stdout, proc.stderr)  # ...but see nothing until exit
   ```
   So for a long `srun` we either (a) see live output but can't programmatically inspect it, or (b) capture it but the call looks hung for minutes. A `--stream` that still populates `stdout`/`stderr` (tee to the TTY while accumulating) would fix this cleanly.
2. **The `bash -lc` tax.** Literally every invocation in our docs is `rcc --profile p run bash -lc "..."`. The `run` verb should run through a login shell by default (with `--no-shell` to opt out), so users stop retyping `bash -lc`. Today `run`'s argv is `["rcc", ..., "run", "bash", "-lc", cmd]` — the shell is the user's responsibility ~30 times per README.
3. **No `--env` / `--cwd`.** Because there's no `--env KEY=VAL`, we build env prefixes by hand with `shlex.quote` and string concatenation (`snapper/slurm.py::_env_prefix`), then hand them to `bash -lc`. That's injection-fragile glue a structured `--env` (repeatable) flag would eliminate. Likewise everything is implicitly `remote_dir`; an explicit `--cwd` (or `rcc exec` that is *not* remote-dir-scoped) would remove a class of absolute-path hacks.

## P2 — No first-class Slurm/workload-manager lifecycle

`rcc` = "remote **cluster** controller," but on an HPC remote the cluster *is* Slurm. Today every consumer reinvents the same `squeue` poll loops. In our `snapper/slurm.py` alone we hand-rolled: `submit` (parse `Submitted batch job N`), `wait_in_queue` (poll `squeue -o %T`), `wait_for_ready` (poll per-rank logs for readiness markers), `_fetch_rank_logs` (loop `cat` over ranks). We even **invented our own wire format** — `===RANK N===` markers — because rcc gave us none.

Proposed verbs (obviously opt-in / plugin-y, not core SSH):
- `rcc submit -- CMD…` → prints jobid (machine-parseable)
- `rcc jobs [--profile p]` → list
- `rcc logs <jobid> [--follow] [--rank N]` → the single biggest win; replaces the rank-concat loop
- `rcc wait <jobid> [--timeout S] [--on STATE]` → blocks, exits non-zero on FAILED

Even just `rcc logs <jobid> --follow` would delete a remarkable amount of glue.

## P2 — No structured (`--json`) output

Every downstream parser is regex over free text: `parse_jobid`, `parse_squeue_state`, `parse_rank_dump`. `rcc config` "prints the resolved profile" as text; we'd love `--json`. Please add a stable `--json` mode to at least `config`, `status`, and the hypothetical `jobs`/`logs` — it lets wrappers drop ~100 lines of fragile parsing and survive output-format changes.

## P3 — Port-forwarding should be a first-class verb

Our deploy READMEs end every workflow with a manual `ssh -L 8080:<HEAD_NODE>:8080 bristen`, and `snapper/service.py` parses a job-written `last_service.env` to recover the endpoint. rcc already maintains an SSH ControlMaster (cf. `status`/`close`) — a `rcc tunnel <profile> <remote-port> [--local-port] [--remote-host HEAD_NODE]` that reuses that master would collapse this whole class of glue, and pairs naturally with a profile-level `tunnel = { remote_port = 8080 }`.

## P3 — Richer profiles

`.rcc/config.toml` profiles today are `{host, remote_dir}`. In practice we also want, per-profile:
- **`env`** defaults (we re-derive `TRITON_CACHE_DIR`, redirected `$HOME`, `EDF_PATH` in every serve script — these are profile constants),
- **`proxy_jump` / `identity_file`** (bastion hops are the norm on HPC; CSCS uses an SSH-config/enrollment dance rcc could encapsulate),
- the aforementioned **`tunnel`** block and **`keep_remote`** protect-list.

---

### TL;DR priority

| # | Theme | Stakes |
|---|-------|--------|
| P1 | `--delete` safety (mirror vs sync, protect-list, itemized dry-run) | data loss |
| P1 | `run` streaming+exit-code, default login shell, `--env`/`--cwd` | daily ergonomics |
| P2 | Slurm lifecycle (`submit`/`jobs`/`logs`/`wait`) | removes duplicated glue across every consumer |
| P2 | `--json` on `config`/`status`/`jobs`/`logs` | composability |
| P3 | `rcc tunnel` | removes manual `ssh -L` |
| P3 | richer profiles (`env`, `proxy_jump`, `keep_remote`, `tunnel`) | one-time setup pain |

Very happy to break any of these into its own issue and/or contribute PRs (the `run`/`--env`/`--json` ones look approachable). Just let me know which you'd want to take first.

— filed from downstream usage in `serving-stack` / `snapper`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ergonomics & safety gaps when driving rcc from automation (downstream evidence) #2

Context

P1 — `push` / `pull --delete` is a footgun (highest stakes)

P1 — `run` is awkward to script: streaming vs. exit-code, `bash -lc` tax, no `--env`/`--cwd`

P2 — No first-class Slurm/workload-manager lifecycle

P2 — No structured (`--json`) output

P3 — Port-forwarding should be a first-class verb

P3 — Richer profiles

TL;DR priority

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

#	Theme	Stakes
P1	`--delete` safety (mirror vs sync, protect-list, itemized dry-run)	data loss
P1	`run` streaming+exit-code, default login shell, `--env`/`--cwd`	daily ergonomics
P2	Slurm lifecycle (`submit`/`jobs`/`logs`/`wait`)	removes duplicated glue across every consumer
P2	`--json` on `config`/`status`/`jobs`/`logs`	composability
P3	`rcc tunnel`	removes manual `ssh -L`
P3	richer profiles (`env`, `proxy_jump`, `keep_remote`, `tunnel`)	one-time setup pain

Uh oh!

Ergonomics & safety gaps when driving rcc from automation (downstream evidence) #2

Description

Context

P1 — push / pull --delete is a footgun (highest stakes)

P1 — run is awkward to script: streaming vs. exit-code, bash -lc tax, no --env/--cwd

P2 — No first-class Slurm/workload-manager lifecycle

P2 — No structured (--json) output

P3 — Port-forwarding should be a first-class verb

P3 — Richer profiles

TL;DR priority

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

P1 — `push` / `pull --delete` is a footgun (highest stakes)

P1 — `run` is awkward to script: streaming vs. exit-code, `bash -lc` tax, no `--env`/`--cwd`

P2 — No structured (`--json`) output