Context
I maintain a downstream project (serving-stack) that drives rcc heavily: we wrap rcc push / rcc run in a Python CLI (snapper) that submits Slurm jobs, polls them to readiness, fetches per-rank logs, and tunnels to the served endpoint. Observed against rcc 0.1.0.
This issue collects the concrete friction points hit when driving rcc from automation (not just interactively). Happy to split any of these into standalone issues — I kept them in one place because several share a root cause (rcc is currently optimized for interactive typing, not for being scripted). Ordered by impact × stakes.
P1 — push / pull --delete is a footgun (highest stakes)
Our .rcc/rccignore carries this comment, verbatim:
JIT compile caches + container HOME on shared FS (written by vLLM/OpenTela). MUST be excluded: a push --delete without this wipes hundreds of thousands of cache files and the per-job generated scripts written into DEPLOY_DIR.
The entire deletion-safety story is pushed onto the user via a ever-growing rccignore. Concrete asks:
- Mirror vs. sync distinction. Today
--delete mirrors local→remote. Consider making the default non-destructive (add/update only) and require an explicit --mirror / --delete-excluded for the dangerous semantics. (rsync's --delete-excluded vs --delete split is a useful prior art.)
- Protect-list. A
--keep-remote GLOB (or profile-level keep_remote = [...]) that survives even --mirror, for paths the job owns (logs/, cache/, *.safetensors, last_service.env). This is rccignore-with-teeth — the kind of guard rail that belongs in the tool, not the user's memory.
- Itemized dry-run.
--dry-run exists; please surface deletions in a clearly distinct section (Would DELETE: ...) so they can't hide in a wall of f.f..... rsync lines.
P1 — run is awkward to script: streaming vs. exit-code, bash -lc tax, no --env/--cwd
This is the single most-used verb for us, and three things hurt:
- Streaming and exit-code visibility are mutually exclusive. Our wrapper (
snapper/rcc.py) has to choose:
def _run_argv(argv, *, stream=False):
if stream:
proc = subprocess.run(argv) # user sees output live...
return RunResult(proc.returncode, "", "") # ...but stdout/stderr come back EMPTY
proc = subprocess.run(argv, capture_output=True, text=True) # or capture text...
return RunResult(proc.returncode, proc.stdout, proc.stderr) # ...but see nothing until exit
So for a long srun we either (a) see live output but can't programmatically inspect it, or (b) capture it but the call looks hung for minutes. A --stream that still populates stdout/stderr (tee to the TTY while accumulating) would fix this cleanly.
- The
bash -lc tax. Literally every invocation in our docs is rcc --profile p run bash -lc "...". The run verb should run through a login shell by default (with --no-shell to opt out), so users stop retyping bash -lc. Today run's argv is ["rcc", ..., "run", "bash", "-lc", cmd] — the shell is the user's responsibility ~30 times per README.
- No
--env / --cwd. Because there's no --env KEY=VAL, we build env prefixes by hand with shlex.quote and string concatenation (snapper/slurm.py::_env_prefix), then hand them to bash -lc. That's injection-fragile glue a structured --env (repeatable) flag would eliminate. Likewise everything is implicitly remote_dir; an explicit --cwd (or rcc exec that is not remote-dir-scoped) would remove a class of absolute-path hacks.
P2 — No first-class Slurm/workload-manager lifecycle
rcc = "remote cluster controller," but on an HPC remote the cluster is Slurm. Today every consumer reinvents the same squeue poll loops. In our snapper/slurm.py alone we hand-rolled: submit (parse Submitted batch job N), wait_in_queue (poll squeue -o %T), wait_for_ready (poll per-rank logs for readiness markers), _fetch_rank_logs (loop cat over ranks). We even invented our own wire format — ===RANK N=== markers — because rcc gave us none.
Proposed verbs (obviously opt-in / plugin-y, not core SSH):
rcc submit -- CMD… → prints jobid (machine-parseable)
rcc jobs [--profile p] → list
rcc logs <jobid> [--follow] [--rank N] → the single biggest win; replaces the rank-concat loop
rcc wait <jobid> [--timeout S] [--on STATE] → blocks, exits non-zero on FAILED
Even just rcc logs <jobid> --follow would delete a remarkable amount of glue.
P2 — No structured (--json) output
Every downstream parser is regex over free text: parse_jobid, parse_squeue_state, parse_rank_dump. rcc config "prints the resolved profile" as text; we'd love --json. Please add a stable --json mode to at least config, status, and the hypothetical jobs/logs — it lets wrappers drop ~100 lines of fragile parsing and survive output-format changes.
P3 — Port-forwarding should be a first-class verb
Our deploy READMEs end every workflow with a manual ssh -L 8080:<HEAD_NODE>:8080 bristen, and snapper/service.py parses a job-written last_service.env to recover the endpoint. rcc already maintains an SSH ControlMaster (cf. status/close) — a rcc tunnel <profile> <remote-port> [--local-port] [--remote-host HEAD_NODE] that reuses that master would collapse this whole class of glue, and pairs naturally with a profile-level tunnel = { remote_port = 8080 }.
P3 — Richer profiles
.rcc/config.toml profiles today are {host, remote_dir}. In practice we also want, per-profile:
env defaults (we re-derive TRITON_CACHE_DIR, redirected $HOME, EDF_PATH in every serve script — these are profile constants),
proxy_jump / identity_file (bastion hops are the norm on HPC; CSCS uses an SSH-config/enrollment dance rcc could encapsulate),
- the aforementioned
tunnel block and keep_remote protect-list.
TL;DR priority
| # |
Theme |
Stakes |
| P1 |
--delete safety (mirror vs sync, protect-list, itemized dry-run) |
data loss |
| P1 |
run streaming+exit-code, default login shell, --env/--cwd |
daily ergonomics |
| P2 |
Slurm lifecycle (submit/jobs/logs/wait) |
removes duplicated glue across every consumer |
| P2 |
--json on config/status/jobs/logs |
composability |
| P3 |
rcc tunnel |
removes manual ssh -L |
| P3 |
richer profiles (env, proxy_jump, keep_remote, tunnel) |
one-time setup pain |
Very happy to break any of these into its own issue and/or contribute PRs (the run/--env/--json ones look approachable). Just let me know which you'd want to take first.
— filed from downstream usage in serving-stack / snapper.
Context
I maintain a downstream project (
serving-stack) that drivesrccheavily: we wraprcc push/rcc runin a Python CLI (snapper) that submits Slurm jobs, polls them to readiness, fetches per-rank logs, and tunnels to the served endpoint. Observed againstrcc0.1.0.This issue collects the concrete friction points hit when driving
rccfrom automation (not just interactively). Happy to split any of these into standalone issues — I kept them in one place because several share a root cause (rcc is currently optimized for interactive typing, not for being scripted). Ordered by impact × stakes.P1 —
push/pull --deleteis a footgun (highest stakes)Our
.rcc/rccignorecarries this comment, verbatim:The entire deletion-safety story is pushed onto the user via a ever-growing
rccignore. Concrete asks:--deletemirrors local→remote. Consider making the default non-destructive (add/update only) and require an explicit--mirror/--delete-excludedfor the dangerous semantics. (rsync's--delete-excludedvs--deletesplit is a useful prior art.)--keep-remote GLOB(or profile-levelkeep_remote = [...]) that survives even--mirror, for paths the job owns (logs/,cache/,*.safetensors,last_service.env). This is rccignore-with-teeth — the kind of guard rail that belongs in the tool, not the user's memory.--dry-runexists; please surface deletions in a clearly distinct section (Would DELETE: ...) so they can't hide in a wall off.f.....rsync lines.P1 —
runis awkward to script: streaming vs. exit-code,bash -lctax, no--env/--cwdThis is the single most-used verb for us, and three things hurt:
snapper/rcc.py) has to choose:srunwe either (a) see live output but can't programmatically inspect it, or (b) capture it but the call looks hung for minutes. A--streamthat still populatesstdout/stderr(tee to the TTY while accumulating) would fix this cleanly.bash -lctax. Literally every invocation in our docs isrcc --profile p run bash -lc "...". Therunverb should run through a login shell by default (with--no-shellto opt out), so users stop retypingbash -lc. Todayrun's argv is["rcc", ..., "run", "bash", "-lc", cmd]— the shell is the user's responsibility ~30 times per README.--env/--cwd. Because there's no--env KEY=VAL, we build env prefixes by hand withshlex.quoteand string concatenation (snapper/slurm.py::_env_prefix), then hand them tobash -lc. That's injection-fragile glue a structured--env(repeatable) flag would eliminate. Likewise everything is implicitlyremote_dir; an explicit--cwd(orrcc execthat is not remote-dir-scoped) would remove a class of absolute-path hacks.P2 — No first-class Slurm/workload-manager lifecycle
rcc= "remote cluster controller," but on an HPC remote the cluster is Slurm. Today every consumer reinvents the samesqueuepoll loops. In oursnapper/slurm.pyalone we hand-rolled:submit(parseSubmitted batch job N),wait_in_queue(pollsqueue -o %T),wait_for_ready(poll per-rank logs for readiness markers),_fetch_rank_logs(loopcatover ranks). We even invented our own wire format —===RANK N===markers — because rcc gave us none.Proposed verbs (obviously opt-in / plugin-y, not core SSH):
rcc submit -- CMD…→ prints jobid (machine-parseable)rcc jobs [--profile p]→ listrcc logs <jobid> [--follow] [--rank N]→ the single biggest win; replaces the rank-concat looprcc wait <jobid> [--timeout S] [--on STATE]→ blocks, exits non-zero on FAILEDEven just
rcc logs <jobid> --followwould delete a remarkable amount of glue.P2 — No structured (
--json) outputEvery downstream parser is regex over free text:
parse_jobid,parse_squeue_state,parse_rank_dump.rcc config"prints the resolved profile" as text; we'd love--json. Please add a stable--jsonmode to at leastconfig,status, and the hypotheticaljobs/logs— it lets wrappers drop ~100 lines of fragile parsing and survive output-format changes.P3 — Port-forwarding should be a first-class verb
Our deploy READMEs end every workflow with a manual
ssh -L 8080:<HEAD_NODE>:8080 bristen, andsnapper/service.pyparses a job-writtenlast_service.envto recover the endpoint. rcc already maintains an SSH ControlMaster (cf.status/close) — arcc tunnel <profile> <remote-port> [--local-port] [--remote-host HEAD_NODE]that reuses that master would collapse this whole class of glue, and pairs naturally with a profile-leveltunnel = { remote_port = 8080 }.P3 — Richer profiles
.rcc/config.tomlprofiles today are{host, remote_dir}. In practice we also want, per-profile:envdefaults (we re-deriveTRITON_CACHE_DIR, redirected$HOME,EDF_PATHin every serve script — these are profile constants),proxy_jump/identity_file(bastion hops are the norm on HPC; CSCS uses an SSH-config/enrollment dance rcc could encapsulate),tunnelblock andkeep_remoteprotect-list.TL;DR priority
--deletesafety (mirror vs sync, protect-list, itemized dry-run)runstreaming+exit-code, default login shell,--env/--cwdsubmit/jobs/logs/wait)--jsononconfig/status/jobs/logsrcc tunnelssh -Lenv,proxy_jump,keep_remote,tunnel)Very happy to break any of these into its own issue and/or contribute PRs (the
run/--env/--jsonones look approachable). Just let me know which you'd want to take first.— filed from downstream usage in
serving-stack/snapper.