Skip to content

Ergonomics & safety gaps when driving rcc from automation (downstream evidence) #2

Description

@xzyaoi

Context

I maintain a downstream project (serving-stack) that drives rcc heavily: we wrap rcc push / rcc run in a Python CLI (snapper) that submits Slurm jobs, polls them to readiness, fetches per-rank logs, and tunnels to the served endpoint. Observed against rcc 0.1.0.

This issue collects the concrete friction points hit when driving rcc from automation (not just interactively). Happy to split any of these into standalone issues — I kept them in one place because several share a root cause (rcc is currently optimized for interactive typing, not for being scripted). Ordered by impact × stakes.


P1 — push / pull --delete is a footgun (highest stakes)

Our .rcc/rccignore carries this comment, verbatim:

JIT compile caches + container HOME on shared FS (written by vLLM/OpenTela). MUST be excluded: a push --delete without this wipes hundreds of thousands of cache files and the per-job generated scripts written into DEPLOY_DIR.

The entire deletion-safety story is pushed onto the user via a ever-growing rccignore. Concrete asks:

  • Mirror vs. sync distinction. Today --delete mirrors local→remote. Consider making the default non-destructive (add/update only) and require an explicit --mirror / --delete-excluded for the dangerous semantics. (rsync's --delete-excluded vs --delete split is a useful prior art.)
  • Protect-list. A --keep-remote GLOB (or profile-level keep_remote = [...]) that survives even --mirror, for paths the job owns (logs/, cache/, *.safetensors, last_service.env). This is rccignore-with-teeth — the kind of guard rail that belongs in the tool, not the user's memory.
  • Itemized dry-run. --dry-run exists; please surface deletions in a clearly distinct section (Would DELETE: ...) so they can't hide in a wall of f.f..... rsync lines.

P1 — run is awkward to script: streaming vs. exit-code, bash -lc tax, no --env/--cwd

This is the single most-used verb for us, and three things hurt:

  1. Streaming and exit-code visibility are mutually exclusive. Our wrapper (snapper/rcc.py) has to choose:
    def _run_argv(argv, *, stream=False):
        if stream:
            proc = subprocess.run(argv)           # user sees output live...
            return RunResult(proc.returncode, "", "")  # ...but stdout/stderr come back EMPTY
        proc = subprocess.run(argv, capture_output=True, text=True)  # or capture text...
        return RunResult(proc.returncode, proc.stdout, proc.stderr)  # ...but see nothing until exit
    So for a long srun we either (a) see live output but can't programmatically inspect it, or (b) capture it but the call looks hung for minutes. A --stream that still populates stdout/stderr (tee to the TTY while accumulating) would fix this cleanly.
  2. The bash -lc tax. Literally every invocation in our docs is rcc --profile p run bash -lc "...". The run verb should run through a login shell by default (with --no-shell to opt out), so users stop retyping bash -lc. Today run's argv is ["rcc", ..., "run", "bash", "-lc", cmd] — the shell is the user's responsibility ~30 times per README.
  3. No --env / --cwd. Because there's no --env KEY=VAL, we build env prefixes by hand with shlex.quote and string concatenation (snapper/slurm.py::_env_prefix), then hand them to bash -lc. That's injection-fragile glue a structured --env (repeatable) flag would eliminate. Likewise everything is implicitly remote_dir; an explicit --cwd (or rcc exec that is not remote-dir-scoped) would remove a class of absolute-path hacks.

P2 — No first-class Slurm/workload-manager lifecycle

rcc = "remote cluster controller," but on an HPC remote the cluster is Slurm. Today every consumer reinvents the same squeue poll loops. In our snapper/slurm.py alone we hand-rolled: submit (parse Submitted batch job N), wait_in_queue (poll squeue -o %T), wait_for_ready (poll per-rank logs for readiness markers), _fetch_rank_logs (loop cat over ranks). We even invented our own wire format===RANK N=== markers — because rcc gave us none.

Proposed verbs (obviously opt-in / plugin-y, not core SSH):

  • rcc submit -- CMD… → prints jobid (machine-parseable)
  • rcc jobs [--profile p] → list
  • rcc logs <jobid> [--follow] [--rank N] → the single biggest win; replaces the rank-concat loop
  • rcc wait <jobid> [--timeout S] [--on STATE] → blocks, exits non-zero on FAILED

Even just rcc logs <jobid> --follow would delete a remarkable amount of glue.

P2 — No structured (--json) output

Every downstream parser is regex over free text: parse_jobid, parse_squeue_state, parse_rank_dump. rcc config "prints the resolved profile" as text; we'd love --json. Please add a stable --json mode to at least config, status, and the hypothetical jobs/logs — it lets wrappers drop ~100 lines of fragile parsing and survive output-format changes.

P3 — Port-forwarding should be a first-class verb

Our deploy READMEs end every workflow with a manual ssh -L 8080:<HEAD_NODE>:8080 bristen, and snapper/service.py parses a job-written last_service.env to recover the endpoint. rcc already maintains an SSH ControlMaster (cf. status/close) — a rcc tunnel <profile> <remote-port> [--local-port] [--remote-host HEAD_NODE] that reuses that master would collapse this whole class of glue, and pairs naturally with a profile-level tunnel = { remote_port = 8080 }.

P3 — Richer profiles

.rcc/config.toml profiles today are {host, remote_dir}. In practice we also want, per-profile:

  • env defaults (we re-derive TRITON_CACHE_DIR, redirected $HOME, EDF_PATH in every serve script — these are profile constants),
  • proxy_jump / identity_file (bastion hops are the norm on HPC; CSCS uses an SSH-config/enrollment dance rcc could encapsulate),
  • the aforementioned tunnel block and keep_remote protect-list.

TL;DR priority

# Theme Stakes
P1 --delete safety (mirror vs sync, protect-list, itemized dry-run) data loss
P1 run streaming+exit-code, default login shell, --env/--cwd daily ergonomics
P2 Slurm lifecycle (submit/jobs/logs/wait) removes duplicated glue across every consumer
P2 --json on config/status/jobs/logs composability
P3 rcc tunnel removes manual ssh -L
P3 richer profiles (env, proxy_jump, keep_remote, tunnel) one-time setup pain

Very happy to break any of these into its own issue and/or contribute PRs (the run/--env/--json ones look approachable). Just let me know which you'd want to take first.

— filed from downstream usage in serving-stack / snapper.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions