Parent epic: tracebloc/client-runtime#116 (WS3 — diagnose without tracebloc access). Moved here from tracebloc/client#267 after deciding to implement in the CLI rather than the installer script.
Decision
Implement as tracebloc cluster doctor — a sibling of the existing tracebloc cluster info (internal/cli/cluster.go), whose own doc comment already anticipates it:
Future verbs (e.g. cluster doctor for diagnostics, cluster contexts for switching) hang off this parent in later phases.
It reuses cluster info's plumbing directly: cluster.Load / cluster.NewClientset / cluster.DiscoverParentRelease, the kubeconfig/context/namespace flags, ui.Printer (Successf/Warnf/Errorf/Hintf = ✔/⚠/✖ + remedy), and exitError for exit codes.
Context — what already exists (don't rebuild)
install-k8s.sh --diagnose (tracebloc/client) already produces WS3's redacted support-bundle (logs + cluster/host state). ⇒ support-bundle ≈ done.
preflight.sh (tracebloc/client) already does pre-install host checks (proxy-aware egress probe, filesystem-type, disk/RAM/CPU).
doctor's gap is the post-install, live-cluster, on-demand health command with red/green + remedies — runnable any time the customer asks "why isn't my experiment running?", over their normal kubeconfig.
Lean MVP — first cut (6 checks)
Each emits ✔/⚠/✖ + a one-line remedy; ends with a consolidated verdict + "run install-k8s.sh --diagnose and send us the bundle" when red.
Read-only and best-effort: every check runs independently and never aborts the run; the exit code reflects the worst result.
Out of scope (follow-up)
- node-resources-vs-spawned-job-request fit, image pullability (the broader cut).
- support-bundle (
--diagnose, shipped) and pre-install host checks (preflight.sh, shipped).
Done when
tracebloc cluster doctor on a live cluster prints per-check ✔/⚠/✖ + remedies + a verdict, runs read-only over the customer's kubeconfig, never aborts mid-check, and has Go test coverage mirroring the cluster info tests. Target branch: develop.
Parent epic: tracebloc/client-runtime#116 (WS3 — diagnose without tracebloc access). Moved here from tracebloc/client#267 after deciding to implement in the CLI rather than the installer script.
Decision
Implement as
tracebloc cluster doctor— a sibling of the existingtracebloc cluster info(internal/cli/cluster.go), whose own doc comment already anticipates it:It reuses
cluster info's plumbing directly:cluster.Load/cluster.NewClientset/cluster.DiscoverParentRelease, the kubeconfig/context/namespace flags,ui.Printer(Successf/Warnf/Errorf/Hintf= ✔/⚠/✖ + remedy), andexitErrorfor exit codes.Context — what already exists (don't rebuild)
install-k8s.sh --diagnose(tracebloc/client) already produces WS3's redacted support-bundle (logs + cluster/host state). ⇒ support-bundle ≈ done.preflight.sh(tracebloc/client) already does pre-install host checks (proxy-aware egress probe, filesystem-type, disk/RAM/CPU).doctor's gap is the post-install, live-cluster, on-demand health command with red/green + remedies — runnable any time the customer asks "why isn't my experiment running?", over their normal kubeconfig.Lean MVP — first cut (6 checks)
Each emits ✔/⚠/✖ + a one-line remedy; ends with a consolidated verdict + "run
install-k8s.sh --diagnoseand send us the bundle" when red.DiscoverParentRelease).internal/cluster/pvc.go).Read-only and best-effort: every check runs independently and never aborts the run; the exit code reflects the worst result.
Out of scope (follow-up)
--diagnose, shipped) and pre-install host checks (preflight.sh, shipped).Done when
tracebloc cluster doctoron a live cluster prints per-check ✔/⚠/✖ + remedies + a verdict, runs read-only over the customer's kubeconfig, never aborts mid-check, and has Go test coverage mirroring thecluster infotests. Target branch:develop.