Skip to content

WS3 · tracebloc cluster doctor — live-cluster health check with remedies #88

Description

@saadqbal

Parent epic: tracebloc/client-runtime#116 (WS3 — diagnose without tracebloc access). Moved here from tracebloc/client#267 after deciding to implement in the CLI rather than the installer script.

Decision

Implement as tracebloc cluster doctor — a sibling of the existing tracebloc cluster info (internal/cli/cluster.go), whose own doc comment already anticipates it:

Future verbs (e.g. cluster doctor for diagnostics, cluster contexts for switching) hang off this parent in later phases.

It reuses cluster info's plumbing directly: cluster.Load / cluster.NewClientset / cluster.DiscoverParentRelease, the kubeconfig/context/namespace flags, ui.Printer (Successf/Warnf/Errorf/Hintf = ✔/⚠/✖ + remedy), and exitError for exit codes.

Context — what already exists (don't rebuild)

  • install-k8s.sh --diagnose (tracebloc/client) already produces WS3's redacted support-bundle (logs + cluster/host state). ⇒ support-bundle ≈ done.
  • preflight.sh (tracebloc/client) already does pre-install host checks (proxy-aware egress probe, filesystem-type, disk/RAM/CPU).

doctor's gap is the post-install, live-cluster, on-demand health command with red/green + remedies — runnable any time the customer asks "why isn't my experiment running?", over their normal kubeconfig.

Lean MVP — first cut (6 checks)

Each emits ✔/⚠/✖ + a one-line remedy; ends with a consolidated verdict + "run install-k8s.sh --diagnose and send us the bundle" when red.

  • Cluster reachable — kubeconfig loads, API responds, parent release discovered (reuse DiscoverParentRelease).
  • Backend egress — the cluster can reach the tracebloc backend API (proxy-aware).
  • Service Bus reachability — the SB endpoint is reachable (the silent-Pending culprit when blocked).
  • dockerd / proxy config — proxy env is present/consistent where required.
  • Dataset / PVC mount — the dataset PVC is bound and mounted in the running jobs-manager (live form of feat(installer,chart): place datasets on a network mount while MySQL stays local client#262; reuse internal/cluster/pvc.go).
  • Pod health — no crash-looping / long-Pending pods in the namespace (local complement to client-runtime#117's controller detection).

Read-only and best-effort: every check runs independently and never aborts the run; the exit code reflects the worst result.

Out of scope (follow-up)

  • node-resources-vs-spawned-job-request fit, image pullability (the broader cut).
  • support-bundle (--diagnose, shipped) and pre-install host checks (preflight.sh, shipped).

Done when

tracebloc cluster doctor on a live cluster prints per-check ✔/⚠/✖ + remedies + a verdict, runs read-only over the customer's kubeconfig, never aborts mid-check, and has Go test coverage mirroring the cluster info tests. Target branch: develop.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions