Skip to content

[Observability] Finalize run manifests on early exits and unhandled pipeline failures #111

Description

@agorevski

Finding

Training run manifests are initialized as running, but several early exits and unhandled failures can leave them in that state instead of finalizing them as failed with an error reason.

Evidence

  • train.py:1350-1374 initializes a run manifest with status: "running".
  • train.py:2356-2363 writes that manifest before the main pipeline branches.
  • Eval-only validation can call sys.exit(1) for a missing/invalid model or missing test dataset at train.py:2377-2409 before _finalize_manifest runs.
  • Full training can call sys.exit(1) for a missing dataset or missing ETHERSCAN_API_KEY at train.py:2477-2504 before _finalize_manifest runs.
  • Only the preflight-failure path explicitly calls _finalize_manifest(..., "failed", error) (train.py:2431-2434, train.py:2556-2559); there is no top-level try/except/finally around collection, training, or evaluation.

Impact

A failed or misconfigured run can leave behind a manifest that says it is still running. This misleads experiment tracking, makes fleet/CI triage harder, and obscures the actual failure reason from the primary run artifact.

Recommended fix

Wrap main() pipeline execution in a manifest-aware failure handler. Replace early sys.exit branches with exceptions or finalize before exiting, and ensure unhandled collection/training/evaluation exceptions update status, finished_at, duration, and structured error information.

Acceptance criteria

  • Missing model, missing test dataset, missing source dataset, missing API key, preflight failure, and unexpected exceptions all produce finalized manifests with status: "failed" and an error object.
  • Successful dataset-only, eval-only, train, and skip-eval runs still finalize as completed.
  • Tests or smoke commands verify manifest status for at least one early-exit and one exception path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions