Finding
Training run manifests are initialized as running, but several early exits and unhandled failures can leave them in that state instead of finalizing them as failed with an error reason.
Evidence
train.py:1350-1374 initializes a run manifest with status: "running".
train.py:2356-2363 writes that manifest before the main pipeline branches.
- Eval-only validation can call
sys.exit(1) for a missing/invalid model or missing test dataset at train.py:2377-2409 before _finalize_manifest runs.
- Full training can call
sys.exit(1) for a missing dataset or missing ETHERSCAN_API_KEY at train.py:2477-2504 before _finalize_manifest runs.
- Only the preflight-failure path explicitly calls
_finalize_manifest(..., "failed", error) (train.py:2431-2434, train.py:2556-2559); there is no top-level try/except/finally around collection, training, or evaluation.
Impact
A failed or misconfigured run can leave behind a manifest that says it is still running. This misleads experiment tracking, makes fleet/CI triage harder, and obscures the actual failure reason from the primary run artifact.
Recommended fix
Wrap main() pipeline execution in a manifest-aware failure handler. Replace early sys.exit branches with exceptions or finalize before exiting, and ensure unhandled collection/training/evaluation exceptions update status, finished_at, duration, and structured error information.
Acceptance criteria
- Missing model, missing test dataset, missing source dataset, missing API key, preflight failure, and unexpected exceptions all produce finalized manifests with
status: "failed" and an error object.
- Successful dataset-only, eval-only, train, and skip-eval runs still finalize as completed.
- Tests or smoke commands verify manifest status for at least one early-exit and one exception path.
Finding
Training run manifests are initialized as
running, but several early exits and unhandled failures can leave them in that state instead of finalizing them as failed with an error reason.Evidence
train.py:1350-1374initializes a run manifest withstatus: "running".train.py:2356-2363writes that manifest before the main pipeline branches.sys.exit(1)for a missing/invalid model or missing test dataset attrain.py:2377-2409before_finalize_manifestruns.sys.exit(1)for a missing dataset or missingETHERSCAN_API_KEYattrain.py:2477-2504before_finalize_manifestruns._finalize_manifest(..., "failed", error)(train.py:2431-2434,train.py:2556-2559); there is no top-level try/except/finally around collection, training, or evaluation.Impact
A failed or misconfigured run can leave behind a manifest that says it is still running. This misleads experiment tracking, makes fleet/CI triage harder, and obscures the actual failure reason from the primary run artifact.
Recommended fix
Wrap
main()pipeline execution in a manifest-aware failure handler. Replace earlysys.exitbranches with exceptions or finalize before exiting, and ensure unhandled collection/training/evaluation exceptions updatestatus,finished_at, duration, and structured error information.Acceptance criteria
status: "failed"and an error object.