Evidence
train.py:470-481 evaluates by calling decompile_tac_to_solidity once per test row.
src/training_pipeline.py:880-890 has the same one-example-at-a-time evaluation loop.
src/model_setup.py:1316-1378 already provides SmartContractDecompiler.decompile_batch, but the training/evaluation pipeline does not use it.
Impact
Post-training evaluation can become a large part of run cost. Launching generation one prompt at a time underutilizes the GPU, adds per-example overhead, and makes multi-GPU evaluation less efficient even though ranks shard the test set.
Recommended fix
Add an --eval-batch-size option and evaluate test rows in chunks with decompile_batch. Include OOM fallback to smaller batches and preserve current deterministic greedy generation behavior.
Acceptance criteria
- Both
train.py eval and SmartContractTrainingPipeline.evaluate_model support configurable batched generation.
- Results stay aligned with original examples and metrics remain unchanged for batch size 1.
- Evaluation logs examples/sec, tokens/sec if available, batch size, and any OOM fallback events.
Evidence
train.py:470-481evaluates by callingdecompile_tac_to_solidityonce per test row.src/training_pipeline.py:880-890has the same one-example-at-a-time evaluation loop.src/model_setup.py:1316-1378already providesSmartContractDecompiler.decompile_batch, but the training/evaluation pipeline does not use it.Impact
Post-training evaluation can become a large part of run cost. Launching generation one prompt at a time underutilizes the GPU, adds per-example overhead, and makes multi-GPU evaluation less efficient even though ranks shard the test set.
Recommended fix
Add an
--eval-batch-sizeoption and evaluate test rows in chunks withdecompile_batch. Include OOM fallback to smaller batches and preserve current deterministic greedy generation behavior.Acceptance criteria
train.pyeval andSmartContractTrainingPipeline.evaluate_modelsupport configurable batched generation.