Skip to content

[Training] Batch evaluation generation during training runs #58

Description

@agorevski

Evidence

  • train.py:470-481 evaluates by calling decompile_tac_to_solidity once per test row.
  • src/training_pipeline.py:880-890 has the same one-example-at-a-time evaluation loop.
  • src/model_setup.py:1316-1378 already provides SmartContractDecompiler.decompile_batch, but the training/evaluation pipeline does not use it.

Impact

Post-training evaluation can become a large part of run cost. Launching generation one prompt at a time underutilizes the GPU, adds per-example overhead, and makes multi-GPU evaluation less efficient even though ranks shard the test set.

Recommended fix

Add an --eval-batch-size option and evaluate test rows in chunks with decompile_batch. Include OOM fallback to smaller batches and preserve current deterministic greedy generation behavior.

Acceptance criteria

  • Both train.py eval and SmartContractTrainingPipeline.evaluate_model support configurable batched generation.
  • Results stay aligned with original examples and metrics remain unchanged for batch size 1.
  • Evaluation logs examples/sec, tokens/sec if available, batch size, and any OOM fallback events.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions