[Training] Batch evaluation generation during training runs

## Evidence
- `train.py:470-481` evaluates by calling `decompile_tac_to_solidity` once per test row.
- `src/training_pipeline.py:880-890` has the same one-example-at-a-time evaluation loop.
- `src/model_setup.py:1316-1378` already provides `SmartContractDecompiler.decompile_batch`, but the training/evaluation pipeline does not use it.

## Impact
Post-training evaluation can become a large part of run cost. Launching generation one prompt at a time underutilizes the GPU, adds per-example overhead, and makes multi-GPU evaluation less efficient even though ranks shard the test set.

## Recommended fix
Add an `--eval-batch-size` option and evaluate test rows in chunks with `decompile_batch`. Include OOM fallback to smaller batches and preserve current deterministic greedy generation behavior.

## Acceptance criteria
- Both `train.py` eval and `SmartContractTrainingPipeline.evaluate_model` support configurable batched generation.
- Results stay aligned with original examples and metrics remain unchanged for batch size 1.
- Evaluation logs examples/sec, tokens/sec if available, batch size, and any OOM fallback events.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Batch evaluation generation during training runs #58

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Training] Batch evaluation generation during training runs #58

Description

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions