[Training] Cache tokenized training examples

## Evidence
- `train_common.sh:21-49` loads the tokenizer and tokenizes the full dataset to auto-detect `MAX_SEQ_LEN` on every wrapper launch.
- `src/model_setup.py:303-327` implements that full JSONL token-count scan.
- `src/model_setup.py:411-422` loads raw JSONL rows into a Python list.
- `src/model_setup.py:502-561` tokenizes target/header/TAC/footer inside `__getitem__`, so examples are re-tokenized every epoch and every DataLoader access.
- `src/model_setup.py:818-820` starts 4 DataLoader workers, multiplying tokenizer CPU work and memory pressure per rank.

## Impact
For a large TAC/Solidity corpus, CPU tokenization can dominate step time and startup time. The same rows are tokenized once for sequence-length detection and again repeatedly during training, which wastes compute, increases cost, and makes GPU utilization unreliable.

## Recommended fix
Pre-tokenize with a cached preprocessing step (for example Hugging Face `datasets.map`/Arrow, or a repo-local cache) keyed by dataset fingerprint, tokenizer/model revision, prompt template, `include_compiler_metadata`, and max sequence length. Persist sequence-length statistics in the same cache. Keep augmentation deterministic or bypass the cache only for fields that truly vary.

## Acceptance criteria
- Re-running training with the same dataset/config reuses cached tokenized examples instead of re-tokenizing all rows.
- Cache invalidates when tokenizer/model revision, max sequence length, prompt template, or compiler-metadata flag changes.
- Training logs report preprocessing/cache-hit time and examples/sec or tokens/sec.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Cache tokenized training examples #57

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Training] Cache tokenized training examples #57

Description

Evidence

Impact

Recommended fix

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions