Evidence
train_common.sh:21-49 loads the tokenizer and tokenizes the full dataset to auto-detect MAX_SEQ_LEN on every wrapper launch.
src/model_setup.py:303-327 implements that full JSONL token-count scan.
src/model_setup.py:411-422 loads raw JSONL rows into a Python list.
src/model_setup.py:502-561 tokenizes target/header/TAC/footer inside __getitem__, so examples are re-tokenized every epoch and every DataLoader access.
src/model_setup.py:818-820 starts 4 DataLoader workers, multiplying tokenizer CPU work and memory pressure per rank.
Impact
For a large TAC/Solidity corpus, CPU tokenization can dominate step time and startup time. The same rows are tokenized once for sequence-length detection and again repeatedly during training, which wastes compute, increases cost, and makes GPU utilization unreliable.
Recommended fix
Pre-tokenize with a cached preprocessing step (for example Hugging Face datasets.map/Arrow, or a repo-local cache) keyed by dataset fingerprint, tokenizer/model revision, prompt template, include_compiler_metadata, and max sequence length. Persist sequence-length statistics in the same cache. Keep augmentation deterministic or bypass the cache only for fields that truly vary.
Acceptance criteria
- Re-running training with the same dataset/config reuses cached tokenized examples instead of re-tokenizing all rows.
- Cache invalidates when tokenizer/model revision, max sequence length, prompt template, or compiler-metadata flag changes.
- Training logs report preprocessing/cache-hit time and examples/sec or tokens/sec.
Evidence
train_common.sh:21-49loads the tokenizer and tokenizes the full dataset to auto-detectMAX_SEQ_LENon every wrapper launch.src/model_setup.py:303-327implements that full JSONL token-count scan.src/model_setup.py:411-422loads raw JSONL rows into a Python list.src/model_setup.py:502-561tokenizes target/header/TAC/footer inside__getitem__, so examples are re-tokenized every epoch and every DataLoader access.src/model_setup.py:818-820starts 4 DataLoader workers, multiplying tokenizer CPU work and memory pressure per rank.Impact
For a large TAC/Solidity corpus, CPU tokenization can dominate step time and startup time. The same rows are tokenized once for sequence-length detection and again repeatedly during training, which wastes compute, increases cost, and makes GPU utilization unreliable.
Recommended fix
Pre-tokenize with a cached preprocessing step (for example Hugging Face
datasets.map/Arrow, or a repo-local cache) keyed by dataset fingerprint, tokenizer/model revision, prompt template,include_compiler_metadata, and max sequence length. Persist sequence-length statistics in the same cache. Keep augmentation deterministic or bypass the cache only for fields that truly vary.Acceptance criteria