Skip to content

[Training] Cache tokenized training examples #57

Description

@agorevski

Evidence

  • train_common.sh:21-49 loads the tokenizer and tokenizes the full dataset to auto-detect MAX_SEQ_LEN on every wrapper launch.
  • src/model_setup.py:303-327 implements that full JSONL token-count scan.
  • src/model_setup.py:411-422 loads raw JSONL rows into a Python list.
  • src/model_setup.py:502-561 tokenizes target/header/TAC/footer inside __getitem__, so examples are re-tokenized every epoch and every DataLoader access.
  • src/model_setup.py:818-820 starts 4 DataLoader workers, multiplying tokenizer CPU work and memory pressure per rank.

Impact

For a large TAC/Solidity corpus, CPU tokenization can dominate step time and startup time. The same rows are tokenized once for sequence-length detection and again repeatedly during training, which wastes compute, increases cost, and makes GPU utilization unreliable.

Recommended fix

Pre-tokenize with a cached preprocessing step (for example Hugging Face datasets.map/Arrow, or a repo-local cache) keyed by dataset fingerprint, tokenizer/model revision, prompt template, include_compiler_metadata, and max sequence length. Persist sequence-length statistics in the same cache. Keep augmentation deterministic or bypass the cache only for fields that truly vary.

Acceptance criteria

  • Re-running training with the same dataset/config reuses cached tokenized examples instead of re-tokenizing all rows.
  • Cache invalidates when tokenizer/model revision, max sequence length, prompt template, or compiler-metadata flag changes.
  • Training logs report preprocessing/cache-hit time and examples/sec or tokens/sec.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions