Problem
Compiler selection parses pragma directives from raw source text, so pragmas inside comments are treated as real constraints. The pipeline then uses the first parsed pragma to choose solc versions.
Evidence
src/local_compiler.py:79-94 runs a raw regex over source_code without stripping // or /* ... */ comments.
src/dataset_pipeline.py:751-756, download_hf_contracts.py:634-638, and scripts/build_lookup_db.py:425-429 all select pragmas[0] as the constraint for compilation.
- Reproduction observed in this repository:
src = '''// pragma solidity ^0.5.0;
/* pragma solidity ^0.6.0; */
pragma solidity ^0.8.20;
contract C {}'''
parse_pragma(src)
# ['^0.5.0', '^0.6.0', '^0.8.20']
Using the first value selects 0.5.17 from the curated compiler list instead of a compatible 0.8.x compiler.
Why it matters
Wrong compiler selection reduces dataset coverage and can silently drop contracts when compilation fails. For model training and TAC lookup generation, that means fewer valid TAC/Solidity pairs and biased coverage toward whichever versions appear in comments.
Suggested fix
Strip Solidity comments before searching for pragma directives, or use a Solidity-aware lexer/parser for pragmas. When multiple real pragmas exist across multi-file sources, compute the compatible intersection or compile each file set with the original verified compiler rather than blindly using the first pragma.
Validation/tests to add
- Unit tests for commented-out line and block pragmas.
- Multi-file fixture where files have compatible but different real pragmas and the selected versions satisfy all of them.
- Regression test in the dataset/lookup preparation path to ensure commented pragmas do not affect
select_compilation_configs or versions_for_pragma.
Problem
Compiler selection parses pragma directives from raw source text, so pragmas inside comments are treated as real constraints. The pipeline then uses the first parsed pragma to choose solc versions.
Evidence
src/local_compiler.py:79-94runs a raw regex oversource_codewithout stripping//or/* ... */comments.src/dataset_pipeline.py:751-756,download_hf_contracts.py:634-638, andscripts/build_lookup_db.py:425-429all selectpragmas[0]as the constraint for compilation.Using the first value selects
0.5.17from the curated compiler list instead of a compatible0.8.xcompiler.Why it matters
Wrong compiler selection reduces dataset coverage and can silently drop contracts when compilation fails. For model training and TAC lookup generation, that means fewer valid TAC/Solidity pairs and biased coverage toward whichever versions appear in comments.
Suggested fix
Strip Solidity comments before searching for pragma directives, or use a Solidity-aware lexer/parser for pragmas. When multiple real pragmas exist across multi-file sources, compute the compatible intersection or compile each file set with the original verified compiler rather than blindly using the first pragma.
Validation/tests to add
select_compilation_configsorversions_for_pragma.