Skip to content

[Pipeline Review] Strip comments before parsing Solidity pragmas for compiler selection #53

Description

@agorevski

Problem

Compiler selection parses pragma directives from raw source text, so pragmas inside comments are treated as real constraints. The pipeline then uses the first parsed pragma to choose solc versions.

Evidence

  • src/local_compiler.py:79-94 runs a raw regex over source_code without stripping // or /* ... */ comments.
  • src/dataset_pipeline.py:751-756, download_hf_contracts.py:634-638, and scripts/build_lookup_db.py:425-429 all select pragmas[0] as the constraint for compilation.
  • Reproduction observed in this repository:
src = '''// pragma solidity ^0.5.0;
/* pragma solidity ^0.6.0; */
pragma solidity ^0.8.20;
contract C {}'''
parse_pragma(src)
# ['^0.5.0', '^0.6.0', '^0.8.20']

Using the first value selects 0.5.17 from the curated compiler list instead of a compatible 0.8.x compiler.

Why it matters

Wrong compiler selection reduces dataset coverage and can silently drop contracts when compilation fails. For model training and TAC lookup generation, that means fewer valid TAC/Solidity pairs and biased coverage toward whichever versions appear in comments.

Suggested fix

Strip Solidity comments before searching for pragma directives, or use a Solidity-aware lexer/parser for pragmas. When multiple real pragmas exist across multi-file sources, compute the compatible intersection or compile each file set with the original verified compiler rather than blindly using the first pragma.

Validation/tests to add

  • Unit tests for commented-out line and block pragmas.
  • Multi-file fixture where files have compatible but different real pragmas and the selected versions satisfy all of them.
  • Regression test in the dataset/lookup preparation path to ensure commented pragmas do not affect select_compilation_configs or versions_for_pragma.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions