Skip to content

[Pipeline Review] Restrict function selector detection to real dispatcher dataflow #52

Description

@agorevski

Problem

Function identification treats any PUSH4 ... EQ ... PUSH<n> ... JUMPI sequence anywhere in bytecode as a public function dispatcher entry, even when the comparison is unrelated to calldata dispatch.

Evidence

  • src/bytecode_analyzer.py:673-692 scans every instruction for PUSH4 and creates a Function when _find_dispatch_target returns a target.
  • src/bytecode_analyzer.py:883-894 only checks for a nearby EQ, PUSH, and JUMPI; it does not verify that the compared value comes from CALLDATALOAD/selector extraction or that the scan is still in the dispatcher region.
  • Reproduction with no calldata access at all:
bc = '0x63deadbeef63deadbeef14600f57005b00'
an = BytecodeAnalyzer(bc)
an.analyze_control_flow()
print([(f.name, f.selector) for f in an.identify_functions().values()])
# [('function_0xdeadbeef', '0xdeadbeef')]

This bytecode only compares two constants and conditionally jumps; it does not implement a function dispatcher.

Why it matters

False function boundaries create bogus per-function TAC, cause selector resolution for functions that do not exist, and can contaminate training/evaluation pairs when constants resemble selectors. In inference, users may see decompiled functions that are actually ordinary internal branches.

Suggested fix

Recognize dispatchers using dataflow from the entry block: CALLDATASIZE guard, CALLDATALOAD(0), selector extraction (SHR 224 or equivalent DIV/AND), duplicated selector value, and bounded selector comparisons before the fallback path. Stop scanning after the dispatcher region instead of scanning the whole instruction stream.

Validation/tests to add

  • Negative unit test for the constant-compare bytecode above: no public selector function should be identified.
  • Positive tests for common solc dispatcher variants (SHR, legacy DIV, sorted binary-search dispatchers if supported).
  • Integration test that custom errors or magic constants inside function bodies do not become functions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions