Problem
Function identification treats any PUSH4 ... EQ ... PUSH<n> ... JUMPI sequence anywhere in bytecode as a public function dispatcher entry, even when the comparison is unrelated to calldata dispatch.
Evidence
src/bytecode_analyzer.py:673-692 scans every instruction for PUSH4 and creates a Function when _find_dispatch_target returns a target.
src/bytecode_analyzer.py:883-894 only checks for a nearby EQ, PUSH, and JUMPI; it does not verify that the compared value comes from CALLDATALOAD/selector extraction or that the scan is still in the dispatcher region.
- Reproduction with no calldata access at all:
bc = '0x63deadbeef63deadbeef14600f57005b00'
an = BytecodeAnalyzer(bc)
an.analyze_control_flow()
print([(f.name, f.selector) for f in an.identify_functions().values()])
# [('function_0xdeadbeef', '0xdeadbeef')]
This bytecode only compares two constants and conditionally jumps; it does not implement a function dispatcher.
Why it matters
False function boundaries create bogus per-function TAC, cause selector resolution for functions that do not exist, and can contaminate training/evaluation pairs when constants resemble selectors. In inference, users may see decompiled functions that are actually ordinary internal branches.
Suggested fix
Recognize dispatchers using dataflow from the entry block: CALLDATASIZE guard, CALLDATALOAD(0), selector extraction (SHR 224 or equivalent DIV/AND), duplicated selector value, and bounded selector comparisons before the fallback path. Stop scanning after the dispatcher region instead of scanning the whole instruction stream.
Validation/tests to add
- Negative unit test for the constant-compare bytecode above: no public selector function should be identified.
- Positive tests for common solc dispatcher variants (
SHR, legacy DIV, sorted binary-search dispatchers if supported).
- Integration test that custom errors or magic constants inside function bodies do not become functions.
Problem
Function identification treats any
PUSH4 ... EQ ... PUSH<n> ... JUMPIsequence anywhere in bytecode as a public function dispatcher entry, even when the comparison is unrelated to calldata dispatch.Evidence
src/bytecode_analyzer.py:673-692scans every instruction forPUSH4and creates aFunctionwhen_find_dispatch_targetreturns a target.src/bytecode_analyzer.py:883-894only checks for a nearbyEQ,PUSH, andJUMPI; it does not verify that the compared value comes fromCALLDATALOAD/selector extraction or that the scan is still in the dispatcher region.This bytecode only compares two constants and conditionally jumps; it does not implement a function dispatcher.
Why it matters
False function boundaries create bogus per-function TAC, cause selector resolution for functions that do not exist, and can contaminate training/evaluation pairs when constants resemble selectors. In inference, users may see decompiled functions that are actually ordinary internal branches.
Suggested fix
Recognize dispatchers using dataflow from the entry block:
CALLDATASIZEguard,CALLDATALOAD(0), selector extraction (SHR 224or equivalentDIV/AND), duplicated selector value, and bounded selector comparisons before the fallback path. Stop scanning after the dispatcher region instead of scanning the whole instruction stream.Validation/tests to add
SHR, legacyDIV, sorted binary-search dispatchers if supported).