Skip to content

[Pipeline Review] Propagate stack state across basic blocks when emitting TAC #50

Description

@agorevski

Problem

TAC generation resets the simulated EVM stack at every basic block, so values left on the stack by a predecessor are lost at the target block.

Evidence

  • src/bytecode_analyzer.py:1516-1523 initializes stack: List[str] = [] inside the loop over self.basic_blocks.items(), then converts each block independently.
  • Reproduction with bytecode 0x60056005565b60020300 (PUSH1 5; PUSH1 5; JUMP; JUMPDEST; PUSH1 2; SUB; STOP) shows the value 5 left on the stack before the jump is not available in the target block:
block_0000:
  temp_1 = 05
  temp_2 = 05
  goto temp_2

block_0005:
  temp_3 = 02
  temp_4 = stack_underflow - stack_underflow
  stop()

Why it matters

EVM stack values commonly flow across basic-block boundaries. Resetting stack state introduces artificial stack_underflow expressions and removes data dependencies, which makes TAC less faithful and lowers the quality of both training pairs and inference prompts.

Suggested fix

Run a CFG-aware stack data-flow pass that computes incoming and outgoing stack states per block. For merges, introduce explicit phi/unknown stack parameters rather than silently underflowing. Then emit TAC using the computed entry stack for each block.

Validation/tests to add

  • A unit test for the reproduction bytecode above that expects the target block to consume the predecessor's carried stack value.
  • Branch/merge fixtures that verify deterministic phi/unknown handling when two predecessors provide different stack values.
  • A corpus metric/test that tracks and fails on unexpected increases in stack_underflow count for compiled Solidity fixtures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions