Skip to content

Add reusable workspaces for hot decode kernels and graph capture #52

Description

@xzyaoi

Finding

Several hot-path wrappers allocate and zero temporary tensors on every call:

Why this should improve performance

In decode serving, many of these kernels run once per layer per token. Per-call allocation and memzero overhead can dominate small-batch latency and complicates HIP/CUDA graph capture. The code already has some fixed-shape modes (truncate=False in MoE align), so the next step is to let callers reuse the fixed buffers.

Suggested implementation

  • Add optional workspace= / out= arguments for the hot public APIs, or small workspace dataclasses per op family.
  • Size workspaces from existing deterministic bounds (max_pad, max_blocks, M, top_k, N, hidden size).
  • Keep allocating by default for ergonomic use, but let serving stacks preallocate once per max decode shape.
  • Avoid zeroing buffers when the kernel overwrites all live elements; keep explicit zeroing only for atomic-add combine outputs or EP partial outputs that need zero for skipped experts.

Validation

  • Add tests that pass preallocated buffers and compare results with allocation-owning calls.
  • Benchmark small decode buckets with allocation-owning eager, preallocated eager, and graph-captured execution.
  • Include a stress test that reuses a larger workspace for smaller M buckets without stale-data leakage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions