Add reusable workspaces for hot decode kernels and graph capture

## Finding

Several hot-path wrappers allocate and zero temporary tensors on every call:

- MoE align allocates `sorted_ids`, `expert_ids`, `num_post`, `tokens_cnts`, and `cumsum` every call: [`src/xkernels/ops/moe/triton/align_kernel.py#L189-L193`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/moe/triton/align_kernel.py#L189-L193)
- INT4 fused MoE allocates either `[M, N]` fp32 combine output or `[M * top_k, N]` scratch: [`src/xkernels/ops/moe/triton/moe_int4_kernel.py#L441-L464`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/moe/triton/moe_int4_kernel.py#L441-L464)
- MXFP4 MoE allocates the activation scratch and zeroed combine output: [`src/xkernels/ops/moe/triton/moe_mxfp4_kernel.py#L436-L452`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/moe/triton/moe_mxfp4_kernel.py#L436-L452)
- Sparse MLA attention allocates `out`, `lse`, and `maxl` every call: [`src/xkernels/ops/attention/triton/sparse_mla_kernel.py#L134-L136`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/attention/triton/sparse_mla_kernel.py#L134-L136)

## Why this should improve performance

In decode serving, many of these kernels run once per layer per token. Per-call allocation and memzero overhead can dominate small-batch latency and complicates HIP/CUDA graph capture. The code already has some fixed-shape modes (`truncate=False` in MoE align), so the next step is to let callers reuse the fixed buffers.

## Suggested implementation

- Add optional `workspace=` / `out=` arguments for the hot public APIs, or small workspace dataclasses per op family.
- Size workspaces from existing deterministic bounds (`max_pad`, `max_blocks`, `M`, `top_k`, `N`, hidden size).
- Keep allocating by default for ergonomic use, but let serving stacks preallocate once per max decode shape.
- Avoid zeroing buffers when the kernel overwrites all live elements; keep explicit zeroing only for atomic-add combine outputs or EP partial outputs that need zero for skipped experts.

## Validation

- Add tests that pass preallocated buffers and compare results with allocation-owning calls.
- Benchmark small decode buckets with allocation-owning eager, preallocated eager, and graph-captured execution.
- Include a stress test that reuses a larger workspace for smaller `M` buckets without stale-data leakage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add reusable workspaces for hot decode kernels and graph capture #52

Finding

Why this should improve performance

Suggested implementation

Validation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add reusable workspaces for hot decode kernels and graph capture #52

Description

Finding

Why this should improve performance

Suggested implementation

Validation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions