Fuse flash_mla_with_kvcache paged fp8 gather/dequant with sparse MLA attention

## Finding

`flash_mla_with_kvcache` currently gathers/dequantizes selected fp8_ds_mla cache rows with torch indexing, materializes a dense `[T, total_topk, D]` KV tensor, concatenates primary and optional extra cache selections, flattens it, builds an index tensor, and only then calls the Triton sparse MLA compute kernel:

- Gather/dequant via advanced indexing: [`src/xkernels/ops/attention/sparse_mla_decode.py#L26-L61`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/attention/sparse_mla_decode.py#L26-L61)
- `torch.cat`, `reshape`, `to(q2.dtype)`, `torch.arange`, and `torch.where` staging before compute: [`src/xkernels/ops/attention/sparse_mla_decode.py#L110-L150`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/attention/sparse_mla_decode.py#L110-L150)

The downstream Triton sparse MLA kernel already streams selected KV by `indices`: [`src/xkernels/ops/attention/triton/sparse_mla_kernel.py#L71-L93`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/attention/triton/sparse_mla_kernel.py#L71-L93).

## Why this should improve performance

The decode wrapper pays multiple extra GPU launches and materializes a top-k-sized dequantized KV buffer before attention. For DeepSeek-V4 sparse decode, `topk` is large and `D=512`, so this temporary can be a large fraction of the work. Fusing paged-cache address resolution and fp8 dequant into the attention kernel should reduce memory traffic and avoid the staging allocations.

## Suggested implementation

- Add a Triton decode kernel variant that accepts `value_cache`, `scale_cache`, `block_table`, primary indices, optional extra cache/indices, and validity lengths directly.
- Inside the attention streaming loop, resolve logical or physical cache positions, dequant fp8_ds_mla rows on the fly, and feed them to the online softmax/value accumulator.
- Preserve the existing materialized `sparse_mla_attention(q, kv, indices)` API for tests and non-paged callers.

## Validation

- Compare fused decode against the current staged decode for primary-only and primary+extra-cache cases.
- Benchmark peak temporary memory, launch count, and latency for `B=1`, `H=128`, `D=512`, `topk=512/1024`.
- Include graph-capture validation for fixed decode shapes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fuse flash_mla_with_kvcache paged fp8 gather/dequant with sparse MLA attention #53

Finding

Why this should improve performance

Suggested implementation

Validation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Fuse flash_mla_with_kvcache paged fp8 gather/dequant with sparse MLA attention #53

Description

Finding

Why this should improve performance

Suggested implementation

Validation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions