Finding
The sparse MLA config module documents a measured single-token decode config that is 1.13-1.24x faster at Tq=1, but it is opt-in through XKERNELS_SPARSE_MLA_CONFIG:
Why this should improve performance
Single-token decode is a common serving hot path. The measured faster config is not hidden or speculative; it is already checked in, but callers have to know to export an environment override. Selecting it automatically when T == 1 would capture the win without regressing the documented multi-token case.
Suggested implementation
- Change config resolution to accept
T or a decode_mode flag.
- If no env override is set and
T == 1, return DECODE_SPARSE_MLA_CONFIG; otherwise return DEFAULT_SPARSE_MLA_CONFIG.
- Keep the env override as the highest-priority path for A/B testing and operator control.
Validation
- Unit-test config resolution for
T=1, T>1, and env override cases.
- Re-run sparse MLA benchmarks for
T=1, T=8, and T=64 to confirm the auto branch keeps the documented win and avoids multi-token regression.
Finding
The sparse MLA config module documents a measured single-token decode config that is 1.13-1.24x faster at
Tq=1, but it is opt-in throughXKERNELS_SPARSE_MLA_CONFIG:src/xkernels/ops/attention/triton/sparse_mla_config.py#L52-L64DECODE_SPARSE_MLA_CONFIG:src/xkernels/ops/attention/triton/sparse_mla_config.py#L74-L85Tbefore resolving config, but currently callsresolve_sparse_mla_config()without passing it:src/xkernels/ops/attention/triton/sparse_mla_kernel.py#L129-L160Why this should improve performance
Single-token decode is a common serving hot path. The measured faster config is not hidden or speculative; it is already checked in, but callers have to know to export an environment override. Selecting it automatically when
T == 1would capture the win without regressing the documented multi-token case.Suggested implementation
Tor adecode_modeflag.T == 1, returnDECODE_SPARSE_MLA_CONFIG; otherwise returnDEFAULT_SPARSE_MLA_CONFIG.Validation
T=1,T>1, and env override cases.T=1,T=8, andT=64to confirm the auto branch keeps the documented win and avoids multi-token regression.