Make AMD fp8 blockscale defaults hit the native fnuz MFMA fast path

## Finding

The native fp8 MFMA path is only selected by `path="auto"` when operands use an AMD-native fnuz fp8 dtype:

- Auto path predicate: [`src/xkernels/ops/gemm/triton/entry.py#L50-L57`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/gemm/triton/entry.py#L50-L57)
- The quant helpers default to `torch.float8_e4m3fn`, not fnuz: [`src/xkernels/ops/gemm/reference.py#L137-L163`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/gemm/reference.py#L137-L163) and [`src/xkernels/ops/gemm/reference.py#L166-L192`](https://github.com/ResearchComputer/xkernels/blob/00aac7e4a249e7af0a40da9e7981065b823fb4f3/src/xkernels/ops/gemm/reference.py#L166-L192)
- The README notes the native MFMA path is the fast one and requires `float8_e4m3fnuz` operands; fn operands stay on the portable fallback.

## Why this should improve performance

A caller using the public defaults can quantize to `float8_e4m3fn` and then call `mm_fp8_blockscale(..., path="auto")`, silently missing the 3-9x native fp8 MFMA path documented for MI300A. The optimized path exists, but the default API makes it easy not to use on AMD.

## Suggested implementation

- Add a device-aware helper such as `preferred_fp8_dtype()` that returns `float8_e4m3fnuz` on AMD/gfx942 when available and `float8_e4m3fn` elsewhere.
- Consider a new `fp8_dtype="auto"` default for quant helpers, while preserving explicit dtype control for portability and parity tests.
- Add a warning or debug note when `path="auto"` falls back to portable because the operands are `float8_e4m3fn` on AMD.

## Validation

- Add tests for dtype selection on environments with and without `torch.float8_e4m3fnuz`.
- Benchmark the default quantize + `mm_fp8_blockscale(path="auto")` flow before and after on MI300A.
- Keep explicit `float8_e4m3fn` tests to preserve portable fallback correctness.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make AMD fp8 blockscale defaults hit the native fnuz MFMA fast path #56

Finding

Why this should improve performance

Suggested implementation

Validation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Make AMD fp8 blockscale defaults hit the native fnuz MFMA fast path #56

Description

Finding

Why this should improve performance

Suggested implementation

Validation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions