[AMD/MI300A] mxfp4 fused-MoE: support expert parallelism (ep_size>1) on gfx942 (DeepSeek-V4)

## Context
Bringing up **DeepSeek-V4** (`model_type: deepseek_v4` — FP4/mxfp4 routed experts + FP8 for the rest) on **AMD MI300A (gfx942, ROCm 7.2, torch 2.11)**. V4 has **256 experts** (Flash) / **384 experts** (Pro), top-6 — so the experts **must be sharded across GPUs/nodes (expert parallelism)** to fit (V4-Pro is 806 GB; even Flash's experts don't fit one 128 GB GPU replicated).

## Finding (single-node probe, MI300A, 2026-06-11)
The gfx942 mxfp4 MoE kernel **works** — but only without expert parallelism:

- **`ep_size = 4` (`--enable-expert-parallel`)** → construction fails at backend selection:
  ```
  RuntimeError: No MoE backend available for gfx94/mxfp4.
    Tried: gluon_kernel:unsupported, triton_kernel:unsupported
    (tokenspeed/runtime/layers/moe/core/selector.py:192)
  ```
- **`ep_size = 1` (TP only)** → `Mxfp4TritonKernelBackend` **selects and allocates mxfp4 weights** (`mxfp4/triton_kernel.py:153 create_layer_weights → mxfp4/weights.py:66 create_mxfp4_weights`); it only then OOMs because 806 GB can't fit one 128 GB GPU. So the kernel itself is fine on gfx942.

## Root cause
`tokenspeed/runtime/layers/moe/backends/mxfp4/{gluon,triton}_kernel.py` `supports()` ends with:
```python
return spec.ep_size <= 1 and spec.activation in {"silu", "swiglu"}
```
i.e. the AMD mxfp4 fused-MoE (w-mxfp4 / a-fp8) is gated to **ep_size ≤ 1**. There is no mxfp4 MoE path under expert parallelism, so V4's experts can't be sharded → the model can't be served on MI300A.

## Ask
Extend the AMD mxfp4 fused-MoE (gluon/triton) to support **`ep_size > 1`**: per-rank expert subset + masked-compute + all-reduce, mirroring the bf16 / INT4-W4A16 EP path that already serves Kimi-K2.6 on 2×MI300A. The `ep_size <= 1` gate is tokenspeed-side; the kernel may already accept a per-rank expert subset (please verify) — the work is lifting the gate + wiring the EP masked-compute (plus any kernel changes if the layout assumes all experts are local).

## Repro
`jobs/serve-v4pro-1node-probe.sbatch` (ep=4 → the selector error) and `serve-v4pro-1node-probe2.sbatch` (ep=1 → selects + OOM) on beverin. V4-Pro weights staged at `infra01/hf_models/.../DeepSeek-V4-Pro`.

**HW:** MI300A gfx942 / ROCm 7.2 / torch 2.11.0+rocm7.2. Part of the DeepSeek-V4-on-MI300A bring-up (tracking issue to follow).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD/MI300A] mxfp4 fused-MoE: support expert parallelism (ep_size>1) on gfx942 (DeepSeek-V4) #26

Context

Finding (single-node probe, MI300A, 2026-06-11)

Root cause

Ask

Repro

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[AMD/MI300A] mxfp4 fused-MoE: support expert parallelism (ep_size>1) on gfx942 (DeepSeek-V4) #26

Description

Context

Finding (single-node probe, MI300A, 2026-06-11)

Root cause

Ask

Repro

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions