Skip to content

[AMD/MI300A] mxfp4 fused-MoE: support expert parallelism (ep_size>1) on gfx942 (DeepSeek-V4) #26

Description

@xzyaoi

Context

Bringing up DeepSeek-V4 (model_type: deepseek_v4 — FP4/mxfp4 routed experts + FP8 for the rest) on AMD MI300A (gfx942, ROCm 7.2, torch 2.11). V4 has 256 experts (Flash) / 384 experts (Pro), top-6 — so the experts must be sharded across GPUs/nodes (expert parallelism) to fit (V4-Pro is 806 GB; even Flash's experts don't fit one 128 GB GPU replicated).

Finding (single-node probe, MI300A, 2026-06-11)

The gfx942 mxfp4 MoE kernel works — but only without expert parallelism:

  • ep_size = 4 (--enable-expert-parallel) → construction fails at backend selection:
    RuntimeError: No MoE backend available for gfx94/mxfp4.
      Tried: gluon_kernel:unsupported, triton_kernel:unsupported
      (tokenspeed/runtime/layers/moe/core/selector.py:192)
    
  • ep_size = 1 (TP only)Mxfp4TritonKernelBackend selects and allocates mxfp4 weights (mxfp4/triton_kernel.py:153 create_layer_weights → mxfp4/weights.py:66 create_mxfp4_weights); it only then OOMs because 806 GB can't fit one 128 GB GPU. So the kernel itself is fine on gfx942.

Root cause

tokenspeed/runtime/layers/moe/backends/mxfp4/{gluon,triton}_kernel.py supports() ends with:

return spec.ep_size <= 1 and spec.activation in {"silu", "swiglu"}

i.e. the AMD mxfp4 fused-MoE (w-mxfp4 / a-fp8) is gated to ep_size ≤ 1. There is no mxfp4 MoE path under expert parallelism, so V4's experts can't be sharded → the model can't be served on MI300A.

Ask

Extend the AMD mxfp4 fused-MoE (gluon/triton) to support ep_size > 1: per-rank expert subset + masked-compute + all-reduce, mirroring the bf16 / INT4-W4A16 EP path that already serves Kimi-K2.6 on 2×MI300A. The ep_size <= 1 gate is tokenspeed-side; the kernel may already accept a per-rank expert subset (please verify) — the work is lifting the gate + wiring the EP masked-compute (plus any kernel changes if the layout assumes all experts are local).

Repro

jobs/serve-v4pro-1node-probe.sbatch (ep=4 → the selector error) and serve-v4pro-1node-probe2.sbatch (ep=1 → selects + OOM) on beverin. V4-Pro weights staged at infra01/hf_models/.../DeepSeek-V4-Pro.

HW: MI300A gfx942 / ROCm 7.2 / torch 2.11.0+rocm7.2. Part of the DeepSeek-V4-on-MI300A bring-up (tracking issue to follow).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions