Tracking issue for serving DeepSeek-V4 (model_type: deepseek_v4) on AMD MI300A (gfx942).
Architecture (why it's new vs V3/Kimi)
- FP4/mxfp4 routed experts + FP8 (block 128×128) for the rest. Flash: 256 experts / hidden 4096; Pro: 384 experts / hidden 7168; top-6, 1 shared.
- Hybrid attention: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) + a DSA indexer (top-512 Flash / top-1024 Pro), per-layer
compress_ratios, sliding_window: 128, o_lora_rank.
- mHC (Manifold-Constrained Hyper-Connections:
hc_mult, hc_sinkhorn_iters), sqrtsoftplus routing, MTP, YaRN→1M.
Findings (single-node probe on MI300A, 2026-06-11)
Much more AMD-ready than V3 was — the model code imports and constructs cleanly on gfx942 (no import wall; deepseek_v4.py feature-detects kernels and gates NVIDIA-only paths on cc=10 with fallbacks). Confirmed working at construction: model import, attention + DSA indexer construction, and the mxfp4 MoE kernel at ep_size=1 (selects + allocates).
Blockers
Not blockers (confirmed OK at construction)
Model import, attention/indexer module construction, mxfp4 MoE backend selection + weight alloc (ep=1).
Notes
- FlashInfer-AMD (
amd-flashinfer, gfx942) does not help here: v0.2.5 / ROCm 7.1.1 / torch 2.8 (ABI-mismatched with our 7.2/2.11 image) and predates V4's FP4-MoE + sparse-indexer kernels.
- This mirrors the Kimi/V3 bring-up (which needed an AMD INT4-W4A16 MoE + AITER MLA): same shape, smaller remaining lift.
Weights / repro
V4-Pro staged at infra01/hf_models/.../DeepSeek-V4-Pro (806 GB); Flash not staged (MIT, ~150 GB). Probes: jobs/serve-v4pro-1node-probe{,2}.sbatch on beverin. HW: MI300A gfx942 / ROCm 7.2 / torch 2.11.
Tracking issue for serving DeepSeek-V4 (
model_type: deepseek_v4) on AMD MI300A (gfx942).Architecture (why it's new vs V3/Kimi)
compress_ratios,sliding_window: 128,o_lora_rank.hc_mult,hc_sinkhorn_iters),sqrtsoftplusrouting, MTP, YaRN→1M.Findings (single-node probe on MI300A, 2026-06-11)
Much more AMD-ready than V3 was — the model code imports and constructs cleanly on gfx942 (no import wall;
deepseek_v4.pyfeature-detects kernels and gates NVIDIA-only paths on cc=10 with fallbacks). Confirmed working at construction: model import, attention + DSA indexer construction, and the mxfp4 MoE kernel atep_size=1(selects + allocates).Blockers
ep_size > 1) on gfx942. PRIMARY/confirmed. The gluon/triton mxfp4 backends are gated toep_size <= 1; V4's 256–384 experts must shard across GPUs/nodes to fit. (At ep=1 the kernel works but the model OOMs.)deep_gemm.fp8_fp4_mqa_logits+ a cuda-only mxfp4 paged-gather; AMD fallback unverified.Not blockers (confirmed OK at construction)
Model import, attention/indexer module construction, mxfp4 MoE backend selection + weight alloc (ep=1).
Notes
amd-flashinfer, gfx942) does not help here: v0.2.5 / ROCm 7.1.1 / torch 2.8 (ABI-mismatched with our 7.2/2.11 image) and predates V4's FP4-MoE + sparse-indexer kernels.Weights / repro
V4-Pro staged at
infra01/hf_models/.../DeepSeek-V4-Pro(806 GB); Flash not staged (MIT, ~150 GB). Probes:jobs/serve-v4pro-1node-probe{,2}.sbatchon beverin. HW: MI300A gfx942 / ROCm 7.2 / torch 2.11.