[AMD/MI300A] DeepSeek-V4 (deepseek_v4) serving support — tracking

Tracking issue for serving **DeepSeek-V4** (`model_type: deepseek_v4`) on **AMD MI300A (gfx942)**.

## Architecture (why it's new vs V3/Kimi)
- **FP4/mxfp4 routed experts** + **FP8** (block 128×128) for the rest. Flash: 256 experts / hidden 4096; Pro: 384 experts / hidden 7168; top-6, 1 shared.
- **Hybrid attention**: Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) + a **DSA indexer** (top-512 Flash / top-1024 Pro), per-layer `compress_ratios`, `sliding_window: 128`, `o_lora_rank`.
- **mHC** (Manifold-Constrained Hyper-Connections: `hc_mult`, `hc_sinkhorn_iters`), `sqrtsoftplus` routing, MTP, YaRN→1M.

## Findings (single-node probe on MI300A, 2026-06-11)
**Much more AMD-ready than V3 was** — the model code **imports and constructs cleanly on gfx942** (no import wall; `deepseek_v4.py` feature-detects kernels and gates NVIDIA-only paths on cc=10 with fallbacks). Confirmed working at construction: model import, attention + DSA indexer construction, and the **mxfp4 MoE kernel at `ep_size=1`** (selects + allocates).

## Blockers
- [ ] **#26 — mxfp4 fused-MoE: no expert-parallelism (`ep_size > 1`) on gfx942.** PRIMARY/confirmed. The gluon/triton mxfp4 backends are gated to `ep_size <= 1`; V4's 256–384 experts must shard across GPUs/nodes to fit. (At ep=1 the kernel works but the model OOMs.)
- [ ] **#27 — DSA indexer + V4 attention: forward unvalidated on gfx942.** Constructs fine, but forward couldn't be reached (OOM behind #26). Indexer uses NVIDIA `deep_gemm.fp8_fp4_mqa_logits` + a cuda-only mxfp4 paged-gather; AMD fallback unverified.

## Not blockers (confirmed OK at construction)
Model import, attention/indexer module construction, mxfp4 MoE backend selection + weight alloc (ep=1).

## Notes
- **FlashInfer-AMD** (`amd-flashinfer`, gfx942) does **not** help here: v0.2.5 / ROCm 7.1.1 / torch 2.8 (ABI-mismatched with our 7.2/2.11 image) and predates V4's FP4-MoE + sparse-indexer kernels.
- This mirrors the Kimi/V3 bring-up (which needed an AMD INT4-W4A16 MoE + AITER MLA): same shape, smaller remaining lift.

## Weights / repro
V4-Pro staged at `infra01/hf_models/.../DeepSeek-V4-Pro` (806 GB); Flash not staged (MIT, ~150 GB). Probes: `jobs/serve-v4pro-1node-probe{,2}.sbatch` on beverin. **HW:** MI300A gfx942 / ROCm 7.2 / torch 2.11.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD/MI300A] DeepSeek-V4 (deepseek_v4) serving support — tracking #28

Architecture (why it's new vs V3/Kimi)

Findings (single-node probe on MI300A, 2026-06-11)

Blockers

Not blockers (confirmed OK at construction)

Notes

Weights / repro

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[AMD/MI300A] DeepSeek-V4 (deepseek_v4) serving support — tracking #28

Description

Architecture (why it's new vs V3/Kimi)

Findings (single-node probe on MI300A, 2026-06-11)

Blockers

Not blockers (confirmed OK at construction)

Notes

Weights / repro

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions