Context
DeepSeek-V4 (deepseek_v4) on AMD MI300A (gfx942). V4's attention is a hybrid Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) with a DSA indexer that selects top-512 (Flash) / 1024 (Pro) KV per query (index_n_heads: 64, index_head_dim: 128), plus per-layer compress_ratios and sliding_window: 128.
Status (single-node probe, MI300A, 2026-06-11)
Good: the V4 model imports and constructs cleanly on AMD — the attention + indexer modules build without error (the probe reached MoE weight allocation past them). No import wall.
Unknown: the forward path is unvalidated — the probe OOM'd at MoE weight alloc before running a forward (blocked on the mxfp4-EP gap; see companion issue). The indexer forward uses NVIDIA-only kernels:
deep_gemm.fp8_fp4_mqa_logits(...) (thirdparty/deep_gemm, CUDA) for the indexer logits, and
indexer_mxfp4_paged_gather(...) (tokenspeed_kernel/ops/attention/cuda/deepseek_v4.py, CUDA-only — the Triton variant ops/attention/triton/deepseek_v4.py does not have the mxfp4 paged gather).
There is a fallback gate _deepseek_v4_deepgemm_fp4_indexer_available() in models/deepseek_v4.py, but its AMD branch is unverified.
Ask
Once the mxfp4-EP blocker is resolved and a V4 model fits on MI300A, run a forward and:
- Validate the indexer + CSA/HCA attention produce correct output on gfx942 (the fallback path may already work via Triton).
- If the
deep_gemm fp8/fp4 MQA-logits and the cuda mxfp4 paged-gather are required, provide gfx942 equivalents (Triton/Gluon or AITER) for: the indexer logits (fp8/fp4 MQA) and the mxfp4 paged KV gather.
Repro / HW
jobs/serve-v4pro-1node-probe2.sbatch on beverin; V4-Pro staged at infra01/hf_models/.../DeepSeek-V4-Pro. MI300A gfx942 / ROCm 7.2 / torch 2.11. Companion to the mxfp4-EP issue + the V4 tracking issue.
Context
DeepSeek-V4 (
deepseek_v4) on AMD MI300A (gfx942). V4's attention is a hybrid Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA) with a DSA indexer that selects top-512 (Flash) / 1024 (Pro) KV per query (index_n_heads: 64,index_head_dim: 128), plus per-layercompress_ratiosandsliding_window: 128.Status (single-node probe, MI300A, 2026-06-11)
Good: the V4 model imports and constructs cleanly on AMD — the attention + indexer modules build without error (the probe reached MoE weight allocation past them). No import wall.
Unknown: the forward path is unvalidated — the probe OOM'd at MoE weight alloc before running a forward (blocked on the mxfp4-EP gap; see companion issue). The indexer forward uses NVIDIA-only kernels:
deep_gemm.fp8_fp4_mqa_logits(...)(thirdparty/deep_gemm, CUDA) for the indexer logits, andindexer_mxfp4_paged_gather(...)(tokenspeed_kernel/ops/attention/cuda/deepseek_v4.py, CUDA-only — the Triton variantops/attention/triton/deepseek_v4.pydoes not have the mxfp4 paged gather).There is a fallback gate
_deepseek_v4_deepgemm_fp4_indexer_available()inmodels/deepseek_v4.py, but its AMD branch is unverified.Ask
Once the mxfp4-EP blocker is resolved and a V4 model fits on MI300A, run a forward and:
deep_gemmfp8/fp4 MQA-logits and the cuda mxfp4 paged-gather are required, provide gfx942 equivalents (Triton/Gluon or AITER) for: the indexer logits (fp8/fp4 MQA) and the mxfp4 paged KV gather.Repro / HW
jobs/serve-v4pro-1node-probe2.sbatchon beverin; V4-Pro staged atinfra01/hf_models/.../DeepSeek-V4-Pro. MI300A gfx942 / ROCm 7.2 / torch 2.11. Companion to the mxfp4-EP issue + the V4 tracking issue.