[AMD/MI300A] DeepSeek-V4 DSA indexer + attention: validate/provide gfx942 forward path

## Context
DeepSeek-V4 (`deepseek_v4`) on **AMD MI300A (gfx942)**. V4's attention is a hybrid **Compressed Sparse Attention (CSA) + Heavily Compressed Attention (HCA)** with a **DSA indexer** that selects top-**512** (Flash) / **1024** (Pro) KV per query (`index_n_heads: 64`, `index_head_dim: 128`), plus per-layer `compress_ratios` and `sliding_window: 128`.

## Status (single-node probe, MI300A, 2026-06-11)
**Good:** the V4 model imports and **constructs cleanly on AMD** — the attention + indexer modules build without error (the probe reached MoE weight allocation *past* them). No import wall.

**Unknown:** the **forward path is unvalidated** — the probe OOM'd at MoE weight alloc before running a forward (blocked on the mxfp4-EP gap; see companion issue). The indexer forward uses NVIDIA-only kernels:
- `deep_gemm.fp8_fp4_mqa_logits(...)` (`thirdparty/deep_gemm`, CUDA) for the indexer logits, and
- `indexer_mxfp4_paged_gather(...)` (`tokenspeed_kernel/ops/attention/cuda/deepseek_v4.py`, CUDA-only — the Triton variant `ops/attention/triton/deepseek_v4.py` does **not** have the mxfp4 paged gather).

There is a fallback gate `_deepseek_v4_deepgemm_fp4_indexer_available()` in `models/deepseek_v4.py`, but its AMD branch is unverified.

## Ask
Once the mxfp4-EP blocker is resolved and a V4 model fits on MI300A, run a forward and:
1. **Validate** the indexer + CSA/HCA attention produce correct output on gfx942 (the fallback path may already work via Triton).
2. If the `deep_gemm` fp8/fp4 MQA-logits and the cuda mxfp4 paged-gather are required, **provide gfx942 equivalents** (Triton/Gluon or AITER) for: the indexer logits (fp8/fp4 MQA) and the mxfp4 paged KV gather.

## Repro / HW
`jobs/serve-v4pro-1node-probe2.sbatch` on beverin; V4-Pro staged at `infra01/hf_models/.../DeepSeek-V4-Pro`. MI300A gfx942 / ROCm 7.2 / torch 2.11. Companion to the mxfp4-EP issue + the V4 tracking issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMD/MI300A] DeepSeek-V4 DSA indexer + attention: validate/provide gfx942 forward path #27

Context

Status (single-node probe, MI300A, 2026-06-11)

Ask

Repro / HW

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[AMD/MI300A] DeepSeek-V4 DSA indexer + attention: validate/provide gfx942 forward path #27

Description

Context

Status (single-node probe, MI300A, 2026-06-11)

Ask

Repro / HW

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions