[NVIDIA] Support Cutlass MLA for Blackwell GPUs #16032

kaixih · 2025-04-03T19:38:20Z

The latest cutlass supports MLA for the blackwell GPUs. Examples can be found here. It should be available in the next release (v3.9).

This PR integrates this kernel as ops.cutlass_mla_decode.

cc. @kushanam

github-actions · 2025-04-03T19:38:31Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

LucasWilkinson

Overall looks pretty good, left a couple nits

csrc/attention/mla/cutlass_mla_kernels.cu

vllm/_custom_ops.py

csrc/attention/mla/cutlass_mla_kernels.cu

LucasWilkinson

Thanks for the updates, left a few more nits (they can be punted to a future PR if you think thats more appropriate), overall though LGTM. Thanks for the contribution!

LucasWilkinson · 2025-04-25T22:33:15Z

csrc/attention/mla/cutlass_mla_kernels.cu

nit: is this wrapper required? can we just do:

template <typename T, bool PersistenceOption = true>

LucasWilkinson · 2025-04-25T22:33:23Z

vllm/_custom_ops.py

nit: this probably shouldn't be hard coded to 512, we should pass in latent size, also we should pass in the softmax scale so we can avoid hardcoding:

// the scale is based on the non-absorbed sizes, change as appropriate // we can't determine this parameter from the info we have, it's an input int D_non_latent = 128; float scale = 1.0 / sqrt(1.0 * (D_non_latent + D_rope));

Done. PTAL.

LucasWilkinson · 2025-04-25T22:34:25Z

csrc/attention/mla/cutlass_mla_kernels.cu

nit: we should pass in the scale (from Python) to avoid having to har code D_non_latent

LucasWilkinson · 2025-04-25T22:39:25Z

csrc/attention/mla/cutlass_mla_kernels.cu

nit: maybe for a future PR, if Q_ptr (q_nope) and Q_ptr + D_latent (q_pe) as seperate tensor (assuming the kernel is ok with these being arbitrary pointers and having separate strides (based on this interface it looks it could), then we can save the cat of q_nope and q_pe e.g.:

vllm/vllm/v1/attention/backends/mla/flashmla.py

Lines 133 to 134 in 7011645

q = torch.cat([q_nope, q_pe], dim=-1)\

.unsqueeze(1) # Add seqlen dim of 1 (decode)

Right, currently we follow the cutlass example, which only supports the single query tensor. If needed or this is a common practice, we can ask for an improvement.

im confused, it appears to support multiple: https://github.com/NVIDIA/cutlass/blob/e94e888df3551224738bfa505787b515eae8352f/examples/77_blackwell_fmha/kernel/sm100_fmha_mla_tma_warpspecialized.hpp#L246-L249

am I missing something here?

Tried the separate tensors and it works. Updated the PR. PTAL.

Signed-off-by: kaixih <kaixih@nvidia.com>

LucasWilkinson · 2025-04-26T20:41:28Z

Landing to help Blackwell perf but would like to follow up on: #16032 (comment) in a future PR potentially

Signed-off-by: kaixih <kaixih@nvidia.com>

…#16032)" This reverts commit ed7a29d. Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

Signed-off-by: kaixih <kaixih@nvidia.com>

Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

kaixih requested review from WoosukKwon and tlrmchlsmth as code owners April 3, 2025 19:38

mergify bot added the ci/build label Apr 3, 2025

kaixih force-pushed the kaixih/cutlass_mla branch 2 times, most recently from fbcf237 to b20ac92 Compare April 3, 2025 20:31

kaixih force-pushed the kaixih/cutlass_mla branch from eae2486 to 3c17c62 Compare April 15, 2025 17:23

kaixih force-pushed the kaixih/cutlass_mla branch from 8d29c8c to 85049f8 Compare April 23, 2025 16:57

tlrmchlsmth assigned LucasWilkinson Apr 25, 2025

LucasWilkinson reviewed Apr 25, 2025

View reviewed changes

csrc/attention/mla/cutlass_mla_kernels.cu Outdated Show resolved Hide resolved

vllm/_custom_ops.py Outdated Show resolved Hide resolved

csrc/attention/mla/cutlass_mla_kernels.cu Outdated Show resolved Hide resolved

LucasWilkinson approved these changes Apr 25, 2025

View reviewed changes

kaixih force-pushed the kaixih/cutlass_mla branch from 4d28698 to 596c81a Compare April 26, 2025 08:17

Support cutlass MLA and Upgrade cutlass to 3.9

985034c

Signed-off-by: kaixih <kaixih@nvidia.com>

kaixih force-pushed the kaixih/cutlass_mla branch from 596c81a to 985034c Compare April 26, 2025 08:21

kaixih added 2 commits April 26, 2025 09:10

Address comments

223e934

Signed-off-by: kaixih <kaixih@nvidia.com>

Address format

d094f20

Signed-off-by: kaixih <kaixih@nvidia.com>

LucasWilkinson enabled auto-merge (squash) April 26, 2025 20:41

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 26, 2025

Support separate q_nope and q_pe

00a0c6a

Signed-off-by: kaixih <kaixih@nvidia.com>

auto-merge was automatically disabled April 26, 2025 23:02
Head branch was pushed to by a user without write access

Format

1ed6dc3

Signed-off-by: kaixih <kaixih@nvidia.com>

vllm-bot merged commit ed7a29d into vllm-project:main Apr 27, 2025
69 of 72 checks passed

tywuAMD mentioned this pull request Apr 28, 2025

[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build #17289

Merged

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025

[NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project#16032)

754c47c

Signed-off-by: kaixih <kaixih@nvidia.com>

nlzy mentioned this pull request Apr 29, 2025

[Bugfix] Fix cutlass_mla_decode() accidentally added to ROCM build #17399

Closed

Alexei-V-Ivanov-AMD added a commit to Alexei-V-Ivanov-AMD/vllm that referenced this pull request Apr 29, 2025

Revert "[NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project…

86b04a7

…#16032)" This reverts commit ed7a29d. Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

[NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project#16032)

8603e06

Signed-off-by: kaixih <kaixih@nvidia.com>

adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025

[NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project#16032)

a144c43

Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project#16032)

101d294

Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

ckhordiasma mentioned this pull request May 14, 2025

nm vllm ent 0.8.5 sync red-hat-data-services/vllm#139

Merged

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[NVIDIA] Support Cutlass MLA for Blackwell GPUs (vllm-project#16032)

cd95315

Signed-off-by: kaixih <kaixih@nvidia.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

mickaelseznec mentioned this pull request Jul 21, 2025

[Compilation fix] add stubs to allow compilation without sm100 #21198

Closed

4 tasks

	q = torch.cat([q_nope, q_pe], dim=-1)\
	.unsqueeze(1) # Add seqlen dim of 1 (decode)

Uh oh!

Uh oh!

[NVIDIA] Support Cutlass MLA for Blackwell GPUs #16032

[NVIDIA] Support Cutlass MLA for Blackwell GPUs #16032

Uh oh!

Conversation

kaixih commented Apr 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kaixih commented Apr 3, 2025 •

edited by github-actions bot

Loading

LucasWilkinson commented Apr 26, 2025 •

edited

Loading