Skip to content

Conversation

@kaixih
Copy link
Contributor

@kaixih kaixih commented Apr 3, 2025

The latest cutlass supports MLA for the blackwell GPUs. Examples can be found here. It should be available in the next release (v3.9).

This PR integrates this kernel as ops.cutlass_mla_decode.

cc. @kushanam

@github-actions
Copy link

github-actions bot commented Apr 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label Apr 3, 2025
@kaixih kaixih force-pushed the kaixih/cutlass_mla branch 2 times, most recently from fbcf237 to b20ac92 Compare April 3, 2025 20:31
@kaixih kaixih force-pushed the kaixih/cutlass_mla branch from eae2486 to 3c17c62 Compare April 15, 2025 17:23
@kaixih kaixih force-pushed the kaixih/cutlass_mla branch from 8d29c8c to 85049f8 Compare April 23, 2025 16:57
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good, left a couple nits

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, left a few more nits (they can be punted to a future PR if you think thats more appropriate), overall though LGTM. Thanks for the contribution!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this wrapper required? can we just do:

template <typename T, bool PersistenceOption = true>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this probably shouldn't be hard coded to 512, we should pass in latent size, also we should pass in the softmax scale so we can avoid hardcoding:


   // the scale is based on the non-absorbed sizes, change as appropriate
   // we can't determine this parameter from the info we have, it's an input
   int D_non_latent = 128;
   float scale = 1.0 / sqrt(1.0 * (D_non_latent + D_rope));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. PTAL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should pass in the scale (from Python) to avoid having to har code D_non_latent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe for a future PR, if Q_ptr (q_nope) and Q_ptr + D_latent (q_pe) as seperate tensor (assuming the kernel is ok with these being arbitrary pointers and having separate strides (based on this interface it looks it could), then we can save the cat of q_nope and q_pe e.g.:

q = torch.cat([q_nope, q_pe], dim=-1)\
.unsqueeze(1) # Add seqlen dim of 1 (decode)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, currently we follow the cutlass example, which only supports the single query tensor. If needed or this is a common practice, we can ask for an improvement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried the separate tensors and it works. Updated the PR. PTAL.

@kaixih kaixih force-pushed the kaixih/cutlass_mla branch from 4d28698 to 596c81a Compare April 26, 2025 08:17
Signed-off-by: kaixih <kaixih@nvidia.com>
@kaixih kaixih force-pushed the kaixih/cutlass_mla branch from 596c81a to 985034c Compare April 26, 2025 08:21
kaixih added 2 commits April 26, 2025 09:10
Signed-off-by: kaixih <kaixih@nvidia.com>
Signed-off-by: kaixih <kaixih@nvidia.com>
@LucasWilkinson LucasWilkinson enabled auto-merge (squash) April 26, 2025 20:41
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 26, 2025
@LucasWilkinson
Copy link
Collaborator

LucasWilkinson commented Apr 26, 2025

Landing to help Blackwell perf but would like to follow up on: #16032 (comment) in a future PR potentially

Signed-off-by: kaixih <kaixih@nvidia.com>
auto-merge was automatically disabled April 26, 2025 23:02

Head branch was pushed to by a user without write access

Signed-off-by: kaixih <kaixih@nvidia.com>
@vllm-bot vllm-bot merged commit ed7a29d into vllm-project:main Apr 27, 2025
69 of 72 checks passed
jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Apr 29, 2025
Alexei-V-Ivanov-AMD added a commit to Alexei-V-Ivanov-AMD/vllm that referenced this pull request Apr 29, 2025
…#16032)"

This reverts commit ed7a29d.

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
adobrzyn pushed a commit to HabanaAI/vllm-fork that referenced this pull request Apr 30, 2025
Signed-off-by: kaixih <kaixih@nvidia.com>
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: kaixih <kaixih@nvidia.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: kaixih <kaixih@nvidia.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants