[Kernel] Enable FP16 and BF16 CUTLASS MoE kernels #15932

ElizaWszola · 2025-04-02T06:15:49Z

Implement BF16 and FP16 weight support in CUTLASS MoE kernels. Tested with

llm = LLM("mistralai/Mixtral-8x7B-Instruct-v0.1",
          tensor_parallel_size=2,
)

and

llm = LLM("mistralai/Mixtral-8x7B-Instruct-v0.1",
          tensor_parallel_size=2,
          dtype=torch.float16,
)

Signed-off-by: ElizaWszola <ewszola@redhat.com>

github-actions · 2025-04-02T06:16:01Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: ElizaWszola <ewszola@redhat.com>

csrc/ops.h

tlrmchlsmth

General comment that using the name fp16 for operations that handle both fp16 and bf16 is confusing and we should either pick a more general name (16bit?), or better: append fp8 to the names of ops that handle fp8 and remove fp16 from names altogether

csrc/quantization/cutlass_w8a8/moe/get_group_starts.cuh

csrc/quantization/cutlass_w8a8/moe/grouped_mm_fp16_c3x.cuh

csrc/quantization/cutlass_w8a8/moe/moe_data.cu

tests/kernels/test_cutlass_moe.py

csrc/ops.h

Signed-off-by: ElizaWszola <ewszola@redhat.com>

tlrmchlsmth · 2025-04-04T16:19:27Z

vllm/model_executor/layers/fused_moe/layer.py

+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        super().process_weights_after_loading(layer)
+
+    # TODO half()


What is the TODO? Resolve before landing?

mergify · 2025-04-04T16:20:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth · 2025-04-04T18:56:46Z

csrc/cutlass_moe/moe_mm_c3x.cu

It looks like the 16-bit configs are the same as the fp8 configs -- these need to be re-tuned for the fp16/bf16 case

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

bnellnm · 2025-04-04T21:00:16Z

tests/kernels/test_cutlass_moe.py

+def run_8_bit(a: torch.Tensor, a_scale: torch.Tensor, w1_q: torch.Tensor,
+              w2_q: torch.Tensor, w1_scale: torch.Tensor,
+              w2_scale: torch.Tensor, topk_weights: torch.Tensor,
+              topk_ids: torch.Tensor, ab_strides1: torch.Tensor,
+              c_strides1: torch.Tensor, ab_strides2: torch.Tensor,
+              c_strides2: torch.Tensor):
    with set_current_vllm_config(
            VllmConfig(parallel_config=ParallelConfig(
                pipeline_parallel_size=1))):
-        return cutlass_moe_fp8(a,
-                               w1_q,
-                               w2_q,
-                               w1_scale,
-                               w2_scale,
-                               topk_weights,
-                               topk_ids,
-                               ab_strides1,
-                               c_strides1,
-                               ab_strides2,
-                               c_strides2,
-                               a1_scale=a_scale)
-
-
-@pytest.mark.parametrize("m", [2, 64, 224])
-@pytest.mark.parametrize("n", [1024, 3072])
-@pytest.mark.parametrize("k", [1024, 1536])
+        return cutlass_moe(a,
+                           w1_q,
+                           w2_q,
+                           topk_weights,
+                           topk_ids,
+                           ab_strides1,
+                           c_strides1,
+                           ab_strides2,
+                           c_strides2,
+                           w1_scale=w1_scale,
+                           w2_scale=w2_scale,
+                           a1_scale=a_scale)
+
+
+def run_16_bit(a: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor,
+               topk_weights: torch.Tensor, topk_ids: torch.Tensor,
+               ab_strides1: torch.Tensor, c_strides1: torch.Tensor,
+               ab_strides2: torch.Tensor, c_strides2: torch.Tensor):
+    with set_current_vllm_config(
+            VllmConfig(parallel_config=ParallelConfig(
+                pipeline_parallel_size=1))):
+        return cutlass_moe(a, w1, w2, topk_weights, topk_ids, ab_strides1,
+                           c_strides1, ab_strides2, c_strides2)
+


nit: Could these be combined with the scales made optional and defaulted to None for the fp16 case? I don't have strong feelings about this though.

Yeah, probably

bnellnm · 2025-04-04T21:02:41Z

tests/kernels/test_cutlass_moe.py

+        print(triton_output)
+        print(cutlass_output)
+        print("*")


nit: Do we need this prints?

The tolerances in the tests are a bit high, so I use these prints to examine manually how off the values are if I'm close to the treshold

I think the prints are fine for debugging but I don't think we should push with them enabled.

bnellnm · 2025-04-04T21:02:57Z

tests/kernels/test_cutlass_moe.py

+        print(triton_output)
+        print(cutlass_output)
+        print("*")


Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

tlrmchlsmth

@varun-sundar-rabindranath do you have e2e benchmark results that we could share before landing this?

tlrmchlsmth

Oh I see the new FP16/BF16 configs still aren't in -- Please LMK when this is ready!

varun-sundar-rabindranath · 2025-04-12T21:07:49Z

@tlrmchlsmth - I have the changes here neuralmagic#57 waiting to be merged on the neuralmagic:cutlass-moe-bf16-weights branch. I am still getting the e2e and microbenchmarks.

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com> Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath · 2025-04-18T17:27:33Z

Factoring out expert_map support into a separate PR #16861

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

mergify · 2025-04-27T13:34:29Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify · 2025-05-02T22:28:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

github-actions · 2025-09-22T02:11:39Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

github-actions · 2025-10-22T02:12:29Z

This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you!

Enable BF16 weights in CUTLASS MoE

aa2e772

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola added 3 commits April 2, 2025 12:18

cleanup tests

482e9ad

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Pick the right model based on arch, activation and expert_map

4e93c7f

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Float16 support

11d640f

Signed-off-by: ElizaWszola <ewszola@redhat.com>

tlrmchlsmth reviewed Apr 2, 2025

View reviewed changes

csrc/ops.h Outdated Show resolved Hide resolved

tlrmchlsmth changed the title ~~[WIP][Kernel] Enable BF16 weights in CUTLASS MoE~~ [WIP][Kernel] Enable FP16 and BF16 CUTLASS MoE kernels Apr 2, 2025

tlrmchlsmth reviewed Apr 2, 2025

View reviewed changes

csrc/quantization/cutlass_w8a8/moe/get_group_starts.cuh Outdated Show resolved Hide resolved

csrc/quantization/cutlass_w8a8/moe/get_group_starts.cuh Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Apr 2, 2025

View reviewed changes

csrc/quantization/cutlass_w8a8/moe/grouped_mm_fp16_c3x.cuh Outdated Show resolved Hide resolved

csrc/quantization/cutlass_w8a8/moe/moe_data.cu Show resolved Hide resolved

tests/kernels/test_cutlass_moe.py Outdated Show resolved Hide resolved

csrc/ops.h Outdated Show resolved Hide resolved

Refactor, merge some common 16- and 8-bit functionalities

ab0143b

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot added the ci/build label Apr 3, 2025

ElizaWszola added 3 commits April 3, 2025 13:16

mnk_factors in unit tests

1c74067

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Move cutlass moe source files outside quantized w8a8 directory

6fa2f6a

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Add separate entry file to cutlass moe

8160305

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola marked this pull request as ready for review April 3, 2025 14:21

ElizaWszola requested review from WoosukKwon, mgoin and robertgshaw2-redhat as code owners April 3, 2025 14:21

ElizaWszola changed the title ~~[WIP][Kernel] Enable FP16 and BF16 CUTLASS MoE kernels~~ [Kernel] Enable FP16 and BF16 CUTLASS MoE kernels Apr 3, 2025

tlrmchlsmth reviewed Apr 4, 2025

View reviewed changes

mergify bot added the needs-rebase label Apr 4, 2025

tlrmchlsmth reviewed Apr 4, 2025

View reviewed changes

tlrmchlsmth added 2 commits April 4, 2025 19:47

some cleanup

c25904d

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

Merge branch 'main' into cutlass-moe-bf16-weights

1a31242

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

mergify bot removed the needs-rebase label Apr 4, 2025

fixes

31c4c80

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>

bnellnm reviewed Apr 4, 2025

View reviewed changes

Merge branch 'main' into cutlass-moe-bf16-weights

5ab56cb

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath force-pushed the cutlass-moe-bf16-weights branch from dda21cc to 5ab56cb Compare April 12, 2025 01:44

varun sundar rabindranath added 6 commits April 12, 2025 02:58

fix plumbing

847150a

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

fp16 configs and expert map support

1168828

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

fix lint

f930b1d

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

fix lint

eecfb15

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

fix lint

fda5f44

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

fix lint

2e8f3ac

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

tlrmchlsmth reviewed Apr 12, 2025

View reviewed changes

varun sundar rabindranath and others added 5 commits April 13, 2025 03:28

c_map zeros -> empty

e0c3a51

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

add expert parallel to torch hash

5b9ab4f

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com> Signed-off-by: ElizaWszola <ewszola@redhat.com>

Comment out output prints in tests

99edc71

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Add more moe benchmark shapes

e471806

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

Merge branch 'main' into cutlass-moe-bf16-weights

5263d8d

update benchmark_cutlass_moe

2256ab4

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

mergify bot added the needs-rebase label Apr 27, 2025

Merge branch 'main' into cutlass-moe-bf16-weights

4bed12a

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot removed the needs-rebase label Apr 29, 2025

Format, cleanup

d5995b2

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify bot added the needs-rebase label May 2, 2025

mergify bot added the performance Performance-related issues label Jun 23, 2025

github-actions bot added the stale Over 90 days of inactivity label Sep 22, 2025

github-actions bot closed this Oct 22, 2025

Uh oh!

Uh oh!

[Kernel] Enable FP16 and BF16 CUTLASS MoE kernels #15932

[Kernel] Enable FP16 and BF16 CUTLASS MoE kernels #15932

Uh oh!

Conversation

ElizaWszola commented Apr 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 2, 2025

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 4, 2025

Uh oh!

tlrmchlsmth Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

bnellnm Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Apr 18, 2025

Uh oh!

mergify bot commented Apr 27, 2025

Uh oh!

mergify bot commented May 2, 2025

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ElizaWszola commented Apr 2, 2025 •

edited by github-actions bot

Loading

bnellnm Apr 4, 2025 •

edited

Loading

varun-sundar-rabindranath commented Apr 12, 2025 •

edited

Loading