Add FP16 configs and Support expert map #57

varun-sundar-rabindranath · 2025-04-12T02:18:18Z

Add new FP16 configs and Support expert_map for EP

E2E benchmark numbers : link

Micro benchmarks : link - there are some bald spots I am looking into.

github-actions · 2025-04-12T02:18:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

varun-sundar-rabindranath · 2025-04-12T02:38:53Z

@ElizaWszola @dsikka can you please a look at the expert_map support part of the PR please ! Thanks 🙌

varun-sundar-rabindranath · 2025-04-12T02:39:28Z

tests/kernels/test_cutlass_moe.py

Mostly refactor existing tests and add EP tests.

varun-sundar-rabindranath · 2025-04-12T02:41:45Z

vllm/model_executor/layers/fused_moe/layer.py

@ElizaWszola I see this function being called below in this file in fused_moe::__init__() - I updated this code to not raise any errors as the expected behavior seems to be to fallback to triton impl. PTAL! Thanks !

good update!

varun-sundar-rabindranath · 2025-04-12T02:42:26Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

@dsikka Can you please take a look at the compressed_tensors changes please. Thanks !

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

tlrmchlsmth · 2025-04-12T21:12:27Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py


    c1 = torch.empty((m * topk, n * 2), device=device, dtype=out_dtype)
-    c2 = torch.empty((m * topk, k), device=device, dtype=out_dtype)
+    c2 = torch.zeros((m * topk, k), device=device, dtype=out_dtype)


Is this needed for correctness? empty should be faster than zeros

OK I see this now

// c2 is initialized to zeros, therefore by setting the output_permutation // to num_tokens, we are guaranteed to fill the moe outputs to zero // for "invalid" topk_ids.

tlrmchlsmth · 2025-04-12T21:19:18Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py

+    a_map = torch.zeros((local_topk_ids.numel()),
+                        dtype=torch.int32,
+                        device=device)
+    c_map = torch.zeros((local_topk_ids.numel()),
+                        dtype=torch.int32,
+                        device=device)


What about these two, why do they need to be zeros?

a_map has to be zeros because we don't fill the indices related to "invalid" topk ids. the c_map can be actually empty as we fill all the indices in the get_cutlass_moe_mm_data. Ill make the change.

tlrmchlsmth · 2025-04-12T21:19:20Z

vllm/model_executor/layers/fused_moe/cutlass_moe.py


    c1 = torch.empty((m * topk, n * 2), device=device, dtype=out_dtype)
-    c2 = torch.empty((m * topk, k), device=device, dtype=out_dtype)
+    c2 = torch.zeros((m * topk, k), device=device, dtype=out_dtype)


OK I see this now

// c2 is initialized to zeros, therefore by setting the output_permutation // to num_tokens, we are guaranteed to fill the moe outputs to zero // for "invalid" topk_ids.

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

ElizaWszola · 2025-04-14T12:40:07Z

csrc/cutlass_moe/moe_mm_c3x.cu

  uint32_t const n = out_tensors.size(1);
  uint32_t const k = a_tensors.size(1);


these vars look unused now, it's ok to remove them

ElizaWszola · 2025-04-14T12:52:15Z

Lgtm! I added a minor comment

ElizaWszola · 2025-04-16T15:28:05Z

Merged through command line, I think it's safe to close now

varun-sundar-rabindranath · 2025-04-16T16:36:56Z

merged through command line.

varun-sundar-rabindranath requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners April 12, 2025 02:18

varun-sundar-rabindranath commented Apr 12, 2025

View reviewed changes

tests/kernels/test_cutlass_moe.py Outdated

Copy link

Author

varun-sundar-rabindranath Apr 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly refactor existing tests and add EP tests.

varun-sundar-rabindranath commented Apr 12, 2025

View reviewed changes

fp16 configs and expert map support

1168828

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath force-pushed the varun/expert-maps-fp16-configs branch from f1b859b to 1168828 Compare April 12, 2025 03:03

varun sundar rabindranath added 4 commits April 12, 2025 03:13

fix lint

f930b1d

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

fix lint

eecfb15

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

fix lint

fda5f44

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

fix lint

2e8f3ac

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

varun-sundar-rabindranath mentioned this pull request Apr 12, 2025

[Kernel] Enable FP16 and BF16 CUTLASS MoE kernels vllm-project/vllm#15932

Closed

tlrmchlsmth reviewed Apr 12, 2025

View reviewed changes

varun sundar rabindranath added 2 commits April 13, 2025 03:28

c_map zeros -> empty

e0c3a51

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

add expert parallel to torch hash

91b802c

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>

ElizaWszola reviewed Apr 14, 2025

View reviewed changes

varun-sundar-rabindranath closed this Apr 16, 2025

		uint32_t const n = out_tensors.size(1);
		uint32_t const k = a_tensors.size(1);

Add FP16 configs and Support expert map #57

Add FP16 configs and Support expert map #57

Uh oh!

Conversation

varun-sundar-rabindranath commented Apr 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 12, 2025

Uh oh!

varun-sundar-rabindranath commented Apr 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ElizaWszola commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ElizaWszola commented Apr 16, 2025

Uh oh!

varun-sundar-rabindranath commented Apr 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

varun-sundar-rabindranath commented Apr 12, 2025 •

edited by github-actions bot

Loading

ElizaWszola commented Apr 14, 2025 •

edited

Loading