Skip to content

Conversation

@wenscarl
Copy link
Collaborator

@wenscarl wenscarl commented Jun 4, 2025

📌 This PR added the CUTLASS implementation of fused Mixture of Expert from TensorRT-LLM.

🔍 Related Issues

Supported data types are: fp32/bf16/fp16/float8_e4m3fn/float8_e2m1
The kernels also support expert/tensor parallellism.

This PR also exposes quantization methods for nvfp4.

🧪 Tests

  • tests/test_trtllm_cutlass_fused.py
  • tests/test_fp4_quantize.py

Reviewer Notes

@wenscarl wenscarl force-pushed the trtllm_cutlass_fused_moe branch from bddbbef to 9957425 Compare June 4, 2025 03:01
@wenscarl wenscarl marked this pull request as ready for review June 4, 2025 03:02
@wenscarl wenscarl force-pushed the trtllm_cutlass_fused_moe branch from 9957425 to 8d551aa Compare June 4, 2025 03:07
@wenscarl wenscarl requested a review from yzh119 June 4, 2025 03:13
Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unittests passed on my side, @wenscarl thanks for the huge effort, let's merge this in first!

Currently all of the c++/cude source code are within to csrc directory, where we store pytorch c++ interface and pybind code, we should move kernel definition and framework agnostic interface to include directory as part of the header only library and reuse infrastructure with other components, in future PRs.

@yzh119 yzh119 merged commit d7e070f into flashinfer-ai:main Jun 4, 2025
2 checks passed
Edenzzzz pushed a commit to Edenzzzz/flashinfer that referenced this pull request Jun 6, 2025
…-ai#1113)

<!-- .github/pull_request_template.md -->

## 📌 This PR added the CUTLASS implementation of fused Mixture of Expert
from TensorRT-LLM.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues
Supported data types are: fp32/bf16/fp16/float8_e4m3fn/float8_e2m1
The kernels also support expert/tensor parallellism.

This PR also exposes quantization methods for nvfp4.
<!-- Link any related issues here -->

## 🧪 Tests

- [ ] tests/test_trtllm_cutlass_fused.py
- [ ] tests/test_fp4_quantize.py

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
@zkyue
Copy link

zkyue commented Jun 27, 2025

is this only support sm100 ?

Anerudhan pushed a commit to Anerudhan/flashinfer that referenced this pull request Jun 28, 2025
…-ai#1113)

<!-- .github/pull_request_template.md -->

## 📌 This PR added the CUTLASS implementation of fused Mixture of Expert
from TensorRT-LLM.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues
Supported data types are: fp32/bf16/fp16/float8_e4m3fn/float8_e2m1
The kernels also support expert/tensor parallellism.

This PR also exposes quantization methods for nvfp4.
<!-- Link any related issues here -->

## 🧪 Tests

- [ ] tests/test_trtllm_cutlass_fused.py
- [ ] tests/test_fp4_quantize.py

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
@fzyzcjy
Copy link
Collaborator

fzyzcjy commented Jul 24, 2025

Supported data types are: fp32/bf16/fp16/float8_e4m3fn/float8_e2m1

Hi, it seems nvfp4 is in the UT as well as in source code, thus I would appreciate it if I could know whether this supports nvfp4 as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants