-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply torch.compile to fused_moe/grouped_topk #12637
Apply torch.compile to fused_moe/grouped_topk #12637
Conversation
Signed-off-by: mgoin <michael@neuralmagic.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: mgoin <michael@neuralmagic.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Felix Marty <felmarty@amd.com>
Inspired by sgl-project/sglang@1ebe1d6. This improves the generation latency for deepseek models due to the large number of experts. For
DeepSeek-Coder-V2-Lite-Instruct
we see an increase of 154 -> 163 tokens/s or 5% improvement at bs=1 on H100script:
main:
this pr: