-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
FusedMoE support for the Transformers backend
#22650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FusedMoE support for the Transformers backend
#22650
Conversation
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Mixture-of-Experts (MoE) models within the Transformers backend. The changes are well-structured, including refactoring ModelConfig to handle MoE-specific properties, registering new MoE model classes, and adding a TransformersMoEBase class to manage the replacement of standard MoE layers with vLLM's FusedMoE. A critical issue was found where top_k and intermediate_size for the FusedMoE layer were hardcoded, which would cause issues for many MoE models. A detailed comment with a suggested fix has been provided to address this.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks nice already!
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
…o add `RMSNorm` replacement) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very familiar with the backend but was curious, just some thoughts
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Thanks for the interest! Slowly but surely I'm bolting on more and more functionality to it 🚀 |
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
MoE models do already work with the Transformers backend, but the performance is not ideal because
FusedMoEis not used. This means that each expert gets a dedicated linear layer which is executed in a for loop on the Transformers side.Depends on
FusedMoE.forwardis called correctly in the Transformers modelling codeChanges
This PR adds:
ModelConfig._get_transformers_backend_clsbased on the number of experts reported by the HF configFusedMoE.weight_loaderhandling toAutoWeightsLoader.load_weightswhich is only triggered ifexpert_mappingis passed toAutoWeightsLoader.load_weights(i.e. it's opt-in so won't break any existing custom weight loading)TransformersMoEBasewhich can be subclassed and adds the necessary logic to substitute inFusedMoEexpertsand which are instances oftorch.nn.ModuleListare replaced withFusedMoEmodulesload_weightspasses theexpert_mappingdirectly toAutoWeightsLoader.load_weightsTransformersMoEModel,TransformersMoEForCausalLM,TransformersMoEForMultimodalLMwhich leverages Python's MRO to add the MoE logic to their non-MoE counterpartsPerformance
Before this PR the Transformers backend for MoE's was 24x slower than the vLLM implementation. This is largely due to the Transformers modelling code not being CUDA graphs compilable (
--enforce-eagerwas needed) and aforloop iterating over the experts. This should be fixed by the PRs we depend on though.As you can see from the results below the average performance of the Transformers backend is <3% worse than the dedicated vLLM implementation!
Serve commands:
Benchmark command:
Results: