-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Fix MTP with deepep_low_latency #25904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MTP with deepep_low_latency #25904
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses an AssertionError that occurs when using speculative decoding with the deepep_low_latency backend on models featuring shared experts. The root cause was the forward_impl_chunked method in the FusedMoE layer lacking the logic to handle shared experts, which was already present in the non-chunked forward_impl method. The changes introduce this missing logic, ensuring consistent behavior and resolving the crash. The fix appears correct and aligns the chunked and non-chunked execution paths.
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does MTP mean we don't end up using the modular kernel?
|
@LucasWilkinson the modular kernel is still used for the standard transformer blocks. With MTP, though, when running the |
3d317b1 to
710fb09
Compare
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
710fb09 to
a14794b
Compare
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Using MTP with
VLLM_ALL2ALL_BACKEND=deepep_low_latencycauses anAssertionErrordue to incorrect handling of shared experts. This PR is a fix, carrying the logic fromFusedMoE.forward_implover toFusedMoE.forward_impl_chunkedTest Plan
Test Result
Previously hits
AssertionErrorduring startup, now starts up successfully.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.