- 
          
 - 
                Notifications
    
You must be signed in to change notification settings  - Fork 11k
 
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk #25698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk #25698
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an issue with shared experts and zero experts in FusedMoE.process_chunk. The changes correctly separate the logic for handling outputs from zero experts and shared experts, which was previously conflated, leading to incorrect behavior. Specifically, the logic now correctly handles the tuple returned when using shared experts. However, a new assertion assert len(final_hidden_states) == 1 has been introduced which is incorrect and will likely lead to runtime assertion errors. I have provided a critical comment with a suggested fix for this issue.
Signed-off-by: Sage Moore <sage@neuralmagic.com>
| 
           CC @OftenDream. I'm trying to get setup on a server large enough to run   | 
    
…vllm-project#25698) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…#25698) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…vllm-project#25698) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…vllm-project#25698) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…vllm-project#25698) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…vllm-project#25698) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…vllm-project#25698) Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
#23991 appears to break Deepseek V2 Lite, and presumably other models that use shared experts. This PR should fix this issue.
Test Results
Deepseek V2 Lite
VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.75 -compilation_config '{"cudagraph_mode": "full_decode_only"}'Before
After
Longcat Flash