-
Notifications
You must be signed in to change notification settings - Fork 617
[Bugfix] Fix the Eagle3 inference failure issue #4559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: sunchendd <sunchendong@xfusion.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes an inference failure with Eagle3 speculative decoding. The changes primarily involve updating the attention mask generation logic in get_splitfuse_attn_mask to handle different scenarios correctly, adjusting the attention state for Eagle3 in the model runner, and modifying how the attention mask is obtained in the Eagle proposer. The core logic of the fix seems sound. I've identified a critical issue in vllm_ascend/worker/model_runner_v1.py where the logic for determining the attention state for Eagle3 speculative decoding could be incorrect, potentially leading to the use of a wrong attention mechanism. I've provided a detailed comment and a suggested fix for this.
| if self.drafter and self.drafter.name in (SpecDcodeType.EAGLE, | ||
| SpecDcodeType.EAGLE3): | ||
| attn_state = AscendAttentionState.ChunkedPrefill | ||
| else: | ||
| attn_state = AscendAttentionState.SpecDecoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to determine the attention state for Eagle3 speculative decoding appears to be incorrect. Currently, it sets attn_state to AscendAttentionState.ChunkedPrefill for Eagle and Eagle3, and AscendAttentionState.SpecDecoding for other drafters. However, ChunkedPrefill is typically used for prefill stages, not for speculative decoding which happens after prefill. For speculative decoding, AscendAttentionState.SpecDecoding should be used to ensure the correct attention mechanism is applied. Using ChunkedPrefill here could lead to incorrect attention masks and potential failures during the decoding phase.
| if self.drafter and self.drafter.name in (SpecDcodeType.EAGLE, | |
| SpecDcodeType.EAGLE3): | |
| attn_state = AscendAttentionState.ChunkedPrefill | |
| else: | |
| attn_state = AscendAttentionState.SpecDecoding | |
| if self.drafter and self.drafter.name in (SpecDcodeType.EAGLE, | |
| SpecDcodeType.EAGLE3): | |
| attn_state = AscendAttentionState.SpecDecoding | |
| else: | |
| attn_state = AscendAttentionState.SpecDecoding |
What this PR does / why we need it?
Fix the Eagle3 inference failure issue.
error message: "EngineCore encountered an issue. See stack trace (above) for the root cause."
Fixes #4323
How was this patch tested?
vllm serve /nfs/1_AscendPackage/05_weights_public/Qwen3-32B \ --served-model-name Qwen3-32B \ -tp 4 \ --host "0.0.0.0" \ --port "8000" \ --trust-remote-code \ --speculative-config '{"method":"eagle3","model":"/home/scd/qwen3_32b_eagle3/","num_speculative_tokens":4,"draft_tensor_parallel_size":1}' \ --max-num-batched-tokens 4096 \ --max-model-len 4096vLLM version: v0.11.0
vLLM-ascend version: v0.11.0rc2