Skip to content

Conversation

@sunchendd
Copy link

@sunchendd sunchendd commented Nov 29, 2025

What this PR does / why we need it?

Fix the Eagle3 inference failure issue.
error message: "EngineCore encountered an issue. See stack trace (above) for the root cause."

Fixes #4323

How was this patch tested?

vllm serve /nfs/1_AscendPackage/05_weights_public/Qwen3-32B \ --served-model-name Qwen3-32B \ -tp 4 \ --host "0.0.0.0" \ --port "8000" \ --trust-remote-code \ --speculative-config '{"method":"eagle3","model":"/home/scd/qwen3_32b_eagle3/","num_speculative_tokens":4,"draft_tensor_parallel_size":1}' \ --max-num-batched-tokens 4096 \ --max-model-len 4096

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen3-32B",
        "prompt": "hi, where is the capital of France?",
        "max_tokens": 10,
        "temperature": 0
    }' | python3 -m json.tool

vLLM version: v0.11.0
vLLM-ascend version: v0.11.0rc2

Signed-off-by: sunchendd <sunchendong@xfusion.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes an inference failure with Eagle3 speculative decoding. The changes primarily involve updating the attention mask generation logic in get_splitfuse_attn_mask to handle different scenarios correctly, adjusting the attention state for Eagle3 in the model runner, and modifying how the attention mask is obtained in the Eagle proposer. The core logic of the fix seems sound. I've identified a critical issue in vllm_ascend/worker/model_runner_v1.py where the logic for determining the attention state for Eagle3 speculative decoding could be incorrect, potentially leading to the use of a wrong attention mechanism. I've provided a detailed comment and a suggested fix for this.

Comment on lines +1957 to +1961
if self.drafter and self.drafter.name in (SpecDcodeType.EAGLE,
SpecDcodeType.EAGLE3):
attn_state = AscendAttentionState.ChunkedPrefill
else:
attn_state = AscendAttentionState.SpecDecoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic to determine the attention state for Eagle3 speculative decoding appears to be incorrect. Currently, it sets attn_state to AscendAttentionState.ChunkedPrefill for Eagle and Eagle3, and AscendAttentionState.SpecDecoding for other drafters. However, ChunkedPrefill is typically used for prefill stages, not for speculative decoding which happens after prefill. For speculative decoding, AscendAttentionState.SpecDecoding should be used to ensure the correct attention mechanism is applied. Using ChunkedPrefill here could lead to incorrect attention masks and potential failures during the decoding phase.

Suggested change
if self.drafter and self.drafter.name in (SpecDcodeType.EAGLE,
SpecDcodeType.EAGLE3):
attn_state = AscendAttentionState.ChunkedPrefill
else:
attn_state = AscendAttentionState.SpecDecoding
if self.drafter and self.drafter.name in (SpecDcodeType.EAGLE,
SpecDcodeType.EAGLE3):
attn_state = AscendAttentionState.SpecDecoding
else:
attn_state = AscendAttentionState.SpecDecoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: vllm-ascend 0.11.x 版本qwen3-32b-eagle3乱码以及无法启动

1 participant