Skip to content

Conversation

@mengwei805
Copy link
Collaborator

@mengwei805 mengwei805 commented Apr 21, 2025

What this PR does / why we need it?

The pr will fix some bug about spec decode / MTP
The pr add a mtp e2e UT test_mtp_correctness.py

vllm_ascend/attention/attention.py

  1. add support self.attn_mask_cache only has 1 element to cover scene in which both spec docode and chunked prefill are enabled.

vllm_ascend/distributed/parallel_state.py

  1. remove 2 assert because spec decode worker would use init_worker twice

vllm_ascend/models/deepseek_mtp.py

  1. remove unused params;
  2. add support w8a8 in CustomDeepSeekMTP

vllm_ascend/quantization/quant_config.py

  1. use AscendUnquantizedFusedMoEMethod instead of UnquantizedFusedMoEMethod

other

  1. replace from vllm.logger import init_logger to from vllm.logger import logger all of the vllm-ascend project

Does this PR introduce any user-facing change?

How was this patch tested?

@mengwei805 mengwei805 force-pushed the main-sd-bugfix branch 3 times, most recently from 46b8b8f to f9c61ea Compare April 21, 2025 07:49
@mengwei805 mengwei805 changed the title [WIP][bugfix] main-sd-bugfix [BUGFIX] main-sd-bugfix && [UT] add mtp UT Apr 21, 2025
run: |
if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then
pytest -sv tests/singlecard/spec_decode
pytest -sv tests/singlecard/spec_decode/e2e/test_mtp_correctness.py # it needs a clean process
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make more comments on why it needs a clean process

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will add more comments

  1. i see a lot of comments it needs a clean process in vLLM test_pipeline, so i add it here, i think vLLM UT arch maybe have some problem?
  2. in this case, it needs a clean process becase it use bf16 and other UTs use fp16, if they are in a common process, it will failed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think 2 is the main reason in our case.

expert_tensor_parallel_size)

global _EP
assert _EP is None, ("expert parallel group is already initialized")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we delete this assertion here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As spec decode worker would use init_worker twice, Second init _EP is not None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, thanks!

group_ranks = []
global _ETP
assert _ETP is None, (
"expert tensor parallel group is already initialized")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As spec decode worker would use init_worker twice, Second init _EP is not None

from vllm.model_executor.parameter import PerTensorScaleParameter
from vllm.model_executor.utils import set_weight_attrs

from ..ops.fused_moe import AscendUnquantizedFusedMoEMethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use absolute path references here

Suggested change
from ..ops.fused_moe import AscendUnquantizedFusedMoEMethod
from vllm_ascend.ops.fused_moe import AscendUnquantizedFusedMoEMethod

Copy link
Collaborator Author

@mengwei805 mengwei805 Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. i have modified it

Signed-off-by: mengwei805 <mengwei25@huawei.com>
if self.is_layer_skipped_ascend(prefix,
self.packed_modules_mapping):
return UnquantizedFusedMoEMethod()
return AscendUnquantizedFusedMoEMethod()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why it's not raised error before. @ganyi1996ppo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if weights only has main model, all fusedmoe weight is w8a8,
but current mtp weights are float in deepseek w8a8 weights,
so when enable mtp, it will raise error, and it will run success if not enable mtp

@wangxiyuan wangxiyuan merged commit 0ae9ee0 into vllm-project:main Apr 21, 2025
16 checks passed
ttanzhiqiang pushed a commit to ttanzhiqiang/vllm-ascend that referenced this pull request Apr 27, 2025
### What this PR does / why we need it?
The pr will fix some bug about spec decode / MTP
The pr add a mtp e2e UT `test_mtp_correctness.py`

**vllm_ascend/attention/attention.py**
1. add support `self.attn_mask_cache` only has 1 element to cover scene
in which both spec docode and chunked prefill are enabled.

**vllm_ascend/distributed/parallel_state.py**
1. remove 2 assert because spec decode worker would use init_worker
twice

**vllm_ascend/models/deepseek_mtp.py**
1. remove unused params;
2. add support w8a8 in `CustomDeepSeekMTP`

**vllm_ascend/quantization/quant_config.py**
1. use `AscendUnquantizedFusedMoEMethod` instead of
`UnquantizedFusedMoEMethod`

**other**
1. replace `from vllm.logger import init_logger` to `from vllm.logger
import logger` all of the vllm-ascend project



### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

Signed-off-by: mengwei805 <mengwei25@huawei.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
The pr will fix some bug about spec decode / MTP
The pr add a mtp e2e UT `test_mtp_correctness.py`

**vllm_ascend/attention/attention.py**
1. add support `self.attn_mask_cache` only has 1 element to cover scene
in which both spec docode and chunked prefill are enabled.

**vllm_ascend/distributed/parallel_state.py**
1. remove 2 assert because spec decode worker would use init_worker
twice

**vllm_ascend/models/deepseek_mtp.py**
1. remove unused params;
2. add support w8a8 in `CustomDeepSeekMTP`

**vllm_ascend/quantization/quant_config.py**
1. use `AscendUnquantizedFusedMoEMethod` instead of
`UnquantizedFusedMoEMethod`

**other**
1. replace `from vllm.logger import init_logger` to `from vllm.logger
import logger` all of the vllm-ascend project



### Does this PR introduce _any_ user-facing change?


### How was this patch tested?

Signed-off-by: mengwei805 <mengwei25@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants