-
Notifications
You must be signed in to change notification settings - Fork 528
[BUGFIX] main-sd-bugfix && [UT] add mtp UT #593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
46b8b8f to
f9c61ea
Compare
f9c61ea to
1256fd9
Compare
| run: | | ||
| if [[ "${{ matrix.os }}" == "linux-arm64-npu-1" ]]; then | ||
| pytest -sv tests/singlecard/spec_decode | ||
| pytest -sv tests/singlecard/spec_decode/e2e/test_mtp_correctness.py # it needs a clean process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make more comments on why it needs a clean process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will add more comments
- i see a lot of comments
it needs a clean processin vLLM test_pipeline, so i add it here, i think vLLM UT arch maybe have some problem? - in this case, it needs a clean process becase it use bf16 and other UTs use fp16, if they are in a common process, it will failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think 2 is the main reason in our case.
| expert_tensor_parallel_size) | ||
|
|
||
| global _EP | ||
| assert _EP is None, ("expert parallel group is already initialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we delete this assertion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As spec decode worker would use init_worker twice, Second init _EP is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, thanks!
| group_ranks = [] | ||
| global _ETP | ||
| assert _ETP is None, ( | ||
| "expert tensor parallel group is already initialized") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As spec decode worker would use init_worker twice, Second init _EP is not None
| from vllm.model_executor.parameter import PerTensorScaleParameter | ||
| from vllm.model_executor.utils import set_weight_attrs | ||
|
|
||
| from ..ops.fused_moe import AscendUnquantizedFusedMoEMethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to use absolute path references here
| from ..ops.fused_moe import AscendUnquantizedFusedMoEMethod | |
| from vllm_ascend.ops.fused_moe import AscendUnquantizedFusedMoEMethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. i have modified it
1256fd9 to
3b35c24
Compare
Signed-off-by: mengwei805 <mengwei25@huawei.com>
3b35c24 to
523db6b
Compare
| if self.is_layer_skipped_ascend(prefix, | ||
| self.packed_modules_mapping): | ||
| return UnquantizedFusedMoEMethod() | ||
| return AscendUnquantizedFusedMoEMethod() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why it's not raised error before. @ganyi1996ppo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if weights only has main model, all fusedmoe weight is w8a8,
but current mtp weights are float in deepseek w8a8 weights,
so when enable mtp, it will raise error, and it will run success if not enable mtp
### What this PR does / why we need it? The pr will fix some bug about spec decode / MTP The pr add a mtp e2e UT `test_mtp_correctness.py` **vllm_ascend/attention/attention.py** 1. add support `self.attn_mask_cache` only has 1 element to cover scene in which both spec docode and chunked prefill are enabled. **vllm_ascend/distributed/parallel_state.py** 1. remove 2 assert because spec decode worker would use init_worker twice **vllm_ascend/models/deepseek_mtp.py** 1. remove unused params; 2. add support w8a8 in `CustomDeepSeekMTP` **vllm_ascend/quantization/quant_config.py** 1. use `AscendUnquantizedFusedMoEMethod` instead of `UnquantizedFusedMoEMethod` **other** 1. replace `from vllm.logger import init_logger` to `from vllm.logger import logger` all of the vllm-ascend project ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: mengwei805 <mengwei25@huawei.com>
### What this PR does / why we need it? The pr will fix some bug about spec decode / MTP The pr add a mtp e2e UT `test_mtp_correctness.py` **vllm_ascend/attention/attention.py** 1. add support `self.attn_mask_cache` only has 1 element to cover scene in which both spec docode and chunked prefill are enabled. **vllm_ascend/distributed/parallel_state.py** 1. remove 2 assert because spec decode worker would use init_worker twice **vllm_ascend/models/deepseek_mtp.py** 1. remove unused params; 2. add support w8a8 in `CustomDeepSeekMTP` **vllm_ascend/quantization/quant_config.py** 1. use `AscendUnquantizedFusedMoEMethod` instead of `UnquantizedFusedMoEMethod` **other** 1. replace `from vllm.logger import init_logger` to `from vllm.logger import logger` all of the vllm-ascend project ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Signed-off-by: mengwei805 <mengwei25@huawei.com>
What this PR does / why we need it?
The pr will fix some bug about spec decode / MTP
The pr add a mtp e2e UT
test_mtp_correctness.pyvllm_ascend/attention/attention.py
self.attn_mask_cacheonly has 1 element to cover scene in which both spec docode and chunked prefill are enabled.vllm_ascend/distributed/parallel_state.py
vllm_ascend/models/deepseek_mtp.py
CustomDeepSeekMTPvllm_ascend/quantization/quant_config.py
AscendUnquantizedFusedMoEMethodinstead ofUnquantizedFusedMoEMethodother
from vllm.logger import init_loggertofrom vllm.logger import loggerall of the vllm-ascend projectDoes this PR introduce any user-facing change?
How was this patch tested?