-
Notifications
You must be signed in to change notification settings - Fork 544
[Fix] Fix update_aclgraph_sizes when running MoE models #913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c9a9b21 to
45d1cc8
Compare
| if (additional_config | ||
| and "expert_tensor_parallel_size" in additional_config | ||
| and not parallel_config.enable_expert_parallel): | ||
| parallel_config.expert_tensor_parallel_size = int( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems there is no expert_tensor_parallel_size in ParallelConfig? same for parallel_config.expert_parallel_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vLLM does not support tensor parallelism for experts’ weights, so I added this attribute. This change primarily addresses scenarios where the number of devices significantly exceeds the number of experts.
| # Calculate expert parallel size based on world size | ||
| parallel_config.expert_parallel_size = ( | ||
| parallel_config.world_size // | ||
| parallel_config.expert_tensor_parallel_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Does this mean etp size not equal to 1 only when disabling ep and enabling etp? then why we set ep size here if ep is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I mentioned above, this is more like an enhancement.
| parallel_config.expert_parallel_size = ( | ||
| parallel_config.world_size // | ||
| parallel_config.expert_tensor_parallel_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm still confused. Could I understand like this: when disabling ep and enabling etp, a expected etp size is set, and an unexpected ep size is set?
or should ep size be set only when ep is enabled?
| parallel_config.expert_parallel_size = ( | |
| parallel_config.world_size // | |
| parallel_config.expert_tensor_parallel_size) | |
| if parallel_config.enable_expert_parallel: | |
| parallel_config.expert_parallel_size = ( | |
| parallel_config.world_size // | |
| parallel_config.expert_tensor_parallel_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss this in detail later, you are right, perhaps we need to give careful consideration to the configuration logic here to avoid any unwanted confusion.
64ea69c to
06be3d2
Compare
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…or_parallel_size Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…#913) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>
…#913) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>
…#913) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: wangxiaoxin (A) <w00664509@china.huawei.com>
…#913) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…#913) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it? 1. [PR913](#913) introduced an error that caused V0's spec decode function to fail. [PR1109](#1109) wanted to fix this problem. Unfortunately, the fix broke the ngram function. I fixed the ngram function in this PR. **PS**: Q: Why is there a problem when ngram is not found when pr1109 is merged? A: The newly introduced problem will only appear when tp>1, and the use cases on CI are all tp=1 2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to avoid CI taking too long, including eagle speculative UTs, which made CI unable to take care of the eagle function. I added it(`test_eagle_correctness.py`) back in this PR 3. Because of the reason mentioned in 2, the current version of Eagle has a problem. I located and fixed this problem. It was because vllm's `draft_model_runner.py` was changed and vllm-ascend was not synchronized in time. 4. Currently, the UTs of v0 and v1 are mixed in the spec_decode directory. I split them into two directories: spec_decode_v0 and spec_decode_v1. 5. i found `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor` and `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace` have changed in vllm, so i remove its patchs in this pr. 6. v1 mtp ut failed(https://github.com/vllm-project/vllm-ascend/actions/runs/15782006176/job/44489813330?pr=1323), I commented it out. @XWFAlone @JC-ut0 ### Does this PR introduce _any_ user-facing change? This PR fixes the functions of ngram and eagle spec decode in the v0 engine ### How was this patch tested? ngram and eagle were tested locally using an 800I A2 machine, using real weights instead of the random small weights used by UT, and using a scenario test with tp>1. and other were tested by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
### What this PR does / why we need it? 1. [PR913](#913) introduced an error that caused V0's spec decode function to fail. [PR1109](#1109) wanted to fix this problem. Unfortunately, the fix broke the ngram function. I fixed the ngram function in this PR. **PS**: Q: Why is there a problem when ngram is not found when pr1109 is merged? A: The newly introduced problem will only appear when tp>1, and the use cases on CI are all tp=1 2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to avoid CI taking too long, including eagle speculative UTs, which made CI unable to take care of the eagle function. I added it(`test_eagle_correctness.py`) back in this PR 3. Because of the reason mentioned in 2, the current version of Eagle has a problem. I located and fixed this problem. It was because vllm's `draft_model_runner.py` was changed and vllm-ascend was not synchronized in time. 4. Currently, the UTs of v0 and v1 are mixed in the spec_decode directory. I split them into two directories: spec_decode_v0 and spec_decode_v1. 5. i found `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor` and `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace` have changed in vllm, so i remove it in this pr. ### Does this PR introduce _any_ user-facing change? This PR fixes the functions of ngram and eagle spec decode in the v0 engine ### How was this patch tested? tested by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
…#913) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it? 1. [PR913](vllm-project#913) introduced an error that caused V0's spec decode function to fail. [PR1109](vllm-project#1109) wanted to fix this problem. Unfortunately, the fix broke the ngram function. I fixed the ngram function in this PR. **PS**: Q: Why is there a problem when ngram is not found when pr1109 is merged? A: The newly introduced problem will only appear when tp>1, and the use cases on CI are all tp=1 2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to avoid CI taking too long, including eagle speculative UTs, which made CI unable to take care of the eagle function. I added it(`test_eagle_correctness.py`) back in this PR 3. Because of the reason mentioned in 2, the current version of Eagle has a problem. I located and fixed this problem. It was because vllm's `draft_model_runner.py` was changed and vllm-ascend was not synchronized in time. 4. Currently, the UTs of v0 and v1 are mixed in the spec_decode directory. I split them into two directories: spec_decode_v0 and spec_decode_v1. 5. i found `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor` and `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace` have changed in vllm, so i remove it in this pr. ### Does this PR introduce _any_ user-facing change? This PR fixes the functions of ngram and eagle spec decode in the v0 engine ### How was this patch tested? tested by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
…#913) ### What this PR does / why we need it? Fix update_aclgraph_sizes when running MoE models. --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it? 1. [PR913](vllm-project#913) introduced an error that caused V0's spec decode function to fail. [PR1109](vllm-project#1109) wanted to fix this problem. Unfortunately, the fix broke the ngram function. I fixed the ngram function in this PR. **PS**: Q: Why is there a problem when ngram is not found when pr1109 is merged? A: The newly introduced problem will only appear when tp>1, and the use cases on CI are all tp=1 2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to avoid CI taking too long, including eagle speculative UTs, which made CI unable to take care of the eagle function. I added it(`test_eagle_correctness.py`) back in this PR 3. Because of the reason mentioned in 2, the current version of Eagle has a problem. I located and fixed this problem. It was because vllm's `draft_model_runner.py` was changed and vllm-ascend was not synchronized in time. 4. Currently, the UTs of v0 and v1 are mixed in the spec_decode directory. I split them into two directories: spec_decode_v0 and spec_decode_v1. 5. i found `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor` and `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace` have changed in vllm, so i remove it in this pr. ### Does this PR introduce _any_ user-facing change? This PR fixes the functions of ngram and eagle spec decode in the v0 engine ### How was this patch tested? tested by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
What this PR does / why we need it?
Fix update_aclgraph_sizes when running MoE models.
Does this PR introduce any user-facing change?
How was this patch tested?