-
Notifications
You must be signed in to change notification settings - Fork 543
[Doc] Add release note for v0.9.1rc3
#2411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. drop main and add 0.9.1 check for 0.9.1-dev branch 2. cherry-pick vllm-project@b75cb78 to fix import error and make 0.9.1 works 3. Fix quantization test failure. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…by#1160 (vllm-project#1214) ### What this PR does / why we need it? …by#1160 - Fixes # -->fix torchair execute issue on padding data, and mtp padding logic ### How was this patch tested? it has been tested and merged in main. Signed-off-by: 刘哲续 <liuzhexu1@huawei.com> Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>
### What this PR does / why we need it? Add myst_substitutions print in docs/source/confg.py ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it? This PR revert 20dedb to restore the tokenwise padding logics so that ACL Graph can work as expected. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…t#1240) ### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA main 关联pr:vllm-project#1172 --------- <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: fems14 <1804143737@qq.com>
…ct#1234) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com>
<!-- Thanks for sending a pull request! --> ### What this PR does / why we need it? rebase main ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: depeng1994 <depengzhang@foxmail.com> Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: yzim <43207690+yzim@users.noreply.github.com> Signed-off-by: chenwaner <861645847@qq.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com> Signed-off-by: wan_danfeng <wonderful199082@126.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: zhuo97 <1103045176@qq.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com> Co-authored-by: depeng1994 <166494784+depeng1994@users.noreply.github.com> Co-authored-by: ttanzhiqiang <38750855+ttanzhiqiang@users.noreply.github.com> Co-authored-by: yz <43207690+yzim@users.noreply.github.com> Co-authored-by: chenwaner <48718746+chenwaner@users.noreply.github.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com> Co-authored-by: Wan_Danfeng <wonderful199082@126.com> Co-authored-by: zhuo97 <49392868+zhuo97@users.noreply.github.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
… to 2.5.1.post1.dev20250528 (vllm-project#1247) ### What this PR does / why we need it? Cherry-pick form vllm-project#1235 1. Fix rank set in DP scenario. The new poc version of torch-npu support setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the rank set in `DPEngineCoreProc` directly instead of calculating local rank across dp by hand in the patched `_init_data_parallel` Closes: vllm-project#1170 2. Bump torch-npu version to 2.5.1.post1.dev20250528 Closes: vllm-project#1242 Closes: vllm-project#1232 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: Icey <1790571317@qq.com>
cherry-pick to 0.9.1-dev Signed-off-by: whx-sjtu <2952154980@qq.com>
…lm-project#1264) ### What this PR does / why we need it? This PR is the cherry-pick of the PR vllm-project#1229 which have already merged into the main branch. This PR is used for resolved [issue 1147](vllm-project#1147) 1. Move fused_moe code into one file `fused_moe.py`. 2. Integrate branch conditions into function `get_fused_moe_state`. ### Does this PR introduce _any_ user-facing change? 1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this env is useless, we can make judgments based on the current scenario without this env, it will only increase complexity. 2. This PR has removed the env `USING_LCCL_COM`, because this env has already expired. 3. `additional_config.expert_tensor_parallel_size` has already expired, and now we also use parameter `enable_expert_parallel`, consistent with the vLLM. ### How was this patch tested? CI passed Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: zzzzwwjj <1183291235@qq.com>
) <!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Refactor the token-wise padding mechanism to a more elegant implementation, correcting the padding logic errors introduced by the previous multimodal commit vllm-project#736 . This is a clean version of vllm-project#1259 . ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…y when using V0 spec decode (vllm-project#1271) ### What this PR does / why we need it? Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode. Cherry pick from vllm-project#1258. --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
…ev) (vllm-project#1296) This PR adopt LLMDataDist for kv cache register and pull_blocks style disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 . This PR can be test with the following step: Generate the rank table for all machine. executetoy_proxy.py to launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, port Run the prefill server and decode server. send the request to the disaggregate prefill proxy Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it? Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we have fixed an known issue in deepseek-dbo. ### How was this patch tested? This patch can be tested with newly added e2e tests: [tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e). We checked the registered module name and class name in these tests. --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com>
Fix running error in dbo when dp_size>1. Add conditional logic in `_get_forward_metadata_across_dp` to enable dbo. Signed-off-by: shikang-hangzhou <459956190@qq.com>
### What this PR does / why we need it? Supports Deepseek-R1 w4a8 quantization. Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class. ### Does this PR introduce _any_ user-facing change? no, use `--quantization=ascend` is engouh. ### How was this patch tested? #### 1.How to get weights using Modelslim ##### Installation steps Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24 git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### The required transformers environment pip install transformers==4.48.2 ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#运行前必检) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-混合量化前三层-mlpw8a8-dynamic-量化mla共享专家w8a8量化路由专家w4a8-dynamic量化) chapter Reference command:python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format ##### Adapt to vllm-ascend Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it: `quant_model_description_w8a8_dynamic.json` rename to `quant_model_description.json`, and change `"group_size": 0` to `"group_size": 256` Modification in `config.json`:`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3` ; `quantization_config` is removed; #### 2.How to run w4a8 TP + EP: python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 --enable_expert_parallel --quantization ascend --port $3 --max-model-len $4 --max-num-seqs $5 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager DP+TP+EP: python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 2 -dp2 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager #### 3.Use constraints export VLLM_USE_V1=1 # v1 --------- Signed-off-by: pichangping <1337510399@qq.com>
### What this PR does / why we need it? 1. [PR913](vllm-project#913) introduced an error that caused V0's spec decode function to fail. [PR1109](vllm-project#1109) wanted to fix this problem. Unfortunately, the fix broke the ngram function. I fixed the ngram function in this PR. **PS**: Q: Why is there a problem when ngram is not found when pr1109 is merged? A: The newly introduced problem will only appear when tp>1, and the use cases on CI are all tp=1 2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to avoid CI taking too long, including eagle speculative UTs, which made CI unable to take care of the eagle function. I added it(`test_eagle_correctness.py`) back in this PR 3. Because of the reason mentioned in 2, the current version of Eagle has a problem. I located and fixed this problem. It was because vllm's `draft_model_runner.py` was changed and vllm-ascend was not synchronized in time. 4. Currently, the UTs of v0 and v1 are mixed in the spec_decode directory. I split them into two directories: spec_decode_v0 and spec_decode_v1. 5. i found `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor` and `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace` have changed in vllm, so i remove its patchs in this pr. 6. v1 mtp ut failed(https://github.com/vllm-project/vllm-ascend/actions/runs/15782006176/job/44489813330?pr=1323), I commented it out. @XWFAlone @JC-ut0 ### Does this PR introduce _any_ user-facing change? This PR fixes the functions of ngram and eagle spec decode in the v0 engine ### How was this patch tested? ngram and eagle were tested locally using an 800I A2 machine, using real weights instead of the random small weights used by UT, and using a scenario test with tp>1. and other were tested by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
…orchair graph in long sequence predictions (vllm-project#1332) ### What this PR does / why we need it? Fix the issue of insufficient cached cosine and sine length in MLA's TorchAir graph mode, which causes accuracy deviation during long-sequence inference. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We tested the accuracy of this patch with DeepSeek R1 e2e becnhmark serving, and get 83.33 sore for AIME2024 dataset with DP4TP4EP16 setting.  Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it? Adding `W4A8_DYNAMIC` quantization support for linear. Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Adding test case `tests/multicard/test_model_qwen3_w4a8.py` to test qwen3 w4a8_dynamic quantized model Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim` of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409` 1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim` ```shell git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409 bash install.sh ``` 2. Serve model using `vllm` ```shell VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \ --model vllm-ascend/Qwen3-8B-W4A8 \ --port 8000 \ --quantization ascend \ --tensor_parallel_size 2 \ --enforce-eager ``` --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
### What this PR does / why we need it? This PR update the torch-npu to dev20250619 on the 0.9.1-dev branch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> No ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
) ### What this PR does / why we need it? Fix accuracy problem after MOE refactor and make inference flow better. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? e2e test in `tests/e2e/multicard/test_offline_inference_distributed.py` Signed-off-by: shikang-hangzhou <459956190@qq.com>
…la decode (vllm-project#1311) ### What this PR does / why we need it? After the disaggregated PD merged, the kv cache on deepseek will become two piece of independent buffer for kv transfer or computation. However, the current kernel, namely `paged_attention_mla` can only accept k_cache as a single parameter, this make us have to concat these two piece of kv cache together before the attention thus incurs a memory peak inside the attention in eager mode. In this PR we introduce a `torch_npu.atb.npu_multi_head_latent_attention` for mla decode path, which will be used as default path for both eager mode and aclgraph after the related torch_npu is public available. Since its still a restrict package, we add `VLLM_ASCEND_MLA_PA` to control its usage. This flag will be removed in the future. ### Does this PR introduce _any_ user-facing change? Yes, add a new flag named `VLLM_ASCEND_MLA_PA`, but it will be removed eventually after the newest torch_npu is released. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it? add guiddance for users to clean the cache when ppip reinstallation of vllm fails. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just changed the doc Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
…oject#1361) ### What this PR does / why we need it? Remove the scheduler patch for disaggregated PD, since we found the patch can not really work on the online serving path. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ci test will guarantee this --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it? update Disaggregate prefill README --------- Signed-off-by: liziyu <liziyu16@huawei.com>
…ject#1393) ### What this PR does / why we need it? Remove the duplicated code introduced by my inadvertent rebase. I apologize for this oversight. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…ect#1422) ### What this PR does / why we need it? A refactoring of `forward_context` and `model_runner_v1`, add some context which is necessary in model inference into `forward_context`, and refactor `dummy_run` logic, make it more reasonable. Some details for this PR: 1. Fix acc bug when online + multi-DP + eager mode + all_gather mode; 2. Fix bug when online + multi-DP + eager mode + mc2 mode; 3. Fix bug when A2 + eager mode + mc2 mode; 4. enable different token_num on different chip when mc2 mode; 5. Update scripts in `examples` dir; ### Does this PR introduce _any_ user-facing change? This PR remove `expert_tensor_parallel_size` in `additional_config`, we will use `enable_expert_parallel` to control whether expert_parallel is enable, which is consistent with vLLM. ### How was this patch tested? --------- Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it? Fixes qwen3 w4a8 test case failed due to `sampling_params` not fixed and `torch_npu` updated. --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
…ble NZ for GMM. (vllm-project#1409) ### What this PR does / why we need it? 1. add a switch for enabling NZ layout in weights 2. enable NZ for GMM 3. replace magic number of weights layout ### Does this PR introduce _any_ user-facing change? Users should set `enable_weight_nz_layout` to `true` in `--additional-config` when they wanna enable weights NZ layout. ### How was this patch tested? 1) CI passed. 2) accuracy and performance comparison (only gsm8k-lite) Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it? remove chunked_prefill_for_mla ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.92it/s, est. speed input: 12.46 toks/s, output: 38.34 toks/s] DP rank 2, Generated text: ' [Your Name] and I am a professional carpenter with over 10 years of experience in the industry' DP rank 2, Generated text: ' the head of state and head of government of the United States, indirectly elected to a four-year term' DP rank 2, Generated text: ' Paris, a city that is renowned for its rich history, culture, and influence on art, fashion' DP rank 2, Generated text: ' a topic of much speculation and debate. Some experts believe that AI will eventually surpass human intelligence, while' Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.95it/s, est. speed input: 12.65 toks/s, output: 38.93 toks/s] DP rank 0, Generated text: " Dr. David Hill and today we're going to be talking about how to treat a child with a" DP rank 0, Generated text: ' the head of state and head of government of the United States, indirectly elected to a four-year term' DP rank 0, Generated text: ' Paris, a city that is renowned for its rich history, culture, and influence on art, fashion' DP rank 0, Generated text: ' here, and it’s called ChatGPT. This revolutionary technology is changing the way we interact with machines' Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.97it/s, est. speed input: 12.79 toks/s, output: 39.36 toks/s] DP rank 1, Generated text: " Dr. David Hill and today we're going to be talking about how to treat a child's fever" DP rank 3, Generated text: ' [Your Name] and I’m here to talk to you about the importance of a healthy diet' DP rank 1, Generated text: ' the head of state and head of government of the United States, indirectly elected to a four-year term' DP rank 1, Generated text: ' Paris, a city that is renowned for its rich history, culture, and influence on art, fashion' DP rank 1, Generated text: ' a topic of much speculation and debate. Some experts believe that AI will eventually surpass human intelligence, leading' DP rank 3, Generated text: ' the head of state and head of government of the United States, indirectly elected to a four-year term' DP rank 3, Generated text: " Paris. It is the largest city in France and serves as the country's political, cultural, and" DP rank 3, Generated text: ' here, and it’s called ChatGPT. This revolutionary technology is changing the way we interact with machines --------- Signed-off-by: fems14 <1804143737@qq.com>
…llm-project#2326) ### What this PR does / why we need it? This PR fix bugs and refactor cached mask generation logic. Now just pre-construct and use the cached mask on cpu instead of device on npu. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>
…mismtaches and .kv_cache_bytes file missing (vllm-project#2312) ### What this PR does / why we need it? Original implementation of torchair caching forces users to make everything prepared, fix all the configuration and enable `use_cached_npu_graph`, and it might cause some problems confusing to understand and tackle for users. It is better to compile the graph twice instead of reusing the old kvcaches and cached torchair graph. And the extra duration time is acceptable. ### Does this PR introduce _any_ user-facing change? If users want to enabling torchair.cache_compile with high compilation speed, it is recommended to enable both `use_cached_kv_cache_bytes` and `use_cached_graph` in `torchair_graph_config`. Without `use_cached_kv_cache_bytes`, we'll compile torchair computation graph twice to avoid runtime error caused by configuration mismtaches (the second compilation will be much faster). ### How was this patch tested? CI and e2e vllm serving passed. Signed-off-by: linfeng-yuan <1102311262@qq.com>
…llm-project#2327) ### What this PR does / why we need it? Add onfiguration check logic for ascend scheduler: 1) if chunked_prefill is disabled, `max_num_batched_tokens` couldn't be less than `max_model_len`, following vLLM; 2) if ascend scheduler is disabled, mtp cannot be enabled. ### Does this PR introduce any user-facing change? 1) users cannot enable mtp without ascend scheduler 2) users cannot set `max_num_batched_tokens` smaller than `max_model_len` with ascend scheduler ### How was this patch tested? CI and vllm serving passed Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it? In pure dp scenarios (such as DP32), LMHead comptuation takes 1~2ms. In this PR we customize the parallelism of LMHead,enabling the separate TP of LMHead. The computation flow is listed as follows: ``` get_lmhead_group().all_gather # [num_tokens, hid_dim] --> [num_tokens * lmhead_tp, hid_dim] --> lmhead matmul # [num_tokens * lmhead_tp, hid_dim] --> [num_tokens * lmhead_tp, vocab_size // lmhead_tp] --> get_lmhead_group().all_to_all # [num_tokens * lmhead_tp, vocab_size // lmhead_tp] --> [num_tokens, vocab_size] ``` this can decrease 0.5~1ms for deepseek with 28BS on a single die、MTP. In addition, this PR also fixes a bug that introduced by LMHead quantization. The OP `npu_quant_matmul` only accepts dim < 65536, while `vocab_size` is > 65536 if using TP 1. We can set lmhead tp size > 1 to avoid this bug. Main version of this PR: vllm-project#2309 . ### Does this PR introduce _any_ user-facing change? Yes. We introduced another configurable options `lmhead_tp_size` in ascend_config. For example: ``` additional_config={ "lmhead_tp_size": 16, } ``` The default value is -1, and `lmhead_tp_size` is automatically set to `tensor_parallel_size` in this case. Besides, it is suggested to use it when running full DP to avoid additional communication introduced by TP. Therefore, the parallel size of `lmhead` group will also be changed to `tensor_parallel_size` if TP > 1 so as to fall back to normally TP+DP case. ### How was this patch tested? --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zengyanjia <z00883269@china.huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: zengyanjia <z00883269@china.huawei.com>
…cheduler in disaggregated_prefill deployment (vllm-project#2368) ### What this PR does / why we need it? Currently deepseek_mtp can be enabled with original vllm scheduler only in disaggregated prefill scenarios (experimental). This pr change the verification logic to allow users to enable deepseek_mtp without ascend scheduler in disaggregated prefill deployments. ### Does this PR introduce _any_ user-facing change? Users can enable deepseek_mtp model without ascend scheduler in disaggregated_prefill deployments. ### How was this patch tested? CI and e2e vllm serving passed. Signed-off-by: linfeng-yuan <1102311262@qq.com>
…#2028) (vllm-project#2306) ### What this PR does / why we need it? Fix protobuf version in Dockerfile to resolve `AttributeError: 'str' object has no attribute 'DESCRIPTOR' when packaging message to dict` using protobuf. will remove version specification after ray-project/ray#54910 is merged backport of vllm-project#2028 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it? This PR presents a large-EP deployment solution based on vllm-ascend, using DeepSeek as an example. It outlines the end-to-end workflow for model deployment and serves as a reference for developers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: hust17yixuan <303660421@qq.com>
vllm-project#2394) … the check_watermark_for_prefill function ### What this PR does / why we need it? ascend schedule encountered an incorrect req block length in the check_watermark_for_prefill function,under the current writing method, it will always be 1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? before: http://image.huawei.com/tiny-lts/v1/images/mdstorm/c6cff7cf33d500a3833f5f80352df373_1183x377.png after: http://image.huawei.com/tiny-lts/v1/images/mdstorm/57207a490d8ac0a70fc87dd08d02dee6_1470x954.png Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it? 1. MTP supports V1 scheduler 2. Refactor attn metadata build ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - [x] v0.9.1-dev - [x] A3 [TP16] [DP4 TP4] - [x] A3 1P1D Signed-off-by: xuyexiong <xuyexiong@huawei.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds the release notes for v0.9.1rc3. The changes update the versioning policy with the new release candidate's compatibility information and release date, and add a new section for it in the release notes document. The new entries are placed correctly, following the existing ordering conventions in both files. The changes are consistent and look good.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
### What this PR does / why we need it? vLLM-Ascend's rope implementaion include several header file that are not supposed to be included by outside users. Current implementation may break when canntoolkits update, this PR remove those not compatible file includes to guarantee the safety of upgrading cann toolkits. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by rope unittest Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it? MTP now supports the v1 scheduler; the corresponding validation should be removed. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - [x] v0.9.1-dev - [x] A3 [TP16] [DP4 TP4] - [x] A3 4P1D Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Yikun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can folow the #2233 to complete release note
…-project#2402) ### What this PR does / why we need it? 1. modify the error message throwing method 2. modify attn_mask setting by PR2177 3. modify seq_len setting by PR2371 4. modify ete case ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: shikang-hangzhou <459956190@qq.com>
v0.9.1rc3 release notev0.9.1rc3
…t#1921) ### What this PR does / why we need it? Since the ATB extension registration has now been fixed in torch_npu, we have removed the invocation of this private method. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No need for further testing. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
|
Moved to #2431. |
What this PR does / why we need it?
Add
v0.9.1rc3release note, find more details at #2396.Does this PR introduce any user-facing change?
How was this patch tested?