[Doc] Add release note for `v0.9.1rc3` #2411

shen-shanshan · 2025-08-18T06:49:27Z

What this PR does / why we need it?

Add v0.9.1rc3 release note, find more details at #2396.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@78dba40

1. drop main and add 0.9.1 check for 0.9.1-dev branch 2. cherry-pick vllm-project@b75cb78 to fix import error and make 0.9.1 works 3. Fix quantization test failure. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

…by#1160 (vllm-project#1214) ### What this PR does / why we need it? …by#1160 - Fixes # -->fix torchair execute issue on padding data, and mtp padding logic ### How was this patch tested? it has been tested and merged in main. Signed-off-by: 刘哲续 <liuzhexu1@huawei.com> Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>

### What this PR does / why we need it? Add myst_substitutions print in docs/source/confg.py ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? This PR revert 20dedb to restore the tokenwise padding logics so that ACL Graph can work as expected. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

…t#1240) ### What this PR does / why we need it? vllm-ascend support chunked prefill for MLA main 关联pr:vllm-project#1172 ---------  ### What this PR does / why we need it?  ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  Signed-off-by: fems14 <1804143737@qq.com>

…ct#1234) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com>

### What this PR does / why we need it? rebase main ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  --------- Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: depeng1994 <depengzhang@foxmail.com> Signed-off-by: ttanzhiqiang <389825161@qq.com> Signed-off-by: yzim <43207690+yzim@users.noreply.github.com> Signed-off-by: chenwaner <861645847@qq.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com> Signed-off-by: wan_danfeng <wonderful199082@126.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: zhuo97 <1103045176@qq.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com> Co-authored-by: depeng1994 <166494784+depeng1994@users.noreply.github.com> Co-authored-by: ttanzhiqiang <38750855+ttanzhiqiang@users.noreply.github.com> Co-authored-by: yz <43207690+yzim@users.noreply.github.com> Co-authored-by: chenwaner <48718746+chenwaner@users.noreply.github.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com> Co-authored-by: Wan_Danfeng <wonderful199082@126.com> Co-authored-by: zhuo97 <49392868+zhuo97@users.noreply.github.com> Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>

… to 2.5.1.post1.dev20250528 (vllm-project#1247) ### What this PR does / why we need it? Cherry-pick form vllm-project#1235 1. Fix rank set in DP scenario. The new poc version of torch-npu support setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the rank set in `DPEngineCoreProc` directly instead of calculating local rank across dp by hand in the patched `_init_data_parallel` Closes: vllm-project#1170 2. Bump torch-npu version to 2.5.1.post1.dev20250528 Closes: vllm-project#1242 Closes: vllm-project#1232 ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: Icey <1790571317@qq.com>

cherry-pick to 0.9.1-dev Signed-off-by: whx-sjtu <2952154980@qq.com>

…lm-project#1264) ### What this PR does / why we need it? This PR is the cherry-pick of the PR vllm-project#1229 which have already merged into the main branch. This PR is used for resolved [issue 1147](vllm-project#1147) 1. Move fused_moe code into one file `fused_moe.py`. 2. Integrate branch conditions into function `get_fused_moe_state`. ### Does this PR introduce _any_ user-facing change? 1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this env is useless, we can make judgments based on the current scenario without this env, it will only increase complexity. 2. This PR has removed the env `USING_LCCL_COM`, because this env has already expired. 3. `additional_config.expert_tensor_parallel_size` has already expired, and now we also use parameter `enable_expert_parallel`, consistent with the vLLM. ### How was this patch tested? CI passed Signed-off-by: zzzzwwjj <1183291235@qq.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>

### What this PR does / why we need it?  ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  Signed-off-by: zzzzwwjj <1183291235@qq.com>

)  ### What this PR does / why we need it?  Refactor the token-wise padding mechanism to a more elegant implementation, correcting the padding logic errors introduced by the previous multimodal commit vllm-project#736 . This is a clean version of vllm-project#1259 . ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

…y when using V0 spec decode (vllm-project#1271) ### What this PR does / why we need it? Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode. Cherry pick from vllm-project#1258. --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

…ev） (vllm-project#1296) This PR adopt LLMDataDist for kv cache register and pull_blocks style disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 . This PR can be test with the following step: Generate the rank table for all machine. executetoy_proxy.py to launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, port Run the prefill server and decode server. send the request to the disaggregate prefill proxy Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: underfituu <hzhucong@163.com> Co-authored-by: ganyi <pleaplusone.gy@gmail.com>

### What this PR does / why we need it? Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we have fixed an known issue in deepseek-dbo. ### How was this patch tested? This patch can be tested with newly added e2e tests: [tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e). We checked the registered module name and class name in these tests. --------- Signed-off-by: zhuohuan <zxdu1997@gmail.com>

Fix running error in dbo when dp_size>1. Add conditional logic in `_get_forward_metadata_across_dp` to enable dbo. Signed-off-by: shikang-hangzhou <459956190@qq.com>

### What this PR does / why we need it? Supports Deepseek-R1 w4a8 quantization. Since R1 w4a8 uses mixed quantization, only the MOE layer uses w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which includes the AscendW4A8DynamicFusedMoEMethod class. ### Does this PR introduce _any_ user-facing change? no, use `--quantization=ascend` is engouh. ### How was this patch tested? #### 1.How to get weights using Modelslim ##### Installation steps Use the branch master, the commit id is: 298e175d69b3b855111a1e09bbe2fcd12fdb4e24 git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim bash install.sh ##### The required transformers environment pip install transformers==4.48.2 ##### Generate w4a8 weights cd /example/DeepSeek Command reference: msmodelslim/example/DeepSeek/README.md Execute the [pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#运行前必检) and [DeepSeek-R1 w4a8 mix quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-混合量化前三层-mlpw8a8-dynamic-量化mla共享专家w8a8量化路由专家w4a8-dynamic量化) chapter Reference command：python3 quant_deepseek_w4a8.py --model_path {Original weight path} --save_path {Generate weight path} --mindie_format ##### Adapt to vllm-ascend Since mindie_format generates mindie format, some adaptation modifications are needed for vllm-ascend to use it: `quant_model_description_w8a8_dynamic.json` rename to `quant_model_description.json`, and change `"group_size": 0` to `"group_size": 256` Modification in `config.json`：`"model_type":deepseekv2` is changed to `"model_type":deepseek_v3` ; `quantization_config` is removed; #### 2.How to run w4a8 TP + EP： python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 --enable_expert_parallel --quantization ascend --port $3 --max-model-len $4 --max-num-seqs $5 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager DP+TP+EP: python -m vllm.entrypoints.openai.api_server --model=$1 --trust-remote-code -tp $2 -dp $3 --enable_expert_parallel --quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6 --enforce-eager eg: python -m vllm.entrypoints.openai.api_server --model=/weightpath/w4a8_4_layer --trust-remote-code -tp 2 -dp2 --enable_expert_parallel --quantization ascend --port 8002 --max-model-len 2048 --max-num-seqs 128 --enforce-eager #### 3.Use constraints export VLLM_USE_V1=1 # v1 --------- Signed-off-by: pichangping <1337510399@qq.com>

@XWFAlone

### What this PR does / why we need it? 1. [PR913](vllm-project#913) introduced an error that caused V0's spec decode function to fail. [PR1109](vllm-project#1109) wanted to fix this problem. Unfortunately, the fix broke the ngram function. I fixed the ngram function in this PR. **PS**: Q: Why is there a problem when ngram is not found when pr1109 is merged? A: The newly introduced problem will only appear when tp>1, and the use cases on CI are all tp=1 2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to avoid CI taking too long, including eagle speculative UTs, which made CI unable to take care of the eagle function. I added it(`test_eagle_correctness.py`) back in this PR 3. Because of the reason mentioned in 2, the current version of Eagle has a problem. I located and fixed this problem. It was because vllm's `draft_model_runner.py` was changed and vllm-ascend was not synchronized in time. 4. Currently, the UTs of v0 and v1 are mixed in the spec_decode directory. I split them into two directories: spec_decode_v0 and spec_decode_v1. 5. i found `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor` and `vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace` have changed in vllm, so i remove its patchs in this pr. 6. v1 mtp ut failed(https://github.com/vllm-project/vllm-ascend/actions/runs/15782006176/job/44489813330?pr=1323), I commented it out. @XWFAlone @JC-ut0 ### Does this PR introduce _any_ user-facing change? This PR fixes the functions of ngram and eagle spec decode in the v0 engine ### How was this patch tested? ngram and eagle were tested locally using an 800I A2 machine, using real weights instead of the random small weights used by UT, and using a scenario test with tp>1. and other were tested by CI Signed-off-by: mengwei805 <mengwei25@huawei.com>

…orchair graph in long sequence predictions (vllm-project#1332) ### What this PR does / why we need it? Fix the issue of insufficient cached cosine and sine length in MLA's TorchAir graph mode, which causes accuracy deviation during long-sequence inference. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We tested the accuracy of this patch with DeepSeek R1 e2e becnhmark serving, and get 83.33 sore for AIME2024 dataset with DP4TP4EP16 setting. ![image](https://github.com/user-attachments/assets/517c63bf-164a-493f-a3cd-6ecae84f502e) Signed-off-by: linfeng-yuan <1102311262@qq.com>

### What this PR does / why we need it? Adding `W4A8_DYNAMIC` quantization support for linear. Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Adding test case `tests/multicard/test_model_qwen3_w4a8.py` to test qwen3 w4a8_dynamic quantized model Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim` of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409` 1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim` ```shell git clone https://gitee.com/ascend/msit.git cd msit/msmodelslim git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409 bash install.sh ``` 2. Serve model using `vllm` ```shell VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \ --model vllm-ascend/Qwen3-8B-W4A8 \ --port 8000 \ --quantization ascend \ --tensor_parallel_size 2 \ --enforce-eager ``` --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

### What this PR does / why we need it? This PR update the torch-npu to dev20250619 on the 0.9.1-dev branch. ### Does this PR introduce _any_ user-facing change?  No ### How was this patch tested?  Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

) ### What this PR does / why we need it? Fix accuracy problem after MOE refactor and make inference flow better. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? e2e test in `tests/e2e/multicard/test_offline_inference_distributed.py` Signed-off-by: shikang-hangzhou <459956190@qq.com>

…la decode (vllm-project#1311) ### What this PR does / why we need it? After the disaggregated PD merged, the kv cache on deepseek will become two piece of independent buffer for kv transfer or computation. However, the current kernel, namely `paged_attention_mla` can only accept k_cache as a single parameter, this make us have to concat these two piece of kv cache together before the attention thus incurs a memory peak inside the attention in eager mode. In this PR we introduce a `torch_npu.atb.npu_multi_head_latent_attention` for mla decode path, which will be used as default path for both eager mode and aclgraph after the related torch_npu is public available. Since its still a restrict package, we add `VLLM_ASCEND_MLA_PA` to control its usage. This flag will be removed in the future. ### Does this PR introduce _any_ user-facing change? Yes, add a new flag named `VLLM_ASCEND_MLA_PA`, but it will be removed eventually after the newest torch_npu is released. --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>

### What this PR does / why we need it? add guiddance for users to clean the cache when ppip reinstallation of vllm fails. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just changed the doc Signed-off-by: weiguihua2 <weiguihua2@huawei.com>

…oject#1361) ### What this PR does / why we need it? Remove the scheduler patch for disaggregated PD, since we found the patch can not really work on the online serving path. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ci test will guarantee this --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>

### What this PR does / why we need it? update Disaggregate prefill README --------- Signed-off-by: liziyu <liziyu16@huawei.com>

…ject#1393) ### What this PR does / why we need it? Remove the duplicated code introduced by my inadvertent rebase. I apologize for this oversight. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

…ect#1422) ### What this PR does / why we need it? A refactoring of `forward_context` and `model_runner_v1`, add some context which is necessary in model inference into `forward_context`, and refactor `dummy_run` logic, make it more reasonable. Some details for this PR: 1. Fix acc bug when online + multi-DP + eager mode + all_gather mode; 2. Fix bug when online + multi-DP + eager mode + mc2 mode; 3. Fix bug when A2 + eager mode + mc2 mode; 4. enable different token_num on different chip when mc2 mode; 5. Update scripts in `examples` dir; ### Does this PR introduce _any_ user-facing change? This PR remove `expert_tensor_parallel_size` in `additional_config`, we will use `enable_expert_parallel` to control whether expert_parallel is enable, which is consistent with vLLM. ### How was this patch tested? --------- Signed-off-by: zzzzwwjj <1183291235@qq.com>

### What this PR does / why we need it? Fixes qwen3 w4a8 test case failed due to `sampling_params` not fixed and `torch_npu` updated. --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>

…ble NZ for GMM. (vllm-project#1409) ### What this PR does / why we need it? 1. add a switch for enabling NZ layout in weights 2. enable NZ for GMM 3. replace magic number of weights layout ### Does this PR introduce _any_ user-facing change? Users should set `enable_weight_nz_layout` to `true` in `--additional-config` when they wanna enable weights NZ layout. ### How was this patch tested? 1) CI passed. 2) accuracy and performance comparison (only gsm8k-lite) Signed-off-by: linfeng-yuan <1102311262@qq.com>

### What this PR does / why we need it? remove chunked_prefill_for_mla ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.92it/s, est. speed input: 12.46 toks/s, output: 38.34 toks/s] DP rank 2, Generated text: ' [Your Name] and I am a professional carpenter with over 10 years of experience in the industry' DP rank 2, Generated text: ' the head of state and head of government of the United States, indirectly elected to a four-year term' DP rank 2, Generated text: ' Paris, a city that is renowned for its rich history, culture, and influence on art, fashion' DP rank 2, Generated text: ' a topic of much speculation and debate. Some experts believe that AI will eventually surpass human intelligence, while' Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.95it/s, est. speed input: 12.65 toks/s, output: 38.93 toks/s] DP rank 0, Generated text: " Dr. David Hill and today we're going to be talking about how to treat a child with a" DP rank 0, Generated text: ' the head of state and head of government of the United States, indirectly elected to a four-year term' DP rank 0, Generated text: ' Paris, a city that is renowned for its rich history, culture, and influence on art, fashion' DP rank 0, Generated text: ' here, and it’s called ChatGPT. This revolutionary technology is changing the way we interact with machines' Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.97it/s, est. speed input: 12.79 toks/s, output: 39.36 toks/s] DP rank 1, Generated text: " Dr. David Hill and today we're going to be talking about how to treat a child's fever" DP rank 3, Generated text: ' [Your Name] and I’m here to talk to you about the importance of a healthy diet' DP rank 1, Generated text: ' the head of state and head of government of the United States, indirectly elected to a four-year term' DP rank 1, Generated text: ' Paris, a city that is renowned for its rich history, culture, and influence on art, fashion' DP rank 1, Generated text: ' a topic of much speculation and debate. Some experts believe that AI will eventually surpass human intelligence, leading' DP rank 3, Generated text: ' the head of state and head of government of the United States, indirectly elected to a four-year term' DP rank 3, Generated text: " Paris. It is the largest city in France and serves as the country's political, cultural, and" DP rank 3, Generated text: ' here, and it’s called ChatGPT. This revolutionary technology is changing the way we interact with machines --------- Signed-off-by: fems14 <1804143737@qq.com>

…llm-project#2326) ### What this PR does / why we need it? This PR fix bugs and refactor cached mask generation logic. Now just pre-construct and use the cached mask on cpu instead of device on npu. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. Signed-off-by: rjg-lyh <1318825571@qq.com>

…mismtaches and .kv_cache_bytes file missing (vllm-project#2312) ### What this PR does / why we need it? Original implementation of torchair caching forces users to make everything prepared, fix all the configuration and enable `use_cached_npu_graph`, and it might cause some problems confusing to understand and tackle for users. It is better to compile the graph twice instead of reusing the old kvcaches and cached torchair graph. And the extra duration time is acceptable. ### Does this PR introduce _any_ user-facing change? If users want to enabling torchair.cache_compile with high compilation speed, it is recommended to enable both `use_cached_kv_cache_bytes` and `use_cached_graph` in `torchair_graph_config`. Without `use_cached_kv_cache_bytes`, we'll compile torchair computation graph twice to avoid runtime error caused by configuration mismtaches (the second compilation will be much faster). ### How was this patch tested? CI and e2e vllm serving passed. Signed-off-by: linfeng-yuan <1102311262@qq.com>

…llm-project#2327) ### What this PR does / why we need it? Add onfiguration check logic for ascend scheduler: 1) if chunked_prefill is disabled, `max_num_batched_tokens` couldn't be less than `max_model_len`, following vLLM; 2) if ascend scheduler is disabled, mtp cannot be enabled. ### Does this PR introduce any user-facing change? 1) users cannot enable mtp without ascend scheduler 2) users cannot set `max_num_batched_tokens` smaller than `max_model_len` with ascend scheduler ### How was this patch tested? CI and vllm serving passed Signed-off-by: linfeng-yuan <1102311262@qq.com>

### What this PR does / why we need it? In pure dp scenarios (such as DP32)， LMHead comptuation takes 1~2ms. In this PR we customize the parallelism of LMHead，enabling the separate TP of LMHead. The computation flow is listed as follows: ``` get_lmhead_group().all_gather # [num_tokens, hid_dim] --> [num_tokens * lmhead_tp, hid_dim] --> lmhead matmul # [num_tokens * lmhead_tp, hid_dim] --> [num_tokens * lmhead_tp, vocab_size // lmhead_tp] --> get_lmhead_group().all_to_all # [num_tokens * lmhead_tp, vocab_size // lmhead_tp] --> [num_tokens, vocab_size] ``` this can decrease 0.5~1ms for deepseek with 28BS on a single die、MTP. In addition, this PR also fixes a bug that introduced by LMHead quantization. The OP `npu_quant_matmul` only accepts dim < 65536, while `vocab_size` is > 65536 if using TP 1. We can set lmhead tp size > 1 to avoid this bug. Main version of this PR: vllm-project#2309 . ### Does this PR introduce _any_ user-facing change? Yes. We introduced another configurable options `lmhead_tp_size` in ascend_config. For example: ``` additional_config={ "lmhead_tp_size": 16, } ``` The default value is -1, and `lmhead_tp_size` is automatically set to `tensor_parallel_size` in this case. Besides, it is suggested to use it when running full DP to avoid additional communication introduced by TP. Therefore, the parallel size of `lmhead` group will also be changed to `tensor_parallel_size` if TP > 1 so as to fall back to normally TP+DP case. ### How was this patch tested? --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zengyanjia <z00883269@china.huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: zengyanjia <z00883269@china.huawei.com>

…cheduler in disaggregated_prefill deployment (vllm-project#2368) ### What this PR does / why we need it? Currently deepseek_mtp can be enabled with original vllm scheduler only in disaggregated prefill scenarios (experimental). This pr change the verification logic to allow users to enable deepseek_mtp without ascend scheduler in disaggregated prefill deployments. ### Does this PR introduce _any_ user-facing change? Users can enable deepseek_mtp model without ascend scheduler in disaggregated_prefill deployments. ### How was this patch tested? CI and e2e vllm serving passed. Signed-off-by: linfeng-yuan <1102311262@qq.com>

…#2028) (vllm-project#2306) ### What this PR does / why we need it? Fix protobuf version in Dockerfile to resolve `AttributeError: 'str' object has no attribute 'DESCRIPTOR' when packaging message to dict` using protobuf. will remove version specification after ray-project/ray#54910 is merged backport of vllm-project#2028 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new added test. --------- Signed-off-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? This PR presents a large-EP deployment solution based on vllm-ascend, using DeepSeek as an example. It outlines the end-to-end workflow for model deployment and serves as a reference for developers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Signed-off-by: hust17yixuan <303660421@qq.com>

vllm-project#2394) … the check_watermark_for_prefill function ### What this PR does / why we need it? ascend schedule encountered an incorrect req block length in the check_watermark_for_prefill function，under the current writing method, it will always be 1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? before： http://image.huawei.com/tiny-lts/v1/images/mdstorm/c6cff7cf33d500a3833f5f80352df373_1183x377.png after： http://image.huawei.com/tiny-lts/v1/images/mdstorm/57207a490d8ac0a70fc87dd08d02dee6_1470x954.png Signed-off-by: liziyu <liziyu16@huawei.com>

### What this PR does / why we need it? 1. MTP supports V1 scheduler 2. Refactor attn metadata build ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - [x] v0.9.1-dev - [x] A3 [TP16] [DP4 TP4] - [x] A3 1P1D Signed-off-by: xuyexiong <xuyexiong@huawei.com>

gemini-code-assist

Code Review

This pull request adds the release notes for v0.9.1rc3. The changes update the versioning policy with the new release candidate's compatibility information and release date, and add a new section for it in the release notes document. The new entries are placed correctly, following the existing ordering conventions in both files. The changes are consistent and look good.

github-actions · 2025-08-18T07:00:51Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

### What this PR does / why we need it? vLLM-Ascend's rope implementaion include several header file that are not supposed to be included by outside users. Current implementation may break when canntoolkits update, this PR remove those not compatible file includes to guarantee the safety of upgrading cann toolkits. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested by rope unittest Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

### What this PR does / why we need it? MTP now supports the v1 scheduler; the corresponding validation should be removed. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - [x] v0.9.1-dev - [x] A3 [TP16] [DP4 TP4] - [x] A3 4P1D Signed-off-by: xuyexiong <xuyexiong@huawei.com>

Yikun

You can folow the #2233 to complete release note

…-project#2402) ### What this PR does / why we need it? 1. modify the error message throwing method 2. modify attn_mask setting by PR2177 3. modify seq_len setting by PR2371 4. modify ete case ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? --------- Signed-off-by: shikang-hangzhou <459956190@qq.com>

…t#1921) ### What this PR does / why we need it? Since the ATB extension registration has now been fixed in torch_npu, we have removed the invocation of this private method. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No need for further testing. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan · 2025-08-19T03:09:24Z

Moved to #2431.

wangxiyuan and others added 30 commits June 11, 2025 14:06

[CI][0.9.0][0.9.1] Update CI job (vllm-project#1149)

75c10ce

1. drop main and add 0.9.1 check for 0.9.1-dev branch 2. cherry-pick vllm-project@b75cb78 to fix import error and make 0.9.1 works 3. Fix quantization test failure. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

Fix the device error when using ray as vllm-acend backend (vllm-proje…

f6d033b

…ct#1234) 1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES 2. Add lazy init for vllm_ascend_C Signed-off-by: zhuo97 <1103045176@qq.com>

[Bugfix] fix rope sin/cos cache bug (vllm-project#1267)

3bef819

cherry-pick to 0.9.1-dev Signed-off-by: whx-sjtu <2952154980@qq.com>

[0.9.1][Bugfix] fix dp error in dbo (vllm-project#1291)

f1353d5

Fix running error in dbo when dp_size>1. Add conditional logic in `_get_forward_metadata_across_dp` to enable dbo. Signed-off-by: shikang-hangzhou <459956190@qq.com>

[Doc] update Disaggregate prefill README (vllm-project#1379)

ab0d9ed

### What this PR does / why we need it? update Disaggregate prefill README --------- Signed-off-by: liziyu <liziyu16@huawei.com>

[Refactor] Remove duplicate multimodal codes in ModelRunner (vllm-pro…

ce4bdab

…ject#1393) ### What this PR does / why we need it? Remove the duplicated code introduced by my inadvertent rebase. I apologize for this oversight. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

[CI/UT][BugFix] Fix sampling params (vllm-project#1423)

43591c3

### What this PR does / why we need it? Fixes qwen3 w4a8 test case failed due to `sampling_params` not fixed and `torch_npu` updated. --------- Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>

fems14 and others added 10 commits August 6, 2025 23:24

shen-shanshan marked this pull request as draft August 18, 2025 06:49

shen-shanshan mentioned this pull request Aug 18, 2025

[Release]: Release checklist for v0.9.1rc3 #2396

Closed

25 tasks

gemini-code-assist bot reviewed Aug 18, 2025

View reviewed changes

github-actions bot added the documentation Improvements or additions to documentation label Aug 18, 2025

ganyi1996ppo and others added 2 commits August 18, 2025 16:48

Yikun reviewed Aug 18, 2025

View reviewed changes

shen-shanshan changed the title ~~[Doc] Add v0.9.1rc3 release note~~ [Doc] Add release note for v0.9.1rc3 Aug 19, 2025

yiz-liu and others added 2 commits August 19, 2025 10:19

Add release note for v0.9.1rc3

855c794

Signed-off-by: shen-shanshan <467638484@qq.com>

shen-shanshan force-pushed the release branch from 752dce2 to 855c794 Compare August 19, 2025 03:06

shen-shanshan closed this Aug 19, 2025

github-actions bot added ci/build module:tests module:ops module:core module:quantization labels Aug 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Doc] Add release note for `v0.9.1rc3` #2411

[Doc] Add release note for `v0.9.1rc3` #2411

Uh oh!

shen-shanshan commented Aug 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Aug 18, 2025

Uh oh!

Yikun left a comment

Uh oh!

shen-shanshan commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

46 participants

[Doc] Add release note for v0.9.1rc3 #2411

[Doc] Add release note for v0.9.1rc3 #2411

Uh oh!

Conversation

shen-shanshan commented Aug 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Aug 18, 2025

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

shen-shanshan commented Aug 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

46 participants

[Doc] Add release note for `v0.9.1rc3` #2411

[Doc] Add release note for `v0.9.1rc3` #2411

shen-shanshan commented Aug 18, 2025 •

edited by github-actions bot

Loading