Skip to content

Conversation

@shen-shanshan
Copy link
Collaborator

@shen-shanshan shen-shanshan commented Aug 18, 2025

What this PR does / why we need it?

Add v0.9.1rc3 release note, find more details at #2396.

Does this PR introduce any user-facing change?

How was this patch tested?

wangxiyuan and others added 30 commits June 11, 2025 14:06
1. drop main and add 0.9.1 check  for 0.9.1-dev branch

2. cherry-pick
vllm-project@b75cb78
to fix import error and make 0.9.1 works

3. Fix quantization test failure.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
…by#1160 (vllm-project#1214)

### What this PR does / why we need it?
…by#1160
- Fixes #
-->fix torchair execute issue on padding data, and mtp padding logic 

### How was this patch tested?
it has been tested and merged in main.

Signed-off-by: 刘哲续 <liuzhexu1@huawei.com>
Co-authored-by: 刘哲续 <liuzhexu1@huawei.com>
### What this PR does / why we need it?
Add myst_substitutions print in docs/source/confg.py

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?

This PR revert 20dedb to restore the tokenwise padding logics so that
ACL Graph can work as expected.
### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…t#1240)

### What this PR does / why we need it?
vllm-ascend support chunked prefill for MLA
main 关联pr:vllm-project#1172

---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: fems14 <1804143737@qq.com>
…ct#1234)

1. Remove RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES
2. Add lazy init for vllm_ascend_C

Signed-off-by: zhuo97 <1103045176@qq.com>
<!--  Thanks for sending a pull request!


-->
### What this PR does / why we need it?
rebase main

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
Signed-off-by: depeng1994 <depengzhang@foxmail.com>
Signed-off-by: ttanzhiqiang <389825161@qq.com>
Signed-off-by: yzim <43207690+yzim@users.noreply.github.com>
Signed-off-by: chenwaner <861645847@qq.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: wangli <wangli858794774@gmail.com>
Signed-off-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
Signed-off-by: wan_danfeng <wonderful199082@126.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: zhuo97 <1103045176@qq.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com>
Co-authored-by: depeng1994 <166494784+depeng1994@users.noreply.github.com>
Co-authored-by: ttanzhiqiang <38750855+ttanzhiqiang@users.noreply.github.com>
Co-authored-by: yz <43207690+yzim@users.noreply.github.com>
Co-authored-by: chenwaner <48718746+chenwaner@users.noreply.github.com>
Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com>
Co-authored-by: Li Wang <wangli858794774@gmail.com>
Co-authored-by: wangyanhui-cmss <wangyanhui_yewu@cmss.chinamobile.com>
Co-authored-by: Wan_Danfeng <wonderful199082@126.com>
Co-authored-by: zhuo97 <49392868+zhuo97@users.noreply.github.com>
Co-authored-by: wangxiaoxin (A) <wangxiaoxin7@huawei.com>
… to 2.5.1.post1.dev20250528 (vllm-project#1247)

### What this PR does / why we need it?
Cherry-pick form vllm-project#1235

1. Fix rank set in DP scenario. The new poc version of torch-npu support
setting `ASCEND_RT_VISIBLE_DEVICES` dynamically, thus we could use the
rank set in `DPEngineCoreProc` directly instead of calculating local
rank across dp by hand in the patched `_init_data_parallel`

Closes: vllm-project#1170

2. Bump torch-npu version to 2.5.1.post1.dev20250528

Closes: vllm-project#1242
Closes: vllm-project#1232

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: Icey <1790571317@qq.com>
Signed-off-by: MengqingCao <cmq0113@163.com>
Co-authored-by: Icey <1790571317@qq.com>
cherry-pick to 0.9.1-dev

Signed-off-by: whx-sjtu <2952154980@qq.com>
…lm-project#1264)

### What this PR does / why we need it?
This PR is the cherry-pick of the PR
vllm-project#1229 which have already
merged into the main branch.

This PR is used for resolved [issue
1147](vllm-project#1147)
1. Move fused_moe code into one file `fused_moe.py`.
2. Integrate branch conditions into function `get_fused_moe_state`.

### Does this PR introduce _any_ user-facing change?
1. This PR has removed the env `VLLM_ENABLE_MC2`, because I think this
env is useless, we can make judgments based on the current scenario
without this env, it will only increase complexity.
2. This PR has removed the env `USING_LCCL_COM`, because this env has
already expired.
3. `additional_config.expert_tensor_parallel_size` has already expired,
and now we also use parameter `enable_expert_parallel`, consistent with
the vLLM.

### How was this patch tested?
CI passed

Signed-off-by: zzzzwwjj <1183291235@qq.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: zzzzwwjj <1183291235@qq.com>
)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Refactor the token-wise padding mechanism to a more elegant
implementation, correcting the padding logic errors introduced by the
previous multimodal commit vllm-project#736 .

This is a clean version of vllm-project#1259 .
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…y when using V0 spec decode (vllm-project#1271)

### What this PR does / why we need it?
Enable `ACL_OP_INIT_MODE=1` directly only when using V0 spec decode.

Cherry pick from vllm-project#1258.

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
…ev) (vllm-project#1296)

This PR adopt LLMDataDist for kv cache register and pull_blocks style
disaggregate prefill implementation. The interface implementation mainly
follows the design of NIXL PR
https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953
.

This PR can be test with the following step:

Generate the rank table for all machine.
executetoy_proxy.py to launch the disaggregate prefill proxy server,
specify the prefill ip, port and the decode ip, port
Run the prefill server and decode server.
send the request to the disaggregate prefill proxy

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: underfituu <hzhucong@163.com>
Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?

Fix env variable in dbo to enable dbo in DeepSeek-V3 model. Besides, we
have fixed an known issue in deepseek-dbo.

### How was this patch tested?

This patch can be tested with newly added e2e tests:
[tests/multicard/test_offline_inference_distributed.py](https://github.com/vllm-project/vllm-ascend/pull/1285/files#diff-7cd2e6b1bda6b8ad1bedb3276971fe7064aeae4dc0efd41c301c4ede2158c57e).
We checked the registered module name and class name in these tests.

---------

Signed-off-by: zhuohuan <zxdu1997@gmail.com>
Fix running error in dbo when dp_size>1. Add conditional logic in
`_get_forward_metadata_across_dp` to enable dbo.

Signed-off-by: shikang-hangzhou <459956190@qq.com>
### What this PR does / why we need it?

Supports Deepseek-R1 w4a8 quantization.
Since R1 w4a8 uses mixed quantization, only the MOE layer uses
w4a8_dynamic quantization, so we added the w4a8_dynamic.py file, which
includes the AscendW4A8DynamicFusedMoEMethod class.

### Does this PR introduce _any_ user-facing change?
no, use `--quantization=ascend` is engouh. 

### How was this patch tested?

#### 1.How to get weights using Modelslim

##### Installation steps

Use the branch master, the commit id is:
298e175d69b3b855111a1e09bbe2fcd12fdb4e24
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
bash install.sh

##### The required transformers environment

pip install transformers==4.48.2

##### Generate w4a8 weights

cd /example/DeepSeek
Command reference: msmodelslim/example/DeepSeek/README.md Execute the
[pre-check](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#运行前必检)
and [DeepSeek-R1 w4a8 mix
quantization](https://gitee.com/ascend/msit/blob/master/msmodelslim/example/DeepSeek/README.md#deepseek-r1-w4a8-混合量化前三层-mlpw8a8-dynamic-量化mla共享专家w8a8量化路由专家w4a8-dynamic量化)
chapter
Reference command:python3 quant_deepseek_w4a8.py --model_path {Original
weight path} --save_path {Generate weight path} --mindie_format

##### Adapt to vllm-ascend

Since mindie_format generates mindie format, some adaptation
modifications are needed for vllm-ascend to use it:
`quant_model_description_w8a8_dynamic.json` rename to
`quant_model_description.json`, and change `"group_size": 0` to
`"group_size": 256`
Modification in `config.json`:`"model_type":deepseekv2` is changed to
`"model_type":deepseek_v3` ; `quantization_config` is removed;

#### 2.How to run w4a8
TP + EP:
python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 --enable_expert_parallel --quantization
ascend --port $3 --max-model-len $4 --max-num-seqs $5 --enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 4
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 2048 --max-num-seqs 128 --enforce-eager
DP+TP+EP:
python -m vllm.entrypoints.openai.api_server --model=$1
--trust-remote-code -tp $2 -dp $3 --enable_expert_parallel
--quantization ascend --port $4 --max-model-len $5 --max-num-seqs $6
--enforce-eager
eg: python -m vllm.entrypoints.openai.api_server
--model=/weightpath/w4a8_4_layer --trust-remote-code -tp 2 -dp2
--enable_expert_parallel --quantization ascend --port 8002
--max-model-len 2048 --max-num-seqs 128 --enforce-eager

#### 3.Use constraints
export VLLM_USE_V1=1  # v1

---------

Signed-off-by: pichangping <1337510399@qq.com>
### What this PR does / why we need it?
1. [PR913](vllm-project#913)
introduced an error that caused V0's spec decode function to fail.
[PR1109](vllm-project#1109) wanted
to fix this problem. Unfortunately, the fix broke the ngram function. I
fixed the ngram function in this PR. **PS**: Q: Why is there a problem
when ngram is not found when pr1109 is merged? A: The newly introduced
problem will only appear when tp>1, and the use cases on CI are all tp=1
2. In versions after 0.7.3, vllm-ascend deleted some spec decode UTs to
avoid CI taking too long, including eagle speculative UTs, which made CI
unable to take care of the eagle function. I added
it(`test_eagle_correctness.py`) back in this PR
3. Because of the reason mentioned in 2, the current version of Eagle
has a problem. I located and fixed this problem. It was because vllm's
`draft_model_runner.py` was changed and vllm-ascend was not synchronized
in time.
4. Currently, the UTs of v0 and v1 are mixed in the spec_decode
directory. I split them into two directories: spec_decode_v0 and
spec_decode_v1.
5. i found
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_include_gpu_probs_tensor`
and
`vllm.spec_decode.multi_step_worker.MultiStepWorker.set_should_modify_greedy_probs_inplace`
have changed in vllm, so i remove its patchs in this pr.
6. v1 mtp ut
failed(https://github.com/vllm-project/vllm-ascend/actions/runs/15782006176/job/44489813330?pr=1323),
I commented it out. @XWFAlone @JC-ut0 

### Does this PR introduce _any_ user-facing change?
This PR fixes the functions of ngram and eagle spec decode in the v0
engine

### How was this patch tested?
ngram and eagle were tested locally using an 800I A2 machine, using real
weights instead of the random small weights used by UT, and using a
scenario test with tp>1.
and other were tested by CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
…orchair graph in long sequence predictions (vllm-project#1332)

### What this PR does / why we need it?
Fix the issue of insufficient cached cosine and sine length in MLA's
TorchAir graph mode, which causes accuracy deviation during
long-sequence inference.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
We tested the accuracy of this patch with DeepSeek R1 e2e becnhmark
serving, and get 83.33 sore for AIME2024 dataset with DP4TP4EP16
setting.

![image](https://github.com/user-attachments/assets/517c63bf-164a-493f-a3cd-6ecae84f502e)

Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Adding `W4A8_DYNAMIC` quantization support for linear.
Dense models like Qwen3 can infer with `W4A8_DYNAMIC` quantization.


### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
Adding test case `tests/multicard/test_model_qwen3_w4a8.py` to test
qwen3 w4a8_dynamic quantized model
Note the w4a8_dynamic quantized model is quantized by `msit/msmodelslim`
of commit `d0abb0a47e1f1a473b866ad41b737fbc28fb1409`

1. Generate `W4A8_DYNAMIC` quantization weights using `msmodelslim`
```shell
git clone https://gitee.com/ascend/msit.git
cd msit/msmodelslim
git checkout d0abb0a47e1f1a473b866ad41b737fbc28fb1409
bash install.sh
```

2. Serve model using `vllm`
```shell
VLLM_USE_V1=1 python -m vllm.entrypoints.openai.api_server \
  --model vllm-ascend/Qwen3-8B-W4A8 \
  --port 8000 \
  --quantization ascend \
  --tensor_parallel_size 2 \
  --enforce-eager
```

---------

Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>
### What this PR does / why we need it?
This PR update the torch-npu to dev20250619 on the 0.9.1-dev branch.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
No
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
)

### What this PR does / why we need it?
Fix accuracy problem after MOE refactor and make inference flow better.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
e2e test in `tests/e2e/multicard/test_offline_inference_distributed.py`

Signed-off-by: shikang-hangzhou <459956190@qq.com>
…la decode (vllm-project#1311)

### What this PR does / why we need it?

After the disaggregated PD merged, the kv cache on deepseek will become
two piece of independent buffer for kv transfer or computation. However,
the current kernel, namely `paged_attention_mla` can only accept k_cache
as a single parameter, this make us have to concat these two piece of kv
cache together before the attention thus incurs a memory peak inside the
attention in eager mode. In this PR we introduce a
`torch_npu.atb.npu_multi_head_latent_attention` for mla decode path,
which will be used as default path for both eager mode and aclgraph
after the related torch_npu is public available. Since its still a
restrict package, we add `VLLM_ASCEND_MLA_PA` to control its usage. This
flag will be removed in the future.

### Does this PR introduce _any_ user-facing change?

Yes, add a new flag named `VLLM_ASCEND_MLA_PA`, but it will be removed
eventually after the newest torch_npu is released.
---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
add guiddance for users to clean the cache when ppip reinstallation of
vllm fails.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?
Just changed the doc

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
…oject#1361)

### What this PR does / why we need it?
Remove the scheduler patch for disaggregated PD, since we found the
patch can not really work on the online serving path.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ci test will guarantee this

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?
update Disaggregate prefill README



---------

Signed-off-by: liziyu <liziyu16@huawei.com>
…ject#1393)

### What this PR does / why we need it?
Remove the duplicated code introduced by my inadvertent rebase. I
apologize for this oversight.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…ect#1422)

### What this PR does / why we need it?
A refactoring of `forward_context` and `model_runner_v1`, add some
context which is necessary in model inference into `forward_context`,
and refactor `dummy_run` logic, make it more reasonable.
Some details for this PR:
1. Fix acc bug when online + multi-DP + eager mode + all_gather mode;
2. Fix bug when online + multi-DP + eager mode + mc2 mode;
3. Fix bug when A2 + eager mode + mc2 mode;
4. enable different token_num on different chip when mc2 mode;
5. Update scripts in `examples` dir;

### Does this PR introduce _any_ user-facing change?
This PR remove `expert_tensor_parallel_size` in `additional_config`, we
will use `enable_expert_parallel` to control whether expert_parallel is
enable, which is consistent with vLLM.

### How was this patch tested?

---------

Signed-off-by: zzzzwwjj <1183291235@qq.com>
### What this PR does / why we need it?
Fixes qwen3 w4a8 test case failed due to `sampling_params` not fixed and
`torch_npu` updated.

---------

Signed-off-by: ZhouXiang <zhouxiang100@huawei.com>
…ble NZ for GMM. (vllm-project#1409)

### What this PR does / why we need it?

1. add a switch for enabling NZ layout in weights
2. enable NZ for GMM
3. replace magic number of weights layout 

### Does this PR introduce _any_ user-facing change?

Users should set `enable_weight_nz_layout` to `true` in
`--additional-config` when they wanna enable weights NZ layout.

### How was this patch tested?

1) CI passed. 
2) accuracy and performance comparison (only gsm8k-lite)

Signed-off-by: linfeng-yuan <1102311262@qq.com>
fems14 and others added 10 commits August 6, 2025 23:24
### What this PR does / why we need it?
 remove chunked_prefill_for_mla


### Does this PR introduce _any_ user-facing change?


### How was this patch tested?
Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.92it/s, est.
speed input: 12.46 toks/s, output: 38.34 toks/s]
DP rank 2, Generated text: ' [Your Name] and I am a professional
carpenter with over 10 years of experience in the industry'
DP rank 2, Generated text: ' the head of state and head of government of
the United States, indirectly elected to a four-year term'
DP rank 2, Generated text: ' Paris, a city that is renowned for its rich
history, culture, and influence on art, fashion'
DP rank 2, Generated text: ' a topic of much speculation and debate.
Some experts believe that AI will eventually surpass human intelligence,
while'
Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.95it/s, est.
speed input: 12.65 toks/s, output: 38.93 toks/s]
DP rank 0, Generated text: " Dr. David Hill and today we're going to be
talking about how to treat a child with a"
DP rank 0, Generated text: ' the head of state and head of government of
the United States, indirectly elected to a four-year term'
DP rank 0, Generated text: ' Paris, a city that is renowned for its rich
history, culture, and influence on art, fashion'
DP rank 0, Generated text: ' here, and it’s called ChatGPT. This
revolutionary technology is changing the way we interact with machines'
Processed prompts: 100%|??????????| 4/4 [00:02<00:00, 1.97it/s, est.
speed input: 12.79 toks/s, output: 39.36 toks/s]

DP rank 1, Generated text: " Dr. David Hill and today we're going to be
talking about how to treat a child's fever"
DP rank 3, Generated text: ' [Your Name] and I’m here to talk to you
about the importance of a healthy diet'
DP rank 1, Generated text: ' the head of state and head of government of
the United States, indirectly elected to a four-year term'
DP rank 1, Generated text: ' Paris, a city that is renowned for its rich
history, culture, and influence on art, fashion'
DP rank 1, Generated text: ' a topic of much speculation and debate.
Some experts believe that AI will eventually surpass human intelligence,
leading'
DP rank 3, Generated text: ' the head of state and head of government of
the United States, indirectly elected to a four-year term'
DP rank 3, Generated text: " Paris. It is the largest city in France and
serves as the country's political, cultural, and"
DP rank 3, Generated text: ' here, and it’s called ChatGPT. This
revolutionary technology is changing the way we interact with machines

---------

Signed-off-by: fems14 <1804143737@qq.com>
…llm-project#2326)

### What this PR does / why we need it?
This PR fix bugs and refactor cached mask generation logic. Now just
pre-construct and use the cached mask on cpu instead of device on npu.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.

Signed-off-by: rjg-lyh <1318825571@qq.com>
…mismtaches and .kv_cache_bytes file missing (vllm-project#2312)

### What this PR does / why we need it?
Original implementation of torchair caching forces users to make
everything prepared, fix all the configuration and enable
`use_cached_npu_graph`, and it might cause some problems confusing to
understand and tackle for users. It is better to compile the graph twice
instead of reusing the old kvcaches and cached torchair graph. And the
extra duration time is acceptable.

### Does this PR introduce _any_ user-facing change?
If users want to enabling torchair.cache_compile with high compilation
speed, it is recommended to enable both `use_cached_kv_cache_bytes` and
`use_cached_graph` in `torchair_graph_config`. Without
`use_cached_kv_cache_bytes`, we'll compile torchair computation graph
twice to avoid runtime error caused by configuration mismtaches (the
second compilation will be much faster).

### How was this patch tested?
CI and e2e vllm serving passed.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
…llm-project#2327)

### What this PR does / why we need it?
Add onfiguration check logic for ascend scheduler: 1) if chunked_prefill
is disabled, `max_num_batched_tokens` couldn't be less than
`max_model_len`, following vLLM; 2) if ascend scheduler is disabled, mtp
cannot be enabled.

### Does this PR introduce any user-facing change?
1) users cannot enable mtp without ascend scheduler
2) users cannot set `max_num_batched_tokens` smaller than
`max_model_len` with ascend scheduler

### How was this patch tested?
CI and vllm serving passed

Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
In pure dp scenarios (such as DP32), LMHead comptuation takes 1~2ms. In
this PR we customize the parallelism of LMHead,enabling the separate TP
of LMHead. The computation flow is listed as follows:

```
get_lmhead_group().all_gather  # [num_tokens, hid_dim] -->  [num_tokens * lmhead_tp, hid_dim]
--> lmhead matmul  # [num_tokens * lmhead_tp, hid_dim] -->  [num_tokens * lmhead_tp, vocab_size //  lmhead_tp]
--> get_lmhead_group().all_to_all  # [num_tokens * lmhead_tp, vocab_size //  lmhead_tp] --> [num_tokens, vocab_size]
```

this can decrease 0.5~1ms for deepseek with 28BS on a single die、MTP.

In addition, this PR also fixes a bug that introduced by LMHead
quantization. The OP `npu_quant_matmul` only accepts dim < 65536, while
`vocab_size` is > 65536 if using TP 1. We can set lmhead tp size > 1 to
avoid this bug.

Main version of this PR: vllm-project#2309 .

### Does this PR introduce _any_ user-facing change?
Yes. We introduced another configurable options `lmhead_tp_size` in
ascend_config. For example:
```
additional_config={
        "lmhead_tp_size": 16,
}
```
The default value is -1, and `lmhead_tp_size` is automatically set to
`tensor_parallel_size` in this case. Besides, it is suggested to use it
when running full DP to avoid additional communication introduced by TP.
Therefore, the parallel size of `lmhead` group will also be changed to
`tensor_parallel_size` if TP > 1 so as to fall back to normally TP+DP
case.

### How was this patch tested?


---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: zengyanjia <z00883269@china.huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: zengyanjia <z00883269@china.huawei.com>
…cheduler in disaggregated_prefill deployment (vllm-project#2368)

### What this PR does / why we need it?
Currently deepseek_mtp can be enabled with original vllm scheduler only
in disaggregated prefill scenarios (experimental). This pr change the
verification logic to allow users to enable deepseek_mtp without ascend
scheduler in disaggregated prefill deployments.

### Does this PR introduce _any_ user-facing change?
Users can enable deepseek_mtp model without ascend scheduler in
disaggregated_prefill deployments.

### How was this patch tested?
CI and e2e vllm serving passed.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
…#2028) (vllm-project#2306)

### What this PR does / why we need it?

Fix protobuf version in Dockerfile to resolve `AttributeError: 'str'
object has no attribute 'DESCRIPTOR' when packaging message to dict`
using protobuf. will remove version specification after
ray-project/ray#54910 is merged

backport of vllm-project#2028

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new added test.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?

This PR presents a large-EP deployment solution based on vllm-ascend,
using DeepSeek as an example. It outlines the end-to-end workflow for
model deployment and serves as a reference for developers.


### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Signed-off-by: hust17yixuan <303660421@qq.com>
vllm-project#2394)

… the check_watermark_for_prefill function


### What this PR does / why we need it?
ascend schedule encountered an incorrect req block length in the
check_watermark_for_prefill function,under the current writing method,
it will always be 1.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
before:

http://image.huawei.com/tiny-lts/v1/images/mdstorm/c6cff7cf33d500a3833f5f80352df373_1183x377.png
after:

http://image.huawei.com/tiny-lts/v1/images/mdstorm/57207a490d8ac0a70fc87dd08d02dee6_1470x954.png

Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it?

1. MTP supports V1 scheduler 
2. Refactor attn metadata build
### Does this PR introduce _any_ user-facing change?


### How was this patch tested?


- [x] v0.9.1-dev 
- [x] A3 [TP16] [DP4 TP4] 
- [x] A3 1P1D

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
@shen-shanshan shen-shanshan marked this pull request as draft August 18, 2025 06:49
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the release notes for v0.9.1rc3. The changes update the versioning policy with the new release candidate's compatibility information and release date, and add a new section for it in the release notes document. The new entries are placed correctly, following the existing ordering conventions in both files. The changes are consistent and look good.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 18, 2025
ganyi1996ppo and others added 2 commits August 18, 2025 16:48
### What this PR does / why we need it?
vLLM-Ascend's rope implementaion include several header file that are
not supposed to be included by outside users. Current implementation may
break when canntoolkits update, this PR remove those not compatible file
includes to guarantee the safety of upgrading cann toolkits.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested by rope unittest

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?

MTP now supports the v1 scheduler; the corresponding validation should
be removed.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- [x]  v0.9.1-dev
- [x]  A3 [TP16] [DP4 TP4]
- [x]  A3 4P1D

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Copy link
Collaborator

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can folow the #2233 to complete release note

…-project#2402)

### What this PR does / why we need it?
1. modify the error message throwing method
2. modify attn_mask setting by PR2177
3. modify seq_len setting  by PR2371
4. modify ete case

### Does this PR introduce _any_ user-facing change?


### How was this patch tested?


---------

Signed-off-by: shikang-hangzhou <459956190@qq.com>
@shen-shanshan shen-shanshan changed the title [Doc] Add v0.9.1rc3 release note [Doc] Add release note for v0.9.1rc3 Aug 19, 2025
yiz-liu and others added 2 commits August 19, 2025 10:19
…t#1921)

### What this PR does / why we need it?
Since the ATB extension registration has now been fixed in torch_npu, we
have removed the invocation of this private method.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
No need for further testing.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
@shen-shanshan
Copy link
Collaborator Author

Moved to #2431.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.