Skip to content

Commit c4ff1ab

Browse files
authored
Merge branch 'main' into main
2 parents c8cd500 + 5cd5d64 commit c4ff1ab

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+4378
-519
lines changed

.github/workflows/vllm_ascend_test.yaml

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -114,14 +114,20 @@ jobs:
114114
# pytest -sv tests/singlecard/test_guided_decoding.py.py
115115
# test_ascend_config.py should be ran separately because it will regenerate the global config many times.
116116
pytest -sv tests/singlecard/test_ascend_config.py
117+
pytest -sv tests/singlecard/test_camem.py
117118
pytest -sv tests/singlecard/ \
118119
--ignore=tests/singlecard/test_offline_inference.py \
119120
--ignore=tests/singlecard/test_scheduler.py \
120121
--ignore=tests/singlecard/test_guided_decoding.py \
121-
--ignore=tests/singlecard/test_ascend_config.py
122+
--ignore=tests/singlecard/test_ascend_config.py \
123+
--ignore=tests/singlecard/test_camem.py
122124
else
123125
pytest -sv tests/multicard/test_ilama_lora_tp2.py
124-
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py
126+
# To avoid oom, we need to run the test in a single process.
127+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
128+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek
129+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_topk
130+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py --ignore=tests/multicard/test_offline_inference_distributed.py
125131
fi
126132
127133
- name: Run vllm-project/vllm-ascend test on V0 engine
@@ -136,16 +142,20 @@ jobs:
136142
pytest -sv tests/singlecard/test_camem.py
137143
# test_ascend_config.py should be ran separately because it will regenerate the global config many times.
138144
pytest -sv tests/singlecard/test_ascend_config.py
145+
pytest -sv tests/singlecard/test_prompt_embedding.py
139146
pytest -sv tests/singlecard/ \
140147
--ignore=tests/singlecard/test_offline_inference.py \
141148
--ignore=tests/singlecard/test_scheduler.py \
142149
--ignore=tests/singlecard/test_guided_decoding.py \
143150
--ignore=tests/singlecard/test_camem.py \
144-
--ignore=tests/singlecard/test_ascend_config.py
151+
--ignore=tests/singlecard/test_ascend_config.py \
152+
--ignore=tests/singlecard/test_prompt_embedding.py
145153
else
146154
pytest -sv tests/multicard/test_ilama_lora_tp2.py
147155
# Fixme: run VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py will raise error.
156+
# To avoid oom, we need to run the test in a single process.
148157
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_QwQ
149158
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_DeepSeek
159+
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/test_offline_inference_distributed.py::test_models_distributed_topk
150160
VLLM_USE_MODELSCOPE=True pytest -sv tests/multicard/ --ignore=tests/multicard/test_ilama_lora_tp2.py --ignore=tests/multicard/test_offline_inference_distributed.py
151161
fi

docs/source/community/contributors.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,11 @@
1+
# Maintainers
2+
3+
| Name | Github ID | Date |
4+
|:-----------:|:-----:|:-----:|
5+
| Xiyuan Wang| [@wangxiyuan](https://github.com/wangxiyuan) | 2025/01 |
6+
| Yikun Jiang| [@Yikun](https://github.com/Yikun) | 2025/02 |
7+
| Yi Gan| [@ganyi1996ppo](https://github.com/ganyi1996ppo) | 2025/02 |
8+
19
# Contributors
210

311
vLLM Ascend every release would not have been possible without the following contributors:

docs/source/community/governance.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,20 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
2929
Requires approval from existing Maintainers. The vLLM community has the final decision-making authority.
3030

3131
Maintainer will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo write permissions (`Can read, clone, and push to this repository. Can also manage issues and pull requests`).
32+
33+
## Nominating and Removing Maintainers
34+
35+
### The Principles
36+
37+
- Membership in vLLM Ascend is given to individuals on merit basis after they demonstrated strong expertise of the vLLM / vLLM Ascend through contributions, reviews and discussions.
38+
39+
- For membership in the maintainer group the individual has to demonstrate strong and continued alignment with the overall vLLM / vLLM Ascend principles.
40+
41+
- Light criteria of moving module maintenance to ‘emeritus’ status if they don’t actively participate over long periods of time.
42+
43+
- The membership is for an individual, not a company.
44+
45+
### Nomination and Removal
46+
47+
- Nomination: Anyone can nominate someone to become a maintainer (include self-nominate). All existing maintainers are responsible for evaluating the nomination. The nominator should provide nominee's info around the strength of the candidate to be a maintainer, include but not limited to review quality, quality contribution, community involvement.
48+
- Removal: Anyone can nominate a person to be removed from maintainer position (include self-nominate). All existing maintainers are responsible for evaluating the nomination. The nominator should provide nominee's info, include but not limited to lack of activity, conflict with the overall direction and other information that makes them unfit to be a maintainer.

docs/source/conf.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -64,15 +64,15 @@
6464
# the branch of vllm, used in vllm clone
6565
# - main branch: 'main'
6666
# - vX.Y.Z branch: 'vX.Y.Z'
67-
'vllm_version': 'v0.8.5.post1',
67+
'vllm_version': 'v0.9.0',
6868
# the branch of vllm-ascend, used in vllm-ascend clone and image tag
6969
# - main branch: 'main'
7070
# - vX.Y.Z branch: latest vllm-ascend release tag
71-
'vllm_ascend_version': 'v0.8.5rc1',
71+
'vllm_ascend_version': 'v0.9.0rc1',
7272
# the newest release version of vllm-ascend and matched vLLM, used in pip install.
7373
# This value should be updated when cut down release.
74-
'pip_vllm_ascend_version': "0.8.5rc1",
75-
'pip_vllm_version': "0.8.5.post1",
74+
'pip_vllm_ascend_version': "0.9.0rc1",
75+
'pip_vllm_version': "0.9.0",
7676
# CANN image tag
7777
'cann_image_tag': "8.1.rc1-910b-ubuntu22.04-py3.10",
7878
}

docs/source/developer_guide/versioning_policy.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
2222

2323
| vLLM Ascend | vLLM | Python | Stable CANN | PyTorch/torch_npu | MindIE Turbo |
2424
|-------------|--------------|------------------|-------------|--------------------|--------------|
25+
| v0.9.0rc1 | v0.9.0 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | |
2526
| v0.8.5rc1 | v0.8.5.post1 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | |
2627
| v0.8.4rc2 | v0.8.4 | >= 3.9, < 3.12 | 8.0.0 | 2.5.1 / 2.5.1 | |
2728
| v0.7.3.post1| v0.7.3 | >= 3.9, < 3.12 | 8.1.RC1 | 2.5.1 / 2.5.1 | 2.0rc1 |
@@ -33,6 +34,7 @@ Following is the Release Compatibility Matrix for vLLM Ascend Plugin:
3334

3435
| Date | Event |
3536
|------------|-------------------------------------------|
37+
| 2025.06.09 | Release candidates, v0.9.0rc1 |
3638
| 2025.05.29 | v0.7.x post release, v0.7.3.post1 |
3739
| 2025.05.08 | v0.7.x Final release, v0.7.3 |
3840
| 2025.05.06 | Release candidates, v0.8.5rc1 |

docs/source/faqs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
## Version Specific FAQs
44

55
- [[v0.7.3.post1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1007)
6-
- [[v0.8.5rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/754)
6+
- [[v0.9.0rc1] FAQ & Feedback](https://github.com/vllm-project/vllm-ascend/issues/1115)
77

88
## General FAQs
99

docs/source/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ user_guide/suppoted_features
4747
user_guide/supported_models
4848
user_guide/env_vars
4949
user_guide/additional_config
50+
user_guide/graph_mode.md
5051
user_guide/release_notes
5152
:::
5253

docs/source/user_guide/additional_config.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,13 @@ LLM(model="Qwen/Qwen3-8B", additional_config={"config_key":"config_value"})
2424

2525
The following table lists the additional configuration options available in vLLM Ascend:
2626

27-
| Name | Type | Default | Description |
28-
| ---- | ---- | ------- | ----------- |
29-
| `torchair_graph_config` | dict | `{}` | The config options for torchair graph mode |
30-
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
31-
| `expert_tensor_parallel_size` | str | `1` | Expert tensor parallel size the model to use. |
27+
| Name | Type | Default | Description |
28+
|-------------------------------| ---- |------|-----------------------------------------------------------------------------------------------|
29+
| `torchair_graph_config` | dict | `{}` | The config options for torchair graph mode |
30+
| `ascend_scheduler_config` | dict | `{}` | The config options for ascend scheduler |
31+
| `expert_tensor_parallel_size` | str | `0` | Expert tensor parallel size the model to use. |
32+
| `refresh` | bool | `false` | Whether to refresh global ascend config content. This value is usually used by rlhf case. |
33+
| `expert_map_path` | str | None | When using expert load balancing for the MOE model, an expert map path needs to be passed in. |
3234

3335
The details of each config option are as follows:
3436

@@ -37,6 +39,7 @@ The details of each config option are as follows:
3739
| Name | Type | Default | Description |
3840
| ---- | ---- | ------- | ----------- |
3941
| `enabled` | bool | `False` | Whether to enable torchair graph mode |
42+
| `enable_view_optimize` | bool | `True` | Whether to enable torchair view optimization |
4043
| `use_cached_graph` | bool | `False` | Whether to use cached graph |
4144
| `graph_batch_sizes` | list[int] | `[]` | The batch size for torchair graph cache |
4245
| `graph_batch_sizes_init` | bool | `False` | Init graph batch size dynamically if `graph_batch_sizes` is empty |
@@ -69,6 +72,7 @@ A full example of additional configuration is as follows:
6972
"enabled": true,
7073
"chunked_prefill_enabled": true,
7174
},
72-
"expert_tensor_parallel_size": 1
75+
"expert_tensor_parallel_size": 1,
76+
"refresh": false,
7377
}
7478
```
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
# Graph Mode Guide
2+
3+
4+
This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
5+
6+
This guide provides instructions for using Ascend Graph Mode with vLLM Ascend. Please note that graph mode is only available on V1 Engine. And only Qwen, DeepSeek series models are well tested in 0.9.0rc1. We'll make it stable and generalize in the next release.
7+
8+
## Getting Started
9+
10+
From v0.9.0rc1 with V1 Engine, vLLM Ascend will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
11+
12+
There are two kinds for graph mode supported by vLLM Ascend:
13+
- **ACLGraph**: This is the default graph mode supported by vLLM Ascend. In v0.9.0rc1, only Qwen series models are well tested.
14+
- **TorchAirGraph**: This is the GE graph mode. In v0.9.0rc1, only DeepSeek series models are supported.
15+
16+
## Using ACLGraph
17+
ACLGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine is enough.
18+
19+
offline example:
20+
21+
```python
22+
import os
23+
24+
from vllm import LLM
25+
26+
os.environ["VLLM_USE_V1"] = 1
27+
28+
model = LLM(model="Qwen/Qwen2-7B-Instruct")
29+
outputs = model.generate("Hello, how are you?")
30+
```
31+
32+
online example:
33+
34+
```shell
35+
vllm serve Qwen/Qwen2-7B-Instruct
36+
```
37+
38+
## Using TorchAirGraph
39+
40+
If you want to run DeepSeek series models with graph mode, you should use [TorchAirGraph](https://www.hiascend.com/document/detail/zh/Pytorch/700/modthirdparty/torchairuseguide/torchair_0002.html). In this case, additional config is required.
41+
42+
offline example:
43+
44+
```python
45+
import os
46+
from vllm import LLM
47+
48+
os.environ["VLLM_USE_V1"] = 1
49+
50+
model = LLM(model="deepseek-ai/DeepSeek-R1-0528", additional_config={"torchair_graph_config": {"enable": True}})
51+
outputs = model.generate("Hello, how are you?")
52+
```
53+
54+
online example:
55+
56+
```shell
57+
vllm serve Qwen/Qwen2-7B-Instruct --additional-config='{"torchair_graph_config": {"enable": True}}'
58+
```
59+
60+
You can find more detail about additional config [here](./additional_config.md)
61+
62+
## Fallback to Eager Mode
63+
64+
If both `ACLGraph` and `TorchAirGraph` fail to run, you should fallback to eager mode.
65+
66+
offline example:
67+
68+
```python
69+
import os
70+
from vllm import LLM
71+
72+
os.environ["VLLM_USE_V1"] = 1
73+
74+
model = LLM(model="someother_model_weight", enforce_eager=True)
75+
outputs = model.generate("Hello, how are you?")
76+
```
77+
78+
online example:
79+
80+
```shell
81+
vllm serve Qwen/Qwen2-7B-Instruct --enforce-eager
82+
```

docs/source/user_guide/release_notes.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,43 @@
11
# Release note
22

3+
## v0.9.0rc1 - 2025.06.09
4+
5+
This is the 1st release candidate of v0.9.0 for vllm-ascend. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/) to start the journey. From this release, V1 Engine is recommended to use. The code of V0 Engine is frozen and will not be maintained any more. Please set environment `VLLM_USE_V1=1` to enable V1 Engine.
6+
7+
### Highlights
8+
9+
- DeepSeek works with graph mode now. Follow the [official doc](https://vllm-ascend.readthedocs.io/en/latest/user_guide/graph_mode.html) to take a try. [#789](https://github.com/vllm-project/vllm-ascend/pull/789)
10+
- Qwen series models works with graph mode now. It works by default with V1 Engine. Please note that in this release, only Qwen series models are well tested with graph mode. We'll make it stable and generalize in the next release. If you hit any issues, please feel free to open an issue on GitHub and fallback to eager mode temporarily by set `enforce_eager=True` when initializing the model.
11+
12+
### Core
13+
14+
- The performance of multi-step scheduler has been improved. Thanks for the contribution from China Merchants Bank. [#814](https://github.com/vllm-project/vllm-ascend/pull/814)
15+
- LoRA、Multi-LoRA And Dynamic Serving is supported for V1 Engine now. Thanks for the contribution from China Merchants Bank. [#893](https://github.com/vllm-project/vllm-ascend/pull/893)
16+
- prefix cache and chunked prefill feature works now [#782](https://github.com/vllm-project/vllm-ascend/pull/782) [#844](https://github.com/vllm-project/vllm-ascend/pull/844)
17+
- Spec decode and MTP features work with V1 Engine now. [#874](https://github.com/vllm-project/vllm-ascend/pull/874) [#890](https://github.com/vllm-project/vllm-ascend/pull/890)
18+
- DP feature works with DeepSeek now. [#1012](https://github.com/vllm-project/vllm-ascend/pull/1012)
19+
- Input embedding feature works with V0 Engine now. [#916](https://github.com/vllm-project/vllm-ascend/pull/916)
20+
- Sleep mode feature works with V1 Engine now. [#1084](https://github.com/vllm-project/vllm-ascend/pull/1084)
21+
22+
### Model
23+
24+
- Qwen2.5 VL works with V1 Engine now. [#736](https://github.com/vllm-project/vllm-ascend/pull/736)
25+
- LLama4 works now. [#740](https://github.com/vllm-project/vllm-ascend/pull/740)
26+
- A new kind of DeepSeek model called dual-batch overlap(DBO) is added. Please set `VLLM_ASCEND_ENABLE_DBO=1` to use it. [#941](https://github.com/vllm-project/vllm-ascend/pull/941)
27+
28+
### Other
29+
30+
- online serve with ascend quantization works now. [#877](https://github.com/vllm-project/vllm-ascend/pull/877)
31+
- A batch of bugs for graph mode and moe model have been fixed. [#773](https://github.com/vllm-project/vllm-ascend/pull/773) [#771](https://github.com/vllm-project/vllm-ascend/pull/771) [#774](https://github.com/vllm-project/vllm-ascend/pull/774) [#816](https://github.com/vllm-project/vllm-ascend/pull/816) [#817](https://github.com/vllm-project/vllm-ascend/pull/817) [#819](https://github.com/vllm-project/vllm-ascend/pull/819) [#912](https://github.com/vllm-project/vllm-ascend/pull/912) [#897](https://github.com/vllm-project/vllm-ascend/pull/897) [#961](https://github.com/vllm-project/vllm-ascend/pull/961) [#958](https://github.com/vllm-project/vllm-ascend/pull/958) [#913](https://github.com/vllm-project/vllm-ascend/pull/913) [#905](https://github.com/vllm-project/vllm-ascend/pull/905)
32+
- A batch of performance improvement PRs have been merged. [#784](https://github.com/vllm-project/vllm-ascend/pull/784) [#803](https://github.com/vllm-project/vllm-ascend/pull/803) [#966](https://github.com/vllm-project/vllm-ascend/pull/966) [#839](https://github.com/vllm-project/vllm-ascend/pull/839) [#970](https://github.com/vllm-project/vllm-ascend/pull/970) [#947](https://github.com/vllm-project/vllm-ascend/pull/947) [#987](https://github.com/vllm-project/vllm-ascend/pull/987) [#1085](https://github.com/vllm-project/vllm-ascend/pull/1085)
33+
- From this release, binary wheel package will be released as well. [#775](https://github.com/vllm-project/vllm-ascend/pull/775)
34+
- The contributor doc site is [added](https://vllm-ascend.readthedocs.io/en/latest/community/contributors.html)
35+
36+
### Known Issue
37+
38+
- In some case, vLLM process may be crashed with aclgraph enabled. We're working this issue and it'll be fixed in the next release.
39+
- Multi node data-parallel doesn't work with this release. This is a known issue in vllm and has been fixed on main branch. [#18981](https://github.com/vllm-project/vllm/pull/18981)
40+
341
## v0.7.3.post1 - 2025.05.29
442

543
This is the first post release of 0.7.3. Please follow the [official doc](https://vllm-ascend.readthedocs.io/en/v0.7.3-dev) to start the journey. It includes the following changes:

0 commit comments

Comments
 (0)