Skip to content

Conversation

@ttanzhiqiang
Copy link
Contributor

update torch (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch) and update chunk prefill

What this PR does / why we need it?

update vllm chunk prefill and torch kernel (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch)

Does this PR introduce any user-facing change?

How was this patch tested?

cd vllm-ascend/examples
python offline_inference_npu_v1.py
截屏2025-04-22 14 35 30

simon-mo and others added 30 commits January 29, 2025 02:44
### What this PR does / why we need it?
vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on
the Ascend NPU.

This plugin is the recommended approach for supporting the Ascend
backend within the vLLM community. It adheres to the principles outlined
in the [RFC]: Hardware pluggable, providing a hardware-pluggable
interface that decouples the integration of the Ascend NPU with vLLM.

This patch also include changes to make CI work and use cache speed up
e2e test, including:
1. Change push (post merge ci) and pull_request (pr ci) trigger branch
to main
   2. Make mypy work by ignore base_communicator and clear unused deps
   3. Several improvements for vllm_ascend_test:
     - use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins)
- switch `git clone` command to `action/checkout` to speedup checkout
and
     - Enable sv for pytest for better info dump
- Remove network host to resole `docker: conflicting ontions: cannot
attach both user-defined and non-user-definednetwork-modes`, which is a
problem on docker 1.45 but not on 1.39.
4. Adapt MLA decode optimizations:
vllm-project/vllm@cabaf4e

### Does this PR introduce _any_ user-facing change?
Yes, init the PR.

### How was this patch tested?
- This is the first PR to make ascend NPU work on vLLM. All code is
tested on ascend with vLLM V0 Engine.
- CI passed

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?

This PR is a refactoring of model runner, to decouple it from the
classes specifically designed for GPU.

The changes of model runner are generally showed below:

![iShot_2025-01-20_21 32
37](https://github.com/user-attachments/assets/e7e14e5f-5367-42cf-bc82-abff35cd73b9)

**Other changes:** I have removed the code of `cuda`, `lora` and `prompt
adapter`, because NPU doesn`t support them now.

### Does this PR introduce _any_ user-facing change?

no.

### How was this patch tested?

I have used `AI-ModelScope/gpt2` for testing
`examples/offline_inference_npu.py`, and the results showed that it
worked well.

The test logs are showed below:

```bash
INFO 02-05 09:08:46 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 02-05 09:08:46 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 02-05 09:08:46 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 02-05 09:08:46 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 02-05 09:08:46 __init__.py:44] plugin ascend loaded.
INFO 02-05 09:08:46 __init__.py:177] Platform plugin ascend is activated
INFO 02-05 09:08:48 config.py:2383] Downcasting torch.float32 to torch.float16.
INFO 02-05 09:08:59 config.py:542] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 02-05 09:08:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+gb3a0d01) with config: model='/home/sss/models/AI-ModelScope/gpt2', speculative_config=None, tokenizer='/home/sss/models/AI-ModelScope/gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/sss/models/AI-ModelScope/gpt2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 02-05 09:09:01 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 02-05 09:09:01 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.18it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.18it/s]

INFO 02-05 09:09:11 executor_base.py:110] # CPU blocks: 98557, # CPU blocks: 7281
INFO 02-05 09:09:11 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 1539.95x
INFO 02-05 09:09:12 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.13 seconds
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.53it/s, est. speed input: 8.41 toks/s, output: 152.97 toks/s]
Prompt: 'Hello, my name is', Generated text: " John. I'm a writer, and I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm"
Prompt: 'The president of the United States is', Generated text: ' States president. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United'
Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic, and the capital of the French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.'
Prompt: 'The future of AI is', Generated text: '\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future'
```

---------

Signed-off-by: Shanshan Shen <467638484@qq.com>
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Add feature and model support matrix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test is enough

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR adds Chinese documents for vllm-ascend for Chinese-speaking
developers

### Does this PR introduce _any_ user-facing change?
Change as follows
- add README.zh.md
- add environment.zh.md
- add CONTRIBUTING.zh.md

### How was this patch tested?
By CI

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Use `pytest.ini` to manage vllm native tests.
This will convert the original test script whitelist to a blacklist to
prevent missing the newly added test scripts of the upstream vLLM.

**note**: _we do **not** manage the test scripts of vLLM-Ascend in
`pytest.ini`, because if we do so, there will be conflicts between vLLM
and vLLM-Ascend's `conftest.py`._
### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
CI passed with new existing test.

Signed-off-by: MengqingCao <cmq0113@163.com>
a fix follow up [MRotaryEmbedding
change](vllm-project/vllm@bf3b79e#diff-6bc44986c91bf0876240dec03d56c748403691c7fcd90f7a22e7affff7b033ecR839)

Signed-off-by: z00897138 <zhaorifa@huawei.com>
Co-authored-by: z00897138 <zhaorifa@huawei.com>
…ject#13)

Add `try_register_lib` and import mindie-turbo when init.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
)

### What this PR does / why we need it?
Replace logo official link and update contrib doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview:
-
https://github.com/Yikun/vllm-ascend/blob/336055be1a271b5c349d19b0c5dc29e77caadf4f/README.zh.md
-
https://github.com/Yikun/vllm-ascend/blob/336055be1a271b5c349d19b0c5dc29e77caadf4f/README.md

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
make package version control by setuptools_scm to keep the same with
vllm

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix device of tensors created in `AscendAttentionBackendImpl`.

While specifying device to cards except card-0, there'll cause an
**device conflict** because the tensors (such as `attn_mask`) will be
put on card-0 by default.

This pr creates these tensors on the correct card corresponding to the
input.

### Does this PR introduce _any_ user-facing change?
User could specify device with local rank by this pr, and a modify on
vLLM is also needed, will related to this pr when created.

### How was this patch tested?
This is tested by the following code locally. Will add a test case when
the modify in vLLM is also completed.
```python
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(model="~/.cache/modelscope/hub/Qwen/Qwen2___5-7B-Instruct", device="npu:1")

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```

Signed-off-by: MengqingCao <cmq0113@163.com>
Add official doc index. Move the release content to the right place.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
- Fix typos: vllm-ascned --> vllm-ascend
- For version info

### Does this PR introduce _any_ user-facing change?
No


### How was this patch tested?
preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Some PR for plugin support is not merged by vllm yet. This PR add monkey
patch to vllm-ascend to make vllm-ascend work with vllm directly.

This patch code should be removed once the related function is supported
by vllm originally.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
fix packages for finding submodule. 
Before this pr, the wheel built by pip not contain submodule `ops` ,
thus will raise an `ImportError` when importing vllm


### How was this patch tested?
1. build vllm-ascend wheel by pip
```bash
cd ./vllm-ascend
pip wheel ./ --no-deps
pip install vllm_ascend-0.1.dev11+g07f2a16.d20250211-py3-none-any.whl  #change file name according to yours wheel.
```
2. check vllm
```python
import vllm
```

Signed-off-by: MengqingCao <cmq0113@163.com>
…m-project#45)

### What this PR does / why we need it?
- Remove on communicator mypy to address:
vllm-project#24 (comment)
- Add mypy.ini to trigger list

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?

This PR updates the dependency version of vllm-ascend on torch-npu, so
that the vllm-ascend can be installed in a later version environment
(like to torch-npu 2.6.0rc1),

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI Test

Signed-off-by: ji-huazhong <hzji210@gmail.com>
### What this PR does / why we need it?
This PR add the quickstart doc 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

---------

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?

This patch enables the doc build for vllm-ascend

- Add sphinx build for vllm-ascend
- Enable readthedocs for vllm-ascend
- Fix CI:
- exclude vllm-empty/tests/mistral_tool_use to skip `You need to agree
to share your contact information to access this model` which introduce
in
vllm-project/vllm@314cfad
- Install test req to fix
https://github.com/vllm-project/vllm-ascend/actions/runs/13304112758/job/37151690770:
      ```
      vllm-empty/tests/mistral_tool_use/conftest.py:4: in <module>
          import pytest_asyncio
      E   ModuleNotFoundError: No module named 'pytest_asyncio'
      ```
  - exclude docs PR

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. test locally:
    ```bash
    # Install dependencies.
    pip install -r requirements-docs.txt
    
    # Build the docs and preview
    make clean; make html; python -m http.server -d build/html/
    ```
    
    Launch browser and open http://localhost:8000/.

2. CI passed with preview:
    https://vllm-ascend--55.org.readthedocs.build/en/55/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Add official install guide.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
fix communicator patch so parallel could work.
see vllm-project#52

Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
Switch to cann latest version

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Add container image build ci:
- Enable branch, tag docker image publish
    - branch image: `vllm-ascend:main`, `vllm-ascend:v0.7.1-dev`
    - tag image: `vllm-ascend:v0.7.1rc1`
- Enable PR docker image build check
- other changes:
    - Prepare the `REPO_OWNER` because the ghcr lowerercase required
- Add `Free up disk space` step to avoid `No space left on device` like
vllm-project#27
- Setup qemu with image to resolve
docker/setup-qemu-action#198

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
build: CI passed [push
false](https://github.com/vllm-project/vllm-ascend/actions/runs/13347017608/job/37278724158?pr=64)
Note for test case:
1. merge commits ot `main`, `v0.7.1-dev` branch 
✅ main: https://github.com/Yikun/vllm-ascend/actions/runs/13347238961
--> ghcr.io/yikun/vllm-ascend:main OK
✅v0.7.1-dev:
https://github.com/Yikun/vllm-ascend/actions/runs/13347229912 -->
ghcr.io/yikun/vllm-ascend:v0.7.1-dev OK

2. create pep440 tag from github release: v0.7.1rc1, v0.7.1,
v0.7.1rc1.dev1 all release has latest
✅ v0.7.5 --> v0.7.5, latest 
✅ v0.7.5rc1 --> v0.7.5rc1
✅ v0.7.5rc1.dev1 --> v0.7.5rc1.dev1
 (no latest, add a todo here) v0.7.5rc1.post1 --> v0.7.5rc1.post1

3. create unknow tag from github release:
✅ create 0.7.1 on v0.7.1-dev: not trigger ( only prefix v triggerd)

4. create tag from git: 
✅ also works, `git tag v0.7.99;git push origin v0.7.99` from
publish-image

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Revert communicator patch as
vllm-project/vllm#13208 has been merged.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
test locally by
vllm-project#30 (comment)

Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?

This patch add the versioning policy doc for vllm-ascend

Reference:
- https://spark.apache.org/versioning-policy.html
- https://docs.openstack.org/project-team-guide/stable-branches.html
- https://github.com/pytorch/pytorch/blob/main/RELEASE.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
preview: https://vllm-ascend--62.org.readthedocs.build/en/62/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
add file to pytest.ini. Ignore some quantization method

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
pytest tests/xxx

Signed-off-by: ShiyaNiu <1025125896@qq.com>
Check and update the feature support table.

- both multi-step and speculative decoding require adaptation of corresponding workers
- prompt adapter (finetune method) require adaption in worker.py and model_runner.py

Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it?
1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model
serving doc
2. fix format of files in `docs` dir, e.g. format tables, add underline
for links, add line feed...

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

no.

### How was this patch tested?
doc CI passed

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
### What this PR does / why we need it?

Update tutorials.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
no.

---------

Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
### What this PR does / why we need it?
Refeactor installation doc

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI, preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
shen-shanshan and others added 3 commits April 22, 2025 18:45
### What this PR does / why we need it?
Remove some parts of metrics patch, since the `cuda` hard code has been
fixed by vllm-project/vllm#14411.

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
### What this PR does / why we need it?
Using EvalScope to hava a evaluation (include eval and test):
-
https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage
-
https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Test locally

---------

Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comment. It's not a full pr review. Obviously this PR copied much vllm code which is useless for vllm-ascend. Please fix all this kind of problem first. The pr should only added the code that's meaningful.

@@ -1,5 +1,7 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# This file is a part of the vllm-ascend project.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many useless change in this PR. Please only change the code you need. Do not do some useless change. For example why move these two lines here?

from vllm.forward_context import set_forward_context
from vllm.inputs import INPUT_REGISTRY
from vllm.logger import logger
from vllm.logger import init_logger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? init_logger doesn't work for vllm-ascend actually.

from vllm.model_executor.layers.fused_moe import FusedMoE
from vllm.model_executor.model_loader import get_model
from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalKwargs
from vllm.platforms import current_platform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm-ascend doesn't use current_platform

model_config = self.model_config
cache_config = self.cache_config
scheduler_config = self.scheduler_config
parallel_config = self.parallel_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite strange, you set self.parallel_config at L67 and it's only used here. Why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxiyuan I Adapted from vllm_ascend/worker/model_runner_v1.py,rewrite _process_reqs and execute_model(Minimize code changes), self.cache_config and self.parallel_config work mla_v1

ttanzhiqiang and others added 6 commits April 23, 2025 11:33
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Add dp stateless process group initialization path with hccl backend as
vllm-ascend patch.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it?
1. support deepseek with w8a8 quant;
2. support deepseek with mix-parallel(multi-DP, EP+TP);
3. support deepseek with graphmode.
---------

Signed-off-by: wen-jie666 <wenjie39@huawei.com>
Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com>
Signed-off-by: libaokui <libaokui@huawei.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: wen-jie666 <wenjie39@huawei.com>
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
This PR supports the access of vllm-acend to the piecewise_graph feature
provided by the v1 engine.

1. register unifiled_ascend_attention_with_output for piecewise_graph to
split graph.
2. support NPUGraph to accelerate kernel launch.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
support npugraph to default, Users can disenable the npugraph feature by
configuring enforce_eager.

This has corresponding requirements for the versions of torch_npu and
CANN, and they need to support graph capture.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
it turn to default

---------

Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn>
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it?
-  Run benchmark scripts will Download model from modelscope

Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Fix qwen2.5-vl positon input bug, fix vllm-project#625 `TypeError: 'NoneType' object
is not iterable`

Signed-off-by: wangli <wangli858794774@gmail.com>
Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's more clear now. Please let the CI happy before a deep review. Thanks.

assert num_blocks >= kv_cache_config.num_blocks
# TODO: remove this after the OOM issue is located and fixed, otherwise, some model may
# encounter OOM issue
num_blocks = num_blocks // 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless change

Potabk and others added 13 commits April 24, 2025 17:12
### What this PR does / why we need it?
Part of vllm-project#499 
Add qwen2.5-vl test on single npu, v1 engine is excluded because
qwen2.5-vl has some problems with v1 now, at the same time, this test
can also make vllm-project#639 more credible

Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it?
Enforce eager mode in the V1 engine ahead of the upcoming CANN and
torch_npu releases.

### Does this PR introduce _any_ user-facing change?
After this change, users will no longer need to manually set
enforce_eager=True.

### How was this patch tested?
Test it with regular offline inference examples.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
vllm-project/vllm@b411418
this vllm commit change the sample usage. This PR adapt the change for
main and make sure it works for 0.8.4 as well.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
@ttanzhiqiang
Copy link
Contributor Author

@wangxiyuan ci passed

continue
tot = tot_blocks_tensor[b]
# 如果使用 seq_starts,需要考虑偏移量
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use english comment. There are many place need update.

# limitations under the License.

# import torch_npu
# import math
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove useless code


@torch.inference_mode()
def _dummy_run(self) -> torch.Tensor:
def _dummy_run(self, num_tokens: int) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like this pr doesn't rebase the newest commit. _dummy_run has been updated in the lates code already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.