-
Notifications
You must be signed in to change notification settings - Fork 543
update torch (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch) and update chunk prefill #609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
### What this PR does / why we need it?
vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on
the Ascend NPU.
This plugin is the recommended approach for supporting the Ascend
backend within the vLLM community. It adheres to the principles outlined
in the [RFC]: Hardware pluggable, providing a hardware-pluggable
interface that decouples the integration of the Ascend NPU with vLLM.
This patch also include changes to make CI work and use cache speed up
e2e test, including:
1. Change push (post merge ci) and pull_request (pr ci) trigger branch
to main
2. Make mypy work by ignore base_communicator and clear unused deps
3. Several improvements for vllm_ascend_test:
- use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins)
- switch `git clone` command to `action/checkout` to speedup checkout
and
- Enable sv for pytest for better info dump
- Remove network host to resole `docker: conflicting ontions: cannot
attach both user-defined and non-user-definednetwork-modes`, which is a
problem on docker 1.45 but not on 1.39.
4. Adapt MLA decode optimizations:
vllm-project/vllm@cabaf4e
### Does this PR introduce _any_ user-facing change?
Yes, init the PR.
### How was this patch tested?
- This is the first PR to make ascend NPU work on vLLM. All code is
tested on ascend with vLLM V0 Engine.
- CI passed
---------
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? This PR is a refactoring of model runner, to decouple it from the classes specifically designed for GPU. The changes of model runner are generally showed below:  **Other changes:** I have removed the code of `cuda`, `lora` and `prompt adapter`, because NPU doesn`t support them now. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? I have used `AI-ModelScope/gpt2` for testing `examples/offline_inference_npu.py`, and the results showed that it worked well. The test logs are showed below: ```bash INFO 02-05 09:08:46 __init__.py:30] Available plugins for group vllm.platform_plugins: INFO 02-05 09:08:46 __init__.py:32] name=ascend, value=vllm_ascend:register INFO 02-05 09:08:46 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded. INFO 02-05 09:08:46 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load. INFO 02-05 09:08:46 __init__.py:44] plugin ascend loaded. INFO 02-05 09:08:46 __init__.py:177] Platform plugin ascend is activated INFO 02-05 09:08:48 config.py:2383] Downcasting torch.float32 to torch.float16. INFO 02-05 09:08:59 config.py:542] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'. INFO 02-05 09:08:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+gb3a0d01) with config: model='/home/sss/models/AI-ModelScope/gpt2', speculative_config=None, tokenizer='/home/sss/models/AI-ModelScope/gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/sss/models/AI-ModelScope/gpt2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, WARNING 02-05 09:09:01 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") INFO 02-05 09:09:01 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available. Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.18it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.18it/s] INFO 02-05 09:09:11 executor_base.py:110] # CPU blocks: 98557, # CPU blocks: 7281 INFO 02-05 09:09:11 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 1539.95x INFO 02-05 09:09:12 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.13 seconds Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.53it/s, est. speed input: 8.41 toks/s, output: 152.97 toks/s] Prompt: 'Hello, my name is', Generated text: " John. I'm a writer, and I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm" Prompt: 'The president of the United States is', Generated text: ' States president. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United' Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic, and the capital of the French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.' Prompt: 'The future of AI is', Generated text: '\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future' ``` --------- Signed-off-by: Shanshan Shen <467638484@qq.com>
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? Add feature and model support matrix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test is enough Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? This PR adds Chinese documents for vllm-ascend for Chinese-speaking developers ### Does this PR introduce _any_ user-facing change? Change as follows - add README.zh.md - add environment.zh.md - add CONTRIBUTING.zh.md ### How was this patch tested? By CI --------- Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Use `pytest.ini` to manage vllm native tests. This will convert the original test script whitelist to a blacklist to prevent missing the newly added test scripts of the upstream vLLM. **note**: _we do **not** manage the test scripts of vLLM-Ascend in `pytest.ini`, because if we do so, there will be conflicts between vLLM and vLLM-Ascend's `conftest.py`._ ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new existing test. Signed-off-by: MengqingCao <cmq0113@163.com>
a fix follow up [MRotaryEmbedding change](vllm-project/vllm@bf3b79e#diff-6bc44986c91bf0876240dec03d56c748403691c7fcd90f7a22e7affff7b033ecR839) Signed-off-by: z00897138 <zhaorifa@huawei.com> Co-authored-by: z00897138 <zhaorifa@huawei.com>
…ject#13) Add `try_register_lib` and import mindie-turbo when init. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>
) ### What this PR does / why we need it? Replace logo official link and update contrib doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview: - https://github.com/Yikun/vllm-ascend/blob/336055be1a271b5c349d19b0c5dc29e77caadf4f/README.zh.md - https://github.com/Yikun/vllm-ascend/blob/336055be1a271b5c349d19b0c5dc29e77caadf4f/README.md Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
make package version control by setuptools_scm to keep the same with vllm Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
Fix device of tensors created in `AscendAttentionBackendImpl`.
While specifying device to cards except card-0, there'll cause an
**device conflict** because the tensors (such as `attn_mask`) will be
put on card-0 by default.
This pr creates these tensors on the correct card corresponding to the
input.
### Does this PR introduce _any_ user-facing change?
User could specify device with local rank by this pr, and a modify on
vLLM is also needed, will related to this pr when created.
### How was this patch tested?
This is tested by the following code locally. Will add a test case when
the modify in vLLM is also completed.
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(model="~/.cache/modelscope/hub/Qwen/Qwen2___5-7B-Instruct", device="npu:1")
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
Signed-off-by: MengqingCao <cmq0113@163.com>
Add official doc index. Move the release content to the right place. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? - Fix typos: vllm-ascned --> vllm-ascend - For version info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Some PR for plugin support is not merged by vllm yet. This PR add monkey patch to vllm-ascend to make vllm-ascend work with vllm directly. This patch code should be removed once the related function is supported by vllm originally. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? fix packages for finding submodule. Before this pr, the wheel built by pip not contain submodule `ops` , thus will raise an `ImportError` when importing vllm ### How was this patch tested? 1. build vllm-ascend wheel by pip ```bash cd ./vllm-ascend pip wheel ./ --no-deps pip install vllm_ascend-0.1.dev11+g07f2a16.d20250211-py3-none-any.whl #change file name according to yours wheel. ``` 2. check vllm ```python import vllm ``` Signed-off-by: MengqingCao <cmq0113@163.com>
…m-project#45) ### What this PR does / why we need it? - Remove on communicator mypy to address: vllm-project#24 (comment) - Add mypy.ini to trigger list ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it? This PR updates the dependency version of vllm-ascend on torch-npu, so that the vllm-ascend can be installed in a later version environment (like to torch-npu 2.6.0rc1), ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Test Signed-off-by: ji-huazhong <hzji210@gmail.com>
### What this PR does / why we need it? This PR add the quickstart doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it? This patch enables the doc build for vllm-ascend - Add sphinx build for vllm-ascend - Enable readthedocs for vllm-ascend - Fix CI: - exclude vllm-empty/tests/mistral_tool_use to skip `You need to agree to share your contact information to access this model` which introduce in vllm-project/vllm@314cfad - Install test req to fix https://github.com/vllm-project/vllm-ascend/actions/runs/13304112758/job/37151690770: ``` vllm-empty/tests/mistral_tool_use/conftest.py:4: in <module> import pytest_asyncio E ModuleNotFoundError: No module named 'pytest_asyncio' ``` - exclude docs PR ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. test locally: ```bash # Install dependencies. pip install -r requirements-docs.txt # Build the docs and preview make clean; make html; python -m http.server -d build/html/ ``` Launch browser and open http://localhost:8000/. 2. CI passed with preview: https://vllm-ascend--55.org.readthedocs.build/en/55/ Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Add official install guide. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? fix communicator patch so parallel could work. see vllm-project#52 Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it? Switch to cann latest version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it?
Add container image build ci:
- Enable branch, tag docker image publish
- branch image: `vllm-ascend:main`, `vllm-ascend:v0.7.1-dev`
- tag image: `vllm-ascend:v0.7.1rc1`
- Enable PR docker image build check
- other changes:
- Prepare the `REPO_OWNER` because the ghcr lowerercase required
- Add `Free up disk space` step to avoid `No space left on device` like
vllm-project#27
- Setup qemu with image to resolve
docker/setup-qemu-action#198
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
build: CI passed [push
false](https://github.com/vllm-project/vllm-ascend/actions/runs/13347017608/job/37278724158?pr=64)
Note for test case:
1. merge commits ot `main`, `v0.7.1-dev` branch
✅ main: https://github.com/Yikun/vllm-ascend/actions/runs/13347238961
--> ghcr.io/yikun/vllm-ascend:main OK
✅v0.7.1-dev:
https://github.com/Yikun/vllm-ascend/actions/runs/13347229912 -->
ghcr.io/yikun/vllm-ascend:v0.7.1-dev OK
2. create pep440 tag from github release: v0.7.1rc1, v0.7.1,
v0.7.1rc1.dev1 all release has latest
✅ v0.7.5 --> v0.7.5, latest
✅ v0.7.5rc1 --> v0.7.5rc1
✅ v0.7.5rc1.dev1 --> v0.7.5rc1.dev1
(no latest, add a todo here) v0.7.5rc1.post1 --> v0.7.5rc1.post1
3. create unknow tag from github release:
✅ create 0.7.1 on v0.7.1-dev: not trigger ( only prefix v triggerd)
4. create tag from git:
✅ also works, `git tag v0.7.99;git push origin v0.7.99` from
publish-image
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it? Revert communicator patch as vllm-project/vllm#13208 has been merged. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? test locally by vllm-project#30 (comment) Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it? This patch add the versioning policy doc for vllm-ascend Reference: - https://spark.apache.org/versioning-policy.html - https://docs.openstack.org/project-team-guide/stable-branches.html - https://github.com/pytorch/pytorch/blob/main/RELEASE.md ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? preview: https://vllm-ascend--62.org.readthedocs.build/en/62/ Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it? add file to pytest.ini. Ignore some quantization method ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? pytest tests/xxx Signed-off-by: ShiyaNiu <1025125896@qq.com>
Check and update the feature support table. - both multi-step and speculative decoding require adaptation of corresponding workers - prompt adapter (finetune method) require adaption in worker.py and model_runner.py Signed-off-by: MengqingCao <cmq0113@163.com>
### What this PR does / why we need it? 1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model serving doc 2. fix format of files in `docs` dir, e.g. format tables, add underline for links, add line feed... ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> no. ### How was this patch tested? doc CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
### What this PR does / why we need it? Update tutorials. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>
### What this PR does / why we need it? Refeactor installation doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI, preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
### What this PR does / why we need it? Remove some parts of metrics patch, since the `cuda` hard code has been fixed by vllm-project/vllm#14411. Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
wangxiyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comment. It's not a full pr review. Obviously this PR copied much vllm code which is useless for vllm-ascend. Please fix all this kind of problem first. The pr should only added the code that's meaningful.
| @@ -1,5 +1,7 @@ | |||
| # | |||
| # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved. | |||
| # This file is a part of the vllm-ascend project. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are too many useless change in this PR. Please only change the code you need. Do not do some useless change. For example why move these two lines here?
| from vllm.forward_context import set_forward_context | ||
| from vllm.inputs import INPUT_REGISTRY | ||
| from vllm.logger import logger | ||
| from vllm.logger import init_logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? init_logger doesn't work for vllm-ascend actually.
| from vllm.model_executor.layers.fused_moe import FusedMoE | ||
| from vllm.model_executor.model_loader import get_model | ||
| from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalKwargs | ||
| from vllm.platforms import current_platform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm-ascend doesn't use current_platform
| model_config = self.model_config | ||
| cache_config = self.cache_config | ||
| scheduler_config = self.scheduler_config | ||
| parallel_config = self.parallel_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite strange, you set self.parallel_config at L67 and it's only used here. Why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wangxiyuan I Adapted from vllm_ascend/worker/model_runner_v1.py,rewrite _process_reqs and execute_model(Minimize code changes), self.cache_config and self.parallel_config work mla_v1
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> Add dp stateless process group initialization path with hccl backend as vllm-ascend patch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it? 1. support deepseek with w8a8 quant; 2. support deepseek with mix-parallel(multi-DP, EP+TP); 3. support deepseek with graphmode. --------- Signed-off-by: wen-jie666 <wenjie39@huawei.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wen-jie666 <wenjie39@huawei.com>
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> support npugraph to default, Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it? - Run benchmark scripts will Download model from modelscope Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Fix qwen2.5-vl positon input bug, fix vllm-project#625 `TypeError: 'NoneType' object is not iterable` Signed-off-by: wangli <wangli858794774@gmail.com>
wangxiyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's more clear now. Please let the CI happy before a deep review. Thanks.
| assert num_blocks >= kv_cache_config.num_blocks | ||
| # TODO: remove this after the OOM issue is located and fixed, otherwise, some model may | ||
| # encounter OOM issue | ||
| num_blocks = num_blocks // 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless change
### What this PR does / why we need it? Part of vllm-project#499 Add qwen2.5-vl test on single npu, v1 engine is excluded because qwen2.5-vl has some problems with v1 now, at the same time, this test can also make vllm-project#639 more credible Signed-off-by: wangli <wangli858794774@gmail.com>
### What this PR does / why we need it? Enforce eager mode in the V1 engine ahead of the upcoming CANN and torch_npu releases. ### Does this PR introduce _any_ user-facing change? After this change, users will no longer need to manually set enforce_eager=True. ### How was this patch tested? Test it with regular offline inference examples. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
vllm-project/vllm@b411418 this vllm commit change the sample usage. This PR adapt the change for main and make sure it works for 0.8.4 as well. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>
|
@wangxiyuan ci passed |
| continue | ||
| tot = tot_blocks_tensor[b] | ||
| # 如果使用 seq_starts,需要考虑偏移量 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use english comment. There are many place need update.
| # limitations under the License. | ||
|
|
||
| # import torch_npu | ||
| # import math |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove useless code
|
|
||
| @torch.inference_mode() | ||
| def _dummy_run(self) -> torch.Tensor: | ||
| def _dummy_run(self, num_tokens: int) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look like this pr doesn't rebase the newest commit. _dummy_run has been updated in the lates code already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update
67fb785 to
68c2b1f
Compare
update torch (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch) and update chunk prefill
What this PR does / why we need it?
update vllm chunk prefill and torch kernel (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch)
Does this PR introduce any user-facing change?
How was this patch tested?
cd vllm-ascend/examples

python offline_inference_npu_v1.py