update torch (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch) and update chunk prefill #609

ttanzhiqiang · 2025-04-22T06:37:28Z

update torch (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch) and update chunk prefill

What this PR does / why we need it?

update vllm chunk prefill and torch kernel (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch)

Does this PR introduce any user-facing change?

How was this patch tested?

cd vllm-ascend/examples
python offline_inference_npu_v1.py

### What this PR does / why we need it? vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on the Ascend NPU. This plugin is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [RFC]: Hardware pluggable, providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM. This patch also include changes to make CI work and use cache speed up e2e test, including: 1. Change push (post merge ci) and pull_request (pr ci) trigger branch to main 2. Make mypy work by ignore base_communicator and clear unused deps 3. Several improvements for vllm_ascend_test: - use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins) - switch `git clone` command to `action/checkout` to speedup checkout and - Enable sv for pytest for better info dump - Remove network host to resole `docker: conflicting ontions: cannot attach both user-defined and non-user-definednetwork-modes`, which is a problem on docker 1.45 but not on 1.39. 4. Adapt MLA decode optimizations: vllm-project/vllm@cabaf4e ### Does this PR introduce _any_ user-facing change? Yes, init the PR. ### How was this patch tested? - This is the first PR to make ascend NPU work on vLLM. All code is tested on ascend with vLLM V0 Engine. - CI passed --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: wangshuai09 <391746016@qq.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? This PR is a refactoring of model runner, to decouple it from the classes specifically designed for GPU. The changes of model runner are generally showed below: ![iShot_2025-01-20_21 32 37](https://github.com/user-attachments/assets/e7e14e5f-5367-42cf-bc82-abff35cd73b9) **Other changes:** I have removed the code of `cuda`, `lora` and `prompt adapter`, because NPU doesn`t support them now. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? I have used `AI-ModelScope/gpt2` for testing `examples/offline_inference_npu.py`, and the results showed that it worked well. The test logs are showed below: ```bash INFO 02-05 09:08:46 __init__.py:30] Available plugins for group vllm.platform_plugins: INFO 02-05 09:08:46 __init__.py:32] name=ascend, value=vllm_ascend:register INFO 02-05 09:08:46 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded. INFO 02-05 09:08:46 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load. INFO 02-05 09:08:46 __init__.py:44] plugin ascend loaded. INFO 02-05 09:08:46 __init__.py:177] Platform plugin ascend is activated INFO 02-05 09:08:48 config.py:2383] Downcasting torch.float32 to torch.float16. INFO 02-05 09:08:59 config.py:542] This model supports multiple tasks: {'generate', 'score', 'embed', 'reward', 'classify'}. Defaulting to 'generate'. INFO 02-05 09:08:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev1+gb3a0d01) with config: model='/home/sss/models/AI-ModelScope/gpt2', speculative_config=None, tokenizer='/home/sss/models/AI-ModelScope/gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/sss/models/AI-ModelScope/gpt2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, WARNING 02-05 09:09:01 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") INFO 02-05 09:09:01 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available. Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.18it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.18it/s] INFO 02-05 09:09:11 executor_base.py:110] # CPU blocks: 98557, # CPU blocks: 7281 INFO 02-05 09:09:11 executor_base.py:115] Maximum concurrency for 1024 tokens per request: 1539.95x INFO 02-05 09:09:12 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.13 seconds Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.53it/s, est. speed input: 8.41 toks/s, output: 152.97 toks/s] Prompt: 'Hello, my name is', Generated text: " John. I'm a writer, and I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm a writer. I'm" Prompt: 'The president of the United States is', Generated text: ' States president. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United States. He is the president of the United' Prompt: 'The capital of France is', Generated text: ' the capital of the French Republic, and the capital of the French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.\n\nThe French Republic is the capital of the French Republic.' Prompt: 'The future of AI is', Generated text: '\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future of AI is a question of how to make it work.\n\nThe future' ``` --------- Signed-off-by: Shanshan Shen <467638484@qq.com>

### What this PR does / why we need it? Add feature and model support matrix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI test is enough Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

### What this PR does / why we need it? This PR adds Chinese documents for vllm-ascend for Chinese-speaking developers ### Does this PR introduce _any_ user-facing change? Change as follows - add README.zh.md - add environment.zh.md - add CONTRIBUTING.zh.md ### How was this patch tested? By CI --------- Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? Use `pytest.ini` to manage vllm native tests. This will convert the original test script whitelist to a blacklist to prevent missing the newly added test scripts of the upstream vLLM. **note**: _we do **not** manage the test scripts of vLLM-Ascend in `pytest.ini`, because if we do so, there will be conflicts between vLLM and vLLM-Ascend's `conftest.py`._ ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new existing test. Signed-off-by: MengqingCao <cmq0113@163.com>

a fix follow up [MRotaryEmbedding change](vllm-project/vllm@bf3b79e#diff-6bc44986c91bf0876240dec03d56c748403691c7fcd90f7a22e7affff7b033ecR839) Signed-off-by: z00897138 <zhaorifa@huawei.com> Co-authored-by: z00897138 <zhaorifa@huawei.com>

…ject#13) Add `try_register_lib` and import mindie-turbo when init. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>

) ### What this PR does / why we need it? Replace logo official link and update contrib doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview: - https://github.com/Yikun/vllm-ascend/blob/336055be1a271b5c349d19b0c5dc29e77caadf4f/README.zh.md - https://github.com/Yikun/vllm-ascend/blob/336055be1a271b5c349d19b0c5dc29e77caadf4f/README.md Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

make package version control by setuptools_scm to keep the same with vllm Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

### What this PR does / why we need it? Fix device of tensors created in `AscendAttentionBackendImpl`. While specifying device to cards except card-0, there'll cause an **device conflict** because the tensors (such as `attn_mask`) will be put on card-0 by default. This pr creates these tensors on the correct card corresponding to the input. ### Does this PR introduce _any_ user-facing change? User could specify device with local rank by this pr, and a modify on vLLM is also needed, will related to this pr when created. ### How was this patch tested? This is tested by the following code locally. Will add a test case when the modify in vLLM is also completed. ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.0) # Create an LLM. llm = LLM(model="~/.cache/modelscope/hub/Qwen/Qwen2___5-7B-Instruct", device="npu:1") # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Signed-off-by: MengqingCao <cmq0113@163.com>

Add official doc index. Move the release content to the right place. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

### What this PR does / why we need it? - Fix typos: vllm-ascned --> vllm-ascend - For version info ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

Some PR for plugin support is not merged by vllm yet. This PR add monkey patch to vllm-ascend to make vllm-ascend work with vllm directly. This patch code should be removed once the related function is supported by vllm originally. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

### What this PR does / why we need it? fix packages for finding submodule. Before this pr, the wheel built by pip not contain submodule `ops` , thus will raise an `ImportError` when importing vllm ### How was this patch tested? 1. build vllm-ascend wheel by pip ```bash cd ./vllm-ascend pip wheel ./ --no-deps pip install vllm_ascend-0.1.dev11+g07f2a16.d20250211-py3-none-any.whl #change file name according to yours wheel. ``` 2. check vllm ```python import vllm ``` Signed-off-by: MengqingCao <cmq0113@163.com>

…m-project#45) ### What this PR does / why we need it? - Remove on communicator mypy to address: vllm-project#24 (comment) - Add mypy.ini to trigger list ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? This PR updates the dependency version of vllm-ascend on torch-npu, so that the vllm-ascend can be installed in a later version environment (like to torch-npu 2.6.0rc1), ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI Test Signed-off-by: ji-huazhong <hzji210@gmail.com>

### What this PR does / why we need it? This PR add the quickstart doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? This patch enables the doc build for vllm-ascend - Add sphinx build for vllm-ascend - Enable readthedocs for vllm-ascend - Fix CI: - exclude vllm-empty/tests/mistral_tool_use to skip `You need to agree to share your contact information to access this model` which introduce in vllm-project/vllm@314cfad - Install test req to fix https://github.com/vllm-project/vllm-ascend/actions/runs/13304112758/job/37151690770: ``` vllm-empty/tests/mistral_tool_use/conftest.py:4: in <module> import pytest_asyncio E ModuleNotFoundError: No module named 'pytest_asyncio' ``` - exclude docs PR ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. test locally: ```bash # Install dependencies. pip install -r requirements-docs.txt # Build the docs and preview make clean; make html; python -m http.server -d build/html/ ``` Launch browser and open http://localhost:8000/. 2. CI passed with preview: https://vllm-ascend--55.org.readthedocs.build/en/55/ Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

Add official install guide. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

### What this PR does / why we need it? fix communicator patch so parallel could work. see vllm-project#52 Signed-off-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? Switch to cann latest version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? Add container image build ci: - Enable branch, tag docker image publish - branch image: `vllm-ascend:main`, `vllm-ascend:v0.7.1-dev` - tag image: `vllm-ascend:v0.7.1rc1` - Enable PR docker image build check - other changes: - Prepare the `REPO_OWNER` because the ghcr lowerercase required - Add `Free up disk space` step to avoid `No space left on device` like vllm-project#27 - Setup qemu with image to resolve docker/setup-qemu-action#198 ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? build: CI passed [push false](https://github.com/vllm-project/vllm-ascend/actions/runs/13347017608/job/37278724158?pr=64) Note for test case: 1. merge commits ot `main`, `v0.7.1-dev` branch ✅ main: https://github.com/Yikun/vllm-ascend/actions/runs/13347238961 --> ghcr.io/yikun/vllm-ascend:main OK ✅v0.7.1-dev: https://github.com/Yikun/vllm-ascend/actions/runs/13347229912 --> ghcr.io/yikun/vllm-ascend:v0.7.1-dev OK 2. create pep440 tag from github release: v0.7.1rc1, v0.7.1, v0.7.1rc1.dev1 all release has latest ✅ v0.7.5 --> v0.7.5, latest ✅ v0.7.5rc1 --> v0.7.5rc1 ✅ v0.7.5rc1.dev1 --> v0.7.5rc1.dev1 (no latest, add a todo here) v0.7.5rc1.post1 --> v0.7.5rc1.post1 3. create unknow tag from github release: ✅ create 0.7.1 on v0.7.1-dev: not trigger ( only prefix v triggerd) 4. create tag from git: ✅ also works, `git tag v0.7.99;git push origin v0.7.99` from publish-image Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? Revert communicator patch as vllm-project/vllm#13208 has been merged. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? test locally by vllm-project#30 (comment) Signed-off-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? This patch add the versioning policy doc for vllm-ascend Reference: - https://spark.apache.org/versioning-policy.html - https://docs.openstack.org/project-team-guide/stable-branches.html - https://github.com/pytorch/pytorch/blob/main/RELEASE.md ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? preview: https://vllm-ascend--62.org.readthedocs.build/en/62/ Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? add file to pytest.ini. Ignore some quantization method ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? pytest tests/xxx Signed-off-by: ShiyaNiu <1025125896@qq.com>

Check and update the feature support table. - both multi-step and speculative decoding require adaptation of corresponding workers - prompt adapter (finetune method) require adaption in worker.py and model_runner.py Signed-off-by: MengqingCao <cmq0113@163.com>

### What this PR does / why we need it? 1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model serving doc 2. fix format of files in `docs` dir, e.g. format tables, add underline for links, add line feed... ### Does this PR introduce _any_ user-facing change?  no. ### How was this patch tested? doc CI passed --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

### What this PR does / why we need it? Update tutorials. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

### What this PR does / why we need it? Refeactor installation doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI, preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

### What this PR does / why we need it? Remove some parts of metrics patch, since the `cuda` hard code has been fixed by vllm-project/vllm#14411. Signed-off-by: shen-shanshan <467638484@qq.com>

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

### What this PR does / why we need it? Using EvalScope to hava a evaluation (include eval and test): - https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#basic-usage - https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#model-api-service-evaluation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test locally --------- Signed-off-by: RongRongStudio <82669040+RongRongStudio@users.noreply.github.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>

wangxiyuan

I left a few comment. It's not a full pr review. Obviously this PR copied much vllm code which is useless for vllm-ascend. Please fix all this kind of problem first. The pr should only added the code that's meaningful.

wangxiyuan · 2025-04-23T01:17:51Z

vllm_ascend/worker/model_runner_v1.py

@@ -1,5 +1,7 @@
 #
 # Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.


There are too many useless change in this PR. Please only change the code you need. Do not do some useless change. For example why move these two lines here?

wangxiyuan · 2025-04-23T01:18:13Z

vllm_ascend/worker/model_runner_v1.py

 from vllm.forward_context import set_forward_context
 from vllm.inputs import INPUT_REGISTRY
-from vllm.logger import logger
+from vllm.logger import init_logger


why? init_logger doesn't work for vllm-ascend actually.

wangxiyuan · 2025-04-23T01:18:30Z

vllm_ascend/worker/model_runner_v1.py

 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.model_loader import get_model
 from vllm.multimodal import MULTIMODAL_REGISTRY, MultiModalKwargs
+from vllm.platforms import current_platform


vllm-ascend doesn't use current_platform

wangxiyuan · 2025-04-23T01:36:05Z

vllm_ascend/worker/model_runner_v1.py

+        model_config = self.model_config
+        cache_config = self.cache_config
+        scheduler_config = self.scheduler_config
+        parallel_config = self.parallel_config


This is quite strange, you set self.parallel_config at L67 and it's only used here. Why

@wangxiyuan I Adapted from vllm_ascend/worker/model_runner_v1.py，rewrite _process_reqs and execute_model(Minimize code changes), self.cache_config and self.parallel_config work mla_v1

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

### What this PR does / why we need it?  Add dp stateless process group initialization path with hccl backend as vllm-ascend patch. ### Does this PR introduce _any_ user-facing change?  ### How was this patch tested?  --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

### What this PR does / why we need it? 1. support deepseek with w8a8 quant; 2. support deepseek with mix-parallel(multi-DP, EP+TP); 3. support deepseek with graphmode. --------- Signed-off-by: wen-jie666 <wenjie39@huawei.com> Signed-off-by: Yizhou Liu <liuyizhou5@h-partners.com> Signed-off-by: libaokui <libaokui@huawei.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: wen-jie666 <wenjie39@huawei.com>

### What this PR does / why we need it?  This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change?  support npugraph to default， Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested?  it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>

### What this PR does / why we need it? - Run benchmark scripts will Download model from modelscope Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? Fix qwen2.5-vl positon input bug, fix vllm-project#625 `TypeError: 'NoneType' object is not iterable` Signed-off-by: wangli <wangli858794774@gmail.com>

wangxiyuan

Thanks, it's more clear now. Please let the CI happy before a deep review. Thanks.

wangxiyuan · 2025-04-24T07:42:30Z

vllm_ascend/worker/model_runner_v1.py

                assert num_blocks >= kv_cache_config.num_blocks
                # TODO: remove this after the OOM issue is located and fixed, otherwise, some model may
                # encounter OOM issue
+                num_blocks = num_blocks // 4


useless change

### What this PR does / why we need it? Part of vllm-project#499 Add qwen2.5-vl test on single npu, v1 engine is excluded because qwen2.5-vl has some problems with v1 now, at the same time, this test can also make vllm-project#639 more credible Signed-off-by: wangli <wangli858794774@gmail.com>

### What this PR does / why we need it? Enforce eager mode in the V1 engine ahead of the upcoming CANN and torch_npu releases. ### Does this PR introduce _any_ user-facing change? After this change, users will no longer need to manually set enforce_eager=True. ### How was this patch tested? Test it with regular offline inference examples. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

vllm-project/vllm@b411418 this vllm commit change the sample usage. This PR adapt the change for main and make sure it works for 0.8.4 as well. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

ttanzhiqiang · 2025-04-25T09:22:45Z

@wangxiyuan ci passed

wangxiyuan · 2025-04-25T09:38:19Z

tests/ops/test_cache.py

+            continue
+        tot = tot_blocks_tensor[b]
+        
+        # 如果使用 seq_starts，需要考虑偏移量


please use english comment. There are many place need update.

wangxiyuan · 2025-04-25T09:39:01Z

vllm_ascend/ops/cache.py

 # limitations under the License.

+# import torch_npu
+# import math


remove useless code

wangxiyuan · 2025-04-25T09:39:41Z

vllm_ascend/worker/model_runner_v1.py


    @torch.inference_mode()
-    def _dummy_run(self) -> torch.Tensor:
+    def _dummy_run(self, num_tokens: int) -> torch.Tensor:


Look like this pr doesn't rebase the newest commit. _dummy_run has been updated in the lates code already.

simon-mo and others added 30 commits January 29, 2025 02:44

Initial commit

8a879a3

bugfix for mrope (vllm-project#14)

6eda871

a fix follow up [MRotaryEmbedding change](vllm-project/vllm@bf3b79e#diff-6bc44986c91bf0876240dec03d56c748403691c7fcd90f7a22e7affff7b033ecR839) Signed-off-by: z00897138 <zhaorifa@huawei.com> Co-authored-by: z00897138 <zhaorifa@huawei.com>

[Worker] Register mindie_turbo while initializing NPUWorker (vllm-pro…

56a351e

…ject#13) Add `try_register_lib` and import mindie-turbo when init. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com>

[Misc] version control by setuptools_scm (vllm-project#21)

7ac01e4

make package version control by setuptools_scm to keep the same with vllm Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

[Docs] Add official doc index (vllm-project#29)

38ecd5e

Add official doc index. Move the release content to the right place. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

[Doc] Add Quickstart doc (vllm-project#44)

8b73a42

### What this PR does / why we need it? This PR add the quickstart doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

[Doc] Add install doc (vllm-project#49)

ca34cff

Add official install guide. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

[dist] fix communicator patch (vllm-project#58)

352b1f2

### What this PR does / why we need it? fix communicator patch so parallel could work. see vllm-project#52 Signed-off-by: MengqingCao <cmq0113@163.com>

[CI] Switch to cann latest version (vllm-project#63)

1bf3021

### What this PR does / why we need it? Switch to cann latest version ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

[Doc] Update tutorials (vllm-project#79)

0080ee9

### What this PR does / why we need it? Update tutorials. ### Does this PR introduce _any_ user-facing change? no. ### How was this patch tested? no. --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com>

[Docs] Refeactor installation doc (vllm-project#78)

fc67a6d

### What this PR does / why we need it? Refeactor installation doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI, preview Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

shen-shanshan and others added 3 commits April 22, 2025 18:45

[Misc] Remove some parts of metrics patch (vllm-project#603)

8c20dc3

### What this PR does / why we need it? Remove some parts of metrics patch, since the `cuda` hard code has been fixed by vllm-project/vllm#14411. Signed-off-by: shen-shanshan <467638484@qq.com>

del useless

b700aff

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

wangxiyuan requested changes Apr 23, 2025

View reviewed changes

ttanzhiqiang and others added 6 commits April 23, 2025 11:33

Adapted from vllm_ascend/worker/model_runner_v1.py

68215d5

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

[Benchmark] Download model from modelscope (vllm-project#634)

4c7fbed

### What this PR does / why we need it? - Run benchmark scripts will Download model from modelscope Signed-off-by: wangli <wangli858794774@gmail.com>

[Bugfix] Fix qwen2.5-vl positon input bug (vllm-project#639)

029ab68

### What this PR does / why we need it? Fix qwen2.5-vl positon input bug, fix vllm-project#625 `TypeError: 'NoneType' object is not iterable` Signed-off-by: wangli <wangli858794774@gmail.com>

wangxiyuan reviewed Apr 24, 2025

View reviewed changes

Potabk and others added 13 commits April 24, 2025 17:12

vllm ascend test

118f45f

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

vllm ascend test

761e2a9

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

vllm ascend test

fb4fcfc

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

vllm ascend test

5697d69

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

isort file

3a9f729

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

isort mla_v1

2f56a99

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

[CI] Fix sample backward compatibility problem (vllm-project#648)

e8fd8eb

vllm-project/vllm@b411418 this vllm commit change the sample usage. This PR adapt the change for main and make sure it works for 0.8.4 as well. Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

update model_runner_v1

a2e97e8

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

isort file

58e8732

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

fixed attention_v1

7df765e

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

Comparison to should be

0bc65bc

Signed-off-by: ttanzhiqiang <tanzhiqiang.tzq@antgroup.com>

wangxiyuan reviewed Apr 25, 2025

View reviewed changes

Merge branch 'vllm-project:main' into chunk_prefill

68c2b1f

ttanzhiqiang closed this Apr 27, 2025

ttanzhiqiang force-pushed the chunk_prefill branch from 67fb785 to 68c2b1f Compare April 27, 2025 07:44

ttanzhiqiang mentioned this pull request Apr 28, 2025

update chunk prefill torch #679

Closed

update torch (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch) and update chunk prefill #609

update torch (concat_and_cache_mla_torch/merge_attn_states_torch/gather_cache_torch/flash_attn_varlen_func_torch/flash_mla_with_kvcache_torch) and update chunk prefill #609

Uh oh!

Conversation

ttanzhiqiang commented Apr 22, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ttanzhiqiang commented Apr 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

33 participants