- 
                Notifications
    You must be signed in to change notification settings 
- Fork 528
support aclgraph #426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support aclgraph #426
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please help implement the unit test case and system test case.
1ae4054    to
    695689e      
    Compare
  
    8793fa4    to
    edce3b8      
    Compare
  
    | self.input_positions_cpu = torch.arange(0, | ||
| self.max_num_tokens, | ||
| device="cpu") | ||
| self.use_cuda_graph = (self.vllm_config.compilation_config.level | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to self.use_acl_graph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.use_npu_graph is better
| self.use_cuda_graph = (self.vllm_config.compilation_config.level | ||
| == CompilationLevel.PIECEWISE | ||
| and not self.model_config.enforce_eager) | ||
| self.cudagraph_batch_sizes = list( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| from vllm.v1.sample.rejection_sampler import INVALID_TOKEN_ID, RejectionSampler | ||
| else: | ||
| INVALID_TOKEN_ID = None | ||
| RejectionSampler = None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change? HAS_TRITON is alway false in vllm-ascend. So I guess you want to rewrite vllm.v1.sample.rejection_sampler. INVALID_TOKEN_ID, RejectionSampler in vllm-ascend here?
        
          
                vllm_ascend/utils.py
              
                Outdated
          
        
      | self.name = name | ||
|  | ||
|  | ||
| def register_dummy_fusion_op() -> None: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to ops module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
        
          
                requirements.txt
              
                Outdated
          
        
      | setuptools-scm>=8 | ||
| torch_npu | ||
| torch >= 2.5.1 | ||
| torch_npu == 2.5.1rc1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not limit torch-npu version here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e6bdffb    to
    a8d3d27      
    Compare
  
    28ce6ee    to
    0ca634e      
    Compare
  
            
          
                vllm_ascend/__init__.py
              
                Outdated
          
        
      | # This file is a part of the vllm-ascend project. | ||
| # | ||
|  | ||
| from torch_npu.contrib import transfer_to_npu # noqa: F401 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this here? This will hide some issue and break some scenes in RL, where torch.cuda expected to be called normally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
eca5828    to
    6c0e10d      
    Compare
  
    Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn>
a7e0e28    to
    b2a0b53      
    Compare
  
    Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> support npugraph to default, Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it? Thanks to the PR #426 make vllm-ascend support the aclgraph inference to reduce the host overhead. However, the capability of aclgraph strongly relies on the functionality provided by `torch.compile`, which is the key feature supported in torch 2.x . Therefore, capture custom op into aclgraph is only possible when it can be recognize and captured by `torch.compile`. In this PR, we register the meta implementation of current custom ops to enable the fx graph capture. And by doing that, insert those custom ops into aclgraph become a natural thing to the ascend runtime. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? Tested in unittest, we will integrate the `rotary_embedding` op into a small custom model and use `torch.compile` and aclgraph to capture and replay it to verify its functionality. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1b99028 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
* enable mm allreduce test (vllm-project#2192) ### What this PR does / why we need it? This PR is to add e2e test for using npu_mm_all_reduce_base fusion kernel. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@5d5d419 Signed-off-by: Ronald1995 <ronaldautomobile@163.com> * [main] remove torch.cat and replace it by List[0] (vllm-project#2153) ### What this PR does / why we need it? torch_npu.npu_grouped_matmul: https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md According to the document, when `split_item` is 2 or 3, `torch_npu.npu_grouped_matmul` will return a list which has one element. Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is unnecessary. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? ut and e2e covered: `tests/ut/ops/test_fused_ops.py`, `tests/e2e/singlecard/ops/test_fused_moe.py` **performance**: (qwen3 30B, 2k->20k) base: Total Token throughput (tok/s): 667.76 remove cat: Total Token throughput (tok/s): 680.82 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@fa00c5d Signed-off-by: huangxialu <huangxialu1@huawei.com> * [CI][Quickfix] Fix AscendFusedMoE init error (vllm-project#2268) ### What this PR does / why we need it? Fix AscendFusedMoE init error. Use `super().__init__()` instead of `super(FusedMoE, self).__init__()` to ensure the member variables in base class could be called by the children class ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new existing test. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@766bc81 --------- Signed-off-by: MengqingCao <cmq0113@163.com> * Fix accuracy test config and add DeepSeek-V2-Lite test (vllm-project#2261) ### What this PR does / why we need it? This PR fix accuracy test related to vllm-project#2073, users can now perform accuracy tests on multiple models simultaneously and generate different report files by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/models/configs/accuracy.txt ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? <img width="1648" height="511" alt="image" src="https://github.com/user-attachments/assets/1757e3b8-a6b7-44e5-b701-80940dc756cd" /> - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@766bc81 --------- Signed-off-by: Icey <1790571317@qq.com> * Fix accuracy test create PR (vllm-project#2274) ### What this PR does / why we need it? Fix create PR of accuracy test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Local testing: nv-action/vllm-benchmarks#87 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@099c046 --------- Signed-off-by: Icey <1790571317@qq.com> * Add ut for test_communicator.py (vllm-project#2293) ### What this PR does / why we need it? Add ut for test_communicator.py - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@e5ebeeb Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com> * [CI] Fix broken CI (vllm-project#2302) 1. disable test_eagle_ccorrectness test, we'll reopen it once oom error fixed. 2. drop transformers version limit for main, since vLLM rely on >=4.55.0, see: vllm-project/vllm@65552b4 3. fix kv_connector_output bug, see: vllm-project/vllm@796bae0 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@d1af8b7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [2/N][Refactor] torchair model runner refactor (vllm-project#2204) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow vllm-project#2203 What's this PR do: move `torchair` related logic into `_get_forward_metadata_across_dp` and override it in torchair model runner - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1b99028 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [core] Support capture custom ops into aclgraph (vllm-project#2113) ### What this PR does / why we need it? Thanks to the PR vllm-project#426 make vllm-ascend support the aclgraph inference to reduce the host overhead. However, the capability of aclgraph strongly relies on the functionality provided by `torch.compile`, which is the key feature supported in torch 2.x . Therefore, capture custom op into aclgraph is only possible when it can be recognize and captured by `torch.compile`. In this PR, we register the meta implementation of current custom ops to enable the fx graph capture. And by doing that, insert those custom ops into aclgraph become a natural thing to the ascend runtime. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? Tested in unittest, we will integrate the `rotary_embedding` op into a small custom model and use `torch.compile` and aclgraph to capture and replay it to verify its functionality. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1b99028 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> * Bump actions/download-artifact from 4 to 5 (vllm-project#2311) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ebf7605 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Perf][MTP] Optimize reject sampler in greedy situation. (vllm-project#2137) This PR port optimization in PR vllm-project#2002 to main and makes it cleaner. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@afa5b7c --------- Signed-off-by: whx-sjtu <2952154980@qq.com> * [3/N][Refactor] torchair model runner refactor (vllm-project#2207) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow vllm-project#2203, this is the first PR. What's this PR do: create common function `_build_attention_metadata` and `_generate_dummy_run_hidden_states` for dummy_run - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ebf7605 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Feat] chunkprefill mla support torchair graph (vllm-project#1772) chunkprefill mla only support eager mode now,we want to optimaze it by support torchair graph, the idea is simple, when all the request is running in decode, use torchair graph to deal with it, else when chunkprefill or prefill only, use the eager mode - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ebf7605 Signed-off-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> * [4/N][Refactor] torchair model runner refactor (vllm-project#2208) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow vllm-project#2203, this is the first PR. What's this PR do: create common function `_convert_torch_foramt` for initialize_kv_cache - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * Configure Gemini (vllm-project#2298) ### What this PR does / why we need it? This PR requests Gemini AI to review PRs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> * ut: add ci guard for ut coverage (vllm-project#2317) ### What this PR does / why we need it? add ci guard for ut coverage, if ut coverage of patch pr is below 80%, the ci will failed/ ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@458e74e --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> * [main][prefill optimization] Optimize parallel strategies to reduce communication overhead (vllm-project#2198) ### What this PR does / why we need it? 1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to pure DP for shared experts, enabling more efficient execution. 2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by using ReduceScatter, made possible by pure DP sharding. 3.AllGather Postponed: Delayed to after QKV down projection to reduce synchronization impact during prefill. ### How was this patch tested? Adding ut case in `tests/ut/attention/test_mla_v1.py` #### How to run use parameter `--additional_config='{"enable_shared_expert_dp": true}'` ##### a.How to run eager mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true}' ##### b.How to run graph mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@9edd1db --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com> * [Doc] Update faq (vllm-project#2334) ### What this PR does / why we need it? - update determinitic calculation - update support device ### Does this PR introduce _any_ user-facing change? - Users should update ray and protobuf when using ray as distributed backend - Users should change to use `export HCCL_DETERMINISTIC=true` when enabling determinitic calculation ### How was this patch tested? N/A - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ea1292a Signed-off-by: MengqingCao <cmq0113@163.com> * [5/N][Refactor] torchair model runner refactor (vllm-project#2216) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow vllm-project#2203 What's this PR do: create common function `_capture_model` for capture_model - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1891a26 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic (vllm-project#2125) ### What this PR does / why we need it? This PR refactors the MoE (Mixture of Experts) communication logic by introducing a strategy pattern. It defines an abstract base class, `MoECommMethod`, which encapsulates different communication strategies for MoE layers. By decoupling the MoE implementation from any single communication method, this change makes it simpler to add, replace, or optimize communication strategies in the future. Plan / Roadmap 1. Introduce `MoECommMethod`, implement `AllGatherImpl`, and adapt ACL Graph handling to cover all scenarios (this PR). 2. Implement `MC2CommImpl` and `AllToAllCommImpl` to optimize performance in specific scenarios. 3. Enable W8A8 / Int8 models to use `unified_fused_experts`. Other notes * Data-parallel (DP) communication currently does not work with vLLM's dispatch/combine mechanisms; an alternative approach is required to resolve this incompatibility. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@f7ad6a1 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> * [Doc] Add container image save/load FAQ for offline environments (vllm-project#2347) ### What this PR does / why we need it? Add Docker export/import guide for air-gapped environments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@d16aa3d Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> * [Bugfix] fix the oom when chunkprefill with long context like 64k (vllm-project#2319) The attn mask was declared in the mla.py,we don't need the splitfuse mask when mla chunkprefill, and this mask will cause memory problem when long context like 64k or 128k - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 --------- Signed-off-by: haojiangzheng <justineric096@gmail.com> --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: huangxialu <huangxialu1@huawei.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: haojiangzheng <justineric096@gmail.com> Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: huangxialu <huangxialu1@huawei.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Pleaplusone <pleaplusone.gy@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: zhenghaojiang <zhjoneson@163.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: jack <QwertyJack@users.noreply.github.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>
### What this PR does / why we need it? Thanks to the PR vllm-project#426 make vllm-ascend support the aclgraph inference to reduce the host overhead. However, the capability of aclgraph strongly relies on the functionality provided by `torch.compile`, which is the key feature supported in torch 2.x . Therefore, capture custom op into aclgraph is only possible when it can be recognize and captured by `torch.compile`. In this PR, we register the meta implementation of current custom ops to enable the fx graph capture. And by doing that, insert those custom ops into aclgraph become a natural thing to the ascend runtime. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? Tested in unittest, we will integrate the `rotary_embedding` op into a small custom model and use `torch.compile` and aclgraph to capture and replay it to verify its functionality. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1b99028 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? <!-- - Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. - Please clarify why the changes are needed. For instance, the use case and bug description. - Fixes # --> This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine. 1. register unifiled_ascend_attention_with_output for piecewise_graph to split graph. 2. support NPUGraph to accelerate kernel launch. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means *any* user-facing change including all aspects such as API, interface or other behavior changes. Documentation-only updates are not considered user-facing changes. --> support npugraph to default, Users can disenable the npugraph feature by configuring enforce_eager. This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture. ### How was this patch tested? <!-- CI passed with new added/existing test. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> it turn to default --------- Signed-off-by: Bug Hunter Yan <yanpq@zju.edu.cn> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
### What this PR does / why we need it? Thanks to the PR vllm-project#426 make vllm-ascend support the aclgraph inference to reduce the host overhead. However, the capability of aclgraph strongly relies on the functionality provided by `torch.compile`, which is the key feature supported in torch 2.x . Therefore, capture custom op into aclgraph is only possible when it can be recognize and captured by `torch.compile`. In this PR, we register the meta implementation of current custom ops to enable the fx graph capture. And by doing that, insert those custom ops into aclgraph become a natural thing to the ascend runtime. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? Tested in unittest, we will integrate the `rotary_embedding` op into a small custom model and use `torch.compile` and aclgraph to capture and replay it to verify its functionality. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1b99028 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com>

What this PR does / why we need it?
This PR supports the access of vllm-acend to the piecewise_graph feature provided by the v1 engine.
Does this PR introduce any user-facing change?
support npugraph to default, Users can disenable the npugraph feature by configuring enforce_eager,just like.
from vllm import LLM, llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", enforce_eager=True)This has corresponding requirements for the versions of torch_npu and CANN, and they need to support graph capture.
How was this patch tested?
it turn to default