-
Notifications
You must be signed in to change notification settings - Fork 0
fix bug support Qwen3 32B dense model of torichair graph #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
vllm_ascend\attention\attention_v1.py 调整query形状:从BSH -> TND原BSH格式:(-1, 1, num_heads * head_size) → 新TND格式:(Token, NumHeads, Dim)其中 Token=-1(总token数),NumHeads=self.num_heads,Dim=self.head_sizequery = query.view(-1, self.num_heads, self.head_size) 调整key_cache形状:从[blocknum, numKvHeads, blocksize, headDims] -> TND格式原处理合并了numKvHeads和headDims,新格式需保留拆分的头数和单头维度TND中:Token=blocknum×blocksize(总token数),NumHeads=self.num_kv_heads,Dim=self.head_sizekey_cache = self.key_cache.view(-1, self.num_kv_heads, self.head_size) 调整value_cache形状:同key_cache,保持TND格式value_cache = self.value_cache.view(-1, self.num_kv_heads, self.head_size) 调用注意力函数,指定input_layout为'TND'output = torch_npu.npu_incre_flash_attention( |
|
def get_output(self) -> EngineCoreOutputs: |
|
修改 /data/z00811365/verl_cann_qwen3_gitcode_fullgraph_0110/rl_train/qwen3/vllm_ascend/models/init.py 添加如下注册模型代码代码: /data/z00811365/verl_cann_qwen3_gitcode_fullgraph_0110/rl_train/qwen3/vllm_ascend/torchair/utils.py ..... |
|
model_runner_v1.py |
|
model_runner_v1.py 拷贝如下, 前面1709 行: import numpy as np yapf conflicts with isort for this blockyapf: disablefrom vllm.v1.kv_cache_interface import (AttentionSpec, FullAttentionSpec, yapf: enablefrom vllm.v1.outputs import (EMPTY_MODEL_RUNNER_OUTPUT, AsyncModelRunnerOutput, from vllm_ascend.ascend_config import get_ascend_config if TYPE_CHECKING: import torch_npu import vllm_ascend.envs as envs_ascend if true, allow tensor initialization and casting with internal format (e.g., NZ)torch.npu.config.allow_internal_format = True if is_310p(): if not vllm_version_is("0.10.2"): @DataClass @contextmanager Wrapper for ModelRunnerOutput to support overlapped execution.class AsyncNPUModelRunnerOutput(AsyncModelRunnerOutput): class NPUModelRunner(LoRAModelRunnerMixin): |
|
model_runner_v1.py 后面的函数如下(1710~2716): ............. |
|
model_runner_v1.py 后面的函数如下(2716~): ............. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?