[Bug]: 量化后的qwen2.5-72B拉起服务报错

### Your current environment

vllm-ascend:0.11.0rc0
910B4-1 64G

### 🐛 Describe the bug

测试环境910B4 32G可以成功拉起，权重拷贝到现场环境后拉起报错

O 11-04 06:05:19 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:19 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:19 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:19 [__init__.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:23 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:23 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
(APIServer pid=2826) INFO 11-04 06:05:23 [api_server.py:1839] vLLM API server version 0.11.0rc3
(APIServer pid=2826) INFO 11-04 06:05:23 [utils.py:233] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/root/models/', 'trust_remote_code': True, 'max_model_len': 32768, 'quantization': 'ascend', 'enforce_eager': True, 'served_model_name': ['Qwen2.5-72B-Instruct-w8a8-public'], 'tensor_parallel_size': 4}
(APIServer pid=2826) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=2826) INFO 11-04 06:05:23 [model.py:547] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=2826) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=2826) INFO 11-04 06:05:23 [model.py:1510] Using max model len 32768
(APIServer pid=2826) INFO 11-04 06:05:24 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=2826) INFO 11-04 06:05:24 [__init__.py:381] Cudagraph is disabled under eager mode
(APIServer pid=2826) INFO 11-04 06:05:24 [platform.py:141] Non-MLA LLMs forcibly disable the chunked prefill feature,as the performance of operators supporting this feature functionality is currently suboptimal.
(APIServer pid=2826) INFO 11-04 06:05:24 [platform.py:179] Compilation disabled, using eager mode by default
INFO 11-04 06:05:31 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:31 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:31 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:31 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-04 06:05:35 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 11-04 06:05:35 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_DP0 pid=2964) INFO 11-04 06:05:35 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
(EngineCore_DP0 pid=2964) INFO 11-04 06:05:35 [core.py:77] Initializing a V1 LLM engine (v0.11.0rc3) with config: model='/root/models/', speculative_config=None, tokenizer='/root/models/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=ascend, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen2.5-72B-Instruct-w8a8-public, enable_prefix_caching=True, chunked_prefill_enabled=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [multiproc_executor.py:720] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=2964) INFO 11-04 06:05:35 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_0e5542df'), local_subscribe_addr='ipc:///tmp/626aeab0-228f-441b-894f-cb28178dad91', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:05:42 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:42 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:42 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:42 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:42 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:42 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:43 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:43 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:43 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:43 [__init__.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:43 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:43 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:43 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:43 [__init__.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:43 [__init__.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:43 [__init__.py:207] Platform plugin ascend is activated
WARNING 11-04 06:05:46 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 11-04 06:05:46 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:46 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 11-04 06:05:46 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:46 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 11-04 06:05:47 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-04 06:05:47 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 11-04 06:05:47 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 11-04 06:05:58 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:58 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:58 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:58 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:58 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:58 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:58 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:58 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:58 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:58 [__init__.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:58 [__init__.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:58 [__init__.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:59 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:59 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:59 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 11-04 06:05:59 [__init__.py:207] Platform plugin ascend is activated
INFO 11-04 06:06:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_0427a4e2'), local_subscribe_addr='ipc:///tmp/d0057df1-c72c-43d3-b5e6-d8f44fdab5da', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_67142581'), local_subscribe_addr='ipc:///tmp/ff9de23a-d19c-41bf-a2df-d7a54a2884a7', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2e07d3d2'), local_subscribe_addr='ipc:///tmp/c0a287ed-3cbf-493c-8c1a-6e6a87d3486b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:05 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_84fe641c'), local_subscribe_addr='ipc:///tmp/8def68c4-c885-4157-9741-c0229f6fe7ba', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_4b4ff9b5'), local_subscribe_addr='ipc:///tmp/b014a024-0846-489c-85b8-afa50f1c01ef', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:07 [parallel_state.py:1208] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 11-04 06:06:07 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-04 06:06:07 [parallel_state.py:1208] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-04 06:06:07 [parallel_state.py:1208] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(Worker_TP2 pid=3102) INFO 11-04 06:06:14 [model_runner_v1.py:2627] Starting to load model /root/models/...
(Worker_TP3 pid=3103) INFO 11-04 06:06:15 [model_runner_v1.py:2627] Starting to load model /root/models/...
(Worker_TP0 pid=3100) INFO 11-04 06:06:15 [model_runner_v1.py:2627] Starting to load model /root/models/...
(Worker_TP2 pid=3102) INFO 11-04 06:06:15 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP1 pid=3101) INFO 11-04 06:06:15 [model_runner_v1.py:2627] Starting to load model /root/models/...
(Worker_TP0 pid=3100) INFO 11-04 06:06:15 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP3 pid=3103) INFO 11-04 06:06:15 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP1 pid=3101) INFO 11-04 06:06:15 [utils.py:60] Using the vLLM Ascend Quantization now!
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(Worker_TP2 pid=3102) INFO 11-04 06:06:28 [default_loader.py:267] Loading weights took 12.43 seconds
(Worker_TP2 pid=3102) INFO 11-04 06:06:29 [model_runner_v1.py:2661] Loading model weights took 22.1509 GB
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:12<00:00, 12.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:12<00:00, 12.82s/it]
(Worker_TP0 pid=3100) 
(Worker_TP0 pid=3100) INFO 11-04 06:06:29 [default_loader.py:267] Loading weights took 12.92 seconds
(Worker_TP1 pid=3101) INFO 11-04 06:06:29 [default_loader.py:267] Loading weights took 13.08 seconds
(Worker_TP3 pid=3103) INFO 11-04 06:06:29 [default_loader.py:267] Loading weights took 13.60 seconds
(Worker_TP0 pid=3100) INFO 11-04 06:06:30 [model_runner_v1.py:2661] Loading model weights took 22.1509 GB
(Worker_TP1 pid=3101) INFO 11-04 06:06:30 [model_runner_v1.py:2661] Loading model weights took 22.1509 GB
(Worker_TP3 pid=3103) INFO 11-04 06:06:30 [model_runner_v1.py:2661] Loading model weights took 22.1509 GB
[rank0]:[E1104 06:06:34.951660549 compiler_depend.ts:429] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:3785 NPU function error: call aclnnAddRmsNormQuant failed, error code is 561000
[ERROR] 2025-11-04-06:06:34 (PID:3100, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
EZ9999: Inner Error!
EZ9999: [PID: 3100] 2025-11-04-06:06:34.756.600 Cannot find bin of op AddRmsNormQuant, integral key 0/1/|float16/ND/float16/ND/float16/ND/float16/ND/int8/ND/int8/ND/float16/ND/.
        TraceBack (most recent call last):
       Cannot find binary for op AddRmsNormQuant.
       Kernel Run failed. opType: 30, AddRmsNormQuant
       launch failed for AddRmsNormQuant, errno:561000.

Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:3785 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xfffd75203ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0xfffd751a3e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1b737dc (0xfffd671537dc in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x22887d4 (0xfffd678687d4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: <unknown function> + 0x8fb170 (0xfffd65edb170 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: <unknown function> + 0x8fd504 (0xfffd65edd504 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: <unknown function> + 0x8f9e2c (0xfffd65ed9e2c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: <unknown function> + 0xd43cc (0xfffd750043cc in /lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x7fbb4 (0xfffd813dfbb4 in /lib64/libc.so.6)
frame #9: <unknown function> + 0xe79dc (0xfffd814479dc in /lib64/libc.so.6)

(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] WorkerProc hit an exception.
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]     output = func(*args, **kwargs)
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 205, in determine_available_memory
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]     self.model_runner.profile_run()
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2509, in profile_run
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]     hidden_states = self._dummy_run(self.max_num_tokens,
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]     return func(*args, **kwargs)
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2475, in _dummy_run
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]     hidden_states = self._generate_dummy_run_hidden_states(
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2320, in _generate_dummy_run_hidden_states
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]     hidden_states = self.model(input_ids=input_ids,
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671]   File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: 量化后的qwen2.5-72B拉起服务报错 #4001

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: 量化后的qwen2.5-72B拉起服务报错 #4001

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions