Skip to content

[Bug]: 量化后的qwen2.5-72B拉起服务报错 #4001

@ylh19917567489-hue

Description

@ylh19917567489-hue

Your current environment

vllm-ascend:0.11.0rc0
910B4-1 64G

🐛 Describe the bug

测试环境910B4 32G可以成功拉起,权重拷贝到现场环境后拉起报错

O 11-04 06:05:19 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:19 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:19 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:19 [init.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:23 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:23 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:23 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
(APIServer pid=2826) INFO 11-04 06:05:23 [api_server.py:1839] vLLM API server version 0.11.0rc3
(APIServer pid=2826) INFO 11-04 06:05:23 [utils.py:233] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/root/models/', 'trust_remote_code': True, 'max_model_len': 32768, 'quantization': 'ascend', 'enforce_eager': True, 'served_model_name': ['Qwen2.5-72B-Instruct-w8a8-public'], 'tensor_parallel_size': 4}
(APIServer pid=2826) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=2826) INFO 11-04 06:05:23 [model.py:547] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=2826) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=2826) INFO 11-04 06:05:23 [model.py:1510] Using max model len 32768
(APIServer pid=2826) INFO 11-04 06:05:24 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=2826) INFO 11-04 06:05:24 [init.py:381] Cudagraph is disabled under eager mode
(APIServer pid=2826) INFO 11-04 06:05:24 [platform.py:141] Non-MLA LLMs forcibly disable the chunked prefill feature,as the performance of operators supporting this feature functionality is currently suboptimal.
(APIServer pid=2826) INFO 11-04 06:05:24 [platform.py:179] Compilation disabled, using eager mode by default
INFO 11-04 06:05:31 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:31 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:31 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:31 [init.py:207] Platform plugin ascend is activated
WARNING 11-04 06:05:35 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 11-04 06:05:35 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_DP0 pid=2964) INFO 11-04 06:05:35 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
(EngineCore_DP0 pid=2964) INFO 11-04 06:05:35 [core.py:77] Initializing a V1 LLM engine (v0.11.0rc3) with config: model='/root/models/', speculative_config=None, tokenizer='/root/models/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=ascend, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen2.5-72B-Instruct-w8a8-public, enable_prefix_caching=True, chunked_prefill_enabled=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
(EngineCore_DP0 pid=2964) WARNING 11-04 06:05:35 [multiproc_executor.py:720] Reducing Torch parallelism from 192 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=2964) INFO 11-04 06:05:35 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_0e5542df'), local_subscribe_addr='ipc:///tmp/626aeab0-228f-441b-894f-cb28178dad91', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:05:42 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:42 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:42 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:42 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:42 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:42 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:43 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:43 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:43 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:43 [init.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:43 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:43 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:43 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:43 [init.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:43 [init.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:43 [init.py:207] Platform plugin ascend is activated
WARNING 11-04 06:05:46 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 11-04 06:05:46 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:46 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 11-04 06:05:46 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:46 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 11-04 06:05:47 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-04 06:05:47 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 11-04 06:05:47 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 11-04 06:05:47 [registry.py:581] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 11-04 06:05:58 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:58 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:58 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:58 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:58 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:58 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:58 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:58 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:58 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:58 [init.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:58 [init.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:58 [init.py:207] Platform plugin ascend is activated
INFO 11-04 06:05:59 [init.py:36] Available plugins for group vllm.platform_plugins:
INFO 11-04 06:05:59 [init.py:38] - ascend -> vllm_ascend:register
INFO 11-04 06:05:59 [init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 11-04 06:05:59 [init.py:207] Platform plugin ascend is activated
INFO 11-04 06:06:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_0427a4e2'), local_subscribe_addr='ipc:///tmp/d0057df1-c72c-43d3-b5e6-d8f44fdab5da', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_67142581'), local_subscribe_addr='ipc:///tmp/ff9de23a-d19c-41bf-a2df-d7a54a2884a7', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2e07d3d2'), local_subscribe_addr='ipc:///tmp/c0a287ed-3cbf-493c-8c1a-6e6a87d3486b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:05 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_84fe641c'), local_subscribe_addr='ipc:///tmp/8def68c4-c885-4157-9741-c0229f6fe7ba', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_4b4ff9b5'), local_subscribe_addr='ipc:///tmp/b014a024-0846-489c-85b8-afa50f1c01ef', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 11-04 06:06:07 [parallel_state.py:1208] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 11-04 06:06:07 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 11-04 06:06:07 [parallel_state.py:1208] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 11-04 06:06:07 [parallel_state.py:1208] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(Worker_TP2 pid=3102) INFO 11-04 06:06:14 [model_runner_v1.py:2627] Starting to load model /root/models/...
(Worker_TP3 pid=3103) INFO 11-04 06:06:15 [model_runner_v1.py:2627] Starting to load model /root/models/...
(Worker_TP0 pid=3100) INFO 11-04 06:06:15 [model_runner_v1.py:2627] Starting to load model /root/models/...
(Worker_TP2 pid=3102) INFO 11-04 06:06:15 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP1 pid=3101) INFO 11-04 06:06:15 [model_runner_v1.py:2627] Starting to load model /root/models/...
(Worker_TP0 pid=3100) INFO 11-04 06:06:15 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP3 pid=3103) INFO 11-04 06:06:15 [utils.py:60] Using the vLLM Ascend Quantization now!
(Worker_TP1 pid=3101) INFO 11-04 06:06:15 [utils.py:60] Using the vLLM Ascend Quantization now!
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
(Worker_TP2 pid=3102) INFO 11-04 06:06:28 [default_loader.py:267] Loading weights took 12.43 seconds
(Worker_TP2 pid=3102) INFO 11-04 06:06:29 [model_runner_v1.py:2661] Loading model weights took 22.1509 GB
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:12<00:00, 12.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:12<00:00, 12.82s/it]
(Worker_TP0 pid=3100)
(Worker_TP0 pid=3100) INFO 11-04 06:06:29 [default_loader.py:267] Loading weights took 12.92 seconds
(Worker_TP1 pid=3101) INFO 11-04 06:06:29 [default_loader.py:267] Loading weights took 13.08 seconds
(Worker_TP3 pid=3103) INFO 11-04 06:06:29 [default_loader.py:267] Loading weights took 13.60 seconds
(Worker_TP0 pid=3100) INFO 11-04 06:06:30 [model_runner_v1.py:2661] Loading model weights took 22.1509 GB
(Worker_TP1 pid=3101) INFO 11-04 06:06:30 [model_runner_v1.py:2661] Loading model weights took 22.1509 GB
(Worker_TP3 pid=3103) INFO 11-04 06:06:30 [model_runner_v1.py:2661] Loading model weights took 22.1509 GB
[rank0]:[E1104 06:06:34.951660549 compiler_depend.ts:429] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:3785 NPU function error: call aclnnAddRmsNormQuant failed, error code is 561000
[ERROR] 2025-11-04-06:06:34 (PID:3100, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
EZ9999: Inner Error!
EZ9999: [PID: 3100] 2025-11-04-06:06:34.756.600 Cannot find bin of op AddRmsNormQuant, integral key 0/1/|float16/ND/float16/ND/float16/ND/float16/ND/int8/ND/int8/ND/float16/ND/.
TraceBack (most recent call last):
Cannot find binary for op AddRmsNormQuant.
Kernel Run failed. opType: 30, AddRmsNormQuant
launch failed for AddRmsNormQuant, errno:561000.

Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:3785 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xfffd75203ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xfffd751a3e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: + 0x1b737dc (0xfffd671537dc in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0x22887d4 (0xfffd678687d4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x8fb170 (0xfffd65edb170 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x8fd504 (0xfffd65edd504 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: + 0x8f9e2c (0xfffd65ed9e2c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0xd43cc (0xfffd750043cc in /lib64/libstdc++.so.6)
frame #8: + 0x7fbb4 (0xfffd813dfbb4 in /lib64/libc.so.6)
frame #9: + 0xe79dc (0xfffd814479dc in /lib64/libc.so.6)

(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] WorkerProc hit an exception.
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] Traceback (most recent call last):
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] output = func(*args, **kwargs)
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 205, in determine_available_memory
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] self.model_runner.profile_run()
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2509, in profile_run
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] hidden_states = self._dummy_run(self.max_num_tokens,
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] return func(*args, **kwargs)
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2475, in _dummy_run
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] hidden_states = self._generate_dummy_run_hidden_states(
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2320, in _generate_dummy_run_hidden_states
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] hidden_states = self.model(input_ids=input_ids,
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3100) ERROR 11-04 06:06:34 [multiproc_executor.py:671] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions