Skip to content

[Bug]: NPU function error: call aclnnAddRmsNormQuant failed, error code is 561000 #2387

@Pandaxia990

Description

@Pandaxia990

Your current environment

NPU: Ascend910B4*4

Model:Qwen3-32B-w8a8

IMAGE: quay.io/ascend/vllm-ascend:v0.10.0rc1

CMD: vllm serve /home/Qwen3-8B-w8a8-MindIE/ --max-model-len 10240 --port 8000 -tp 4

🐛 Describe the bug

INFO 08-15 02:21:19 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:19 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:19 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:21:19 [init.py:226] Platform plugin ascend is activated
WARNING 08-15 02:21:21 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 08-15 02:21:22 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 08-15 02:21:24 [api_server.py:1755] vLLM API server version 0.10.0
INFO 08-15 02:21:24 [cli_args.py:261] non-default args: {'model_tag': '/home/Qwen3-8B-w8a8-MindIE/', 'model': '/home/Qwen3-8B-w8a8-MindIE/', 'max_model_len': 10240, 'tensor_parallel_size': 4}
INFO 08-15 02:21:36 [config.py:1604] Using max model len 10240
WARNING 08-15 02:21:36 [config.py:1084] ascend quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-15 02:21:37 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-15 02:21:37 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 08-15 02:21:37 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 25
INFO 08-15 02:21:37 [utils.py:348] Adjusted ACL graph batch sizes for Qwen3ForCausalLM model (layers: 36): 67 → 25 sizes
INFO 08-15 02:21:46 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:46 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:46 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:21:46 [init.py:226] Platform plugin ascend is activated
WARNING 08-15 02:21:47 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 08-15 02:21:48 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-15 02:21:49 [core.py:572] Waiting for init message from front-end.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 08-15 02:21:49 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/home/Qwen3-8B-w8a8-MindIE/', speculative_config=None, tokenizer='/home/Qwen3-8B-w8a8-MindIE/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=10240, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=ascend, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/Qwen3-8B-w8a8-MindIE/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,488,464,448,424,400,384,360,336,312,288,272,248,224,208,184,160,136,112,96,72,48,32,8,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 08-15 02:21:49 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-15 02:21:49 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_0bff10d1'), local_subscribe_addr='ipc:///tmp/dcdeb8d0-8d1b-4684-8ad7-970b33a69de9', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 08-15 02:21:57 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:57 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:57 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:21:57 [init.py:226] Platform plugin ascend is activated
INFO 08-15 02:21:57 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:57 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:57 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:21:58 [init.py:226] Platform plugin ascend is activated
INFO 08-15 02:21:58 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:58 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:58 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:21:58 [init.py:226] Platform plugin ascend is activated
INFO 08-15 02:21:58 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:58 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:58 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:21:58 [init.py:226] Platform plugin ascend is activated
WARNING 08-15 02:21:59 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:21:59 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:21:59 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:21:59 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 08-15 02:22:00 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-15 02:22:00 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-15 02:22:00 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-15 02:22:00 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 08-15 02:22:13 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:22:13 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:22:13 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:22:13 [init.py:226] Platform plugin ascend is activated
INFO 08-15 02:22:13 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:22:13 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:22:13 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:22:13 [init.py:226] Platform plugin ascend is activated
INFO 08-15 02:22:14 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:22:14 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:22:14 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:22:14 [init.py:226] Platform plugin ascend is activated
INFO 08-15 02:22:14 [init.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:22:14 [init.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:22:14 [init.py:43] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 08-15 02:22:14 [init.py:226] Platform plugin ascend is activated
WARNING 08-15 02:22:15 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:22:15 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:22:15 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:22:15 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_7f99a860'), local_subscribe_addr='ipc:///tmp/c423abbc-52dd-4c8e-83e4-955eb4e0cebe', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_fc1211b5'), local_subscribe_addr='ipc:///tmp/b8018fd6-05c9-4aa5-bf62-1437adbbfecc', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_58be7781'), local_subscribe_addr='ipc:///tmp/ebef15f5-db56-426b-95b2-e00df04d9740', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e5434b7e'), local_subscribe_addr='ipc:///tmp/d981a72a-8c6e-458f-83df-6ccd8d1543ee', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_718d42b0'), local_subscribe_addr='ipc:///tmp/ba08e10f-dadb-46aa-9268-358aca80cd5c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:19 [parallel_state.py:1102] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:19 [parallel_state.py:1102] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:19 [parallel_state.py:1102] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:19 [parallel_state.py:1102] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:20 [model_runner_v1.py:2084] Starting to load model /home/Qwen3-8B-w8a8-MindIE/...
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:20 [model_runner_v1.py:2084] Starting to load model /home/Qwen3-8B-w8a8-MindIE/...
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:20 [model_runner_v1.py:2084] Starting to load model /home/Qwen3-8B-w8a8-MindIE/...
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:20 [model_runner_v1.py:2084] Starting to load model /home/Qwen3-8B-w8a8-MindIE/...
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:20 [quantizer.py:85] Using the vLLM Ascend Quantizer version now!
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:20 [quantizer.py:85] Using the vLLM Ascend Quantizer version now!
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:20 [quantizer.py:85] Using the vLLM Ascend Quantizer version now!
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:20 [quantizer.py:85] Using the vLLM Ascend Quantizer version now!
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.14it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:02<00:01, 1.10s/it]
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:24 [default_loader.py:262] Loading weights took 3.42 seconds
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:03<00:00, 1.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:03<00:00, 1.15s/it]
(VllmWorker rank=0 pid=14228)
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:24 [default_loader.py:262] Loading weights took 3.45 seconds
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:25 [model_runner_v1.py:2114] Loading model weights took 2.6379 GB
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:25 [model_runner_v1.py:2114] Loading model weights took 2.6379 GB
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:26 [default_loader.py:262] Loading weights took 4.94 seconds
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:26 [default_loader.py:262] Loading weights took 4.99 seconds
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:27 [model_runner_v1.py:2114] Loading model weights took 2.6379 GB
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:27 [model_runner_v1.py:2114] Loading model weights took 2.6379 GB
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:37 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c249e2e960/rank_0_0/backbone for vLLM's torch.compile
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:37 [backends.py:541] Dynamo bytecode transform time: 10.02 s
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:38 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c249e2e960/rank_3_0/backbone for vLLM's torch.compile
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:38 [backends.py:541] Dynamo bytecode transform time: 10.31 s
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:38 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c249e2e960/rank_2_0/backbone for vLLM's torch.compile
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:38 [backends.py:541] Dynamo bytecode transform time: 10.56 s
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:39 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c249e2e960/rank_1_0/backbone for vLLM's torch.compile
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:39 [backends.py:541] Dynamo bytecode transform time: 10.72 s
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:41 [backends.py:215] Compiling a graph for dynamic shape takes 2.19 s
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:41 [backends.py:215] Compiling a graph for dynamic shape takes 2.10 s
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:42 [backends.py:215] Compiling a graph for dynamic shape takes 2.10 s
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:42 [backends.py:215] Compiling a graph for dynamic shape takes 2.12 s
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:49 [monitor.py:34] torch.compile takes 12.21 s in total
[rank0]:[E815 02:22:49.024075817 compiler_depend.ts:429] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:3785 NPU function error: call aclnnAddRmsNormQuant failed, error code is 561000
[ERROR] 2025-08-15-02:22:49 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
EZ9999: Inner Error!
EZ9999: [PID: 14228] 2025-08-15-02:22:49.996.317 Cannot find bin of op AddRmsNormQuant, integral key 0/1/|float16/ND/float16/ND/float16/ND/float16/ND/int8/ND/int8/ND/float16/ND/.
TraceBack (most recent call last):
Cannot find binary for op AddRmsNormQuant.
Kernel Run failed. opType: 31, AddRmsNormQuant
launch failed for AddRmsNormQuant, errno:561000.

Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:3785 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xfffbf75c3ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xfffbf7563e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: + 0x1b737dc (0xfffbe96e37dc in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0x22887d4 (0xfffbe9df87d4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x8fb170 (0xfffbe846b170 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x8fd504 (0xfffbe846d504 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: + 0x8f9e2c (0xfffbe8469e2c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0xd31fc (0xfffbf73d31fc in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #8: + 0x7d5b8 (0xfffc036dd5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: + 0xe5edc (0xfffc03745edc in /lib/aarch64-linux-gnu/libc.so.6)

(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] WorkerProc hit an exception.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Traceback (most recent call last):
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 541, in worker_busy_loop
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] output = func(*args, **kwargs)
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 157, in determine_available_memory
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] self.model_runner.profile_run()
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2027, in profile_run
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] hidden_states = hidden_states[logit_indices]
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnAddRmsNormQuant.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] [ERROR] 2025-08-15-02:22:50 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Traceback (most recent call last):
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 541, in worker_busy_loop
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] output = func(*args, **kwargs)
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 157, in determine_available_memory
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] self.model_runner.profile_run()
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2027, in profile_run
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] hidden_states = hidden_states[logit_indices]
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnAddRmsNormQuant.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] [ERROR] 2025-08-15-02:22:50 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]
ERROR 08-15 02:22:50 [core.py:632] EngineCore failed to start.
ERROR 08-15 02:22:50 [core.py:632] Traceback (most recent call last):
ERROR 08-15 02:22:50 [core.py:632] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 623, in run_engine_core
ERROR 08-15 02:22:50 [core.py:632] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 08-15 02:22:50 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-15 02:22:50 [core.py:632] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 441, in init
ERROR 08-15 02:22:50 [core.py:632] super().init(vllm_config, executor_class, log_stats,
ERROR 08-15 02:22:50 [core.py:632] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 86, in init
ERROR 08-15 02:22:50 [core.py:632] self._initialize_kv_caches(vllm_config)
ERROR 08-15 02:22:50 [core.py:632] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 158, in _initialize_kv_caches
ERROR 08-15 02:22:50 [core.py:632] self.model_executor.determine_available_memory())
ERROR 08-15 02:22:50 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-15 02:22:50 [core.py:632] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 08-15 02:22:50 [core.py:632] output = self.collective_rpc("determine_available_memory")
ERROR 08-15 02:22:50 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-15 02:22:50 [core.py:632] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 237, in collective_rpc
ERROR 08-15 02:22:50 [core.py:632] result = get_response(w, dequeue_timeout)
ERROR 08-15 02:22:50 [core.py:632] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-15 02:22:50 [core.py:632] File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 224, in get_response
ERROR 08-15 02:22:50 [core.py:632] raise RuntimeError(
ERROR 08-15 02:22:50 [core.py:632] RuntimeError: Worker failed with error 'The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnAddRmsNormQuant.
ERROR 08-15 02:22:50 [core.py:632] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
ERROR 08-15 02:22:50 [core.py:632] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
ERROR 08-15 02:22:50 [core.py:632] [ERROR] 2025-08-15-02:22:50 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
ERROR 08-15 02:22:50 [core.py:632] ', please check the stack trace above for the root cause
ERROR 08-15 02:23:00 [multiproc_executor.py:140] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 636, in run_engine_core
raise e
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 623, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 441, in init
super().init(vllm_config, executor_class, log_stats,
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 86, in init
self._initialize_kv_caches(vllm_config)
File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 158, in _initialize_kv_caches
self.model_executor.determine_available_memory())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
output = self.collective_rpc("determine_available_memory")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 237, in collective_rpc
result = get_response(w, dequeue_timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 224, in get_response
raise RuntimeError(
RuntimeError: Worker failed with error 'The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnAddRmsNormQuant.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
[ERROR] 2025-08-15-02:22:50 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
', please check the stack trace above for the root cause
Traceback (most recent call last):
File "/usr/local/python3.11.13/bin/vllm", line 8, in
sys.exit(main())
^^^^^^
File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 54, in main
args.dispatch_function(args)
File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 52, in cmd
uvloop.run(run_server(args))
File "/usr/local/python3.11.13/lib/python3.11/site-packages/uvloop/init.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3.11.13/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/python3.11.13/lib/python3.11/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1791, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1811, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
File "/usr/local/python3.11.13/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/local/python3.11.13/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 163, in from_vllm_config
return cls(
^^^^
File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 117, in init
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 98, in make_async_mp_client
return AsyncMPClient(*client_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 677, in init
super().init(
File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 408, in init
with launch_core_engines(vllm_config, executor_class,
File "/usr/local/python3.11.13/lib/python3.11/contextlib.py", line 144, in exit
next(self.gen)
File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 697, in launch_core_engines
wait_for_engine_startup(
File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
[ERROR] 2025-08-15-02:23:06 (PID:13821, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions