[Bug]: NPU function error: call aclnnAddRmsNormQuant failed, error code is 561000

### Your current environment

<details>

NPU: Ascend910B4*4

Model：Qwen3-32B-w8a8

IMAGE: quay.io/ascend/vllm-ascend:v0.10.0rc1 

```text
CMD: vllm serve /home/Qwen3-8B-w8a8-MindIE/ --max-model-len 10240 --port 8000 -tp 4
```

</details>


### 🐛 Describe the bug

INFO 08-15 02:21:19 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:19 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:19 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:21:19 [__init__.py:226] Platform plugin ascend is activated
WARNING 08-15 02:21:21 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 08-15 02:21:22 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:21:23 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 08-15 02:21:24 [api_server.py:1755] vLLM API server version 0.10.0
INFO 08-15 02:21:24 [cli_args.py:261] non-default args: {'model_tag': '/home/Qwen3-8B-w8a8-MindIE/', 'model': '/home/Qwen3-8B-w8a8-MindIE/', 'max_model_len': 10240, 'tensor_parallel_size': 4}
INFO 08-15 02:21:36 [config.py:1604] Using max model len 10240
WARNING 08-15 02:21:36 [config.py:1084] ascend quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 08-15 02:21:37 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-15 02:21:37 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 08-15 02:21:37 [utils.py:333] Calculated maximum supported batch sizes for ACL graph: 25
INFO 08-15 02:21:37 [utils.py:348] Adjusted ACL graph batch sizes for Qwen3ForCausalLM model (layers: 36): 67 → 25 sizes
INFO 08-15 02:21:46 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:46 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:46 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:21:46 [__init__.py:226] Platform plugin ascend is activated
WARNING 08-15 02:21:47 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 08-15 02:21:48 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-15 02:21:49 [core.py:572] Waiting for init message from front-end.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:21:49 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 08-15 02:21:49 [core.py:71] Initializing a V1 LLM engine (v0.10.0) with config: model='/home/Qwen3-8B-w8a8-MindIE/', speculative_config=None, tokenizer='/home/Qwen3-8B-w8a8-MindIE/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=10240, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=ascend, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/Qwen3-8B-w8a8-MindIE/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,488,464,448,424,400,384,360,336,312,288,272,248,224,208,184,160,136,112,96,72,48,32,8,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 08-15 02:21:49 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 08-15 02:21:49 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_0bff10d1'), local_subscribe_addr='ipc:///tmp/dcdeb8d0-8d1b-4684-8ad7-970b33a69de9', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 08-15 02:21:57 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:57 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:57 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:21:57 [__init__.py:226] Platform plugin ascend is activated
INFO 08-15 02:21:57 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:57 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:57 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:21:58 [__init__.py:226] Platform plugin ascend is activated
INFO 08-15 02:21:58 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:58 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:58 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:21:58 [__init__.py:226] Platform plugin ascend is activated
INFO 08-15 02:21:58 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:21:58 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:21:58 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:21:58 [__init__.py:226] Platform plugin ascend is activated
WARNING 08-15 02:21:59 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:21:59 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:21:59 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:21:59 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 08-15 02:22:00 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-15 02:22:00 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-15 02:22:00 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-15 02:22:00 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:22:00 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-15 02:22:01 [registry.py:430] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 08-15 02:22:13 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:22:13 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:22:13 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:22:13 [__init__.py:226] Platform plugin ascend is activated
INFO 08-15 02:22:13 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:22:13 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:22:13 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:22:13 [__init__.py:226] Platform plugin ascend is activated
INFO 08-15 02:22:14 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:22:14 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:22:14 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:22:14 [__init__.py:226] Platform plugin ascend is activated
INFO 08-15 02:22:14 [__init__.py:38] Available plugins for group vllm.platform_plugins:
INFO 08-15 02:22:14 [__init__.py:40] - ascend -> vllm_ascend:register
INFO 08-15 02:22:14 [__init__.py:43] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-15 02:22:14 [__init__.py:226] Platform plugin ascend is activated
WARNING 08-15 02:22:15 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:22:15 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:22:15 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
WARNING 08-15 02:22:15 [_custom_ops.py:20] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_7f99a860'), local_subscribe_addr='ipc:///tmp/c423abbc-52dd-4c8e-83e4-955eb4e0cebe', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_fc1211b5'), local_subscribe_addr='ipc:///tmp/b8018fd6-05c9-4aa5-bf62-1437adbbfecc', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_58be7781'), local_subscribe_addr='ipc:///tmp/ebef15f5-db56-426b-95b2-e00df04d9740', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:18 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_e5434b7e'), local_subscribe_addr='ipc:///tmp/d981a72a-8c6e-458f-83df-6ccd8d1543ee', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_718d42b0'), local_subscribe_addr='ipc:///tmp/ba08e10f-dadb-46aa-9268-358aca80cd5c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:19 [parallel_state.py:1102] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:19 [parallel_state.py:1102] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:19 [parallel_state.py:1102] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:19 [parallel_state.py:1102] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:20 [model_runner_v1.py:2084] Starting to load model /home/Qwen3-8B-w8a8-MindIE/...
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:20 [model_runner_v1.py:2084] Starting to load model /home/Qwen3-8B-w8a8-MindIE/...
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:20 [model_runner_v1.py:2084] Starting to load model /home/Qwen3-8B-w8a8-MindIE/...
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:20 [model_runner_v1.py:2084] Starting to load model /home/Qwen3-8B-w8a8-MindIE/...
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:20 [quantizer.py:85] Using the vLLM Ascend Quantizer version now!
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:20 [quantizer.py:85] Using the vLLM Ascend Quantizer version now!
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:20 [quantizer.py:85] Using the vLLM Ascend Quantizer version now!
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:20 [quantizer.py:85] Using the vLLM Ascend Quantizer version now!
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:01,  1.14it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:02<00:01,  1.10s/it]
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:24 [default_loader.py:262] Loading weights took 3.42 seconds
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:03<00:00,  1.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:03<00:00,  1.15s/it]
(VllmWorker rank=0 pid=14228)
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:24 [default_loader.py:262] Loading weights took 3.45 seconds
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:25 [model_runner_v1.py:2114] Loading model weights took 2.6379 GB
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:25 [model_runner_v1.py:2114] Loading model weights took 2.6379 GB
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:26 [default_loader.py:262] Loading weights took 4.94 seconds
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:26 [default_loader.py:262] Loading weights took 4.99 seconds
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:27 [model_runner_v1.py:2114] Loading model weights took 2.6379 GB
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:27 [model_runner_v1.py:2114] Loading model weights took 2.6379 GB
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:37 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c249e2e960/rank_0_0/backbone for vLLM's torch.compile
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:37 [backends.py:541] Dynamo bytecode transform time: 10.02 s
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:38 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c249e2e960/rank_3_0/backbone for vLLM's torch.compile
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:38 [backends.py:541] Dynamo bytecode transform time: 10.31 s
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:38 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c249e2e960/rank_2_0/backbone for vLLM's torch.compile
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:38 [backends.py:541] Dynamo bytecode transform time: 10.56 s
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:39 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c249e2e960/rank_1_0/backbone for vLLM's torch.compile
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:39 [backends.py:541] Dynamo bytecode transform time: 10.72 s
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:41 [backends.py:215] Compiling a graph for dynamic shape takes 2.19 s
(VllmWorker rank=3 pid=14231) INFO 08-15 02:22:41 [backends.py:215] Compiling a graph for dynamic shape takes 2.10 s
(VllmWorker rank=2 pid=14230) INFO 08-15 02:22:42 [backends.py:215] Compiling a graph for dynamic shape takes 2.10 s
(VllmWorker rank=1 pid=14229) INFO 08-15 02:22:42 [backends.py:215] Compiling a graph for dynamic shape takes 2.12 s
(VllmWorker rank=0 pid=14228) INFO 08-15 02:22:49 [monitor.py:34] torch.compile takes 12.21 s in total
[rank0]:[E815 02:22:49.024075817 compiler_depend.ts:429] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:3785 NPU function error: call aclnnAddRmsNormQuant failed, error code is 561000
[ERROR] 2025-08-15-02:22:49 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
EZ9999: Inner Error!
EZ9999: [PID: 14228] 2025-08-15-02:22:49.996.317 Cannot find bin of op AddRmsNormQuant, integral key 0/1/|float16/ND/float16/ND/float16/ND/float16/ND/int8/ND/int8/ND/float16/ND/.
        TraceBack (most recent call last):
       Cannot find binary for op AddRmsNormQuant.
       Kernel Run failed. opType: 31, AddRmsNormQuant
       launch failed for AddRmsNormQuant, errno:561000.

Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:3785 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xfffbf75c3ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe4 (0xfffbf7563e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1b737dc (0xfffbe96e37dc in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x22887d4 (0xfffbe9df87d4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: <unknown function> + 0x8fb170 (0xfffbe846b170 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: <unknown function> + 0x8fd504 (0xfffbe846d504 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: <unknown function> + 0x8f9e2c (0xfffbe8469e2c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: <unknown function> + 0xd31fc (0xfffbf73d31fc in /lib/aarch64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x7d5b8 (0xfffc036dd5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0xe5edc (0xfffc03745edc in /lib/aarch64-linux-gnu/libc.so.6)

(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] WorkerProc hit an exception.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Traceback (most recent call last):
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 541, in worker_busy_loop
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 157, in determine_available_memory
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]     self.model_runner.profile_run()
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2027, in profile_run
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]     hidden_states = hidden_states[logit_indices]
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]                     ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnAddRmsNormQuant.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] [ERROR] 2025-08-15-02:22:50 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Traceback (most recent call last):
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 541, in worker_busy_loop
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 157, in determine_available_memory
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]     self.model_runner.profile_run()
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 2027, in profile_run
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]     hidden_states = hidden_states[logit_indices]
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]                     ~~~~~~~~~~~~~^^^^^^^^^^^^^^^
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnAddRmsNormQuant.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546] [ERROR] 2025-08-15-02:22:50 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]
(VllmWorker rank=0 pid=14228) ERROR 08-15 02:22:50 [multiproc_executor.py:546]
ERROR 08-15 02:22:50 [core.py:632] EngineCore failed to start.
ERROR 08-15 02:22:50 [core.py:632] Traceback (most recent call last):
ERROR 08-15 02:22:50 [core.py:632]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 623, in run_engine_core
ERROR 08-15 02:22:50 [core.py:632]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 08-15 02:22:50 [core.py:632]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-15 02:22:50 [core.py:632]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 441, in __init__
ERROR 08-15 02:22:50 [core.py:632]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 08-15 02:22:50 [core.py:632]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 86, in __init__
ERROR 08-15 02:22:50 [core.py:632]     self._initialize_kv_caches(vllm_config)
ERROR 08-15 02:22:50 [core.py:632]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 158, in _initialize_kv_caches
ERROR 08-15 02:22:50 [core.py:632]     self.model_executor.determine_available_memory())
ERROR 08-15 02:22:50 [core.py:632]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-15 02:22:50 [core.py:632]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 08-15 02:22:50 [core.py:632]     output = self.collective_rpc("determine_available_memory")
ERROR 08-15 02:22:50 [core.py:632]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-15 02:22:50 [core.py:632]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 237, in collective_rpc
ERROR 08-15 02:22:50 [core.py:632]     result = get_response(w, dequeue_timeout)
ERROR 08-15 02:22:50 [core.py:632]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 08-15 02:22:50 [core.py:632]   File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 224, in get_response
ERROR 08-15 02:22:50 [core.py:632]     raise RuntimeError(
ERROR 08-15 02:22:50 [core.py:632] RuntimeError: Worker failed with error 'The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnAddRmsNormQuant.
ERROR 08-15 02:22:50 [core.py:632] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
ERROR 08-15 02:22:50 [core.py:632] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
ERROR 08-15 02:22:50 [core.py:632] [ERROR] 2025-08-15-02:22:50 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
ERROR 08-15 02:22:50 [core.py:632] ', please check the stack trace above for the root cause
ERROR 08-15 02:23:00 [multiproc_executor.py:140] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/local/python3.11.13/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 636, in run_engine_core
    raise e
  File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 623, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 441, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 86, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 158, in _initialize_kv_caches
    self.model_executor.determine_available_memory())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
    output = self.collective_rpc("determine_available_memory")
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 237, in collective_rpc
    result = get_response(w, dequeue_timeout)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/v1/executor/multiproc_executor.py", line 224, in get_response
    raise RuntimeError(
RuntimeError: Worker failed with error 'The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnAddRmsNormQuant.
Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
[ERROR] 2025-08-15-02:22:50 (PID:14228, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
', please check the stack trace above for the root cause
Traceback (most recent call last):
  File "/usr/local/python3.11.13/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/vllm-workspace/vllm/vllm/entrypoints/cli/main.py", line 54, in main
    args.dispatch_function(args)
  File "/vllm-workspace/vllm/vllm/entrypoints/cli/serve.py", line 52, in cmd
    uvloop.run(run_server(args))
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.13/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/python3.11.13/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1791, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 1811, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
  File "/usr/local/python3.11.13/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/local/python3.11.13/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 163, in from_vllm_config
    return cls(
           ^^^^
  File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 117, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 98, in make_async_mp_client
    return AsyncMPClient(*client_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 677, in __init__
    super().__init__(
  File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 408, in __init__
    with launch_core_engines(vllm_config, executor_class,
  File "/usr/local/python3.11.13/lib/python3.11/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 697, in launch_core_engines
    wait_for_engine_startup(
  File "/vllm-workspace/vllm/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
[ERROR] 2025-08-15-02:23:06 (PID:13821, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: NPU function error: call aclnnAddRmsNormQuant failed, error code is 561000 #2387

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: NPU function error: call aclnnAddRmsNormQuant failed, error code is 561000 #2387

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions