[1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic #2125

yiz-liu · 2025-07-31T03:57:10Z

What this PR does / why we need it?

This PR refactors the MoE (Mixture of Experts) communication logic by introducing a strategy pattern. It defines an abstract base class, MoECommMethod, which encapsulates different communication strategies for MoE layers. By decoupling the MoE implementation from any single communication method, this change makes it simpler to add, replace, or optimize communication strategies in the future.

Plan / Roadmap

Introduce MoECommMethod, implement AllGatherImpl, and adapt ACL Graph handling to cover all scenarios (this PR).
Implement MC2CommImpl and AllToAllCommImpl to optimize performance in specific scenarios.
Enable W8A8 / Int8 models to use unified_fused_experts.

Other notes

Data-parallel (DP) communication currently does not work with vLLM's dispatch/combine mechanisms; an alternative approach is required to resolve this incompatibility.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

Working on it.

vLLM version: v0.10.0
vLLM main: vllm-project/vllm@f7ad6a1

github-actions · 2025-07-31T04:01:18Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-07-31T13:10:11Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-08-04T07:27:09Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

codecov · 2025-08-07T09:05:20Z

Codecov Report

❌ Patch coverage is 24.39024% with 124 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.74%. Comparing base (1a70564) to head (7f2995e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/distributed/moe_comm_method.py	24.26%	103 Missing ⚠️
vllm_ascend/ops/fused_moe.py	14.28%	12 Missing ⚠️
vllm_ascend/ascend_forward_context.py	22.22%	7 Missing ⚠️
vllm_ascend/ops/common_fused_moe.py	50.00%	2 Missing ⚠️

❌ Your patch status has failed because the patch coverage (24.39%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2125      +/-   ##
==========================================
- Coverage   76.35%   75.74%   -0.61%     
==========================================
  Files         117      118       +1     
  Lines       13371    13525     +154     
==========================================
+ Hits        10209    10245      +36     
- Misses       3162     3280     +118

Flag	Coverage Δ
unittests	`75.74% <24.39%> (-0.61%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-08-12T06:27:04Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Introduces a `MoECommMethod` abstract base class to encapsulate different communication strategies for Mixture of Experts layers. This change decouples the MoE implementation from the specific communication method. Two initial strategies are provided: - `AllGatherCommImpl`: A pure PyTorch implementation for expert parallel scenarios. - `AllReduceCommImpl`: Utilizes NPU-specific ops for non-expert parallel cases. The selection of the communication method is now determined at runtime based on the parallel configuration. This improves code organization and makes it easier to add or swap communication strategies in the future. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Introduces and enables the MC2 communication implementation for Mixture-of-Experts (MoE) on Ascend devices when expert parallelism is active. This new method leverages platform-specific `npu_moe_distribute_dispatch` and `npu_moe_distribute_combine` operators to optimize communication and computation parallelism, improving performance. The implementation also adapts to different Ascend SoC versions and available features. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

This commit refactors and cleans up the Mixture-of-Experts (MoE) communication implementations for Ascend NPUs. Key changes include: - Renames `AllReduceCommImpl` to `AllGatherCommImpl` and updates its implementation to use `npu_moe_init_routing_v2` and `npu_moe_token_unpermute` for improved performance and correctness. - Renames the original `AllGatherCommImpl` to `NativeAllGatherCommImpl` to clarify that it uses native PyTorch operations. - Removes the `MC2CommImpl` and sets `AllGatherCommImpl` as the default MoE communication method. - Adds workarounds in both `AllGatherCommImpl` and `NativeAllGatherCommImpl` to handle incorrect outputs from `npu_grouped_matmul` by zeroing out weights for invalid tokens. - Improves documentation by adding detailed docstrings to abstract methods. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Adds data parallelism (DP) padding to ensure token tensors have a uniform shape across all DP ranks. This change mirrors the padding logic from the GPU model runner. This alignment is necessary for features like ACL graphs that require consistent tensor shapes in distributed environments. The padding is calculated and applied before the model forward pass. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Pass the Hugging Face configuration object directly to the MoE communication method constructor. This allows the method to handle different attribute names for MoE parameters, such as `num_experts` and `n_routed_experts`. This change improves robustness and makes the implementation more compatible with various MoE model configurations. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Enhances and adds docstrings across the MoE communication methods to improve clarity and provide more detailed explanations. The docstring for `AllGatherCommImpl` is updated to reflect that it is now the default implementation and to explain a workaround for an accuracy issue. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Adds a multicard test that validates NPU-optimized MoE all-gather pre/post processing against a native reference. Covers varied tokens, hidden sizes, global/local experts, top-k, dtypes, and expert-parallel ranks via mocked context/group. Verifies expert token counts, permutation, and reconstruction within dtype-aware tolerances to guard against regressions in distributed kernels. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Removes dead/commented paths in the MoE communication implementation and cleans up legacy chunking/gather remnants. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu · 2025-08-12T08:41:25Z

@wangxiyuan @Yikun Because the test case depends on NPU-specific operations, unit testing is not feasible; only end-to-end (E2E) tests are available, thus the low coverage rate.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

MengqingCao · 2025-08-13T03:14:19Z

Tried without the NaiveAll2AllManager, and it failed in DP scenario, only one DP could process prompts as expected:

python examples/offline_data_parallel.py \
        --model="Qwen/Qwen3-30B-A3B" \
        --dp-size=4 \
        --tp-size=1 \
        --enable-expert-parallel

Results

INFO 08-13 02:22:16 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:22:16 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:22:16 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:22:16 [__init__.py:232] Platform plugin ascend is activated
WARNING 08-13 02:22:18 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
DP rank 0 needs to process 100 prompts
DP rank 1 needs to process 100 prompts
DP rank 2 needs to process 100 prompts
DP rank 3 needs to process 100 prompts
INFO 08-13 02:22:19 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-13 02:22:19 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-13 02:22:19 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 08-13 02:22:19 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 08-13 02:22:20 [utils.py:326] non-default args: {'model': '/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', 'enable_expert_parallel': True, 'disable_log_stats': True}
INFO 08-13 02:22:20 [utils.py:326] non-default args: {'model': '/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', 'enable_expert_parallel': True, 'disable_log_stats': True}
INFO 08-13 02:22:20 [utils.py:326] non-default args: {'model': '/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', 'enable_expert_parallel': True, 'disable_log_stats': True}
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
WARNING 08-13 02:22:20 [registry.py:454] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
INFO 08-13 02:22:20 [utils.py:326] non-default args: {'model': '/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', 'enable_expert_parallel': True, 'disable_log_stats': True}
INFO 08-13 02:22:37 [__init__.py:702] Resolved architecture: Qwen3MoeForCausalLM
INFO 08-13 02:22:37 [__init__.py:1740] Using max model len 40960
INFO 08-13 02:22:38 [scheduler.py:237] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 08-13 02:22:38 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 08-13 02:22:38 [utils.py:337] Calculated maximum supported batch sizes for ACL graph: 39
INFO 08-13 02:22:38 [utils.py:352] Adjusted ACL graph batch sizes for Qwen3MoeForCausalLM model (layers: 48): 67 → 39 sizes
INFO 08-13 02:22:38 [__init__.py:702] Resolved architecture: Qwen3MoeForCausalLM
INFO 08-13 02:22:38 [__init__.py:1740] Using max model len 40960
INFO 08-13 02:22:38 [scheduler.py:237] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 08-13 02:22:38 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 08-13 02:22:38 [utils.py:337] Calculated maximum supported batch sizes for ACL graph: 39
INFO 08-13 02:22:38 [utils.py:352] Adjusted ACL graph batch sizes for Qwen3MoeForCausalLM model (layers: 48): 67 → 39 sizes
INFO 08-13 02:22:39 [__init__.py:702] Resolved architecture: Qwen3MoeForCausalLM
INFO 08-13 02:22:39 [__init__.py:1740] Using max model len 40960
INFO 08-13 02:22:39 [__init__.py:702] Resolved architecture: Qwen3MoeForCausalLM
INFO 08-13 02:22:39 [__init__.py:1740] Using max model len 40960
INFO 08-13 02:22:39 [scheduler.py:237] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 08-13 02:22:39 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 08-13 02:22:39 [utils.py:337] Calculated maximum supported batch sizes for ACL graph: 39
INFO 08-13 02:22:39 [utils.py:352] Adjusted ACL graph batch sizes for Qwen3MoeForCausalLM model (layers: 48): 67 → 39 sizes
INFO 08-13 02:22:39 [scheduler.py:237] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 08-13 02:22:39 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
INFO 08-13 02:22:39 [utils.py:337] Calculated maximum supported batch sizes for ACL graph: 39
INFO 08-13 02:22:39 [utils.py:352] Adjusted ACL graph batch sizes for Qwen3MoeForCausalLM model (layers: 48): 67 → 39 sizes
INFO 08-13 02:22:52 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:22:52 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:22:52 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:22:52 [__init__.py:232] Platform plugin ascend is activated
INFO 08-13 02:22:53 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:22:53 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:22:53 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:22:53 [__init__.py:232] Platform plugin ascend is activated
INFO 08-13 02:22:54 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:22:54 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:22:54 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:22:54 [__init__.py:232] Platform plugin ascend is activated
INFO 08-13 02:22:54 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:22:54 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:22:54 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:22:54 [__init__.py:232] Platform plugin ascend is activated
WARNING 08-13 02:22:55 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
(EngineCore_1 pid=302528) INFO 08-13 02:22:55 [core.py:619] Waiting for init message from front-end.
WARNING 08-13 02:22:56 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
(EngineCore_3 pid=302616) INFO 08-13 02:22:56 [core.py:619] Waiting for init message from front-end.
WARNING 08-13 02:22:56 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
(EngineCore_0 pid=302624) INFO 08-13 02:22:56 [core.py:619] Waiting for init message from front-end.
WARNING 08-13 02:22:56 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
(EngineCore_2 pid=302620) INFO 08-13 02:22:57 [core.py:619] Waiting for init message from front-end.
(EngineCore_1 pid=302528) INFO 08-13 02:22:57 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_0 pid=302624) INFO 08-13 02:22:57 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_2 pid=302620) INFO 08-13 02:22:57 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_3 pid=302616) INFO 08-13 02:22:57 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_1 pid=302528) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(EngineCore_1 pid=302528) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(EngineCore_1 pid=302528) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(EngineCore_1 pid=302528) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(EngineCore_1 pid=302528) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
(EngineCore_1 pid=302528) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
(EngineCore_1 pid=302528) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
(EngineCore_1 pid=302528) INFO 08-13 02:22:58 [core.py:72] Initializing a V1 LLM engine (v0.9.2.dev301+g3c545c0c3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,496,488,472,456,440,432,416,400,384,376,360,344,328,320,304,288,272,264,248,232,224,208,192,176,168,152,136,120,112,96,80,64,56,40,24,8,4,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_2 pid=302620) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(EngineCore_2 pid=302620) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(EngineCore_2 pid=302620) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(EngineCore_2 pid=302620) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(EngineCore_2 pid=302620) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
(EngineCore_2 pid=302620) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
(EngineCore_2 pid=302620) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
(EngineCore_2 pid=302620) INFO 08-13 02:22:58 [core.py:72] Initializing a V1 LLM engine (v0.9.2.dev301+g3c545c0c3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,496,488,472,456,440,432,416,400,384,376,360,344,328,320,304,288,272,264,248,232,224,208,192,176,168,152,136,120,112,96,80,64,56,40,24,8,4,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_0 pid=302624) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(EngineCore_0 pid=302624) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(EngineCore_0 pid=302624) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(EngineCore_0 pid=302624) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(EngineCore_0 pid=302624) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
(EngineCore_0 pid=302624) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
(EngineCore_0 pid=302624) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
(EngineCore_0 pid=302624) INFO 08-13 02:22:58 [core.py:72] Initializing a V1 LLM engine (v0.9.2.dev301+g3c545c0c3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,496,488,472,456,440,432,416,400,384,376,360,344,328,320,304,288,272,264,248,232,224,208,192,176,168,152,136,120,112,96,80,64,56,40,24,8,4,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_3 pid=302616) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
(EngineCore_3 pid=302616) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
(EngineCore_3 pid=302616) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
(EngineCore_3 pid=302616) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
(EngineCore_3 pid=302616) WARNING 08-13 02:22:58 [registry.py:454] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3:CustomDeepseekV3ForCausalLM.
(EngineCore_3 pid=302616) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen3MoeForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_moe:CustomQwen3MoeForCausalLM.
(EngineCore_3 pid=302616) WARNING 08-13 02:22:58 [registry.py:454] Model architecture Qwen3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3:CustomQwen3ForCausalLM.
(EngineCore_3 pid=302616) INFO 08-13 02:22:58 [core.py:72] Initializing a V1 LLM engine (v0.9.2.dev301+g3c545c0c3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,496,488,472,456,440,432,416,400,384,376,360,344,328,320,304,288,272,264,248,232,224,208,192,176,168,152,136,120,112,96,80,64,56,40,24,8,4,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
INFO 08-13 02:23:17 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:23:17 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:23:17 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:23:17 [__init__.py:232] Platform plugin ascend is activated
INFO 08-13 02:23:17 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:23:17 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:23:17 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:23:17 [__init__.py:232] Platform plugin ascend is activated
INFO 08-13 02:23:17 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:23:17 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:23:17 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:23:17 [__init__.py:232] Platform plugin ascend is activated
INFO 08-13 02:23:18 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 08-13 02:23:18 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 08-13 02:23:18 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 08-13 02:23:18 [__init__.py:232] Platform plugin ascend is activated
WARNING 08-13 02:23:20 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
WARNING 08-13 02:23:20 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
WARNING 08-13 02:23:20 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
WARNING 08-13 02:23:21 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libnuma.so.1: cannot open shared object file: No such file or directory')
(EngineCore_1 pid=302528) INFO 08-13 02:23:24 [parallel_state.py:992] Adjusting world_size=4 rank=1 distributed_init_method=tcp://127.0.0.1:59496 for DP
(EngineCore_2 pid=302620) INFO 08-13 02:23:24 [parallel_state.py:992] Adjusting world_size=4 rank=2 distributed_init_method=tcp://127.0.0.1:59496 for DP
(EngineCore_3 pid=302616) INFO 08-13 02:23:24 [parallel_state.py:992] Adjusting world_size=4 rank=3 distributed_init_method=tcp://127.0.0.1:59496 for DP
(EngineCore_0 pid=302624) INFO 08-13 02:23:25 [parallel_state.py:992] Adjusting world_size=4 rank=0 distributed_init_method=tcp://127.0.0.1:59496 for DP
(EngineCore_2 pid=302620) INFO 08-13 02:23:26 [parallel_state.py:1134] rank 2 in world size 4 is assigned as DP rank 2, PP rank 0, TP rank 0, EP rank 2
(EngineCore_1 pid=302528) INFO 08-13 02:23:26 [parallel_state.py:1134] rank 1 in world size 4 is assigned as DP rank 1, PP rank 0, TP rank 0, EP rank 1
(EngineCore_0 pid=302624) INFO 08-13 02:23:26 [parallel_state.py:1134] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_3 pid=302616) INFO 08-13 02:23:26 [parallel_state.py:1134] rank 3 in world size 4 is assigned as DP rank 3, PP rank 0, TP rank 0, EP rank 3
(EngineCore_3 pid=302616) INFO 08-13 02:23:26 [model_runner_v1.py:2097] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B...
(EngineCore_0 pid=302624) INFO 08-13 02:23:26 [model_runner_v1.py:2097] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B...
(EngineCore_1 pid=302528) INFO 08-13 02:23:26 [model_runner_v1.py:2097] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B...
(EngineCore_2 pid=302620) INFO 08-13 02:23:26 [model_runner_v1.py:2097] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen3-30B-A3B...
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:00<00:09,  1.60it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:00<00:06,  2.32it/s]
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:01<00:06,  1.95it/s]
Loading safetensors checkpoint shards:  25% Completed | 4/16 [00:02<00:06,  1.76it/s]
Loading safetensors checkpoint shards:  31% Completed | 5/16 [00:02<00:06,  1.69it/s]
Loading safetensors checkpoint shards:  38% Completed | 6/16 [00:03<00:06,  1.63it/s]
Loading safetensors checkpoint shards:  44% Completed | 7/16 [00:04<00:05,  1.59it/s]
Loading safetensors checkpoint shards:  50% Completed | 8/16 [00:04<00:05,  1.54it/s]
Loading safetensors checkpoint shards:  56% Completed | 9/16 [00:05<00:04,  1.48it/s]
Loading safetensors checkpoint shards:  62% Completed | 10/16 [00:06<00:04,  1.46it/s]
Loading safetensors checkpoint shards:  69% Completed | 11/16 [00:06<00:03,  1.47it/s]
Loading safetensors checkpoint shards:  75% Completed | 12/16 [00:07<00:02,  1.49it/s]
Loading safetensors checkpoint shards:  81% Completed | 13/16 [00:08<00:01,  1.51it/s]
Loading safetensors checkpoint shards:  88% Completed | 14/16 [00:08<00:01,  1.52it/s]
Loading safetensors checkpoint shards:  94% Completed | 15/16 [00:09<00:00,  1.47it/s]
(EngineCore_1 pid=302528) INFO 08-13 02:23:37 [default_loader.py:262] Loading weights took 9.93 seconds
(EngineCore_3 pid=302616) INFO 08-13 02:23:37 [default_loader.py:262] Loading weights took 10.08 seconds
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:10<00:00,  1.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:10<00:00,  1.55it/s]
(EngineCore_0 pid=302624) 
(EngineCore_2 pid=302620) INFO 08-13 02:23:38 [default_loader.py:262] Loading weights took 10.22 seconds
(EngineCore_0 pid=302624) INFO 08-13 02:23:38 [default_loader.py:262] Loading weights took 10.44 seconds
(EngineCore_1 pid=302528) INFO 08-13 02:23:38 [model_runner_v1.py:2127] Loading model weights took 16.3816 GB
(EngineCore_3 pid=302616) INFO 08-13 02:23:38 [model_runner_v1.py:2127] Loading model weights took 16.3816 GB
(EngineCore_2 pid=302620) INFO 08-13 02:23:39 [model_runner_v1.py:2127] Loading model weights took 16.3816 GB
(EngineCore_0 pid=302624) INFO 08-13 02:23:39 [model_runner_v1.py:2127] Loading model weights took 16.3816 GB
(EngineCore_2 pid=302620) INFO 08-13 02:23:52 [backends.py:530] Using cache directory: /home/xxx/.cache/vllm/torch_compile_cache/549862fd88/rank_0_2/backbone for vLLM's torch.compile
(EngineCore_2 pid=302620) INFO 08-13 02:23:52 [backends.py:541] Dynamo bytecode transform time: 12.47 s
(EngineCore_3 pid=302616) INFO 08-13 02:23:52 [backends.py:530] Using cache directory: /home/xxx/.cache/vllm/torch_compile_cache/8de301bc72/rank_0_3/backbone for vLLM's torch.compile
(EngineCore_3 pid=302616) INFO 08-13 02:23:52 [backends.py:541] Dynamo bytecode transform time: 12.49 s
(EngineCore_0 pid=302624) INFO 08-13 02:23:52 [backends.py:530] Using cache directory: /home/xxx/.cache/vllm/torch_compile_cache/3c574c0f5d/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=302624) INFO 08-13 02:23:52 [backends.py:541] Dynamo bytecode transform time: 12.62 s
(EngineCore_1 pid=302528) INFO 08-13 02:23:53 [backends.py:530] Using cache directory: /home/xxx/.cache/vllm/torch_compile_cache/bb2847b99f/rank_0_1/backbone for vLLM's torch.compile
(EngineCore_1 pid=302528) INFO 08-13 02:23:53 [backends.py:541] Dynamo bytecode transform time: 12.84 s
(EngineCore_3 pid=302616) INFO 08-13 02:23:56 [backends.py:215] Compiling a graph for dynamic shape takes 3.37 s
(EngineCore_2 pid=302620) INFO 08-13 02:23:56 [backends.py:215] Compiling a graph for dynamic shape takes 3.40 s
(EngineCore_0 pid=302624) INFO 08-13 02:23:57 [backends.py:215] Compiling a graph for dynamic shape takes 3.48 s
(EngineCore_1 pid=302528) INFO 08-13 02:23:57 [backends.py:215] Compiling a graph for dynamic shape takes 3.53 s
(EngineCore_3 pid=302616) INFO 08-13 02:24:05 [monitor.py:34] torch.compile takes 15.86 s in total
(EngineCore_2 pid=302620) INFO 08-13 02:24:05 [monitor.py:34] torch.compile takes 15.87 s in total
(EngineCore_0 pid=302624) INFO 08-13 02:24:05 [monitor.py:34] torch.compile takes 16.10 s in total
(EngineCore_1 pid=302528) INFO 08-13 02:24:05 [monitor.py:34] torch.compile takes 16.36 s in total
(EngineCore_3 pid=302616) INFO 08-13 02:24:05 [worker_v1.py:186] Available memory: 39452653056, total memory: 65452113920
(EngineCore_3 pid=302616) INFO 08-13 02:24:05 [kv_cache_utils.py:829] GPU KV cache size: 401,280 tokens
(EngineCore_3 pid=302616) INFO 08-13 02:24:05 [kv_cache_utils.py:833] Maximum concurrency for 40,960 tokens per request: 9.80x
(EngineCore_2 pid=302620) INFO 08-13 02:24:06 [worker_v1.py:186] Available memory: 39460353536, total memory: 65452113920
(EngineCore_2 pid=302620) INFO 08-13 02:24:06 [kv_cache_utils.py:829] GPU KV cache size: 401,408 tokens
(EngineCore_2 pid=302620) INFO 08-13 02:24:06 [kv_cache_utils.py:833] Maximum concurrency for 40,960 tokens per request: 9.80x
(EngineCore_0 pid=302624) INFO 08-13 02:24:06 [worker_v1.py:186] Available memory: 39454647808, total memory: 65452113920
(EngineCore_0 pid=302624) INFO 08-13 02:24:06 [kv_cache_utils.py:829] GPU KV cache size: 401,280 tokens
(EngineCore_0 pid=302624) INFO 08-13 02:24:06 [kv_cache_utils.py:833] Maximum concurrency for 40,960 tokens per request: 9.80x
(EngineCore_1 pid=302528) INFO 08-13 02:24:06 [worker_v1.py:186] Available memory: 39447262720, total memory: 65452113920
(EngineCore_1 pid=302528) INFO 08-13 02:24:06 [kv_cache_utils.py:829] GPU KV cache size: 401,152 tokens
(EngineCore_1 pid=302528) INFO 08-13 02:24:06 [kv_cache_utils.py:833] Maximum concurrency for 40,960 tokens per request: 9.79x
(EngineCore_1 pid=302528) INFO 08-13 02:24:44 [model_runner_v1.py:2417] Graph capturing finished in 38 secs, took 0.45 GiB
(EngineCore_1 pid=302528) INFO 08-13 02:24:44 [core.py:199] init engine (profile, create kv cache, warmup model) took 65.88 seconds
(EngineCore_3 pid=302616) INFO 08-13 02:24:44 [model_runner_v1.py:2417] Graph capturing finished in 39 secs, took 0.45 GiB
(EngineCore_3 pid=302616) INFO 08-13 02:24:44 [core.py:199] init engine (profile, create kv cache, warmup model) took 65.76 seconds
(EngineCore_0 pid=302624) INFO 08-13 02:24:44 [model_runner_v1.py:2417] Graph capturing finished in 38 secs, took 0.45 GiB
(EngineCore_0 pid=302624) INFO 08-13 02:24:44 [core.py:199] init engine (profile, create kv cache, warmup model) took 65.21 seconds
(EngineCore_2 pid=302620) INFO 08-13 02:24:44 [model_runner_v1.py:2417] Graph capturing finished in 38 secs, took 0.45 GiB
(EngineCore_2 pid=302620) INFO 08-13 02:24:44 [core.py:199] init engine (profile, create kv cache, warmup model) took 65.28 seconds
(EngineCore_3 pid=302616) INFO 08-13 02:24:45 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
(EngineCore_3 pid=302616) INFO 08-13 02:24:45 [utils.py:337] Calculated maximum supported batch sizes for ACL graph: 39
(EngineCore_3 pid=302616) INFO 08-13 02:24:45 [utils.py:363] No adjustment needed for ACL graph batch sizes: Qwen3MoeForCausalLM model (layers: 48) with 39 sizes
(EngineCore_0 pid=302624) INFO 08-13 02:24:45 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
(EngineCore_0 pid=302624) INFO 08-13 02:24:45 [utils.py:337] Calculated maximum supported batch sizes for ACL graph: 39
(EngineCore_0 pid=302624) INFO 08-13 02:24:45 [utils.py:363] No adjustment needed for ACL graph batch sizes: Qwen3MoeForCausalLM model (layers: 48) with 39 sizes
(EngineCore_1 pid=302528) INFO 08-13 02:24:45 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
(EngineCore_1 pid=302528) INFO 08-13 02:24:45 [utils.py:337] Calculated maximum supported batch sizes for ACL graph: 39
(EngineCore_1 pid=302528) INFO 08-13 02:24:45 [utils.py:363] No adjustment needed for ACL graph batch sizes: Qwen3MoeForCausalLM model (layers: 48) with 39 sizes
(EngineCore_2 pid=302620) INFO 08-13 02:24:45 [platform.py:162] PIECEWISE compilation enabled on NPU. use_inductor not supported - using only ACL Graph mode
(EngineCore_2 pid=302620) INFO 08-13 02:24:45 [utils.py:337] Calculated maximum supported batch sizes for ACL graph: 39
(EngineCore_2 pid=302620) INFO 08-13 02:24:45 [utils.py:363] No adjustment needed for ACL graph batch sizes: Qwen3MoeForCausalLM model (layers: 48) with 39 sizes
INFO 08-13 02:24:45 [llm.py:294] Supported_tasks: ['generate']
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1595.95it/s]
Processed prompts:   0%|                                                             | 0/100 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 08-13 02:24:45 [llm.py:294] Supported_tasks: ['generate']
Adding requests:   0%|                                                                                                                   | 0/100 [00:00<?, ?it/s]INFO 08-13 02:24:45 [llm.py:294] Supported_tasks: ['generate']
Adding requests:   0%|                                                                                                                   | 0/100 [00:00<?, ?it/s]INFO 08-13 02:24:45 [llm.py:294] Supported_tasks: ['generate']
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1663.31it/s]
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1619.46it/s]
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 1559.81it/s]
Processed prompts: 100%|██████████████████████████████████████████████| 100/100 [00:01<00:00,  1.02it/s, est. speed input: 533.03 toks/s, output: 1550.60 toks/s]Killing process 301804 that didn't stop within 5 minutes.
Killing process 301806 that didn't stop within 5 minutes.
Killing process 301808 that didn't stop within 5 minutes.

* enable mm allreduce test (vllm-project#2192) ### What this PR does / why we need it? This PR is to add e2e test for using npu_mm_all_reduce_base fusion kernel. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@5d5d419 Signed-off-by: Ronald1995 <ronaldautomobile@163.com> * [main] remove torch.cat and replace it by List[0] (vllm-project#2153) ### What this PR does / why we need it? torch_npu.npu_grouped_matmul: https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_grouped_matmul.md According to the document, when `split_item` is 2 or 3, `torch_npu.npu_grouped_matmul` will return a list which has one element. Therefore, the `torch.cat` after `torch_npu.npu_grouped_matmul` is unnecessary. ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? ut and e2e covered: `tests/ut/ops/test_fused_ops.py`, `tests/e2e/singlecard/ops/test_fused_moe.py` **performance**: (qwen3 30B, 2k->20k) base: Total Token throughput (tok/s): 667.76 remove cat: Total Token throughput (tok/s): 680.82 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@fa00c5d Signed-off-by: huangxialu <huangxialu1@huawei.com> * [CI][Quickfix] Fix AscendFusedMoE init error (vllm-project#2268) ### What this PR does / why we need it? Fix AscendFusedMoE init error. Use `super().__init__()` instead of `super(FusedMoE, self).__init__()` to ensure the member variables in base class could be called by the children class ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? CI passed with new existing test. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@766bc81 --------- Signed-off-by: MengqingCao <cmq0113@163.com> * Fix accuracy test config and add DeepSeek-V2-Lite test (vllm-project#2261) ### What this PR does / why we need it? This PR fix accuracy test related to vllm-project#2073, users can now perform accuracy tests on multiple models simultaneously and generate different report files by running: ```bash cd ~/vllm-ascend pytest -sv ./tests/e2e/models/test_lm_eval_correctness.py \ --config-list-file ./tests/e2e/models/configs/accuracy.txt ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? <img width="1648" height="511" alt="image" src="https://github.com/user-attachments/assets/1757e3b8-a6b7-44e5-b701-80940dc756cd" /> - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@766bc81 --------- Signed-off-by: Icey <1790571317@qq.com> * Fix accuracy test create PR (vllm-project#2274) ### What this PR does / why we need it? Fix create PR of accuracy test ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Local testing: nv-action/vllm-benchmarks#87 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@099c046 --------- Signed-off-by: Icey <1790571317@qq.com> * Add ut for test_communicator.py (vllm-project#2293) ### What this PR does / why we need it? Add ut for test_communicator.py - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@e5ebeeb Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com> * [CI] Fix broken CI (vllm-project#2302) 1. disable test_eagle_ccorrectness test, we'll reopen it once oom error fixed. 2. drop transformers version limit for main, since vLLM rely on >=4.55.0, see: vllm-project/vllm@65552b4 3. fix kv_connector_output bug, see: vllm-project/vllm@796bae0 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@d1af8b7 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [2/N][Refactor] torchair model runner refactor (vllm-project#2204) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow vllm-project#2203 What's this PR do: move `torchair` related logic into `_get_forward_metadata_across_dp` and override it in torchair model runner - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1b99028 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [core] Support capture custom ops into aclgraph (vllm-project#2113) ### What this PR does / why we need it? Thanks to the PR vllm-project#426 make vllm-ascend support the aclgraph inference to reduce the host overhead. However, the capability of aclgraph strongly relies on the functionality provided by `torch.compile`, which is the key feature supported in torch 2.x . Therefore, capture custom op into aclgraph is only possible when it can be recognize and captured by `torch.compile`. In this PR, we register the meta implementation of current custom ops to enable the fx graph capture. And by doing that, insert those custom ops into aclgraph become a natural thing to the ascend runtime. ### Does this PR introduce _any_ user-facing change? No user face change. ### How was this patch tested? Tested in unittest, we will integrate the `rotary_embedding` op into a small custom model and use `torch.compile` and aclgraph to capture and replay it to verify its functionality. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1b99028 --------- Signed-off-by: ganyi <pleaplusone.gy@gmail.com> * Bump actions/download-artifact from 4 to 5 (vllm-project#2311) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 4 to 5. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ebf7605 Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Perf][MTP] Optimize reject sampler in greedy situation. (vllm-project#2137) This PR port optimization in PR vllm-project#2002 to main and makes it cleaner. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@afa5b7c --------- Signed-off-by: whx-sjtu <2952154980@qq.com> * [3/N][Refactor] torchair model runner refactor (vllm-project#2207) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow vllm-project#2203, this is the first PR. What's this PR do: create common function `_build_attention_metadata` and `_generate_dummy_run_hidden_states` for dummy_run - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ebf7605 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Feat] chunkprefill mla support torchair graph (vllm-project#1772) chunkprefill mla only support eager mode now，we want to optimaze it by support torchair graph, the idea is simple, when all the request is running in decode, use torchair graph to deal with it, else when chunkprefill or prefill only, use the eager mode - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ebf7605 Signed-off-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> * [4/N][Refactor] torchair model runner refactor (vllm-project#2208) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow vllm-project#2203, this is the first PR. What's this PR do: create common function `_convert_torch_foramt` for initialize_kv_cache - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * Configure Gemini (vllm-project#2298) ### What this PR does / why we need it? This PR requests Gemini AI to review PRs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> * ut: add ci guard for ut coverage (vllm-project#2317) ### What this PR does / why we need it? add ci guard for ut coverage, if ut coverage of patch pr is below 80%, the ci will failed/ ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? not involved - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@458e74e --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> * [main][prefill optimization] Optimize parallel strategies to reduce communication overhead (vllm-project#2198) ### What this PR does / why we need it? 1.Shared Expert Sharding Strategy Update: Switched from TP-aligned to pure DP for shared experts, enabling more efficient execution. 2.O_Proj AllReduce → ReduceScatter: Reduced communication overhead by using ReduceScatter, made possible by pure DP sharding. 3.AllGather Postponed: Delayed to after QKV down projection to reduce synchronization impact during prefill. ### How was this patch tested? Adding ut case in `tests/ut/attention/test_mla_v1.py` #### How to run use parameter `--additional_config='{"enable_shared_expert_dp": true}'` ##### a.How to run eager mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --enforce-eager --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true}' ##### b.How to run graph mode eg: python -m vllm.entrypoints.openai.api_server --model=/model_path --trust-remote-code -tp 8 -dp 2 --enable_expert_parallel --port 8002 --max-model-len 5120 --max-num-batched-tokens 16384 --disable-log-requests --additional_config='{"ascend_scheduler_config":{"enabled":true},"enable_shared_expert_dp": true,"chunked_prefill_for_mla":true,"torchair_graph_config":{"enabled":true}}' - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@9edd1db --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com> * [Doc] Update faq (vllm-project#2334) ### What this PR does / why we need it? - update determinitic calculation - update support device ### Does this PR introduce _any_ user-facing change? - Users should update ray and protobuf when using ray as distributed backend - Users should change to use `export HCCL_DETERMINISTIC=true` when enabling determinitic calculation ### How was this patch tested? N/A - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@ea1292a Signed-off-by: MengqingCao <cmq0113@163.com> * [5/N][Refactor] torchair model runner refactor (vllm-project#2216) There is lot of torchair code in model runner leading the code hard for maintenance. We'll create new torchair_model_runner to split torchair related logic. Following the workflow vllm-project#2203 What's this PR do: create common function `_capture_model` for capture_model - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@1891a26 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic (vllm-project#2125) ### What this PR does / why we need it? This PR refactors the MoE (Mixture of Experts) communication logic by introducing a strategy pattern. It defines an abstract base class, `MoECommMethod`, which encapsulates different communication strategies for MoE layers. By decoupling the MoE implementation from any single communication method, this change makes it simpler to add, replace, or optimize communication strategies in the future. Plan / Roadmap 1. Introduce `MoECommMethod`, implement `AllGatherImpl`, and adapt ACL Graph handling to cover all scenarios (this PR). 2. Implement `MC2CommImpl` and `AllToAllCommImpl` to optimize performance in specific scenarios. 3. Enable W8A8 / Int8 models to use `unified_fused_experts`. Other notes * Data-parallel (DP) communication currently does not work with vLLM's dispatch/combine mechanisms; an alternative approach is required to resolve this incompatibility. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@f7ad6a1 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> * [Doc] Add container image save/load FAQ for offline environments (vllm-project#2347) ### What this PR does / why we need it? Add Docker export/import guide for air-gapped environments ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? NA - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@d16aa3d Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> * [Bugfix] fix the oom when chunkprefill with long context like 64k (vllm-project#2319) The attn mask was declared in the mla.py，we don't need the splitfuse mask when mla chunkprefill, and this mask will cause memory problem when long context like 64k or 128k - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@14a5d90 --------- Signed-off-by: haojiangzheng <justineric096@gmail.com> --------- Signed-off-by: Ronald1995 <ronaldautomobile@163.com> Signed-off-by: huangxialu <huangxialu1@huawei.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: haojiangzheng <justineric096@gmail.com> Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: SlightwindSec <slightwindsec@gmail.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Co-authored-by: Ronald1995 <ronaldautomobile@163.com> Co-authored-by: huangxialu <huangxialu1@huawei.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: yangqinghao-cmss <yangqinghao_yewu@cmss.chinamobile.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Pleaplusone <pleaplusone.gy@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: whx <56632993+whx-sjtu@users.noreply.github.com> Co-authored-by: zhenghaojiang <zhjoneson@163.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: jack <QwertyJack@users.noreply.github.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: SlightwindSec <slightwindsec@gmail.com> Co-authored-by: yiz-liu <136800916+yiz-liu@users.noreply.github.com>

…ication logic (vllm-project#2125) ### What this PR does / why we need it? This PR refactors the MoE (Mixture of Experts) communication logic by introducing a strategy pattern. It defines an abstract base class, `MoECommMethod`, which encapsulates different communication strategies for MoE layers. By decoupling the MoE implementation from any single communication method, this change makes it simpler to add, replace, or optimize communication strategies in the future. Plan / Roadmap 1. Introduce `MoECommMethod`, implement `AllGatherImpl`, and adapt ACL Graph handling to cover all scenarios (this PR). 2. Implement `MC2CommImpl` and `AllToAllCommImpl` to optimize performance in specific scenarios. 3. Enable W8A8 / Int8 models to use `unified_fused_experts`. Other notes * Data-parallel (DP) communication currently does not work with vLLM's dispatch/combine mechanisms; an alternative approach is required to resolve this incompatibility. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@f7ad6a1 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

github-actions bot added module:ops module:core merge-conflicts labels Jul 31, 2025

yiz-liu force-pushed the refactor-moe branch from 7592f40 to 3a459de Compare August 4, 2025 02:18

github-actions bot added merge-conflicts and removed merge-conflicts labels Aug 4, 2025

yiz-liu force-pushed the refactor-moe branch from 3a459de to 58f7ec7 Compare August 6, 2025 03:52

github-actions bot removed the merge-conflicts label Aug 6, 2025

yiz-liu force-pushed the refactor-moe branch from 6c934b9 to c9bc8f5 Compare August 7, 2025 01:42

yiz-liu force-pushed the refactor-moe branch from 67d1601 to ef5eeb9 Compare August 8, 2025 04:32

Yikun added the accuracy-test enable all accuracy test for PR label Aug 8, 2025

yiz-liu force-pushed the refactor-moe branch from ef5eeb9 to 64d5ceb Compare August 8, 2025 06:07

yiz-liu changed the title ~~[WIP][Feat] Support MoE models with ACL Graph and refactor MoE communication logic~~ [1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic Aug 11, 2025

yiz-liu force-pushed the refactor-moe branch from 9abbf4b to 525eee5 Compare August 11, 2025 04:50

github-actions bot added module:tests merge-conflicts labels Aug 11, 2025

yiz-liu added 7 commits August 12, 2025 14:41

yiz-liu force-pushed the refactor-moe branch from 0dfb10d to 1595913 Compare August 12, 2025 06:42

github-actions bot removed the merge-conflicts label Aug 12, 2025

Simplifies MoE comm; removes unused MC2 params

48a3942

Removes dead/commented paths in the MoE communication implementation and cleans up legacy chunking/gather remnants. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu force-pushed the refactor-moe branch from 1595913 to 48a3942 Compare August 12, 2025 07:06

yiz-liu force-pushed the refactor-moe branch 2 times, most recently from 7865127 to 4872a78 Compare August 12, 2025 11:45

Fixes aclgraph condition and padding logic

7f2995e

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu force-pushed the refactor-moe branch from 4872a78 to 7f2995e Compare August 12, 2025 12:34

wangxiyuan approved these changes Aug 12, 2025

View reviewed changes

wangxiyuan merged commit 992271b into vllm-project:main Aug 12, 2025
22 of 23 checks passed

yiz-liu deleted the refactor-moe branch August 13, 2025 01:23

yiz-liu mentioned this pull request Aug 26, 2025

[RFC]: Refactoring MoE Communication for ACL Graph Compatibility and Performance Optimization #2552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic #2125

[1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic #2125

Uh oh!

yiz-liu commented Jul 31, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

codecov bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

yiz-liu commented Aug 12, 2025

Uh oh!

Uh oh!

MengqingCao commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic #2125

[1/N][Feat] Support MoE models with ACL Graph and refactor MoE communication logic #2125

Uh oh!

Conversation

yiz-liu commented Jul 31, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

github-actions bot commented Jul 31, 2025

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

codecov bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 12, 2025

Uh oh!

yiz-liu commented Aug 12, 2025

Uh oh!

Uh oh!

MengqingCao commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yiz-liu commented Jul 31, 2025 •

edited by github-actions bot

Loading

codecov bot commented Aug 7, 2025 •

edited

Loading