[EPLB]: Correct local expert number calculation with redundant experts && add e2e test #1223

ZhengWG · 2025-06-14T11:55:53Z

What this PR does / why we need it?

Fixed issue: #122 , support for expert mapping with redundant experts

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test passed when passed a expert_map with redundant_experts == 16:

export VLLM_ENABLE_MC2=1
export VLLM_USE_V1=1
export TASK_QUEUE_ENABLE=1

source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh

export ASCEND_LAUNCH_BLOCKING=0
export VLLM_VERSION=0.9.0
MODEL_PATH=DeepSeek-R1-W8A8-VLLM
python -m vllm.entrypoints.openai.api_server --model=$MODEL_PATH \
    --load-format=prefetch_auto \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    -dp=2 \
    --enable-expert-parallel \
    --max-num-seqs 24 \
    --max-model-len 2048 \
    --max-num-batched-tokens 2048 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config":{"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true}, "expert_tensor_parallel_size":1, "expert_map_path": "delta_gsm8k_temp0.0_16_16.json"}' \
    --gpu-memory-utilization 0.90

TODO:

[☑️] Support EPLB with no redundant experts
[☑️] Support EPLB with redundant experts
[☑️] Add e2e unit test

ZhengWG · 2025-06-14T16:01:07Z

@wangxiyuan can you help review it～

jianzs

LGTM

jianzs · 2025-06-14T16:20:55Z

vllm_ascend/ops/fused_moe.py

-        local_num_experts = torch.sum(self.expert_map != -1) \
-            if self.expert_map is not None else num_experts
+        if self.log2phy is not None:
+            local_num_experts = self.local_num_experts
+        else:
+            local_num_experts = torch.sum(self.expert_map != -1) \
+                if self.expert_map is not None else num_experts


Why not just use the self.local_num_experts value when log2phy is None? It's already set by determine_expert_map.

vllm-ascend/vllm_ascend/ops/fused_moe.py

Lines 1065 to 1067 in fe0da59

self.local_num_experts, self.expert_map = determine_expert_map(

self.ep_size,

get_ep_group().rank_in_group, self.global_num_experts)

Yes, it should return the same value. The current implementation intentionally preserves the original logic.

wangxiyuan · 2025-06-16T01:06:11Z

You should add a e2e test for eplb case. I notice that there is PR for eplb test. #1186 can you combine it together to make sure the feature works as expect.

songshanhu07 · 2025-06-16T06:58:29Z

You need to check whether there is a rank with the same expert number in a JSON file. This may be the reason for your runtime error. The code changes you merged don't seem to make much sense.

ZhengWG · 2025-06-16T12:35:35Z

You need to check whether there is a rank with the same expert number in a JSON file. This may be the reason for your runtime error. The code changes you merged don't seem to make much sense.

Because when num_redundant_experts > 0, multiple experts with identical logic number might be loaded onto a single rank.

ZhengWG · 2025-06-16T12:36:17Z

You should add a e2e test for eplb case. I notice that there is PR for eplb test. #1186 can you combine it together to make sure the feature works as expect.

Ok，I will add it soon.

ZhengWG · 2025-06-24T06:15:38Z

Hi @wangxiyuan,

I've added the E2E test and verified it locally. Could you please review the changes when you have time?

Let me know if you have any questions or suggestions.

Thanks in advance!

codecov · 2025-06-24T06:45:20Z

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 50.73%. Comparing base (c30ddb8) to head (d74bd9f).
⚠️ Report is 304 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/ops/fused_moe.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1223       +/-   ##
===========================================
+ Coverage   27.39%   50.73%   +23.34%     
===========================================
  Files          56       77       +21     
  Lines        6191     9413     +3222     
===========================================
+ Hits         1696     4776     +3080     
- Misses       4495     4637      +142

Flag	Coverage Δ
unittests	`50.73% <25.00%> (+23.34%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Yikun · 2025-06-24T15:10:37Z

is it ready to go? Please do a rebase?

ZhengWG · 2025-06-25T14:26:42Z

is it ready to go? Please do a rebase?

It's ready now~ @Yikun @wangxiyuan

MengqingCao · 2025-06-27T01:18:58Z

tests/e2e/run_eplb.sh

+export VLLM_ENABLE_MC2=1
+export VLLM_USE_V1=1
+export TASK_QUEUE_ENABLE=1
+export VLLM_VERSION=0.9.1


please remove this

Suggested change

export VLLM_VERSION=0.9.1

MengqingCao · 2025-06-27T01:25:55Z

tests/e2e/eplb/test_eplb_e2e.py

+def build_expert_map(expert_map_path,
+                     num_redundant_expert=0,
+                     num_layer=58,
+                     num_device=16,


accutally there is only 4 cards on CI now, please reduce num_device to make it work

MengqingCao · 2025-06-27T01:32:07Z

Please add pytest -sv tests/e2e/eplb at https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/vllm_ascend_test.yaml#L372 to make it run

github-actions · 2025-07-03T10:38:53Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: ZhengWG <zwg0606@gmail.com>

ZhengWG · 2025-07-04T03:29:28Z

The same e2e test passes successfully in my local environment but fails on CI. The root cause appears to be a CANN version mismatch affecting EP parallel execution, @MengqingCao can you help check it, here is my local env info:

github-actions · 2025-07-15T04:52:09Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiyuan · 2025-08-18T06:51:55Z

eplb will be refactor, let's close this now.

github-actions bot added the module:ops label Jun 14, 2025

jianzs approved these changes Jun 14, 2025

View reviewed changes

jianzs reviewed Jun 14, 2025

View reviewed changes

github-actions bot added the module:tests label Jun 23, 2025

ZhengWG force-pushed the eplb-fix-redunt branch 2 times, most recently from 09051ef to 689f6ed Compare June 24, 2025 06:09

ZhengWG force-pushed the eplb-fix-redunt branch 3 times, most recently from ff5ae37 to eb29669 Compare June 24, 2025 06:31

ZhengWG changed the title ~~[EPLB]: Correct local expert number calculation with redundant experts~~ [EPLB]: Correct local expert number calculation with redundant experts && add e2e test Jun 24, 2025

MengqingCao mentioned this pull request Jun 25, 2025

[RFC]: E2E CI test for key features #413

Open

83 tasks

MengqingCao reviewed Jun 27, 2025

View reviewed changes

ZhengWG force-pushed the eplb-fix-redunt branch 3 times, most recently from 11eb86d to 0946edf Compare July 3, 2025 02:35

github-actions bot added documentation Improvements or additions to documentation ci/build module:quantization merge-conflicts labels Jul 3, 2025

ZhengWG force-pushed the eplb-fix-redunt branch from 0946edf to d9f76dd Compare July 3, 2025 14:08

github-actions bot removed the merge-conflicts label Jul 3, 2025

ZhengWG added 6 commits July 3, 2025 22:13

[EPLB]: Correct local expert number calculation with redundant experts

16bbd53

Signed-off-by: ZhengWG <zwg0606@gmail.com>

[CI]: add e2e test case for eplb

30852a6

Signed-off-by: ZhengWG <zwg0606@gmail.com>

reformat

0d2d7f2

Signed-off-by: ZhengWG <zwg0606@gmail.com>

change test model

d1032fb

Signed-off-by: ZhengWG <zwg0606@gmail.com>

[CI]: adjust device num

cb0917f

Signed-off-by: ZhengWG <zwg0606@gmail.com>

[CI]: enable ci run

d74bd9f

Signed-off-by: ZhengWG <zwg0606@gmail.com>

ZhengWG force-pushed the eplb-fix-redunt branch from d9f76dd to d74bd9f Compare July 3, 2025 14:13

github-actions bot removed documentation Improvements or additions to documentation ci/build module:quantization labels Jul 3, 2025

github-actions bot added the merge-conflicts label Jul 15, 2025

wangxiyuan closed this Aug 18, 2025

	self.local_num_experts, self.expert_map = determine_expert_map(
	self.ep_size,
	get_ep_group().rank_in_group, self.global_num_experts)

[EPLB]: Correct local expert number calculation with redundant experts && add e2e test #1223

[EPLB]: Correct local expert number calculation with redundant experts && add e2e test #1223

Uh oh!

Conversation

ZhengWG commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ZhengWG commented Jun 14, 2025

Uh oh!

jianzs left a comment

Choose a reason for hiding this comment

Uh oh!

jianzs Jun 14, 2025

Choose a reason for hiding this comment

Uh oh!

ZhengWG Jun 15, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

songshanhu07 commented Jun 16, 2025

Uh oh!

ZhengWG commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZhengWG commented Jun 16, 2025

Uh oh!

ZhengWG commented Jun 24, 2025

Uh oh!

codecov bot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Yikun commented Jun 24, 2025

Uh oh!

ZhengWG commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

MengqingCao commented Jun 27, 2025

Uh oh!

github-actions bot commented Jul 3, 2025

Uh oh!

ZhengWG commented Jul 4, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

wangxiyuan commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ZhengWG commented Jun 14, 2025 •

edited

Loading

wangxiyuan commented Jun 16, 2025 •

edited

Loading

ZhengWG commented Jun 16, 2025 •

edited

Loading

codecov bot commented Jun 24, 2025 •

edited

Loading

ZhengWG commented Jun 25, 2025 •

edited

Loading