-
Notifications
You must be signed in to change notification settings - Fork 538
[EPLB]: Correct local expert number calculation with redundant experts && add e2e test #1223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@wangxiyuan can you help review it~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| local_num_experts = torch.sum(self.expert_map != -1) \ | ||
| if self.expert_map is not None else num_experts | ||
| if self.log2phy is not None: | ||
| local_num_experts = self.local_num_experts | ||
| else: | ||
| local_num_experts = torch.sum(self.expert_map != -1) \ | ||
| if self.expert_map is not None else num_experts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the self.local_num_experts value when log2phy is None? It's already set by determine_expert_map.
vllm-ascend/vllm_ascend/ops/fused_moe.py
Lines 1065 to 1067 in fe0da59
| self.local_num_experts, self.expert_map = determine_expert_map( | |
| self.ep_size, | |
| get_ep_group().rank_in_group, self.global_num_experts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should return the same value. The current implementation intentionally preserves the original logic.
|
You should add a e2e test for eplb case. I notice that there is PR for eplb test. #1186 can you combine it together to make sure the feature works as expect. |
|
You need to check whether there is a rank with the same expert number in a JSON file. This may be the reason for your runtime error. The code changes you merged don't seem to make much sense. |
Because when num_redundant_experts > 0, multiple experts with identical logic number might be loaded onto a single rank. |
Ok,I will add it soon. |
09051ef to
689f6ed
Compare
|
Hi @wangxiyuan, I've added the E2E test and verified it locally. Could you please review the changes when you have time? Let me know if you have any questions or suggestions. Thanks in advance! |
ff5ae37 to
eb29669
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1223 +/- ##
===========================================
+ Coverage 27.39% 50.73% +23.34%
===========================================
Files 56 77 +21
Lines 6191 9413 +3222
===========================================
+ Hits 1696 4776 +3080
- Misses 4495 4637 +142
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
is it ready to go? Please do a rebase? |
It's ready now~ @Yikun @wangxiyuan |
tests/e2e/run_eplb.sh
Outdated
| export VLLM_ENABLE_MC2=1 | ||
| export VLLM_USE_V1=1 | ||
| export TASK_QUEUE_ENABLE=1 | ||
| export VLLM_VERSION=0.9.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this
| export VLLM_VERSION=0.9.1 |
tests/e2e/eplb/test_eplb_e2e.py
Outdated
| def build_expert_map(expert_map_path, | ||
| num_redundant_expert=0, | ||
| num_layer=58, | ||
| num_device=16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accutally there is only 4 cards on CI now, please reduce num_device to make it work
|
Please add |
11eb86d to
0946edf
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
Signed-off-by: ZhengWG <zwg0606@gmail.com>
|
The same e2e test passes successfully in my local environment but fails on CI. The root cause appears to be a CANN version mismatch affecting EP parallel execution, @MengqingCao can you help check it, here is my local env info: |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
eplb will be refactor, let's close this now. |

What this PR does / why we need it?
Does this PR introduce any user-facing change?
No
How was this patch tested?
Test passed when passed a expert_map with redundant_experts == 16:
TODO: