-
Notifications
You must be signed in to change notification settings - Fork 601
minor refactor over EP #1854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minor refactor over EP #1854
Conversation
wwwjn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactor!
| 1 and 2 are needed only when expert_parallel_degree > 1. | ||
| 3 is needed even for single-device computation. | ||
| 2 can be moved to ExpertParallel _token_dispatch if not coupled with 3. | ||
| In order to use torch._grouped_mm, we need to make sure the number of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This description only talks about padding, didn't talk about generate_permute_indices kernel to permute the inputs to be ordered by local experts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wrapper now is only responsible for padding, when EP is not used. I renamed to make it more clear.
| ) | ||
|
|
||
|
|
||
| class ExpertParallel(ParallelStyle): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have question about when we apply _permute() and _unpermute(), now they are applied in 2 places
-
In ExpertParallel(), which is applied on
transformer_block.moe.experts, so the input of MoE module will be reordered by local experts. -
When use_grouped_mm is enabled, in
indices_permutation_wrapper, it will also try to permute the inputs of GroupedExperts by the order of local experts
Why do we need to apply is twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They won't be applied together.
When EP is used, EP will do _permute and _unpermute.
When EP is not used, indices_padding_wrapper will do them.
This PR: - let `ExpertParallel` handles indices permute / unpermute when EP is used - move `to_local` to model code to be more explicit - rename the `expert_parallel` wrapper which does permute / unpermute to `indices_permutation_wrapper` to be more accurate
This PR: - let `ExpertParallel` handles indices permute / unpermute when EP is used - move `to_local` to model code to be more explicit - rename the `expert_parallel` wrapper which does permute / unpermute to `indices_permutation_wrapper` to be more accurate
This PR: - let `ExpertParallel` handles indices permute / unpermute when EP is used - move `to_local` to model code to be more explicit - rename the `expert_parallel` wrapper which does permute / unpermute to `indices_permutation_wrapper` to be more accurate
This PR: - let `ExpertParallel` handles indices permute / unpermute when EP is used - move `to_local` to model code to be more explicit - rename the `expert_parallel` wrapper which does permute / unpermute to `indices_permutation_wrapper` to be more accurate
This PR: - let `ExpertParallel` handles indices permute / unpermute when EP is used - move `to_local` to model code to be more explicit - rename the `expert_parallel` wrapper which does permute / unpermute to `indices_permutation_wrapper` to be more accurate
This PR:
ExpertParallelhandles indices permute / unpermute when EP is usedto_localto model code to be more explicitexpert_parallelwrapper which does permute / unpermute toindices_permutation_wrapperto be more accurate