-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Expert Parallelism #72
Conversation
@@ -20,6 +20,7 @@ class ParallelismArgs: | |||
dp: Number of DP replicas | |||
pp: Number of PP stages | |||
tp: Number of TP replicas | |||
expert_parallel_size: Number of expert parallel replicas (used only for MoEs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't expert_parallel_size should be the number of experts per tp rank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, expert parallelism is orthogonal to TP. for example you can have 1 expert sharded along 2 tp ranks
No description provided.