-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Enable Allgather/ReduceScatter backend for NaiveAllToAll #23964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # - "allgather_reducescatter": all2all implementation based on allgather and | ||
| # reducescatter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrmchlsmth wdyt if we turn this on as the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a better default.
[edit] I'd recommend making this switch in a followup PR with some testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we should make this the default
varun-sundar-rabindranath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice and clean change. Thanks @wenscarl
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to add a prepare/finalize implementation that uses this backend?
Not a this moment. Since trtllm-moe kernel's API is quite different from what modular kernel supports. |
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize how this was hooked up. Tried it and LGTM!
I also switched the default All2All backend to this one.
|
@tlrmchlsmth test should be unrelated? can you confirm? |
|
failures might actually be related as well changed the default All2All backend. I’m on my phone so can’t read the logs well right now |
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
|
If we can confirm that the performance is good, could we enable it by default so that users do not need to add cc @weireweire |
Yes, I had this enabled by default but backed out that change in case it was breaking the CI. Distributed tests are still red though, so not sure what the problem is now. Will try to land this PR with the default set to |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@wenscarl Could the failure be related? The distributed tests are green on a very recent nightly https://buildkite.com/vllm/ci/builds/30655#01994662-1fcb-4cd8-aba7-ee88e6f8608a |
Signed-off-by: Shu Wang <shuw@nvidia.com>
Head branch was pushed to by a user without write access
I didn't see significant failures. Do I miss anything? |
|
🤞 that the distributed tests are green |
…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>
…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
…t#23964) Signed-off-by: Shu Wang. <shuw@nvidia.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Shu Wang <shuw@nvidia.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
|
Hi, @wenscarl |
we have been running with cuda graph and didn't see any issue with |
Purpose
FIX #22916
Test Plan
accuracy:
Perf:
Test Result
accuracy:
perf:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.