-
Notifications
You must be signed in to change notification settings - Fork 22.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[c10d] switch ProcessGroup::Work to be managed by intrusive_ptr #44046
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 056232f (more details on the Dr. CI page):
🕵️ 2 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 (1/2)Step: "Run tests" (full log | diagnosis details | 🔁 rerun)
|
…e_ptr" [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
@@ -119,33 +119,33 @@ class ProcessGroup { | |||
return size_; | |||
} | |||
|
|||
virtual std::shared_ptr<ProcessGroup::Work> broadcast( | |||
virtual c10::intrusive_ptr<ProcessGroup::Work> broadcast( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is probably going to break other third party backends like: https://github.com/intel/torch-ccl/blob/master/src/ProcessGroupCCL.hpp#L136 and https://github.com/openucx/torch-ucc/blob/master/include/torch_ucc.hpp#L77.
I'm guessing this is necessary for TorchScript and there is no way around it, so should we ask the third-party libraries to make this change as well? (we can probably file issues on those repos).
cc @agolynski Since this affects the c10d extension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed that we should ask them to make the changes. Do you have a list of third party backends or are these two the only two that's currently using c10d extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ones I am aware of are Intel and UCX. See:
- https://github.com/openucx/torch-ucc
- add c10d dynamic loading mechanism and unit test #28068
Could you please check with @agolynski, he might know more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @chengjunlu @mshiryaev @Sergei-Lebedev @srinivas212
This PR will break master -> master dependency in torch-ccl and torch-ucc.
In near future we'll be changing ProcessGroup API which will break these repos as well. Would it be okay if you depend on 1.6 (upgrade to 1.7 when released) and not on master meanwhile?
@Sergei-Lebedev @srinivas212: How does torch-ucc depend on pytorch, do you require users to install from pytorch from master branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @agolynski @mrshenli @pritamdamania87 - after discussions with both Torch-UCC and Torch-CCL teams, the near term plan is to fix the issue in the third-party repo once this change lands.
In general, it is best we keep third-party plugins in sync w/ master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to land the stack soon, created openucx/torch-ucc#23 and intel/torch-ccl#11 to ucc and ccl to do the API migration.
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
…e_ptr" Differential Revision: [D23632280](https://our.internmc.facebook.com/intern/diff/D23632280) [ghstack-poisoned]
Hey @wanchaol, looks like this PR breaks master, shall we revert?
|
yes just reverted it and will investigate and submit it again. |
Stack from ghstack:
Differential Revision: D23632280