-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ncclCommSplit in non-blocking API mode #1472
Comments
I would think ncclCommSplit mimics the behavior of ncclCommInitRank. So: Q1: yes, it should I might be missing something though; let me check internally. |
Thanks @sjeaugey Re Q1:
The "ncclComm_ 0" part seems to suggest that the value is not filled when Re Q2: |
Ok. It looks like the current implementation is focusing on the parent comm and considering ncclCommSplit like a ncclAllReduce in that respect: the operation is not complete while the parent comm status is ncclInProgress and when it becomes ncclSuccess, then the newcomm is filled. We'll see if we can improve the behavior to continue being consistent with the non-blocking semantics of the parent, but also be more consistent with the ncclCommInitRankConfig semantics. I'm thinking about making the ncclCommSplit non-blocking only if both the parent comm and the new comm are non-blocking. Also, in non-blocking mode, the newcomm should be valid when we return, like for ncclCommInitRank. That seems to me to follow the constraints of both sides. Of course we'd need to make the changes to confirm I'm not missing anything. |
Sounds good, thanks!
nit: in blocking mode. |
Mmm, no, I meant in non-blocking more. In blocking mode when we return from ncclCommSplit, the new comm should be valid and fully initialized. Does that make sense? |
Oh, my bad understanding of "valid", I took it as "initialized". All good! |
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: #137741 Approved by: https://github.com/shuqiangzhang
### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: #137741 Approved by: https://github.com/shuqiangzhang
Question 1:
When calling
ncclCommSplit
in non-blocking API mode, would the value (i.e. pointer) fornewcomm
be filled before the API returns?Question 2:
if we want to check if the split finishes, should we call
ncclGetAsyncError
on the parentcomm
or on thenewcomm
? Or either one is fine?The text was updated successfully, but these errors were encountered: