Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncclCommSplit in non-blocking API mode #1472

Open
kwen2501 opened this issue Oct 9, 2024 · 6 comments
Open

ncclCommSplit in non-blocking API mode #1472

kwen2501 opened this issue Oct 9, 2024 · 6 comments

Comments

@kwen2501
Copy link
Contributor

kwen2501 commented Oct 9, 2024

ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t* newcomm, ncclConfig_t* config)

Question 1:
When calling ncclCommSplit in non-blocking API mode, would the value (i.e. pointer) for newcomm be filled before the API returns?

Question 2:
if we want to check if the split finishes, should we call ncclGetAsyncError on the parent comm or on the newcomm? Or either one is fine?

@sjeaugey
Copy link
Member

sjeaugey commented Oct 9, 2024

I would think ncclCommSplit mimics the behavior of ncclCommInitRank. So:

Q1: yes, it should
Q2: checking the status of newcomm should work, so that you can poll on it then use it for comm operations . But indeed we would also need to wait for the split to complete until we can reuse the parent comm, hence the status of the parent comm should be ncclInProgress as well until it's safe to call new operations on the parent comm.

I might be missing something though; let me check internally.

@kwen2501
Copy link
Contributor Author

kwen2501 commented Oct 9, 2024

Thanks @sjeaugey

Re Q1:
Yeah, I had the same mental model / expectation. But I also saw some lines as below from PyTorch's log after using split in non-blocking mode:

[PG ID 1 PG GUID 1(default_pg:split:0) Rank 0] ProcessGroupNCCL created ncclComm_ 0 on CUDA device: 0

The "ncclComm_ 0" part seems to suggest that the value is not filled when ncclCommSplit returns. It may be also due to bugs in PyTorch, I can isolate further.

Re Q2:
Good point about the parent comm.

@sjeaugey
Copy link
Member

Ok. It looks like the current implementation is focusing on the parent comm and considering ncclCommSplit like a ncclAllReduce in that respect: the operation is not complete while the parent comm status is ncclInProgress and when it becomes ncclSuccess, then the newcomm is filled.

We'll see if we can improve the behavior to continue being consistent with the non-blocking semantics of the parent, but also be more consistent with the ncclCommInitRankConfig semantics.

I'm thinking about making the ncclCommSplit non-blocking only if both the parent comm and the new comm are non-blocking. Also, in non-blocking mode, the newcomm should be valid when we return, like for ncclCommInitRank. That seems to me to follow the constraints of both sides.

Of course we'd need to make the changes to confirm I'm not missing anything.

@kwen2501
Copy link
Contributor Author

Sounds good, thanks!

Also, in non-blocking mode, the newcomm should be valid when we return, like for ncclCommInitRank.

nit: in blocking mode.

@sjeaugey
Copy link
Member

nit: in blocking mode.

Mmm, no, I meant in non-blocking more. In blocking mode when we return from ncclCommSplit, the new comm should be valid and fully initialized.
In non-blocking mode, the new comm we return should be a valid object (as far as I understand, not the case currently), although probably not initialized yet (ncclInProgress).

Does that make sense?

@kwen2501
Copy link
Contributor Author

Oh, my bad understanding of "valid", I took it as "initialized". All good!

kwen2501 added a commit to pytorch/pytorch that referenced this issue Oct 14, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit to pytorch/pytorch that referenced this issue Oct 14, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit to pytorch/pytorch that referenced this issue Oct 15, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit to pytorch/pytorch that referenced this issue Oct 15, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit to pytorch/pytorch that referenced this issue Oct 15, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit to pytorch/pytorch that referenced this issue Oct 15, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Oct 15, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

Pull Request resolved: #137741
Approved by: https://github.com/shuqiangzhang
dvorjackz pushed a commit to pytorch/pytorch that referenced this issue Oct 16, 2024
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   NVIDIA/nccl#1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

Pull Request resolved: #137741
Approved by: https://github.com/shuqiangzhang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants