ncclCommSplit in non-blocking API mode #1472

kwen2501 · 2024-10-09T08:18:00Z

ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t* newcomm, ncclConfig_t* config)

Question 1:
When calling ncclCommSplit in non-blocking API mode, would the value (i.e. pointer) for newcomm be filled before the API returns?

Question 2:
if we want to check if the split finishes, should we call ncclGetAsyncError on the parent comm or on the newcomm? Or either one is fine?

The text was updated successfully, but these errors were encountered:

sjeaugey · 2024-10-09T13:07:23Z

I would think ncclCommSplit mimics the behavior of ncclCommInitRank. So:

Q1: yes, it should
Q2: checking the status of newcomm should work, so that you can poll on it then use it for comm operations . But indeed we would also need to wait for the split to complete until we can reuse the parent comm, hence the status of the parent comm should be ncclInProgress as well until it's safe to call new operations on the parent comm.

I might be missing something though; let me check internally.

kwen2501 · 2024-10-09T17:24:52Z

Thanks @sjeaugey

Re Q1:
Yeah, I had the same mental model / expectation. But I also saw some lines as below from PyTorch's log after using split in non-blocking mode:

[PG ID 1 PG GUID 1(default_pg:split:0) Rank 0] ProcessGroupNCCL created ncclComm_ 0 on CUDA device: 0

The "ncclComm_ 0" part seems to suggest that the value is not filled when ncclCommSplit returns. It may be also due to bugs in PyTorch, I can isolate further.

Re Q2:
Good point about the parent comm.

sjeaugey · 2024-10-10T07:21:51Z

Ok. It looks like the current implementation is focusing on the parent comm and considering ncclCommSplit like a ncclAllReduce in that respect: the operation is not complete while the parent comm status is ncclInProgress and when it becomes ncclSuccess, then the newcomm is filled.

We'll see if we can improve the behavior to continue being consistent with the non-blocking semantics of the parent, but also be more consistent with the ncclCommInitRankConfig semantics.

I'm thinking about making the ncclCommSplit non-blocking only if both the parent comm and the new comm are non-blocking. Also, in non-blocking mode, the newcomm should be valid when we return, like for ncclCommInitRank. That seems to me to follow the constraints of both sides.

Of course we'd need to make the changes to confirm I'm not missing anything.

kwen2501 · 2024-10-10T18:28:06Z

Sounds good, thanks!

Also, in non-blocking mode, the newcomm should be valid when we return, like for ncclCommInitRank.

nit: in blocking mode.

sjeaugey · 2024-10-11T07:25:40Z

nit: in blocking mode.

Mmm, no, I meant in non-blocking more. In blocking mode when we return from ncclCommSplit, the new comm should be valid and fully initialized.
In non-blocking mode, the new comm we return should be a valid object (as far as I understand, not the case currently), although probably not initialized yet (ncclInProgress).

Does that make sense?

kwen2501 · 2024-10-12T00:01:02Z

Oh, my bad understanding of "valid", I took it as "initialized". All good!

### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. cc XilunWu H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

### Fix 1: Throw async error during init wait Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`. ### Fix 2: Add wait after comm split ``` // After calling ncclCommSplit in non-blocking mode, we should wait for the // source communicator to be out of ncclInProgress state. // Reason 1: // it's unsafe to call new operations on the parent comm while it's in // ncclInProgress state. // Reason 2: // as of NCCL 2.23, the ptr value of child comm will not be filled until the // state of parent comm is ncclSuccess. This may change in the future. See: // NVIDIA/nccl#1472 ``` This wait does not mean the child comm is ready for use, neither does it block till that point. Pull Request resolved: #137741 Approved by: https://github.com/shuqiangzhang

kwen2501 mentioned this issue Oct 14, 2024

[PGNCCL] Fix bugs in non-blocking mode pytorch/pytorch#137741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ncclCommSplit in non-blocking API mode #1472

ncclCommSplit in non-blocking API mode #1472

kwen2501 commented Oct 9, 2024 •

edited

Loading

sjeaugey commented Oct 9, 2024

kwen2501 commented Oct 9, 2024 •

edited

Loading

sjeaugey commented Oct 10, 2024

kwen2501 commented Oct 10, 2024

sjeaugey commented Oct 11, 2024

kwen2501 commented Oct 12, 2024

ncclCommSplit in non-blocking API mode #1472

ncclCommSplit in non-blocking API mode #1472

Comments

kwen2501 commented Oct 9, 2024 • edited Loading

sjeaugey commented Oct 9, 2024

kwen2501 commented Oct 9, 2024 • edited Loading

sjeaugey commented Oct 10, 2024

kwen2501 commented Oct 10, 2024

sjeaugey commented Oct 11, 2024

kwen2501 commented Oct 12, 2024

kwen2501 commented Oct 9, 2024 •

edited

Loading

kwen2501 commented Oct 9, 2024 •

edited

Loading