-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify comm_split to avoid ucp #1649
Modify comm_split to avoid ucp #1649
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a welcome change! Looks good, just two very minor things.
|
||
update_host(&id, d_nccl_ids.data() + offset, 1, stream_); | ||
|
||
RAFT_CUDA_TRY(cudaStreamSynchronize(stream_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably use sync_stream()
here so that this is more tolerant to failed ranks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest push
@@ -140,48 +142,37 @@ class std_comms : public comms_iface { | |||
|
|||
RAFT_CUDA_TRY(cudaStreamSynchronize(stream_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we are using nccl instead of ucp, we should probably avoid any potential deadlocks from dying nccl ranks by using sync_stream
here instead of the standard cuda synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in latest push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Chuck!
/merge |
During testing of a new feature in cugraph I discovered that the method required either MPI comms or UCP. I have an application that has neither.
This PR modifies the
comm_split
implementation to continue usingallgather
when performing the split instead of usingallgather
followed by UCP comms.