Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify comm_split to avoid ucp #1649

Merged

Conversation

ChuckHastings
Copy link
Contributor

During testing of a new feature in cugraph I discovered that the method required either MPI comms or UCP. I have an application that has neither.

This PR modifies the comm_split implementation to continue using allgather when performing the split instead of using allgather followed by UCP comms.

@ChuckHastings ChuckHastings requested a review from a team as a code owner July 14, 2023 16:39
@ChuckHastings ChuckHastings self-assigned this Jul 14, 2023
@github-actions github-actions bot added the cpp label Jul 14, 2023
@ChuckHastings ChuckHastings added 3 - Ready for Review improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cpp and removed cpp labels Jul 14, 2023
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a welcome change! Looks good, just two very minor things.


update_host(&id, d_nccl_ids.data() + offset, 1, stream_);

RAFT_CUDA_TRY(cudaStreamSynchronize(stream_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably use sync_stream() here so that this is more tolerant to failed ranks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest push

@@ -140,48 +142,37 @@ class std_comms : public comms_iface {

RAFT_CUDA_TRY(cudaStreamSynchronize(stream_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we are using nccl instead of ucp, we should probably avoid any potential deadlocks from dying nccl ranks by using sync_stream here instead of the standard cuda synchronization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in latest push

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Chuck!

@cjnolet
Copy link
Member

cjnolet commented Jul 24, 2023

/merge

@rapids-bot rapids-bot bot merged commit 422e937 into rapidsai:branch-23.08 Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

2 participants