-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running NCCL mpi test accros multiple nodes #33
Comments
No, indeed, NCCL doesn't run across multiple nodes. |
Are there any plans to add this support? |
Inter-node communication has been implemented in NCCL2, which is now available at https://developer.nvidia.com/nccl. |
minsii
added a commit
to minsii/nccl
that referenced
this issue
Nov 13, 2023
Summary: When concurrent collective/p2p are sent via multiple NCCL communicators, ctran mapper register/deregister/search paths can be called by multiple threads concurrently. Thus, we need ensure thread-safety for the global timer for registration. This patch fixes it by adding mutex for all accesses to the global variables used. Differential Revision: D51083701
minsii
added a commit
to minsii/nccl
that referenced
this issue
Nov 14, 2023
Summary: When concurrent collective/p2p are sent via multiple NCCL communicators, ctran mapper register/deregister/search paths can be called by multiple threads concurrently. Thus, we need ensure thread-safety for the global timer for registration. This patch fixes it by adding mutex for all accesses to the global variables used. Differential Revision: D51083701
minsii
added a commit
to minsii/nccl
that referenced
this issue
Nov 15, 2023
Summary: Pull Request resolved: facebookresearch#33 When concurrent collective/p2p are sent via multiple NCCL communicators, ctran mapper register/deregister/search paths can be called by multiple threads concurrently. Thus, we need ensure thread-safety for the global timer for registration. This patch fixes it by adding mutex for all accesses to the global variables used. Reviewed By: wesbland Differential Revision: D51083701 fbshipit-source-id: ca0ba40484f9c871780fc99623e0c9d8224328e3
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I've built and run the
mpi_test
on 1 node with 8 TitanX gpus successfully. I usesrun
to launch the mpi test and it passes. However, the test fails when run across 2 nodes with 8 TitanX gpus per node. I use the following command line:The test fails with the following error:
Does NCCL run across multiple nodes?
The text was updated successfully, but these errors were encountered: