-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
occasional crashes when using more than one comm per GPU #11
Comments
NCCL should support multiple ranks per GPU. After a few dozen runs on my workstation, I have not been able to reproduce the bug. How frequently do you see it? Does it happen if all ranks use just a single GPU? What GPU/CPU and OS are you running on? Thanks, |
Hi Nathan, I still see the seg faults..Looks like the crashes are seen in ncclCommInitAll routine as I exit immediately after. I am also pasting the nvidia-smi cmd result below nvidia-smi +-----------------------------------------------------------------------------+ |
I tried a simple loop of: for (( i=1; i < 20; i++ )); do echo $i; ./build/test/single/broadcast_test 1024 2 0 1 ; done and I see a couple of seg faults. Pasting the errors.. *** Error in `./build/test/single/broadcast_test': free(): invalid pointer: 0x00003efff80008c0 *** |
Noticed that on a separate machine, this doesn;t happen..when I did nvidia-smi, I see that on the PASSES +-----------------------------------------------------------------------------+ DOESNOT PASS ue May 31 16:35:47 2016 +-----------------------------------------------------------------------------+ |
We have now clarified this in NCCL documentation: From the documentation: "Using the same CUDA device multiple times as different ranks of the same NCCL communicator is not supported and may lead to hangs." Closing this as it is expected behavior. Please reopen if you have further questions. |
Hi All,
I have noticed crashes when I overload a device with more than one nccl comm. For example,
below I want to use 6 instances of the comm with only two devices 0, 1. I see crashes even with smaller instances..for eg. two instances of comm with each device. Does nccl assume that only one comm is created per device? This is restrictive if this is the case,
./build/test/single/broadcast_test 10000000 6 0 0 0 0 0 1
INFO NCCL debug level set to INFO
INFO rank 0 using buffSize = 2097152
INFO rank 0 using device 0 (0000:03:00.0)
INFO rank 1 using buffSize = 2097152
INFO rank 1 using device 0 (0000:03:00.0)
INFO rank 2 using buffSize = 2097152
INFO rank 2 using device 0 (0000:03:00.0)
INFO rank 3 using buffSize = 2097152
INFO rank 3 using device 0 (0000:03:00.0)
INFO rank 4 using buffSize = 2097152
INFO rank 4 using device 0 (0000:03:00.0)
Segmentation fault
Amith
The text was updated successfully, but these errors were encountered: