Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional crashes when using more than one comm per GPU #11

Closed
amithr1 opened this issue Feb 25, 2016 · 5 comments
Closed

occasional crashes when using more than one comm per GPU #11

amithr1 opened this issue Feb 25, 2016 · 5 comments

Comments

@amithr1
Copy link

amithr1 commented Feb 25, 2016

Hi All,

I have noticed crashes when I overload a device with more than one nccl comm. For example,
below I want to use 6 instances of the comm with only two devices 0, 1. I see crashes even with smaller instances..for eg. two instances of comm with each device. Does nccl assume that only one comm is created per device? This is restrictive if this is the case,

./build/test/single/broadcast_test 10000000 6 0 0 0 0 0 1
INFO NCCL debug level set to INFO
INFO rank 0 using buffSize = 2097152
INFO rank 0 using device 0 (0000:03:00.0)
INFO rank 1 using buffSize = 2097152
INFO rank 1 using device 0 (0000:03:00.0)
INFO rank 2 using buffSize = 2097152
INFO rank 2 using device 0 (0000:03:00.0)
INFO rank 3 using buffSize = 2097152
INFO rank 3 using device 0 (0000:03:00.0)
INFO rank 4 using buffSize = 2097152
INFO rank 4 using device 0 (0000:03:00.0)
Segmentation fault

Amith

@nluehr
Copy link
Contributor

nluehr commented Mar 9, 2016

NCCL should support multiple ranks per GPU. After a few dozen runs on my workstation, I have not been able to reproduce the bug. How frequently do you see it? Does it happen if all ranks use just a single GPU? What GPU/CPU and OS are you running on?

Thanks,
Nathan

@amithr1
Copy link
Author

amithr1 commented Mar 21, 2016

Hi Nathan,

I still see the seg faults..Looks like the crashes are seen in ncclCommInitAll routine as I exit immediately after. I am also pasting the nvidia-smi cmd result below

nvidia-smi
Mon Mar 21 11:40:42 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.59 Driver Version: 352.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:03:00.0 Off | 0 |
| N/A 27C P8 27W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

@amithr1
Copy link
Author

amithr1 commented May 31, 2016

I tried a simple loop of:

for (( i=1; i < 20; i++ )); do echo $i; ./build/test/single/broadcast_test 1024 2 0 1 ; done

and I see a couple of seg faults. Pasting the errors..

*** Error in `./build/test/single/broadcast_test': free(): invalid pointer: 0x00003efff80008c0 ***
======= Backtrace: =========
/lib64/power8/libc.so.6(+0x8f284)[0x3fff76a7f284]
/usr/lib/nvidia/libcuda.so.1(+0xa7f22c)[0x3fff7b04f22c]
/usr/lib/nvidia/libcuda.so.1(+0x276f50)[0x3fff7a846f50]
/usr/lib/nvidia/libcuda.so.1(+0xa7ffbc)[0x3fff7b04ffbc]
/lib64/power8/libpthread.so.0(+0x8728)[0x3fff76eb8728]
/lib64/power8/libc.so.6(clone+0x98)[0x3fff76b07ae0]
======= Memory map: ========
10000000-10050000 r-xp 00000000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
10050000-10060000 r--p 00040000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
10060000-10070000 rw-p 00050000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
200000000-200100000 rw-s 55fbfa0000 00:05 104789 /dev/nvidiactl
200100000-200500000 rw-s 6ceccc0000 00:05 104789 /dev/nvidiactl
200500000-200900000 rw-s 47301c0000 00:05 104789 /dev/nvidiactl

@amithr1
Copy link
Author

amithr1 commented May 31, 2016

Noticed that on a separate machine, this doesn;t happen..when I did nvidia-smi, I see that on the
machine with persistence-settings, it crashes and on the other one with no persistence-settings, it passes..

PASSES
-bash-4.2$ nvidia-smi
Tue May 31 16:27:16 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.59 Driver Version: 352.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:03:00.0 Off | 0 |
| N/A 41C P0 59W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:04:00.0 Off | 0 |
| N/A 32C P0 72W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0002:03:00.0 Off | 0 |
| N/A 36C P0 59W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0002:04:00.0 Off | 0 |
| N/A 27C P0 73W / 149W | 55MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

DOESNOT PASS

ue May 31 16:35:47 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.59 Driver Version: 352.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:03:00.0 Off | 0 |
| N/A 27C P8 27W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:04:00.0 Off | 0 |
| N/A 25C P8 28W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

@spotluri
Copy link
Collaborator

spotluri commented May 6, 2019

We have now clarified this in NCCL documentation:
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/usage/communicators.html

From the documentation: "Using the same CUDA device multiple times as different ranks of the same NCCL communicator is not supported and may lead to hangs."

Closing this as it is expected behavior. Please reopen if you have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants