occasional crashes when using more than one comm per GPU #11

amithr1 · 2016-02-25T19:13:46Z

Hi All,

I have noticed crashes when I overload a device with more than one nccl comm. For example,
below I want to use 6 instances of the comm with only two devices 0, 1. I see crashes even with smaller instances..for eg. two instances of comm with each device. Does nccl assume that only one comm is created per device? This is restrictive if this is the case,

./build/test/single/broadcast_test 10000000 6 0 0 0 0 0 1
INFO NCCL debug level set to INFO
INFO rank 0 using buffSize = 2097152
INFO rank 0 using device 0 (0000:03:00.0)
INFO rank 1 using buffSize = 2097152
INFO rank 1 using device 0 (0000:03:00.0)
INFO rank 2 using buffSize = 2097152
INFO rank 2 using device 0 (0000:03:00.0)
INFO rank 3 using buffSize = 2097152
INFO rank 3 using device 0 (0000:03:00.0)
INFO rank 4 using buffSize = 2097152
INFO rank 4 using device 0 (0000:03:00.0)
Segmentation fault

Amith

nluehr · 2016-03-09T23:43:22Z

NCCL should support multiple ranks per GPU. After a few dozen runs on my workstation, I have not been able to reproduce the bug. How frequently do you see it? Does it happen if all ranks use just a single GPU? What GPU/CPU and OS are you running on?

Thanks,
Nathan

amithr1 · 2016-03-21T16:00:33Z

Hi Nathan,

I still see the seg faults..Looks like the crashes are seen in ncclCommInitAll routine as I exit immediately after. I am also pasting the nvidia-smi cmd result below

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

amithr1 · 2016-05-31T17:14:24Z

I tried a simple loop of:

for (( i=1; i < 20; i++ )); do echo $i; ./build/test/single/broadcast_test 1024 2 0 1 ; done

and I see a couple of seg faults. Pasting the errors..

*** Error in `./build/test/single/broadcast_test': free(): invalid pointer: 0x00003efff80008c0 ***
======= Backtrace: =========
/lib64/power8/libc.so.6(+0x8f284)[0x3fff76a7f284]
/usr/lib/nvidia/libcuda.so.1(+0xa7f22c)[0x3fff7b04f22c]
/usr/lib/nvidia/libcuda.so.1(+0x276f50)[0x3fff7a846f50]
/usr/lib/nvidia/libcuda.so.1(+0xa7ffbc)[0x3fff7b04ffbc]
/lib64/power8/libpthread.so.0(+0x8728)[0x3fff76eb8728]
/lib64/power8/libc.so.6(clone+0x98)[0x3fff76b07ae0]
======= Memory map: ========
10000000-10050000 r-xp 00000000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
10050000-10060000 r--p 00040000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
10060000-10070000 rw-p 00050000 00:2d 26347551 /gpfs/ess2fs0/armamida/nccl/build/test/single/broadcast_test
200000000-200100000 rw-s 55fbfa0000 00:05 104789 /dev/nvidiactl
200100000-200500000 rw-s 6ceccc0000 00:05 104789 /dev/nvidiactl
200500000-200900000 rw-s 47301c0000 00:05 104789 /dev/nvidiactl

amithr1 · 2016-05-31T20:37:18Z

Noticed that on a separate machine, this doesn;t happen..when I did nvidia-smi, I see that on the
machine with persistence-settings, it crashes and on the other one with no persistence-settings, it passes..

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

DOESNOT PASS

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

spotluri · 2019-05-06T17:47:49Z

We have now clarified this in NCCL documentation:
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/usage/communicators.html

From the documentation: "Using the same CUDA device multiple times as different ranks of the same NCCL communicator is not supported and may lead to hangs."

Closing this as it is expected behavior. Please reopen if you have further questions.

hpjeonGIT mentioned this issue Oct 28, 2017

nccl all_reduce_test hangs #117

Closed

spotluri closed this as completed May 6, 2019

weberxie mentioned this issue Sep 30, 2020

NCCL hang issue #394

Closed

himanshucodz55 mentioned this issue Jul 25, 2022

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #708

Open

xw285cornell mentioned this issue Nov 15, 2022

NCCL Hang with CUDA_LAUNCH_BLOCKING=1 #750

Closed

junior-zsy mentioned this issue Jun 29, 2023

FasterTransformer NcclAllReduceSum with 4 GPUs hangs #901

Closed

raninbowlalala mentioned this issue Jul 4, 2023

2 allreduce and a allgather hang in multi-node #899

Open

dbfancier mentioned this issue Jul 14, 2023

nccl-test hung and tcp socket failed sometimes #914

Closed

acphile mentioned this issue Sep 29, 2023

Question about ncclCommAbort stuck issue #1013

Open

alexander-zinoviev pushed a commit to alexander-zinoviev/nccl that referenced this issue Nov 7, 2024

Revert log strings in init.cc (NVIDIA#11)

088f59f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

occasional crashes when using more than one comm per GPU #11

occasional crashes when using more than one comm per GPU #11

amithr1 commented Feb 25, 2016

nluehr commented Mar 9, 2016

amithr1 commented Mar 21, 2016

amithr1 commented May 31, 2016

amithr1 commented May 31, 2016

spotluri commented May 6, 2019

occasional crashes when using more than one comm per GPU #11

occasional crashes when using more than one comm per GPU #11

Comments

amithr1 commented Feb 25, 2016

nluehr commented Mar 9, 2016

amithr1 commented Mar 21, 2016

amithr1 commented May 31, 2016

amithr1 commented May 31, 2016

spotluri commented May 6, 2019