Stuck when running MPI test #18

kyoungrok0517 · 2019-02-19T21:27:01Z

I've compiled nccl, then tried with the following command
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

Then I see the following error. What's the problem?

[kyoungrok-ryzen:12576] *** Process received signal ***
[kyoungrok-ryzen:12576] Signal: Segmentation fault (11)
[kyoungrok-ryzen:12576] Signal code: Address not mapped (1)
[kyoungrok-ryzen:12576] Failing at address: 0x44000098
[kyoungrok-ryzen:12576] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f339f39c890]
[kyoungrok-ryzen:12576] [ 1] /usr/lib/x86_64-linux-gnu/libmpi.so.20(MPI_Comm_size+0x42)[0x7f33a4d353b2]
[kyoungrok-ryzen:12576] [ 2] ./build/all_reduce_perf[0x402101]
[kyoungrok-ryzen:12576] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f339e010b97]
[kyoungrok-ryzen:12576] [ 4] ./build/all_reduce_perf[0x40398a]
[kyoungrok-ryzen:12576] *** End of error message ***
[1]    12576 segmentation fault (core dumped)  ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

The text was updated successfully, but these errors were encountered:

sjeaugey · 2019-02-19T21:48:12Z

The crash seems to happen in MPI. Which MPI are you using ? Are you sure you are using the same MPI at runtime that you compiled the tests with ?

kyoungrok0517 · 2019-02-20T01:27:07Z

I'm not sure... I'm using Ubuntu 18.04. How can I check my MPI sanity? I was having weird phenomena that pre-compiled multi-gpu xgboost (which uses NCCL as a backend) can't be parallelized in the PC I'm testing, so I started trying to compile by myself. Now it seems like there's a problem in my MPI library :(

kyoungrok0517 · 2019-02-20T01:35:59Z

Now I come to the point that the multi-gpu command stucks! Here's where I am.

Command
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

STUCK HERE

Thread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
kyoungrok-ryzen:10156:10156 [0] NCCL INFO NET/Socket : Using [0]enp7s0:143.248.47.222<0>
kyoungrok-ryzen:10156:10156 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
kyoungrok-ryzen:10156:10156 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.2+cuda9.0
kyoungrok-ryzen:10156:10156 [1] NCCL INFO nranks 2
kyoungrok-ryzen:10156:10156 [0] NCCL INFO Setting affinity for GPU 0 to ffff
kyoungrok-ryzen:10156:10156 [0] NCCL INFO comm 0x563ed42dd520 rank 0 nranks 2 cudaDev 0 nvmlDev 0
kyoungrok-ryzen:10156:10156 [1] NCCL INFO Setting affinity for GPU 1 to ffff
kyoungrok-ryzen:10156:10156 [1] NCCL INFO comm 0x563ed42c2010 rank 1 nranks 2 cudaDev 1 nvmlDev 1
kyoungrok-ryzen:10156:10156 [1] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
kyoungrok-ryzen:10156:10156 [1] NCCL INFO Channel 00 :    0   1
kyoungrok-ryzen:10156:10156 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
kyoungrok-ryzen:10156:10156 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res

Oh I just fixed something and having different error. I'm Ubuntu 18.04, CUDA 9.0, the latest NCCL repo.

Command 1
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
NCCL version 2.4.2+cuda9.0
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res

kyoungrok-ryzen:23056:23056 [0] enqueue.cu:74 NCCL WARN Cuda failure 'invalid device function'
NCCL failure common.cu:483 'unhandled cuda error'

Command 2
mpirun -np 40 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 -c 0

nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
NCCL failure common.cu:483 'unhandled cuda error'
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
NCCL failure common.cu:483 'unhandled cuda error'
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

Command 3 (success)
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 (use only single GPU)

# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
           8             2   float     sum    0.008   0.00   0.00    0e+00    0.000   0.02   0.00    0e+00
          16             4   float     sum    0.005   0.00   0.00    0e+00    0.001   0.03   0.00    0e+00
          32             8   float     sum    0.005   0.01   0.00    0e+00    0.000   0.07   0.00    0e+00
          64            16   float     sum    0.005   0.01   0.00    0e+00    0.001   0.12   0.00    0e+00
         128            32   float     sum    0.005   0.03   0.00    0e+00    0.001   0.24   0.00    0e+00
         256            64   float     sum    0.005   0.05   0.00    0e+00    0.000   0.54   0.00    0e+00
         512           128   float     sum    0.005   0.10   0.00    0e+00    0.000   1.06   0.00    0e+00
        1024           256   float     sum    0.005   0.21   0.00    0e+00    0.001   1.90   0.00    0e+00
        2048           512   float     sum    0.005   0.42   0.00    0e+00    0.000   4.26   0.00    0e+00
        4096          1024   float     sum    0.005   0.83   0.00    0e+00    0.000   8.52   0.00    0e+00
        8192          2048   float     sum    0.005   1.67   0.00    0e+00    0.001  15.27   0.00    0e+00
       16384          4096   float     sum    0.005   3.36   0.00    0e+00    0.001  30.51   0.00    0e+00
       32768          8192   float     sum    0.005   6.68   0.00    0e+00    0.000  67.93   0.00    0e+00
       65536         16384   float     sum    0.005  13.15   0.00    0e+00    0.000  138.16   0.00    0e+00
      131072         32768   float     sum    0.006  21.80   0.00    0e+00    0.000  276.58   0.00    0e+00
      262144         65536   float     sum    0.005  50.65   0.00    0e+00    0.000  602.21   0.00    0e+00
      524288        131072   float     sum    0.005  103.97   0.00    0e+00    0.000  1094.78   0.00    0e+00
     1048576        262144   float     sum    0.005  216.02   0.00    0e+00    0.000  2175.92   0.00    0e+00
     2097152        524288   float     sum    0.014  149.60   0.00    0e+00    0.000  4925.20   0.00    0e+00
     4194304       1048576   float     sum    0.026  162.39   0.00    0e+00    0.000  9920.30   0.00    0e+00
     8388608       2097152   float     sum    0.049  170.41   0.00    0e+00    0.000  19052.03   0.00    0e+00
    16777216       4194304   float     sum    0.096  174.87   0.00    0e+00    0.000  37714.32   0.00    0e+00
    33554432       8388608   float     sum    0.189  177.25   0.00    0e+00    0.000  75099.44   0.00    0e+00
    67108864      16777216   float     sum    0.376  178.27   0.00    0e+00    0.000  150182.08   0.00    0e+00
   134217728      33554432   float     sum    0.751  178.74   0.00    0e+00    0.000  301714.57   0.00    0e+00
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 0

How can I give you the full error logs?

kyoungrok0517 · 2019-02-21T02:42:20Z

I solved this by turning off VT (SVM in AMD) at BIOS. I close this issue.

kyoungrok0517 · 2020-10-21T07:03:21Z

Hello. I re-encountered the same issue again in another ryzen system. Also stuck with mpi test script. I turned off SVM like I did before but was in vain. Tested with nccl-2.7.8, CUDA 10.2

These are the commands I've tested

./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2 (hang)
mpirun -np 2 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 -c 0 (hang)
mpirun -np 2 hostname (works)
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1 (works)

Below are the messages & status I see.

Now this is the message from single GPU test

sjeaugey · 2020-10-21T16:22:42Z

Did you confirm that ACS was off (or SVM was disabled on the new node .. I don't know how to confirm that from the system) ?

Also could you confirm it is a GPU Direct P2P issue, and disabling P2P (NCCL_P2P_DISABLE=1) solves the problem ?

kyoungrok0517 · 2020-10-21T17:13:48Z

Thanks for the reply. For now I tested NCCL_P2P_DISABLE=1 and that solved the problem :) As for ACS I can check the server's BIOS next day. So what's the cause of this problem?

sjeaugey · 2020-10-21T17:25:45Z

GPU Direct P2P (and RDMA for that matter) relies on PCI devices communicating directly with one another. That means using PCI transactions which target another PCI device and not the CPU.

Some platforms do not support those type of PCI packets because they only tested PCI<->CPU transactions were working correctly, which is the vast majority of PCI use cases. And many things can break those PCI<->PCI transactions, the main one being PCI virtualization technologies when enabled but not properly configured. Even when properly configured, they often impact performance negatively. That's why we advise to turn off any PCI virtualization unless you need to run virtual machines (in which case you may want to also enable PCI ATS but that's a complex subject).

Now there could also be other reasons for GPU Direct to be broken, due to some bad settings in PCI switches or CPUs, breaking those PCI P2P requests. The problem is that it's hard to debug as you would need a PCI Express analyzer to see what's going wrong (plus the PCI expertise).

And frequently, when this is broken, we do not see anything in NCCL, only that the remote write did not happen, so the remote GPU is still waiting for the message to arrive, and we can't do much more than hang.

kyoungrok0517 · 2020-10-21T17:30:33Z

PCI virtualization

Thanks for the detailed answer. I'll now close the issue, and come back if I need further help. Thanks!

kyoungrok0517 · 2020-10-22T05:32:47Z

If anyone encounter the same issue, please do the followings in BIOS:

Turn off SVM
Turn off IOMMU (at AMD CBS)

I leave this for future reference

maxhgerlach · 2020-12-18T16:16:14Z

Turn off IOMMU

Thanks for leaving that advice, @kyoungrok0517. It helped a lot to make sense of an AMD-CPU-based system that I was working with.

longkukuhi · 2022-02-12T11:01:33Z

I faced the same issue when using docker container in cluster. Solved problem by set NCCL_P2P_DISABLE=1. Will it has severer negative impact on training speed?
Thanks for your help

sjeaugey · 2022-02-14T09:48:57Z

If you have NVLink, or more than 2 GPUs, then yes, disabling P2P will probably degrade performance significantly.

kyoungrok0517 changed the title ~~Segfault with libpthread.so~~ Stuck when testing NCCL-2.4 Feb 20, 2019

kyoungrok0517 closed this as completed Feb 21, 2019

kyoungrok0517 reopened this Oct 21, 2020

kyoungrok0517 changed the title ~~Stuck when testing NCCL-2.4~~ Stuck when running MPI test Oct 21, 2020

kyoungrok0517 mentioned this issue Oct 21, 2020

Stuck when running MPI test (reopened) #53

Closed

kyoungrok0517 closed this as completed Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck when running MPI test #18

Stuck when running MPI test #18

kyoungrok0517 commented Feb 19, 2019

sjeaugey commented Feb 19, 2019

kyoungrok0517 commented Feb 20, 2019

kyoungrok0517 commented Feb 20, 2019 •

edited

Loading

kyoungrok0517 commented Feb 21, 2019 •

edited

Loading

kyoungrok0517 commented Oct 21, 2020 •

edited

Loading

sjeaugey commented Oct 21, 2020 •

edited

Loading

kyoungrok0517 commented Oct 21, 2020

sjeaugey commented Oct 21, 2020 •

edited

Loading

kyoungrok0517 commented Oct 21, 2020

kyoungrok0517 commented Oct 22, 2020

maxhgerlach commented Dec 18, 2020 •

edited

Loading

longkukuhi commented Feb 12, 2022

sjeaugey commented Feb 14, 2022 •

edited

Loading

Stuck when running MPI test #18

Stuck when running MPI test #18

Comments

kyoungrok0517 commented Feb 19, 2019

sjeaugey commented Feb 19, 2019

kyoungrok0517 commented Feb 20, 2019

kyoungrok0517 commented Feb 20, 2019 • edited Loading

kyoungrok0517 commented Feb 21, 2019 • edited Loading

kyoungrok0517 commented Oct 21, 2020 • edited Loading

sjeaugey commented Oct 21, 2020 • edited Loading

kyoungrok0517 commented Oct 21, 2020

sjeaugey commented Oct 21, 2020 • edited Loading

kyoungrok0517 commented Oct 21, 2020

kyoungrok0517 commented Oct 22, 2020

maxhgerlach commented Dec 18, 2020 • edited Loading

longkukuhi commented Feb 12, 2022

sjeaugey commented Feb 14, 2022 • edited Loading

kyoungrok0517 commented Feb 20, 2019 •

edited

Loading

kyoungrok0517 commented Feb 21, 2019 •

edited

Loading

kyoungrok0517 commented Oct 21, 2020 •

edited

Loading

sjeaugey commented Oct 21, 2020 •

edited

Loading

sjeaugey commented Oct 21, 2020 •

edited

Loading

maxhgerlach commented Dec 18, 2020 •

edited

Loading

sjeaugey commented Feb 14, 2022 •

edited

Loading