Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck when running MPI test #18

Closed
kyoungrok0517 opened this issue Feb 19, 2019 · 13 comments
Closed

Stuck when running MPI test #18

kyoungrok0517 opened this issue Feb 19, 2019 · 13 comments

Comments

@kyoungrok0517
Copy link

I've compiled nccl, then tried with the following command
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

Then I see the following error. What's the problem?

[kyoungrok-ryzen:12576] *** Process received signal ***
[kyoungrok-ryzen:12576] Signal: Segmentation fault (11)
[kyoungrok-ryzen:12576] Signal code: Address not mapped (1)
[kyoungrok-ryzen:12576] Failing at address: 0x44000098
[kyoungrok-ryzen:12576] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f339f39c890]
[kyoungrok-ryzen:12576] [ 1] /usr/lib/x86_64-linux-gnu/libmpi.so.20(MPI_Comm_size+0x42)[0x7f33a4d353b2]
[kyoungrok-ryzen:12576] [ 2] ./build/all_reduce_perf[0x402101]
[kyoungrok-ryzen:12576] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f339e010b97]
[kyoungrok-ryzen:12576] [ 4] ./build/all_reduce_perf[0x40398a]
[kyoungrok-ryzen:12576] *** End of error message ***
[1]    12576 segmentation fault (core dumped)  ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
@sjeaugey
Copy link
Member

The crash seems to happen in MPI. Which MPI are you using ? Are you sure you are using the same MPI at runtime that you compiled the tests with ?

@kyoungrok0517
Copy link
Author

I'm not sure... I'm using Ubuntu 18.04. How can I check my MPI sanity? I was having weird phenomena that pre-compiled multi-gpu xgboost (which uses NCCL as a backend) can't be parallelized in the PC I'm testing, so I started trying to compile by myself. Now it seems like there's a problem in my MPI library :(

@kyoungrok0517
Copy link
Author

kyoungrok0517 commented Feb 20, 2019

Now I come to the point that the multi-gpu command stucks! Here's where I am.

Command
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

STUCK HERE

Thread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
kyoungrok-ryzen:10156:10156 [0] NCCL INFO NET/Socket : Using [0]enp7s0:143.248.47.222<0>
kyoungrok-ryzen:10156:10156 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
kyoungrok-ryzen:10156:10156 [0] NCCL INFO NET/IB : No device found.
NCCL version 2.4.2+cuda9.0
kyoungrok-ryzen:10156:10156 [1] NCCL INFO nranks 2
kyoungrok-ryzen:10156:10156 [0] NCCL INFO Setting affinity for GPU 0 to ffff
kyoungrok-ryzen:10156:10156 [0] NCCL INFO comm 0x563ed42dd520 rank 0 nranks 2 cudaDev 0 nvmlDev 0
kyoungrok-ryzen:10156:10156 [1] NCCL INFO Setting affinity for GPU 1 to ffff
kyoungrok-ryzen:10156:10156 [1] NCCL INFO comm 0x563ed42c2010 rank 1 nranks 2 cudaDev 1 nvmlDev 1
kyoungrok-ryzen:10156:10156 [1] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
kyoungrok-ryzen:10156:10156 [1] NCCL INFO Channel 00 :    0   1
kyoungrok-ryzen:10156:10156 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
kyoungrok-ryzen:10156:10156 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/direct pointer
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res

Oh I just fixed something and having different error. I'm Ubuntu 18.04, CUDA 9.0, the latest NCCL repo.

Command 1
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2

nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
NCCL version 2.4.2+cuda9.0
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res

kyoungrok-ryzen:23056:23056 [0] enqueue.cu:74 NCCL WARN Cuda failure 'invalid device function'
NCCL failure common.cu:483 'unhandled cuda error'

Command 2
mpirun -np 40 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 -c 0

nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
NCCL failure common.cu:483 'unhandled cuda error'
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
NCCL failure common.cu:483 'unhandled cuda error'
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
#   Rank  1 on kyoungrok-ryzen device  1 [0x09] GeForce GTX 1080 Ti

Command 3 (success)
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 (use only single GPU)

# NCCL Tests compiled with NCCL 2.4
# Using devices
#   Rank  0 on kyoungrok-ryzen device  0 [0x08] GeForce GTX 1080 Ti

#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
           8             2   float     sum    0.008   0.00   0.00    0e+00    0.000   0.02   0.00    0e+00
          16             4   float     sum    0.005   0.00   0.00    0e+00    0.001   0.03   0.00    0e+00
          32             8   float     sum    0.005   0.01   0.00    0e+00    0.000   0.07   0.00    0e+00
          64            16   float     sum    0.005   0.01   0.00    0e+00    0.001   0.12   0.00    0e+00
         128            32   float     sum    0.005   0.03   0.00    0e+00    0.001   0.24   0.00    0e+00
         256            64   float     sum    0.005   0.05   0.00    0e+00    0.000   0.54   0.00    0e+00
         512           128   float     sum    0.005   0.10   0.00    0e+00    0.000   1.06   0.00    0e+00
        1024           256   float     sum    0.005   0.21   0.00    0e+00    0.001   1.90   0.00    0e+00
        2048           512   float     sum    0.005   0.42   0.00    0e+00    0.000   4.26   0.00    0e+00
        4096          1024   float     sum    0.005   0.83   0.00    0e+00    0.000   8.52   0.00    0e+00
        8192          2048   float     sum    0.005   1.67   0.00    0e+00    0.001  15.27   0.00    0e+00
       16384          4096   float     sum    0.005   3.36   0.00    0e+00    0.001  30.51   0.00    0e+00
       32768          8192   float     sum    0.005   6.68   0.00    0e+00    0.000  67.93   0.00    0e+00
       65536         16384   float     sum    0.005  13.15   0.00    0e+00    0.000  138.16   0.00    0e+00
      131072         32768   float     sum    0.006  21.80   0.00    0e+00    0.000  276.58   0.00    0e+00
      262144         65536   float     sum    0.005  50.65   0.00    0e+00    0.000  602.21   0.00    0e+00
      524288        131072   float     sum    0.005  103.97   0.00    0e+00    0.000  1094.78   0.00    0e+00
     1048576        262144   float     sum    0.005  216.02   0.00    0e+00    0.000  2175.92   0.00    0e+00
     2097152        524288   float     sum    0.014  149.60   0.00    0e+00    0.000  4925.20   0.00    0e+00
     4194304       1048576   float     sum    0.026  162.39   0.00    0e+00    0.000  9920.30   0.00    0e+00
     8388608       2097152   float     sum    0.049  170.41   0.00    0e+00    0.000  19052.03   0.00    0e+00
    16777216       4194304   float     sum    0.096  174.87   0.00    0e+00    0.000  37714.32   0.00    0e+00
    33554432       8388608   float     sum    0.189  177.25   0.00    0e+00    0.000  75099.44   0.00    0e+00
    67108864      16777216   float     sum    0.376  178.27   0.00    0e+00    0.000  150182.08   0.00    0e+00
   134217728      33554432   float     sum    0.751  178.74   0.00    0e+00    0.000  301714.57   0.00    0e+00
 Out of bounds values : 0 OK
 Avg bus bandwidth    : 0

How can I give you the full error logs?

@kyoungrok0517 kyoungrok0517 changed the title Segfault with libpthread.so Stuck when testing NCCL-2.4 Feb 20, 2019
@kyoungrok0517
Copy link
Author

kyoungrok0517 commented Feb 21, 2019

I solved this by turning off VT (SVM in AMD) at BIOS. I close this issue.

@kyoungrok0517
Copy link
Author

kyoungrok0517 commented Oct 21, 2020

Hello. I re-encountered the same issue again in another ryzen system. Also stuck with mpi test script. I turned off SVM like I did before but was in vain. Tested with nccl-2.7.8, CUDA 10.2

These are the commands I've tested

  • ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2 (hang)
  • mpirun -np 2 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 -c 0 (hang)
  • mpirun -np 2 hostname (works)
  • ./build/all_reduce_perf -b 8 -e 256M -f 2 -g 1 (works)

Below are the messages & status I see.

image

image

Now this is the message from single GPU test
image

@kyoungrok0517 kyoungrok0517 changed the title Stuck when testing NCCL-2.4 Stuck when running MPI test Oct 21, 2020
@sjeaugey
Copy link
Member

sjeaugey commented Oct 21, 2020

Did you confirm that ACS was off (or SVM was disabled on the new node .. I don't know how to confirm that from the system) ?

Also could you confirm it is a GPU Direct P2P issue, and disabling P2P (NCCL_P2P_DISABLE=1) solves the problem ?

@kyoungrok0517
Copy link
Author

Thanks for the reply. For now I tested NCCL_P2P_DISABLE=1 and that solved the problem :) As for ACS I can check the server's BIOS next day. So what's the cause of this problem?

@sjeaugey
Copy link
Member

sjeaugey commented Oct 21, 2020

GPU Direct P2P (and RDMA for that matter) relies on PCI devices communicating directly with one another. That means using PCI transactions which target another PCI device and not the CPU.

Some platforms do not support those type of PCI packets because they only tested PCI<->CPU transactions were working correctly, which is the vast majority of PCI use cases. And many things can break those PCI<->PCI transactions, the main one being PCI virtualization technologies when enabled but not properly configured. Even when properly configured, they often impact performance negatively. That's why we advise to turn off any PCI virtualization unless you need to run virtual machines (in which case you may want to also enable PCI ATS but that's a complex subject).

Now there could also be other reasons for GPU Direct to be broken, due to some bad settings in PCI switches or CPUs, breaking those PCI P2P requests. The problem is that it's hard to debug as you would need a PCI Express analyzer to see what's going wrong (plus the PCI expertise).

And frequently, when this is broken, we do not see anything in NCCL, only that the remote write did not happen, so the remote GPU is still waiting for the message to arrive, and we can't do much more than hang.

@kyoungrok0517
Copy link
Author

PCI virtualization

Thanks for the detailed answer. I'll now close the issue, and come back if I need further help. Thanks!

@kyoungrok0517
Copy link
Author

If anyone encounter the same issue, please do the followings in BIOS:

  • Turn off SVM
  • Turn off IOMMU (at AMD CBS)

I leave this for future reference

@maxhgerlach
Copy link

maxhgerlach commented Dec 18, 2020

Turn off IOMMU

Thanks for leaving that advice, @kyoungrok0517. It helped a lot to make sense of an AMD-CPU-based system that I was working with.

@longkukuhi
Copy link

I faced the same issue when using docker container in cluster. Solved problem by set NCCL_P2P_DISABLE=1. Will it has severer negative impact on training speed?
Thanks for your help

@sjeaugey
Copy link
Member

sjeaugey commented Feb 14, 2022

If you have NVLink, or more than 2 GPUs, then yes, disabling P2P will probably degrade performance significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants