-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stuck when running MPI test #18
Comments
The crash seems to happen in MPI. Which MPI are you using ? Are you sure you are using the same MPI at runtime that you compiled the tests with ? |
I'm not sure... I'm using Ubuntu 18.04. How can I check my MPI sanity? I was having weird phenomena that pre-compiled multi-gpu xgboost (which uses NCCL as a backend) can't be parallelized in the PC I'm testing, so I started trying to compile by myself. Now it seems like there's a problem in my MPI library :( |
Now I come to the point that the multi-gpu command stucks! Here's where I am. Command STUCK HERE
Oh I just fixed something and having different error. I'm Ubuntu 18.04, CUDA 9.0, the latest NCCL repo. Command 1 nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
NCCL version 2.4.2+cuda9.0
# NCCL Tests compiled with NCCL 2.4
# Using devices
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
kyoungrok-ryzen:23056:23056 [0] enqueue.cu:74 NCCL WARN Cuda failure 'invalid device function'
NCCL failure common.cu:483 'unhandled cuda error' Command 2 nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 validation: 0
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
Cuda failure common.cu:681 'out of memory'
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# NCCL Tests compiled with NCCL 2.4
# Using devices
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
NCCL failure common.cu:483 'unhandled cuda error'
# Rank 0 on kyoungrok-ryzen device 0 [0x08] GeForce GTX 1080 Ti
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
NCCL failure common.cu:483 'unhandled cuda error'
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti
# out-of-place in-place
# bytes N type op time algbw busbw res time algbw busbw res
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
NCCL failure common.cu:483 'unhandled cuda error'
# Rank 1 on kyoungrok-ryzen device 1 [0x09] GeForce GTX 1080 Ti Command 3 (success)
How can I give you the full error logs? |
I solved this by turning off VT (SVM in AMD) at BIOS. I close this issue. |
Hello. I re-encountered the same issue again in another ryzen system. Also stuck with mpi test script. I turned off SVM like I did before but was in vain. Tested with These are the commands I've tested
Below are the messages & status I see. |
Did you confirm that ACS was off (or SVM was disabled on the new node .. I don't know how to confirm that from the system) ? Also could you confirm it is a GPU Direct P2P issue, and disabling P2P ( |
Thanks for the reply. For now I tested |
GPU Direct P2P (and RDMA for that matter) relies on PCI devices communicating directly with one another. That means using PCI transactions which target another PCI device and not the CPU. Some platforms do not support those type of PCI packets because they only tested PCI<->CPU transactions were working correctly, which is the vast majority of PCI use cases. And many things can break those PCI<->PCI transactions, the main one being PCI virtualization technologies when enabled but not properly configured. Even when properly configured, they often impact performance negatively. That's why we advise to turn off any PCI virtualization unless you need to run virtual machines (in which case you may want to also enable PCI ATS but that's a complex subject). Now there could also be other reasons for GPU Direct to be broken, due to some bad settings in PCI switches or CPUs, breaking those PCI P2P requests. The problem is that it's hard to debug as you would need a PCI Express analyzer to see what's going wrong (plus the PCI expertise). And frequently, when this is broken, we do not see anything in NCCL, only that the remote write did not happen, so the remote GPU is still waiting for the message to arrive, and we can't do much more than hang. |
Thanks for the detailed answer. I'll now close the issue, and come back if I need further help. Thanks! |
If anyone encounter the same issue, please do the followings in BIOS:
I leave this for future reference |
Thanks for leaving that advice, @kyoungrok0517. It helped a lot to make sense of an AMD-CPU-based system that I was working with. |
I faced the same issue when using docker container in cluster. Solved problem by set NCCL_P2P_DISABLE=1. Will it has severer negative impact on training speed? |
If you have NVLink, or more than 2 GPUs, then yes, disabling P2P will probably degrade performance significantly. |
I've compiled nccl, then tried with the following command
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
Then I see the following error. What's the problem?
The text was updated successfully, but these errors were encountered: