-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing nccl with a difficult topology #19
Comments
Hi Manuel, I cannot reproduce the stall on a similar machine. Could you get a stack trace so that we get an idea of where it stalls ? Also, can you look at nvidia-smi to see if one (or two) GPUs are busy ? |
Hi sjeaugey, I've been running and testing my configuration for the toolkit (CUDA SDK 7.5), NCCL libraries, MPI (openMPI 1.10.2) libraries to make sure they are all correctly installed, in both my workstation and the SuperMicro PC, and to rule any bug from the libraries. Having made sure that both computers have identical configurations and tested nccl examples in them, I concluded that this problem has to be related to my topology as I don't see get any problems when executing in my workstation (which for now works with two different GPUs). To put it in perspective my testing hardware configurations given by the lspci command are as follows: [manuel@nhri]$ lspci -tv | grep NVIDIA
+-01.1-[02]----00.0 NVIDIA Corporation GK110GL [Tesla K20c]
+-1c.4-[10]--+-00.0 NVIDIA Corporation GM107GL [Quadro K620]
| \-00.1 NVIDIA Corporation Device 0fbc [r1bsl@supermicro]$ lspci -tv | grep NVIDIA
| +-02.0-[82-85]----00.0-[83-85]--+-08.0-[84]----00.0 NVIDIA Corporation GK210GL [Tesla K80]
| | \-10.0-[85]----00.0 NVIDIA Corporation GK210GL [Tesla K80]
| +-03.0-[86-89]----00.0-[87-89]--+-08.0-[88]----00.0 NVIDIA Corporation GK210GL [Tesla K80]
| | \-10.0-[89]----00.0 NVIDIA Corporation GK210GL [Tesla K80]
+-03.0-[03-06]----00.0-[04-06]--+-08.0-[05]----00.0 NVIDIA Corporation GK210GL [Tesla K80]
| \-10.0-[06]----00.0 NVIDIA Corporation GK210GL [Tesla K80] In my workstation I have no PCIe switch therefore, the topology doesn't allow me to use P2P communications. But, the Supermicro configuration is more complex as I found out that to reach each gpu I have to pass by too PCIe switches. To put it in a schematic way: CPU(0) -- switch -- K80 internal switch -- K80(0)
| | \ -- K80(1)
| \ ----- K80 internal switch -- K80(2)
| \ -- K80(3)
CPU(1) -- switch -- K80 internal switch -- K80(4)
\ -- K80(5) I have tested the MPI example and all the examples in the 'single' folder. When run nccl tests, I made sure no GPU is busy by using the nvidia-smi command as indicated. I have no problems executing the single and MPI test using 3 gpus simultaneously, by either GPUs (0,2,4) or (1,3,5), however using all GPUs or contiguous GPUs e.g. (0,1,5) or (0,4,5), the execution will stall. I have traced a little bit the problem using the MPI example. I notice that the program always stalls after the ncclAllReduce stage. The cudaStreamSynchronize never gets to be completed. If you have similar machine, do you have any special configuration in it? |
Could this be related to NVIDIA/caffe#10? |
@lukeyeager It seems this problem is about a deadlock rather than a P2P bandwidth issue. @wme7 your topology doesn't seem strange nor difficult to me. That's in fact a very common configuration here. Could you please run a test which stalls, and while it is stalled (do not hit Ctrl-C), run : Then, can you provide us with the output of That would help me understand the problem and hopefully reproduce it and fix it. |
@lukeyeager I don't know about that but I'll check my BIOS just in case. @sjeaugey I'm following your instructions now. I run the test: [r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 0
GPU 0000:05:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 3
GPU 0000:85:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 1
GPU 0000:06:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 4
GPU 0000:88:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 5
GPU 0000:89:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 2
GPU 0000:84:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ ~/openMPI/bin/mpirun -np 3 test.run 0 2 3
MPI initialized
rank 0 has device 0
rank 1 has device 2
rank 2 has device 3
nccl communicator created!
CUDA streams created!
Input values set. Starting Test:
Reduction complete:
[stall] from another terminal window I call the nvidia-smi and get the gstack output of the working PIDs: [r1bsl@supermicro simpleMPITest]$ nvidia-smi
Fri Apr 22 10:48:50 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79 Driver Version: 352.79 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:05:00.0 Off | 0 |
| N/A 51C P0 71W / 149W | 127MiB / 11519MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:06:00.0 Off | 0 |
| N/A 30C P8 30W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:84:00.0 Off | 0 |
| N/A 49C P0 73W / 149W | 194MiB / 11519MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:85:00.0 Off | 0 |
| N/A 35C P0 85W / 149W | 194MiB / 11519MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 0000:88:00.0 Off | 0 |
| N/A 33C P8 26W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 0000:89:00.0 Off | 0 |
| N/A 28C P8 30W / 149W | 55MiB / 11519MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 696 C test.run 71MiB |
| 2 697 C test.run 71MiB |
| 2 698 C test.run 64MiB |
| 3 697 C test.run 64MiB |
| 3 698 C test.run 71MiB |
+-----------------------------------------------------------------------------+
[r1bsl@supermicro simpleMPITest]$ gstack 696
Thread 3 (Thread 0x7fec7072f700 (LWP 699)):
#0 0x00007fec725b0c3d in poll () from /lib64/libc.so.6
#1 0x00007fec71fd6a96 in poll_dispatch (base=0x1c4ad70, tv=0x7fec7072ee90) at ../../../../../../opal/mca/event/libevent2021/libevent/poll.c:165
#2 0x00007fec71fce8c4 in opal_libevent2021_event_base_loop (base=0x1c4ad70, flags=1) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1633
#3 0x00007fec7228131e in orte_progress_thread_engine () from /home/r1bsl/openMPI/lib/libopen-rte.so.12
#4 0x00007fec732b1dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fec725bb28d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fec57bfc700 (LWP 710)):
#0 0x00007fec725b0c3d in poll () from /lib64/libc.so.6
#1 0x00007fec779b885b in ?? () from /lib64/libcuda.so.1
#2 0x00007fec7737e651 in ?? () from /lib64/libcuda.so.1
#3 0x00007fec779b91a8 in ?? () from /lib64/libcuda.so.1
#4 0x00007fec732b1dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007fec725bb28d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fec79754740 (LWP 696)):
#0 0x00007ffd8e97b7c2 in clock_gettime ()
#1 0x00007fec725ceedd in clock_gettime () from /lib64/libc.so.6
#2 0x00007fec779b81de in ?? () from /lib64/libcuda.so.1
#3 0x00007fec7736d7ab in ?? () from /lib64/libcuda.so.1
#4 0x00007fec7734ae33 in ?? () from /lib64/libcuda.so.1
#5 0x00007fec7734af89 in ?? () from /lib64/libcuda.so.1
#6 0x00007fec772bec87 in ?? () from /lib64/libcuda.so.1
#7 0x00007fec772970c2 in cuStreamSynchronize () from /lib64/libcuda.so.1
#8 0x00007fec780fed90 in ?? () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#9 0x00007fec781361fd in cudaStreamSynchronize () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#10 0x0000000000401870 in main ()
[r1bsl@supermicro simpleMPITest]$ gstack 697
Thread 4 (Thread 0x7ff42bd39700 (LWP 700)):
#0 0x00007ff42dbbac3d in poll () from /lib64/libc.so.6
#1 0x00007ff42d5e0a96 in poll_dispatch (base=0x2127d70, tv=0x7ff42bd38e90) at ../../../../../../opal/mca/event/libevent2021/libevent/poll.c:165
#2 0x00007ff42d5d88c4 in opal_libevent2021_event_base_loop (base=0x2127d70, flags=1) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1633
#3 0x00007ff42d88b31e in orte_progress_thread_engine () from /home/r1bsl/openMPI/lib/libopen-rte.so.12
#4 0x00007ff42e8bbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007ff42dbc528d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7ff4130a5700 (LWP 712)):
#0 0x00007ff42dbbac3d in poll () from /lib64/libc.so.6
#1 0x00007ff432fc285b in ?? () from /lib64/libcuda.so.1
#2 0x00007ff432988651 in ?? () from /lib64/libcuda.so.1
#3 0x00007ff432fc31a8 in ?? () from /lib64/libcuda.so.1
#4 0x00007ff42e8bbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007ff42dbc528d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7ff409dfd700 (LWP 714)):
#0 0x00007ff42dbbac3d in poll () from /lib64/libc.so.6
#1 0x00007ff432fc285b in ?? () from /lib64/libcuda.so.1
#2 0x00007ff432988651 in ?? () from /lib64/libcuda.so.1
#3 0x00007ff432fc31a8 in ?? () from /lib64/libcuda.so.1
#4 0x00007ff42e8bbdc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007ff42dbc528d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7ff434d5e740 (LWP 697)):
#0 0x00007ffd38bb77c2 in clock_gettime ()
#1 0x00007ff42dbd8edd in clock_gettime () from /lib64/libc.so.6
#2 0x00007ff432fc21de in ?? () from /lib64/libcuda.so.1
#3 0x00007ff4329777ab in ?? () from /lib64/libcuda.so.1
#4 0x00007ff432954e33 in ?? () from /lib64/libcuda.so.1
#5 0x00007ff432954f89 in ?? () from /lib64/libcuda.so.1
#6 0x00007ff4328c8c87 in ?? () from /lib64/libcuda.so.1
#7 0x00007ff4328a10c2 in cuStreamSynchronize () from /lib64/libcuda.so.1
#8 0x00007ff433708d90 in ?? () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#9 0x00007ff4337401fd in cudaStreamSynchronize () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#10 0x0000000000401870 in main ()
[r1bsl@supermicro simpleMPITest]$ gstack 698
Thread 4 (Thread 0x7f5016a85700 (LWP 701)):
#0 0x00007f5018906c3d in poll () from /lib64/libc.so.6
#1 0x00007f501832ca96 in poll_dispatch (base=0xf23d70, tv=0x7f5016a84e90) at ../../../../../../opal/mca/event/libevent2021/libevent/poll.c:165
#2 0x00007f50183248c4 in opal_libevent2021_event_base_loop (base=0xf23d70, flags=1) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1633
#3 0x00007f50185d731e in orte_progress_thread_engine () from /home/r1bsl/openMPI/lib/libopen-rte.so.12
#4 0x00007f5019607dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f501891128d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f4ffde06700 (LWP 711)):
#0 0x00007f5018906c3d in poll () from /lib64/libc.so.6
#1 0x00007f501dd0e85b in ?? () from /lib64/libcuda.so.1
#2 0x00007f501d6d4651 in ?? () from /lib64/libcuda.so.1
#3 0x00007f501dd0f1a8 in ?? () from /lib64/libcuda.so.1
#4 0x00007f5019607dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f501891128d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f4ff4dff700 (LWP 713)):
#0 0x00007f5018906c3d in poll () from /lib64/libc.so.6
#1 0x00007f501dd0e85b in ?? () from /lib64/libcuda.so.1
#2 0x00007f501d6d4651 in ?? () from /lib64/libcuda.so.1
#3 0x00007f501dd0f1a8 in ?? () from /lib64/libcuda.so.1
#4 0x00007f5019607dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f501891128d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f501faaa740 (LWP 698)):
#0 0x00007ffea4bfd7c2 in clock_gettime ()
#1 0x00007f5018924edd in clock_gettime () from /lib64/libc.so.6
#2 0x00007f501dd0e1de in ?? () from /lib64/libcuda.so.1
#3 0x00007f501d6c37ab in ?? () from /lib64/libcuda.so.1
#4 0x00007f501d6a0e33 in ?? () from /lib64/libcuda.so.1
#5 0x00007f501d6a0f89 in ?? () from /lib64/libcuda.so.1
#6 0x00007f501d614c87 in ?? () from /lib64/libcuda.so.1
#7 0x00007f501d5ed0c2 in cuStreamSynchronize () from /lib64/libcuda.so.1
#8 0x00007f501e454d90 in ?? () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#9 0x00007f501e48c1fd in cudaStreamSynchronize () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#10 0x0000000000401870 in main () |
@lukeyeager I'm checking to see if the ACS are disabled for the PLX PCI-e switch in my motherboard: SS 7048GR-TR. I just found the following recommendation in a question of the Supermicro forum. Using the lscpi we check for the PLX switches and ACSCtl: [root@supermicro manuel]# lspci | grep PLX
03:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
04:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
04:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
82:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
86:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) [root@supermicro manuel]# lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- Here, I notice that some that not all of the SrcValid are positive. Therefore, we set [root@supermicro manuel]# sudo setpci -s 03:00.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 04:08.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 04:10.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 82:00.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 83:08.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 83:10.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 86:00.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 87:08.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 87:10.0 f2a.w=0000 Now when we check for ACSCtl I get: [root@supermicro manuel]# sudo lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- and I turn to test a again the mpi example and found that ... it worked! [r1bsl@supermicro simpleMPITest]$ sh run.sh
rm *.run
Building mpi_test.cu > test.run
nvcc -I/home/r1bsl/nccl/include -L/home/r1bsl/nccl/lib -I/home/r1bsl/openMPI/include -L/home/r1bsl/openMPI/lib -lmpi -gencode=arch=compute_35,code=sm_35 -O3 -lineinfo -std=c++11 -maxrregcount 96 --compiler-options "-O3 -fPIC -fvisibility=hidden" -o test.run mpi_test.cu -lnccl -L/usr/local/cuda-7.5/lib64 -lcudart -lcuda -lcurand -lnvToolsExt
MPI initialized
rank 1 has device 1
rank 2 has device 2
rank 3 has device 3
rank 4 has device 4
rank 5 has device 5
rank 0 has device 0
nccl communicator created!
CUDA streams created!
Input values set. Starting Test:
Reduction complete:
streams synchronization complete:
streams synchronization complete:
Checking results:
streams synchronization complete:
streams synchronization complete:
streams synchronization complete:
streams synchronization complete:
Test PASSED. |
Just for the record, my machine has BIOS version: [r1bsl@supermicro]# sudo dmidecode | less
SMBIOS 2.8 present.
132 structures occupying 6109 bytes.
Table at 0x000ED8A0.
Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: 1.0b
Release Date: 01/07/2015
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 16384 kB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.6 I'll try to update my computer BIOS, but as I'm reading, linux can enable the ACS every time I reboot. Is there a simple way to make sure is set at boot time? |
I'm happy to hear that helped! This issue has been reported as a deadlock before, even though it's really just a [super dramatic] reduction in communication speed.
It seems like you've already seen this comment:
If you can't get that to work, you could always set up a script to run after boot, I guess. |
Dear NCCL team,
First of all, thx much for such nice open-source project.
I just got to know about you through the Parallel-Forall Blog.
Currently, I'm testing your examples in a small production PC, and I noticed that the topology that I'm using is a little bit complex, namely:
As you may see, I'm working with K80-type GPUs in this machine.
I've noticed that I have no problem running your tests using one of the internal GPUs, e.g.:
However, it I want to run the test using both internal GPU in a single K80 card, I get in troubles:
The execution stalls and I have no more option that to kill the execution.
My question is: Can NCCL handle such complex topology? and if so, what can I do to modify the examples for the case that I can run them with all my 6 GPUs?
The text was updated successfully, but these errors were encountered: