single/all_gather_test is always stuck #120

starsblinking · 2017-11-24T12:48:28Z

Hi， everyone. I meet a problem like issues19 but something is different.
I installed cuda 8 ,cudnn 6 and openmpi 3.0.0. My ubuntu server 16.04 has 2 GTX1080 Ti.
By lspci -tv | grep NVIDIA we can see:

       +-03.1-[29]--+-00.0  NVIDIA Corporation Device 1b06
       |            \-00.1  NVIDIA Corporation Device 10ef
       +-03.2-[2a]--+-00.0  NVIDIA Corporation Device 1b06
       |            \-00.1  NVIDIA Corporation Device 10ef

By lspci -vvv I will find:

29:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
    subsystem: ZOTAC International (MCO) Ltd. Device 1470
    ontrol: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    nterrupt: pin A routed to IRQ 66
    Region 0: Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at d000 [size=128]
    [virtual] Expansion ROM at f7000000 [disabled] [size=512K]
    Capabilities: <access denied>
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

and

2a:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
    Subsystem: ZOTAC International (MCO) Ltd. Device 1470
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 67
    Region 0: Memory at f4000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at c0000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at d0000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at c000 [size=128]
    [virtual] Expansion ROM at f5000000 [disabled] [size=512K]
    Capabilities: <access denied>
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

sudo lspci -vvv | grep ACSCtl will return nothing.
It seems that I'm not in the trouble of ACS like issues19 .
Even if I use
sudo setpci -s 2a:00.0 f2a.w=00001
sudo setpci -s 29:00.0 f2a.w=0000
things don't change.
If I use only one GPU, single/all_gather_test will not be stuck:
$ ./build/test/single/all_reduce_test 10000000 1 0

 #Using devices
 # Rank  0 uses device  0 [0x29] GeForce GTX 1080 Ti

#                                            out-of-place                    in-place
#  bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
10000000      10000000    char     sum    0.064  157.38   0.00    0e+00    0.005  2083.77   0.00    0e+00
10000000      10000000    char    prod    0.062  160.55   0.00    0e+00    0.005  2183.88   0.00    0e+00
10000000      10000000    char     max    0.064  157.28   0.00    0e+00    0.004  2284.15   0.00    0e+00
10000000      10000000    char     min    0.062  160.26   0.00    0e+00    0.005  2179.60   0.00    0e+00
10000000       2500000     int     sum    0.068  146.14   0.00    0e+00    0.015  688.37   0.00    0e+00
10000000       2500000     int    prod    0.068  147.13   0.00    0e+00    0.005  1972.39   0.00    0e+00
10000000       2500000     int     max    0.067  149.22   0.00    0e+00    0.004  2268.60   0.00    0e+00
10000000       2500000     int     min    0.068  147.63   0.00    0e+00    0.004  2268.60   0.00    0e+00
10000000       5000000    half     sum    0.068  146.18   0.00    0e+00    0.004  2273.76   0.00    0e+00
10000000       5000000    half    prod    0.067  148.40   0.00    0e+00    0.005  2079.43   0.00    0e+00
10000000       5000000    half     max    0.069  145.65   0.00    0e+00    0.004  2289.38   0.00    0e+00
10000000       5000000    half     min    0.067  149.85   0.00    0e+00    0.006  1754.08   0.00    0e+00
10000000       2500000   float     sum    0.069  145.24   0.00    0e+00    0.004  2321.26   0.00    0e+00
10000000       2500000   float    prod    0.068  147.65   0.00    0e+00    0.004  2257.85   0.00    0e+00
10000000       2500000   float     max    0.068  148.07   0.00    0e+00    0.004  2268.60   0.00    0e+00
10000000       2500000   float     min    0.067  148.82   0.00    0e+00    0.004  2273.76   0.00    0e+00
10000000       1250000  double     sum    0.067  148.44   0.00    0e+00    0.005  2213.37   0.00    0e+00
10000000       1250000  double    prod    0.067  148.75   0.00    0e+00    0.004  2227.67   0.00    0e+00
10000000       1250000  double     max    0.067  149.78   0.00    0e+00    0.005  2203.13   0.00    0e+00
10000000       1250000  double     min    0.067  149.00   0.00    0e+00    0.004  2284.15   0.00    0e+00
10000000       1250000   int64     sum    0.068  146.91   0.00    0e+00    0.005  2160.76   0.00    0e+00
10000000       1250000   int64    prod    0.068  147.19   0.00    0e+00    0.007  1446.55   0.00    0e+00
10000000       1250000   int64     max    0.069  145.12   0.00    0e+00    0.005  2151.46   0.00    0e+00
10000000       1250000   int64     min    0.066  151.25   0.00    0e+00    0.005  1934.24   0.00    0e+00
10000000       1250000  uint64     sum    0.067  148.60   0.00    0e+00    0.005  1912.05   0.00    0e+00
10000000       1250000  uint64    prod    0.068  146.29   0.00    0e+00    0.004  2258.36   0.00    0e+00
10000000       1250000  uint64     max    0.067  149.20   0.00    0e+00    0.005  2058.04   0.00    0e+00
10000000       1250000  uint64     min    0.067  149.13   0.00    0e+00    0.005  2141.79   0.00    0e+00

$ ./build/test/single/all_reduce_test 10000000 1 1

 #Using devices
 # Rank  0 uses device  1 [0x2a] GeForce GTX 1080 Ti

#                                            out-of-place                    in-place
# bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
10000000      10000000    char     sum    0.063  158.81   0.00    0e+00    0.005  2012.48   0.00    0e+00
10000000      10000000    char    prod    0.061  162.72   0.00    0e+00    0.005  2169.67   0.00    0e+00
10000000      10000000    char     max    0.061  163.17   0.00    0e+00    0.005  2151.00   0.00    0e+00
10000000      10000000    char     min    0.062  162.27   0.00    0e+00    0.006  1814.55   0.00    0e+00
10000000       2500000     int     sum    0.067  149.04   0.00    0e+00    0.004  2326.66   0.00    0e+00
10000000       2500000     int    prod    0.067  149.22   0.00    0e+00    0.005  2115.06   0.00    0e+00
10000000       2500000     int     max    0.073  136.34   0.00    0e+00    0.005  2119.09   0.00    0e+00
10000000       2500000     int     min    0.069  145.60   0.00    0e+00    0.005  1984.13   0.00    0e+00
10000000       5000000    half     sum    0.067  148.22   0.00    0e+00    0.007  1434.10   0.00    0e+00
10000000       5000000    half    prod    0.071  140.86   0.00    0e+00    0.004  2223.21   0.00    0e+00
10000000       5000000    half     max    0.067  148.46   0.00    0e+00    0.005  2110.15   0.00    0e+00
10000000       5000000    half     min    0.068  147.17   0.00    0e+00    0.006  1602.05   0.00    0e+00
10000000       2500000   float     sum    0.068  147.30   0.00    0e+00    0.004  2337.54   0.00    0e+00
10000000       2500000   float    prod    0.067  148.82   0.00    0e+00    0.004  2243.16   0.00    0e+00
10000000       2500000   float     max    0.066  150.48   0.00    0e+00    0.006  1682.94   0.00    0e+00
10000000       2500000   float     min    0.067  148.62   0.00    0e+00    0.005  1992.43   0.00    0e+00
10000000       1250000  double     sum    0.068  146.89   0.00    0e+00    0.005  2183.88   0.00    0e+00
10000000       1250000  double    prod    0.067  149.69   0.00    0e+00    0.005  2028.81   0.00    0e+00
10000000       1250000  double     max    0.067  150.27   0.00    0e+00    0.006  1738.83   0.00    0e+00
10000000       1250000  double     min    0.067  150.36   0.00    0e+00    0.004  2258.36   0.00    0e+00
10000000       1250000   int64     sum    0.068  147.92   0.00    0e+00    0.005  2208.48   0.00    0e+00
10000000       1250000   int64    prod    0.068  147.52   0.00    0e+00    0.005  2169.67   0.00    0e+00
10000000       1250000   int64     max    0.067  149.53   0.00    0e+00    0.005  2020.61   0.00    0e+00
10000000       1250000   int64     min    0.067  148.93   0.00    0e+00    0.005  2146.38   0.00    0e+00
10000000       1250000  uint64     sum    0.067  148.58   0.00    0e+00    0.004  2321.26   0.00    0e+00
10000000       1250000  uint64    prod    0.067  150.16   0.00    0e+00    0.006  1697.50   0.00    0e+00
10000000       1250000  uint64     max    0.067  148.15   0.00    0e+00    0.005  2132.65   0.00    0e+00
10000000       1250000  uint64     min    0.067  149.31   0.00    0e+00    0.005  2132.65   0.00    0e+00

If using both GPUs, stuck stuff will appear:
$ ./build/test/single/all_reduce_test 10000000

 # Using devices
 #   Rank  0 uses device  0 [0x29] GeForce GTX 1080 Ti
 #   Rank  1 uses device  1 [0x2a] GeForce GTX 1080 Ti
 
#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
^C

When stuck, 2 GPUs get their max use like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:29:00.0 Off |                  N/A |
| 41%   54C    P2    77W / 250W |    193MiB / 11170MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:2A:00.0 Off |                  N/A |
| 33%   43C    P2    72W / 250W |    193MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                            
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1032      C   ./build/test/single/all_reduce_test          183MiB |
|    1      1032      C   ./build/test/single/all_reduce_test          183MiB |
+-----------------------------------------------------------------------------+

In the same time, cpu will arrive 200% use in top command
I feel exhausted and wish a solution. Thank you very much.

The text was updated successfully, but these errors were encountered:

sjeaugey · 2017-11-28T01:12:47Z

As a double check, make sure you run sudo lspci -vvv | grep ACSCtl. As a user, you wouldn't see ACS properties.

Then, the setpci commands should be run on the PCI switches, not the GPUs.

starsblinking · 2017-11-28T08:14:23Z

@sjeaugey Thanks for your reply very much. sudo lspci -vvv | grep ACSCtl will return nothing either. I don't know much about PC hardware and just use setpci as a try. Have you found the reason that my all_gather_test is always stuck?

gzmarchenko · 2017-11-28T08:50:25Z

We're experiencing exactly the same problem (ubuntu and video drivers are the same). The difference is that we have 4 GTX1080 Ti cards like this:

06:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 120f
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- >><PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at f0000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 2f80000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 2f90000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at b000 [size=128]
        [virtual] Expansion ROM at f1000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 34326183936ns
                Max no snoop latency: 34326183936ns
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout+ NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] #19
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

sudo lspci -vvv | grep ACSCtl shows that all is fine:

           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

But if we use any two (or all four) cards with all_reduce_test simultaneously - it hangs. :(
Every single card works just fine.

I'm out of ideas.

nluehr · 2017-11-28T18:04:13Z

I've heard of similar issues being related to IOMMU / VT-d. Can you try disabling this in your BIOS?

gzmarchenko · 2017-11-29T08:33:01Z

Unfortunately for us both vt-d and vt-x had already been disabled. :(
Doesn't change much ((

starsblinking · 2017-11-30T13:50:23Z

@nluehr Thanks for your suggestion. I have disabled CPU virtualization already.
@gzmarchenko It seems that the problem has nothing with CPU virtualization.
My cpu is one AMD Ryzen 7 1700X. The motherboard is AUSU ROG C6H.

sjeaugey · 2017-11-30T17:55:43Z

Oh, you are using an AMD CPU. Then I suggest you review this post and see if it fixes your problem :
pytorch/pytorch#1637 (comment)
(basically adding iommu=soft to the kernel command line)

@gzmarchenko are you using an AMD CPU as well ?

gzmarchenko · 2017-12-01T08:05:09Z

@sjeaugey
No, mine is
model name : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz with
Manufacturer: ASUSTeK COMPUTER INC. Product Name: PRIME Z270-P
:(

starsblinking · 2017-12-01T14:53:47Z

@sjeaugey Thanks very much! Everything goes well now. I just thought that CPU virtualization in BIOS settings is right IOMMU but actually not. I didn't find IOMMU settings in BIOS settings when PC boot. sudo vim /etc/default/grub and sudo update-grub do work.
Thank everyone!

gzmarchenko · 2017-12-01T15:10:53Z

I'm glad to hear it helped @starsblinking, but it didn't work for us.

sjeaugey · 2017-12-01T17:12:20Z

@gzmarchenko sorry to see that it didn't work for you. Still, most likely, this is due to GPU Direct P2P not working between the GPUs which is usually caused by CPU/BIOS settings. You may want to reproduce the problem with the CUDA P2P tests and if it doesn't work, report that problem to CUDA (through developer.nvidia.com).

Also, I would advise to use NCCL2 (https://developer.nvidia.com/nccl, tests at https://github.com/nvidia/nccl-tests) and until your P2P problem is resolved, you can set NCCL_P2P_DISABLE=1. Performance will be lower though.

gzmarchenko · 2017-12-11T17:51:42Z

@sjeaugey, we cut off 2 of 4 GPUs and switched others into x16 ports (with x1 passive adapters though)
With these arrangements, we were able to run them.

It's a pity that plugging in another one makes it non-working again. Is there anything else we should think about before changing the motherboard?

starsblinking closed this as completed Jan 29, 2018

kuzemchik mentioned this issue May 25, 2018

Horovod example is hanging on more the 1 GPU horovod/horovod#280

Closed

Luonic mentioned this issue Nov 27, 2018

train with multi-gpu with MirroredStrategy will hang-up tensorflow/tensorflow#22889

Closed

sjeaugey mentioned this issue Sep 30, 2019

AllReduce hangs #257

Closed

jayagami mentioned this issue Jan 13, 2023

NCCL all_reduce_perf test hangs with multiple RTX 4090 GPUs, works fine when I swap in 2080tis NVIDIA/nccl-tests#117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single/all_gather_test is always stuck #120

single/all_gather_test is always stuck #120

starsblinking commented Nov 24, 2017 •

edited

Loading

sjeaugey commented Nov 28, 2017

starsblinking commented Nov 28, 2017

gzmarchenko commented Nov 28, 2017 •

edited

Loading

nluehr commented Nov 28, 2017

gzmarchenko commented Nov 29, 2017 •

edited

Loading

starsblinking commented Nov 30, 2017

sjeaugey commented Nov 30, 2017

gzmarchenko commented Dec 1, 2017 •

edited

Loading

starsblinking commented Dec 1, 2017

gzmarchenko commented Dec 1, 2017 •

edited

Loading

sjeaugey commented Dec 1, 2017

gzmarchenko commented Dec 11, 2017 •

edited

Loading

single/all_gather_test is always stuck #120

single/all_gather_test is always stuck #120

Comments

starsblinking commented Nov 24, 2017 • edited Loading

sjeaugey commented Nov 28, 2017

starsblinking commented Nov 28, 2017

gzmarchenko commented Nov 28, 2017 • edited Loading

nluehr commented Nov 28, 2017

gzmarchenko commented Nov 29, 2017 • edited Loading

starsblinking commented Nov 30, 2017

sjeaugey commented Nov 30, 2017

gzmarchenko commented Dec 1, 2017 • edited Loading

starsblinking commented Dec 1, 2017

gzmarchenko commented Dec 1, 2017 • edited Loading

sjeaugey commented Dec 1, 2017

gzmarchenko commented Dec 11, 2017 • edited Loading

starsblinking commented Nov 24, 2017 •

edited

Loading

gzmarchenko commented Nov 28, 2017 •

edited

Loading

gzmarchenko commented Nov 29, 2017 •

edited

Loading

gzmarchenko commented Dec 1, 2017 •

edited

Loading

gzmarchenko commented Dec 1, 2017 •

edited

Loading

gzmarchenko commented Dec 11, 2017 •

edited

Loading