Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single/all_gather_test is always stuck #120

Closed
starsblinking opened this issue Nov 24, 2017 · 12 comments
Closed

single/all_gather_test is always stuck #120

starsblinking opened this issue Nov 24, 2017 · 12 comments

Comments

@starsblinking
Copy link

starsblinking commented Nov 24, 2017

Hi, everyone. I meet a problem like issues19 but something is different.
I installed cuda 8 ,cudnn 6 and openmpi 3.0.0. My ubuntu server 16.04 has 2 GTX1080 Ti.
By lspci -tv | grep NVIDIA we can see:

       +-03.1-[29]--+-00.0  NVIDIA Corporation Device 1b06
       |            \-00.1  NVIDIA Corporation Device 10ef
       +-03.2-[2a]--+-00.0  NVIDIA Corporation Device 1b06
       |            \-00.1  NVIDIA Corporation Device 10ef

By lspci -vvv I will find:

29:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
    subsystem: ZOTAC International (MCO) Ltd. Device 1470
    ontrol: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    nterrupt: pin A routed to IRQ 66
    Region 0: Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at f0000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at d000 [size=128]
    [virtual] Expansion ROM at f7000000 [disabled] [size=512K]
    Capabilities: <access denied>
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

and

2a:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
    Subsystem: ZOTAC International (MCO) Ltd. Device 1470
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 67
    Region 0: Memory at f4000000 (32-bit, non-prefetchable) [size=16M]
    Region 1: Memory at c0000000 (64-bit, prefetchable) [size=256M]
    Region 3: Memory at d0000000 (64-bit, prefetchable) [size=32M]
    Region 5: I/O ports at c000 [size=128]
    [virtual] Expansion ROM at f5000000 [disabled] [size=512K]
    Capabilities: <access denied>
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

sudo lspci -vvv | grep ACSCtl will return nothing.
It seems that I'm not in the trouble of ACS like issues19 .
Even if I use
sudo setpci -s 2a:00.0 f2a.w=00001
sudo setpci -s 29:00.0 f2a.w=0000
things don't change.
If I use only one GPU, single/all_gather_test will not be stuck:
$ ./build/test/single/all_reduce_test 10000000 1 0

 #Using devices
 # Rank  0 uses device  0 [0x29] GeForce GTX 1080 Ti

#                                            out-of-place                    in-place
#  bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
10000000      10000000    char     sum    0.064  157.38   0.00    0e+00    0.005  2083.77   0.00    0e+00
10000000      10000000    char    prod    0.062  160.55   0.00    0e+00    0.005  2183.88   0.00    0e+00
10000000      10000000    char     max    0.064  157.28   0.00    0e+00    0.004  2284.15   0.00    0e+00
10000000      10000000    char     min    0.062  160.26   0.00    0e+00    0.005  2179.60   0.00    0e+00
10000000       2500000     int     sum    0.068  146.14   0.00    0e+00    0.015  688.37   0.00    0e+00
10000000       2500000     int    prod    0.068  147.13   0.00    0e+00    0.005  1972.39   0.00    0e+00
10000000       2500000     int     max    0.067  149.22   0.00    0e+00    0.004  2268.60   0.00    0e+00
10000000       2500000     int     min    0.068  147.63   0.00    0e+00    0.004  2268.60   0.00    0e+00
10000000       5000000    half     sum    0.068  146.18   0.00    0e+00    0.004  2273.76   0.00    0e+00
10000000       5000000    half    prod    0.067  148.40   0.00    0e+00    0.005  2079.43   0.00    0e+00
10000000       5000000    half     max    0.069  145.65   0.00    0e+00    0.004  2289.38   0.00    0e+00
10000000       5000000    half     min    0.067  149.85   0.00    0e+00    0.006  1754.08   0.00    0e+00
10000000       2500000   float     sum    0.069  145.24   0.00    0e+00    0.004  2321.26   0.00    0e+00
10000000       2500000   float    prod    0.068  147.65   0.00    0e+00    0.004  2257.85   0.00    0e+00
10000000       2500000   float     max    0.068  148.07   0.00    0e+00    0.004  2268.60   0.00    0e+00
10000000       2500000   float     min    0.067  148.82   0.00    0e+00    0.004  2273.76   0.00    0e+00
10000000       1250000  double     sum    0.067  148.44   0.00    0e+00    0.005  2213.37   0.00    0e+00
10000000       1250000  double    prod    0.067  148.75   0.00    0e+00    0.004  2227.67   0.00    0e+00
10000000       1250000  double     max    0.067  149.78   0.00    0e+00    0.005  2203.13   0.00    0e+00
10000000       1250000  double     min    0.067  149.00   0.00    0e+00    0.004  2284.15   0.00    0e+00
10000000       1250000   int64     sum    0.068  146.91   0.00    0e+00    0.005  2160.76   0.00    0e+00
10000000       1250000   int64    prod    0.068  147.19   0.00    0e+00    0.007  1446.55   0.00    0e+00
10000000       1250000   int64     max    0.069  145.12   0.00    0e+00    0.005  2151.46   0.00    0e+00
10000000       1250000   int64     min    0.066  151.25   0.00    0e+00    0.005  1934.24   0.00    0e+00
10000000       1250000  uint64     sum    0.067  148.60   0.00    0e+00    0.005  1912.05   0.00    0e+00
10000000       1250000  uint64    prod    0.068  146.29   0.00    0e+00    0.004  2258.36   0.00    0e+00
10000000       1250000  uint64     max    0.067  149.20   0.00    0e+00    0.005  2058.04   0.00    0e+00
10000000       1250000  uint64     min    0.067  149.13   0.00    0e+00    0.005  2141.79   0.00    0e+00

$ ./build/test/single/all_reduce_test 10000000 1 1

 #Using devices
 # Rank  0 uses device  1 [0x2a] GeForce GTX 1080 Ti

#                                            out-of-place                    in-place
# bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
10000000      10000000    char     sum    0.063  158.81   0.00    0e+00    0.005  2012.48   0.00    0e+00
10000000      10000000    char    prod    0.061  162.72   0.00    0e+00    0.005  2169.67   0.00    0e+00
10000000      10000000    char     max    0.061  163.17   0.00    0e+00    0.005  2151.00   0.00    0e+00
10000000      10000000    char     min    0.062  162.27   0.00    0e+00    0.006  1814.55   0.00    0e+00
10000000       2500000     int     sum    0.067  149.04   0.00    0e+00    0.004  2326.66   0.00    0e+00
10000000       2500000     int    prod    0.067  149.22   0.00    0e+00    0.005  2115.06   0.00    0e+00
10000000       2500000     int     max    0.073  136.34   0.00    0e+00    0.005  2119.09   0.00    0e+00
10000000       2500000     int     min    0.069  145.60   0.00    0e+00    0.005  1984.13   0.00    0e+00
10000000       5000000    half     sum    0.067  148.22   0.00    0e+00    0.007  1434.10   0.00    0e+00
10000000       5000000    half    prod    0.071  140.86   0.00    0e+00    0.004  2223.21   0.00    0e+00
10000000       5000000    half     max    0.067  148.46   0.00    0e+00    0.005  2110.15   0.00    0e+00
10000000       5000000    half     min    0.068  147.17   0.00    0e+00    0.006  1602.05   0.00    0e+00
10000000       2500000   float     sum    0.068  147.30   0.00    0e+00    0.004  2337.54   0.00    0e+00
10000000       2500000   float    prod    0.067  148.82   0.00    0e+00    0.004  2243.16   0.00    0e+00
10000000       2500000   float     max    0.066  150.48   0.00    0e+00    0.006  1682.94   0.00    0e+00
10000000       2500000   float     min    0.067  148.62   0.00    0e+00    0.005  1992.43   0.00    0e+00
10000000       1250000  double     sum    0.068  146.89   0.00    0e+00    0.005  2183.88   0.00    0e+00
10000000       1250000  double    prod    0.067  149.69   0.00    0e+00    0.005  2028.81   0.00    0e+00
10000000       1250000  double     max    0.067  150.27   0.00    0e+00    0.006  1738.83   0.00    0e+00
10000000       1250000  double     min    0.067  150.36   0.00    0e+00    0.004  2258.36   0.00    0e+00
10000000       1250000   int64     sum    0.068  147.92   0.00    0e+00    0.005  2208.48   0.00    0e+00
10000000       1250000   int64    prod    0.068  147.52   0.00    0e+00    0.005  2169.67   0.00    0e+00
10000000       1250000   int64     max    0.067  149.53   0.00    0e+00    0.005  2020.61   0.00    0e+00
10000000       1250000   int64     min    0.067  148.93   0.00    0e+00    0.005  2146.38   0.00    0e+00
10000000       1250000  uint64     sum    0.067  148.58   0.00    0e+00    0.004  2321.26   0.00    0e+00
10000000       1250000  uint64    prod    0.067  150.16   0.00    0e+00    0.006  1697.50   0.00    0e+00
10000000       1250000  uint64     max    0.067  148.15   0.00    0e+00    0.005  2132.65   0.00    0e+00
10000000       1250000  uint64     min    0.067  149.31   0.00    0e+00    0.005  2132.65   0.00    0e+00

If using both GPUs, stuck stuff will appear:
$ ./build/test/single/all_reduce_test 10000000

 # Using devices
 #   Rank  0 uses device  0 [0x29] GeForce GTX 1080 Ti
 #   Rank  1 uses device  1 [0x2a] GeForce GTX 1080 Ti
 
#                                                 out-of-place                    in-place
#      bytes             N    type      op     time  algbw  busbw      res     time  algbw  busbw      res
^C

When stuck, 2 GPUs get their max use like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:29:00.0 Off |                  N/A |
| 41%   54C    P2    77W / 250W |    193MiB / 11170MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:2A:00.0 Off |                  N/A |
| 33%   43C    P2    72W / 250W |    193MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                            
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1032      C   ./build/test/single/all_reduce_test          183MiB |
|    1      1032      C   ./build/test/single/all_reduce_test          183MiB |
+-----------------------------------------------------------------------------+

In the same time, cpu will arrive 200% use in top command
I feel exhausted and wish a solution. Thank you very much.

@sjeaugey
Copy link
Member

As a double check, make sure you run sudo lspci -vvv | grep ACSCtl. As a user, you wouldn't see ACS properties.

Then, the setpci commands should be run on the PCI switches, not the GPUs.

@starsblinking
Copy link
Author

@sjeaugey Thanks for your reply very much. sudo lspci -vvv | grep ACSCtl will return nothing either. I don't know much about PC hardware and just use setpci as a try. Have you found the reason that my all_gather_test is always stuck?

@gzmarchenko
Copy link

gzmarchenko commented Nov 28, 2017

We're experiencing exactly the same problem (ubuntu and video drivers are the same). The difference is that we have 4 GTX1080 Ti cards like this:

06:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 120f
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- >><PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 18
        Region 0: Memory at f0000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 2f80000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 2f90000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at b000 [size=128]
        [virtual] Expansion ROM at f1000000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR+, OBFF Via message
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 34326183936ns
                Max no snoop latency: 34326183936ns
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout+ NonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] #19
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_384_drm, nvidia_384

sudo lspci -vvv | grep ACSCtl shows that all is fine:

           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
           ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

But if we use any two (or all four) cards with all_reduce_test simultaneously - it hangs. :(
Every single card works just fine.

I'm out of ideas.

@nluehr
Copy link
Contributor

nluehr commented Nov 28, 2017

I've heard of similar issues being related to IOMMU / VT-d. Can you try disabling this in your BIOS?

@gzmarchenko
Copy link

gzmarchenko commented Nov 29, 2017

Unfortunately for us both vt-d and vt-x had already been disabled. :(
Doesn't change much ((

@starsblinking
Copy link
Author

@nluehr Thanks for your suggestion. I have disabled CPU virtualization already.
@gzmarchenko It seems that the problem has nothing with CPU virtualization.
My cpu is one AMD Ryzen 7 1700X. The motherboard is AUSU ROG C6H.

@sjeaugey
Copy link
Member

Oh, you are using an AMD CPU. Then I suggest you review this post and see if it fixes your problem :
pytorch/pytorch#1637 (comment)
(basically adding iommu=soft to the kernel command line)

@gzmarchenko are you using an AMD CPU as well ?

@gzmarchenko
Copy link

gzmarchenko commented Dec 1, 2017

@sjeaugey
No, mine is
model name : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz with
Manufacturer: ASUSTeK COMPUTER INC. Product Name: PRIME Z270-P
:(

@starsblinking
Copy link
Author

@sjeaugey Thanks very much! Everything goes well now. I just thought that CPU virtualization in BIOS settings is right IOMMU but actually not. I didn't find IOMMU settings in BIOS settings when PC boot. sudo vim /etc/default/grub and sudo update-grub do work.
Thank everyone!

@gzmarchenko
Copy link

gzmarchenko commented Dec 1, 2017

I'm glad to hear it helped @starsblinking, but it didn't work for us.

@sjeaugey
Copy link
Member

sjeaugey commented Dec 1, 2017

@gzmarchenko sorry to see that it didn't work for you. Still, most likely, this is due to GPU Direct P2P not working between the GPUs which is usually caused by CPU/BIOS settings. You may want to reproduce the problem with the CUDA P2P tests and if it doesn't work, report that problem to CUDA (through developer.nvidia.com).

Also, I would advise to use NCCL2 (https://developer.nvidia.com/nccl, tests at https://github.com/nvidia/nccl-tests) and until your P2P problem is resolved, you can set NCCL_P2P_DISABLE=1. Performance will be lower though.

@gzmarchenko
Copy link

gzmarchenko commented Dec 11, 2017

@sjeaugey, we cut off 2 of 4 GPUs and switched others into x16 ports (with x1 passive adapters though)
With these arrangements, we were able to run them.

It's a pity that plugging in another one makes it non-working again. Is there anything else we should think about before changing the motherboard?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants