AllReduce hangs #257

vmarkovtsev · 2019-09-28T22:17:19Z

My problem was diagnosed in tensorflow/tensorflow#32654 - please find all the info about my environment there.

Using the master version of nccl. I launch all_reduce_perf and it hangs with 100% volatile GPU usage reported.

./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  15833 on jupyter-vmarkovtsev device  0 [0x02] GeForce GTX 1080 Ti
#   Rank  1 Pid  15833 on jupyter-vmarkovtsev device  1 [0x03] GeForce GTX 1080 Ti
#   Rank  2 Pid  15833 on jupyter-vmarkovtsev device  2 [0x82] GeForce GTX 1080 Ti
#   Rank  3 Pid  15833 on jupyter-vmarkovtsev device  3 [0x83] GeForce GTX 1080 Ti
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Bootstrap : Using [0]eth0:10.2.3.32<0>
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

jupyter-vmarkovtsev:15833:15833 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Socket : Using [0]eth0:10.2.3.32<0>
NCCL version 2.4.8+cuda10.0
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO nranks 4
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Channel 00 :    0   1   2   3
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
jupyter-vmarkovtsev:15833:15833 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Launch mode Group/CGMD
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680

jupyter-vmarkovtsev:15833:15833 [0] init.cc:1250 NCCL WARN Mismatched collective detected, please check your collectivecalls at and around rank 3. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs

I waited for 10 minutes, there are no more logs printed.

The text was updated successfully, but these errors were encountered:

vmarkovtsev · 2019-09-29T07:29:38Z

Sometimes, it prints

jupyter-vmarkovtsev:15892:15892 [0] init.cc:1259 NCCL WARN Your program may be hanging, this may be caused by a collective mismatch around rank 3. Please check your collective calls at and around this rank. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs

vmarkovtsev · 2019-09-29T14:42:38Z

If I insert

printf("mismatch: %d\nremoteOpCount %d\nopCount %d\n", (int)mismatch, remoteOpCount? ((int)*remoteOpCount) : -1, (int)opCount);

before if (mismatch > 20) { in primitives.h, it never stops printing the following two blocks which constantly change:

mismatch: 21
remoteOpCount -1
opCount 0

and

mismatch: 0
remoteOpCount 0
opCount 0

vmarkovtsev · 2019-09-29T16:30:33Z

cudaStreamQuery() always returns cudaErrorNotReady inside testStreamSynchronize() which is called right after the warmup through TESTCHECK(completeColl(args));. Digging further...

vmarkovtsev · 2019-09-29T16:59:29Z

If I short-circuit ncclAsyncMode() to false in ncclEnqueueCheck() then it hangs in ncclBarrierEnqueueWait() at the very first warmup iteration for the very first device.

Going deeper: it hangs on ncclCpuBarrierOut().

Update: nah, this is a dead end: sync never calls CUDA.

vmarkovtsev · 2019-09-29T17:02:25Z

It does not hang with -g 1. It hangs with -g 2.

vmarkovtsev · 2019-09-29T17:50:16Z

I found that cudaLaunchKernel() call in ncclBarrierEnqueueWait() has params->gridDim equal to -2130043008 and params->blockDim equal to 1. Nice!

Update: nah, this is for ncclComm::PARALLEL mode, mine is GROUP.

vmarkovtsev · 2019-09-29T18:08:57Z

We call cudaLaunchCooperativeKernelMultiDevice() with gridDim 1 1 1 and blockDim 64 1 1. All is fine here.

vmarkovtsev · 2019-09-29T19:23:16Z

We never leave ncclAllReduceRingLLKernel. According to printf in the beginning of the kernel, we call it exactly 256 times. It makes sense: 4 gpus x 64 blockDim = 256. This is the first clue.

vmarkovtsev · 2019-09-29T20:13:36Z

This code hangs during the first iteration of the outer and the inner for loops:

    // k-2 steps: copy to next GPU
    for (int j=1; j<nranks-1; ++j) {
      slice = ring->devUserRanks[nranks-j];
      offset = chunkOffset + slice * chunkSize;
      nelem = min(chunkSize, size-offset);

      LLprims.recvCopySend(thisOutput+offset, nelem);
    }

vmarkovtsev · 2019-09-29T20:37:18Z

If I insert return; in the header of ncclAllReduceRingLLKernel, the test passes (the result is not validated so it considers itself successful).

vmarkovtsev · 2019-09-29T20:59:23Z

Any call to LLprims is poisonous. Inserting continue before LLprims.send(thisInput+offset, nelem); passes, after - the function returns but we still hang.

vmarkovtsev · 2019-09-29T21:38:19Z

Inserting return; in LLGenericOp<> before the final FOR_SEND(postSend, offset); passes if we continue right after LLprims.send(thisInput+offset, nelem);. Double-checked.

  template <int RECV, int SEND, int SRC, int DST>
  __device__ void LLGenericOp(const T* srcPtr, T* dstPtr, int nelem) {
    uint32_t nbytes = nelem < 0 ? 0 : nelem*sizeof(T);
    FOR_SEND(waitSend, nbytes*2);
    barrier();
    uint32_t npack = DIVUP(nbytes, sizeof(uint64_t));
    uint64_t* srcPack = (uint64_t*)srcPtr;
    uint64_t* dstPack = (uint64_t*)dstPtr;
    int offset = tid;
    // Do multiples of 64 bits
    #pragma unroll 2
    for (; offset<npack; offset+=nthreads) {
      // Recv : local, then intra-node, then inter-node
      uint64_t val = SRC ? readAL(srcPack+offset) : readLL(0, offset);
      if (RECV) {
        if (SRC) val = MULTI<FUNC, T>()(readLL(0, offset), val);
        for (int i=1; i<NRECV && i<nrecv; i++) {
          val = MULTI<FUNC, T>()(readLL(i, offset), val);
        }
      }

      // Send : inter-node, then intra-node, then local
      if (SEND) {
        for (int i=1; i<NSEND && i<nsend; i++) storeLL(sendPtr(i)+offset, val, sendFlag(i));
        storeLL(sendPtr(0)+offset, val, sendFlag(0));
      }
      if (DST) {
        if (((offset*sizeof(uint64_t)) ^ nbytes) < sizeof(uint64_t)) {
          // Last incomplete word
          storeAL(dstPack+offset, val, nbytes & 0x7);
        } else {
          storeAL(dstPack+offset, val, sizeof(uint64_t));
        }
      }
    }
    exitIfAbortLocalBarrier();
    FOR_RECV(postRecv);
    return;
    FOR_SEND(postSend, offset);
  }

template<int UNUSED, class FUNC, typename T>
__device__ void ncclAllReduceRingLLKernel(struct CollectiveArgs* args) {
  const int tid = threadIdx.x;
  const int bid = args->bid;
  const int nthreads = args->nThreads;
  struct ncclDevComm* comm = args->comm;
  struct ncclChannel* channel = comm->channels+blockIdx.x;
  struct ncclRing* ring = &channel->ring;

  ncclLLPrimitives<T, FUNC, 1, 1> LLprims(tid, nthreads, &ring->prev, &ring->next, channel, comm, args->opCount);

  const ssize_t size = args->N;
  //const int rank = comm->rank;
  const int nranks = comm->nRanks;
  ssize_t chunkSize = NCCL_LL_SLICE_LINES * sizeof(uint64_t) / sizeof(T);
  const ssize_t loopSize = args->nChannels*nranks*chunkSize;

  // Compute pointers
  const T * __restrict__ thisInput = (const T*)args->ThisInput;
  T * __restrict__ thisOutput = (T*)args->ThisOutput;

  for (ssize_t gridOffset = 0; gridOffset < size; gridOffset += loopSize) {
    if (size-gridOffset < loopSize) {
      chunkSize = args->lastChunkSize;
    }
    ssize_t chunkOffset = gridOffset + bid*nranks*chunkSize;

    /////////////// begin AllReduce steps ///////////////
    ssize_t offset;
    int nelem;
    int slice;

    // step 0: push data to next GPU
    slice = ring->devUserRanks[nranks-1];
    offset = chunkOffset + slice * chunkSize;
    nelem = min(chunkSize, size-offset);

    LLprims.send(thisInput+offset, nelem);
    continue;

vmarkovtsev · 2019-09-30T09:46:19Z

If I comment out sendStep[i]++; in inline __device__ void postSend(int i, int offset) it does not hang, too. Now I need to find where it gets decremented...

Update: the only place which mutates it seems to be __device__ __forceinline__ void loadSendConn(struct ncclConnInfo* conn, int i)

loadSendConn sets it in the constructor to 0.

vmarkovtsev · 2019-09-30T10:44:07Z

So,

loadSendConn sets sendStep[i] to 0
postSend is called once for each thread
Each thread makes sendStep[i] equal to 1
saveSendConn sets sendConn[i]->step to 1
ncclAllReduceRingLLKernel exits
Next call, loadSendConn sets sendStep[i] to 1. Not clear yet why ncclAllReduceRingLLKernel is called several times, probably because of the warmup.
postSend checks the NCCL_LL_CLEAN_MASK, this time it does not hold, so does nothing and increments sendStep[i]
saveSendConn sets sendConn[i]->step to 2
ncclAllReduceRingLLKernel exits
The cycle repeats until saveSendConn sets sendConn[i]->step to 8
ncclAllReduceRingLLKernel exits
loadSendConn sets sendStep[i] to 8
logs stop 🤔 No records of postSend are printed afterwards. The kernel does not actually return! I thought that it did, but I di not know that ncclAllReduceRingLLKernel is called several times.

OK, now I need to find where it enters an infinite cycle.

vmarkovtsev · 2019-09-30T11:22:21Z

Gotcha! We call LLprims.send() with gridOffset = 0 and nelem = -224 in the end.

My current cmdline is ./build/all_reduce_perf -b 128 -e 128 -f 2 -g 4

There are two options:

nelem = -224 chunkSize = 128 size = 32 offset = 256
nelem = -96 chunkSize = 128 size = 32 offset = 128.

gridOffset was 0 in all the cases.

size is 32 because I request for reducing 128 bytes, which is 32 4-bit numbers.

loopSize is 16384.

vmarkovtsev · 2019-09-30T13:15:29Z

Actually, negative nelem appears on the very first invocation.

So this is definitely unrelated to my continue after LLprims.send(thisInput+offset, nelem);. Good.

vmarkovtsev · 2019-09-30T13:35:28Z

Thus the immediate cause is this code:

    // step 0: push data to next GPU
    slice = ring->devUserRanks[nranks-1];
    offset = chunkOffset + slice * chunkSize;
    nelem = min(chunkSize, size-offset);

    LLprims.send(thisInput+offset, nelem);

When slice is 1, it gets multiplied by chunkSize which is 128. Our size is 32 and hence we get a negative number. I checked and slice is sometimes 0 (no error) or 2 (nelem becomes -224).

slice comes from ring->devUserRanks. Let's see what's there...

vmarkovtsev · 2019-09-30T13:54:05Z

nRanks is 4, nChannels is 1.

ring->devUserRanks is initialized to one of [0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 0, 1], [3, 0, 1, 2].

vmarkovtsev · 2019-09-30T13:58:40Z

Nah, another dead end: uint32_t nbytes = nelem < 0 ? 0 : nelem*sizeof(T); in LLGenericOp. It can be negative, not a problem.

vmarkovtsev · 2019-09-30T15:10:41Z

The actual spot where the control flow hangs is barrier();:

template <int RECV, int SEND, int SRC, int DST>
  __device__ void LLGenericOp(const T* srcPtr, T* dstPtr, int nelem) {
    uint32_t nbytes = nelem < 0 ? 0 : nelem*sizeof(T);
    FOR_SEND(waitSend, nbytes*2);
    barrier();  // <- hangs here
    uint32_t npack = DIVUP(nbytes, sizeof(uint64_t));
    uint64_t* srcPack = (uint64_t*)srcPtr;
    uint64_t* dstPack = (uint64_t*)dstPtr;

vmarkovtsev · 2019-09-30T15:16:03Z

I should have checked dmesg much earlier. This is what I see there

[1477882.502235] dmar_fault: 32 callbacks suppressed
[1477882.502237] DMAR: DRHD: handling fault status reg 102
[1477882.502925] DMAR: [DMA Write] Request device [02:00.0] fault addr cd139000 [fault reason 05] PTE Write access is not set
[1477884.569124] DMAR: DRHD: handling fault status reg 402
[1477884.569490] DMAR: [DMA Write] Request device [82:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set

vmarkovtsev · 2019-09-30T15:36:46Z

Waiting for the authors to confirm that this is not a problem with NCCL and closing at once.

vmarkovtsev · 2019-09-30T15:49:26Z

export NCCL_P2P_DISABLE=1 fixes the hang 🎉

cliffwoolley · 2019-09-30T15:54:36Z

I see in the linked TF issue that you're running under Kubernetes. Is it also a virtualized environment?

Thanks for the detailed followups; sorry you had to go so deep into the code just to discover that it was something external after all.

It would be nice if we had a way to detect and guard against whatever system configuration is causing this.

sjeaugey · 2019-09-30T16:30:51Z

Note this workaround (NCCL_P2P_DISABLE=1) can severely degrade performance. The problem is usually caused by IO virtualization (VT-d / IOMMU) being enabled on a bare metal system, breaking CUDA p2p.

To confirm this, the first step is to run the CUDA p2pBandwidthLatencyTest sample in /usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest. If this test does not work, then p2p is not functional.

The solution is to either disable VT-d/IOMMU in the BIOS, or have a script disable ACS upon boot. See more information here and here.

vmarkovtsev · 2019-09-30T17:01:41Z

We are running Kubernetes over bare metal machines. No virtualization is used. I will consult with our infra team about the details. I know that before switching to Kubernetes, we ran old-style, and cudaMemcpyPeer worked.

Trying to run p2pBandwidthLatencyTest now.

vmarkovtsev · 2019-09-30T17:23:21Z

@sjeaugey Interesting, that sample passes without hangs:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1080 Ti, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     0     0
     1       1     1     0     0
     2       0     0     1     1
     3       0     0     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 353.51   9.48   9.66   9.63
     1   9.59 152.05   9.73   9.62
     2   9.70   9.73 354.79   9.78
     3   9.67   9.76   9.71 354.47
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 354.80   0.23   9.75   9.63
     1   0.23 381.47   9.75   9.68
     2   9.75   9.51 355.45   0.23
     3   9.69   9.63   0.23 353.51
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 357.21  16.28  17.77  17.71
     1  16.00 384.25  16.39  16.85
     2  16.39  16.43 356.20  16.21
     3  17.24  17.48  16.50 354.79
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 356.24   0.46  17.86  17.80
     1   0.46 383.72  17.58  17.57
     2  17.93  17.89 356.25   0.45
     3  17.80  17.06   0.45 354.66
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   1.29  10.68  12.44  12.65
     1  13.39   1.32  13.05  10.33
     2  11.31  13.05   1.31  13.45
     3  10.87  13.37  12.11   1.29

   CPU     0      1      2      3
     0   4.20  12.67  10.10  12.37
     1  10.98   3.86  12.25  12.02
     2   9.57  10.11   4.64  10.96
     3  12.47  12.08  11.29   4.34
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.30 49250.70  12.24  12.79
     1 49250.55   1.31  12.30  14.65
     2  12.78  14.55   1.60 49250.37
     3  11.51  12.72 49250.50   1.31

   CPU     0      1      2      3
     0   4.42   4.08  11.12  11.82
     1   3.00   5.00  12.49  11.23
     2   9.72   9.59   4.09   2.71
     3  11.82  11.19   3.00   5.76

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

sjeaugey · 2019-09-30T18:21:40Z

Thanks. Interesting indeed, CE copies seem to not hang, but just be really slow. Notice the 0.23GB/s :

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 354.80   0.23   9.75   9.63
     1   0.23 381.47   9.75   9.68
     2   9.75   9.51 355.45   0.23
     3   9.69   9.63   0.23 353.51

Let us know if performance goes back to normal with the BIOS or system change. NCCL performance and functionality should be back as well.

vmarkovtsev · 2019-10-01T12:16:57Z

We disabled VT-d in BIOS and the hang still persists with the same DMA errors.

The corresponding BIOS checkbox is off and the kernel prints kvm: disabled by bios.

vmarkovtsev · 2019-10-02T18:04:55Z

@sjeaugey @cliffwoolley Booting the kernel with intel_iommu=off resolved all our problems. No need for NCCL_P2P_DISABLE now. Worked like a charm! This is the new output from p2pBandwidthLatencyTest:

Device: 0, GeForce GTX 1080 Ti, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     0     0
     1       1     1     0     0
     2       0     0     1     1
     3       0     0     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 353.83   5.98  11.22  11.15
     1  11.10 152.05  10.77  10.89
     2  11.14  10.76 355.12  11.37
     3  11.17  10.76  11.32 353.83
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 354.15  10.28  11.18  11.16
     1  10.27 152.29  11.16  11.11
     2  11.15  10.85 355.44  10.28
     3  11.15  10.94  10.28 354.15
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 356.35  19.31  19.64  19.86
     1  18.48 245.06  18.73  18.59
     2  19.82  19.18 357.36  18.23
     3  19.64  20.07  17.72 355.87
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 355.76  19.29  19.54  19.30
     1  19.30 383.34  20.06  20.02
     2  19.97  19.71 355.30  19.30
     3  19.67  19.80  19.29 354.31
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   1.25  11.90  10.32  10.33
     1  12.95   1.37  16.49  15.06
     2  11.29  11.45   1.27  12.95
     3  11.65  10.75  11.12   1.26

   CPU     0      1      2      3
     0   3.96   9.96   9.93  10.93
     1   9.40   4.33   9.26  10.34
     2   9.71   9.60   4.06   9.82
     3  10.95  10.42  10.49   3.99
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.25   0.99  12.84  13.39
     1   1.01   1.37  11.21  10.36
     2  12.39  12.33   1.28   1.07
     3  10.39  10.86   1.04   1.27

   CPU     0      1      2      3
     0   4.14   3.00  10.40  10.54
     1   2.79   4.32   9.66  10.71
     2   9.72  10.03   4.08   2.75
     3  10.64  11.25   3.08   4.12

Thanks for the help.

vmarkovtsev · 2019-10-02T18:11:00Z

I can suggest searching for DMAR: IOMMU enabled in dmesg and printing a warning if it is found. I've got DMAR: IOMMU disabled now. https://stackoverflow.com/questions/44286683/check-for-iommu-support-on-linux

vmarkovtsev closed this as completed Oct 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AllReduce hangs #257

AllReduce hangs #257

vmarkovtsev commented Sep 28, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019 •

edited

Loading

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019 •

edited

Loading

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019 •

edited

Loading

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

cliffwoolley commented Sep 30, 2019

sjeaugey commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

sjeaugey commented Sep 30, 2019

vmarkovtsev commented Oct 1, 2019 •

edited

Loading

vmarkovtsev commented Oct 2, 2019 •

edited

Loading

vmarkovtsev commented Oct 2, 2019 •

edited

Loading

AllReduce hangs #257

AllReduce hangs #257

Comments

vmarkovtsev commented Sep 28, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019 • edited Loading

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019 • edited Loading

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019 • edited Loading

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019

vmarkovtsev commented Sep 29, 2019 • edited Loading

vmarkovtsev commented Sep 30, 2019 • edited Loading

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019 • edited Loading

vmarkovtsev commented Sep 30, 2019 • edited Loading

vmarkovtsev commented Sep 30, 2019 • edited Loading

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

cliffwoolley commented Sep 30, 2019

sjeaugey commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

vmarkovtsev commented Sep 30, 2019

sjeaugey commented Sep 30, 2019

vmarkovtsev commented Oct 1, 2019 • edited Loading

vmarkovtsev commented Oct 2, 2019 • edited Loading

vmarkovtsev commented Oct 2, 2019 • edited Loading

vmarkovtsev commented Sep 29, 2019 •

edited

Loading

vmarkovtsev commented Sep 29, 2019 •

edited

Loading

vmarkovtsev commented Sep 29, 2019 •

edited

Loading

vmarkovtsev commented Sep 29, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019 •

edited

Loading

vmarkovtsev commented Sep 30, 2019 •

edited

Loading

vmarkovtsev commented Oct 1, 2019 •

edited

Loading

vmarkovtsev commented Oct 2, 2019 •

edited

Loading

vmarkovtsev commented Oct 2, 2019 •

edited

Loading