Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AllReduce hangs #257

Closed
vmarkovtsev opened this issue Sep 28, 2019 · 31 comments
Closed

AllReduce hangs #257

vmarkovtsev opened this issue Sep 28, 2019 · 31 comments

Comments

@vmarkovtsev
Copy link

My problem was diagnosed in tensorflow/tensorflow#32654 - please find all the info about my environment there.

Using the master version of nccl. I launch all_reduce_perf and it hangs with 100% volatile GPU usage reported.

./build/all_reduce_perf -b 8 -e 256M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  15833 on jupyter-vmarkovtsev device  0 [0x02] GeForce GTX 1080 Ti
#   Rank  1 Pid  15833 on jupyter-vmarkovtsev device  1 [0x03] GeForce GTX 1080 Ti
#   Rank  2 Pid  15833 on jupyter-vmarkovtsev device  2 [0x82] GeForce GTX 1080 Ti
#   Rank  3 Pid  15833 on jupyter-vmarkovtsev device  3 [0x83] GeForce GTX 1080 Ti
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Bootstrap : Using [0]eth0:10.2.3.32<0>
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

jupyter-vmarkovtsev:15833:15833 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO NET/Socket : Using [0]eth0:10.2.3.32<0>
NCCL version 2.4.8+cuda10.0
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO nranks 4
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Channel 00 :    0   1   2   3
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/direct pointer
jupyter-vmarkovtsev:15833:15833 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
jupyter-vmarkovtsev:15833:15833 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via P2P/direct pointer
jupyter-vmarkovtsev:15833:15833 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO Launch mode Group/CGMD
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 1 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 3 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f93b2000000 recvbuff 0x7f93a2000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09b4c43b0 [nranks=4] stream 0x55d099d151c0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f936c000000 recvbuff 0x7f935c000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff61710 [nranks=4] stream 0x55d09a4afee0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f9328000000 recvbuff 0x7f9318000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d09ff6b9f0 [nranks=4] stream 0x55d09ac521a0
jupyter-vmarkovtsev:15833:15833 [0] NCCL INFO AllReduce: opCount 4 sendbuff 0x7f92e4000000 recvbuff 0x7f92d4000000 count 67108864 datatype 7 op 0 root 0 comm 0x55d0a2e1ef20 [nranks=4] stream 0x55d09b3fb680

jupyter-vmarkovtsev:15833:15833 [0] init.cc:1250 NCCL WARN Mismatched collective detected, please check your collectivecalls at and around rank 3. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs

I waited for 10 minutes, there are no more logs printed.

@vmarkovtsev
Copy link
Author

Sometimes, it prints

jupyter-vmarkovtsev:15892:15892 [0] init.cc:1259 NCCL WARN Your program may be hanging, this may be caused by a collective mismatch around rank 3. Please check your collective calls at and around this rank. You can use NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL to see the collective logs

@vmarkovtsev
Copy link
Author

If I insert

printf("mismatch: %d\nremoteOpCount %d\nopCount %d\n", (int)mismatch, remoteOpCount? ((int)*remoteOpCount) : -1, (int)opCount);

before if (mismatch > 20) { in primitives.h, it never stops printing the following two blocks which constantly change:

mismatch: 21
remoteOpCount -1
opCount 0

and

mismatch: 0
remoteOpCount 0
opCount 0

@vmarkovtsev
Copy link
Author

cudaStreamQuery() always returns cudaErrorNotReady inside testStreamSynchronize() which is called right after the warmup through TESTCHECK(completeColl(args));. Digging further...

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Sep 29, 2019

If I short-circuit ncclAsyncMode() to false in ncclEnqueueCheck() then it hangs in ncclBarrierEnqueueWait() at the very first warmup iteration for the very first device.

Going deeper: it hangs on ncclCpuBarrierOut().

Update: nah, this is a dead end: sync never calls CUDA.

@vmarkovtsev
Copy link
Author

It does not hang with -g 1. It hangs with -g 2.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Sep 29, 2019

I found that cudaLaunchKernel() call in ncclBarrierEnqueueWait() has params->gridDim equal to -2130043008 and params->blockDim equal to 1. Nice!

Update: nah, this is for ncclComm::PARALLEL mode, mine is GROUP.

@vmarkovtsev
Copy link
Author

We call cudaLaunchCooperativeKernelMultiDevice() with gridDim 1 1 1 and blockDim 64 1 1. All is fine here.

@vmarkovtsev
Copy link
Author

We never leave ncclAllReduceRingLLKernel. According to printf in the beginning of the kernel, we call it exactly 256 times. It makes sense: 4 gpus x 64 blockDim = 256. This is the first clue.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Sep 29, 2019

This code hangs during the first iteration of the outer and the inner for loops:

    // k-2 steps: copy to next GPU
    for (int j=1; j<nranks-1; ++j) {
      slice = ring->devUserRanks[nranks-j];
      offset = chunkOffset + slice * chunkSize;
      nelem = min(chunkSize, size-offset);

      LLprims.recvCopySend(thisOutput+offset, nelem);
    }

@vmarkovtsev
Copy link
Author

If I insert return; in the header of ncclAllReduceRingLLKernel, the test passes (the result is not validated so it considers itself successful).

@vmarkovtsev
Copy link
Author

Any call to LLprims is poisonous. Inserting continue before LLprims.send(thisInput+offset, nelem); passes, after - the function returns but we still hang.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Sep 29, 2019

Inserting return; in LLGenericOp<> before the final FOR_SEND(postSend, offset); passes if we continue right after LLprims.send(thisInput+offset, nelem);. Double-checked.

  template <int RECV, int SEND, int SRC, int DST>
  __device__ void LLGenericOp(const T* srcPtr, T* dstPtr, int nelem) {
    uint32_t nbytes = nelem < 0 ? 0 : nelem*sizeof(T);
    FOR_SEND(waitSend, nbytes*2);
    barrier();
    uint32_t npack = DIVUP(nbytes, sizeof(uint64_t));
    uint64_t* srcPack = (uint64_t*)srcPtr;
    uint64_t* dstPack = (uint64_t*)dstPtr;
    int offset = tid;
    // Do multiples of 64 bits
    #pragma unroll 2
    for (; offset<npack; offset+=nthreads) {
      // Recv : local, then intra-node, then inter-node
      uint64_t val = SRC ? readAL(srcPack+offset) : readLL(0, offset);
      if (RECV) {
        if (SRC) val = MULTI<FUNC, T>()(readLL(0, offset), val);
        for (int i=1; i<NRECV && i<nrecv; i++) {
          val = MULTI<FUNC, T>()(readLL(i, offset), val);
        }
      }

      // Send : inter-node, then intra-node, then local
      if (SEND) {
        for (int i=1; i<NSEND && i<nsend; i++) storeLL(sendPtr(i)+offset, val, sendFlag(i));
        storeLL(sendPtr(0)+offset, val, sendFlag(0));
      }
      if (DST) {
        if (((offset*sizeof(uint64_t)) ^ nbytes) < sizeof(uint64_t)) {
          // Last incomplete word
          storeAL(dstPack+offset, val, nbytes & 0x7);
        } else {
          storeAL(dstPack+offset, val, sizeof(uint64_t));
        }
      }
    }
    exitIfAbortLocalBarrier();
    FOR_RECV(postRecv);
    return;
    FOR_SEND(postSend, offset);
  }
template<int UNUSED, class FUNC, typename T>
__device__ void ncclAllReduceRingLLKernel(struct CollectiveArgs* args) {
  const int tid = threadIdx.x;
  const int bid = args->bid;
  const int nthreads = args->nThreads;
  struct ncclDevComm* comm = args->comm;
  struct ncclChannel* channel = comm->channels+blockIdx.x;
  struct ncclRing* ring = &channel->ring;

  ncclLLPrimitives<T, FUNC, 1, 1> LLprims(tid, nthreads, &ring->prev, &ring->next, channel, comm, args->opCount);

  const ssize_t size = args->N;
  //const int rank = comm->rank;
  const int nranks = comm->nRanks;
  ssize_t chunkSize = NCCL_LL_SLICE_LINES * sizeof(uint64_t) / sizeof(T);
  const ssize_t loopSize = args->nChannels*nranks*chunkSize;

  // Compute pointers
  const T * __restrict__ thisInput = (const T*)args->ThisInput;
  T * __restrict__ thisOutput = (T*)args->ThisOutput;

  for (ssize_t gridOffset = 0; gridOffset < size; gridOffset += loopSize) {
    if (size-gridOffset < loopSize) {
      chunkSize = args->lastChunkSize;
    }
    ssize_t chunkOffset = gridOffset + bid*nranks*chunkSize;

    /////////////// begin AllReduce steps ///////////////
    ssize_t offset;
    int nelem;
    int slice;

    // step 0: push data to next GPU
    slice = ring->devUserRanks[nranks-1];
    offset = chunkOffset + slice * chunkSize;
    nelem = min(chunkSize, size-offset);

    LLprims.send(thisInput+offset, nelem);
    continue;

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Sep 30, 2019

If I comment out sendStep[i]++; in inline __device__ void postSend(int i, int offset) it does not hang, too. Now I need to find where it gets decremented...

Update: the only place which mutates it seems to be __device__ __forceinline__ void loadSendConn(struct ncclConnInfo* conn, int i)

loadSendConn sets it in the constructor to 0.

@vmarkovtsev
Copy link
Author

So,

  • loadSendConn sets sendStep[i] to 0
  • postSend is called once for each thread
  • Each thread makes sendStep[i] equal to 1
  • saveSendConn sets sendConn[i]->step to 1
  • ncclAllReduceRingLLKernel exits
  • Next call, loadSendConn sets sendStep[i] to 1. Not clear yet why ncclAllReduceRingLLKernel is called several times, probably because of the warmup.
  • postSend checks the NCCL_LL_CLEAN_MASK, this time it does not hold, so does nothing and increments sendStep[i]
  • saveSendConn sets sendConn[i]->step to 2
  • ncclAllReduceRingLLKernel exits
  • The cycle repeats until saveSendConn sets sendConn[i]->step to 8
  • ncclAllReduceRingLLKernel exits
  • loadSendConn sets sendStep[i] to 8
  • logs stop 🤔 No records of postSend are printed afterwards. The kernel does not actually return! I thought that it did, but I di not know that ncclAllReduceRingLLKernel is called several times.

OK, now I need to find where it enters an infinite cycle.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Sep 30, 2019

Gotcha! We call LLprims.send() with gridOffset = 0 and nelem = -224 in the end.

My current cmdline is ./build/all_reduce_perf -b 128 -e 128 -f 2 -g 4

There are two options:

  • nelem = -224 chunkSize = 128 size = 32 offset = 256
  • nelem = -96 chunkSize = 128 size = 32 offset = 128.

gridOffset was 0 in all the cases.

size is 32 because I request for reducing 128 bytes, which is 32 4-bit numbers.

loopSize is 16384.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Sep 30, 2019

Actually, negative nelem appears on the very first invocation.

So this is definitely unrelated to my continue after LLprims.send(thisInput+offset, nelem);. Good.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Sep 30, 2019

Thus the immediate cause is this code:

    // step 0: push data to next GPU
    slice = ring->devUserRanks[nranks-1];
    offset = chunkOffset + slice * chunkSize;
    nelem = min(chunkSize, size-offset);

    LLprims.send(thisInput+offset, nelem);

When slice is 1, it gets multiplied by chunkSize which is 128. Our size is 32 and hence we get a negative number. I checked and slice is sometimes 0 (no error) or 2 (nelem becomes -224).

slice comes from ring->devUserRanks. Let's see what's there...

@vmarkovtsev
Copy link
Author

nRanks is 4, nChannels is 1.

ring->devUserRanks is initialized to one of [0, 1, 2, 3], [1, 2, 3, 0], [2, 3, 0, 1], [3, 0, 1, 2].

@vmarkovtsev
Copy link
Author

Nah, another dead end: uint32_t nbytes = nelem < 0 ? 0 : nelem*sizeof(T); in LLGenericOp. It can be negative, not a problem.

@vmarkovtsev
Copy link
Author

The actual spot where the control flow hangs is barrier();:

template <int RECV, int SEND, int SRC, int DST>
  __device__ void LLGenericOp(const T* srcPtr, T* dstPtr, int nelem) {
    uint32_t nbytes = nelem < 0 ? 0 : nelem*sizeof(T);
    FOR_SEND(waitSend, nbytes*2);
    barrier();  // <- hangs here
    uint32_t npack = DIVUP(nbytes, sizeof(uint64_t));
    uint64_t* srcPack = (uint64_t*)srcPtr;
    uint64_t* dstPack = (uint64_t*)dstPtr;

@vmarkovtsev
Copy link
Author

I should have checked dmesg much earlier. This is what I see there

[1477882.502235] dmar_fault: 32 callbacks suppressed
[1477882.502237] DMAR: DRHD: handling fault status reg 102
[1477882.502925] DMAR: [DMA Write] Request device [02:00.0] fault addr cd139000 [fault reason 05] PTE Write access is not set
[1477884.569124] DMAR: DRHD: handling fault status reg 402
[1477884.569490] DMAR: [DMA Write] Request device [82:00.0] fault addr f8139000 [fault reason 05] PTE Write access is not set

@vmarkovtsev
Copy link
Author

Waiting for the authors to confirm that this is not a problem with NCCL and closing at once.

@vmarkovtsev
Copy link
Author

export NCCL_P2P_DISABLE=1 fixes the hang 🎉

@cliffwoolley
Copy link
Collaborator

I see in the linked TF issue that you're running under Kubernetes. Is it also a virtualized environment?

Thanks for the detailed followups; sorry you had to go so deep into the code just to discover that it was something external after all.

It would be nice if we had a way to detect and guard against whatever system configuration is causing this.

@sjeaugey
Copy link
Member

Note this workaround (NCCL_P2P_DISABLE=1) can severely degrade performance. The problem is usually caused by IO virtualization (VT-d / IOMMU) being enabled on a bare metal system, breaking CUDA p2p.

To confirm this, the first step is to run the CUDA p2pBandwidthLatencyTest sample in /usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest. If this test does not work, then p2p is not functional.

The solution is to either disable VT-d/IOMMU in the BIOS, or have a script disable ACS upon boot. See more information here and here.

@vmarkovtsev
Copy link
Author

We are running Kubernetes over bare metal machines. No virtualization is used. I will consult with our infra team about the details. I know that before switching to Kubernetes, we ran old-style, and cudaMemcpyPeer worked.

Trying to run p2pBandwidthLatencyTest now.

@vmarkovtsev
Copy link
Author

@sjeaugey Interesting, that sample passes without hangs:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, GeForce GTX 1080 Ti, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     0     0
     1       1     1     0     0
     2       0     0     1     1
     3       0     0     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 353.51   9.48   9.66   9.63
     1   9.59 152.05   9.73   9.62
     2   9.70   9.73 354.79   9.78
     3   9.67   9.76   9.71 354.47
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 354.80   0.23   9.75   9.63
     1   0.23 381.47   9.75   9.68
     2   9.75   9.51 355.45   0.23
     3   9.69   9.63   0.23 353.51
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 357.21  16.28  17.77  17.71
     1  16.00 384.25  16.39  16.85
     2  16.39  16.43 356.20  16.21
     3  17.24  17.48  16.50 354.79
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 356.24   0.46  17.86  17.80
     1   0.46 383.72  17.58  17.57
     2  17.93  17.89 356.25   0.45
     3  17.80  17.06   0.45 354.66
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   1.29  10.68  12.44  12.65
     1  13.39   1.32  13.05  10.33
     2  11.31  13.05   1.31  13.45
     3  10.87  13.37  12.11   1.29

   CPU     0      1      2      3
     0   4.20  12.67  10.10  12.37
     1  10.98   3.86  12.25  12.02
     2   9.57  10.11   4.64  10.96
     3  12.47  12.08  11.29   4.34
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.30 49250.70  12.24  12.79
     1 49250.55   1.31  12.30  14.65
     2  12.78  14.55   1.60 49250.37
     3  11.51  12.72 49250.50   1.31

   CPU     0      1      2      3
     0   4.42   4.08  11.12  11.82
     1   3.00   5.00  12.49  11.23
     2   9.72   9.59   4.09   2.71
     3  11.82  11.19   3.00   5.76

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

@sjeaugey
Copy link
Member

Thanks. Interesting indeed, CE copies seem to not hang, but just be really slow. Notice the 0.23GB/s :

Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 354.80   0.23   9.75   9.63
     1   0.23 381.47   9.75   9.68
     2   9.75   9.51 355.45   0.23
     3   9.69   9.63   0.23 353.51

Let us know if performance goes back to normal with the BIOS or system change. NCCL performance and functionality should be back as well.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Oct 1, 2019

We disabled VT-d in BIOS and the hang still persists with the same DMA errors.

The corresponding BIOS checkbox is off and the kernel prints kvm: disabled by bios.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Oct 2, 2019

@sjeaugey @cliffwoolley Booting the kernel with intel_iommu=off resolved all our problems. No need for NCCL_P2P_DISABLE now. Worked like a charm! This is the new output from p2pBandwidthLatencyTest:

Device: 0, GeForce GTX 1080 Ti, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 1, GeForce GTX 1080 Ti, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 2, GeForce GTX 1080 Ti, pciBusID: 82, pciDeviceID: 0, pciDomainID:0
Device: 3, GeForce GTX 1080 Ti, pciBusID: 83, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     0     0
     1       1     1     0     0
     2       0     0     1     1
     3       0     0     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 353.83   5.98  11.22  11.15
     1  11.10 152.05  10.77  10.89
     2  11.14  10.76 355.12  11.37
     3  11.17  10.76  11.32 353.83
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 354.15  10.28  11.18  11.16
     1  10.27 152.29  11.16  11.11
     2  11.15  10.85 355.44  10.28
     3  11.15  10.94  10.28 354.15
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 356.35  19.31  19.64  19.86
     1  18.48 245.06  18.73  18.59
     2  19.82  19.18 357.36  18.23
     3  19.64  20.07  17.72 355.87
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 355.76  19.29  19.54  19.30
     1  19.30 383.34  20.06  20.02
     2  19.97  19.71 355.30  19.30
     3  19.67  19.80  19.29 354.31
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   1.25  11.90  10.32  10.33
     1  12.95   1.37  16.49  15.06
     2  11.29  11.45   1.27  12.95
     3  11.65  10.75  11.12   1.26

   CPU     0      1      2      3
     0   3.96   9.96   9.93  10.93
     1   9.40   4.33   9.26  10.34
     2   9.71   9.60   4.06   9.82
     3  10.95  10.42  10.49   3.99
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   1.25   0.99  12.84  13.39
     1   1.01   1.37  11.21  10.36
     2  12.39  12.33   1.28   1.07
     3  10.39  10.86   1.04   1.27

   CPU     0      1      2      3
     0   4.14   3.00  10.40  10.54
     1   2.79   4.32   9.66  10.71
     2   9.72  10.03   4.08   2.75
     3  10.64  11.25   3.08   4.12

Thanks for the help.

@vmarkovtsev
Copy link
Author

vmarkovtsev commented Oct 2, 2019

I can suggest searching for DMAR: IOMMU enabled in dmesg and printing a warning if it is found. I've got DMAR: IOMMU disabled now. https://stackoverflow.com/questions/44286683/check-for-iommu-support-on-linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants