-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AllReduce hangs #257
Comments
Sometimes, it prints
|
If I insert printf("mismatch: %d\nremoteOpCount %d\nopCount %d\n", (int)mismatch, remoteOpCount? ((int)*remoteOpCount) : -1, (int)opCount); before
and
|
|
If I short-circuit Going deeper: it hangs on Update: nah, this is a dead end: sync never calls CUDA. |
It does not hang with |
I found that Update: nah, this is for |
We call |
We never leave |
This code hangs during the first iteration of the outer and the inner // k-2 steps: copy to next GPU
for (int j=1; j<nranks-1; ++j) {
slice = ring->devUserRanks[nranks-j];
offset = chunkOffset + slice * chunkSize;
nelem = min(chunkSize, size-offset);
LLprims.recvCopySend(thisOutput+offset, nelem);
} |
If I insert |
Any call to |
Inserting template <int RECV, int SEND, int SRC, int DST>
__device__ void LLGenericOp(const T* srcPtr, T* dstPtr, int nelem) {
uint32_t nbytes = nelem < 0 ? 0 : nelem*sizeof(T);
FOR_SEND(waitSend, nbytes*2);
barrier();
uint32_t npack = DIVUP(nbytes, sizeof(uint64_t));
uint64_t* srcPack = (uint64_t*)srcPtr;
uint64_t* dstPack = (uint64_t*)dstPtr;
int offset = tid;
// Do multiples of 64 bits
#pragma unroll 2
for (; offset<npack; offset+=nthreads) {
// Recv : local, then intra-node, then inter-node
uint64_t val = SRC ? readAL(srcPack+offset) : readLL(0, offset);
if (RECV) {
if (SRC) val = MULTI<FUNC, T>()(readLL(0, offset), val);
for (int i=1; i<NRECV && i<nrecv; i++) {
val = MULTI<FUNC, T>()(readLL(i, offset), val);
}
}
// Send : inter-node, then intra-node, then local
if (SEND) {
for (int i=1; i<NSEND && i<nsend; i++) storeLL(sendPtr(i)+offset, val, sendFlag(i));
storeLL(sendPtr(0)+offset, val, sendFlag(0));
}
if (DST) {
if (((offset*sizeof(uint64_t)) ^ nbytes) < sizeof(uint64_t)) {
// Last incomplete word
storeAL(dstPack+offset, val, nbytes & 0x7);
} else {
storeAL(dstPack+offset, val, sizeof(uint64_t));
}
}
}
exitIfAbortLocalBarrier();
FOR_RECV(postRecv);
return;
FOR_SEND(postSend, offset);
} template<int UNUSED, class FUNC, typename T>
__device__ void ncclAllReduceRingLLKernel(struct CollectiveArgs* args) {
const int tid = threadIdx.x;
const int bid = args->bid;
const int nthreads = args->nThreads;
struct ncclDevComm* comm = args->comm;
struct ncclChannel* channel = comm->channels+blockIdx.x;
struct ncclRing* ring = &channel->ring;
ncclLLPrimitives<T, FUNC, 1, 1> LLprims(tid, nthreads, &ring->prev, &ring->next, channel, comm, args->opCount);
const ssize_t size = args->N;
//const int rank = comm->rank;
const int nranks = comm->nRanks;
ssize_t chunkSize = NCCL_LL_SLICE_LINES * sizeof(uint64_t) / sizeof(T);
const ssize_t loopSize = args->nChannels*nranks*chunkSize;
// Compute pointers
const T * __restrict__ thisInput = (const T*)args->ThisInput;
T * __restrict__ thisOutput = (T*)args->ThisOutput;
for (ssize_t gridOffset = 0; gridOffset < size; gridOffset += loopSize) {
if (size-gridOffset < loopSize) {
chunkSize = args->lastChunkSize;
}
ssize_t chunkOffset = gridOffset + bid*nranks*chunkSize;
/////////////// begin AllReduce steps ///////////////
ssize_t offset;
int nelem;
int slice;
// step 0: push data to next GPU
slice = ring->devUserRanks[nranks-1];
offset = chunkOffset + slice * chunkSize;
nelem = min(chunkSize, size-offset);
LLprims.send(thisInput+offset, nelem);
continue; |
If I comment out Update: the only place which mutates it seems to be
|
So,
OK, now I need to find where it enters an infinite cycle. |
Gotcha! We call My current cmdline is There are two options:
|
Actually, negative So this is definitely unrelated to my |
Thus the immediate cause is this code: // step 0: push data to next GPU
slice = ring->devUserRanks[nranks-1];
offset = chunkOffset + slice * chunkSize;
nelem = min(chunkSize, size-offset);
LLprims.send(thisInput+offset, nelem); When
|
|
Nah, another dead end: |
The actual spot where the control flow hangs is template <int RECV, int SEND, int SRC, int DST>
__device__ void LLGenericOp(const T* srcPtr, T* dstPtr, int nelem) {
uint32_t nbytes = nelem < 0 ? 0 : nelem*sizeof(T);
FOR_SEND(waitSend, nbytes*2);
barrier(); // <- hangs here
uint32_t npack = DIVUP(nbytes, sizeof(uint64_t));
uint64_t* srcPack = (uint64_t*)srcPtr;
uint64_t* dstPack = (uint64_t*)dstPtr; |
I should have checked
|
Waiting for the authors to confirm that this is not a problem with NCCL and closing at once. |
|
I see in the linked TF issue that you're running under Kubernetes. Is it also a virtualized environment? Thanks for the detailed followups; sorry you had to go so deep into the code just to discover that it was something external after all. It would be nice if we had a way to detect and guard against whatever system configuration is causing this. |
Note this workaround ( To confirm this, the first step is to run the CUDA The solution is to either disable VT-d/IOMMU in the BIOS, or have a script disable ACS upon boot. See more information here and here. |
We are running Kubernetes over bare metal machines. No virtualization is used. I will consult with our infra team about the details. I know that before switching to Kubernetes, we ran old-style, and Trying to run |
@sjeaugey Interesting, that sample passes without hangs:
|
Thanks. Interesting indeed, CE copies seem to not hang, but just be really slow. Notice the 0.23GB/s :
Let us know if performance goes back to normal with the BIOS or system change. NCCL performance and functionality should be back as well. |
We disabled VT-d in BIOS and the hang still persists with the same DMA errors. The corresponding BIOS checkbox is off and the kernel prints |
@sjeaugey @cliffwoolley Booting the kernel with
Thanks for the help. |
I can suggest searching for |
My problem was diagnosed in tensorflow/tensorflow#32654 - please find all the info about my environment there.
Using the
master
version of nccl. I launchall_reduce_perf
and it hangs with 100% volatile GPU usage reported.I waited for 10 minutes, there are no more logs printed.
The text was updated successfully, but these errors were encountered: