ncclCommAbort stuck during NCCL errors #992

pritamdamania87 · 2023-09-13T19:30:34Z

We had a process hang for a long time since ncclCommAbort call was stuck as follows:

Thread 3438296 (idle): "Thread-41"
    0x155555277197 (libc.so.6)
    0x15555527c6a4 (libc.so.6)
    commFree (init.cc:180)
    commCleanup (init.cc:1731)
    commReclaim (init.cc:1862)
    ncclCommAbort (init.cc:1931)
    c10d::NCCLComm::ncclCommAbort (libtorch_cuda.so)
    c10d::abortCommsFromMap (libtorch_cuda.so)
    c10d::ProcessGroupNCCL::abort (libtorch_cuda.so)

The logic we have is when we detect errors a separate background thread calls ncclCommAbort to recover from the situation and deal with errors. However, this thread itself gets stuck with the stack trace above.

Would love to know how to debug/root cause this issue further.

NCCL version: 2.18.1

The text was updated successfully, but these errors were encountered:

KaimingOuyang · 2023-09-13T21:12:22Z

Hi Pritam,
I remember you reported this issue before. NCCL now does not support multithreaded abort well. This issue is fixed in the upcoming NCCL 2.19.1.

pritamdamania87 · 2023-09-13T21:21:10Z

@KaimingOuyang Thanks a lot for the clarification! I do recall having a similar issue previously but wasn't sure if it was the same as this and hence I created this issue.

Reading the NCCL code I see that we set abortFlag = 1 here: https://github.com/NVIDIA/nccl/blob/master/src/init.cc#L1974 and the ncclProxyService thread checks it here: https://github.com/NVIDIA/nccl/blob/master/src/proxy.cc#L1433. Although, the abortFlag is volatile: https://github.com/NVIDIA/nccl/blob/master/src/include/comm.h#L269, this does not provide the guarantees we need of signaling across threads.

From https://en.cppreference.com/w/c/language/volatile:

Note that volatile variables are not suitable for communication between threads; they do not offer atomicity, synchronization, or memory ordering. A read from a volatile variable that is modified by another thread without synchronization or concurrent modification from two unsynchronized threads is undefined behavior due to a data race.

From https://en.cppreference.com/w/cpp/language/cv:

Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to access a volatile object through a glvalue of non-volatile type (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.

Is this the issue with the current code that will be fixed in the upcoming 2.19.1 release? Is there any sort of workaround in the meantime?

pritamdamania87 · 2023-09-13T21:26:32Z

@KaimingOuyang Ah I found the previous context here: pytorch/pytorch#103927. Based on pytorch/pytorch#103927 (comment), looks like we were planning to release 2.19.1 sometime around early August. Wondering if you have an updated timeline for the release? Thanks a ton!

KaimingOuyang · 2023-09-13T21:29:59Z

ok, thanks for the reference. When the thread hangs, do you see the abortFlag at https://github.com/NVIDIA/nccl/blob/master/src/proxy.cc#L1433 is still 0?

pritamdamania87 · 2023-09-13T21:34:23Z

ok, thanks for the reference. When the thread hangs, do you see the abortFlag at https://github.com/NVIDIA/nccl/blob/master/src/proxy.cc#L1433 is still 0?

The traceback we have doesn't have any information regarding variables like abortFlag. This was just a hypothesis based on reading the code. However, the other weird thing is that in the full traceback of all threads we don't see any ncclProxyService thread running and yet it seems like the pthread_join is waiting on that thread.

KaimingOuyang · 2023-11-07T17:23:35Z

Hi Pritam,
Could you please try this branch https://github.com/NVIDIA/nccl/tree/github-abort-meta, and let me know the results? Thanks!

igozali · 2023-11-09T19:28:46Z

Hi Kaiming, I had some questions about the patch. IIUC you're using some shared memory to implement a heartbeat mechanism between local ranks in a node in the proxy service. Alternatively, wondering if we could just use TCP keepalive to check if the local peers are still alive, since TCP keepalive also uses heartbeats but the kernel manages that?

KaimingOuyang · 2023-11-17T17:32:18Z

No, we cannot use keepalive because it takes too long to time out (RFC mentions the timeout threshold is at least 2 hours). In addition, I don't want to let this decision be made by OS in case users do not have any control of systems (e.g. no root).

KaimingOuyang · 2023-11-22T23:15:45Z

@pritamdamania87 Can you let me know whether it works for you?

pritamdamania87 · 2023-11-27T20:06:08Z

@igozali was looking into this and can report back on his findings.

KaimingOuyang · 2023-11-28T17:31:00Z

Hi @pritamdamania87 ,
Just want to confirm with you. How do you kill the process?
If you "kill" the process by issuing exit() in pytorch, you might face the issue I mentioned in #1013 (comment)

pritamdamania87 · 2023-11-28T18:30:41Z

@KaimingOuyang For this particular issue, we never called exit(). A process crashed fatally and then the rest of the processes called ncclCommAbort.

KaimingOuyang · 2023-11-28T18:41:15Z

Great! Thanks for the confirmation.

KaimingOuyang · 2024-01-02T22:11:14Z

@igozali Did you get any results?

KaimingOuyang mentioned this issue Nov 8, 2023

Question about ncclCommAbort stuck issue #1013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ncclCommAbort stuck during NCCL errors #992

ncclCommAbort stuck during NCCL errors #992

pritamdamania87 commented Sep 13, 2023

KaimingOuyang commented Sep 13, 2023 •

edited

Loading

pritamdamania87 commented Sep 13, 2023

pritamdamania87 commented Sep 13, 2023

KaimingOuyang commented Sep 13, 2023

pritamdamania87 commented Sep 13, 2023

KaimingOuyang commented Nov 7, 2023

igozali commented Nov 9, 2023

KaimingOuyang commented Nov 17, 2023

KaimingOuyang commented Nov 22, 2023

pritamdamania87 commented Nov 27, 2023

KaimingOuyang commented Nov 28, 2023

pritamdamania87 commented Nov 28, 2023

KaimingOuyang commented Nov 28, 2023

KaimingOuyang commented Jan 2, 2024

ncclCommAbort stuck during NCCL errors #992

ncclCommAbort stuck during NCCL errors #992

Comments

pritamdamania87 commented Sep 13, 2023

KaimingOuyang commented Sep 13, 2023 • edited Loading

pritamdamania87 commented Sep 13, 2023

pritamdamania87 commented Sep 13, 2023

KaimingOuyang commented Sep 13, 2023

pritamdamania87 commented Sep 13, 2023

KaimingOuyang commented Nov 7, 2023

igozali commented Nov 9, 2023

KaimingOuyang commented Nov 17, 2023

KaimingOuyang commented Nov 22, 2023

pritamdamania87 commented Nov 27, 2023

KaimingOuyang commented Nov 28, 2023

pritamdamania87 commented Nov 28, 2023

KaimingOuyang commented Nov 28, 2023

KaimingOuyang commented Jan 2, 2024

KaimingOuyang commented Sep 13, 2023 •

edited

Loading