Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncclCommAbort stuck during NCCL errors #992

Open
pritamdamania87 opened this issue Sep 13, 2023 · 14 comments
Open

ncclCommAbort stuck during NCCL errors #992

pritamdamania87 opened this issue Sep 13, 2023 · 14 comments

Comments

@pritamdamania87
Copy link

We had a process hang for a long time since ncclCommAbort call was stuck as follows:

Thread 3438296 (idle): "Thread-41"
    0x155555277197 (libc.so.6)
    0x15555527c6a4 (libc.so.6)
    commFree (init.cc:180)
    commCleanup (init.cc:1731)
    commReclaim (init.cc:1862)
    ncclCommAbort (init.cc:1931)
    c10d::NCCLComm::ncclCommAbort (libtorch_cuda.so)
    c10d::abortCommsFromMap (libtorch_cuda.so)
    c10d::ProcessGroupNCCL::abort (libtorch_cuda.so)

The logic we have is when we detect errors a separate background thread calls ncclCommAbort to recover from the situation and deal with errors. However, this thread itself gets stuck with the stack trace above.

Would love to know how to debug/root cause this issue further.

NCCL version: 2.18.1

@KaimingOuyang
Copy link
Collaborator

KaimingOuyang commented Sep 13, 2023

Hi Pritam,
I remember you reported this issue before. NCCL now does not support multithreaded abort well. This issue is fixed in the upcoming NCCL 2.19.1.

@pritamdamania87
Copy link
Author

@KaimingOuyang Thanks a lot for the clarification! I do recall having a similar issue previously but wasn't sure if it was the same as this and hence I created this issue.

Reading the NCCL code I see that we set abortFlag = 1 here: https://github.com/NVIDIA/nccl/blob/master/src/init.cc#L1974 and the ncclProxyService thread checks it here: https://github.com/NVIDIA/nccl/blob/master/src/proxy.cc#L1433. Although, the abortFlag is volatile: https://github.com/NVIDIA/nccl/blob/master/src/include/comm.h#L269, this does not provide the guarantees we need of signaling across threads.

From https://en.cppreference.com/w/c/language/volatile:

Note that volatile variables are not suitable for communication between threads; they do not offer atomicity, synchronization, or memory ordering. A read from a volatile variable that is modified by another thread without synchronization or concurrent modification from two unsynchronized threads is undefined behavior due to a data race.

From https://en.cppreference.com/w/cpp/language/cv:

Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to access a volatile object through a glvalue of non-volatile type (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.

Is this the issue with the current code that will be fixed in the upcoming 2.19.1 release? Is there any sort of workaround in the meantime?

@pritamdamania87
Copy link
Author

@KaimingOuyang Ah I found the previous context here: pytorch/pytorch#103927. Based on pytorch/pytorch#103927 (comment), looks like we were planning to release 2.19.1 sometime around early August. Wondering if you have an updated timeline for the release? Thanks a ton!

@KaimingOuyang
Copy link
Collaborator

ok, thanks for the reference. When the thread hangs, do you see the abortFlag at https://github.com/NVIDIA/nccl/blob/master/src/proxy.cc#L1433 is still 0?

@pritamdamania87
Copy link
Author

ok, thanks for the reference. When the thread hangs, do you see the abortFlag at https://github.com/NVIDIA/nccl/blob/master/src/proxy.cc#L1433 is still 0?

The traceback we have doesn't have any information regarding variables like abortFlag. This was just a hypothesis based on reading the code. However, the other weird thing is that in the full traceback of all threads we don't see any ncclProxyService thread running and yet it seems like the pthread_join is waiting on that thread.

@KaimingOuyang
Copy link
Collaborator

Hi Pritam,
Could you please try this branch https://github.com/NVIDIA/nccl/tree/github-abort-meta, and let me know the results? Thanks!

@igozali
Copy link

igozali commented Nov 9, 2023

Hi Kaiming, I had some questions about the patch. IIUC you're using some shared memory to implement a heartbeat mechanism between local ranks in a node in the proxy service. Alternatively, wondering if we could just use TCP keepalive to check if the local peers are still alive, since TCP keepalive also uses heartbeats but the kernel manages that?

@KaimingOuyang
Copy link
Collaborator

No, we cannot use keepalive because it takes too long to time out (RFC mentions the timeout threshold is at least 2 hours). In addition, I don't want to let this decision be made by OS in case users do not have any control of systems (e.g. no root).

@KaimingOuyang
Copy link
Collaborator

@pritamdamania87 Can you let me know whether it works for you?

@pritamdamania87
Copy link
Author

@igozali was looking into this and can report back on his findings.

@KaimingOuyang
Copy link
Collaborator

Hi @pritamdamania87 ,
Just want to confirm with you. How do you kill the process?
If you "kill" the process by issuing exit() in pytorch, you might face the issue I mentioned in #1013 (comment)

@pritamdamania87
Copy link
Author

@KaimingOuyang For this particular issue, we never called exit(). A process crashed fatally and then the rest of the processes called ncclCommAbort.

@KaimingOuyang
Copy link
Collaborator

Great! Thanks for the confirmation.

@KaimingOuyang
Copy link
Collaborator

@igozali Did you get any results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants