-
Notifications
You must be signed in to change notification settings - Fork 846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ncclCommAbort stuck during NCCL errors #992
Comments
Hi Pritam, |
@KaimingOuyang Thanks a lot for the clarification! I do recall having a similar issue previously but wasn't sure if it was the same as this and hence I created this issue. Reading the NCCL code I see that we set From https://en.cppreference.com/w/c/language/volatile:
From https://en.cppreference.com/w/cpp/language/cv:
Is this the issue with the current code that will be fixed in the upcoming 2.19.1 release? Is there any sort of workaround in the meantime? |
@KaimingOuyang Ah I found the previous context here: pytorch/pytorch#103927. Based on pytorch/pytorch#103927 (comment), looks like we were planning to release 2.19.1 sometime around early August. Wondering if you have an updated timeline for the release? Thanks a ton! |
ok, thanks for the reference. When the thread hangs, do you see the abortFlag at https://github.com/NVIDIA/nccl/blob/master/src/proxy.cc#L1433 is still 0? |
The traceback we have doesn't have any information regarding variables like |
Hi Pritam, |
Hi Kaiming, I had some questions about the patch. IIUC you're using some shared memory to implement a heartbeat mechanism between local ranks in a node in the proxy service. Alternatively, wondering if we could just use TCP keepalive to check if the local peers are still alive, since TCP keepalive also uses heartbeats but the kernel manages that? |
No, we cannot use keepalive because it takes too long to time out (RFC mentions the timeout threshold is at least 2 hours). In addition, I don't want to let this decision be made by OS in case users do not have any control of systems (e.g. no root). |
@pritamdamania87 Can you let me know whether it works for you? |
@igozali was looking into this and can report back on his findings. |
Hi @pritamdamania87 , |
@KaimingOuyang For this particular issue, we never called |
Great! Thanks for the confirmation. |
@igozali Did you get any results? |
We had a process hang for a long time since
ncclCommAbort
call was stuck as follows:The logic we have is when we detect errors a separate background thread calls
ncclCommAbort
to recover from the situation and deal with errors. However, this thread itself gets stuck with the stack trace above.Would love to know how to debug/root cause this issue further.
NCCL version: 2.18.1
The text was updated successfully, but these errors were encountered: