-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NAMD occasionally hangs on Frontera with mpi-smp builds #2850
Comments
Can you try running with |
I don't see the hangs when I run with |
I'm not surprised there is a hang in isomalloc_sync considering I had to fight the code a bit to get something that worked with netlrts, netlrts-smp, and multicore. Could #2838 be related? Maybe something about MPI's use of thread 0 as a comm thread instead of thread N is affecting this, though I used It could be that I am not driving the comm thread properly in a way that only causes an issue with this machine. The log also mentions partitions. Maybe that is related. isomalloc_sync does no special handling for partitions. For now NAMD can pass |
I would not be surprised if partitions caused the issue, or one of the things you mentioned. We could also add a build-time option, if there isn't one already, to disable isomalloc_sync entirely |
I'm seeing a similar hang in ChaNGa when built with mpi-linux-x86_64-mpicxx on a skylake IB cluster. The hang goes away if I start up with "+noisomalloc". The hang seems to happen in the CmiIsomallocSyncWait() after the call to CmiNodeReduceStruct() in CmiIsmallocInitExtent(). I can reproduce this pretty reliably if you need more information. Should I try Jaimin's PR; it seems quite old? |
Does the issue only happen with Intel MPI? |
Has this been resolved? |
Since this isn't going in 7.0 could there be a warning about hanging so that unsuspecting users know what to do? This continues to be a problem with ChaNGa on large machines. |
Why don't we add a build option that ChaNGa can set which will disable isomalloc entirely? Also answering Evan's questions above from Aug 7th would help to isolate it further |
I got a traceback on 64x10 processes. Note that I'm using v6.11.0-beta1 since v7.0.0-rc1 doesn't compile on this particular machine.
|
Is the hang also avoided by specifying |
@trquinn Does the issue still occur with the latest |
Yes. I just got a hang with Converse/Charm++ Commit ID: v7.1.0-devel-32-ge9bca3e |
I stumbled upon this issue while doing some performance runs.
This looks like a startup hang probably caused during RTS initialization.
On running git bisect, I found that the bug was introduced by ecea95f
The text was updated successfully, but these errors were encountered: