NAMD occasionally hangs on Frontera with mpi-smp builds #2850

nitbhat · 2020-05-12T18:15:43Z

I stumbled upon this issue while doing some performance runs.

This looks like a startup hang probably caused during RTS initialization.


c191-001.frontera(1060)$ ibrun /scratch1/03808/nbhat4/namd/Linux-x86_64-g++-smp-impi-prod/namd2 ++ppn 13 +pemap 4-55:2,5-55:2 +commap 0,2,1,3 julio_input/runZIKV-50M-atoms.namd
TACC:  Starting up job 821440
TACC:  Starting parallel tasks...
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 64 processes, 13 worker threads (PEs) + 1 comm threads per process, 832 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.11.0-devel-225-g4504e8531

On running git bisect, I found that the bug was introduced by ecea95f

The text was updated successfully, but these errors were encountered:

stwhite91 · 2020-05-12T18:21:30Z

Can you try running with +no_isomalloc_sync? @evan-charmworks

nitbhat · 2020-05-12T18:27:00Z

I don't see the hangs when I run with +no_isomalloc_sync.

evan-charmworks · 2020-05-12T20:58:13Z

I'm not surprised there is a hang in isomalloc_sync considering I had to fight the code a bit to get something that worked with netlrts, netlrts-smp, and multicore.

Could #2838 be related?

Maybe something about MPI's use of thread 0 as a comm thread instead of thread N is affecting this, though I used CmiInCommThread() instead of hardcoding an ID check, and we don't encounter this issue on other mpi-smp builds.

It could be that I am not driving the comm thread properly in a way that only causes an issue with this machine.

The log also mentions partitions. Maybe that is related. isomalloc_sync does no special handling for partitions.

For now NAMD can pass +no_isomalloc_sync if they have the issue.

stwhite91 · 2020-05-12T21:04:48Z

I would not be surprised if partitions caused the issue, or one of the things you mentioned. We could also add a build-time option, if there isn't one already, to disable isomalloc_sync entirely

evan-charmworks · 2020-05-14T16:55:02Z

I don't have access to Frontera to test. @nitbhat Does Jaemin's PR #2838 help with this issue?

trquinn · 2020-08-02T21:20:27Z

I'm seeing a similar hang in ChaNGa when built with mpi-linux-x86_64-mpicxx on a skylake IB cluster. The hang goes away if I start up with "+noisomalloc". The hang seems to happen in the CmiIsomallocSyncWait() after the call to CmiNodeReduceStruct() in CmiIsmallocInitExtent(). I can reproduce this pretty reliably if you need more information. Should I try Jaimin's PR; it seems quite old?
And the software stack:
icc version 18.0.2 (gcc version 6.4.0 compatibility)
Intel(R) MPI Library for Linux* OS, Version 2018 Update 2 Build 20180125 (id: 18157)

evan-charmworks · 2020-08-07T23:58:45Z

Does the issue only happen with Intel MPI?
What are the smallest +p ++ppn arguments that cause the issue?
If it is possible to get a ++debug-no-pause session, running thread apply all bt in each logical node's GDB instance would be helpful.

ericjbohm · 2020-10-15T21:00:22Z

Has this been resolved?

trquinn · 2021-07-15T17:28:53Z

Since this isn't going in 7.0 could there be a warning about hanging so that unsuspecting users know what to do? This continues to be a problem with ChaNGa on large machines.

stwhite91 · 2021-07-15T17:35:46Z

Why don't we add a build option that ChaNGa can set which will disable isomalloc entirely?

Also answering Evan's questions above from Aug 7th would help to isolate it further

trquinn · 2021-07-18T05:05:55Z

I got a traceback on 64x10 processes. Note that I'm using v6.11.0-beta1 since v7.0.0-rc1 doesn't compile on this particular machine.

Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 64 processes, 9 worker threads (PEs) + 1 comm threads per proce
ss, 576 PEs total
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.11.0-beta1-0-gee129f3
Charm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
------------- Processor 19 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
------------- Processor 25 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[25] Stack Traceback:
[19] Stack Traceback:
  [19:0] ChaNGa.smp 0xa85ce8 CmiSendNodeReduce(CmiReduction*)
  [25:0] ChaNGa.smp 0xa85ce8 CmiSendNodeReduce(CmiReduction*)
  [19:1] ChaNGa.smp 0xa892bd CmiHandleNodeReductionMessage(void*)
  [25:1] ChaNGa.smp 0xa892bd CmiHandleNodeReductionMessage(void*)
  [19:2] ChaNGa.smp 0xa84736 CsdSchedulePoll
  [25:2] ChaNGa.smp 0xa84736 CsdSchedulePoll
  [19:3] ChaNGa.smp 0xaaa998 CmiIsomallocInit(char**)
  [25:3] ChaNGa.smp 0xaaa998 CmiIsomallocInit(char**)
  [19:4] ChaNGa.smp 0xa8d6a0 ConverseCommonInit(char**)
  [25:4] ChaNGa.smp 0xa8d6a0 ConverseCommonInit(char**)
  [19:5] ChaNGa.smp 0xa6e1f0 
  [25:5] ChaNGa.smp 0xa6e1f0 
  [19:6] ChaNGa.smp 0xa6ac00 
  [25:6] ChaNGa.smp 0xa6ac00 
  [19:7] libpthread.so.0 0x2b1a23274ea5 
  [25:7] libpthread.so.0 0x2b1a23274ea5 
  [19:8] libc.so.6 0x2b1a258699fd clone
  [25:8] libc.so.6 0x2b1a258699fd clone

evan-charmworks · 2021-08-05T18:57:58Z

Is the hang also avoided by specifying +skip_cpu_topology?

evan-charmworks · 2021-08-18T19:54:24Z

@trquinn Does the issue still occur with the latest main branch?

trquinn · 2021-08-18T20:17:24Z

Yes. I just got a hang with Converse/Charm++ Commit ID: v7.1.0-devel-32-ge9bca3e

evan-charmworks · 2021-09-12T16:41:41Z

Fixed by #3462 and #3481

nitbhat added this to the 6.11 milestone May 12, 2020

trquinn mentioned this issue Aug 3, 2020

Hang on startup with mpi-smp builds N-BodyShop/changa#77

Open

ericjbohm assigned nitbhat Oct 15, 2020

ericjbohm modified the milestones: 7.0, 7.1 Jul 1, 2021

ericjbohm assigned evan-charmworks and stwhite91 Jul 22, 2021

evan-charmworks modified the milestones: 7.1, 7.0 Aug 5, 2021

evan-charmworks mentioned this issue Aug 12, 2021

Clean up cputopology initialization global communication pattern #3449

Merged

evan-charmworks changed the title ~~NAMD occasionally hangs on Frontera with mpi-smp builds using charm master~~ NAMD occasionally hangs on Frontera with mpi-smp builds Aug 18, 2021

evan-charmworks linked a pull request Sep 12, 2021 that will close this issue

Avoid hang during Isomalloc-Sync #3462

Merged

evan-charmworks closed this as completed Sep 12, 2021

dmclark17 mentioned this issue Oct 27, 2021

NAMD hangs with mpi-smp build on DGX A100 #3519

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAMD occasionally hangs on Frontera with mpi-smp builds #2850

NAMD occasionally hangs on Frontera with mpi-smp builds #2850

nitbhat commented May 12, 2020

stwhite91 commented May 12, 2020 •

edited

Loading

nitbhat commented May 12, 2020

evan-charmworks commented May 12, 2020

stwhite91 commented May 12, 2020

evan-charmworks commented May 14, 2020

trquinn commented Aug 2, 2020

evan-charmworks commented Aug 7, 2020

ericjbohm commented Oct 15, 2020

trquinn commented Jul 15, 2021

stwhite91 commented Jul 15, 2021

trquinn commented Jul 18, 2021

evan-charmworks commented Aug 5, 2021

evan-charmworks commented Aug 18, 2021

trquinn commented Aug 18, 2021

evan-charmworks commented Sep 12, 2021 •

edited

Loading

NAMD occasionally hangs on Frontera with mpi-smp builds #2850

NAMD occasionally hangs on Frontera with mpi-smp builds #2850

Comments

nitbhat commented May 12, 2020

stwhite91 commented May 12, 2020 • edited Loading

nitbhat commented May 12, 2020

evan-charmworks commented May 12, 2020

stwhite91 commented May 12, 2020

evan-charmworks commented May 14, 2020

trquinn commented Aug 2, 2020

evan-charmworks commented Aug 7, 2020

ericjbohm commented Oct 15, 2020

trquinn commented Jul 15, 2021

stwhite91 commented Jul 15, 2021

trquinn commented Jul 18, 2021

evan-charmworks commented Aug 5, 2021

evan-charmworks commented Aug 18, 2021

trquinn commented Aug 18, 2021

evan-charmworks commented Sep 12, 2021 • edited Loading

stwhite91 commented May 12, 2020 •

edited

Loading

evan-charmworks commented Sep 12, 2021 •

edited

Loading