Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAMD occasionally hangs on Frontera with mpi-smp builds #2850

Closed
nitbhat opened this issue May 12, 2020 · 15 comments · Fixed by #3462
Closed

NAMD occasionally hangs on Frontera with mpi-smp builds #2850

nitbhat opened this issue May 12, 2020 · 15 comments · Fixed by #3462
Assignees
Milestone

Comments

@nitbhat
Copy link

nitbhat commented May 12, 2020

I stumbled upon this issue while doing some performance runs.

This looks like a startup hang probably caused during RTS initialization.


c191-001.frontera(1060)$ ibrun /scratch1/03808/nbhat4/namd/Linux-x86_64-g++-smp-impi-prod/namd2 ++ppn 13 +pemap 4-55:2,5-55:2 +commap 0,2,1,3 julio_input/runZIKV-50M-atoms.namd
TACC:  Starting up job 821440
TACC:  Starting parallel tasks...
Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 64 processes, 13 worker threads (PEs) + 1 comm threads per process, 832 PEs total
Charm++> The comm. thread both sends and receives messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.11.0-devel-225-g4504e8531

On running git bisect, I found that the bug was introduced by ecea95f

@nitbhat nitbhat added this to the 6.11 milestone May 12, 2020
@stwhite91
Copy link
Collaborator

stwhite91 commented May 12, 2020

Can you try running with +no_isomalloc_sync? @evan-charmworks

@nitbhat
Copy link
Author

nitbhat commented May 12, 2020

I don't see the hangs when I run with +no_isomalloc_sync.

@evan-charmworks
Copy link
Contributor

I'm not surprised there is a hang in isomalloc_sync considering I had to fight the code a bit to get something that worked with netlrts, netlrts-smp, and multicore.

Could #2838 be related?

Maybe something about MPI's use of thread 0 as a comm thread instead of thread N is affecting this, though I used CmiInCommThread() instead of hardcoding an ID check, and we don't encounter this issue on other mpi-smp builds.

It could be that I am not driving the comm thread properly in a way that only causes an issue with this machine.

The log also mentions partitions. Maybe that is related. isomalloc_sync does no special handling for partitions.

For now NAMD can pass +no_isomalloc_sync if they have the issue.

@stwhite91
Copy link
Collaborator

I would not be surprised if partitions caused the issue, or one of the things you mentioned. We could also add a build-time option, if there isn't one already, to disable isomalloc_sync entirely

@evan-charmworks
Copy link
Contributor

I don't have access to Frontera to test. @nitbhat Does Jaemin's PR #2838 help with this issue?

@trquinn
Copy link
Collaborator

trquinn commented Aug 2, 2020

I'm seeing a similar hang in ChaNGa when built with mpi-linux-x86_64-mpicxx on a skylake IB cluster. The hang goes away if I start up with "+noisomalloc". The hang seems to happen in the CmiIsomallocSyncWait() after the call to CmiNodeReduceStruct() in CmiIsmallocInitExtent(). I can reproduce this pretty reliably if you need more information. Should I try Jaimin's PR; it seems quite old?
And the software stack:
icc version 18.0.2 (gcc version 6.4.0 compatibility)
Intel(R) MPI Library for Linux* OS, Version 2018 Update 2 Build 20180125 (id: 18157)

@evan-charmworks
Copy link
Contributor

Does the issue only happen with Intel MPI?
What are the smallest +p ++ppn arguments that cause the issue?
If it is possible to get a ++debug-no-pause session, running thread apply all bt in each logical node's GDB instance would be helpful.

@ericjbohm
Copy link
Contributor

Has this been resolved?

@ericjbohm ericjbohm modified the milestones: 7.0, 7.1 Jul 1, 2021
@trquinn
Copy link
Collaborator

trquinn commented Jul 15, 2021

Since this isn't going in 7.0 could there be a warning about hanging so that unsuspecting users know what to do? This continues to be a problem with ChaNGa on large machines.

@stwhite91
Copy link
Collaborator

Why don't we add a build option that ChaNGa can set which will disable isomalloc entirely?

Also answering Evan's questions above from Aug 7th would help to isolate it further

@trquinn
Copy link
Collaborator

trquinn commented Jul 18, 2021

I got a traceback on 64x10 processes. Note that I'm using v6.11.0-beta1 since v7.0.0-rc1 doesn't compile on this particular machine.

Charm++> Running on MPI version: 3.1
Charm++> level of thread support used: MPI_THREAD_FUNNELED (desired: MPI_THREAD_FUNNELED)
Charm++> Running in SMP mode: 64 processes, 9 worker threads (PEs) + 1 comm threads per proce
ss, 576 PEs total
Charm++> The comm. thread both sends and receives messages
Converse/Charm++ Commit ID: v6.11.0-beta1-0-gee129f3
Charm++ built without optimization.
Do not use for performance benchmarking (build with --with-production to do so).
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
------------- Processor 19 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
------------- Processor 25 Exiting: Called CmiAbort ------------
Reason: CmiFree reference count was zero-- is this a duplicate free?
[25] Stack Traceback:
[19] Stack Traceback:
  [19:0] ChaNGa.smp 0xa85ce8 CmiSendNodeReduce(CmiReduction*)
  [25:0] ChaNGa.smp 0xa85ce8 CmiSendNodeReduce(CmiReduction*)
  [19:1] ChaNGa.smp 0xa892bd CmiHandleNodeReductionMessage(void*)
  [25:1] ChaNGa.smp 0xa892bd CmiHandleNodeReductionMessage(void*)
  [19:2] ChaNGa.smp 0xa84736 CsdSchedulePoll
  [25:2] ChaNGa.smp 0xa84736 CsdSchedulePoll
  [19:3] ChaNGa.smp 0xaaa998 CmiIsomallocInit(char**)
  [25:3] ChaNGa.smp 0xaaa998 CmiIsomallocInit(char**)
  [19:4] ChaNGa.smp 0xa8d6a0 ConverseCommonInit(char**)
  [25:4] ChaNGa.smp 0xa8d6a0 ConverseCommonInit(char**)
  [19:5] ChaNGa.smp 0xa6e1f0 
  [25:5] ChaNGa.smp 0xa6e1f0 
  [19:6] ChaNGa.smp 0xa6ac00 
  [25:6] ChaNGa.smp 0xa6ac00 
  [19:7] libpthread.so.0 0x2b1a23274ea5 
  [25:7] libpthread.so.0 0x2b1a23274ea5 
  [19:8] libc.so.6 0x2b1a258699fd clone
  [25:8] libc.so.6 0x2b1a258699fd clone

@evan-charmworks
Copy link
Contributor

Is the hang also avoided by specifying +skip_cpu_topology?

@evan-charmworks evan-charmworks modified the milestones: 7.1, 7.0 Aug 5, 2021
@evan-charmworks evan-charmworks changed the title NAMD occasionally hangs on Frontera with mpi-smp builds using charm master NAMD occasionally hangs on Frontera with mpi-smp builds Aug 18, 2021
@evan-charmworks
Copy link
Contributor

@trquinn Does the issue still occur with the latest main branch?

@trquinn
Copy link
Collaborator

trquinn commented Aug 18, 2021

Yes. I just got a hang with Converse/Charm++ Commit ID: v7.1.0-devel-32-ge9bca3e

@evan-charmworks evan-charmworks linked a pull request Sep 12, 2021 that will close this issue
@evan-charmworks
Copy link
Contributor

evan-charmworks commented Sep 12, 2021

Fixed by #3462 and #3481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants