Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orphaned orted processes after failing mpi jobs #7790

Closed
iassiour opened this issue Jun 5, 2020 · 5 comments
Closed

orphaned orted processes after failing mpi jobs #7790

iassiour opened this issue Jun 5, 2020 · 5 comments
Labels
RTE Issue likely is in RTE or PMIx areas

Comments

@iassiour
Copy link

iassiour commented Jun 5, 2020

After failure of mpirun jobs there is often orphaned orted processes left behind that consume 100% cpu. The version I am using is 4.0.1 and looks like this is the first version to show this kind of issue.

The pstack of the hanging orted process as follows:

Thread 2 (Thread 0x148556eb2700 (LWP 12288)):
#0  0x000014855739467d in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0x0000148558465d26 in poll_dispatch ()
#2  0x000014855845bf5d in opal_libevent2022_event_base_loop () 
#3  0x000014855840566e in progress_engine () 
#4  0x000014855765b494 in start_thread (arg=0x148556eb2700) at pthread_create.c:333
#5  0x000014855739dacf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

Thread 1 (Thread 0x148558d6f040 (LWP 12285)):
#0  __GI___pthread_mutex_lock (mutex=0x23b0020) at ../nptl/pthread_mutex_lock.c:75
#1  0x0000148558459350 in opal_libevent2022_event_del () 
#2  0x000014855845ea45 in opal_libevent2022_event_base_free () 
#3  0x000014855854173e in tracker_destructor () 
#4  0x0000148558541c81 in pmix_progress_thread_stop () 
#5  0x0000148558555791 in OPAL_MCA_PMIX3X_PMIx_server_finalize () 
#6  0x00001485584b97a8 in pmix3x_server_finalize () 
#7  0x0000148558887494 in pmix_server_finalize () 
#8  0x00001485588adf17 in orte_ess_base_orted_finalize ()
#9  0x00001485588b56c9 in rte_finalize () 
#10 0x000014855885e2a0 in orte_finalize () 
#11 0x000014855887dbd5 in orte_daemon () 
#12 0x00000000004007b9 in main ()
@jsquyres
Copy link
Member

jsquyres commented Jun 5, 2020

Can you try 4.0.4rc2 (or later)?

https://www.open-mpi.org/software/ompi/v4.0/

@iassiour
Copy link
Author

iassiour commented Jun 6, 2020

I will try that but the issue is not easy to reproduce so I would not know straight away if new version works.

Is this related to some known issue? If so could you please give me some more details.

@jsquyres
Copy link
Member

jsquyres commented Jun 6, 2020

No, it is not related to a known issue. I'm just asking to update to the latest on the release series to see if it has been fixed as a matter of course. Sorry. ☹️

@iassiour
Copy link
Author

iassiour commented Jun 6, 2020

Looks like the issue is fixed from 4.0.2 onward. Can it be related to this:

  • Update embedded PMIx to 3.1.4

@jjhursey jjhursey added the RTE Issue likely is in RTE or PMIx areas label Jun 8, 2020
@rhc54
Copy link
Contributor

rhc54 commented Apr 23, 2021

Closing as fixed.

@rhc54 rhc54 closed this as completed Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RTE Issue likely is in RTE or PMIx areas
Projects
None yet
Development

No branches or pull requests

4 participants