-
Notifications
You must be signed in to change notification settings - Fork 893
COMM_SPAWN is broken in 2.1.x HEAD #2030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Strangely enough, IBM MTT shows the issue on master but not on v2.x. Also, I looked at the 1.1.5rc commit and it is far from being a significant change. If this commit is the issue, it should be easy to fix. |
I'm currently unable to reproduce the problem on v2.x outside of MTT. On MTT though, it failed both with tcp and smcuda, and the error is a clear segmentation fault in orted (not the random pmix failure when there is a remaining socket in /tmp). |
@sjeaugey Can you provide a backtrace? |
@sjeaugey Also, can you force your MTT to use a tarball or git clone with that PMIx commit reverted? |
@jsquyres I still need to be able to reproduce the problem. I'm trying to launch MTT building only v2.x and only IBM dynamic/spawn, but maybe it only fails when spawn is launched after in the middle of all other tests. |
@sjeaugey You might be able to run this indirectly via MTT. E.g. (I didn't look at your MTT results; I'm guessing/assuming it's the $ client/mtt --file ... --scratch ... --verbose --mpi-get
$ client/mtt --file ... --scratch ... --verbose --mpi-install
$ client/mtt --file ... --scratch ... --verbose --test-get --section ibm
$ client/mtt --file ... --scratch ... --verbose --test-build --section ibm
$ client/mtt --file ... --scratch ... --verbose --test-run --section ibm-run-the-bad-test I.e., make a Make sense? |
@jsquyres Yes it makes sense. Still, no luck. When I only run dynamic/spawn, the tests pass. @rhc54 I'm starting to wonder if I were just out of luck and hit the pmix problem (when the socket is already in /tmp), only that in the case of spawns, instead of the usual "Error in file orted/pmix/pmix_server.c at line 254" message, I get a segmentation fault. |
I added tmp cleanup before my nightly MTT run. We'll see tomorrow if it solved the issue. |
@sjeaugey Looks like you ran clean last night - can you clarify exactly what you did? Did you just clean the tmp? Or did you also roll back the PMIx 1.1.5 commit? |
@rhc54 I wouldn't say that. I got rid of some failures (caused by remaining sockets in tmp) but only 2 failures remain : spawn and spawn_with_env_vars. And spawn_multiple timed out too. I'm still unable to reproduce it outside of the nightly runs however. Working on it. |
how strange...the nightly summary didn't show those failures. Is there a core file you can look at? A line number where those daemons are crashing would really help. |
Some new interesting fact : on my second node, the spawn process are still there, stuck :
I tried to gstack them. That killed some of the processes, but some others survived and gave me a stack trace :
I'll kill those remaining processes eventually (they could be the cause of the failures), but if you want me to look at something in particular, please tell me. |
Yeah, I can believe they would be stuck - they are trying to Abort and the daemon is gone, so nobody they can tell. I believe we have solved that since the 1.1 series, but I'll check to ensure we do. Real question is: why are the orted's crashing? Any chance for a line number from a core file there? |
Still trying to get a core ... worst case, I'll set ulimit in my nightly script. |
@rhc54 I got a core file. Here is the orted backtrace :
|
can you print the value of "msg" and, if not NULL, the contents (*msg)? |
I followed the whole code on line 239 (big macro calling a lot of other macros) but everything seems normal. Even looking at the assembly, the line where it crashed is :
%rbp is equal to 0x7fffb6fc9cb0, and the value of -0x30(%rbp) is accessible and equal to 0x30. I don't get it. |
For reference, the content of msg and msg->msg :
|
I see the issue - it was fixed on master. Can you try this patch?
|
Ahah ... I was reading the code of master instead of v2.x. Of course I didn't see the issue. Thanks ! As for the patch, well, since I still didn't figure out how to reproduce the issue ... I cannot really test it. |
Okay, I have filed a PR - if you can "review" it, then we can see if it passes MTT |
fixed in open-mpi/ompi-release#1359 |
@jsquyres please wait until tomorrow/friday to close the bug. I couldn't check it actually fixed the bug (because I cannot reproduce the issue). Hence the need to push the fix to v2.x to see if MTT results are better. |
Ok. |
I don't have any other output. Please also not that esslingen MTT is showing the same exact list of timed out tests (c_reqops, intercomm_create, spawn, spawn_with_env_vars, spawn_multiple). |
Move to 2.0.2 since spawn appears to be still broken on 2.x |
I am running it. I will try to reproduce it outside of MTT. |
I was able to get no-disconnect to hang when run on more than one node, so I can make that happen at-will. I still cannot get any other dynamic test in that suite to fail or hang. no-disconnect does not hang on master, so this appears to be something specific to v2.x. However, I am getting an error message on master from the TCP btl during finalize at the very end of the test:
Adding an assert in that spot generates the following stacktrace:
So it could be that we have a race condition in finalize that is causing the problem. |
I was not able to reproduce this problem using uGNI BTL. |
This commit fixes an abort during finalize because pending events were removed from the list twice. References open-mpi#2030 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Per the call today: For the TCP segv, @hjelmn thinks that the items are being removed from the list twice (e.g., in the destructor) in debug builds. That will fix the segv. He just filed #2077 to fix this. But there's also a hang issue that is not yet been fully diagnosed. Suggestion:
|
This commit fixes an abort during finalize because pending events were removed from the list twice. References open-mpi#2030 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Update on where we are on this issue:
|
This may not be fixed in 2.1.0 if we go with external pmix2 solution. |
@hjelmn I just checked the v2.0.x branch again, and no-disconnect now hangs in the BTL finalize for the TCP BTL: 424 while (lock->u.lock == OPAL_ATOMIC_LOCKED) {
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.163-3.el7.x86_64 elfutils-libs-0.163-3.el7.x86_64 glibc-2.17-106.el7_2.8.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libibverbs-1.1.8-8.el7.x86_64 libnl3-3.2.21-10.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 systemd-libs-219-19.el7_2.13.x86_64 xz-libs-5.1.2-12alpha.el7.x86_64 zlib-1.2.7-15.el7.x86_64
(gdb) where
#0 0x00007f7865817a1d in opal_atomic_lock (lock=0x7f7865a268c4 <mca_btl_tcp_component+548>) at ../../../../opal/include/opal/sys/atomic_impl.h:424
#1 0x00007f7865817c5d in opal_mutex_atomic_lock (m=0x7f7865a26860 <mca_btl_tcp_component+448>) at ../../../../opal/threads/mutex_unix.h:183
#2 0x00007f786581813d in mca_btl_tcp_event_destruct (event=0x1b73180) at btl_tcp_component.c:194
#3 0x00007f7865817b99 in opal_obj_run_destructors (object=0x1b73180) at ../../../../opal/class/opal_object.h:460
#4 0x00007f7865818ee7 in mca_btl_tcp_component_close () at btl_tcp_component.c:418
#5 0x00007f786fce4ba9 in mca_base_component_close (component=0x7f7865a266a0 <mca_btl_tcp_component>, output_id=-1) at mca_base_components_close.c:53
#6 0x00007f786fce4c69 in mca_base_components_close (output_id=-1, components=0x7f786ffa4250 <opal_btl_base_framework+80>, skip=0x0)
at mca_base_components_close.c:85
#7 0x00007f786fce4c10 in mca_base_framework_components_close (framework=0x7f786ffa4200 <opal_btl_base_framework>, skip=0x0) at mca_base_components_close.c:65
#8 0x00007f786fd07d64 in mca_btl_base_close () at base/btl_base_frame.c:158
#9 0x00007f786fcf22c5 in mca_base_framework_close (framework=0x7f786ffa4200 <opal_btl_base_framework>) at mca_base_framework.c:214
#10 0x00007f7870920bd2 in mca_bml_base_close () at base/bml_base_frame.c:130
#11 0x00007f786fcf22c5 in mca_base_framework_close (framework=0x7f7870bbc4c0 <ompi_bml_base_framework>) at mca_base_framework.c:214
#12 0x00007f78708b6161 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:419
#13 0x00007f78708dda05 in PMPI_Finalize () at pfinalize.c:45
#14 0x00000000004013c9 in main (argc=1, argv=0x7ffffe18d1b8) at no-disconnect.c:130 I can't get any other dynamic test to fail - any ideas why the TCP BTL is locking up? |
I thought I fixed this in 2.0.x. There was an easily identifiable bug in btl/tcp. Let me see if the commit came over. |
Bring this over. |
will do - will report back later. Thx! |
np |
Okay, we have 2.0.x fixed, but not 2.x - sorry for the confusion. On 2.x, we are getting a warning about removing an item that is no longer on a list: Warning :: opal_list_remove_item - the item 0xf30630 is not on the list 0x7fbb6497c8c8 This comes at the end of no-disconnect, and is likely again from the TCP btl. I'll try to find out where. |
Same bug, different symptom. |
Accept that the patch you suggested is already there Sent from my iPhone
|
@rhc54 Sorry; I still saw COMM_SPAWN fails over the weekend in the v2.0.x branch. |
This has been fixed. |
Recent Cisco and NVIDIA MTT results are showing COMM_SPAWN and related dynamic operations are failing.
We need to find out if it was broken in v2.0.0 and see if this is a regression. This will determine whether it's a v2.0.1 blocker or not (because we'd really like to release the other important fixes in v2.0.1 ASAP, and start working on the short timeline for v2.1.0 and v2.0.2, etc.).
@sjeaugey is going to test on his tree and see if the recent v2.x commit(s) about bringing over PMIx 1.1.5 are the root of this particular problem.
The text was updated successfully, but these errors were encountered: