-
Notifications
You must be signed in to change notification settings - Fork 879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPI 2.0.0rc2 with openib/iWARP btl stalls running OSU/IMB benchmarks #1664
Comments
I built openmpi 2..0.0rc2 with the following configure command ./configure --prefix=/usr/local/ --enable-mpi-thread-multiple Does not matter about mpi-thread-multiple. Tried it both ways. |
When stalled we see one rank with a pending SEND WR. HW has consumed the WR and it appears the node has transmitted the SEND to the peer. |
We started bisecting the ompi release git repo.. For commits before march of ompi, |
Only osu_bcast and osu_bw have this issue and other OSU tests are running fine. |
@jsquyres, I think this is a blocker for 2.0.0. Chelsio is working this as top priority. I'm also pushing on Chelsio to regression test new RCs as they release. This issue should not have made it to rc2 IMO... |
@larrystevenwise Yes, would have been great to know that this bug existed before now. No one else in the community has Chelsio equipment with which to test. |
@jsquyres, sorry about that. I can ask Chelsio to give some cards to someone who would test OMPI/openib/cxgb4 regularly. |
I think the community would prefer if you or Chelsio did regular testing. 😄 |
agreed... :) I've recommended they test rc1 of each release and then regression test each rc after that. |
FWIW, it would be best if they could test more than that -- even submitting nightly (or even weekly?) MTT runs to the community database would be a good heads-up to catch problems before RC releases. Absoft, for example, does this. |
I've passed this along. Thanks Jeff. |
I installed OFED-3.18-2-rc1 and the stall goes away. So: OMPI-2.0.0rc2, rhel 6.5 kernel, OFED-3.18-1, chelsio's latest drivers: BAD It could be that OMPI configures differently against OFED-3.18-1 vs OFED-3.18-2 and that is causing the change in behavior due to different code running in OMPI itself, or it could be the library/driver differences in OFED-3.18-1 and OFED-3.18-2... |
git bisect results: Here is the culprit commit of ompi release repo: repo: https://github.com/open-mpi/ompi-release.git branch v2.X [root@maneybhanjang ompi-release]# git bisect bad d7c21da1b835be1f7f035d33550587e3e1954b0b is the first bad commit commit d7c21da1b835be1f7f035d33550587e3e1954b0b Author: Jeff Squyres Date: Wed Feb 3 18:47:51 2016 -0800 ompi_mpi_params.c: set mpi_add_procs_cutoff default to 0 Decrease the default value of the "mpi_add_procs_cutoff" MCA param from 1024 to 0. (cherry picked from commit open-mpi/ompi@902b477aac2063976578f031aa079185104e07e0) :040000 040000 62e9175129fcd9e581a813451fcc948d70298456 18bec256d8f39734aacf0c1d1f952c8f289a6ee6 M ompi We are re-verifying this. I still don't know how this relates to OFED-3.18-1 or -2 causing different behavior... |
If this is a bug in OFED-3.18-1, should we care? |
We don't care if the problem doesn't exist in OFED-3.18-2, but I would still like to understand the root cause. And the problem is avoided with backing out the identified commit. Also: running with '--mca mpi_add_procs_cutoff 1024' added to the mpirun works around the problem. Jeff, can you please explain what mpi_add_procs_cutoff does and how it might affect openib btl? Any ideas? |
@larrystevenwise The behavior of There is not a short explanation of |
Since there is an easy workaround, i'm not concerned about holding the release (which you wouldn't do anyway apparently). I've expressed the issue of spending more resources on OMPI with chelsio and I'll work to get them more involved and doing regular regression testing. |
What we see when looking at the 2 ranks that are stalled is that one rank has posted a SEND WR which has been transmitted and processed by the peer rank, but the CQE for that SEND has never been polled. And I think there's another RECV CQE sitting in the CQ that the rank hasn't polled either. So it seems the rank has stopped polling the CQ for some reason. Debugging further. |
Both the RCQ and SCQ for the stuck rank have never been successfully polled. The RCQ has a handful of completions and the SCQ has a send completion. Seems like the rank never got going on polling the CQs... |
the stuck rank seems to be stuck doing this and never proceeding: #0 0x00007fff56f40952 in vgetns (ts=0x2af6dab28d70) at arch/x86/vdso/vclock_gettime.c:42 #1 do_monotonic (ts=0x2af6dab28d70) at arch/x86/vdso/vclock_gettime.c:81 #2 0x00007fff56f40a47 in __vdso_clock_gettime (clock=, ts=) at arch/x86/vdso/vclock_gettime.c:124 #3 0x0000003ef6803e46 in clock_gettime () from /lib64/librt.so.1 #4 0x00002af6d99f4647 in gettime (base=0x1f2b1c0, tp=0x1f2b350) at event.c:372 #5 0x00002af6d99f8961 in update_time_cache (base=0x1f2b1c0, flags=1) at event.c:430 #6 opal_libevent2022_event_base_loop (base=0x1f2b1c0, flags=1) at event.c:1639 #7 0x00002af6da6dffd9 in progress_engine (obj=0x1f2b1c0) at src/util/progress_threads.c:49 #8 0x0000003ef5c079d1 in start_thread () from /lib64/libpthread.so.0 #9 0x0000003ef58e89dd in clone () from /lib64/libc.so.6 |
Is it not progressing past that point, or is it endlessly looping in in the event base loop in the progress thread? I note that that is not the main MPI thread. |
some more data points: with the same OFED-3.18-1 code, mlx4 runs ok regardless of mpi_add_procs_cutoff value. cxgb4 stalls if the value is <= 2. |
@bharatpotnuri is working on this. Not sure if he has access to adding to these issues? |
The main thread of rank1 is looping endlessly at opal_progress(), as it doesn't find registered callbacks for progressing further. Below is the snippet of code, where is fails to get registered callback:
Below is the back-trace of main thread:
EDIT: Added verbatim blocks |
@bharatpotnuri @larrystevenwise Github pro tip: use three back ticks on a line to start and end verbose blocks. See https://help.github.com/articles/basic-writing-and-formatting-syntax/ for more github markdown formatting. |
So just to be clear: you're saying that Hmm. I'm not sure how this could happen -- that the openib BTL was evidently selected for use, but then didn't have its progress function registered with the main progression engine (that should be done up in ob1, actually).
|
Yes, The difference in code flow between working case( Following the code back, nprocs is propagated through |
This is the test command which stalls:
|
@jsquyres Could you please tell me what to check for in MPI code, for your above two questions. I couldn't figure it out from ompi code. |
I chatted with @hjelmn about this in IM. He's thinking that the issue might be in bml/r2 somewhere (r2 is the code that ob1 uses to multiplex over all of its BTLs), because r2 is where the BTL component progress functions are supposed to be registered. He's poking into this today to see what's going on... |
Perhaps this issue has something to do with the fact that cxgb4/iwarp is using the rdmacm CPC, and thus the CTS message? I see with openib/mlx4, which does not stall, the rdmacm CPC is not being used.
|
Nah. This is caused by your network not supporting loopback. In that case openib is disqualified on the initial add_procs (local procs) only but then selected on a future add_proc call. mlnx isn't affected as the openib btl is not disqualified on local procs. |
cxgb4 does support hw loobpack, but maybe the openib btl doesn't know that? |
Ah. from mca_btl_openib_add_procs():
|
cxgb3 did not support hw loobpack. cxgb4 and beyond does. |
I think @hjelmn will reply with detail shortly about what he just found. @larrystevenwise Feel free to submit a PR to amend that |
@jsquyres I'll look into a way to detect hw loobpack support. But the OpenFabrics Verbs API doesn't advertise this capability from what I can tell. But the 'self' btl was included in the mpirun, so maybe the openib shouldn't be disqualified? |
One gross way to detect hw loopback in openib is to attempt a hw loopback connection as part of module init... |
We already do that in the openib add_procs. You can remove the I am working on a fix so it will not hang when local communication is not available. Should be ready shortly. |
Removing the 2 blocks in btl_openib.c that excluded iwarp devices works around the issue. diff --git a/opal/mca/btl/openib/btl_openib.c b/opal/mca/btl/openib/btl_openib.c
index 53d8c81..2e4211f 100644
--- a/opal/mca/btl/openib/btl_openib.c
+++ b/opal/mca/btl/openib/btl_openib.c
@@ -1065,18 +1065,6 @@ int mca_btl_openib_add_procs(
struct opal_proc_t* proc = procs[i];
mca_btl_openib_proc_t* ib_proc;
-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
- /* Most current iWARP adapters (June 2008) cannot handle
- talking to other processes on the same host (!) -- so mark
- them as unreachable (need to use sm). So for the moment,
- we'll just mark any local peer on an iWARP NIC as
- unreachable. See trac ticket #1352. */
- if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
- OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
- continue;
- }
-#endif
-
if(NULL == (ib_proc = mca_btl_openib_proc_get_locked(proc)) ) {
/* if we don't have connection info for this process, it's
* okay because some other method might be able to reach it,
@@ -1138,18 +1126,6 @@ int mca_btl_openib_add_procs(
opal_output(-1, "add procs: adding proc %d", i);
-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
- /* Most current iWARP adapters (June 2008) cannot handle
- talking to other processes on the same host (!) -- so mark
- them as unreachable (need to use sm). So for the moment,
- we'll just mark any local peer on an iWARP NIC as
- unreachable. See trac ticket #1352. */
- if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
- OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
- continue;
- }
-#endif
-
if(NULL == (ib_proc = mca_btl_openib_proc_get_locked(proc)) ) {
/* if we don't have connection info for this process, it's
* okay because some other method might be able to reach it, EDIT: Added diff syntax hilighting |
@larrystevenwise Another github pro tip: you can tell github how it should syntax hilight your verbatim blocks. For example, trail the 3 back ticks with "diff" (without the quotes) and it tells Github to render that verbatim block with diff syntax hilighting. |
nifty! |
By removing the 2 blocks in
The main thread is waiting at
ompi-release/opal/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1231
I am running the test with git |
Please ignore my previous comment. |
Great! @hjelmn please PR to v2.x if you have not done so already. |
So should i close this issue out now? |
I have installed this OpenMPI on my lab machines. I find that the
osu_latency test runs OK. However, the osu_bw test stalls. Upon doing some more
investigation, I find the that calls to MPI_ISend on the first node all
complete and the calls to MPI_irecv on the second node all complete. However
when the nodes go into MPI_Waitall both nodes stall.
It is important that we identify where the problem is before openmpi releases
2.0 as final.
The text was updated successfully, but these errors were encountered: