OpenMPI 2.0.0rc2 with openib/iWARP btl stalls running OSU/IMB benchmarks #1664

larrystevenwise · 2016-05-10T16:42:04Z

I have installed this OpenMPI on my lab machines. I find that the
osu_latency test runs OK. However, the osu_bw test stalls. Upon doing some more
investigation, I find the that calls to MPI_ISend on the first node all
complete and the calls to MPI_irecv on the second node all complete. However
when the nodes go into MPI_Waitall both nodes stall.

It is important that we identify where the problem is before openmpi releases
2.0 as final.

larrystevenwise · 2016-05-10T16:42:18Z

I built openmpi 2..0.0rc2 with the following configure command

./configure --prefix=/usr/local/ --enable-mpi-thread-multiple
--enable-mpirun-prefix-by-default --enable-debug --enable-mpi-cxx

Does not matter about mpi-thread-multiple. Tried it both ways.

larrystevenwise · 2016-05-10T16:44:03Z

When stalled we see one rank with a pending SEND WR. HW has consumed the WR and it appears the node has transmitted the SEND to the peer.

larrystevenwise · 2016-05-10T16:44:17Z

We started bisecting the ompi release git repo.. For commits before march of ompi,
osu tests are working fine. Will update the culprit commit..

larrystevenwise · 2016-05-10T16:46:04Z

Only osu_bcast and osu_bw have this issue and other OSU tests are running fine.
Tests are hanging with no traffic running. IMB tests are also not working on
openmpi2.0.0rc2.

larrystevenwise · 2016-05-10T18:13:31Z

@jsquyres, I think this is a blocker for 2.0.0. Chelsio is working this as top priority. I'm also pushing on Chelsio to regression test new RCs as they release. This issue should not have made it to rc2 IMO...

jsquyres · 2016-05-10T18:17:51Z

@larrystevenwise Yes, would have been great to know that this bug existed before now. No one else in the community has Chelsio equipment with which to test.

larrystevenwise · 2016-05-10T18:44:04Z

@jsquyres, sorry about that. I can ask Chelsio to give some cards to someone who would test OMPI/openib/cxgb4 regularly.

jsquyres · 2016-05-10T18:46:07Z

I think the community would prefer if you or Chelsio did regular testing. 😄

larrystevenwise · 2016-05-10T18:47:20Z

agreed... :)

I've recommended they test rc1 of each release and then regression test each rc after that.

jsquyres · 2016-05-10T18:49:16Z

FWIW, it would be best if they could test more than that -- even submitting nightly (or even weekly?) MTT runs to the community database would be a good heads-up to catch problems before RC releases.

Absoft, for example, does this.

larrystevenwise · 2016-05-10T18:53:03Z

I've passed this along. Thanks Jeff.

larrystevenwise · 2016-05-10T22:02:15Z

I installed OFED-3.18-2-rc1 and the stall goes away. So:

OMPI-2.0.0rc2, rhel 6.5 kernel, OFED-3.18-1, chelsio's latest drivers: BAD
OMPI-2.0.0rc2, rhel 6.5 kernel, OFED-3.18-2-rc1, chelsio's latest drivers: GOOD

It could be that OMPI configures differently against OFED-3.18-1 vs OFED-3.18-2 and that is causing the change in behavior due to different code running in OMPI itself, or it could be the library/driver differences in OFED-3.18-1 and OFED-3.18-2...

larrystevenwise · 2016-05-11T20:56:55Z

git bisect results:

Here is the culprit commit of ompi release repo:
From this commit onward bw test fails with an rdma_cm error or stalls.

repo: https://github.com/open-mpi/ompi-release.git   
branch v2.X 
[root@maneybhanjang ompi-release]# git bisect bad
d7c21da1b835be1f7f035d33550587e3e1954b0b is the first bad commit
commit d7c21da1b835be1f7f035d33550587e3e1954b0b
Author: Jeff Squyres 
Date:   Wed Feb 3 18:47:51 2016 -0800
    ompi_mpi_params.c: set mpi_add_procs_cutoff default to 0
    Decrease the default value of the "mpi_add_procs_cutoff" MCA param
    from 1024 to 0.
    (cherry picked from commit open-mpi/ompi@902b477aac2063976578f031aa079185104e07e0)
:040000 040000 62e9175129fcd9e581a813451fcc948d70298456 18bec256d8f39734aacf0c1d1f952c8f289a6ee6 M      ompi

We are re-verifying this. I still don't know how this relates to OFED-3.18-1 or -2 causing different behavior...

jsquyres · 2016-05-12T10:08:16Z

If this is a bug in OFED-3.18-1, should we care?

larrystevenwise · 2016-05-12T14:13:10Z

We don't care if the problem doesn't exist in OFED-3.18-2, but I would still like to understand the root cause. And the problem is avoided with backing out the identified commit. Also: running with '--mca mpi_add_procs_cutoff 1024' added to the mpirun works around the problem.

Jeff, can you please explain what mpi_add_procs_cutoff does and how it might affect openib btl? Any ideas?

jsquyres · 2016-05-12T14:20:37Z

@larrystevenwise The behavior of mpi_add_procs_cutoff was discussed by the community literally for months. I'm actually pretty annoyed that Chelsio is coming in literally at the last second and calling for a full stop, despite the fact that they completely ignored the community for the entire development cycle. 😠

There is not a short explanation of mpi_add_procs_cutoff. We fundamentally changed the behavior of how btl_add_procs is invoked: it's only invoked for a given peer the first time we communicate with that peer. I'm sorry; I don't have time to explain further (I have other deadlines this week -- your poor prior planning does not constitute an emergency on my part). Go read the code and git logs.

larrystevenwise · 2016-05-12T14:33:16Z

Since there is an easy workaround, i'm not concerned about holding the release (which you wouldn't do anyway apparently). I've expressed the issue of spending more resources on OMPI with chelsio and I'll work to get them more involved and doing regular regression testing.

larrystevenwise · 2016-05-12T14:43:29Z

What we see when looking at the 2 ranks that are stalled is that one rank has posted a SEND WR which has been transmitted and processed by the peer rank, but the CQE for that SEND has never been polled. And I think there's another RECV CQE sitting in the CQ that the rank hasn't polled either. So it seems the rank has stopped polling the CQ for some reason. Debugging further.

larrystevenwise · 2016-05-12T16:10:21Z

Both the RCQ and SCQ for the stuck rank have never been successfully polled. The RCQ has a handful of completions and the SCQ has a send completion. Seems like the rank never got going on polling the CQs...

larrystevenwise · 2016-05-13T19:16:58Z

the stuck rank seems to be stuck doing this and never proceeding:

#0  0x00007fff56f40952 in vgetns (ts=0x2af6dab28d70) at arch/x86/vdso/vclock_gettime.c:42
#1  do_monotonic (ts=0x2af6dab28d70) at arch/x86/vdso/vclock_gettime.c:81
#2  0x00007fff56f40a47 in __vdso_clock_gettime (clock=, ts=) at arch/x86/vdso/vclock_gettime.c:124
#3  0x0000003ef6803e46 in clock_gettime () from /lib64/librt.so.1
#4  0x00002af6d99f4647 in gettime (base=0x1f2b1c0, tp=0x1f2b350) at event.c:372
#5  0x00002af6d99f8961 in update_time_cache (base=0x1f2b1c0, flags=1) at event.c:430
#6  opal_libevent2022_event_base_loop (base=0x1f2b1c0, flags=1) at event.c:1639
#7  0x00002af6da6dffd9 in progress_engine (obj=0x1f2b1c0) at src/util/progress_threads.c:49
#8  0x0000003ef5c079d1 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003ef58e89dd in clone () from /lib64/libc.so.6

jsquyres · 2016-05-13T19:51:43Z

Is it not progressing past that point, or is it endlessly looping in in the event base loop in the progress thread?

I note that that is not the main MPI thread.

larrystevenwise · 2016-05-13T20:39:58Z

some more data points: with the same OFED-3.18-1 code, mlx4 runs ok regardless of mpi_add_procs_cutoff value. cxgb4 stalls if the value is <= 2.

larrystevenwise · 2016-05-17T15:56:02Z

@bharatpotnuri is working on this. Not sure if he has access to adding to these issues?

bharatpotnuri · 2016-05-18T06:24:02Z

The main thread of rank1 is looping endlessly at opal_progress(), as it doesn't find registered callbacks for progressing further.

Below is the snippet of code, where is fails to get registered callback:
opal/runtime/opal_progress.c:187

187     /* progress all registered callbacks */
188     for (i = 0 ; i < callbacks_len ; ++i) {
189         events += (callbacks[i])();
190     }

Below is the back-trace of main thread:

0x00002b08047dd081 in opal_timer_base_get_usec_clock_gettime () at timer_linux_component.c:184
184             return (tp.tv_sec * 1e6 + tp.tv_nsec/1000);
(gdb) bt
#0  0x00002b08047dd081 in opal_timer_base_get_usec_clock_gettime () at timer_linux_component.c:184
#1  0x00002b0804729abc in opal_progress () at runtime/opal_progress.c:161
#2  0x00002b080409f590 in opal_condition_wait (c=0x2b08043e3380, m=0x2b08043e3300) at ../opal/threads/condition.h:72
#3  0x00002b080409fbd7 in ompi_request_default_wait_all (count=64, requests=0x602b80, statuses=0x608940) at request/req_wait.c:272
#4  0x00002b08041067f4 in PMPI_Waitall (count=64, requests=0x602b80, statuses=0x608940) at pwaitall.c:76
#5  0x00000000004010b4 in main (argc=1, argv=0x7fff35570348) at osu_bw.c:121

EDIT: Added verbatim blocks

jsquyres · 2016-05-18T10:37:35Z

@bharatpotnuri @larrystevenwise Github pro tip: use three back ticks on a line to start and end verbose blocks. See https://help.github.com/articles/basic-writing-and-formatting-syntax/ for more github markdown formatting.

jsquyres · 2016-05-18T10:44:15Z

So just to be clear: you're saying that opal_progress() is not ending up calling btl_openib_progress() in its loop over the callbacks[] array?

Hmm. I'm not sure how this could happen -- that the openib BTL was evidently selected for use, but then didn't have its progress function registered with the main progression engine (that should be done up in ob1, actually).

What message is being sent by the openib BTL? Is it an MPI message, or some type of control / initialization message?
Is the openib BTL being selected for use?

bharatpotnuri · 2016-05-18T10:54:16Z

Yes, opal_progress() is not calling btl_openib_component_progress().

The difference in code flow between working case(mpi_add_procs_cutoff > 3 and np=2) and stall case(mpi_add_procs_cutoff = 0 and np=2) is nprocs value is 1 in case of stall and is 2, in working case.

Following the code back, nprocs is propagated through mca_pml_ob1_add_procs(). I am checking further, how it gets called.
I am checking answers for your two questions. I will rerun test with few more prints and let you know soon.
Thanks.

bharatpotnuri · 2016-05-18T12:30:43Z

This is the test command which stalls:

/usr/mpi/gcc/openmpi-2.0.0/bin/mpirun -np 2 --host peer1,peer2 --allow-run-as-root --output-filename /tmp/mpi --mca btl openib,sm,self /usr/mpi/gcc/openmpi-2.0.0/tests/OSU/osu_bw

bharatpotnuri · 2016-05-18T13:03:54Z

What message is being sent by the openib BTL? Is it an MPI message, or some type of control / initialization message?
Is the openib BTL being selected for use?

@jsquyres Could you please tell me what to check for in MPI code, for your above two questions. I couldn't figure it out from ompi code.

jsquyres · 2016-05-18T13:35:13Z

I chatted with @hjelmn about this in IM. He's thinking that the issue might be in bml/r2 somewhere (r2 is the code that ob1 uses to multiplex over all of its BTLs), because r2 is where the BTL component progress functions are supposed to be registered.

He's poking into this today to see what's going on...

larrystevenwise · 2016-05-18T18:15:12Z

Perhaps this issue has something to do with the fact that cxgb4/iwarp is using the rdmacm CPC, and thus the CTS message? I see with openib/mlx4, which does not stall, the rdmacm CPC is not being used.

[stevo2.asicdesigners.com:27045] rdmacm CPC only supported when the first QP is a PP QP; skipped
[stevo2.asicdesigners.com:27045] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped

hjelmn · 2016-05-18T18:32:36Z

Nah. This is caused by your network not supporting loopback. In that case openib is disqualified on the initial add_procs (local procs) only but then selected on a future add_proc call. mlnx isn't affected as the openib btl is not disqualified on local procs.

larrystevenwise · 2016-05-18T18:43:51Z

cxgb4 does support hw loobpack, but maybe the openib btl doesn't know that?

larrystevenwise · 2016-05-18T18:46:25Z

Ah. from mca_btl_openib_add_procs():

#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
        /* Most current iWARP adapters (June 2008) cannot handle
           talking to other processes on the same host (!) -- so mark
           them as unreachable (need to use sm).  So for the moment,
           we'll just mark any local peer on an iWARP NIC as
           unreachable.  See trac ticket #1352. */
        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
            continue;
        }
#endif

larrystevenwise · 2016-05-18T18:47:02Z

cxgb3 did not support hw loobpack. cxgb4 and beyond does.

jsquyres · 2016-05-18T18:48:53Z

I think @hjelmn will reply with detail shortly about what he just found.

@larrystevenwise Feel free to submit a PR to amend that #if block you just found. It needs to be a generic solution, though -- we have avoided "if device X, do Y" kinds of hard-coding in the openib BTL whenever possible.

larrystevenwise · 2016-05-18T19:02:45Z

@jsquyres I'll look into a way to detect hw loobpack support. But the OpenFabrics Verbs API doesn't advertise this capability from what I can tell.

But the 'self' btl was included in the mpirun, so maybe the openib shouldn't be disqualified?

larrystevenwise · 2016-05-18T19:14:01Z

One gross way to detect hw loopback in openib is to attempt a hw loopback connection as part of module init...

hjelmn · 2016-05-18T19:39:44Z

We already do that in the openib add_procs. You can remove the #if block and see if it is working with your hardware.

I am working on a fix so it will not hang when local communication is not available. Should be ready shortly.

larrystevenwise · 2016-05-18T20:19:40Z

Removing the 2 blocks in btl_openib.c that excluded iwarp devices works around the issue.

diff --git a/opal/mca/btl/openib/btl_openib.c b/opal/mca/btl/openib/btl_openib.c
index 53d8c81..2e4211f 100644
--- a/opal/mca/btl/openib/btl_openib.c
+++ b/opal/mca/btl/openib/btl_openib.c
@@ -1065,18 +1065,6 @@ int mca_btl_openib_add_procs(
         struct opal_proc_t* proc = procs[i];
         mca_btl_openib_proc_t* ib_proc;

-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
-        /* Most current iWARP adapters (June 2008) cannot handle
-           talking to other processes on the same host (!) -- so mark
-           them as unreachable (need to use sm).  So for the moment,
-           we'll just mark any local peer on an iWARP NIC as
-           unreachable.  See trac ticket #1352. */
-        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
-            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
-            continue;
-        }
-#endif
-
         if(NULL == (ib_proc = mca_btl_openib_proc_get_locked(proc)) ) {
             /* if we don't have connection info for this process, it's
              * okay because some other method might be able to reach it,
@@ -1138,18 +1126,6 @@ int mca_btl_openib_add_procs(

         opal_output(-1, "add procs: adding proc %d", i);

-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
-        /* Most current iWARP adapters (June 2008) cannot handle
-           talking to other processes on the same host (!) -- so mark
-           them as unreachable (need to use sm).  So for the moment,
-           we'll just mark any local peer on an iWARP NIC as
-           unreachable.  See trac ticket #1352. */
-        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
-            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
-            continue;
-        }
-#endif
-
         if(NULL == (ib_proc = mca_btl_openib_proc_get_locked(proc)) ) {
             /* if we don't have connection info for this process, it's
              * okay because some other method might be able to reach it,

EDIT: Added diff syntax hilighting

jsquyres · 2016-05-18T20:42:00Z

@larrystevenwise Another github pro tip: you can tell github how it should syntax hilight your verbatim blocks. For example, trail the 3 back ticks with "diff" (without the quotes) and it tells Github to render that verbatim block with diff syntax hilighting.

larrystevenwise · 2016-05-18T20:45:17Z

nifty!

bharatpotnuri · 2016-05-19T09:34:49Z

By removing the 2 blocks in btl_openib.c that excluded iWARP devices works around the issue. The test seems to be working but is failing to complete.

# OSU MPI Bi-Directional Bandwidth Test v5.0
# Size      Bandwidth (MB/s)
1                       0.53
2                       1.07
4                       2.14
8                       4.24
16                      8.53
32                     16.97
64                     32.76
128                    63.48
256                   121.18
512                   242.11
1024                  441.96
2048                  752.99
4096                 1051.12
8192                 1385.62
16384                3153.99
32768                4147.69
65536                4857.31
131072               5495.65
262144               5690.68
524288               5768.05
1048576              5777.00
2097152              6132.86
4194304              6336.14

The main thread is waiting at rdmacm_endpoint_finalize(). Below is the back-trace of rank 1:

0x0000003ef5c0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003ef5c0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b38408a66e1 in rdmacm_endpoint_finalize (endpoint=0x81b710) at connect/btl_openib_connect_rdmacm.c:1231
#2  0x00002b3840891839 in mca_btl_openib_endpoint_destruct (endpoint=0x81b710) at btl_openib_endpoint.c:338
#3  0x00002b384087a998 in opal_obj_run_destructors (object=0x81b710) at ../../../../opal/class/opal_object.h:455
#4  0x00002b3840880a66 in mca_btl_openib_finalize_resources (btl=0x7d63c0) at btl_openib.c:1627
#5  0x00002b38408812e1 in mca_btl_openib_finalize (btl=0x7d63c0) at btl_openib.c:1703
#6  0x00002b3832a9f812 in mca_btl_base_close () at base/btl_base_frame.c:153
#7  0x00002b3832a89299 in mca_base_framework_close (framework=0x2b3832d3e5a0) at mca_base_framework.c:214
#8  0x00002b383243d5f6 in mca_bml_base_close () at base/bml_base_frame.c:130
#9  0x00002b3832a89299 in mca_base_framework_close (framework=0x2b38326e91a0) at mca_base_framework.c:214
#10 0x00002b38323bec35 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:418
#11 0x00002b38323eba6d in PMPI_Finalize () at pfinalize.c:45
#12 0x0000000000401184 in main (argc=1, argv=0x7fff781a1ee8) at osu_bw.c:136

ompi-release/opal/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1231

1228     /* Now wait for all the disconnect callbacks to occur */
1229     pthread_mutex_lock(&rdmacm_disconnect_lock);
1230     while (opal_list_get_size (&contents->ids)) {
1231         pthread_cond_wait (&rdmacm_disconnect_cond, &rdmacm_disconnect_lock);
1232     }

I am running the test with git ompi-release repo, "commit :a7e27ef6, Merge pull request #1162 from abouteiller/bug/v2.x/mt_lazy_addproc". Looking further.

bharatpotnuri · 2016-05-19T11:13:24Z

Please ignore my previous comment.
I have tested with commit a679cc072b9458957671886062c643f6a40b0351 bml/r2: always add btl progress function, it is working fine with iWARP and fixes the issue #1664 .

jsquyres · 2016-05-19T11:33:59Z

Great! @hjelmn please PR to v2.x if you have not done so already.

larrystevenwise · 2016-05-19T14:29:03Z

So should i close this issue out now?

jsquyres mentioned this issue May 19, 2016

bml/r2: always add btl progress function #1677

Merged

jsquyres added this to the v2.0.0 milestone May 19, 2016

jsquyres added bug Severity: blocker labels May 19, 2016

jsquyres closed this as completed May 19, 2016

larrystevenwise mentioned this issue May 23, 2016

openmpi-2.0.0rc2: mtl/ofi/verbs being chosen over pml/ob1/btl/openib #1676

Closed

OpenMPI 2.0.0rc2 with openib/iWARP btl stalls running OSU/IMB benchmarks #1664

OpenMPI 2.0.0rc2 with openib/iWARP btl stalls running OSU/IMB benchmarks #1664

Comments

larrystevenwise commented May 10, 2016

larrystevenwise commented May 10, 2016 • edited Loading

larrystevenwise commented May 10, 2016

larrystevenwise commented May 10, 2016

larrystevenwise commented May 10, 2016

larrystevenwise commented May 10, 2016

jsquyres commented May 10, 2016

larrystevenwise commented May 10, 2016

jsquyres commented May 10, 2016

larrystevenwise commented May 10, 2016

jsquyres commented May 10, 2016

larrystevenwise commented May 10, 2016

larrystevenwise commented May 10, 2016

larrystevenwise commented May 11, 2016

jsquyres commented May 12, 2016

larrystevenwise commented May 12, 2016

jsquyres commented May 12, 2016

larrystevenwise commented May 12, 2016

larrystevenwise commented May 12, 2016

larrystevenwise commented May 12, 2016

larrystevenwise commented May 13, 2016 • edited Loading

jsquyres commented May 13, 2016

larrystevenwise commented May 13, 2016

larrystevenwise commented May 17, 2016

bharatpotnuri commented May 18, 2016 • edited by jsquyres Loading

jsquyres commented May 18, 2016

jsquyres commented May 18, 2016

bharatpotnuri commented May 18, 2016 • edited Loading

bharatpotnuri commented May 18, 2016

bharatpotnuri commented May 18, 2016

jsquyres commented May 18, 2016

larrystevenwise commented May 18, 2016

hjelmn commented May 18, 2016

larrystevenwise commented May 18, 2016

larrystevenwise commented May 18, 2016

larrystevenwise commented May 18, 2016

jsquyres commented May 18, 2016 • edited Loading

larrystevenwise commented May 18, 2016

larrystevenwise commented May 18, 2016

hjelmn commented May 18, 2016

larrystevenwise commented May 18, 2016 • edited by jsquyres Loading

jsquyres commented May 18, 2016

larrystevenwise commented May 18, 2016

bharatpotnuri commented May 19, 2016

bharatpotnuri commented May 19, 2016 • edited Loading

jsquyres commented May 19, 2016

larrystevenwise commented May 19, 2016

larrystevenwise commented May 10, 2016 •

edited

Loading

larrystevenwise commented May 13, 2016 •

edited

Loading

bharatpotnuri commented May 18, 2016 •

edited by jsquyres

Loading

bharatpotnuri commented May 18, 2016 •

edited

Loading

jsquyres commented May 18, 2016 •

edited

Loading

larrystevenwise commented May 18, 2016 •

edited by jsquyres

Loading

bharatpotnuri commented May 19, 2016 •

edited

Loading