Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMPI 2.0.0rc2 with openib/iWARP btl stalls running OSU/IMB benchmarks #1664

Closed
larrystevenwise opened this issue May 10, 2016 · 46 comments
Closed

Comments

@larrystevenwise
Copy link

I have installed this OpenMPI on my lab machines. I find that the
osu_latency test runs OK. However, the osu_bw test stalls. Upon doing some more
investigation, I find the that calls to MPI_ISend on the first node all
complete and the calls to MPI_irecv on the second node all complete. However
when the nodes go into MPI_Waitall both nodes stall.

It is important that we identify where the problem is before openmpi releases
2.0 as final.

@larrystevenwise
Copy link
Author

larrystevenwise commented May 10, 2016

I built openmpi 2..0.0rc2 with the following configure command

./configure --prefix=/usr/local/ --enable-mpi-thread-multiple
--enable-mpirun-prefix-by-default --enable-debug --enable-mpi-cxx

Does not matter about mpi-thread-multiple. Tried it both ways.

@larrystevenwise
Copy link
Author

When stalled we see one rank with a pending SEND WR. HW has consumed the WR and it appears the node has transmitted the SEND to the peer.

@larrystevenwise
Copy link
Author

We started bisecting the ompi release git repo.. For commits before march of ompi,
osu tests are working fine. Will update the culprit commit..

@larrystevenwise
Copy link
Author

Only osu_bcast and osu_bw have this issue and other OSU tests are running fine.
Tests are hanging with no traffic running. IMB tests are also not working on
openmpi2.0.0rc2.

@larrystevenwise
Copy link
Author

@jsquyres, I think this is a blocker for 2.0.0. Chelsio is working this as top priority. I'm also pushing on Chelsio to regression test new RCs as they release. This issue should not have made it to rc2 IMO...

@jsquyres
Copy link
Member

@larrystevenwise Yes, would have been great to know that this bug existed before now. No one else in the community has Chelsio equipment with which to test.

@larrystevenwise
Copy link
Author

@jsquyres, sorry about that. I can ask Chelsio to give some cards to someone who would test OMPI/openib/cxgb4 regularly.

@jsquyres
Copy link
Member

I think the community would prefer if you or Chelsio did regular testing. 😄

@larrystevenwise
Copy link
Author

agreed... :)

I've recommended they test rc1 of each release and then regression test each rc after that.

@jsquyres
Copy link
Member

FWIW, it would be best if they could test more than that -- even submitting nightly (or even weekly?) MTT runs to the community database would be a good heads-up to catch problems before RC releases.

Absoft, for example, does this.

@larrystevenwise
Copy link
Author

I've passed this along. Thanks Jeff.

@larrystevenwise
Copy link
Author

I installed OFED-3.18-2-rc1 and the stall goes away. So:

OMPI-2.0.0rc2, rhel 6.5 kernel, OFED-3.18-1, chelsio's latest drivers: BAD
OMPI-2.0.0rc2, rhel 6.5 kernel, OFED-3.18-2-rc1, chelsio's latest drivers: GOOD

It could be that OMPI configures differently against OFED-3.18-1 vs OFED-3.18-2 and that is causing the change in behavior due to different code running in OMPI itself, or it could be the library/driver differences in OFED-3.18-1 and OFED-3.18-2...

@larrystevenwise
Copy link
Author

git bisect results:

Here is the culprit commit of ompi release repo:
From this commit onward bw test fails with an rdma_cm error or stalls.

repo: https://github.com/open-mpi/ompi-release.git   
branch v2.X 
[root@maneybhanjang ompi-release]# git bisect bad
d7c21da1b835be1f7f035d33550587e3e1954b0b is the first bad commit
commit d7c21da1b835be1f7f035d33550587e3e1954b0b
Author: Jeff Squyres 
Date:   Wed Feb 3 18:47:51 2016 -0800
    ompi_mpi_params.c: set mpi_add_procs_cutoff default to 0
    Decrease the default value of the "mpi_add_procs_cutoff" MCA param
    from 1024 to 0.
    (cherry picked from commit open-mpi/ompi@902b477aac2063976578f031aa079185104e07e0)
:040000 040000 62e9175129fcd9e581a813451fcc948d70298456 18bec256d8f39734aacf0c1d1f952c8f289a6ee6 M      ompi

We are re-verifying this. I still don't know how this relates to OFED-3.18-1 or -2 causing different behavior...

@jsquyres
Copy link
Member

If this is a bug in OFED-3.18-1, should we care?

@larrystevenwise
Copy link
Author

We don't care if the problem doesn't exist in OFED-3.18-2, but I would still like to understand the root cause. And the problem is avoided with backing out the identified commit. Also: running with '--mca mpi_add_procs_cutoff 1024' added to the mpirun works around the problem.

Jeff, can you please explain what mpi_add_procs_cutoff does and how it might affect openib btl? Any ideas?

@jsquyres
Copy link
Member

@larrystevenwise The behavior of mpi_add_procs_cutoff was discussed by the community literally for months. I'm actually pretty annoyed that Chelsio is coming in literally at the last second and calling for a full stop, despite the fact that they completely ignored the community for the entire development cycle. 😠

There is not a short explanation of mpi_add_procs_cutoff. We fundamentally changed the behavior of how btl_add_procs is invoked: it's only invoked for a given peer the first time we communicate with that peer. I'm sorry; I don't have time to explain further (I have other deadlines this week -- your poor prior planning does not constitute an emergency on my part). Go read the code and git logs.

@larrystevenwise
Copy link
Author

Since there is an easy workaround, i'm not concerned about holding the release (which you wouldn't do anyway apparently). I've expressed the issue of spending more resources on OMPI with chelsio and I'll work to get them more involved and doing regular regression testing.

@larrystevenwise
Copy link
Author

What we see when looking at the 2 ranks that are stalled is that one rank has posted a SEND WR which has been transmitted and processed by the peer rank, but the CQE for that SEND has never been polled. And I think there's another RECV CQE sitting in the CQ that the rank hasn't polled either. So it seems the rank has stopped polling the CQ for some reason. Debugging further.

@larrystevenwise
Copy link
Author

Both the RCQ and SCQ for the stuck rank have never been successfully polled. The RCQ has a handful of completions and the SCQ has a send completion. Seems like the rank never got going on polling the CQs...

@larrystevenwise
Copy link
Author

larrystevenwise commented May 13, 2016

the stuck rank seems to be stuck doing this and never proceeding:

#0  0x00007fff56f40952 in vgetns (ts=0x2af6dab28d70) at arch/x86/vdso/vclock_gettime.c:42
#1  do_monotonic (ts=0x2af6dab28d70) at arch/x86/vdso/vclock_gettime.c:81
#2  0x00007fff56f40a47 in __vdso_clock_gettime (clock=, ts=) at arch/x86/vdso/vclock_gettime.c:124
#3  0x0000003ef6803e46 in clock_gettime () from /lib64/librt.so.1
#4  0x00002af6d99f4647 in gettime (base=0x1f2b1c0, tp=0x1f2b350) at event.c:372
#5  0x00002af6d99f8961 in update_time_cache (base=0x1f2b1c0, flags=1) at event.c:430
#6  opal_libevent2022_event_base_loop (base=0x1f2b1c0, flags=1) at event.c:1639
#7  0x00002af6da6dffd9 in progress_engine (obj=0x1f2b1c0) at src/util/progress_threads.c:49
#8  0x0000003ef5c079d1 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003ef58e89dd in clone () from /lib64/libc.so.6

@jsquyres
Copy link
Member

Is it not progressing past that point, or is it endlessly looping in in the event base loop in the progress thread?

I note that that is not the main MPI thread.

@larrystevenwise
Copy link
Author

some more data points: with the same OFED-3.18-1 code, mlx4 runs ok regardless of mpi_add_procs_cutoff value. cxgb4 stalls if the value is <= 2.

@larrystevenwise
Copy link
Author

@bharatpotnuri is working on this. Not sure if he has access to adding to these issues?

@bharatpotnuri
Copy link
Contributor

bharatpotnuri commented May 18, 2016

The main thread of rank1 is looping endlessly at opal_progress(), as it doesn't find registered callbacks for progressing further.

Below is the snippet of code, where is fails to get registered callback:
opal/runtime/opal_progress.c:187

187     /* progress all registered callbacks */
188     for (i = 0 ; i < callbacks_len ; ++i) {
189         events += (callbacks[i])();
190     } 

Below is the back-trace of main thread:

0x00002b08047dd081 in opal_timer_base_get_usec_clock_gettime () at timer_linux_component.c:184
184             return (tp.tv_sec * 1e6 + tp.tv_nsec/1000);
(gdb) bt
#0  0x00002b08047dd081 in opal_timer_base_get_usec_clock_gettime () at timer_linux_component.c:184
#1  0x00002b0804729abc in opal_progress () at runtime/opal_progress.c:161
#2  0x00002b080409f590 in opal_condition_wait (c=0x2b08043e3380, m=0x2b08043e3300) at ../opal/threads/condition.h:72
#3  0x00002b080409fbd7 in ompi_request_default_wait_all (count=64, requests=0x602b80, statuses=0x608940) at request/req_wait.c:272
#4  0x00002b08041067f4 in PMPI_Waitall (count=64, requests=0x602b80, statuses=0x608940) at pwaitall.c:76
#5  0x00000000004010b4 in main (argc=1, argv=0x7fff35570348) at osu_bw.c:121

EDIT: Added verbatim blocks

@jsquyres
Copy link
Member

@bharatpotnuri @larrystevenwise Github pro tip: use three back ticks on a line to start and end verbose blocks. See https://help.github.com/articles/basic-writing-and-formatting-syntax/ for more github markdown formatting.

@jsquyres
Copy link
Member

So just to be clear: you're saying that opal_progress() is not ending up calling btl_openib_progress() in its loop over the callbacks[] array?

Hmm. I'm not sure how this could happen -- that the openib BTL was evidently selected for use, but then didn't have its progress function registered with the main progression engine (that should be done up in ob1, actually).

  • What message is being sent by the openib BTL? Is it an MPI message, or some type of control / initialization message?
  • Is the openib BTL being selected for use?

@bharatpotnuri
Copy link
Contributor

bharatpotnuri commented May 18, 2016

Yes, opal_progress() is not calling btl_openib_component_progress().

The difference in code flow between working case(mpi_add_procs_cutoff > 3 and np=2) and stall case(mpi_add_procs_cutoff = 0 and np=2) is nprocs value is 1 in case of stall and is 2, in working case.

Following the code back, nprocs is propagated through mca_pml_ob1_add_procs(). I am checking further, how it gets called.
I am checking answers for your two questions. I will rerun test with few more prints and let you know soon.
Thanks.

@bharatpotnuri
Copy link
Contributor

This is the test command which stalls:

/usr/mpi/gcc/openmpi-2.0.0/bin/mpirun -np 2 --host peer1,peer2 --allow-run-as-root --output-filename /tmp/mpi --mca btl openib,sm,self /usr/mpi/gcc/openmpi-2.0.0/tests/OSU/osu_bw

@bharatpotnuri
Copy link
Contributor

  • What message is being sent by the openib BTL? Is it an MPI message, or some type of control / initialization message?
  • Is the openib BTL being selected for use?

@jsquyres Could you please tell me what to check for in MPI code, for your above two questions. I couldn't figure it out from ompi code.

@jsquyres
Copy link
Member

I chatted with @hjelmn about this in IM. He's thinking that the issue might be in bml/r2 somewhere (r2 is the code that ob1 uses to multiplex over all of its BTLs), because r2 is where the BTL component progress functions are supposed to be registered.

He's poking into this today to see what's going on...

@larrystevenwise
Copy link
Author

Perhaps this issue has something to do with the fact that cxgb4/iwarp is using the rdmacm CPC, and thus the CTS message? I see with openib/mlx4, which does not stall, the rdmacm CPC is not being used.

[stevo2.asicdesigners.com:27045] rdmacm CPC only supported when the first QP is a PP QP; skipped
[stevo2.asicdesigners.com:27045] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped

@hjelmn
Copy link
Member

hjelmn commented May 18, 2016

Nah. This is caused by your network not supporting loopback. In that case openib is disqualified on the initial add_procs (local procs) only but then selected on a future add_proc call. mlnx isn't affected as the openib btl is not disqualified on local procs.

@larrystevenwise
Copy link
Author

cxgb4 does support hw loobpack, but maybe the openib btl doesn't know that?

@larrystevenwise
Copy link
Author

Ah. from mca_btl_openib_add_procs():

#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
        /* Most current iWARP adapters (June 2008) cannot handle
           talking to other processes on the same host (!) -- so mark
           them as unreachable (need to use sm).  So for the moment,
           we'll just mark any local peer on an iWARP NIC as
           unreachable.  See trac ticket #1352. */
        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
            continue;
        }
#endif

@larrystevenwise
Copy link
Author

cxgb3 did not support hw loobpack. cxgb4 and beyond does.

@jsquyres
Copy link
Member

jsquyres commented May 18, 2016

I think @hjelmn will reply with detail shortly about what he just found.

@larrystevenwise Feel free to submit a PR to amend that #if block you just found. It needs to be a generic solution, though -- we have avoided "if device X, do Y" kinds of hard-coding in the openib BTL whenever possible.

@larrystevenwise
Copy link
Author

@jsquyres I'll look into a way to detect hw loobpack support. But the OpenFabrics Verbs API doesn't advertise this capability from what I can tell.

But the 'self' btl was included in the mpirun, so maybe the openib shouldn't be disqualified?

@larrystevenwise
Copy link
Author

One gross way to detect hw loopback in openib is to attempt a hw loopback connection as part of module init...

@hjelmn
Copy link
Member

hjelmn commented May 18, 2016

We already do that in the openib add_procs. You can remove the #if block and see if it is working with your hardware.

I am working on a fix so it will not hang when local communication is not available. Should be ready shortly.

@larrystevenwise
Copy link
Author

larrystevenwise commented May 18, 2016

Removing the 2 blocks in btl_openib.c that excluded iwarp devices works around the issue.

diff --git a/opal/mca/btl/openib/btl_openib.c b/opal/mca/btl/openib/btl_openib.c
index 53d8c81..2e4211f 100644
--- a/opal/mca/btl/openib/btl_openib.c
+++ b/opal/mca/btl/openib/btl_openib.c
@@ -1065,18 +1065,6 @@ int mca_btl_openib_add_procs(
         struct opal_proc_t* proc = procs[i];
         mca_btl_openib_proc_t* ib_proc;

-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
-        /* Most current iWARP adapters (June 2008) cannot handle
-           talking to other processes on the same host (!) -- so mark
-           them as unreachable (need to use sm).  So for the moment,
-           we'll just mark any local peer on an iWARP NIC as
-           unreachable.  See trac ticket #1352. */
-        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
-            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
-            continue;
-        }
-#endif
-
         if(NULL == (ib_proc = mca_btl_openib_proc_get_locked(proc)) ) {
             /* if we don't have connection info for this process, it's
              * okay because some other method might be able to reach it,
@@ -1138,18 +1126,6 @@ int mca_btl_openib_add_procs(

         opal_output(-1, "add procs: adding proc %d", i);

-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
-        /* Most current iWARP adapters (June 2008) cannot handle
-           talking to other processes on the same host (!) -- so mark
-           them as unreachable (need to use sm).  So for the moment,
-           we'll just mark any local peer on an iWARP NIC as
-           unreachable.  See trac ticket #1352. */
-        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
-            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
-            continue;
-        }
-#endif
-
         if(NULL == (ib_proc = mca_btl_openib_proc_get_locked(proc)) ) {
             /* if we don't have connection info for this process, it's
              * okay because some other method might be able to reach it,

EDIT: Added diff syntax hilighting

@jsquyres
Copy link
Member

@larrystevenwise Another github pro tip: you can tell github how it should syntax hilight your verbatim blocks. For example, trail the 3 back ticks with "diff" (without the quotes) and it tells Github to render that verbatim block with diff syntax hilighting.

@larrystevenwise
Copy link
Author

nifty!

@bharatpotnuri
Copy link
Contributor

By removing the 2 blocks in btl_openib.c that excluded iWARP devices works around the issue. The test seems to be working but is failing to complete.

# OSU MPI Bi-Directional Bandwidth Test v5.0
# Size      Bandwidth (MB/s)
1                       0.53
2                       1.07
4                       2.14
8                       4.24
16                      8.53
32                     16.97
64                     32.76
128                    63.48
256                   121.18
512                   242.11
1024                  441.96
2048                  752.99
4096                 1051.12
8192                 1385.62
16384                3153.99
32768                4147.69
65536                4857.31
131072               5495.65
262144               5690.68
524288               5768.05
1048576              5777.00
2097152              6132.86
4194304              6336.14

The main thread is waiting at rdmacm_endpoint_finalize(). Below is the back-trace of rank 1:

0x0000003ef5c0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003ef5c0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b38408a66e1 in rdmacm_endpoint_finalize (endpoint=0x81b710) at connect/btl_openib_connect_rdmacm.c:1231
#2  0x00002b3840891839 in mca_btl_openib_endpoint_destruct (endpoint=0x81b710) at btl_openib_endpoint.c:338
#3  0x00002b384087a998 in opal_obj_run_destructors (object=0x81b710) at ../../../../opal/class/opal_object.h:455
#4  0x00002b3840880a66 in mca_btl_openib_finalize_resources (btl=0x7d63c0) at btl_openib.c:1627
#5  0x00002b38408812e1 in mca_btl_openib_finalize (btl=0x7d63c0) at btl_openib.c:1703
#6  0x00002b3832a9f812 in mca_btl_base_close () at base/btl_base_frame.c:153
#7  0x00002b3832a89299 in mca_base_framework_close (framework=0x2b3832d3e5a0) at mca_base_framework.c:214
#8  0x00002b383243d5f6 in mca_bml_base_close () at base/bml_base_frame.c:130
#9  0x00002b3832a89299 in mca_base_framework_close (framework=0x2b38326e91a0) at mca_base_framework.c:214
#10 0x00002b38323bec35 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:418
#11 0x00002b38323eba6d in PMPI_Finalize () at pfinalize.c:45
#12 0x0000000000401184 in main (argc=1, argv=0x7fff781a1ee8) at osu_bw.c:136

ompi-release/opal/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1231

1228     /* Now wait for all the disconnect callbacks to occur */
1229     pthread_mutex_lock(&rdmacm_disconnect_lock);
1230     while (opal_list_get_size (&contents->ids)) {
1231         pthread_cond_wait (&rdmacm_disconnect_cond, &rdmacm_disconnect_lock);
1232     }

I am running the test with git ompi-release repo, "commit :a7e27ef6, Merge pull request #1162 from abouteiller/bug/v2.x/mt_lazy_addproc". Looking further.

@bharatpotnuri
Copy link
Contributor

bharatpotnuri commented May 19, 2016

Please ignore my previous comment.
I have tested with commit a679cc072b9458957671886062c643f6a40b0351 bml/r2: always add btl progress function, it is working fine with iWARP and fixes the issue #1664 .

@jsquyres
Copy link
Member

Great! @hjelmn please PR to v2.x if you have not done so already.

@larrystevenwise
Copy link
Author

So should i close this issue out now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants