Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New vader SEGV possibly PR 5829 #5842

Closed
gpaulsen opened this issue Oct 4, 2018 · 16 comments
Closed

New vader SEGV possibly PR 5829 #5842

gpaulsen opened this issue Oct 4, 2018 · 16 comments

Comments

@gpaulsen
Copy link
Member

gpaulsen commented Oct 4, 2018

@amckinstry commented on #5829

Original issue was: #5638

Since that was merged to all release branches, this is a blocker on all release branches.

Unfortunately we now see a related crash on other codes (lammps):

#0 0x0000000000000000 in ()
#1 0x00007f7a700f8a5f in mca_btl_vader_poll_handle_frag (hdr=0x7f7a6a1bf049, endpoint=endpoint@entry=0x55e02df87bd0) at btl_vader_component.c:603
#2 0x00007f7a700f8f83 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:225

#3 0x00007f7a700f8f83 in mca_btl_vader_component_progress () at btl_vader_component.c:702
#4 0x00007f7a828fb1cc in opal_progress () at runtime/opal_progress.c:228
#5 0x00007f7a8d816ebd in ompi_request_wait_completion (req=0x55e02df8d900) at ../ompi/request/request.h:413
#6 0x00007f7a8d816ebd in ompi_request_default_wait (req_ptr=0x7fffa70c3708, status=0x7fffa70c3710) at request/req_wait.c:42
#7 0x00007f7a8d86ec41 in ompi_coll_base_sendrecv_actual (sendbuf=sendbuf@entry=0x55e02df97de0, scount=scount@entry=1, sdatatype=sdatatype@entry=0x55e02cace1e0 <ompi_mpi_int>, dest=dest@entry=0, stag=stag@entry=-12, recvbuf=recvbuf@entry=0x7fffa70c3934, rcount=1, rdatatype=0x55e02cace1e0 <ompi_mpi_int>, source=0, rtag=-12, comm=0x55e02cacf700 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59

it looks like hdr->tag is invaliid, hence segdfault
@gpaulsen
Copy link
Member Author

gpaulsen commented Oct 4, 2018

@amckinstry, can you please confirm that you have commit dfa8d3a in your master build that you saw this with? There were a few other vader related fixes that went in around #5829

@gpaulsen
Copy link
Member Author

gpaulsen commented Oct 4, 2018

@amckinstry - @hjelmn asked what platforms this affects. If it only affects i386, we may not have a developer to fix this.

@jsquyres
Copy link
Member

jsquyres commented Oct 4, 2018

I don't agree with that sentiment. If we know it's broken on a platform, we need to make it right. We either fix it on that platform or we disable the functionality (i.e., vader fastboxes) on that platform. For core functionality like this, we can't just knowingly have a bad bug like this.

@hjelmn
Copy link
Member

hjelmn commented Oct 4, 2018

@jsquyres My time is at a premium. We don't run i386 so unless I am told to fix this at work I will not spend any time on it. Just the reality right now. Easy enough to disable fast boxes on i386. I can make a PR for that.

@bwbarrett
Copy link
Member

What evidence do we have the bug is i386? The last two have been bugs in general, but are more likely to happen on i386. Really think we should consider disabling fastboxes in general until we can go through it in more detail.

I’m happy to do a review, but would love a high level description of the fastbox design first; there are some assumptions in the code I didn’t quite follow last week.

@hjelmn
Copy link
Member

hjelmn commented Oct 4, 2018

@bwbarrett Read my comment. If this is only happening on i386 I won't have time for it. If it happens elsewhere (x86_64, power, arm64) then I can justify the time.

@jsquyres
Copy link
Member

jsquyres commented Oct 4, 2018

@hjelmn I don't really disagree with you. Indeed, I don't care about i386, either. We're all busy -- who's got time to fix i386? Not me. But regardless as a community, we need to decide which of these to do:

  1. Fix the bug
  2. Disable the problematic platform(s) altogether
  3. Disable the problematic functionality (probably vader fastboxes) on problematic platform(s) (i.e., so that you can still get correct runs on these platform(s))

In short: it would be super lame for us to keep shipping code that configures/builds/installs/runs on these problematic platforms when we seem to have indications that there's a fairly big bug in core functionality.

And I think we all agree: we don't know that it's (only) i386. We need to hear two things back from @amckinstry:

  1. To double check that he had all the relevant patches
  2. To find out what platform(s) failed this time

@hjelmn
Copy link
Member

hjelmn commented Oct 4, 2018

@jsquyres Option 2 would be my preference but I don't know if we are ready to kill support a platform that was effectively killed in the early 2000's (when x86_64 came about). So, option 3 is the way to go.

Once we hear back from the user we will see what the path forward is.

@amckinstry
Copy link

@bwbarrett The original bug(s) was on multiple platforms: (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907407) arm, i386, powerpc, including mips64el and ppc64el so not just 32-bit; i386 was just the easiest to debug. It could be seen in test case issue 46 on "arpack", but we also had hangs in "liggghts" and "elpa".
With the current patch we see segfaults on amd64 on lammps, and elpa.

This is very definitely a release blocker for Debian/Ubuntu; we need to either fix or disable functionality somehow. Whatever I can do to help, I will.

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2018

Will look today.

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2018

@amckinstry Which LAMMPS problem shows the issue? Is OpenMP in use?

@amckinstry
Copy link

Version lammps-0~20180822.gitb47e49223
You can browser the build source here: https://salsa.debian.org/science-team/lammps
The relevant fragment is:

        mkdir test; cp examples/crack/* test/
        cd test; mpirun -np 2 --allow-run-as-root --oversubscribe $(BUILDDIR)/lmp < in.crack

OpenMP not enabled I think; it appears to be stochastic, hanging about half the time, crashing the rest (with trace above).

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2018

@amckinstry Thanks. That will help me narrow down where the problem may be.

@amckinstry
Copy link

For the record in case of mistake the patch I am running on openmpi-3.1.2 is attached

Index: openmpi-3.1.2/opal/mca/btl/vader/btl_vader_fbox.h
===================================================================
--- openmpi-3.1.2.orig/opal/mca/btl/vader/btl_vader_fbox.h
+++ openmpi-3.1.2/opal/mca/btl/vader/btl_vader_fbox.h
@@ -50,11 +50,23 @@ void mca_btl_vader_poll_handle_frag (mca
 static inline void mca_btl_vader_fbox_set_header (mca_btl_vader_fbox_hdr_t *hdr, uint16_t tag,
                                                   uint16_t seq, uint32_t size)
 {
-    mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = tag, .seq = seq, .size = size}};
-    opal_atomic_wmb ();
+    mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = 0, .seq = seq, .size = size}};
     hdr->ival = tmp.ival;
+    opal_atomic_wmb ();
+    hdr->data.tag = tag;
+}
+
+static inline mca_btl_vader_fbox_hdr_t mca_btl_vader_fbox_read_header (mca_btl_vader_fbox_hdr_t *hdr)
+{
+    mca_btl_vader_fbox_hdr_t tmp;
+    uint16_t tag = hdr->data.tag;
+    opal_atomic_rmb ();
+    tmp.ival = hdr->ival;
+    tmp.data.tag = tag;
+    return tmp;
 }

+
 /* attempt to reserve a contiguous segment from the remote ep */
 static inline bool mca_btl_vader_fbox_sendi (mca_btl_base_endpoint_t *ep, unsigned char tag,
                                              void * restrict header, const size_t header_size,
@@ -138,9 +150,6 @@ static inline bool mca_btl_vader_fbox_se
         memcpy (data + header_size, payload, payload_size);
     }
-    /* write out part of the header now. the tag will be written when the data is available */
-    mca_btl_vader_fbox_set_header (MCA_BTL_VADER_FBOX_HDR(dst), tag, ep->fbox_out.seq++, data_size);
-
     end += size;

     if (OPAL_UNLIKELY(fbox_size == end)) {
@@ -152,6 +161,9 @@ static inline bool mca_btl_vader_fbox_se
         MCA_BTL_VADER_FBOX_HDR(ep->fbox_out.buffer + end)->ival = 0;
     }

+    /* write out part of the header now. the tag will be written when the data is available */
+    mca_btl_vader_fbox_set_header (MCA_BTL_VADER_FBOX_HDR(dst), tag, ep->fbox_out.seq++, data_size);
+
     /* align the buffer */
     ep->fbox_out.end = ((uint32_t) hbs << 31) | end;
     opal_atomic_wmb ();
@@ -174,7 +186,7 @@ static inline bool mca_btl_vader_check_f

         for (poll_count = 0 ; poll_count <= MCA_BTL_VADER_POLL_COUNT ; ++poll_count) {
-            const mca_btl_vader_fbox_hdr_t hdr = {.ival = MCA_BTL_VADER_FBOX_HDR(ep->fbox_in.buffer + start)->ival};
+            const mca_btl_vader_fbox_hdr_t hdr = mca_btl_vader_fbox_read_header (MCA_BTL_VADER_FBOX_HDR(ep->fbox_in.buffer + start));

             /* check for a valid tag a sequence number */
             if (0 == hdr.data.tag || hdr.data.seq != ep->fbox_in.seq) {

@amckinstry
Copy link

I've tested the fix in #5852
This solves all the issues seen (regression on amd64, original hangs and segfaults on i386)

@gpaulsen
Copy link
Member Author

gpaulsen commented Oct 23, 2018

This has been merged to v3.0.x, v3.1.x, and v4.0.x. It looks like this functionality never existed on v2.1.x or earlier.
Closing this Issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants