btl/vader: ensure the fast box tag is always read first #5829

hjelmn · 2018-10-02T21:55:05Z

On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

On some platfoms reading a 64-bit value is non-atomic and it is possible that the two 32-bit values are read in the wrong order. To ensure the tag is always read first this commit reads the tag before reading the full 64-bit value. Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

hjelmn · 2018-10-02T21:55:31Z

@jsquyres In theory this will fix the remaining vader issue.

hjelmn · 2018-10-02T22:05:00Z

@amckinstry Please verify.

jsquyres · 2018-10-02T22:11:15Z

@amckinstry This is in reference to #5638 -- the hang with vader on some architectures.

amckinstry · 2018-10-04T15:21:04Z

Unfortunately we now see a related crash on other codes (lammps):

#0 0x0000000000000000 in ()
#1 0x00007f7a700f8a5f in mca_btl_vader_poll_handle_frag (hdr=0x7f7a6a1bf049, endpoint=endpoint@entry=0x55e02df87bd0) at btl_vader_component.c:603
#2 0x00007f7a700f8f83 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:225

#3 0x00007f7a700f8f83 in mca_btl_vader_component_progress () at btl_vader_component.c:702
#4 0x00007f7a828fb1cc in opal_progress () at runtime/opal_progress.c:228
#5 0x00007f7a8d816ebd in ompi_request_wait_completion (req=0x55e02df8d900) at ../ompi/request/request.h:413
#6 0x00007f7a8d816ebd in ompi_request_default_wait (req_ptr=0x7fffa70c3708, status=0x7fffa70c3710) at request/req_wait.c:42
#7 0x00007f7a8d86ec41 in ompi_coll_base_sendrecv_actual (sendbuf=sendbuf@entry=0x55e02df97de0, scount=scount@entry=1, sdatatype=sdatatype@entry=0x55e02cace1e0 <ompi_mpi_int>, dest=dest@entry=0, stag=stag@entry=-12, recvbuf=recvbuf@entry=0x7fffa70c3934, rcount=1, rdatatype=0x55e02cace1e0 <ompi_mpi_int>, source=0, rtag=-12, comm=0x55e02cacf700 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59

it looks like hdr->tag is invaliid, hence segdfault

gpaulsen · 2018-10-04T17:50:01Z

@amckinstry, I've opened #5842 to track this related crash you're seeing.

hjelmn · 2018-10-04T18:31:28Z

@amckinstry What platforms?

gpaulsen · 2018-10-04T19:29:25Z

@hjelmn In @amckinstry's original Issue, he mentioned i386 + Debian: #5638

hjelmn · 2018-10-04T19:32:48Z

@gpaulsen Yeah, just want to see if that is still the case. If it is just i386 I can not justify spending any more time at work on the issue. Someone else will need to look at it.

gpaulsen · 2018-10-04T20:54:13Z

Ok thanks for mentioning that.

amckinstry · 2018-10-05T08:11:38Z

@hjelmn @gpaulsen The regression (crash) is on amd64 at least, all platforms I think.

hjelmn · 2018-10-05T12:00:54Z

Ok, that I can spend time on. Will take a look today.

hjelmn added bug Target: main labels Oct 2, 2018

jsquyres mentioned this pull request Oct 2, 2018

Hangs in mca_btl_vader_component_progress on multiple archs #5638

Closed

bwbarrett approved these changes Oct 3, 2018

View reviewed changes

hjelmn merged commit 66a7dc4 into open-mpi:master Oct 3, 2018

gpaulsen mentioned this pull request Oct 4, 2018

New vader SEGV possibly PR 5829 #5842

Closed

gpaulsen added Target: v2.x Target: v3.0.x and removed Target: v2.x Target: v3.0.x labels Oct 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

btl/vader: ensure the fast box tag is always read first #5829

btl/vader: ensure the fast box tag is always read first #5829

hjelmn commented Oct 2, 2018

hjelmn commented Oct 2, 2018

hjelmn commented Oct 2, 2018

jsquyres commented Oct 2, 2018

amckinstry commented Oct 4, 2018

gpaulsen commented Oct 4, 2018

hjelmn commented Oct 4, 2018

gpaulsen commented Oct 4, 2018

hjelmn commented Oct 4, 2018

gpaulsen commented Oct 4, 2018

amckinstry commented Oct 5, 2018

hjelmn commented Oct 5, 2018

btl/vader: ensure the fast box tag is always read first #5829

btl/vader: ensure the fast box tag is always read first #5829

Conversation

hjelmn commented Oct 2, 2018

hjelmn commented Oct 2, 2018

hjelmn commented Oct 2, 2018

jsquyres commented Oct 2, 2018

amckinstry commented Oct 4, 2018

gpaulsen commented Oct 4, 2018

hjelmn commented Oct 4, 2018

gpaulsen commented Oct 4, 2018

hjelmn commented Oct 4, 2018

gpaulsen commented Oct 4, 2018

amckinstry commented Oct 5, 2018

hjelmn commented Oct 5, 2018