Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/vader: ensure the fast box tag is always read first #5829

Merged
merged 1 commit into from
Oct 3, 2018

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Oct 2, 2018

On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

On some platfoms reading a 64-bit value is non-atomic and it is
possible that the two 32-bit values are read in the wrong order. To
ensure the tag is always read first this commit reads the tag before
reading the full 64-bit value.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn
Copy link
Member Author

hjelmn commented Oct 2, 2018

@jsquyres In theory this will fix the remaining vader issue.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 2, 2018

@amckinstry Please verify.

@jsquyres
Copy link
Member

jsquyres commented Oct 2, 2018

@amckinstry This is in reference to #5638 -- the hang with vader on some architectures.

@hjelmn hjelmn merged commit 66a7dc4 into open-mpi:master Oct 3, 2018
@amckinstry
Copy link

Unfortunately we now see a related crash on other codes (lammps):

#0 0x0000000000000000 in ()
#1 0x00007f7a700f8a5f in mca_btl_vader_poll_handle_frag (hdr=0x7f7a6a1bf049, endpoint=endpoint@entry=0x55e02df87bd0) at btl_vader_component.c:603
#2 0x00007f7a700f8f83 in mca_btl_vader_check_fboxes () at btl_vader_fbox.h:225

#3 0x00007f7a700f8f83 in mca_btl_vader_component_progress () at btl_vader_component.c:702
#4 0x00007f7a828fb1cc in opal_progress () at runtime/opal_progress.c:228
#5 0x00007f7a8d816ebd in ompi_request_wait_completion (req=0x55e02df8d900) at ../ompi/request/request.h:413
#6 0x00007f7a8d816ebd in ompi_request_default_wait (req_ptr=0x7fffa70c3708, status=0x7fffa70c3710) at request/req_wait.c:42
#7 0x00007f7a8d86ec41 in ompi_coll_base_sendrecv_actual (sendbuf=sendbuf@entry=0x55e02df97de0, scount=scount@entry=1, sdatatype=sdatatype@entry=0x55e02cace1e0 <ompi_mpi_int>, dest=dest@entry=0, stag=stag@entry=-12, recvbuf=recvbuf@entry=0x7fffa70c3934, rcount=1, rdatatype=0x55e02cace1e0 <ompi_mpi_int>, source=0, rtag=-12, comm=0x55e02cacf700 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59

it looks like hdr->tag is invaliid, hence segdfault

@gpaulsen
Copy link
Member

gpaulsen commented Oct 4, 2018

@amckinstry, I've opened #5842 to track this related crash you're seeing.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 4, 2018

@amckinstry What platforms?

@gpaulsen
Copy link
Member

gpaulsen commented Oct 4, 2018

@hjelmn In @amckinstry's original Issue, he mentioned i386 + Debian: #5638

@hjelmn
Copy link
Member Author

hjelmn commented Oct 4, 2018

@gpaulsen Yeah, just want to see if that is still the case. If it is just i386 I can not justify spending any more time at work on the issue. Someone else will need to look at it.

@gpaulsen
Copy link
Member

gpaulsen commented Oct 4, 2018

Ok thanks for mentioning that.

@amckinstry
Copy link

@hjelmn @gpaulsen The regression (crash) is on amd64 at least, all platforms I think.

@hjelmn
Copy link
Member Author

hjelmn commented Oct 5, 2018

Ok, that I can spend time on. Will take a look today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants