Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osc/rdma: rework locking code to improve behavior of unlock #3783

Merged
merged 1 commit into from
Aug 4, 2017

Conversation

hjelmn
Copy link
Member

@hjelmn hjelmn commented Jun 28, 2017

This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes #3616

Signed-off-by: Nathan Hjelm hjelmn@lanl.gov
(cherry picked from commit 022c658)
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov

This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.

Fixes open-mpi#3616

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
(cherry picked from commit 022c658)
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@hjelmn hjelmn added the bug label Jun 28, 2017
@hjelmn hjelmn added this to the v2.1.2 milestone Jun 28, 2017
@hjelmn hjelmn requested a review from markalle June 28, 2017 15:47
@jsquyres
Copy link
Member

jsquyres commented Aug 3, 2017

@markalle Can you please review this PR?

@markalle
Copy link
Contributor

markalle commented Aug 3, 2017

I was able to build and tried running it with the mt_1sided test and I did get failures still.

$MPI_ROOT/bin/mpicc -o x \
  /path/to/ompi-tests/ibm/onesided/mt_1sided.c \
  /path/to/ompi-tests/ibm/onesided/mt_1sided_td1.c \
  /path/to/ompi-tests/ibm/onesided/mt_1sided_td2.c

$MPI_ROOT/bin/mpirun -mca osc rdma -host hostA:4 ./mtx
$MPI_ROOT/bin/mpirun -mca osc rdma -host hostA:1,hostB:1 ./mtx

The -np 4 single host run hit a wrong answer, the 2-host version hung.

I don't think those problems are caused by this commit though.

If it helps I can shrink the number of things the testcase is doing quite a bit and still reproduce the wrong answer in the -np 4 run. Here's a gist of a version that only uses MPI_Win_fence and contiguous MPI_Get of size 200000:
https://gist.github.com/markalle/a8da31fbf4a7b25d79501232e6e8f2af
It's the 1sided.c code that's included by the above mt_1sided*.

@markalle
Copy link
Contributor

markalle commented Aug 3, 2017

Oh, ignore that. That's not even what this checkin is about. This is MPI_Win_lock/unlock stuff. Okay, let me start over.

@markalle
Copy link
Contributor

markalle commented Aug 3, 2017

Good news, it passed everything I threw at it. There was one failure from MPI_Accumulate where the datatypes were weird (negative lower bound, that sort of thing) which I'm just sure we addressed a while back with that big flurry of datatype related checkins. Anyway that part isn't related to this commit, and I'll just have to double-check that later. Overall the new lock/unlock seems to be working great.

@markalle
Copy link
Contributor

markalle commented Aug 3, 2017

Oh wait: one of the runs hung... I'll have to investigate.

@markalle
Copy link
Contributor

markalle commented Aug 4, 2017

So, as soon as I switched to building with CFLAGS=-g even the simplest 1sided tests failed. Every call kept seeing the remote peer's window size as 0 or 1 and thus was reporting an ERR_RMA_RANGE. I can't completely rule out that I built wrong, but I did a complete rebuild and kept hitting the above. It makes me suspicious of some addess being accessed wrong.

Anyway I was building debug to try to get more info about my hang. Without debug I can't say much about the hang.

@markalle
Copy link
Contributor

markalle commented Aug 4, 2017

For me the new rdma seems to not work for even simple tests if I use mixed hosts:

https://gist.github.com/markalle/9688f91e479e854328f9fb7b42959e9d

mpicc -o x simple.c
mpirun -mca osc rdma -host hostA:3 ./x       : is fine
mpirun -mca osc rdma -host hostA:1,hostB:1 ./x       : is fine
mpirun -mca osc rdma -host hostA:2,hostB:1 ./x      : fails with

[[11927,1],2][btl_openib_component.c:3529:handle_wc] from hostB to: hostA error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 25954b8 opcode -1  vendor error 136 qp_idx 3

So I guess I can only confirm this commit is working single host and at hostA:1,hostB:1, and none of the problems I've seen appear to be this commit's fault.

@hppritcha
Copy link
Member

strange why does github still show this as needing a review?

@jsquyres
Copy link
Member

jsquyres commented Aug 4, 2017

Because @markalle doesn't have write access to the repo. So I'll add a token review.

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved because @markalle approved.

@hppritcha hppritcha merged commit c2559ff into open-mpi:v2.x Aug 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants