-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osc/rdma: rework locking code to improve behavior of unlock #3783
Conversation
This commit changes the locking code to allow the lock release to be non-blocking. This helps with releasing the accumulate lock which may occur in a BTL callback. Fixes open-mpi#3616 Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov> (cherry picked from commit 022c658) Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
@markalle Can you please review this PR? |
I was able to build and tried running it with the mt_1sided test and I did get failures still.
The -np 4 single host run hit a wrong answer, the 2-host version hung. I don't think those problems are caused by this commit though. If it helps I can shrink the number of things the testcase is doing quite a bit and still reproduce the wrong answer in the -np 4 run. Here's a gist of a version that only uses MPI_Win_fence and contiguous MPI_Get of size 200000: |
Oh, ignore that. That's not even what this checkin is about. This is MPI_Win_lock/unlock stuff. Okay, let me start over. |
Good news, it passed everything I threw at it. There was one failure from MPI_Accumulate where the datatypes were weird (negative lower bound, that sort of thing) which I'm just sure we addressed a while back with that big flurry of datatype related checkins. Anyway that part isn't related to this commit, and I'll just have to double-check that later. Overall the new lock/unlock seems to be working great. |
Oh wait: one of the runs hung... I'll have to investigate. |
So, as soon as I switched to building with CFLAGS=-g even the simplest 1sided tests failed. Every call kept seeing the remote peer's window size as 0 or 1 and thus was reporting an ERR_RMA_RANGE. I can't completely rule out that I built wrong, but I did a complete rebuild and kept hitting the above. It makes me suspicious of some addess being accessed wrong. Anyway I was building debug to try to get more info about my hang. Without debug I can't say much about the hang. |
For me the new rdma seems to not work for even simple tests if I use mixed hosts: https://gist.github.com/markalle/9688f91e479e854328f9fb7b42959e9d
So I guess I can only confirm this commit is working single host and at hostA:1,hostB:1, and none of the problems I've seen appear to be this commit's fault. |
strange why does github still show this as needing a review? |
Because @markalle doesn't have write access to the repo. So I'll add a token review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved because @markalle approved.
This commit changes the locking code to allow the lock release to be
non-blocking. This helps with releasing the accumulate lock which may
occur in a BTL callback.
Fixes #3616
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov
(cherry picked from commit 022c658)
Signed-off-by: Nathan Hjelm hjelmn@lanl.gov