Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX osc: mixing exclusive and shared locks leads to lockup #6931

Closed
devreal opened this issue Aug 27, 2019 · 2 comments · Fixed by #6933
Closed

UCX osc: mixing exclusive and shared locks leads to lockup #6931

devreal opened this issue Aug 27, 2019 · 2 comments · Fixed by #6933

Comments

@devreal
Copy link
Contributor

devreal commented Aug 27, 2019

Looking at the code in osc_ucx_passive_target.c I found that the implementation of end_exclusive is flawed, leading to a lock-up if ranks try to take a shared lock while one rank holds an exclusive lock on the same target. Simply replacing the value in the lock's memory with TARGET_LOCK_UNLOCKED overwrites any changes to that value made by other processes trying to acquire a shared lock, causing the lock to get out of sync. Instead, the value of TARGET_LOCK_EXCLUSIVE should be subtracted from the lock to release it and to not interfere with the attempts of other ranks.

While debugging, I also found that some of the asserts in these code paths are overly strict and trigger even when they should not.

I will post PRs for master, v4.0.x, and v3.1.x soon.

Potentially related: #6549

@hppritcha
Copy link
Member

@devreal note we are in the release cycle for 4.0.2 so if you want a fix in 4.0.2 you'll need to open a PR this week or next week at latest.

@devreal
Copy link
Contributor Author

devreal commented Aug 27, 2019

@hppritcha The PR for 4.0.x is at #6934. I hope this will still make it into 4.0.2 :) I will wait for feedback on these two PRs before posting the PR to 3.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants