You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking at the code in osc_ucx_passive_target.c I found that the implementation of end_exclusive is flawed, leading to a lock-up if ranks try to take a shared lock while one rank holds an exclusive lock on the same target. Simply replacing the value in the lock's memory with TARGET_LOCK_UNLOCKED overwrites any changes to that value made by other processes trying to acquire a shared lock, causing the lock to get out of sync. Instead, the value of TARGET_LOCK_EXCLUSIVE should be subtracted from the lock to release it and to not interfere with the attempts of other ranks.
While debugging, I also found that some of the asserts in these code paths are overly strict and trigger even when they should not.
I will post PRs for master, v4.0.x, and v3.1.x soon.
@hppritcha The PR for 4.0.x is at #6934. I hope this will still make it into 4.0.2 :) I will wait for feedback on these two PRs before posting the PR to 3.1.x
Looking at the code in
osc_ucx_passive_target.c
I found that the implementation ofend_exclusive
is flawed, leading to a lock-up if ranks try to take a shared lock while one rank holds an exclusive lock on the same target. Simply replacing the value in the lock's memory withTARGET_LOCK_UNLOCKED
overwrites any changes to that value made by other processes trying to acquire a shared lock, causing the lock to get out of sync. Instead, the value ofTARGET_LOCK_EXCLUSIVE
should be subtracted from the lock to release it and to not interfere with the attempts of other ranks.While debugging, I also found that some of the asserts in these code paths are overly strict and trigger even when they should not.
I will post PRs for
master
,v4.0.x
, andv3.1.x
soon.Potentially related: #6549
The text was updated successfully, but these errors were encountered: