-
Notifications
You must be signed in to change notification settings - Fork 4
Fix race in acquire #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It would be good to run a few long-running benchmarks to make sure this fixes the issue |
| } else { | ||
| // item is being moved - wait for completion | ||
| return handleWithWaitContextForMovingItem(*it); | ||
| while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside this spin loop, it is a good practice to use pause instructions. Actually, folly already has platform-agnostic function asm_volatile_pause(). Furthermore, there is a folly::detail::Sleeper abstraction, but the problem is that it is not a public interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this loop is actually guaranteed to have at most 2 iterations. I'll try to verify this experimentally but here's my thinking:
acquire() is always called under Access Container lock so there is only one scenario (1.a.) when the race can happen on the evicting 'item':
tryEvictToNextMemoryTierfailed, or we are evicting from the last tier
a.markForEvictionWhenMovingis called just after we gotincFailedMoving: if under the MoveLock we realize that item is no longer moving, it has has to be marked for eviction so we just return NULL (we loop once and fail with incFailedEviction). There's no use-after free on the item becauseacquireholds AC lock, so findEviction will block onunlinkItemForEviction.
b. in all other cases we synchronize on AC lock or the item is already marked for eviction (and we just return NULL).- the item is successfully moved between tiers
a.unmarMovingis called just after we gotincFailedMoving? This cannot happen becauseunmarkMovingis called AFTERAC->replaceIf(which takes AC lock) which means we wouldn't enteracquirewith pointer to theitemthat is being evicted.
We also need to consider newItemHdl. If find or inserOrReplace gets newItem that is still marked moving (replaceIf in moveRegularItemWithSync was executed but unmarkMoving was not) and the newItem is no longer moving under MoveLock then we just increment the item (it will work since it's no longer moving).
If someone will try to evict/move this new item before acquire gets a chance to incRef successfully, acquire will surely succeed in creating a wait context (because acquire() is under AC lock so findEviction cannot replaceIf or remove from access container which is always done before unmarking).
| return handleWithWaitContextForMovingItem(*it); | ||
| while (true) { | ||
| // TODO: do not block incRef for child items to avoid deadlock | ||
| auto failIfMoving = getNumTiers() > 1 && !it->isChainedItem(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be calculated outside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
The assumption for moving items was that once item is unmarked no one can add new waiters for that item. However, since incrementing item ref count was not done under the MoveMap lock, there was a race: item could have been unmarked right after incRef returned incFailedMoving.
|
I can confirm no impact on performance for leader and follower workloads. |
|
@guptask could you please check this patch with your DSA branch and confirm that it fixes the hang that you reported recently? |
|
So, in that case, I believe we can merge this PR? |
The assumption for moving items was that once item is unmarked no one can add new waiters for that item. However, since incrementing item ref count was not done under the MoveMap lock, there was a race: item could have been unmarked right after incRef returned incFailedMoving.
The assumption for moving items was that once item is unmarked no one can add new waiters for that item. However, since incrementing item ref count was not done under the MoveMap lock, there was a race: item could have been unmarked right after incRef returned incFailedMoving.
The assumption for moving items was that once item is unmarked no one can add new waiters for that item. However, since incrementing item ref count was not done under the MoveMap lock, there was a race: item could have been unmarked right after incRef returned incFailedMoving.
This change is