-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threading manager stresstest and fixes #15470
Conversation
Was hoping to find the cause of the issue I looked at in #15431
if (us == 0) | ||
return false; | ||
std::unique_lock<std::mutex> lock(mutex_); | ||
cond_.wait_for(lock, std::chrono::microseconds(us), [&] { return !triggered_; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I can't believe I didn't notice that. I'm sure I copied it from Event and changed the line, but even so.
-[Unknown]
std::unique_lock<std::mutex> lock(mutex_); | ||
if (!triggered_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, just wondering, what's the race here?
Considering the old code:
- If triggered is true, no lock and it bails. It can't be un-triggered, so this is safe.
- If triggered is false and stays false, then we lock and check triggered again (within the cond predicate), and then wait. When trigger happens, we wake.
- If triggered is false and changes immediately before the lock, we still lock and check triggered again in the pred. This happens before waiting, within the lock, so triggered is now true and we don't wait.
- If triggered is false but doesn't change until after the lock, then Notify() will be blocked until we enter the wait, and then the notify_all() will wake and recheck pred.
As long as Notify locks, in theory we could wrap all these functions in if (!triggered_) { ... }
as they are now, without a lock, because it gets rechecked again and can only become true once. Trying to figure out where the race condition is that I'm missing?
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be a race with destruction. If you go back to the old code and run it (on a machine with enough cores) it reproduces fairly easily. I think it's something like:
Notify gets called, locks, and reaches triggered_ = true;
The original thread enters Wait, and immediately exits and before the first thread has unlocked the mutex, it deletes the LimitedWaitable.
Thus we get a crash on "mutex deleted when locked".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, ok. I guess could just put a lock in the destructor instead.
-[Unknown]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, could be a possibility. Feel free to make a PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, 12 threads must not be enough to reproduce I guess (at least with unittest threadmanager), but I think there's additional safety to be had in the destruct case, so I'll open.
-[Unknown]
Trying to investigate the problem from #15431 by stressing the thread manager and thread primitives.
Turns out there was both misuse of cond.wait (the callback should return whether to stop waiting, not whether to continue to do so), and additionally there was a subtle race condition between Notify and Wait which I fixed by locking the mutex.
The old "Event" structure was also affected by these bugs.