Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threading manager stresstest and fixes #15470

Merged
merged 5 commits into from
Apr 9, 2022
Merged

Conversation

hrydgard
Copy link
Owner

@hrydgard hrydgard commented Apr 8, 2022

Trying to investigate the problem from #15431 by stressing the thread manager and thread primitives.

Turns out there was both misuse of cond.wait (the callback should return whether to stop waiting, not whether to continue to do so), and additionally there was a subtle race condition between Notify and Wait which I fixed by locking the mutex.

The old "Event" structure was also affected by these bugs.

@hrydgard hrydgard added this to the v1.13.0 milestone Apr 8, 2022
@hrydgard hrydgard marked this pull request as ready for review April 8, 2022 10:35
@hrydgard hrydgard changed the title Threading manager stresstest Threading manager stresstest and fixes Apr 8, 2022
if (us == 0)
return false;
std::unique_lock<std::mutex> lock(mutex_);
cond_.wait_for(lock, std::chrono::microseconds(us), [&] { return !triggered_; });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I can't believe I didn't notice that. I'm sure I copied it from Event and changed the line, but even so.

-[Unknown]

Comment on lines +15 to +16
std::unique_lock<std::mutex> lock(mutex_);
if (!triggered_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, just wondering, what's the race here?

Considering the old code:

  • If triggered is true, no lock and it bails. It can't be un-triggered, so this is safe.
  • If triggered is false and stays false, then we lock and check triggered again (within the cond predicate), and then wait. When trigger happens, we wake.
  • If triggered is false and changes immediately before the lock, we still lock and check triggered again in the pred. This happens before waiting, within the lock, so triggered is now true and we don't wait.
  • If triggered is false but doesn't change until after the lock, then Notify() will be blocked until we enter the wait, and then the notify_all() will wake and recheck pred.

As long as Notify locks, in theory we could wrap all these functions in if (!triggered_) { ... } as they are now, without a lock, because it gets rechecked again and can only become true once. Trying to figure out where the race condition is that I'm missing?

-[Unknown]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a race with destruction. If you go back to the old code and run it (on a machine with enough cores) it reproduces fairly easily. I think it's something like:

Notify gets called, locks, and reaches triggered_ = true;

The original thread enters Wait, and immediately exits and before the first thread has unlocked the mutex, it deletes the LimitedWaitable.

Thus we get a crash on "mutex deleted when locked".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, ok. I guess could just put a lock in the destructor instead.

-[Unknown]

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, could be a possibility. Feel free to make a PR :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, 12 threads must not be enough to reproduce I guess (at least with unittest threadmanager), but I think there's additional safety to be had in the destruct case, so I'll open.

-[Unknown]

@hrydgard hrydgard merged commit 5b58b69 into master Apr 9, 2022
@hrydgard hrydgard deleted the threading-manager-stresstest branch April 9, 2022 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants