-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
messager: do not lock when testing for openness #9134
Conversation
Thanks for looking into this! We don't have our own fork/build system, so I pushed a copy of your branch into this repo and triggered a build for it. Once that's done, I'll roll it out later today to see if it solves the deadlock. |
I ended up cherry picking on top of the v12 release branch and deploying that - https://github.com/vitessio/vitess/tree/release-12.0-messager-deadlock That has been running for 20-30 minutes for now, and I've performed 32 reparents so far without failure, which is far more than I've been able to do since upgrading past v6. I'll let it soak for a little bit, but it's looking promising. |
After some more reparenting, I was able to trigger a similar but new failure mode. It appears to have still deadlocked in the Full logs and goroutine dumps Here's the most relevant part of the logs. It's hard to tell if the timing of the snapshot table lock caused the deadlock or if it just happened to be the next thing to run after the deadlock already happened.
There's still a pile up of In any event, this is still behaving much better than before and it feels like we're on the right track. |
I also see that the rogue goroutines in question appear to be using |
After a day, things seem to have devolved back to roughly where they were before, with similar deadlock rates - all the new ones. I still believe this is heading the right direction. |
Back from my holiday. I'm looking at this again today and trying to come up with a more comprehensive fix. |
Signed-off-by: Vicent Marti <vmg@strn.cat>
8ee4b0d
to
26a7cf0
Compare
@derekperkins OK, I've just pushed an alternative and more comprehensive fix that removes the cyclic inter-dependency between the two objects. Can you give it a test run? |
Yep, I'll roll it out today. Thanks again for looking into it! |
It's been running for almost 24 hours now with no issues in message processing and no deadlocks, including 16 reparent operations under peak load, so this appears very promising. The code changes are pretty simple and look great. |
🎉 @deepthi I think we have a fix for @derekperkins' issue. Care to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clever. LGTM
@derekperkins should we backport to release-12.0? |
Yeah. I already had a branch targeting v12 which is what I'm running in prod, so I just opened a backport PR. #9210 |
Description
When the Messager engine is shutting down and the Engine has been
closed, there may be other goroutines still attempting to access the
Engine to generate queries. Since the Engine takes its own internal
mutex while it's closing itself down, these other goroutines can
deadlock it by calling into any method in the Engine.
To prevent this deadlock, we're using an atomic test that will fail fast
any engine operations as soon as the engine has started shutting down.
Signed-off-by: Vicent Marti vmg@strn.cat
This is a potential fix for #8909, which is proving itself really hard to reproduce.
cc @derekperkins -- could you please try this in your system and see if the deadlocking stops?
cc @aquarapid @deepthi
Related Issue(s)
Checklist
Deployment Notes