-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vttablet: apparent deadlock when running PlannedReparentShard #8909
Comments
When this happens, we have an outage. The resolution is:
While this results in an outage of a few minutes, longer for keyspaces with more shards, at least there is no data lost or errant gtids. |
My initial theory is that it's caused by those table locks. We could look at drilling down there, or we could use a different approach:We don't really need to lock tables any more. Once you start a snapshot-ed transaction, the data in the table won't change within that transaction. So, there's no need to lock the table. Option 2 would be to shut down all streams before demoting, but I think that's not necessary (if we solve the problem by avoiding locks). |
Here's a full debugging dump from a tablet that got stuck |
There seems to be an awful lot of goroutines running here. How many vreplication flows were actually running at the time of this snapshot? |
Only 2-3 vreplication workflows running, but there are 27 message tables open, each of which starts a VStream connection. |
OK, I think I have tracked this one down by looking at It seems the situation for the vttablet is like this:
So, it seems we have a deadlock, |
Thanks for tracking this down @aquarapid! That matches the behavior I see, where it continues to loop trying to purge, always failing because it's in read only. |
I upgraded us to v12 today, and unsurprisingly, this minor PR did not fix the problem. #8965 |
Ostensibly fixed. :) |
I think this is still happening, just less regularly. I'll try upgrading to v13 to see if I still see it. on PlannedReparentShard
on container shutdown
|
Here's a new gist with updated goroutine/block/mutex pprof profile dumps |
Looking at this again. 🥲 ✨ |
@derekperkins: is the block profile gist complete? These are the only possible culprits for a deadlock:
Everything else looks like noise. |
FWIW this appears to be an unrelated bug to what we were seeing before, as there are no |
Thanks for looking into this again, I really appreciate it.
I'm not 100% sure how it gets populated. I enabled the blocking profile and then grabbed the dump, but I didn't wait for a very long time since it was an outage, maybe 30 seconds tops, so if it might have taken longer than that to fully populate, then maybe it was a partial. I have a sneaking suspicion that it was related to the lack of an index, causing the vstream to block somewhere. Over the weekend we had a 10x spike in our workload, causing all of our queues to backup into the millions of rows. A few weeks ago, the messager just stopped running entirely when it got too long, and after a lot of debugging, I did an explain on the vstream query. I thought that it would use Fast forward a few weeks, I hadn't gone through and added that same index to all our other message tables, and I had queues not processing, and were otherwise behaving just like the original deadlocking issue here, where > 50% of shards would fail to reparent, stopping in the same place in the logs as before. After adding that index to all the tables, the "deadlocks" / "blocking" stopped happening, and now I can't seem to reproduce it. Query setup Vstream execution
As I'm typing this, I'm remembering that at least some of the time while the large queues were blocked, they would still process streaming/ongoing messages, so maybe polling backed up messages wasn't working, but messages would come through the live stream... I'm not super clear exactly how the vstream mechanism works, so I'm not sure my mental model is correct. |
I'm wondering if what was actually happening was the same as #9827. The query timed out because of the full table scan, then deadlocked. In your list of possible culprits, 2 of them are inside that |
Ah yes, @rohit-nayak-ps's #9833 could indeed be a fix for this. Worth trying in production before we dig further. |
I deployed yesterday from this branch, which has Rohit's commit from #9833, but today still saw it deadlock and stop at the same log entry on 10 of 16 shards in one keyspace.
I dumped the debugging profiles again, and this time let it sit for longer, in case I missed any blocking profiles last time. Trying to eliminate the noise, these are very similar to the ones you called out earlier.
|
I'm not sure what is happening, but this has been exponentially worse since rolling to v13. I can't keep our cluster up and running. Here's a new batch of dumps: this time I took them from a tablet where messages were no longer being delivered, but before calling PlannedReparentShard where the tablet gets stuck shutting down. I also included a full log dump |
@deepthi @rohit-nayak-ps: we need to talk about this in the weekly sync. There's a regression in v13 and it's not related to the previous bug at all. It looks like we just introduced a different issue here. |
I'm re-closing this and moved the ongoing discussion/work here: #9936 There's a lot of history and different info in this issue that is making it harder to follow. |
Overview of the Issue
We have successfully executed thousands of PlannedReparentShard commands before upgrading to v11, and now they are very unreliable (25-100% failure rate with 0% recovery on successive attempts) on tablets with many concurrent vstreams. I can't prove that VStream is causing it, but my guess is that's where we're different from other Vitess clusters. We also have several Materialize workflows running cross-keyspaces.
I have linked to full log output (https://gist.github.com/derekperkins/299519b0bd9da3618645de908a37ca0c), but this seems to be where things are getting locked:
When I do a reparent on another keyspace that doesn't have so many streams, this is what I see immediately following
TableGC: closing
. It looks like there must be some race condition with how the Messager is closing, leading to a deadlockI turned off 3 Materialize workflows (16 shards for 48 vreplication rows) and things seem to have improved. I do need those running and have many others that I plan on starting, so I am really hoping to get this resolved in time for v12.
Operating system and Environment details
GKE 1.20.9
vttablet: 11.0.0
vtctld: 11.0.0
The text was updated successfully, but these errors were encountered: