-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: race condition with PlannedReparentShard can leave a shard with no PRIMARY #9819
Comments
@darenseagrave can you give us the specific Vitess version / commit SHA you ran into this with? |
Hi @deepthi, we've hit the bug in Vitess 10 but I've confirmed this is also an issue on main, the log snippet is from our branch but the code snippets I referenced are from main |
How low is your ticks interval set to? /assign @vitessio/cluster-management |
This happened to us over the weekend. I don't have logs to prove that it was this specifically, but after we had some flapping on a specific shard (this has resurfaced for us #8909), we ended up with no primaries |
Checked the code and the order of operations for
So any run of replication manager tick after the first step and before the 4th step is going to cause this issue. A possible fix is to stop the replication manager much sooner in the |
I think the very first thing we should do in PromoteReplica is to stop the replication manager. We may need add a defer call to restart it if we fail to promote and return an error. |
Overview of the Issue
A race condition between the reparenting code and the
replManager
can cause a newly promoted PRIMARY to change tablet type toREPLICA
and reconnect to the old PRIMARY causing chained replication as well as no PRIMARY in the shard.This is caused by repairReplication being called which runs:
In the following code:
We change to PRIMARY and at the very last step call:
At this point it calls:
Which then:
If however, you run a very low
ticks
interval, you risk triggering this logic while/during thePlannedReparentShard
call.Either
replManager
needs to be stopped pre-reparent on the newly elected PRIMARY, told sooner (viats.tm.replManager.SetTabletType
) or using locking to preventreplManager
doing anything until theChangeTabletType
can call itsSetTabletType
function.Reproduction Steps
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: