-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: vtorc goes from a DeadPrimary -> ClusterHasNoPrimary; promotes final replica #13284
Comments
Could you share the VTOrc and vttablet logs? I couldn't reproduce the problem locally following the steps listed. |
vtorc logs: https://gist.github.com/olyazavr/a42483e8755add6a6ccee3aee153682a (omitted slack hook content for brevity) |
cc @vitessio/vtorc |
@olyazavr I don't have time right now to be able to fix this issue, but if you want to take it up, I'd be happy to review the PR. I can review any pseudocode/ideas that you want to talk about too. |
Overview of the Issue
When you have a primary and a replica, and then you kill that primary, vtorc will first detect DeadPrimary, but then not intervene with ERS because it cannot find a suitable candidate that would still satisfy the semi_sync durability policy. However, it soon transitions to detecting ClusterHasNoPrimary (likely when the primary server completely terminates) and then performs PRS on that final replica, meaning you have one primary and no replicas, which is not a functional situation when running with semi_sync. After this, it detects LockedSemiSyncPrimary and cannot get out of it.
One workaround I have found is to introduce a patch to block PRS from choosing a new primary similar to this check, where we check
if len(validTablets) == 1 && !currentlyHasAPrimary
and if so, return nothing.Reproduction Steps
mysql -- pkill -9 mysqld
and then shut down the server abruptly)Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: