Backport: Only start SQL thread temporarily to WaitForPosition if needed #10123
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
After #9512 we always attempted to start the replication
SQL_Thread
(s) when waiting for a given position. The problem with this, however, is that if theSQL_Thread
is running but theIO_Thread
is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired.This changes the behavior so that:
SQL_Thread
(s) if it's not already runningSQL_Thread(s)
then we also explicitly reset it to what it was (stopped) as we exit the callBecause the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before #9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up.
Related Issue(s)
Checklist