Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport: Only start SQL thread temporarily to WaitForPosition if needed #10123

Merged
merged 2 commits into from
Apr 24, 2022

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Apr 21, 2022

⚠️ This does not require/warrant another 13.0 patch release but rather should be included in the next patch release whenever it's made ⚠️

Description

After #9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired.

This changes the behavior so that:

  1. We only attempt to start the SQL_Thread(s) if it's not already running
  2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call

Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before #9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests are not required
  • Documentation is not required

…sio#10104)

After vitessio#9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired.

This changes the behavior so that:
  1. We only attempt to start the SQL_Thread(s) if it's not already running
  2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call

Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before vitessio#9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Member

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else looks good to me

go/vt/mysqlctl/replication.go Outdated Show resolved Hide resolved
As release-13.0 does not have this:
  vitessio#9853

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord merged commit d685b18 into vitessio:release-13.0 Apr 24, 2022
@mattlord mattlord deleted the backport10104_v13 branch April 24, 2022 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants