Backport: Only start SQL thread temporarily to WaitForPosition if needed #10123

mattlord · 2022-04-21T15:11:57Z

⚠️ This does not require/warrant another 13.0 patch release but rather should be included in the next patch release whenever it's made ⚠️

Description

After #9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired.

This changes the behavior so that:

We only attempt to start the SQL_Thread(s) if it's not already running
If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call

Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before #9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up.

Related Issue(s)

Follow-up to: Fix ERS to work when the primary candidate's replication is stopped #9512
Backport-of: Only start SQL thread temporarily to WaitForPosition if needed #10104

Checklist

"Backport me!" label has been added if this change should be backported
Tests are not required
Documentation is not required

…sio#10104) After vitessio#9512 we always attempted to start the replication SQL_Thread(s) when waiting for a given position. The problem with this, however, is that if the SQL_Thread is running but the IO_Thread is not then the tablet repair does not try and start replication on a replica tablet. So in certain states such as when initializing a shard, replication may end up in a non-healthy state and never be repaired. This changes the behavior so that: 1. We only attempt to start the SQL_Thread(s) if it's not already running 2. If we explicitly start the SQL_Thread(s) then we also explicitly reset it to what it was (stopped) as we exit the call Because the caller should be/have a TabletManager which has a mutex, this should ensure that the replication manager calls are serialized and because we are resetting the replication state after mutating it, everything should work as it did before vitessio#9512 with the exception being that when waiting we ensure that the replica at least has the possibility of catching up. Signed-off-by: Matt Lord <mattalord@gmail.com>

GuptaManan100

Everything else looks good to me

go/vt/mysqlctl/replication.go

As release-13.0 does not have this: vitessio#9853 Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord added Type: Bug Component: Query Serving Backport This is a backport release notes labels Apr 21, 2022

mattlord requested review from deepthi, harshit-gangal and systay as code owners April 21, 2022 15:11

mattlord requested review from GuptaManan100 and removed request for systay and harshit-gangal April 21, 2022 15:12

GuptaManan100 approved these changes Apr 23, 2022

View reviewed changes

go/vt/mysqlctl/replication.go Outdated Show resolved Hide resolved

Use older replication status interface

96fd29a

As release-13.0 does not have this: vitessio#9853 Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord merged commit d685b18 into vitessio:release-13.0 Apr 24, 2022

mattlord deleted the backport10104_v13 branch April 24, 2022 00:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport: Only start SQL thread temporarily to WaitForPosition if needed #10123

Backport: Only start SQL thread temporarily to WaitForPosition if needed #10123

mattlord commented Apr 21, 2022 •

edited

Loading

GuptaManan100 left a comment

Backport: Only start SQL thread temporarily to WaitForPosition if needed #10123

Backport: Only start SQL thread temporarily to WaitForPosition if needed #10123

Conversation

mattlord commented Apr 21, 2022 • edited Loading

Description

Related Issue(s)

Checklist

GuptaManan100 left a comment

Choose a reason for hiding this comment

mattlord commented Apr 21, 2022 •

edited

Loading