-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not mutate replication state in WaitSourcePos and ignore tablets with SQL_Thread stopped in ERS #10148
Conversation
This results in race conditions with tablet repairs that can leave replication broken and requiring manual repair using: START SLAVE. Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…topped Signed-off-by: Manan Gupta <manan@planetscale.com>
315dbc0
to
3e128dd
Compare
6a8ade8
to
3e128dd
Compare
…cation status Signed-off-by: Manan Gupta <manan@planetscale.com>
e4030b1
to
347b17e
Compare
…ich is correct Signed-off-by: Manan Gupta <manan@planetscale.com>
@@ -141,7 +141,7 @@ func TestEmergencyReparentShardSlow(t *testing.T) { | |||
}, | |||
"zone1-0000000200": { | |||
StopStatus: &replicationdatapb.StopReplicationStatus{ | |||
Before: &replicationdatapb.Status{}, | |||
Before: &replicationdatapb.Status{IoState: int32(mysql.ReplicationStateRunning), SqlState: int32(mysql.ReplicationStateRunning)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're changing this — so that it's correct — all over because we now use it in the tests. Before it was unused.
} | ||
|
||
replStatus := mysql.ProtoToReplicationStatus(stopStatus.Before) | ||
return replStatus.SQLState == mysql.ReplicationStateRunning, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be backwards compatible once #10167 is merged.
Before: &replicationdatapb.Status{Position: "MySQL56/3E11FA47-71CA-11E1-9E33-C80AA9429100:1-5", IoState: int32(mysql.ReplicationStateRunning), SqlState: int32(mysql.ReplicationStateRunning)}, | ||
After: &replicationdatapb.Status{Position: "MySQL56/3E11FA47-71CA-11E1-9E33-C80AA9429100:1-9"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like the explicitness of it.
LGTM. Thanks, @GuptaManan100 ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
…ith SQL_Thread stopped in ERS (#10148) - Release 13.0 (#10175) * Do not mutate replication state in WaitSourcePos This results in race conditions with tablet repairs that can leave replication broken and requiring manual repair using: START SLAVE. Signed-off-by: Matt Lord <mattalord@gmail.com> feat: throw error for tablets that have replication stopped in ERS Signed-off-by: Manan Gupta <manan@planetscale.com> test: fix reparent tests to reflect the new change Signed-off-by: Manan Gupta <manan@planetscale.com> test: fix end to end test expectations and add a new one Signed-off-by: Manan Gupta <manan@planetscale.com> feat: mark tablets as not reachable if the sql thread is stopped Signed-off-by: Manan Gupta <manan@planetscale.com> test: augment the end to end test to also check for just sql thread stopped Signed-off-by: Manan Gupta <manan@planetscale.com> feat: change sqlthreadwasrunning to first convert from proto to replication status Signed-off-by: Manan Gupta <manan@planetscale.com> test: fix stopReplicationAndGetStatusMaps test to provide gtid set which is correct Signed-off-by: Manan Gupta <manan@planetscale.com> * test: add missing defer teardown in initialization tests Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
Description
We had added the code that started the
SQL_Thread
inWaitSourcePos
to handle the case where a tablet with replication stopped in ERS would have a chance to succeed as we then wait for it to apply its relay logs.But, mutating the state there in
WaitSourcePos
results in race conditions with tablet repairs that can leave replication broken and requiring manual repair using usingSTART SLAVE
.So, we have decided to revert that mutation change and instead in ERS we treat tablets that have their
SQL_Thread
stopped as though they were unreachable. If this prevents ERS from running a failover entirely — because of multiple failure detections — the user will need to either explicitly ignore the tablet or manually start replication on it. This guarantees safety since the responsibility of what to do in this scenario resides with the user, and this change will also play well withvtorc
since in this case it will start the replication on the tablet eventually and call ERS again.This reverts the state mutating parts from #9512 and the follow-up #10104 and fixes ERS as described above.
It leaves the other helper functions in place as they could be useful for other purposes at some point. They are effectively unused though so we could remove them as well since it's currently dead code.
Related Issue(s)
Checklist