Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not mutate replication state in WaitSourcePos and ignore tablets with SQL_Thread stopped in ERS #10148

Merged
merged 8 commits into from
Apr 29, 2022

Conversation

mattlord
Copy link
Contributor

@mattlord mattlord commented Apr 27, 2022

Description

We had added the code that started the SQL_Thread in WaitSourcePos to handle the case where a tablet with replication stopped in ERS would have a chance to succeed as we then wait for it to apply its relay logs.

But, mutating the state there in WaitSourcePos results in race conditions with tablet repairs that can leave replication broken and requiring manual repair using using START SLAVE.

So, we have decided to revert that mutation change and instead in ERS we treat tablets that have their SQL_Thread stopped as though they were unreachable. If this prevents ERS from running a failover entirely — because of multiple failure detections — the user will need to either explicitly ignore the tablet or manually start replication on it. This guarantees safety since the responsibility of what to do in this scenario resides with the user, and this change will also play well with vtorc since in this case it will start the replication on the tablet eventually and call ERS again.

This reverts the state mutating parts from #9512 and the follow-up #10104 and fixes ERS as described above.

It leaves the other helper functions in place as they could be useful for other purposes at some point. They are effectively unused though so we could remove them as well since it's currently dead code.

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported (this should be backported to release-13.0)
  • Tests were added
  • Documentation is not required (internal)

This results in race conditions with tablet repairs that can leave
replication broken and requiring manual repair using: START SLAVE.

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>
…topped

Signed-off-by: Manan Gupta <manan@planetscale.com>
…cation status

Signed-off-by: Manan Gupta <manan@planetscale.com>
…ich is correct

Signed-off-by: Manan Gupta <manan@planetscale.com>
@GuptaManan100 GuptaManan100 changed the title Do not mutate replication state in WaitSourcePos Do not mutate replication state in WaitSourcePos and ignore tablets with SQL_Thread stopped in ERS Apr 29, 2022
@mattlord mattlord marked this pull request as ready for review April 29, 2022 16:16
@@ -141,7 +141,7 @@ func TestEmergencyReparentShardSlow(t *testing.T) {
},
"zone1-0000000200": {
StopStatus: &replicationdatapb.StopReplicationStatus{
Before: &replicationdatapb.Status{},
Before: &replicationdatapb.Status{IoState: int32(mysql.ReplicationStateRunning), SqlState: int32(mysql.ReplicationStateRunning)},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're changing this — so that it's correct — all over because we now use it in the tests. Before it was unused.

}

replStatus := mysql.ProtoToReplicationStatus(stopStatus.Before)
return replStatus.SQLState == mysql.ReplicationStateRunning, nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be backwards compatible once #10167 is merged.

Comment on lines +299 to +300
Before: &replicationdatapb.Status{Position: "MySQL56/3E11FA47-71CA-11E1-9E33-C80AA9429100:1-5", IoState: int32(mysql.ReplicationStateRunning), SqlState: int32(mysql.ReplicationStateRunning)},
After: &replicationdatapb.Status{Position: "MySQL56/3E11FA47-71CA-11E1-9E33-C80AA9429100:1-9"},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the explicitness of it.

@mattlord
Copy link
Contributor Author

LGTM. Thanks, @GuptaManan100 !

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@deepthi deepthi merged commit ce883ae into vitessio:main Apr 29, 2022
@deepthi deepthi deleted the ro_state_waitsourcepos branch April 29, 2022 18:00
deepthi pushed a commit that referenced this pull request Apr 29, 2022
…ith SQL_Thread stopped in ERS (#10148) - Release 13.0 (#10175)

* Do not mutate replication state in WaitSourcePos

This results in race conditions with tablet repairs that can leave
replication broken and requiring manual repair using: START SLAVE.

Signed-off-by: Matt Lord <mattalord@gmail.com>

feat: throw error for tablets that have replication stopped in ERS

Signed-off-by: Manan Gupta <manan@planetscale.com>

test: fix reparent tests to reflect the new change

Signed-off-by: Manan Gupta <manan@planetscale.com>

test: fix end to end test expectations and add a new one

Signed-off-by: Manan Gupta <manan@planetscale.com>

feat: mark tablets as not reachable if the sql thread is stopped

Signed-off-by: Manan Gupta <manan@planetscale.com>

test: augment the end to end test to also check for just sql thread stopped

Signed-off-by: Manan Gupta <manan@planetscale.com>

feat: change sqlthreadwasrunning to first convert from proto to replication status

Signed-off-by: Manan Gupta <manan@planetscale.com>

test: fix stopReplicationAndGetStatusMaps test to provide gtid set which is correct

Signed-off-by: Manan Gupta <manan@planetscale.com>

* test: add missing defer teardown in initialization tests

Signed-off-by: Manan Gupta <manan@planetscale.com>

Co-authored-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants