Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle SQL thread crash in vt/vttablet/tabletserver/repltracker #7157

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion go/mysql/flavor.go
Original file line number Diff line number Diff line change
Expand Up @@ -299,8 +299,12 @@ func parseReplicationStatus(fields map[string]string) ReplicationStatus {
status.MasterPort = int(parseInt)
parseInt, _ = strconv.ParseInt(fields["Connect_Retry"], 10, 0)
status.MasterConnectRetry = int(parseInt)
parseUint, _ := strconv.ParseUint(fields["Seconds_Behind_Master"], 10, 0)
parseUint, _ := strconv.ParseUint(fields["Last_SQL_Errno"], 10, 0)
status.LastSQLErrno = uint(parseUint)
parseUint, _ = strconv.ParseUint(fields["Seconds_Behind_Master"], 10, 0)
status.SecondsBehindMaster = uint(parseUint)
parseUint, _ = strconv.ParseUint(fields["Skip_Counter"], 10, 0)
status.SkipCounter = uint(parseUint)
parseUint, _ = strconv.ParseUint(fields["Master_Server_Id"], 10, 0)
status.MasterServerID = uint(parseUint)

Expand Down
12 changes: 12 additions & 0 deletions go/mysql/replication_status.go
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,9 @@ type ReplicationStatus struct {
MasterServerID uint
IOThreadRunning bool
SQLThreadRunning bool
LastSQLErrno uint
SecondsBehindMaster uint
SkipCounter uint
MasterHost string
MasterPort int
MasterConnectRetry int
Expand All @@ -49,6 +51,14 @@ func (s *ReplicationStatus) ReplicationRunning() bool {
return s.IOThreadRunning && s.SQLThreadRunning
}

// HasReplicationSQLThreadError returns true if the replication SQL thread stopped on an
// error (ie: Slave_SQL_Running: no + Last_SQL_Errno > 0) and sql_slave_skip_counter is
// disabled (0). This suggests the replication SQL thread stopped on an error that can't
// be recovered/skipped automatically. A node in this state may have inconsistent data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip counter is human controlled AFAIK. MySQL will not change the value of the skip counter. I'm just pointing this our as I'm not sure if the logic is expected to respond to manual human changes? Either way the logic is good.

Copy link
Contributor Author

@timvaillancourt timvaillancourt Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomi-noach correct, sql_slave_skip_counter could be overridden on the fly using a SET operation although it defaults to 0 👍

If .Status (from go/vt/vttablet/tabletserver/repltracker/poller.go) is evaluated periodically a manual change to the sql_slave_skip_counter value shouldn't cause unexpected behaviour for more than one polling "cycle"

If, for example, the SQL thread has crashed and a user changed the value from 0 -> 1 the tablet would become "healthy" again. Today a tablet with a crashed SQL thread will continue to report as "healthy" here until replication has lagged significantly

func (s *ReplicationStatus) HasReplicationSQLThreadError() bool {
return !s.SQLThreadRunning && s.LastSQLErrno > 0 && s.SkipCounter == 0
}

// ReplicationStatusToProto translates a Status to proto3.
func ReplicationStatusToProto(s ReplicationStatus) *replicationdatapb.Status {
return &replicationdatapb.Status{
Expand All @@ -59,6 +69,8 @@ func ReplicationStatusToProto(s ReplicationStatus) *replicationdatapb.Status {
MasterServerId: uint32(s.MasterServerID),
IoThreadRunning: s.IOThreadRunning,
SqlThreadRunning: s.SQLThreadRunning,
LastSqlErrno: uint32(s.LastSQLErrno),
SkipCounter: uint32(s.SkipCounter),
SecondsBehindMaster: uint32(s.SecondsBehindMaster),
MasterHost: s.MasterHost,
MasterPort: int32(s.MasterPort),
Expand Down
11 changes: 11 additions & 0 deletions go/mysql/replication_status_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,17 @@ func TestStatusIOThreadNotRunning(t *testing.T) {
}
}

func TestStatusHasReplicationSQLThreadError(t *testing.T) {
input := &ReplicationStatus{
LastSQLErrno: 1032,
SQLThreadRunning: false,
}
want := true
if got := input.HasReplicationSQLThreadError(); got != want {
t.Errorf("%v#v.HasReplicationSQLThreadError() = %v, want %v", input, got, want)
}
}

func TestStatusSQLThreadNotRunning(t *testing.T) {
input := &ReplicationStatus{
IOThreadRunning: true,
Expand Down
79 changes: 49 additions & 30 deletions go/vt/proto/replicationdata/replicationdata.pb.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions go/vt/vttablet/tabletserver/repltracker/poller.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@ func (p *poller) Status() (time.Duration, error) {
return 0, err
}

if status.HasReplicationSQLThreadError() {
return 0, vterrors.Errorf(vtrpcpb.Code_UNAVAILABLE, "replication sql thread error")
}
if !status.ReplicationRunning() {
if p.timeRecorded.IsZero() {
return 0, vterrors.Errorf(vtrpcpb.Code_UNAVAILABLE, "replication is not running")
Expand Down
2 changes: 2 additions & 0 deletions proto/replicationdata.proto
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,8 @@ message Status {
string file_relay_log_position = 10;
uint32 master_server_id = 11;
string master_uuid = 12;
uint32 last_sql_errno = 13;
uint32 skip_counter = 14;
}

// StopReplicationStatus represents the replication status before calling StopReplication, and the replication status collected immediately after
Expand Down