-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle SQL thread crash in vt/vttablet/tabletserver/repltracker #7157
Handle SQL thread crash in vt/vttablet/tabletserver/repltracker #7157
Conversation
Signed-off-by: Tim Vaillancourt <timvaillancourt@github.com>
Signed-off-by: Tim Vaillancourt <timvaillancourt@github.com>
e85d5b7
to
7900bdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this to one of the end-to-end tests would be helpful.
// HasReplicationSQLThreadError returns true if the replication SQL thread stopped on an | ||
// error (ie: Slave_SQL_Running: no + Last_SQL_Errno > 0) and sql_slave_skip_counter is | ||
// disabled (0). This suggests the replication SQL thread stopped on an error that can't | ||
// be recovered/skipped automatically. A node in this state may have inconsistent data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The skip counter is human controlled AFAIK. MySQL will not change the value of the skip counter. I'm just pointing this our as I'm not sure if the logic is expected to respond to manual human changes? Either way the logic is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach correct, sql_slave_skip_counter
could be overridden on the fly using a SET
operation although it defaults to 0
👍
If .Status
(from go/vt/vttablet/tabletserver/repltracker/poller.go
) is evaluated periodically a manual change to the sql_slave_skip_counter
value shouldn't cause unexpected behaviour for more than one polling "cycle"
If, for example, the SQL thread has crashed and a user changed the value from 0
-> 1
the tablet would become "healthy" again. Today a tablet with a crashed SQL thread will continue to report as "healthy" here until replication has lagged significantly
bump Hi @timvaillancourt! Are you still interested in getting this merged? Going over old PRs to check which are still active. |
Hi @timvaillancourt! I'm going to close this for now as I believe the underlying problems have since been address in #9308 and #9853. I apologize — I hadn't seen this PR before or I would have raised the issues with you earlier. Please let me know if you think I'm missing or misunderstanding anything though and we we can re-open this issue and go from there. Thank you! |
Signed-off-by: Tim Vaillancourt timvaillancourt@github.com
Replaces PR #7156
Backport
NO
Status
DRAFT
Description
This PR causes
vt/vttablet/tabletserver/repltracker/poller.go
to return anreplication sql thread error
error if a replica's SQL thread crashed on an unrecoverable error. This is to address delays in the system noticing a replica is unhealthy (and potentially inconsistent) and to aidgithub.com/github/freno
in understanding if replication is broken vs lagging via thevtctld
APIThe criteria for the error is:
sql_slave_skip_counter=0
, ie: SQL thread errors are not skippedThe drawback to this change is
Last_SQL_Errno
is not reset to zero unless aRESET MASTER
orRESET SLAVE
is ran following an SQL thread crashAlthough somewhat unlikely, a user that manually resolves the SQL error and restarts the SQL thread without running
RESET [MASTER|SLAVE]
will continue to have aLast_SQL_Errno
that is greater-than zero. In this situation this new error could be returned if the SQL thread were to be stopped. As MySQL doesn't have a "current" SQL error number I don't have a good solution for this at the momentBecause restoring from a good backup is usually the norm in this situation (in my experience) I wonder if this is an acceptable limitation or more logic is required 🤔
cc @sougou / @shlomi-noach / @deepthi / @tomkrouper / @drogart for thoughts on the above
Related Issue(s)
List related PRs against other branches:
Todos
Deployment Notes
Notes regarding deployment of the contained body of work. These should note any
db migrations, etc.
Impacted Areas in Vitess
List general components of the application that this PR will affect: