Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle SQL thread crash in vt/vttablet/tabletserver/repltracker #7157

Conversation

timvaillancourt
Copy link
Contributor

@timvaillancourt timvaillancourt commented Dec 10, 2020

Signed-off-by: Tim Vaillancourt timvaillancourt@github.com

Replaces PR #7156

Backport

NO

Status

DRAFT

Description

This PR causes vt/vttablet/tabletserver/repltracker/poller.go to return an replication sql thread error error if a replica's SQL thread crashed on an unrecoverable error. This is to address delays in the system noticing a replica is unhealthy (and potentially inconsistent) and to aid github.com/github/freno in understanding if replication is broken vs lagging via the vtctld API

The criteria for the error is:

  1. The SQL thread is not running
  2. The SQL thread error number is > 0
  3. sql_slave_skip_counter=0, ie: SQL thread errors are not skipped

The drawback to this change is Last_SQL_Errno is not reset to zero unless a RESET MASTER or RESET SLAVE is ran following an SQL thread crash

Although somewhat unlikely, a user that manually resolves the SQL error and restarts the SQL thread without running RESET [MASTER|SLAVE] will continue to have a Last_SQL_Errno that is greater-than zero. In this situation this new error could be returned if the SQL thread were to be stopped. As MySQL doesn't have a "current" SQL error number I don't have a good solution for this at the moment

Because restoring from a good backup is usually the norm in this situation (in my experience) I wonder if this is an acceptable limitation or more logic is required 🤔

cc @sougou / @shlomi-noach / @deepthi / @tomkrouper / @drogart for thoughts on the above

Related Issue(s)

List related PRs against other branches:

Todos

  • Tests
  • Documentation

Deployment Notes

Notes regarding deployment of the contained body of work. These should note any
db migrations, etc.

Impacted Areas in Vitess

List general components of the application that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build

Signed-off-by: Tim Vaillancourt <timvaillancourt@github.com>
Signed-off-by: Tim Vaillancourt <timvaillancourt@github.com>
@deepthi deepthi requested review from aquarapid and removed request for systay, sougou, shlomi-noach and harshit-gangal August 25, 2021 22:03
Copy link
Contributor

@aquarapid aquarapid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this to one of the end-to-end tests would be helpful.

// HasReplicationSQLThreadError returns true if the replication SQL thread stopped on an
// error (ie: Slave_SQL_Running: no + Last_SQL_Errno > 0) and sql_slave_skip_counter is
// disabled (0). This suggests the replication SQL thread stopped on an error that can't
// be recovered/skipped automatically. A node in this state may have inconsistent data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The skip counter is human controlled AFAIK. MySQL will not change the value of the skip counter. I'm just pointing this our as I'm not sure if the logic is expected to respond to manual human changes? Either way the logic is good.

Copy link
Contributor Author

@timvaillancourt timvaillancourt Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shlomi-noach correct, sql_slave_skip_counter could be overridden on the fly using a SET operation although it defaults to 0 👍

If .Status (from go/vt/vttablet/tabletserver/repltracker/poller.go) is evaluated periodically a manual change to the sql_slave_skip_counter value shouldn't cause unexpected behaviour for more than one polling "cycle"

If, for example, the SQL thread has crashed and a user changed the value from 0 -> 1 the tablet would become "healthy" again. Today a tablet with a crashed SQL thread will continue to report as "healthy" here until replication has lagged significantly

@systay
Copy link
Collaborator

systay commented Mar 28, 2022

bump

Hi @timvaillancourt!

Are you still interested in getting this merged? Going over old PRs to check which are still active.

@deepthi deepthi requested a review from mattlord March 31, 2022 22:19
@mattlord
Copy link
Contributor

Hi @timvaillancourt!

I'm going to close this for now as I believe the underlying problems have since been address in #9308 and #9853. I apologize — I hadn't seen this PR before or I would have raised the issues with you earlier.

Please let me know if you think I'm missing or misunderstanding anything though and we we can re-open this issue and go from there.

Thank you!

@mattlord mattlord closed this Mar 31, 2022
@timvaillancourt timvaillancourt deleted the repltracker-sql-thread-error branch May 19, 2023 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants