-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail VReplication workflows on errors that persist and unrecoverable errors #10429
Fail VReplication workflows on errors that persist and unrecoverable errors #10429
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajm188 what is the guidance on new command-line flags added to binaries?
dash or underscore?
retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed binlog connection") | ||
retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed workflow event in the replication phase") | ||
|
||
maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time") | |
maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop retrying after this time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be dashes but I also think that consistency is important and that we should migrate all Vitess flag names from underscores to dashes, either early in 15.0 or 16.0 SNAPSHOT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense. we can do them all at once (and allow both versions for at least one release).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few questions/nits, but I LOVE this! We can then store this information e.g. in the _vt.vdiff
table too.
The only thing that kept me from approving -- as most of my comments are not blockers and subjective preference -- is that I think using vterrors
would be better. What do you think?
Thanks!
ℹ️ Note: I think that we should backport this to 14.0.0, but no further. I added the label and updated the description accordingly, but if you disagree I can undo those changes (already discussed with @deepthi).
retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed binlog connection") | ||
retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed workflow event in the replication phase") | ||
|
||
maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
stop automatically retrying when we've had consecutive failures with the same error for this long after the first occurrence
ct.blpStats.ErrorCounts.Add([]string{"Stream Error"}, 1) | ||
binlogplayer.LogError(fmt.Sprintf("error in stream %v, retrying after %v", ct.id, *retryDelay), err) | ||
log.Flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log flushes can potentially be slow and IMO we should avoid them in the hot path when possible. Are you adding these for local debugging during feature development? It also shouldn't be necessary, I don't think (and also may not be an issue to flush), as I believe STDERR is used for the error log and that is unbuffered by default anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I'm not sure why flush
is used here in particular, I think this point of code is infrequent enough that it can justify
flush`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was leftover from a debug session, removed
|
||
ct.lastWorkflowError.record(err) | ||
if isUnrecoverableError(err) /* mysql error that we know needs manual intervention */ || | ||
!ct.lastWorkflowError.canRetry() /* cannot detect if this is recoverable, but it is persisting too long */ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but I think this method should be called shouldRetry()
rather than canRetry()
.
|
||
type lastError struct { | ||
name string | ||
lastError error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a vterror
instead? That way we can enforce a meaningful error type and code as part of the structure. That should make it much easier to classify different ones based on the error code.
*/ | ||
|
||
type lastError struct { | ||
name string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to be a short summary or just a key?
lastError error | ||
lastErrorStartTime time.Time | ||
lastErrorMu sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type is lastError
so using lastError in the field names feels redundant. Can we call them ~:
error vterror
firstSeen time.Time
mu sync.Mutex
le.lastError = err | ||
} | ||
|
||
func (le *lastError) canRetry() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, IMO we should call this shouldRetry()
as we always can. It's an (annoying) nit though, I know.
le.lastErrorMu.Lock() | ||
defer le.lastErrorMu.Unlock() | ||
if !time.Time.IsZero(le.lastErrorStartTime) && time.Since(le.lastErrorStartTime) > le.maxTimeInError { | ||
log.Errorf("Got same error since %s, will not retry anymore: you will need to manually restart workflow once error '%s' is fixed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Errorf("The same error has been seen continuously since %s, we will assume this is a non-recoverable error and will not retry anymore; the workflow will need to be manually restarted once error '%s' has been addressed",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! The code change is small and simple and solves a big problem. My only request for change is dealing with unexpected input value for --vreplication_max_time_to_retry_errors
@rohit-nayak-ps I tried to address my own nits and comments here: 6e26308 I can revert some or all of it if you disagree, but I thought you'd be fine with them so attempted to save you some time. 🙂 @shlomi-noach for the input validation, were you thinking something like this?
Or did I misunderstand? Thanks! |
Sure! Although, I notice this adds a new dependency on |
All, let's pursue this PR? As far as I'm concerned, there is just the remaining issue of validating the input; but I'm also OK to merge this as-is to get this fix sooner into |
Agreed that we should merge this soon. I will take a fine look at this later today to review the newer commits and aim to push this today itself. |
flag value validation added in bd1ba9b |
… errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
…n_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
bd1ba9b
to
ebb0dd3
Compare
Signed-off-by: Rohit Nayak <rohit@planetscale.com>
…ng unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a very small nit, but LGTM!
@@ -161,6 +163,7 @@ func isUnrecoverableError(err error) bool { | |||
mysql.ERInvalidCastToJSON, | |||
mysql.ERJSONValueTooBig, | |||
mysql.ERJSONDocumentTooDeep: | |||
log.Errorf("got unrecoverable error: %v", sqlErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but we should capitalize the beginning of log messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was the other way around? That we had to uncapitalize everything, because those messages are preceded by prefix like Error:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalized per request.
Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
|
…errors (vitessio#10429) * Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Update max time default to 15m, was 1m for testing purposes Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Leverage vterrors for Equals; attempt to address my own nits Signed-off-by: Matt Lord <mattalord@gmail.com> * sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Fix flags test Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Remove leftover log.Flush() Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com> * captilize per request Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
Backport: #10573 |
…sist and unrecoverable errors (#10573) * Fail VReplication workflows on errors that persist and unrecoverable errors (#10429) * Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Update max time default to 15m, was 1m for testing purposes Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Leverage vterrors for Equals; attempt to address my own nits Signed-off-by: Matt Lord <mattalord@gmail.com> * sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Fix flags test Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Remove leftover log.Flush() Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com> * captilize per request Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * fix TestHelpOutput Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * spaces, not tabs Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Rohit Nayak <57520317+rohit-nayak-ps@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
…errors (vitessio#10429) (vitessio#783) * Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Update max time default to 15m, was 1m for testing purposes Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Leverage vterrors for Equals; attempt to address my own nits Signed-off-by: Matt Lord <mattalord@gmail.com> * sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Fix flags test Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Remove leftover log.Flush() Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com> * captilize per request Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Rohit Nayak <57520317+rohit-nayak-ps@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
Description
As part of an initial design decision, VReplication workflows always retry in case it encounters an error after sleeping for 5 seconds. The reasoning was that, for large reshards/migrations and perpetual materialize workflows, we could often encounter recoverable errors like PRS, restarting of vttablets/mysql servers, network partitions etc. So rather than error out waiting for an operator to manually restart workflows we decided to keep retrying.
Since we only retried every five seconds any resource wastage due to continuously retrying unrecoverable workflows would be small and in most cases we would transparently recover and make forward progress with minimum downtime. This is especially important for Materialize workflows were the user is expecting near realtime performance.
Usually the vreplication workflows would be setup manually and the possibility of errors due to schema issues was minimal and so this approach worked well. However with the introduction of vreplication-based online DDL workflows we see a lot of automated use where user-specified DDLs are directly used to configure vreplication workflows. Incorrect DDLs can thus result in errors that result in prolonged retries that are not recoverable.
Error reporting in VReplication is also not great: we update the
message
column in the_vt.vreplication
table, but that can get overwritten when we retry. We do also log errors in the_vt.vreplication_log
tableA change was introduced recently in Online DDL workflows to mitigate this: we look up the error against a set of MySQL errors that we knew were not recoverable and in that case we put the workflow in an error state. Then there are no more automated retries and a manual restart after fixing the error is expected.
However there are still unrecoverable schema-related errors that are not yet mapped or do not map cleanly to MySQL errors. There could also be misconfigured workflows (example: no replicas in a keyspace when the tablet type is set to only replicas, incorrect cell settings etc). Continuously retrying workflows in such cases can delay detecting them.
This PR:
--vreplication_max_time_to_retry_errors
(default: 15 minutes).For above cases it directly moves the workflow to
Error
state, which is then reported inWorkflow Show
.Checklist