Fail VReplication workflows on errors that persist and unrecoverable errors #10429

rohit-nayak-ps · 2022-06-03T16:13:06Z

Description

As part of an initial design decision, VReplication workflows always retry in case it encounters an error after sleeping for 5 seconds. The reasoning was that, for large reshards/migrations and perpetual materialize workflows, we could often encounter recoverable errors like PRS, restarting of vttablets/mysql servers, network partitions etc. So rather than error out waiting for an operator to manually restart workflows we decided to keep retrying.

Since we only retried every five seconds any resource wastage due to continuously retrying unrecoverable workflows would be small and in most cases we would transparently recover and make forward progress with minimum downtime. This is especially important for Materialize workflows were the user is expecting near realtime performance.

Usually the vreplication workflows would be setup manually and the possibility of errors due to schema issues was minimal and so this approach worked well. However with the introduction of vreplication-based online DDL workflows we see a lot of automated use where user-specified DDLs are directly used to configure vreplication workflows. Incorrect DDLs can thus result in errors that result in prolonged retries that are not recoverable.

Error reporting in VReplication is also not great: we update the message column in the _vt.vreplication table, but that can get overwritten when we retry. We do also log errors in the _vt.vreplication_log table

A change was introduced recently in Online DDL workflows to mitigate this: we look up the error against a set of MySQL errors that we knew were not recoverable and in that case we put the workflow in an error state. Then there are no more automated retries and a manual restart after fixing the error is expected.

However there are still unrecoverable schema-related errors that are not yet mapped or do not map cleanly to MySQL errors. There could also be misconfigured workflows (example: no replicas in a keyspace when the tablet type is set to only replicas, incorrect cell settings etc). Continuously retrying workflows in such cases can delay detecting them.

This PR:

extends the check for unrecoverable errors to all workflow types, not just Online DDLs
for all workflows, detects errors that persist for more than the 🚩 new vttablet flag
--vreplication_max_time_to_retry_errors (default: 15 minutes).

For above cases it directly moves the workflow to Error state, which is then reported in Workflow Show.

Checklist

"Backport me!" label has been added if this change should be backported
- We should backport this to 14.0.0-rc, but no further
Tests were added or are not required
Documentation was added or is not required

github-actions · 2022-06-03T18:17:17Z

deepthi

@ajm188 what is the guidance on new command-line flags added to binaries?
dash or underscore?

deepthi · 2022-06-06T21:17:05Z

go/vt/vttablet/tabletmanager/vreplication/controller.go

-	retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed binlog connection")
+	retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed workflow event in the replication phase")
+
+	maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time")


Suggested change

maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time")

maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop retrying after this time")

Should be dashes but I also think that consistency is important and that we should migrate all Vitess flag names from underscores to dashes, either early in 15.0 or 16.0 SNAPSHOT.

that makes sense. we can do them all at once (and allow both versions for at least one release).

mattlord

I had a few questions/nits, but I LOVE this! We can then store this information e.g. in the _vt.vdiff table too.

The only thing that kept me from approving -- as most of my comments are not blockers and subjective preference -- is that I think using vterrors would be better. What do you think?

Thanks!

ℹ️ Note: I think that we should backport this to 14.0.0, but no further. I added the label and updated the description accordingly, but if you disagree I can undo those changes (already discussed with @deepthi).

mattlord · 2022-06-07T17:29:21Z

go/vt/vttablet/tabletmanager/vreplication/controller.go

-	retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed binlog connection")
+	retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed workflow event in the replication phase")
+
+	maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time")


Suggestion:
stop automatically retrying when we've had consecutive failures with the same error for this long after the first occurrence

mattlord · 2022-06-07T17:30:57Z

go/vt/vttablet/tabletmanager/vreplication/controller.go

 		ct.blpStats.ErrorCounts.Add([]string{"Stream Error"}, 1)
+		binlogplayer.LogError(fmt.Sprintf("error in stream %v, retrying after %v", ct.id, *retryDelay), err)
+		log.Flush()


Log flushes can potentially be slow and IMO we should avoid them in the hot path when possible. Are you adding these for local debugging during feature development? It also shouldn't be necessary, I don't think (and also may not be an issue to flush), as I believe STDERR is used for the error log and that is unbuffered by default anyway.

Though I'm not sure why flush is used here in particular, I think this point of code is infrequent enough that it can justify flush`.

Was leftover from a debug session, removed

mattlord · 2022-06-07T17:32:32Z

go/vt/vttablet/tabletmanager/vreplication/controller.go

+
+		ct.lastWorkflowError.record(err)
+		if isUnrecoverableError(err) /* mysql error that we know needs manual intervention */ ||
+			!ct.lastWorkflowError.canRetry() /* cannot detect if this is recoverable, but it is persisting too long */ {


Nit, but I think this method should be called shouldRetry() rather than canRetry().

go/vt/vttablet/tabletmanager/vreplication/controller.go

mattlord · 2022-06-07T18:53:08Z

go/vt/vttablet/tabletmanager/vreplication/last_error.go

+
+type lastError struct {
+	name               string
+	lastError          error


Can we make this a vterror instead? That way we can enforce a meaningful error type and code as part of the structure. That should make it much easier to classify different ones based on the error code.

mattlord · 2022-06-07T18:57:23Z

go/vt/vttablet/tabletmanager/vreplication/last_error.go

+ */
+
+type lastError struct {
+	name               string


Is this intended to be a short summary or just a key?

mattlord · 2022-06-07T18:59:31Z

go/vt/vttablet/tabletmanager/vreplication/last_error.go

+	lastError          error
+	lastErrorStartTime time.Time
+	lastErrorMu        sync.Mutex


The type is lastError so using lastError in the field names feels redundant. Can we call them ~:

error vterror firstSeen time.Time mu sync.Mutex

mattlord · 2022-06-07T19:01:32Z

go/vt/vttablet/tabletmanager/vreplication/last_error.go

+	le.lastError = err
+}
+
+func (le *lastError) canRetry() bool {


Again, IMO we should call this shouldRetry() as we always can. It's an (annoying) nit though, I know.

mattlord · 2022-06-07T19:04:15Z

go/vt/vttablet/tabletmanager/vreplication/last_error.go

+	le.lastErrorMu.Lock()
+	defer le.lastErrorMu.Unlock()
+	if !time.Time.IsZero(le.lastErrorStartTime) && time.Since(le.lastErrorStartTime) > le.maxTimeInError {
+		log.Errorf("Got same error since %s, will not retry anymore: you will need to manually restart workflow once error '%s' is fixed",


log.Errorf("The same error has been seen continuously since %s, we will assume this is a non-recoverable error and will not retry anymore; the workflow will need to be manually restarted once error '%s' has been addressed",

shlomi-noach

This looks great! The code change is small and simple and solves a big problem. My only request for change is dealing with unexpected input value for --vreplication_max_time_to_retry_errors

go/vt/vttablet/tabletmanager/vreplication/controller.go

mattlord · 2022-06-12T19:05:39Z

@rohit-nayak-ps I tried to address my own nits and comments here: 6e26308

I can revert some or all of it if you disagree, but I thought you'd be fine with them so attempted to save you some time. 🙂

@shlomi-noach for the input validation, were you thinking something like this?

diff --git a/go/vt/vttablet/tabletmanager/vreplication/controller.go b/go/vt/vttablet/tabletmanager/vreplication/controller.go
index cf906cebcf..945fcc695c 100644
--- a/go/vt/vttablet/tabletmanager/vreplication/controller.go
+++ b/go/vt/vttablet/tabletmanager/vreplication/controller.go
@@ -48,7 +48,7 @@ var (
        _          = flag.Duration("vreplication_healthcheck_timeout", 1*time.Minute, "healthcheck retry delay")
        retryDelay = flag.Duration("vreplication_retry_delay", 5*time.Second, "delay before retrying a failed workflow event in the replication phase")

-       maxTimeToRetryError = flag.Duration("vreplication_max_time_to_retry_on_error", 15*time.Minute, "stop automatically retrying when we've had consecutive failures with the same error for this long after the first occurrence")
+       maxTimeToRetryError = flag.Duration("vreplication_max_time_to_retry_on_error", 15*time.Minute, "stop automatically retrying when we've had consecutive failures with the same error for this long after the first occurrence (min: vreplication_retry_delay * 5, max: 24 hours)")
 )

 // controller is created by Engine. Members are initialized upfront.
@@ -81,6 +81,11 @@ func newController(ctx context.Context, params map[string]string, dbClientFactor
        if blpStats == nil {
                blpStats = binlogplayer.NewStats()
        }
+       if *maxTimeToRetryError < *retryDelay*5 {
+               *maxTimeToRetryError = *retryDelay * 5
+       } else if *maxTimeToRetryError > 24*time.Hour {
+               *maxTimeToRetryError = 24 * time.Hour
+       }
        ct := &controller{
                vre:               vre,
                dbClientFactory:   dbClientFactory,

Or did I misunderstand? Thanks!

shlomi-noach · 2022-06-19T03:32:00Z

@shlomi-noach for the input validation, were you thinking something like this?

Sure! Although, I notice this adds a new dependency on --vreplication_retry_delay, which in itself isn't constrained (what happens if --vreplication_retry_delay is zero or negative?)

shlomi-noach · 2022-06-21T15:53:55Z

All, let's pursue this PR? As far as I'm concerned, there is just the remaining issue of validating the input; but I'm also OK to merge this as-is to get this fix sooner into main, in the assumption that we will follow up with validation on a different PR. Opinions?

rohit-nayak-ps · 2022-06-22T07:02:03Z

All, let's pursue this PR? As far as I'm concerned, there is just the remaining issue of validating the input; but I'm also OK to merge this as-is to get this fix sooner into main, in the assumption that we will follow up with validation on a different PR. Opinions?

Agreed that we should merge this soon. I will take a fine look at this later today to review the newer commits and aim to push this today itself.

shlomi-noach · 2022-06-22T12:31:27Z

flag value validation added in bd1ba9b

… errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com>

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

…n_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

…ng unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com>

mattlord

Had a very small nit, but LGTM!

mattlord · 2022-06-22T20:36:55Z

go/vt/vttablet/tabletmanager/vreplication/utils.go

@@ -161,6 +163,7 @@ func isUnrecoverableError(err error) bool {
 		mysql.ERInvalidCastToJSON,
 		mysql.ERJSONValueTooBig,
 		mysql.ERJSONDocumentTooDeep:
+		log.Errorf("got unrecoverable error: %v", sqlErr)


Nit, but we should capitalize the beginning of log messages.

I thought it was the other way around? That we had to uncapitalize everything, because those messages are preceded by prefix like Error: ?

Capitalized per request.

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach · 2022-06-23T07:30:43Z

Cluster (shardedrecovery_stress_verticalsplit_heavy) is super flaky.

…errors (vitessio#10429) * Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Update max time default to 15m, was 1m for testing purposes Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Leverage vterrors for Equals; attempt to address my own nits Signed-off-by: Matt Lord <mattalord@gmail.com> * sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Fix flags test Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Remove leftover log.Flush() Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com> * captilize per request Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach · 2022-06-23T07:57:17Z

Backport: #10573

…sist and unrecoverable errors (#10573) * Fail VReplication workflows on errors that persist and unrecoverable errors (#10429) * Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Update max time default to 15m, was 1m for testing purposes Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Leverage vterrors for Equals; attempt to address my own nits Signed-off-by: Matt Lord <mattalord@gmail.com> * sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Fix flags test Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Remove leftover log.Flush() Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com> * captilize per request Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * fix TestHelpOutput Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * spaces, not tabs Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Rohit Nayak <57520317+rohit-nayak-ps@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com>

…errors (vitessio#10429) (vitessio#783) * Fail workflow if same error persists too long. Fail for unrecoverable errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Update max time default to 15m, was 1m for testing purposes Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Leverage vterrors for Equals; attempt to address my own nits Signed-off-by: Matt Lord <mattalord@gmail.com> * sanity: validate range of vreplication_retry_delay and of vreplication_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * Fix flags test Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Remove leftover log.Flush() Signed-off-by: Rohit Nayak <rohit@planetscale.com> * Revert validations min/max settings on retry delay since it is breaking unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com> * captilize per request Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Rohit Nayak <57520317+rohit-nayak-ps@users.noreply.github.com> Co-authored-by: Matt Lord <mattalord@gmail.com>

rohit-nayak-ps added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) Skip Upgrade Downgrade labels Jun 3, 2022

rohit-nayak-ps requested review from shlomi-noach, mattlord and deepthi June 3, 2022 18:16

rohit-nayak-ps marked this pull request as ready for review June 3, 2022 18:16

rohit-nayak-ps added release notes and removed release notes (needs details) This PR needs to be listed in the release notes in a dedicated section (deprecation notice, etc...) labels Jun 3, 2022

deepthi reviewed Jun 6, 2022

View reviewed changes

mattlord mentioned this pull request Jun 7, 2022

Feature Request: Resume VReplication after error #10456

Closed

mattlord reviewed Jun 7, 2022

View reviewed changes

mattlord added the Backport me! label Jun 7, 2022

shlomi-noach requested changes Jun 9, 2022

View reviewed changes

go/vt/vttablet/tabletmanager/vreplication/controller.go Outdated Show resolved Hide resolved

go/vt/vttablet/tabletmanager/vreplication/controller.go Outdated Show resolved Hide resolved

mattlord requested a review from shlomi-noach June 12, 2022 18:52

frouioui mentioned this pull request Jun 22, 2022

Release of v14.0.0 #10476

Closed

43 tasks

shlomi-noach approved these changes Jun 22, 2022

View reviewed changes

rohit-nayak-ps and others added 5 commits June 22, 2022 14:38

Fail workflow if same error persists too long. Fail for unrecoverable…

30dff73

… errors also in non-online ddl workflows Signed-off-by: Rohit Nayak <rohit@planetscale.com>

Update max time default to 15m, was 1m for testing purposes

3b5d290

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

Leverage vterrors for Equals; attempt to address my own nits

c1669e3

Signed-off-by: Matt Lord <mattalord@gmail.com>

sanity: validate range of vreplication_retry_delay and of vreplicatio…

6cb80c8

…n_max_time_to_retry_on_error Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

Fix flags test

ebb0dd3

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

rohit-nayak-ps force-pushed the rn-vr-fail-on-repeated-errors branch from bd1ba9b to ebb0dd3 Compare June 22, 2022 12:55

rohit-nayak-ps added 2 commits June 22, 2022 14:59

Remove leftover log.Flush()

ae8b722

Signed-off-by: Rohit Nayak <rohit@planetscale.com>

Revert validations min/max settings on retry delay since it is breaki…

4e0138e

…ng unit tests that set the value to a very small value Signed-off-by: Rohit Nayak <rohit@planetscale.com>

mattlord approved these changes Jun 22, 2022

View reviewed changes

captilize per request

4c0ea00

Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>

shlomi-noach merged commit cc7974f into vitessio:main Jun 23, 2022

shlomi-noach deleted the rn-vr-fail-on-repeated-errors branch June 23, 2022 07:44

shlomi-noach mentioned this pull request Jun 23, 2022

release-14.0 backport: Fail VReplication workflows on errors that persist and unrecoverable errors #10573

Merged

3 tasks

derekperkins mentioned this pull request Jul 22, 2022

vttablet: dangerous undocumented new flag (vreplication_max_time_to_retry_on_error) #10783

Closed

shlomi-noach mentioned this pull request Aug 18, 2022

Documenting vreplication flag: vreplication_max_time_to_retry_on_error vitessio/website#1136

Merged

shlomi-noach mentioned this pull request Feb 12, 2023

Internal refactor: LastError as a public struct #12321

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail VReplication workflows on errors that persist and unrecoverable errors #10429

Fail VReplication workflows on errors that persist and unrecoverable errors #10429

rohit-nayak-ps commented Jun 3, 2022 •

edited by mattlord

Loading

github-actions bot commented Jun 3, 2022

deepthi left a comment

deepthi Jun 6, 2022

mattlord Jun 6, 2022

deepthi Jun 6, 2022

mattlord left a comment •

edited

Loading

mattlord Jun 7, 2022 •

edited

Loading

mattlord Jun 7, 2022 •

edited

Loading

shlomi-noach Jun 9, 2022

rohit-nayak-ps Jun 22, 2022

mattlord Jun 7, 2022

mattlord Jun 7, 2022

mattlord Jun 7, 2022 •

edited

Loading

mattlord Jun 7, 2022

mattlord Jun 7, 2022

mattlord Jun 7, 2022 •

edited

Loading

shlomi-noach left a comment

mattlord commented Jun 12, 2022

shlomi-noach commented Jun 19, 2022

shlomi-noach commented Jun 21, 2022

rohit-nayak-ps commented Jun 22, 2022

shlomi-noach commented Jun 22, 2022

mattlord left a comment

mattlord Jun 22, 2022 •

edited

Loading

shlomi-noach Jun 23, 2022

shlomi-noach Jun 23, 2022

shlomi-noach commented Jun 23, 2022

shlomi-noach commented Jun 23, 2022

	maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop trying to retry after this time")
	maxTimeToRetryErrors = flag.Duration("vreplication_max_time_to_retry_errors", 15*time.Minute, "stop retrying after this time")

Fail VReplication workflows on errors that persist and unrecoverable errors #10429

Fail VReplication workflows on errors that persist and unrecoverable errors #10429

Conversation

rohit-nayak-ps commented Jun 3, 2022 • edited by mattlord Loading

Description

Checklist

github-actions bot commented Jun 3, 2022

Review Checklist

General

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

deepthi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattlord left a comment • edited Loading

Choose a reason for hiding this comment

mattlord Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

mattlord Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattlord Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattlord Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

shlomi-noach left a comment

Choose a reason for hiding this comment

mattlord commented Jun 12, 2022

shlomi-noach commented Jun 19, 2022

shlomi-noach commented Jun 21, 2022

rohit-nayak-ps commented Jun 22, 2022

shlomi-noach commented Jun 22, 2022

mattlord left a comment

Choose a reason for hiding this comment

mattlord Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shlomi-noach commented Jun 23, 2022

shlomi-noach commented Jun 23, 2022

rohit-nayak-ps commented Jun 3, 2022 •

edited by mattlord

Loading

mattlord left a comment •

edited

Loading

mattlord Jun 7, 2022 •

edited

Loading

mattlord Jun 7, 2022 •

edited

Loading

mattlord Jun 7, 2022 •

edited

Loading

mattlord Jun 7, 2022 •

edited

Loading

mattlord Jun 22, 2022 •

edited

Loading