Don't abort restore if master is unreachable #5254

deepthi · 2019-09-30T22:15:11Z

This is a follow up to #5000. With the code introduced in that PR, it is possible for a cluster that is being restored from backups to get into a crash loop situation. This PR fixes that.

Signed-off-by: deepthi deepthi@planetscale.com

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc · 2019-09-30T22:42:46Z

go/vt/vttablet/tabletmanager/restore.go

-		return vterrors.Wrap(err, "can't get master replication position")
+		// It is possible that though MasterAlias is set, the master tablet is unreachable
+		// Log a warning and let tablet restore in that case
+		// If we had instead considered this fatal, all tablets would crash-loop


Can we change one of the e2e test cases to take the master down before restoring one of the tablets? Would that have caught this?

I was able to reproduce with a unit test, and verified that the fix works.

enisoc · 2019-09-30T22:55:44Z

go/vt/vttablet/tabletmanager/restore.go

+		// If we had instead considered this fatal, all tablets would crash-loop
+		// until a master appears, which would make it impossible to elect a master.
+		log.Warningf("Can't get master replication position after restore: %v", err)
+		return nil


I can't leave a line comment down there, so leaving it here.

The loop on line 248 seems like it will hot-loop indefinitely if replication never starts. Could we add a 1s delay between retries, and check if the context has been cancelled before each iteration?

Good point. Done.

…us, add unit test Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc

LGTM other than some optional comments.

enisoc · 2019-10-01T18:00:33Z

go.mod

@@ -21,6 +21,7 @@ require (
 	github.com/golang/mock v1.3.1
 	github.com/golang/protobuf v1.3.2
 	github.com/golang/snappy v0.0.0-20170215233205-553a64147049
+	github.com/google/btree v1.0.0 // indirect


Do these new entries persist after go mod tidy? I don't quite understand what's happening, but I've noticed the go tool adding some things that we don't actually need.

I hadn't intended to commit a new go.mod :(
But this one diff does persist after go mod tidy

enisoc · 2019-10-01T18:04:23Z

go/vt/mysqlctl/builtinbackupengine.go

-				newPos := status.Position
-				if !newPos.Equal(replicationPosition) {
-					break
+				select {


I don't think you need a select to check the context if you're not waiting on anything else at the same time. I usually just do:

if err := ctx.Err(); err != nil { return err }

Signed-off-by: deepthi <deepthi@planetscale.com>

Don't abort restore if master is unreachable

0d1ebf0

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi requested a review from sougou as a code owner September 30, 2019 22:15

deepthi requested a review from enisoc September 30, 2019 22:15

enisoc reviewed Sep 30, 2019

View reviewed changes

implement delay between retries of attempting to get mysql slave stat…

e6c4a09

…us, add unit test Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc approved these changes Oct 1, 2019

View reviewed changes

cleanup per review comments

1e1ef87

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi merged commit 169331a into vitessio:master Oct 1, 2019

deepthi deleted the ds-fix-restore-crashloop branch October 1, 2019 21:43

spark4 mentioned this pull request Nov 12, 2019

Serry deploy tinyspeck/vitess#140

Closed

spark4 mentioned this pull request Nov 22, 2019

Slack sync upstream 2019 11 09.r0 tinyspeck/vitess#142

Merged

rafael mentioned this pull request Dec 11, 2019

Slack sync upstream 2019 12 11.r0 tinyspeck/vitess#143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't abort restore if master is unreachable #5254

Don't abort restore if master is unreachable #5254

deepthi commented Sep 30, 2019

enisoc Sep 30, 2019

deepthi Oct 1, 2019

enisoc Sep 30, 2019

deepthi Oct 1, 2019

enisoc left a comment

enisoc Oct 1, 2019

deepthi Oct 1, 2019

enisoc Oct 1, 2019

Don't abort restore if master is unreachable #5254

Don't abort restore if master is unreachable #5254

Conversation

deepthi commented Sep 30, 2019

enisoc Sep 30, 2019

Choose a reason for hiding this comment

deepthi Oct 1, 2019

Choose a reason for hiding this comment

enisoc Sep 30, 2019

Choose a reason for hiding this comment

deepthi Oct 1, 2019

Choose a reason for hiding this comment

enisoc left a comment

Choose a reason for hiding this comment

enisoc Oct 1, 2019

Choose a reason for hiding this comment

deepthi Oct 1, 2019

Choose a reason for hiding this comment

enisoc Oct 1, 2019

Choose a reason for hiding this comment