Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't abort restore if master is unreachable #5254

Merged
merged 3 commits into from
Oct 1, 2019

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Sep 30, 2019

This is a follow up to #5000. With the code introduced in that PR, it is possible for a cluster that is being restored from backups to get into a crash loop situation. This PR fixes that.

Signed-off-by: deepthi deepthi@planetscale.com

Signed-off-by: deepthi <deepthi@planetscale.com>
return vterrors.Wrap(err, "can't get master replication position")
// It is possible that though MasterAlias is set, the master tablet is unreachable
// Log a warning and let tablet restore in that case
// If we had instead considered this fatal, all tablets would crash-loop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change one of the e2e test cases to take the master down before restoring one of the tablets? Would that have caught this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to reproduce with a unit test, and verified that the fix works.

// If we had instead considered this fatal, all tablets would crash-loop
// until a master appears, which would make it impossible to elect a master.
log.Warningf("Can't get master replication position after restore: %v", err)
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't leave a line comment down there, so leaving it here.

The loop on line 248 seems like it will hot-loop indefinitely if replication never starts. Could we add a 1s delay between retries, and check if the context has been cancelled before each iteration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

…us, add unit test

Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than some optional comments.

@@ -21,6 +21,7 @@ require (
github.com/golang/mock v1.3.1
github.com/golang/protobuf v1.3.2
github.com/golang/snappy v0.0.0-20170215233205-553a64147049
github.com/google/btree v1.0.0 // indirect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these new entries persist after go mod tidy? I don't quite understand what's happening, but I've noticed the go tool adding some things that we don't actually need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't intended to commit a new go.mod :(
But this one diff does persist after go mod tidy

newPos := status.Position
if !newPos.Equal(replicationPosition) {
break
select {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need a select to check the context if you're not waiting on anything else at the same time. I usually just do:

if err := ctx.Err(); err != nil {
  return err
}

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi merged commit 169331a into vitessio:master Oct 1, 2019
@deepthi deepthi deleted the ds-fix-restore-crashloop branch October 1, 2019 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants