-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't abort restore if master is unreachable #5254
Conversation
Signed-off-by: deepthi <deepthi@planetscale.com>
return vterrors.Wrap(err, "can't get master replication position") | ||
// It is possible that though MasterAlias is set, the master tablet is unreachable | ||
// Log a warning and let tablet restore in that case | ||
// If we had instead considered this fatal, all tablets would crash-loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change one of the e2e test cases to take the master down before restoring one of the tablets? Would that have caught this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to reproduce with a unit test, and verified that the fix works.
// If we had instead considered this fatal, all tablets would crash-loop | ||
// until a master appears, which would make it impossible to elect a master. | ||
log.Warningf("Can't get master replication position after restore: %v", err) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't leave a line comment down there, so leaving it here.
The loop on line 248 seems like it will hot-loop indefinitely if replication never starts. Could we add a 1s delay between retries, and check if the context has been cancelled before each iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
…us, add unit test Signed-off-by: deepthi <deepthi@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than some optional comments.
@@ -21,6 +21,7 @@ require ( | |||
github.com/golang/mock v1.3.1 | |||
github.com/golang/protobuf v1.3.2 | |||
github.com/golang/snappy v0.0.0-20170215233205-553a64147049 | |||
github.com/google/btree v1.0.0 // indirect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these new entries persist after go mod tidy
? I don't quite understand what's happening, but I've noticed the go
tool adding some things that we don't actually need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't intended to commit a new go.mod
:(
But this one diff does persist after go mod tidy
newPos := status.Position | ||
if !newPos.Equal(replicationPosition) { | ||
break | ||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need a select
to check the context if you're not waiting on anything else at the same time. I usually just do:
if err := ctx.Err(); err != nil {
return err
}
Signed-off-by: deepthi <deepthi@planetscale.com>
This is a follow up to #5000. With the code introduced in that PR, it is possible for a cluster that is being restored from backups to get into a crash loop situation. This PR fixes that.
Signed-off-by: deepthi deepthi@planetscale.com