-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow retry of interrupted restores #4909
Conversation
Signed-off-by: deepthi <deepthi@planetscale.com>
After testing this PR on our cluster, we last landed on the following behaviour: Restarted pods (i.e. interrupted restore) do correctly identify a
This causes the Meanwhile,
|
…even if some data exists Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
24247e3
to
feef084
Compare
@enisoc thank you for the detailed review. I will address the suggested changes. |
…case to actually test interrupted restore Signed-off-by: deepthi <deepthi@planetscale.com>
…to allow multiple restores of same vttablet Signed-off-by: deepthi <deepthi@planetscale.com>
This seems pretty stable in our testing! Restores are recovering from interruption at any stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! Just a few last questions.
…ing files Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
@enisoc a more fundamental question comes to mind. The way I have coded this, we delete the |
Ah that's a good point. We also should consider the restore incomplete if we haven't run mysql_upgrade and populated metadata yet. I don't think we need to lift the check all the way up to tabletmanager though. Do you think it's enough if we create the state file before running I think we're still ok if a failure occurs after that point. The tabletmanager-level stuff should fix itself either on the next vttablet restart, or when the RestoreFromBackup RPC is retried. Do you agree? |
That sounds reasonable. I will make the change. |
Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
if err := removeExistingFiles(cnf); err != nil { | ||
return mysql.Position{}, err | ||
// mark restore as in progress | ||
if err = createStateFile(cnf); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we remove the state file one level up in mysqlctl.Restore()
, I was thinking we'd also create it there so all the logic is in one place. That does mean that we'd potentially mark the data dir invalid just because we failed to find a valid backup, but that may be easy to avoid now that you've abstracted findBackupToRestore()
. We could call that from mysqlctl.Restore()
instead of here, and then create the state file there as well if we find one, before passing the found backup to ExecuteRestore()
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be ideal, but there is a little bit of difficulty with moving the call to findBackupToRestore
into mysqlctl.Restore
. The manifest format is different for the two backup methods - builtin and xtrabackup, and we are passing in the appropriate one to findBackupToRestore
when we call it. Any suggestions on how we could work around that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I didn't notice that there's an implementation-specific arg to findBackupToRestore
. In that case, it's probably not worth doing surgery on it in this PR.
Fixes #4821
The strategy here is to create a temporary file at the beginning of the restore process that is deleted at the end.
If the file exists at the beginning of the restore process, we assume that a previous restore failed and skip the checkNoDB check.
This requires some other changes:
In addition, changes have been made to the code so that terminating a grpc client that started the restore will cause the restore to terminate as well, so effectively we are undo-ing most of #2310.
The terminate_restore test has been changed to test the new behavior.
Backup tests had to be re-organized because now we have so many cases that if they are all in one test, it times out.