-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add flag to wait for backup instead of starting up empty. #4929
Conversation
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
the tablet will start up empty. This behavior allows you to bootstrap a new | ||
shard before any backups exist. | ||
|
||
If the `-wait_for_backup_interval` flag is set to a value greater than zero, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the `-wait_for_backup_interval` flag is set to a value greater than zero, | |
If the `-init_from_backup_retry_interval` flag is set to a value greater than zero, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name "retry" is not accurate here. The current behavior of -restore_from_backup
is that "no backup exists; starting up empty" is considered a normal, non-error result. It doesn't make sense that adding a retry interval will change this semantic because retries only apply to errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you're retrying until success. that's generally what retries are for. this is changing the behavior of restore_from_backup, which only is a single time special case, and happens to startup from empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you're retrying until success
The behavior I'm adding is not "retrying until success". If the backup doesn't exist, "success" is currently defined as "start up empty". There is no need nor expectation to retry anything. The thing you tried (-restore_from_backup
, which implicitly means "restore from backup if there is one or else start up empty") succeeded.
I need a name that indicates I'm redefining what success means, not just adding a retry until the existing definition of success is met.
this is changing the behavior of restore_from_backup, which only is a single time special case, and happens to startup from empty.
I don't follow what this means. Can you reword it if I've missed the point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i must misunderstand the code, but it seems to look for backups until it finds one. what else does init_from_backup_retry_interval mean? what other success is there? if it can't find a backup, it keeps retrying right? it doesn't start up empty anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the new behavior in isolation, the word "retry" makes sense. However, if you take into account the context of how the previous behavior worked -- what users have had to understand since -restore_from_backup
was first introduced -- then the word "retry" is inaccurate and misleading. In other words, we can't call this "retry" because of legacy.
I'm open to other names that are better than wait_for_backup
, but not if the name actively misleads users.
restoreConcurrency = flag.Int("restore_concurrency", 4, "(init restore parameter) how many concurrent files to restore at once") | ||
restoreFromBackup = flag.Bool("restore_from_backup", false, "(init restore parameter) will check BackupStorage for a recent backup at startup and start there") | ||
restoreConcurrency = flag.Int("restore_concurrency", 4, "(init restore parameter) how many concurrent files to restore at once") | ||
waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear") | |
restoreFromBackupRetryInterval = flag.Duration("init_from_backup_retry_interval", 0, "(init restore parameter) when greater than 0, continuously retry restoring backup until success before starting up") |
if we want to keep with the bad name restore_from_backup
, we can also name this restore_from_backup_retry_interval
.
break | ||
} | ||
|
||
log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval) | |
log.Infof("No backup found. Retrying in %v (from -wait_for_backup_interval flag).", waitForBackupInterval) |
@@ -131,7 +131,7 @@ func (agent *ActionAgent) RestoreFromBackup(ctx context.Context, logger logutil. | |||
l := logutil.NewTeeLogger(logutil.NewConsoleLogger(), logger) | |||
|
|||
// now we can run restore | |||
err = agent.restoreDataLocked(ctx, l, true /* deleteBeforeRestore */) | |||
err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval */, true /* deleteBeforeRestore */) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval */, true /* deleteBeforeRestore */) | |
retryInterval := 0*time.Seconds | |
err = agent.restoreDataLocked(ctx, l, 0 retryInterval, true /* deleteBeforeRestore */) |
Note that user documentation under |
Ah thanks for the heads-up @sougou. I'll open a PR against vitessio/website after this is approved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that with the flag set, we will keep looking for a backup at intervals until the context times out or we find a good backup (whichever comes first). Has there been any thought or discussion as to whether the timeout should be shorter than what is passed in?
@deepthi Do you mean, voluntarily set a shorter timeout for the "wait for backup" stage so we have a chance to still do something after that times out, before the overall context times out? Normally that's a good practice, but I don't think it applies in this case. The intended meaning of "wait for backup" is "never proceed to start up until you successfully restore from a backup"; the idea is to provide a guarantee that the tablet will never start up empty. So if we time out while waiting for the backup, we're not allowed to do anything else besides quit anyway. |
Signed-off-by: deepthi <deepthi@planetscale.com>
* enable go version upgrade workflow Signed-off-by: deepthi <deepthi@planetscale.com> * fix code owners Signed-off-by: deepthi <deepthi@planetscale.com> * fix code owner errors Signed-off-by: deepthi <deepthi@planetscale.com> * configure private repo access Signed-off-by: deepthi <deepthi@planetscale.com> * revert changes to upstream file and create private-only version Signed-off-by: deepthi <deepthi@planetscale.com> * fix code owner errors Signed-off-by: deepthi <deepthi@planetscale.com> --------- Signed-off-by: deepthi <deepthi@planetscale.com>
This allows shard bootstrap workflows that start up tablets concurrently while uploading an initial backup. Once the initial backup is complete, the tablets will restore from it and start serving. Without this, the workflow has to wait for the backup to complete before launching any tablets, which makes it slower and more complex.
Signed-off-by: Anthony Yeh enisoc@planetscale.com