Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to wait for backup instead of starting up empty. #4929

Merged
merged 1 commit into from
Jun 19, 2019

Conversation

enisoc
Copy link
Member

@enisoc enisoc commented Jun 14, 2019

This allows shard bootstrap workflows that start up tablets concurrently while uploading an initial backup. Once the initial backup is complete, the tablets will restore from it and start serving. Without this, the workflow has to wait for the backup to complete before launching any tablets, which makes it slower and more complex.

Signed-off-by: Anthony Yeh enisoc@planetscale.com

@enisoc enisoc requested a review from deepthi June 14, 2019 05:05
@enisoc enisoc requested a review from sougou as a code owner June 14, 2019 05:05
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
the tablet will start up empty. This behavior allows you to bootstrap a new
shard before any backups exist.

If the `-wait_for_backup_interval` flag is set to a value greater than zero,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the `-wait_for_backup_interval` flag is set to a value greater than zero,
If the `-init_from_backup_retry_interval` flag is set to a value greater than zero,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "retry" is not accurate here. The current behavior of -restore_from_backup is that "no backup exists; starting up empty" is considered a normal, non-error result. It doesn't make sense that adding a retry interval will change this semantic because retries only apply to errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you're retrying until success. that's generally what retries are for. this is changing the behavior of restore_from_backup, which only is a single time special case, and happens to startup from empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you're retrying until success

The behavior I'm adding is not "retrying until success". If the backup doesn't exist, "success" is currently defined as "start up empty". There is no need nor expectation to retry anything. The thing you tried (-restore_from_backup, which implicitly means "restore from backup if there is one or else start up empty") succeeded.

I need a name that indicates I'm redefining what success means, not just adding a retry until the existing definition of success is met.

this is changing the behavior of restore_from_backup, which only is a single time special case, and happens to startup from empty.

I don't follow what this means. Can you reword it if I've missed the point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i must misunderstand the code, but it seems to look for backups until it finds one. what else does init_from_backup_retry_interval mean? what other success is there? if it can't find a backup, it keeps retrying right? it doesn't start up empty anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the new behavior in isolation, the word "retry" makes sense. However, if you take into account the context of how the previous behavior worked -- what users have had to understand since -restore_from_backup was first introduced -- then the word "retry" is inaccurate and misleading. In other words, we can't call this "retry" because of legacy.

I'm open to other names that are better than wait_for_backup, but not if the name actively misleads users.

restoreConcurrency = flag.Int("restore_concurrency", 4, "(init restore parameter) how many concurrent files to restore at once")
restoreFromBackup = flag.Bool("restore_from_backup", false, "(init restore parameter) will check BackupStorage for a recent backup at startup and start there")
restoreConcurrency = flag.Int("restore_concurrency", 4, "(init restore parameter) how many concurrent files to restore at once")
waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear")
restoreFromBackupRetryInterval = flag.Duration("init_from_backup_retry_interval", 0, "(init restore parameter) when greater than 0, continuously retry restoring backup until success before starting up")

if we want to keep with the bad name restore_from_backup, we can also name this restore_from_backup_retry_interval.

break
}

log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval)
log.Infof("No backup found. Retrying in %v (from -wait_for_backup_interval flag).", waitForBackupInterval)

@@ -131,7 +131,7 @@ func (agent *ActionAgent) RestoreFromBackup(ctx context.Context, logger logutil.
l := logutil.NewTeeLogger(logutil.NewConsoleLogger(), logger)

// now we can run restore
err = agent.restoreDataLocked(ctx, l, true /* deleteBeforeRestore */)
err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval */, true /* deleteBeforeRestore */)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval */, true /* deleteBeforeRestore */)
retryInterval := 0*time.Seconds
err = agent.restoreDataLocked(ctx, l, 0 retryInterval, true /* deleteBeforeRestore */)

@sougou
Copy link
Contributor

sougou commented Jun 17, 2019

Note that user documentation under /doc is deprecated. They should all go to vitessio/website. I intend to delete most files in that directory except for design docs that may be relevant to contributors.

@enisoc
Copy link
Member Author

enisoc commented Jun 17, 2019

Ah thanks for the heads-up @sougou. I'll open a PR against vitessio/website after this is approved.

Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that with the flag set, we will keep looking for a backup at intervals until the context times out or we find a good backup (whichever comes first). Has there been any thought or discussion as to whether the timeout should be shorter than what is passed in?

@enisoc
Copy link
Member Author

enisoc commented Jun 18, 2019

@deepthi Do you mean, voluntarily set a shorter timeout for the "wait for backup" stage so we have a chance to still do something after that times out, before the overall context times out?

Normally that's a good practice, but I don't think it applies in this case. The intended meaning of "wait for backup" is "never proceed to start up until you successfully restore from a backup"; the idea is to provide a guarantee that the tablet will never start up empty. So if we time out while waiting for the backup, we're not allowed to do anything else besides quit anyway.

@enisoc enisoc merged commit 1c323c3 into vitessio:master Jun 19, 2019
@enisoc enisoc deleted the wait-for-backup branch June 19, 2019 00:01
deepthi added a commit to planetscale/vitess that referenced this pull request Jun 19, 2019
Signed-off-by: deepthi <deepthi@planetscale.com>
systay pushed a commit that referenced this pull request Jul 22, 2024
* enable go version upgrade workflow

Signed-off-by: deepthi <deepthi@planetscale.com>

* fix code owners

Signed-off-by: deepthi <deepthi@planetscale.com>

* fix code owner errors

Signed-off-by: deepthi <deepthi@planetscale.com>

* configure private repo access

Signed-off-by: deepthi <deepthi@planetscale.com>

* revert changes to upstream file and create private-only version

Signed-off-by: deepthi <deepthi@planetscale.com>

* fix code owner errors

Signed-off-by: deepthi <deepthi@planetscale.com>

---------

Signed-off-by: deepthi <deepthi@planetscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants