Add flag to wait for backup instead of starting up empty. #4929

enisoc · 2019-06-14T05:05:22Z

This allows shard bootstrap workflows that start up tablets concurrently while uploading an initial backup. Once the initial backup is complete, the tablets will restore from it and start serving. Without this, the workflow has to wait for the backup to complete before launching any tablets, which makes it slower and more complex.

Signed-off-by: Anthony Yeh enisoc@planetscale.com

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

teejae · 2019-06-15T05:44:28Z

doc/BackupAndRestore.md

+the tablet will start up empty. This behavior allows you to bootstrap a new
+shard before any backups exist.
+
+If the `-wait_for_backup_interval` flag is set to a value greater than zero,


Suggested change

If the `-wait_for_backup_interval` flag is set to a value greater than zero,

If the `-init_from_backup_retry_interval` flag is set to a value greater than zero,

The name "retry" is not accurate here. The current behavior of -restore_from_backup is that "no backup exists; starting up empty" is considered a normal, non-error result. It doesn't make sense that adding a retry interval will change this semantic because retries only apply to errors.

but you're retrying until success. that's generally what retries are for. this is changing the behavior of restore_from_backup, which only is a single time special case, and happens to startup from empty.

but you're retrying until success

The behavior I'm adding is not "retrying until success". If the backup doesn't exist, "success" is currently defined as "start up empty". There is no need nor expectation to retry anything. The thing you tried (-restore_from_backup, which implicitly means "restore from backup if there is one or else start up empty") succeeded.

I need a name that indicates I'm redefining what success means, not just adding a retry until the existing definition of success is met.

this is changing the behavior of restore_from_backup, which only is a single time special case, and happens to startup from empty.

I don't follow what this means. Can you reword it if I've missed the point?

i must misunderstand the code, but it seems to look for backups until it finds one. what else does init_from_backup_retry_interval mean? what other success is there? if it can't find a backup, it keeps retrying right? it doesn't start up empty anymore.

If you look at the new behavior in isolation, the word "retry" makes sense. However, if you take into account the context of how the previous behavior worked -- what users have had to understand since -restore_from_backup was first introduced -- then the word "retry" is inaccurate and misleading. In other words, we can't call this "retry" because of legacy.

I'm open to other names that are better than wait_for_backup, but not if the name actively misleads users.

teejae · 2019-06-15T05:59:02Z

go/vt/vttablet/tabletmanager/restore.go

- restoreConcurrency = flag.Int("restore_concurrency", 4, "(init restore parameter) how many concurrent files to restore at once")
+ restoreFromBackup = flag.Bool("restore_from_backup", false, "(init restore parameter) will check BackupStorage for a recent backup at startup and start there")
+ restoreConcurrency = flag.Int("restore_concurrency", 4, "(init restore parameter) how many concurrent files to restore at once")
+ waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear")


Suggested change

waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear")

restoreFromBackupRetryInterval = flag.Duration("init_from_backup_retry_interval", 0, "(init restore parameter) when greater than 0, continuously retry restoring backup until success before starting up")

if we want to keep with the bad name restore_from_backup, we can also name this restore_from_backup_retry_interval.

teejae · 2019-06-15T06:05:39Z

go/vt/vttablet/tabletmanager/restore.go

+ break
+ }
+
+ log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval)


Suggested change

log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval)

log.Infof("No backup found. Retrying in %v (from -wait_for_backup_interval flag).", waitForBackupInterval)

teejae · 2019-06-15T06:14:25Z

go/vt/vttablet/tabletmanager/rpc_backup.go

@@ -131,7 +131,7 @@ func (agent *ActionAgent) RestoreFromBackup(ctx context.Context, logger logutil.
 l := logutil.NewTeeLogger(logutil.NewConsoleLogger(), logger)

 // now we can run restore
- err = agent.restoreDataLocked(ctx, l, true /* deleteBeforeRestore */)
+ err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval */, true /* deleteBeforeRestore */)


Suggested change

err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval */, true /* deleteBeforeRestore */)

retryInterval := 0*time.Seconds

err = agent.restoreDataLocked(ctx, l, 0 retryInterval, true /* deleteBeforeRestore */)

sougou · 2019-06-17T16:15:54Z

Note that user documentation under /doc is deprecated. They should all go to vitessio/website. I intend to delete most files in that directory except for design docs that may be relevant to contributors.

enisoc · 2019-06-17T18:42:00Z

Ah thanks for the heads-up @sougou. I'll open a PR against vitessio/website after this is approved.

deepthi

It seems that with the flag set, we will keep looking for a backup at intervals until the context times out or we find a good backup (whichever comes first). Has there been any thought or discussion as to whether the timeout should be shorter than what is passed in?

enisoc · 2019-06-18T23:00:04Z

@deepthi Do you mean, voluntarily set a shorter timeout for the "wait for backup" stage so we have a chance to still do something after that times out, before the overall context times out?

Normally that's a good practice, but I don't think it applies in this case. The intended meaning of "wait for backup" is "never proceed to start up until you successfully restore from a backup"; the idea is to provide a guarantee that the tablet will never start up empty. So if we time out while waiting for the backup, we're not allowed to do anything else besides quit anyway.

Signed-off-by: deepthi <deepthi@planetscale.com>

* enable go version upgrade workflow Signed-off-by: deepthi <deepthi@planetscale.com> * fix code owners Signed-off-by: deepthi <deepthi@planetscale.com> * fix code owner errors Signed-off-by: deepthi <deepthi@planetscale.com> * configure private repo access Signed-off-by: deepthi <deepthi@planetscale.com> * revert changes to upstream file and create private-only version Signed-off-by: deepthi <deepthi@planetscale.com> * fix code owner errors Signed-off-by: deepthi <deepthi@planetscale.com> --------- Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc requested a review from deepthi June 14, 2019 05:05

enisoc requested a review from sougou as a code owner June 14, 2019 05:05

Add flag to wait for backup instead of starting up empty.

5500168

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

enisoc force-pushed the wait-for-backup branch from 5271bde to 5500168 Compare June 15, 2019 03:09

teejae suggested changes Jun 15, 2019

View reviewed changes

deepthi reviewed Jun 18, 2019

View reviewed changes

deepthi approved these changes Jun 18, 2019

View reviewed changes

enisoc merged commit 1c323c3 into vitessio:master Jun 19, 2019

enisoc deleted the wait-for-backup branch June 19, 2019 00:01

deepthi added a commit to planetscale/vitess that referenced this pull request Jun 19, 2019

restore: carry over change from vitessio#4929 into refactored function

c325bc4

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc mentioned this pull request Jun 24, 2019

Document new -wait_for_backup_interval vttablet flag. vitessio/website#201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to wait for backup instead of starting up empty. #4929

Add flag to wait for backup instead of starting up empty. #4929

enisoc commented Jun 14, 2019

teejae Jun 15, 2019

enisoc Jun 15, 2019

teejae Jun 15, 2019

enisoc Jun 15, 2019

teejae Jun 17, 2019

enisoc Jun 17, 2019

teejae Jun 15, 2019

teejae Jun 15, 2019

teejae Jun 15, 2019

sougou commented Jun 17, 2019

enisoc commented Jun 17, 2019

deepthi left a comment

enisoc commented Jun 18, 2019

	If the `-wait_for_backup_interval` flag is set to a value greater than zero,
	If the `-init_from_backup_retry_interval` flag is set to a value greater than zero,

	waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear")
	restoreFromBackupRetryInterval = flag.Duration("init_from_backup_retry_interval", 0, "(init restore parameter) when greater than 0, continuously retry restoring backup until success before starting up")

	log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval)
	log.Infof("No backup found. Retrying in %v (from -wait_for_backup_interval flag).", waitForBackupInterval)

	err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval /, true / deleteBeforeRestore */)
	retryInterval := 0*time.Seconds
	err = agent.restoreDataLocked(ctx, l, 0 retryInterval, true /* deleteBeforeRestore */)

Add flag to wait for backup instead of starting up empty. #4929

Add flag to wait for backup instead of starting up empty. #4929

Conversation

enisoc commented Jun 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sougou commented Jun 17, 2019

enisoc commented Jun 17, 2019

deepthi left a comment

Choose a reason for hiding this comment

enisoc commented Jun 18, 2019