vtbackup: Clean up and add policy enforcement #1

enisoc · 2019-06-05T17:52:29Z

@dkhenry This builds on your WIP vtbackup PR, adding some features that will be used as part of my design for automated backups. The doc comment in vtbackup.go explains how this will be used.

The last thing left is to fix the integration test. Since vtbackup has to run from an empty dir, we'll need to change the test flow a bit so that we don't reuse the data dir of a real tablet. We could either first take a backup off a real tablet and then use vtbackup to update it, or we could exercise vtbackup's initial_backup mode to avoid that.

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

The lag value can lie. Also we're guaranteed to hit the snapshot position eventually as long as replication is making progress. If our goal is a lag value, we may never catch up if replication is slower than the rate of new transactions on the master. Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

deepthi · 2019-06-05T20:23:11Z

I haven't looked at this in detail, but did we decide to keep this proprietary? What happens to vitessio#4858?

enisoc · 2019-06-05T22:03:18Z

@deepthi This is a PR targeting @dkhenry's branch. I did it this way since vitessio#4858 isn't merged yet. If this gets merged here, these commits will add onto vitessio#4858.

deepthi · 2019-06-05T22:04:20Z

@deepthi This is a PR targeting @dkhenry's branch. I did it this way since vitessio#4858 isn't merged yet. If this gets merged here, these commits will add onto vitessio#4858.

Oh, I missed that. Makes sense now.

sverch

Just some questions here. Also it would be nice to have some test cases, but I can understand how that might require a lot of scaffolding.

But even if you don't have golang tests, some instructions or scripts for how a user could quickly test that backup/restore is working properly would help.

sverch · 2019-06-06T18:27:05Z

go/cmd/vtbackup/vtbackup.go

+* Old backups for the shard are removed.
+
+Whatever system launches vtbackup is responsible for the following:
+* Running vtbackup with similar flags that would be used for a vttablet and


Are there public docs for vtbackup?

vtbackup doesn't exist yet. This comment is the first time anything has been written about it. This comment is essentially the design doc. We can add user docs after we verify that this is a good way to go.

go/cmd/vtbackup/vtbackup.go

sverch · 2019-06-06T18:29:54Z

go/cmd/vtbackup/vtbackup.go

- log.Fatalf("Error Starting Replication %v", err)
+ // In initial_backup mode, just take a backup of this empty database.
+ if *initialBackup {
+ // Take a backup of this empty DB without restoring anything.


Why do we want to take a backup of the empty database?

Empty Database isn't a good way to categorize this. The database starts empty, but it will catch up on replication, so it won't be empty when we take the backup. Maybe this should read take a backup starting from an empty database or some thing like that

The initialBackup mode I'm adding here really does intend to take a backup of an empty database. I'm going to upload this empty backup to the shard's backup location before launching any tablets at all. In this way, I can bootstrap a new shard without ever having to take down any tablet to tell it to take a backup.

Without this, you have to take down a tablet once, when you first deploy the shard. That may not be possible if the user asked for a small number of replicas (e.g. you can't take a backup on a master if they only asked for 1 tablet). Doing it this way also ensures that once the shard is up, it's up for good; we don't have to flap from available to unavailable (to take the initial backup) and then back again, which users might notice if they're starting to test things out. And we can assume going forward that no tablet will ever go down for backup, ever, which simplifies our automation and operations.

sverch · 2019-06-06T18:31:51Z

go/cmd/vtbackup/vtbackup.go

+ dbName = fmt.Sprintf("vt_%s", *initKeyspace)
+ }
+
+ log.Infof("Restoring latest backup from directory %v", backupDir)


Is trying to restore from backup to bootstrap replication? Is this logic duplicated in the normal new replica tablet startup process?

Yes, this is the same logic a new replica tablet would execute to get the baseline data needed to start replication. MySQL itself doesn't support "give me everything you have from the beginning of time". You have to already have all the stuff that has fallen out of the binlogs due to rotation.

dkhenry · 2019-06-07T04:55:03Z

go/cmd/vtbackup/vtbackup.go

- // replication reporter may restart replication at the
- // next health check if it thinks it should. We do not
- // alter replication here.
+ return fmt.Errorf("can't run vtbackup because data directory is not empty")


Would there ever be a time we want to enable restore from a non-empty tablet. I am thinking if someone makes a PV and they just want to always use that PV to not have to copy all the data to it.

That might make sense as an additional feature someday. For now, I needed to enforce this invariant to make sure vtbackup is safe, by allowing the assumption that it always starts empty.

dkhenry · 2019-06-07T05:01:18Z

go/cmd/vtbackup/vtbackup.go

+ // to the goal position).
+ backupTime := time.Now()
+
+ if restorePos.Equal(masterPos) {


Just to double check, this will never result in you removing your last backup as long as minRetentionCount is 1 ( it looks to be that way )

Right. The pruning is orthogonal to whether you just took a new backup or not. It's not going to get tricked into thinking we just created a new backup, because it looks at the full list of backups right before making its decisions.

dkhenry · 2019-06-07T05:02:05Z

go/cmd/vtbackup/vtbackup.go

+ log.Warningf("Error getting replication status: %v", statusErr)
+ continue
+ }
+ if status.Position.AtLeast(masterPos) {


enisoc · 2019-06-07T17:41:56Z

Also it would be nice to have some test cases

The best way to test this type of logic is by exercising actual backup flows. @dkhenry already started the scaffolding for those new tests in vitessio#4858. We just need to update them to work with the new vtbackup flags, and add some more scenarios for the new features.

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

Updated test

* decouple olap tx timeout from oltp tx timeout Since workload=olap bypasses the query timeouts (--queryserver-config-query-timeout) and also row limits, the natural assumption is that it also bypasses the transaction timeout. This is not the case, e.g. for a tablet where the --queryserver-config-transaction-timeout is 10. This commit: * Adds new CLI flag and YAML field to independently configure TX timeouts for OLAP workloads (--queryserver-config-olap-transaction-timeout). * Decouples TX kill interval from OLTP TX timeout via new CLI flag and YAML field (--queryserver-config-transaction-killer-interval). Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #1 Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #2 consolidate timeout logic in sc Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: remove unused tx killer flag Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: update 15_0_0_summary.md Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: fix race cond Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #3 -txProps.timeout, +sc.expiryTime Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #4 -atomic.Value for expiryTime Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: fix race cond (without atomic.Value) Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #5 -unused funcs, fix comments, set ticks interval once Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #5 +txkill tests Signed-off-by: Max Englander <max@planetscale.com> * revert fmt changes Signed-off-by: Max Englander <max@planetscale.com> * implement pr review suggestion Signed-off-by: Max Englander <max@planetscale.com> Signed-off-by: Max Englander <max@planetscale.com>

* decouple olap tx timeout from oltp tx timeout Since workload=olap bypasses the query timeouts (--queryserver-config-query-timeout) and also row limits, the natural assumption is that it also bypasses the transaction timeout. This is not the case, e.g. for a tablet where the --queryserver-config-transaction-timeout is 10. This commit: * Adds new CLI flag and YAML field to independently configure TX timeouts for OLAP workloads (--queryserver-config-olap-transaction-timeout). * Decouples TX kill interval from OLTP TX timeout via new CLI flag and YAML field (--queryserver-config-transaction-killer-interval). Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #1 Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #2 consolidate timeout logic in sc Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: remove unused tx killer flag Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: update 15_0_0_summary.md Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: fix race cond Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #3 -txProps.timeout, +sc.expiryTime Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #4 -atomic.Value for expiryTime Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: fix race cond (without atomic.Value) Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #5 -unused funcs, fix comments, set ticks interval once Signed-off-by: Max Englander <max@planetscale.com> * decouple ol{a,t}p tx timeouts: pr comments #5 +txkill tests Signed-off-by: Max Englander <max@planetscale.com> * fix flags Signed-off-by: Max Englander <max@planetscale.com> Signed-off-by: Max Englander <max@planetscale.com>

1. When opening the engine, restart any vdiffs that are in the started state as this indicates it did not complete and was unable to save the final state and must be restarted. 2. When a vdiff run fails, retry saving the error state with an exponential backoff until the engine shuts down. This way the normal retry mechanism will kick in OR #1 will kick in when the engine is next opened on the primary tablet. Signed-off-by: Matt Lord <mattalord@gmail.com>

* Prevent orphaned VDiffs in two ways... 1. When opening the engine, restart any vdiffs that are in the started state as this indicates it did not complete and was unable to save the final state and must be restarted. 2. When a vdiff run fails, retry saving the error state with an exponential backoff until the engine shuts down. This way the normal retry mechanism will kick in OR #1 will kick in when the engine is next opened on the primary tablet. Signed-off-by: Matt Lord <mattalord@gmail.com> * Handle failures before vdiff_table records are created Signed-off-by: Matt Lord <mattalord@gmail.com> * Add more ephemeral client errors Signed-off-by: Matt Lord <mattalord@gmail.com> * Show vdiff state of error even if no vdiff_table records Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor cleanup Signed-off-by: Matt Lord <mattalord@gmail.com> * Add vdiff2 unit tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Add unit test for retry Signed-off-by: Matt Lord <mattalord@gmail.com> * Small cleanup Signed-off-by: Matt Lord <mattalord@gmail.com> * Addressing review comments and other improvements Signed-off-by: Matt Lord <mattalord@gmail.com> * Use warning log for ... warnings :-) Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor touch ups Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Matt Lord <mattalord@gmail.com>

…vitessio#11768) (vitessio#11943) * Prevent orphaned VDiffs in two ways... 1. When opening the engine, restart any vdiffs that are in the started state as this indicates it did not complete and was unable to save the final state and must be restarted. 2. When a vdiff run fails, retry saving the error state with an exponential backoff until the engine shuts down. This way the normal retry mechanism will kick in OR #1 will kick in when the engine is next opened on the primary tablet. Signed-off-by: Matt Lord <mattalord@gmail.com> * Handle failures before vdiff_table records are created Signed-off-by: Matt Lord <mattalord@gmail.com> * Add more ephemeral client errors Signed-off-by: Matt Lord <mattalord@gmail.com> * Show vdiff state of error even if no vdiff_table records Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor cleanup Signed-off-by: Matt Lord <mattalord@gmail.com> * Add vdiff2 unit tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Add unit test for retry Signed-off-by: Matt Lord <mattalord@gmail.com> * Small cleanup Signed-off-by: Matt Lord <mattalord@gmail.com> * Addressing review comments and other improvements Signed-off-by: Matt Lord <mattalord@gmail.com> * Use warning log for ... warnings :-) Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor touch ups Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Matt Lord <mattalord@gmail.com>

enisoc added 6 commits June 4, 2019 16:18

vtbackup: Document and clean up.

af81ebc

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

vtbackup: Incorporate mysqlctld behavior so that's not needed.

9a43d10

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

vtbackup: Add to vitess/lite docker images.

e89e429

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

vtbackup: Add -initial_backup flag for seeding an empty backup.

e10c624

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

vtbackup: Add backup interval and pruning with retention policy.

22bccdd

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

enisoc requested a review from dkhenry June 5, 2019 17:52

enisoc requested a review from sougou as a code owner June 5, 2019 17:52

sverch reviewed Jun 6, 2019

View reviewed changes

dkhenry reviewed Jun 7, 2019

View reviewed changes

enisoc added 2 commits June 7, 2019 12:41

vtbackup: Make -initial_backup mode idempotent.

7e9e3ba

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

vtbackup: Add docker/k8s/vtbackup image.

62fd75c

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>

dkhenry merged commit 3151386 into dk-backup-only Jun 10, 2019

enisoc deleted the enisoc-vtbackup branch June 11, 2019 00:15

systay pushed a commit that referenced this pull request Jan 21, 2021

Merge pull request #1 from systay/frouioui/master

2d57dab

Updated test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vtbackup: Clean up and add policy enforcement #1

vtbackup: Clean up and add policy enforcement #1

enisoc commented Jun 5, 2019

deepthi commented Jun 5, 2019

enisoc commented Jun 5, 2019

deepthi commented Jun 5, 2019

sverch left a comment

sverch Jun 6, 2019

enisoc Jun 7, 2019 •

edited

Loading

sverch Jun 6, 2019

dkhenry Jun 6, 2019

enisoc Jun 7, 2019

sverch Jun 6, 2019

enisoc Jun 7, 2019

dkhenry Jun 7, 2019

enisoc Jun 7, 2019

dkhenry Jun 7, 2019

enisoc Jun 7, 2019

dkhenry Jun 7, 2019

enisoc commented Jun 7, 2019

vtbackup: Clean up and add policy enforcement #1

vtbackup: Clean up and add policy enforcement #1

Conversation

enisoc commented Jun 5, 2019

deepthi commented Jun 5, 2019

enisoc commented Jun 5, 2019

deepthi commented Jun 5, 2019

sverch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enisoc Jun 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enisoc commented Jun 7, 2019

enisoc Jun 7, 2019 •

edited

Loading