Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vtbackup: Clean up and add policy enforcement #1

Merged
merged 8 commits into from
Jun 10, 2019

Conversation

enisoc
Copy link

@enisoc enisoc commented Jun 5, 2019

@dkhenry This builds on your WIP vtbackup PR, adding some features that will be used as part of my design for automated backups. The doc comment in vtbackup.go explains how this will be used.

The last thing left is to fix the integration test. Since vtbackup has to run from an empty dir, we'll need to change the test flow a bit so that we don't reuse the data dir of a real tablet. We could either first take a backup off a real tablet and then use vtbackup to update it, or we could exercise vtbackup's initial_backup mode to avoid that.

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
The lag value can lie. Also we're guaranteed to hit the snapshot
position eventually as long as replication is making progress. If our
goal is a lag value, we may never catch up if replication is slower than
the rate of new transactions on the master.

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
@enisoc enisoc requested a review from dkhenry June 5, 2019 17:52
@enisoc enisoc requested a review from sougou as a code owner June 5, 2019 17:52
@deepthi
Copy link
Collaborator

deepthi commented Jun 5, 2019

I haven't looked at this in detail, but did we decide to keep this proprietary? What happens to vitessio#4858?

@enisoc
Copy link
Author

enisoc commented Jun 5, 2019

@deepthi This is a PR targeting @dkhenry's branch. I did it this way since vitessio#4858 isn't merged yet. If this gets merged here, these commits will add onto vitessio#4858.

@deepthi
Copy link
Collaborator

deepthi commented Jun 5, 2019

@deepthi This is a PR targeting @dkhenry's branch. I did it this way since vitessio#4858 isn't merged yet. If this gets merged here, these commits will add onto vitessio#4858.

Oh, I missed that. Makes sense now.

Copy link

@sverch sverch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions here. Also it would be nice to have some test cases, but I can understand how that might require a lot of scaffolding.

But even if you don't have golang tests, some instructions or scripts for how a user could quickly test that backup/restore is working properly would help.

* Old backups for the shard are removed.

Whatever system launches vtbackup is responsible for the following:
* Running vtbackup with similar flags that would be used for a vttablet and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there public docs for vtbackup?

Copy link
Author

@enisoc enisoc Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vtbackup doesn't exist yet. This comment is the first time anything has been written about it. This comment is essentially the design doc. We can add user docs after we verify that this is a good way to go.

go/cmd/vtbackup/vtbackup.go Show resolved Hide resolved
log.Fatalf("Error Starting Replication %v", err)
// In initial_backup mode, just take a backup of this empty database.
if *initialBackup {
// Take a backup of this empty DB without restoring anything.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to take a backup of the empty database?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty Database isn't a good way to categorize this. The database starts empty, but it will catch up on replication, so it won't be empty when we take the backup. Maybe this should read take a backup starting from an empty database or some thing like that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialBackup mode I'm adding here really does intend to take a backup of an empty database. I'm going to upload this empty backup to the shard's backup location before launching any tablets at all. In this way, I can bootstrap a new shard without ever having to take down any tablet to tell it to take a backup.

Without this, you have to take down a tablet once, when you first deploy the shard. That may not be possible if the user asked for a small number of replicas (e.g. you can't take a backup on a master if they only asked for 1 tablet). Doing it this way also ensures that once the shard is up, it's up for good; we don't have to flap from available to unavailable (to take the initial backup) and then back again, which users might notice if they're starting to test things out. And we can assume going forward that no tablet will ever go down for backup, ever, which simplifies our automation and operations.

dbName = fmt.Sprintf("vt_%s", *initKeyspace)
}

log.Infof("Restoring latest backup from directory %v", backupDir)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is trying to restore from backup to bootstrap replication? Is this logic duplicated in the normal new replica tablet startup process?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the same logic a new replica tablet would execute to get the baseline data needed to start replication. MySQL itself doesn't support "give me everything you have from the beginning of time". You have to already have all the stuff that has fallen out of the binlogs due to rotation.

// replication reporter may restart replication at the
// next health check if it thinks it should. We do not
// alter replication here.
return fmt.Errorf("can't run vtbackup because data directory is not empty")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there ever be a time we want to enable restore from a non-empty tablet. I am thinking if someone makes a PV and they just want to always use that PV to not have to copy all the data to it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might make sense as an additional feature someday. For now, I needed to enforce this invariant to make sure vtbackup is safe, by allowing the assumption that it always starts empty.

// to the goal position).
backupTime := time.Now()

if restorePos.Equal(masterPos) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check, this will never result in you removing your last backup as long as minRetentionCount is 1 ( it looks to be that way )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The pruning is orthogonal to whether you just took a new backup or not. It's not going to get tricked into thinking we just created a new backup, because it looks at the full list of backups right before making its decisions.

log.Warningf("Error getting replication status: %v", statusErr)
continue
}
if status.Position.AtLeast(masterPos) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@enisoc
Copy link
Author

enisoc commented Jun 7, 2019

Also it would be nice to have some test cases

The best way to test this type of logic is by exercising actual backup flows. @dkhenry already started the scaffolding for those new tests in vitessio#4858. We just need to update them to work with the new vtbackup flags, and add some more scenarios for the new features.

Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
@dkhenry dkhenry merged commit 3151386 into dk-backup-only Jun 10, 2019
@enisoc enisoc deleted the enisoc-vtbackup branch June 11, 2019 00:15
systay pushed a commit that referenced this pull request Jan 21, 2021
harshit-gangal pushed a commit that referenced this pull request Sep 7, 2022
* decouple olap tx timeout from oltp tx timeout

Since workload=olap bypasses the query timeouts
(--queryserver-config-query-timeout) and also row limits, the natural
assumption is that it also bypasses the transaction timeout.

This is not the case, e.g. for a tablet where the
--queryserver-config-transaction-timeout is 10.

This commit:

 * Adds new CLI flag and YAML field to independently configure TX
   timeouts for OLAP workloads (--queryserver-config-olap-transaction-timeout).
 * Decouples TX kill interval from OLTP TX timeout via new CLI flag and
   YAML field (--queryserver-config-transaction-killer-interval).

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #1

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #2 consolidate timeout logic in sc

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: remove unused tx killer flag

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: update 15_0_0_summary.md

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: fix race cond

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #3 -txProps.timeout, +sc.expiryTime

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #4 -atomic.Value for expiryTime

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: fix race cond (without atomic.Value)

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #5 -unused funcs, fix comments, set ticks interval once

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #5 +txkill tests

Signed-off-by: Max Englander <max@planetscale.com>

* revert fmt changes

Signed-off-by: Max Englander <max@planetscale.com>

* implement pr review suggestion

Signed-off-by: Max Englander <max@planetscale.com>

Signed-off-by: Max Englander <max@planetscale.com>
notfelineit pushed a commit that referenced this pull request Sep 21, 2022
* decouple olap tx timeout from oltp tx timeout

Since workload=olap bypasses the query timeouts
(--queryserver-config-query-timeout) and also row limits, the natural
assumption is that it also bypasses the transaction timeout.

This is not the case, e.g. for a tablet where the
--queryserver-config-transaction-timeout is 10.

This commit:

 * Adds new CLI flag and YAML field to independently configure TX
   timeouts for OLAP workloads (--queryserver-config-olap-transaction-timeout).
 * Decouples TX kill interval from OLTP TX timeout via new CLI flag and
   YAML field (--queryserver-config-transaction-killer-interval).

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #1

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #2 consolidate timeout logic in sc

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: remove unused tx killer flag

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: update 15_0_0_summary.md

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: fix race cond

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #3 -txProps.timeout, +sc.expiryTime

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #4 -atomic.Value for expiryTime

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: fix race cond (without atomic.Value)

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #5 -unused funcs, fix comments, set ticks interval once

Signed-off-by: Max Englander <max@planetscale.com>

* decouple ol{a,t}p tx timeouts: pr comments #5 +txkill tests

Signed-off-by: Max Englander <max@planetscale.com>

* fix flags

Signed-off-by: Max Englander <max@planetscale.com>

Signed-off-by: Max Englander <max@planetscale.com>
mattlord added a commit that referenced this pull request Nov 17, 2022
1. When opening the engine, restart any vdiffs that are in the
   started state as this indicates it did not complete and was
   unable to save the final state and must be restarted.
2. When a vdiff run fails, retry saving the error state with an
   exponential backoff until the engine shuts down. This way
   the normal retry mechanism will kick in OR #1 will kick in
   when the engine is next opened on the primary tablet.

Signed-off-by: Matt Lord <mattalord@gmail.com>
ajm188 pushed a commit that referenced this pull request Dec 5, 2022
* Prevent orphaned VDiffs in two ways...

1. When opening the engine, restart any vdiffs that are in the
   started state as this indicates it did not complete and was
   unable to save the final state and must be restarted.
2. When a vdiff run fails, retry saving the error state with an
   exponential backoff until the engine shuts down. This way
   the normal retry mechanism will kick in OR #1 will kick in
   when the engine is next opened on the primary tablet.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Handle failures before vdiff_table records are created

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add more ephemeral client errors

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Show vdiff state of error even if no vdiff_table records

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor cleanup

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add vdiff2 unit tests

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add unit test for retry

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Small cleanup

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Addressing review comments and other improvements

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use warning log for ... warnings :-)

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor touch ups

Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>
mattlord added a commit that referenced this pull request Dec 12, 2022
* Prevent orphaned VDiffs in two ways...

1. When opening the engine, restart any vdiffs that are in the
   started state as this indicates it did not complete and was
   unable to save the final state and must be restarted.
2. When a vdiff run fails, retry saving the error state with an
   exponential backoff until the engine shuts down. This way
   the normal retry mechanism will kick in OR #1 will kick in
   when the engine is next opened on the primary tablet.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Handle failures before vdiff_table records are created

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add more ephemeral client errors

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Show vdiff state of error even if no vdiff_table records

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor cleanup

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add vdiff2 unit tests

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add unit test for retry

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Small cleanup

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Addressing review comments and other improvements

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use warning log for ... warnings :-)

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor touch ups

Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>
shlomi-noach pushed a commit that referenced this pull request Dec 14, 2022
…vitessio#11768) (vitessio#11943)

* Prevent orphaned VDiffs in two ways...

1. When opening the engine, restart any vdiffs that are in the
   started state as this indicates it did not complete and was
   unable to save the final state and must be restarted.
2. When a vdiff run fails, retry saving the error state with an
   exponential backoff until the engine shuts down. This way
   the normal retry mechanism will kick in OR #1 will kick in
   when the engine is next opened on the primary tablet.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Handle failures before vdiff_table records are created

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add more ephemeral client errors

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Show vdiff state of error even if no vdiff_table records

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor cleanup

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add vdiff2 unit tests

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add unit test for retry

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Small cleanup

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Addressing review comments and other improvements

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use warning log for ... warnings :-)

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor touch ups

Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants