Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flag to wait for backup instead of starting up empty. #4929

Merged
merged 1 commit into from
Jun 19, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions doc/BackupAndRestore.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,8 +168,16 @@ to restore a backup to that tablet.

As noted in the [Prerequisites](#prerequisites) section, the flag is
generally enabled all of the time for all of the tablets in a shard.
If Vitess cannot find a backup in the Backup Storage system, it just
starts the vttablet as a new tablet.
By default, if Vitess cannot find a backup in the Backup Storage system,
the tablet will start up empty. This behavior allows you to bootstrap a new
shard before any backups exist.

If the `-wait_for_backup_interval` flag is set to a value greater than zero,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the `-wait_for_backup_interval` flag is set to a value greater than zero,
If the `-init_from_backup_retry_interval` flag is set to a value greater than zero,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "retry" is not accurate here. The current behavior of -restore_from_backup is that "no backup exists; starting up empty" is considered a normal, non-error result. It doesn't make sense that adding a retry interval will change this semantic because retries only apply to errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you're retrying until success. that's generally what retries are for. this is changing the behavior of restore_from_backup, which only is a single time special case, and happens to startup from empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you're retrying until success

The behavior I'm adding is not "retrying until success". If the backup doesn't exist, "success" is currently defined as "start up empty". There is no need nor expectation to retry anything. The thing you tried (-restore_from_backup, which implicitly means "restore from backup if there is one or else start up empty") succeeded.

I need a name that indicates I'm redefining what success means, not just adding a retry until the existing definition of success is met.

this is changing the behavior of restore_from_backup, which only is a single time special case, and happens to startup from empty.

I don't follow what this means. Can you reword it if I've missed the point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i must misunderstand the code, but it seems to look for backups until it finds one. what else does init_from_backup_retry_interval mean? what other success is there? if it can't find a backup, it keeps retrying right? it doesn't start up empty anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the new behavior in isolation, the word "retry" makes sense. However, if you take into account the context of how the previous behavior worked -- what users have had to understand since -restore_from_backup was first introduced -- then the word "retry" is inaccurate and misleading. In other words, we can't call this "retry" because of legacy.

I'm open to other names that are better than wait_for_backup, but not if the name actively misleads users.

the tablet will instead keep checking for a backup to appear at that interval.
This can be used to ensure tablets launched concurrently while an initial backup
is being seeded for the shard (e.g. uploaded from cold storage or created by
another tablet) will wait until the proper time and then pull the new backup
when it's ready.

``` sh
vttablet ... -backup_storage_implementation=file \
Expand Down
4 changes: 4 additions & 0 deletions go/vt/mysqlctl/backup.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ var (
// ErrNoBackup is returned when there is no backup.
ErrNoBackup = errors.New("no available backup")

// ErrNoCompleteBackup is returned when there is at least one backup,
// but none of them are complete.
ErrNoCompleteBackup = errors.New("backup(s) found but none are complete")

// ErrExistingDB is returned when there's already an active DB.
ErrExistingDB = errors.New("skipping restore due to existing database")

Expand Down
3 changes: 1 addition & 2 deletions go/vt/mysqlctl/builtinbackupengine.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ import (
"bufio"
"context"
"encoding/json"
"errors"
"fmt"
"io"
"io/ioutil"
Expand Down Expand Up @@ -543,7 +542,7 @@ func (be *BuiltinBackupEngine) ExecuteRestore(
// There is at least one attempted backup, but none could be read.
// This implies there is data we ought to have, so it's not safe to start
// up empty.
return mysql.Position{}, errors.New("backup(s) found but none could be read, unsafe to start up empty, restart to retry restore")
return mysql.Position{}, ErrNoCompleteBackup
}

// Starting from here we won't be able to recover if we get stopped by a cancelled
Expand Down
3 changes: 1 addition & 2 deletions go/vt/mysqlctl/xtrabackupengine.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ import (
"bufio"
"context"
"encoding/json"
"errors"
"flag"
"io"
"io/ioutil"
Expand Down Expand Up @@ -268,7 +267,7 @@ func (be *XtrabackupEngine) ExecuteRestore(
// There is at least one attempted backup, but none could be read.
// This implies there is data we ought to have, so it's not safe to start
// up empty.
return zeroPosition, errors.New("backup(s) found but none could be read, unsafe to start up empty, restart to retry restore")
return zeroPosition, ErrNoCompleteBackup
}

// Starting from here we won't be able to recover if we get stopped by a cancelled
Expand Down
4 changes: 2 additions & 2 deletions go/vt/vttablet/tabletmanager/action_agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -303,9 +303,9 @@ func NewActionAgent(
// - restoreFromBackup is not set: we initHealthCheck right away
if *restoreFromBackup {
go func() {
// restoreFromBackup wil just be a regular action
// restoreFromBackup will just be a regular action
// (same as if it was triggered remotely)
if err := agent.RestoreData(batchCtx, logutil.NewConsoleLogger(), false /* deleteBeforeRestore */); err != nil {
if err := agent.RestoreData(batchCtx, logutil.NewConsoleLogger(), *waitForBackupInterval, false /* deleteBeforeRestore */); err != nil {
println(fmt.Sprintf("RestoreFromBackup failed: %v", err))
log.Exitf("RestoreFromBackup failed: %v", err)
}
Expand Down
35 changes: 29 additions & 6 deletions go/vt/vttablet/tabletmanager/restore.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package tabletmanager
import (
"flag"
"fmt"
"time"

"vitess.io/vitess/go/vt/vterrors"

Expand All @@ -37,26 +38,27 @@ import (
// It is only enabled if restore_from_backup is set.

var (
restoreFromBackup = flag.Bool("restore_from_backup", false, "(init restore parameter) will check BackupStorage for a recent backup at startup and start there")
restoreConcurrency = flag.Int("restore_concurrency", 4, "(init restore parameter) how many concurrent files to restore at once")
restoreFromBackup = flag.Bool("restore_from_backup", false, "(init restore parameter) will check BackupStorage for a recent backup at startup and start there")
restoreConcurrency = flag.Int("restore_concurrency", 4, "(init restore parameter) how many concurrent files to restore at once")
waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
waitForBackupInterval = flag.Duration("wait_for_backup_interval", 0, "(init restore parameter) if this is greater than 0, instead of starting up empty when no backups are found, keep checking at this interval for a backup to appear")
restoreFromBackupRetryInterval = flag.Duration("init_from_backup_retry_interval", 0, "(init restore parameter) when greater than 0, continuously retry restoring backup until success before starting up")

if we want to keep with the bad name restore_from_backup, we can also name this restore_from_backup_retry_interval.

)

// RestoreData is the main entry point for backup restore.
// It will either work, fail gracefully, or return
// an error in case of a non-recoverable error.
// It takes the action lock so no RPC interferes.
func (agent *ActionAgent) RestoreData(ctx context.Context, logger logutil.Logger, deleteBeforeRestore bool) error {
func (agent *ActionAgent) RestoreData(ctx context.Context, logger logutil.Logger, waitForBackupInterval time.Duration, deleteBeforeRestore bool) error {
if err := agent.lock(ctx); err != nil {
return err
}
defer agent.unlock()
if agent.Cnf == nil {
return fmt.Errorf("cannot perform restore without my.cnf, please restart vttablet with a my.cnf file specified")
}
return agent.restoreDataLocked(ctx, logger, deleteBeforeRestore)
return agent.restoreDataLocked(ctx, logger, waitForBackupInterval, deleteBeforeRestore)
}

func (agent *ActionAgent) restoreDataLocked(ctx context.Context, logger logutil.Logger, deleteBeforeRestore bool) error {
func (agent *ActionAgent) restoreDataLocked(ctx context.Context, logger logutil.Logger, waitForBackupInterval time.Duration, deleteBeforeRestore bool) error {
// change type to RESTORE (using UpdateTabletFields so it's
// always authorized)
var originalType topodatapb.TabletType
Expand All @@ -80,7 +82,28 @@ func (agent *ActionAgent) restoreDataLocked(ctx context.Context, logger logutil.
localMetadata := agent.getLocalMetadataValues(originalType)
tablet := agent.Tablet()
dir := fmt.Sprintf("%v/%v", tablet.Keyspace, tablet.Shard)
pos, err := mysqlctl.Restore(ctx, agent.Cnf, agent.MysqlDaemon, dir, *restoreConcurrency, agent.hookExtraEnv(), localMetadata, logger, deleteBeforeRestore, topoproto.TabletDbName(tablet))

// Loop until a backup exists, unless we were told to give up immediately.
var pos mysql.Position
var err error
for {
pos, err = mysqlctl.Restore(ctx, agent.Cnf, agent.MysqlDaemon, dir, *restoreConcurrency, agent.hookExtraEnv(), localMetadata, logger, deleteBeforeRestore, topoproto.TabletDbName(tablet))
if waitForBackupInterval == 0 {
break
}
// We only retry a specific set of errors. The rest we return immediately.
if err != mysqlctl.ErrNoBackup && err != mysqlctl.ErrNoCompleteBackup {
break
}

log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Infof("No backup found. Waiting %v (from -wait_for_backup_interval flag) to check again.", waitForBackupInterval)
log.Infof("No backup found. Retrying in %v (from -wait_for_backup_interval flag).", waitForBackupInterval)

select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(waitForBackupInterval):
}
}

switch err {
case nil:
// Starting from here we won't be able to recover if we get stopped by a cancelled
Expand Down
2 changes: 1 addition & 1 deletion go/vt/vttablet/tabletmanager/rpc_backup.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ func (agent *ActionAgent) RestoreFromBackup(ctx context.Context, logger logutil.
l := logutil.NewTeeLogger(logutil.NewConsoleLogger(), logger)

// now we can run restore
err = agent.restoreDataLocked(ctx, l, true /* deleteBeforeRestore */)
err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval */, true /* deleteBeforeRestore */)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = agent.restoreDataLocked(ctx, l, 0 /* waitForBackupInterval */, true /* deleteBeforeRestore */)
retryInterval := 0*time.Seconds
err = agent.restoreDataLocked(ctx, l, 0 retryInterval, true /* deleteBeforeRestore */)


// re-run health check to be sure to capture any replication delay
agent.runHealthCheckLocked()
Expand Down
2 changes: 1 addition & 1 deletion go/vt/wrangler/testlib/backup_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ func TestBackupRestore(t *testing.T) {
RelayLogInfoPath: path.Join(root, "relay-log.info"),
}

if err := destTablet.Agent.RestoreData(ctx, logutil.NewConsoleLogger(), false /* deleteBeforeRestore */); err != nil {
if err := destTablet.Agent.RestoreData(ctx, logutil.NewConsoleLogger(), 0 /* waitForBackupInterval */, false /* deleteBeforeRestore */); err != nil {
t.Fatalf("RestoreData failed: %v", err)
}

Expand Down
30 changes: 27 additions & 3 deletions test/backup.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,24 @@ def _restore(self, t, tablet_type='replica'):
t.check_db_var('rpl_semi_sync_slave_enabled', 'OFF')
t.check_db_status('rpl_semi_sync_slave_status', 'OFF')

def _restore_wait_for_backup(self, t, tablet_type='replica'):
"""Erase mysql/tablet dir, then start tablet with wait_for_restore_interval."""
self._reset_tablet_dir(t)

xtra_args = [
'-db-credentials-file', db_credentials_file,
'-wait_for_backup_interval', '1s',
]
if use_xtrabackup:
xtra_args.extend(xtrabackup_args)

t.start_vttablet(wait_for_state=None,
init_tablet_type=tablet_type,
init_keyspace='test_keyspace',
init_shard='0',
supports_backups=True,
extra_args=xtra_args)

def _reset_tablet_dir(self, t):
"""Stop mysql, delete everything including tablet dir, restart mysql."""
extra_args = ['-db-credentials-file', db_credentials_file]
Expand Down Expand Up @@ -330,17 +348,22 @@ def _test_backup(self, tablet_type):
test_backup will:
- create a shard with master and replica1 only
- run InitShardMaster
- bring up tablet_replica2 concurrently, telling it to wait for a backup
- insert some data
- take a backup
- insert more data on the master
- bring up tablet_replica2 after the fact, let it restore the backup
- wait for tablet_replica2 to become SERVING
- check all data is right (before+after backup data)
- list the backup, remove it

Args:
tablet_type: 'replica' or 'rdonly'.
"""

# bring up another replica concurrently, telling it to wait until a backup
# is available instead of starting up empty.
self._restore_wait_for_backup(tablet_replica2, tablet_type=tablet_type)

# insert data on master, wait for slave to get it
tablet_master.mquery('vt_test_keyspace', self._create_vt_insert_test)
self._insert_data(tablet_master, 1)
Expand All @@ -358,8 +381,9 @@ def _test_backup(self, tablet_type):
# insert more data on the master
self._insert_data(tablet_master, 2)

# now bring up the other slave, letting it restore from backup.
self._restore(tablet_replica2, tablet_type=tablet_type)
# wait for tablet_replica2 to become serving (after restoring)
utils.pause('wait_for_backup')
tablet_replica2.wait_for_vttablet_state('SERVING')

# check the new slave has the data
self._check_data(tablet_replica2, 2, 'replica2 tablet getting data')
Expand Down