Allow retry of interrupted restores #4909

deepthi · 2019-06-05T22:10:12Z

Fixes #4821
The strategy here is to create a temporary file at the beginning of the restore process that is deleted at the end.
If the file exists at the beginning of the restore process, we assume that a previous restore failed and skip the checkNoDB check.

This requires some other changes:

don't wait on mysqld if we are retrying a restore
mysqlctld should initialize mysqld config, but not try to bring it up if it detects an interrupted restore

In addition, changes have been made to the code so that terminating a grpc client that started the restore will cause the restore to terminate as well, so effectively we are undo-ing most of #2310.

The terminate_restore test has been changed to test the new behavior.
Backup tests had to be re-organized because now we have so many cases that if they are all in one test, it times out.

Signed-off-by: deepthi <deepthi@planetscale.com>

msolters · 2019-06-10T14:25:47Z

After testing this PR on our cluster, we last landed on the following behaviour:

Restarted pods (i.e. interrupted restore) do correctly identify a .restore file and attempt to restart the backup restore. However, depending on how much of the data restore has completed, the mysql service (in our case, managed by mysqlctl) may crash on boot:

2019-06-06T20:51:33.993228Z mysqld_safe Logging to '/vtdataroot/tabletdata/error.log'.
2019-06-06T20:51:34.092999Z mysqld_safe Starting mysqld daemon with databases from /vtdataroot/tabletdata/data
2019-06-06T20:51:35.177106Z 0 [Warning] [MY-011068] [Server] The syntax 'expire-logs-days' is deprecated and will be removed in a future release. Please use binlog_expire_logs_seconds instead.
2019-06-06T20:51:35.177348Z 0 [Warning] [MY-010915] [Server] 'NO_ZERO_DATE', 'NO_ZERO_IN_DATE' and 'ERROR_FOR_DIVISION_BY_ZERO' sql modes should be used with strict mode. They will be merged with strict mode in a future release.
2019-06-06T20:51:35.177456Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.16) starting as process 1245
2019-06-06T20:51:35.184752Z 0 [Warning] [MY-010091] [Server] Can't create test file /vtdataroot/tabletdata/data/app-web-app-x38-40x-replica-2.lower-test
2019-06-06T20:51:35.185089Z 0 [Warning] [MY-010159] [Server] Setting lower_case_table_names=2 because file system for /vtdataroot/tabletdata/data/ is case insensitive
2019-06-06T20:51:35.185166Z 0 [ERROR] [MY-013276] [Server] Failed to set datadir to '/vtdataroot/tabletdata/data/' (OS errno: 2 - No such file or directory)
2019-06-06T20:51:35.185735Z 0 [ERROR] [MY-010119] [Server] Aborting
2019-06-06T20:51:35.241353Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.16)  MySQL Community Server - GPL.
2019-06-06T20:51:35.465138Z mysqld_safe mysqld from pid file /vtdataroot/tabletdata/mysql.pid ended

This causes the mysqlctl container to basically restart every 5 minutes as MySQL never successfully comes up.

Meanwhile, vttablet is attempting to restart the restore. However mysqlctl's unavailability can break this process, as GRPC communication with mysqlctl seems to be a requirement of the restore process:

I0606 20:15:49.230822       1 state_change.go:169] Running tablet callback because: restore from backup
I0606 20:15:49.230847       1 state_change.go:300] Disabling query service on type change, reason: not a serving tablet type(RESTORE)
I0606 20:15:49.230873       1 state_change.go:158] Done with post-action state refresh
I0606 20:15:49.230910       1 backup.go:229] Restore: Checking if a restore is in progress
I0606 20:15:49.230944       1 backup.go:253] Restore: looking for a suitable backup to restore
I0606 20:15:49.240403       1 service_map.go:64] Registering updatestream for grpc, disable it with -grpc-updatestream service_map parameter
I0606 20:15:49.240471       1 grpc_server.go:212] Listening for gRPC calls on port 16002
I0606 20:15:49.240620       1 unix_socket.go:36] Not listening on socket file
I0606 20:15:50.436138       1 backupengine.go:79] Restore: found backup web_app/38-40 2019-05-15.012358.app-0481768300 to restore
I0606 20:15:50.436199       1 mysqld.go:298] Waiting for mysqld socket file (/vtdataroot/tabletdata/mysql.sock) to be ready...
I0606 20:16:20.465369       1 backupengine.go:105] Restore: deleting existing files
I0606 20:16:20.465461       1 backup.go:200] Restore: skipping removal of nonexistent InnodbLogGroupHomeDir (/vtdataroot/tabletdata/innodb/logs)
I0606 20:16:20.465472       1 backup.go:185] Restore: removing files in RelayLogPath.* (/vtdataroot/tabletdata/relay-logs/vt-0104873702-relay-bin.*)
I0606 20:16:20.465609       1 backup.go:200] Restore: skipping removal of nonexistent RelayLogIndexPath (/vtdataroot/tabletdata/relay-logs/vt-0104873702-relay-bin.index)
I0606 20:16:20.465620       1 backup.go:200] Restore: skipping removal of nonexistent RelayLogInfoPath (/vtdataroot/tabletdata/relay-logs/relay-log.info)
I0606 20:16:20.465627       1 backup.go:185] Restore: removing files in BinLogPath.* (/vtdataroot/tabletdata/bin-logs/vt-0104873702-bin.*)
I0606 20:16:20.465732       1 backup.go:203] Restore: removing files in DataDir (/vtdataroot/tabletdata/data)
I0606 20:16:21.044877       1 backup.go:200] Restore: skipping removal of nonexistent InnodbDataHomeDir (/vtdataroot/tabletdata/innodb/data)
I0606 20:16:21.045023       1 backupengine.go:110] Restore: reinit config file
I0606 20:16:21.045036       1 mysqld.go:738] Mysqld.ReinitConfig
I0606 20:16:21.045043       1 mysqld.go:742] executing Mysqld.ReinitConfig() remotely via mysqlctld server: /vtdataroot/mysqlctl.sock
W0606 20:16:21.045648       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:22.046225       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:23.686121       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:26.701888       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:31.136101       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:36.905676       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:46.264571       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:17:01.494523       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:17:25.213006       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:18:02.537421       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...

…even if some data exists Signed-off-by: deepthi <deepthi@planetscale.com>

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi · 2019-06-11T21:27:22Z

@enisoc this is ready for review. We will wait to merge until @msolters is finished with testing, but for now the basic tests work.

go/cmd/mysqlctld/mysqlctld.go

go/vt/mysqlctl/backup.go

go/vt/mysqlctl/backupengine.go

go/vt/mysqlctl/builtinbackupengine.go

go/vt/mysqlctl/backupengine.go

test/backup.py

test/backup_transform.py

deepthi · 2019-06-12T19:10:05Z

@enisoc thank you for the detailed review. I will address the suggested changes.

…case to actually test interrupted restore Signed-off-by: deepthi <deepthi@planetscale.com>

…to allow multiple restores of same vttablet Signed-off-by: deepthi <deepthi@planetscale.com>

msolters · 2019-06-18T21:10:50Z

This seems pretty stable in our testing! Restores are recovering from interruption at any stage.

enisoc

Looks good overall! Just a few last questions.

go/vt/mysqlctl/backup.go

go/vt/mysqlctl/backupengine.go

go/vt/mysqlctl/builtinbackupengine.go

…ing files Signed-off-by: deepthi <deepthi@planetscale.com>

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi · 2019-06-19T16:55:35Z

@enisoc a more fundamental question comes to mind. The way I have coded this, we delete the restore_in_progress file as soon as we have completed copying the files, but before restarting mysqld. Should it instead move to https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletmanager/restore.go#L154? You could argue that the restore is not complete until then.

enisoc · 2019-06-19T18:10:09Z

Ah that's a good point. We also should consider the restore incomplete if we haven't run mysql_upgrade and populated metadata yet.

I don't think we need to lift the check all the way up to tabletmanager though. Do you think it's enough if we create the state file before running ExecuteRestore in mysqlctl.Restore, and then remove it at the end of mysqlctl.Restore?

I think we're still ok if a failure occurs after that point. The tabletmanager-level stuff should fix itself either on the next vttablet restart, or when the RestoreFromBackup RPC is retried. Do you agree?

deepthi · 2019-06-19T18:59:02Z

I don't think we need to lift the check all the way up to tabletmanager though. Do you think it's enough if we create the state file before running ExecuteRestore in mysqlctl.Restore, and then remove it at the end of mysqlctl.Restore?

I think we're still ok if a failure occurs after that point. The tabletmanager-level stuff should fix itself either on the next vttablet restart, or when the RestoreFromBackup RPC is retried. Do you agree?

That sounds reasonable. I will make the change.

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc · 2019-06-19T22:48:30Z

go/vt/mysqlctl/builtinbackupengine.go

-	if err := removeExistingFiles(cnf); err != nil {
-		return mysql.Position{}, err
+	// mark restore as in progress
+	if err = createStateFile(cnf); err != nil {


Now that we remove the state file one level up in mysqlctl.Restore(), I was thinking we'd also create it there so all the logic is in one place. That does mean that we'd potentially mark the data dir invalid just because we failed to find a valid backup, but that may be easy to avoid now that you've abstracted findBackupToRestore(). We could call that from mysqlctl.Restore() instead of here, and then create the state file there as well if we find one, before passing the found backup to ExecuteRestore(). What do you think?

That would be ideal, but there is a little bit of difficulty with moving the call to findBackupToRestore into mysqlctl.Restore. The manifest format is different for the two backup methods - builtin and xtrabackup, and we are passing in the appropriate one to findBackupToRestore when we call it. Any suggestions on how we could work around that?

Ah I didn't notice that there's an implementation-specific arg to findBackupToRestore. In that case, it's probably not worth doing surgery on it in this PR.

restore: refactor common code, add functions

beb8415

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi requested a review from sougou as a code owner June 5, 2019 22:10

deepthi requested a review from dkhenry June 5, 2019 22:12

deepthi added 3 commits June 11, 2019 14:20

restore: allow retry of interrupted restores to start from beginning …

7cac07f

…even if some data exists Signed-off-by: deepthi <deepthi@planetscale.com>

restore: integration test for interrupted restore

e523966

Signed-off-by: deepthi <deepthi@planetscale.com>

restore: reorganize backup/restore tests to avoid timing out

feef084

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi force-pushed the ds-restore-check branch from 24247e3 to feef084 Compare June 11, 2019 21:26

deepthi requested a review from enisoc June 11, 2019 21:26

enisoc reviewed Jun 12, 2019

View reviewed changes

deepthi added 2 commits June 12, 2019 18:48

restore: refactor, rename, cleanup based on review comments. fix test…

d135722

…case to actually test interrupted restore Signed-off-by: deepthi <deepthi@planetscale.com>

Restore: no need to use background context. Fix tests and xtrabackup …

ed23916

…to allow multiple restores of same vttablet Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi changed the title ~~WIP: Allow retry of interrupted restores~~ Allow retry of interrupted restores Jun 18, 2019

enisoc reviewed Jun 18, 2019

View reviewed changes

go/vt/mysqlctl/backup.go Outdated Show resolved Hide resolved

go/vt/mysqlctl/backupengine.go Outdated Show resolved Hide resolved

go/vt/mysqlctl/builtinbackupengine.go Show resolved Hide resolved

deepthi added 2 commits June 18, 2019 14:56

restore: change file name, create it before stopping mysqld and delet…

72bf2c1

…ing files Signed-off-by: deepthi <deepthi@planetscale.com>

restore: fix comments

712ad3f

Signed-off-by: deepthi <deepthi@planetscale.com>

deepthi added 3 commits June 19, 2019 12:07

restore: only delete progress file at the of mysqlctl.Restore

8970c94

Signed-off-by: deepthi <deepthi@planetscale.com>

Merge branch 'master' into ds-restore-check

eb2ea30

restore: carry over change from vitessio#4929 into refactored function

c325bc4

Signed-off-by: deepthi <deepthi@planetscale.com>

enisoc reviewed Jun 19, 2019

View reviewed changes

enisoc approved these changes Jun 20, 2019

View reviewed changes

deepthi merged commit 6ad520e into vitessio:master Jun 20, 2019

enisoc deleted the ds-restore-check branch June 20, 2019 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow retry of interrupted restores #4909

Allow retry of interrupted restores #4909

deepthi commented Jun 5, 2019 •

edited

Loading

msolters commented Jun 10, 2019

deepthi commented Jun 11, 2019

deepthi commented Jun 12, 2019

msolters commented Jun 18, 2019

enisoc left a comment

deepthi commented Jun 19, 2019

enisoc commented Jun 19, 2019

deepthi commented Jun 19, 2019

enisoc Jun 19, 2019

deepthi Jun 19, 2019

enisoc Jun 20, 2019

Allow retry of interrupted restores #4909

Allow retry of interrupted restores #4909

Conversation

deepthi commented Jun 5, 2019 • edited Loading

msolters commented Jun 10, 2019

deepthi commented Jun 11, 2019

deepthi commented Jun 12, 2019

msolters commented Jun 18, 2019

enisoc left a comment

Choose a reason for hiding this comment

deepthi commented Jun 19, 2019

enisoc commented Jun 19, 2019

deepthi commented Jun 19, 2019

enisoc Jun 19, 2019

Choose a reason for hiding this comment

deepthi Jun 19, 2019

Choose a reason for hiding this comment

enisoc Jun 20, 2019

Choose a reason for hiding this comment

deepthi commented Jun 5, 2019 •

edited

Loading