Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow retry of interrupted restores #4909

Merged
merged 11 commits into from
Jun 20, 2019

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Jun 5, 2019

Fixes #4821
The strategy here is to create a temporary file at the beginning of the restore process that is deleted at the end.
If the file exists at the beginning of the restore process, we assume that a previous restore failed and skip the checkNoDB check.

This requires some other changes:

  • don't wait on mysqld if we are retrying a restore
  • mysqlctld should initialize mysqld config, but not try to bring it up if it detects an interrupted restore

In addition, changes have been made to the code so that terminating a grpc client that started the restore will cause the restore to terminate as well, so effectively we are undo-ing most of #2310.

The terminate_restore test has been changed to test the new behavior.
Backup tests had to be re-organized because now we have so many cases that if they are all in one test, it times out.

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi requested a review from sougou as a code owner June 5, 2019 22:10
@deepthi deepthi requested a review from dkhenry June 5, 2019 22:12
@msolters
Copy link
Contributor

After testing this PR on our cluster, we last landed on the following behaviour:

Restarted pods (i.e. interrupted restore) do correctly identify a .restore file and attempt to restart the backup restore. However, depending on how much of the data restore has completed, the mysql service (in our case, managed by mysqlctl) may crash on boot:

2019-06-06T20:51:33.993228Z mysqld_safe Logging to '/vtdataroot/tabletdata/error.log'.
2019-06-06T20:51:34.092999Z mysqld_safe Starting mysqld daemon with databases from /vtdataroot/tabletdata/data
2019-06-06T20:51:35.177106Z 0 [Warning] [MY-011068] [Server] The syntax 'expire-logs-days' is deprecated and will be removed in a future release. Please use binlog_expire_logs_seconds instead.
2019-06-06T20:51:35.177348Z 0 [Warning] [MY-010915] [Server] 'NO_ZERO_DATE', 'NO_ZERO_IN_DATE' and 'ERROR_FOR_DIVISION_BY_ZERO' sql modes should be used with strict mode. They will be merged with strict mode in a future release.
2019-06-06T20:51:35.177456Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.16) starting as process 1245
2019-06-06T20:51:35.184752Z 0 [Warning] [MY-010091] [Server] Can't create test file /vtdataroot/tabletdata/data/app-web-app-x38-40x-replica-2.lower-test
2019-06-06T20:51:35.185089Z 0 [Warning] [MY-010159] [Server] Setting lower_case_table_names=2 because file system for /vtdataroot/tabletdata/data/ is case insensitive
2019-06-06T20:51:35.185166Z 0 [ERROR] [MY-013276] [Server] Failed to set datadir to '/vtdataroot/tabletdata/data/' (OS errno: 2 - No such file or directory)
2019-06-06T20:51:35.185735Z 0 [ERROR] [MY-010119] [Server] Aborting
2019-06-06T20:51:35.241353Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.16)  MySQL Community Server - GPL.
2019-06-06T20:51:35.465138Z mysqld_safe mysqld from pid file /vtdataroot/tabletdata/mysql.pid ended

This causes the mysqlctl container to basically restart every 5 minutes as MySQL never successfully comes up.

Meanwhile, vttablet is attempting to restart the restore. However mysqlctl's unavailability can break this process, as GRPC communication with mysqlctl seems to be a requirement of the restore process:

I0606 20:15:49.230822       1 state_change.go:169] Running tablet callback because: restore from backup
I0606 20:15:49.230847       1 state_change.go:300] Disabling query service on type change, reason: not a serving tablet type(RESTORE)
I0606 20:15:49.230873       1 state_change.go:158] Done with post-action state refresh
I0606 20:15:49.230910       1 backup.go:229] Restore: Checking if a restore is in progress
I0606 20:15:49.230944       1 backup.go:253] Restore: looking for a suitable backup to restore
I0606 20:15:49.240403       1 service_map.go:64] Registering updatestream for grpc, disable it with -grpc-updatestream service_map parameter
I0606 20:15:49.240471       1 grpc_server.go:212] Listening for gRPC calls on port 16002
I0606 20:15:49.240620       1 unix_socket.go:36] Not listening on socket file
I0606 20:15:50.436138       1 backupengine.go:79] Restore: found backup web_app/38-40 2019-05-15.012358.app-0481768300 to restore
I0606 20:15:50.436199       1 mysqld.go:298] Waiting for mysqld socket file (/vtdataroot/tabletdata/mysql.sock) to be ready...
I0606 20:16:20.465369       1 backupengine.go:105] Restore: deleting existing files
I0606 20:16:20.465461       1 backup.go:200] Restore: skipping removal of nonexistent InnodbLogGroupHomeDir (/vtdataroot/tabletdata/innodb/logs)
I0606 20:16:20.465472       1 backup.go:185] Restore: removing files in RelayLogPath.* (/vtdataroot/tabletdata/relay-logs/vt-0104873702-relay-bin.*)
I0606 20:16:20.465609       1 backup.go:200] Restore: skipping removal of nonexistent RelayLogIndexPath (/vtdataroot/tabletdata/relay-logs/vt-0104873702-relay-bin.index)
I0606 20:16:20.465620       1 backup.go:200] Restore: skipping removal of nonexistent RelayLogInfoPath (/vtdataroot/tabletdata/relay-logs/relay-log.info)
I0606 20:16:20.465627       1 backup.go:185] Restore: removing files in BinLogPath.* (/vtdataroot/tabletdata/bin-logs/vt-0104873702-bin.*)
I0606 20:16:20.465732       1 backup.go:203] Restore: removing files in DataDir (/vtdataroot/tabletdata/data)
I0606 20:16:21.044877       1 backup.go:200] Restore: skipping removal of nonexistent InnodbDataHomeDir (/vtdataroot/tabletdata/innodb/data)
I0606 20:16:21.045023       1 backupengine.go:110] Restore: reinit config file
I0606 20:16:21.045036       1 mysqld.go:738] Mysqld.ReinitConfig
I0606 20:16:21.045043       1 mysqld.go:742] executing Mysqld.ReinitConfig() remotely via mysqlctld server: /vtdataroot/mysqlctl.sock
W0606 20:16:21.045648       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:22.046225       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:23.686121       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:26.701888       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:31.136101       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:36.905676       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:16:46.264571       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:17:01.494523       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:17:25.213006       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...
W0606 20:18:02.537421       1 clientconn.go:1160] grpc: addrConn.createTransport failed to connect to {/vtdataroot/mysqlctl.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /vtdataroot/mysqlctl.sock: connect: connection refused". Reconnecting...

…even if some data exists

Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi
Copy link
Member Author

deepthi commented Jun 11, 2019

@enisoc this is ready for review. We will wait to merge until @msolters is finished with testing, but for now the basic tests work.

go/cmd/mysqlctld/mysqlctld.go Outdated Show resolved Hide resolved
go/cmd/mysqlctld/mysqlctld.go Outdated Show resolved Hide resolved
go/vt/mysqlctl/backup.go Outdated Show resolved Hide resolved
go/vt/mysqlctl/backup.go Outdated Show resolved Hide resolved
go/vt/mysqlctl/backupengine.go Show resolved Hide resolved
go/vt/mysqlctl/builtinbackupengine.go Outdated Show resolved Hide resolved
go/vt/mysqlctl/backupengine.go Outdated Show resolved Hide resolved
test/backup.py Outdated Show resolved Hide resolved
test/backup.py Outdated Show resolved Hide resolved
test/backup_transform.py Show resolved Hide resolved
@deepthi
Copy link
Member Author

deepthi commented Jun 12, 2019

@enisoc thank you for the detailed review. I will address the suggested changes.

…case to actually test interrupted restore

Signed-off-by: deepthi <deepthi@planetscale.com>
…to allow multiple restores of same vttablet

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi changed the title WIP: Allow retry of interrupted restores Allow retry of interrupted restores Jun 18, 2019
@msolters
Copy link
Contributor

This seems pretty stable in our testing! Restores are recovering from interruption at any stage.

Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Just a few last questions.

go/vt/mysqlctl/backup.go Outdated Show resolved Hide resolved
go/vt/mysqlctl/backupengine.go Outdated Show resolved Hide resolved
go/vt/mysqlctl/builtinbackupengine.go Show resolved Hide resolved
…ing files

Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi
Copy link
Member Author

deepthi commented Jun 19, 2019

@enisoc a more fundamental question comes to mind. The way I have coded this, we delete the restore_in_progress file as soon as we have completed copying the files, but before restarting mysqld. Should it instead move to https://github.com/vitessio/vitess/blob/master/go/vt/vttablet/tabletmanager/restore.go#L154? You could argue that the restore is not complete until then.

@enisoc
Copy link
Member

enisoc commented Jun 19, 2019

Ah that's a good point. We also should consider the restore incomplete if we haven't run mysql_upgrade and populated metadata yet.

I don't think we need to lift the check all the way up to tabletmanager though. Do you think it's enough if we create the state file before running ExecuteRestore in mysqlctl.Restore, and then remove it at the end of mysqlctl.Restore?

I think we're still ok if a failure occurs after that point. The tabletmanager-level stuff should fix itself either on the next vttablet restart, or when the RestoreFromBackup RPC is retried. Do you agree?

@deepthi
Copy link
Member Author

deepthi commented Jun 19, 2019

I don't think we need to lift the check all the way up to tabletmanager though. Do you think it's enough if we create the state file before running ExecuteRestore in mysqlctl.Restore, and then remove it at the end of mysqlctl.Restore?

I think we're still ok if a failure occurs after that point. The tabletmanager-level stuff should fix itself either on the next vttablet restart, or when the RestoreFromBackup RPC is retried. Do you agree?

That sounds reasonable. I will make the change.

if err := removeExistingFiles(cnf); err != nil {
return mysql.Position{}, err
// mark restore as in progress
if err = createStateFile(cnf); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we remove the state file one level up in mysqlctl.Restore(), I was thinking we'd also create it there so all the logic is in one place. That does mean that we'd potentially mark the data dir invalid just because we failed to find a valid backup, but that may be easy to avoid now that you've abstracted findBackupToRestore(). We could call that from mysqlctl.Restore() instead of here, and then create the state file there as well if we find one, before passing the found backup to ExecuteRestore(). What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be ideal, but there is a little bit of difficulty with moving the call to findBackupToRestore into mysqlctl.Restore. The manifest format is different for the two backup methods - builtin and xtrabackup, and we are passing in the appropriate one to findBackupToRestore when we call it. Any suggestions on how we could work around that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't notice that there's an implementation-specific arg to findBackupToRestore. In that case, it's probably not worth doing surgery on it in this PR.

@deepthi deepthi merged commit 6ad520e into vitessio:master Jun 20, 2019
@enisoc enisoc deleted the ds-restore-check branch June 20, 2019 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Interrupted Restores Fail/Please Sanity Check Restores
3 participants