Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backups intermittently fail #2588

Closed
adkhare opened this issue Feb 23, 2017 · 4 comments
Closed

Backups intermittently fail #2588

adkhare opened this issue Feb 23, 2017 · 4 comments

Comments

@adkhare
Copy link
Contributor

adkhare commented Feb 23, 2017

Following is the lock when the backups fail:

I0221 15:29:33.089957   31505 mysqld.go:339] Mysqld.Shutdown
I0221 15:29:33.090018   31505 mysqld.go:377] No mysqld_shutdown hook, running mysqladmin directly
I0221 15:29:33.090093   31505 mysqld.go:440] execCmd: /usr/bin/mysqladmin /usr/bin/mysqladmin [--defaults-extra-file=/tmp/example331428697 shutdown]
I0221 15:29:37.096281   31505 mysqld.go:454] execCmd: /usr/bin/mysqladmin output:
I0221 15:29:37.096299   31505 mysqld.go:416] Mysqld.Shutdown: waiting for socket file (/opt/data/vitess-data/vt_1025632011/mysql.sock) and pid file (/opt/data/vitess-data/vt_1025632011/mysql.pid) to disappear
I0221 15:29:37.097312   31505 backup.go:356] found 149 files to backup
I0221 15:29:53.624134   31505 run.go:37] Entering lameduck mode for at least 50ms
I0221 15:29:53.624149   31505 run.go:38] Firing asynchronous OnTerm hooks
I0221 15:29:53.624154   31505 servenv.go:139] Firing synchronous OnTermSync hooks and waiting up to 10s for them
I0221 15:29:53.624173   31505 healthcheck.go:138] Stopping periodic health check timer
I0221 15:30:03.624326   31505 servenv.go:155] OnTermSync hooks timed out
I0221 15:30:03.624343   31505 run.go:47] Shutting down gracefully
I0221 15:30:03.624458   31505 tabletserver.go:231] TabletServer state: NOT_SERVING -> SHUTTING_DOWN
I0221 15:30:03.624479   31505 tabletserver.go:454] Executing graceful transition to NotServing
I0221 15:30:03.624486   31505 tabletserver.go:463] Shutting down query service

Following is the successful backup log:

I0221 15:22:41.563032   22790 mysqld.go:222] Mysqld.Start(1487690561): No mysqld_start hook, running mysqld_safe directly
I0221 15:22:41.563054   22790 mysqld.go:239] Mysqld.Start(1487690561) &exec.Cmd{Path:"/usr/bin/mysqld_safe", Args:[]string{"/usr/bin/mysqld_safe", "--defaults-file=/opt/data/vitess-data/vt_1025631981/my.cnf"}, Env:[]string{"LD_LIBRARY_PATH=/usr/lib/mysql"}, Dir:"/usr", Stdin:io.Reader(nil), Stdout:io.Writer(nil), Stderr:io.Writer(nil), ExtraFiles:[]*os.File(nil), SysProcAttr:(*syscall.SysProcAttr)(nil), Process:(*os.Process)(nil), ProcessState:(*os.ProcessState)(nil), ctx:context.Context(nil), lookPathErr:error(nil), finished:false, childFiles:[]*os.File(nil), closeAfterStart:[]io.Closer(nil), closeAfterWait:[]io.Closer(nil), goroutine:[]func() error(nil), errch:(chan error)(nil), waitDone:(chan struct {})(nil)}
I0221 15:22:41.567153   22790 mysqld.go:306] Waiting for mysqld socket file (/opt/data/vitess-data/vt_1025631981/mysql.sock) to be ready...
I0221 15:22:41.927235   22790 mysqld.go:257] Mysqld.Start(1487690561) stdout: 170221 15:22:41 mysqld_safe Logging to '/opt/data/vitess-data/vt_1025631981/error.log'.
I0221 15:22:41.943440   22790 mysqld.go:257] Mysqld.Start(1487690561) stdout: 170221 15:22:41 mysqld_safe Starting mysqld daemon with databases from /opt/data/vitess-data/vt_1025631981/data
I0221 15:22:43.083917   22790 backup.go:322] restoring semi-sync settings from before backup: master=false, slave=true
I0221 15:22:43.084007   22790 replication.go:351] Setting semi-sync mode: master=false, slave=true
I0221 15:22:43.084433   22790 query.go:53] exec SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 1
I0221 15:22:43.084556   22790 backup.go:329] restarting mysql replication
I0221 15:22:43.085110   22790 query.go:53] exec START SLAVE
I0221 15:22:43.087813   22790 hook.go:154] postflight_start_slave hook doesn't exist
I0221 15:22:43.088442   22790 backup.go:341] resetting mysqld read-only to true
I0221 15:22:43.088874   22790 query.go:53] exec SET GLOBAL read_only = ON
I0221 15:22:43.105865   22790 state_change.go:117] Executing post-action state refresh: after backup
I0221 15:22:43.121555   22790 state_change.go:149] Running tablet callback because: after backup
I0221 15:22:43.122793   22790 tabletserver.go:231] TabletServer state: NOT_SERVING -> NOT_SERVING
I0221 15:22:43.122804   22790 tabletserver.go:231] TabletServer state: NOT_SERVING -> SERVING
I0221 15:22:43.122812   22790 updatestreamctl.go:195] Enabling update stream, dbname: vt_assets
I0221 15:22:43.122849   22790 state_change.go:138] Done with post-action state refresh
I0221 15:22:43.123880   22790 rpc_server.go:71] TabletManager.Backup(concurrency:8 )(on production-1025631981 from ): <nil>

As a workaround, we have currently added retry for backups in case of failures, but the failures are extremely intermittent at this point.

It almost feels like there is a clean shutdown being triggered during backups.

@michael-berlin
Copy link
Contributor

Pavel, do you have time to look into this?

@adkhare
Copy link
Contributor Author

adkhare commented Feb 23, 2017

It would also be great if we can have some kind of alerting mechanism to or and API of some kind to show that backups have failed

@pivanof
Copy link
Contributor

pivanof commented Mar 1, 2017

I don't understand this issue. The first log excerpt shows how mysqld was shutdown and backup has started. I don't see any failure message. The second log excerpt doesn't show this part, but looks like it's showing how mysqld is started after the backup and how the RPC TabletManager.Backup is succeeded.
Did you skip some parts of the log? I don't see any problem yet.

@adkhare
Copy link
Contributor Author

adkhare commented Mar 21, 2017

This was a problem on our infrastructure end. Since, during backups, health checks show unhealthy, after certain time, tablets were getting restarted causing backups to fail. we have fixed this by changing our health checks to not use /health but use TabletState from /debug/vars as health check

@adkhare adkhare closed this as completed Mar 21, 2017
frouioui pushed a commit to planetscale/vitess that referenced this issue Nov 21, 2023
* backport of 2588

* muysqlctl: Ensure to implement server side API

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>

---------

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com>
frouioui pushed a commit to planetscale/vitess that referenced this issue Mar 26, 2024
If this errors, we can't derefence v and read this value. This was
broken in vitessio#13449

Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants