Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XtraBackup: Tablet goes unhealthy during backup #5062

Closed
enisoc opened this issue Aug 8, 2019 · 4 comments · Fixed by #5066
Closed

XtraBackup: Tablet goes unhealthy during backup #5062

enisoc opened this issue Aug 8, 2019 · 4 comments · Fixed by #5066

Comments

@enisoc
Copy link
Member

enisoc commented Aug 8, 2019

@deepthi Am I correct in assuming it should be possible to let the tablet keep serving while XtraBackup is taking a snapshot? If so, it seems we need to fix an interaction between backups and the tablet health check.

During a backup, the health check result is not being updated, perhaps due to the tabletmanager action lock. As a result, /healthz and vtgate start seeing the tablet as unhealthy due to last health check is too old: 9m24.152112392s > 15s.

@deepthi
Copy link
Member

deepthi commented Aug 8, 2019

You are correct. this seems to have been overlooked when the feature was developed.

@deepthi
Copy link
Member

deepthi commented Aug 9, 2019

At first glance it seemed like it would be a good idea to fix this by not acquiring the action lock (agent.actionMutex). However, that lock exists to ensure that only one action is performed on a tablet at one time.

	// actionMutex is there to run only one action at a time. If
	// both agent.actionMutex and agent.mutex needs to be taken,
	// take actionMutex first.
	actionMutex sync.Mutex

there is another lock that protects fields

	// mutex protects all the following fields (that start with '_'),
	// only hold the mutex to update the fields, nothing else.
	mutex sync.Mutex

So to fix this properly, we will need to understand why healthcheck doesn't run when an action is in progress, and whether that is the correct behavior.

cc @dweitzman @sougou @derekperkins

@enisoc
Copy link
Member Author

enisoc commented Aug 9, 2019

It might be because the health check is actually more than a check; it also tries to take action, such as trying to start the queryservice if it's stopped. Maybe we can address this differently by fixing the health check to do its checking part without needing the action lock, but don't try to take any action without the action lock?

This assumes there isn't any downside to publishing up-to-date health status during any of the other actions. I can't think of a case where having more accurate, up-to-date information (without taking action) would be a bad thing.

It's sort of orthogonal though that xtrabackup is a special kind of action that's both long-running and non-exclusive. It made sense to disallow other actions while built-in backup was running, because it takes mysqld down so you can't do anything else useful with the tablet at the same time. However, since xtrabackup is intended to run in the background while the tablet is still serving, I think it's pretty important that we allow other actions to occur because being "serving" means more than just serving SQL queries; the tablet should also be responsive to management commands (tabletmanager RPC actions). If it turns out we can't safely allow such management commands while running xtrabackup, we probably should go back to requiring that a tablet go non-serving during that time, rather than being in a precarious half-serving state.

@deepthi
Copy link
Member

deepthi commented Aug 9, 2019

After discussing with @sougou here's how we can handle this situation:

  • don't take actionMutex lock while running xtrabackup
  • introduce a boolean backupIsRunning
  • only allow a backup to start if !backupIsRunning
  • enclose access to boolean using the mutex lock
  • export this boolean to debug/vars so that there is a way for people to know that a backup is running

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants