-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XtraBackup: Tablet goes unhealthy during backup #5062
Comments
You are correct. this seems to have been overlooked when the feature was developed. |
At first glance it seemed like it would be a good idea to fix this by not acquiring the action lock (agent.actionMutex). However, that lock exists to ensure that only one action is performed on a tablet at one time.
there is another lock that protects fields
So to fix this properly, we will need to understand why healthcheck doesn't run when an action is in progress, and whether that is the correct behavior. |
It might be because the health check is actually more than a check; it also tries to take action, such as trying to start the queryservice if it's stopped. Maybe we can address this differently by fixing the health check to do its checking part without needing the action lock, but don't try to take any action without the action lock? This assumes there isn't any downside to publishing up-to-date health status during any of the other actions. I can't think of a case where having more accurate, up-to-date information (without taking action) would be a bad thing. It's sort of orthogonal though that xtrabackup is a special kind of action that's both long-running and non-exclusive. It made sense to disallow other actions while built-in backup was running, because it takes mysqld down so you can't do anything else useful with the tablet at the same time. However, since xtrabackup is intended to run in the background while the tablet is still serving, I think it's pretty important that we allow other actions to occur because being "serving" means more than just serving SQL queries; the tablet should also be responsive to management commands (tabletmanager RPC actions). If it turns out we can't safely allow such management commands while running xtrabackup, we probably should go back to requiring that a tablet go non-serving during that time, rather than being in a precarious half-serving state. |
After discussing with @sougou here's how we can handle this situation:
|
@deepthi Am I correct in assuming it should be possible to let the tablet keep serving while XtraBackup is taking a snapshot? If so, it seems we need to fix an interaction between backups and the tablet health check.
During a backup, the health check result is not being updated, perhaps due to the tabletmanager action lock. As a result, /healthz and vtgate start seeing the tablet as unhealthy due to
last health check is too old: 9m24.152112392s > 15s
.The text was updated successfully, but these errors were encountered: