-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tablet should stay healthy while running xtrabackup #5066
Conversation
…ents tablet from updating its replication lag Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
@@ -32,11 +32,6 @@ import ( | |||
|
|||
// Backup takes a db backup and sends it to the BackupStorage | |||
func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger logutil.Logger, allowMaster bool) error { | |||
if err := agent.lock(ctx); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a thought: Should we have a separate lock to allow only one backup at a time? I guess technically it might work to have multiple xtrabackups running, but it's probably not a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not locking allow it to be promoted to master? Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for cluster backup lock. It is unlikely you would want a new backup if one is already running, and limiting it helps ensure service only degrades so much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm as far as the tablet is concerned, I think that would be acceptable. It's better than blocking promotion to master because xtrabackup is running; presumably you are promoting this one because the current master is in bad shape anyway. WDYT?
I don't know for sure if XtraBackup will still produce a consistent snapshot in that case, but I would expect it to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know a specific reason for it to not be promoted to master, just bringing it up for discussion. Would the tablet type still read BACKUP
? My guess is that would prevent most workloads from choosing it to failover to, even though they probably should like you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not changing the tablet type to BACKUP
while xtrabackup is running, which means a REPLICA
would be eligible for master promotion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Will there be any knowledge that a backup is running? I would imagine that when selecting a tablet for master promotion, you'd want to prefer a replica not currently running a backup. I would hope to get that info in wr.ShardReplicationStatuses
or something like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be, but I don't think there is now. Another thing we should fix.
I tried out this patch with a set of 250G shards that take 1.5hrs to back up, and the tablets stayed healthy. |
|
…r online backup is running Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
Testing update:
|
Signed-off-by: deepthi <deepthi@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit. I'll let @enisoc do the full review.
@@ -74,7 +77,20 @@ func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger lo | |||
if err := agent.refreshTablet(ctx, "before backup"); err != nil { | |||
return err | |||
} | |||
} else { | |||
if agent._isOnlineBackupRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done while holding the lock. Same thing while exporting stats. And same comments in Backup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup()
and agent.endOnlineBackup()
to handle the locking and stats update. Then you can use defer agent.mutex.Unlock()
inside each helper, and I think you can also defer agent.endOnlineBackup()
here so it's guaranteed for any return path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have called them beginBackup
and endBackup
. we set _isBackupRunning
to true
or false
and then export the right value in the stats so that regardless of online/offline, we can see the stats in /debug/vars or prometheus metrics.
@@ -74,7 +77,20 @@ func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger lo | |||
if err := agent.refreshTablet(ctx, "before backup"); err != nil { | |||
return err | |||
} | |||
} else { | |||
if agent._isOnlineBackupRunning { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup()
and agent.endOnlineBackup()
to handle the locking and stats update. Then you can use defer agent.mutex.Unlock()
inside each helper, and I think you can also defer agent.endOnlineBackup()
here so it's guaranteed for any return path.
… to protect all access to _isBackupRunning and the stats variable Signed-off-by: deepthi <deepthi@planetscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Fixes #5062.
As documented in the issue, here's how we address this:
Signed-off-by: deepthi deepthi@planetscale.com