Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tablet should stay healthy while running xtrabackup #5066

Merged
merged 7 commits into from
Aug 15, 2019

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Aug 9, 2019

Fixes #5062.
As documented in the issue, here's how we address this:

  • don't take actionMutex lock while running xtrabackup
  • introduce a boolean _isBackupRunning
  • only allow a backup to start if _isBackupRunning is false
  • protect updates to boolean using the mutex lock
  • export this boolean to debug/vars so that there is a way for people to know that a backup is running

Signed-off-by: deepthi deepthi@planetscale.com

…ents tablet from updating its replication lag

Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi requested a review from sougou as a code owner August 9, 2019 00:33
@deepthi deepthi requested a review from enisoc August 9, 2019 00:35
Signed-off-by: deepthi <deepthi@planetscale.com>
@@ -32,11 +32,6 @@ import (

// Backup takes a db backup and sends it to the BackupStorage
func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger logutil.Logger, allowMaster bool) error {
if err := agent.lock(ctx); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a thought: Should we have a separate lock to allow only one backup at a time? I guess technically it might work to have multiple xtrabackups running, but it's probably not a good idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not locking allow it to be promoted to master? Is that ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for cluster backup lock. It is unlikely you would want a new backup if one is already running, and limiting it helps ensure service only degrades so much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm as far as the tablet is concerned, I think that would be acceptable. It's better than blocking promotion to master because xtrabackup is running; presumably you are promoting this one because the current master is in bad shape anyway. WDYT?

I don't know for sure if XtraBackup will still produce a consistent snapshot in that case, but I would expect it to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a specific reason for it to not be promoted to master, just bringing it up for discussion. Would the tablet type still read BACKUP? My guess is that would prevent most workloads from choosing it to failover to, even though they probably should like you said.

Copy link
Member Author

@deepthi deepthi Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not changing the tablet type to BACKUP while xtrabackup is running, which means a REPLICA would be eligible for master promotion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Will there be any knowledge that a backup is running? I would imagine that when selecting a tablet for master promotion, you'd want to prefer a replica not currently running a backup. I would hope to get that info in wr.ShardReplicationStatuses or something like it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be, but I don't think there is now. Another thing we should fix.

@enisoc
Copy link
Member

enisoc commented Aug 9, 2019

I tried out this patch with a set of 250G shards that take 1.5hrs to back up, and the tablets stayed healthy.

@deepthi
Copy link
Member Author

deepthi commented Aug 9, 2019

I think we'll need to rework this. Apparently the actionMutex that agent.lock() acquires a lock on is the right one, because it prevents two actions from running in parallel. This means that we need to find out why healthcheck is blocked when the actionMutex is locked.

deepthi added 2 commits August 9, 2019 21:23
…r online backup is running

Signed-off-by: deepthi <deepthi@planetscale.com>
Signed-off-by: deepthi <deepthi@planetscale.com>
@deepthi deepthi changed the title WIP: don't take the action lock while running xtrabackup WIP: tablet should stay healthy while running xtrabackup Aug 10, 2019
@deepthi
Copy link
Member Author

deepthi commented Aug 14, 2019

Testing update:

  • unit test for stats - turns out there is a reason we can't unit test action_agent stats. It is because stats vars are global and we create >1 TestActionAgent in the tests.
  • test that tablet actually stays healthy - verified by @enisoc
  • test that a call to Backup fails if a backup is already running - verified on local cluster
  • sanity test of debug/vars - verified on local cluster

@deepthi deepthi changed the title WIP: tablet should stay healthy while running xtrabackup tablet should stay healthy while running xtrabackup Aug 14, 2019
@deepthi
Copy link
Member Author

deepthi commented Aug 14, 2019

@enisoc @sougou this is ready for review. Let me know if you feel that we need an integ test for multiple backups running at one time.
cc @rafael

Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit. I'll let @enisoc do the full review.

@@ -74,7 +77,20 @@ func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger lo
if err := agent.refreshTablet(ctx, "before backup"); err != nil {
return err
}
} else {
if agent._isOnlineBackupRunning {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done while holding the lock. Same thing while exporting stats. And same comments in Backup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup() and agent.endOnlineBackup() to handle the locking and stats update. Then you can use defer agent.mutex.Unlock() inside each helper, and I think you can also defer agent.endOnlineBackup() here so it's guaranteed for any return path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have called them beginBackup and endBackup. we set _isBackupRunning to true or false and then export the right value in the stats so that regardless of online/offline, we can see the stats in /debug/vars or prometheus metrics.

@@ -74,7 +77,20 @@ func (agent *ActionAgent) Backup(ctx context.Context, concurrency int, logger lo
if err := agent.refreshTablet(ctx, "before backup"); err != nil {
return err
}
} else {
if agent._isOnlineBackupRunning {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but be careful to release the lock before returning. I recommend a pair of helpers like agent.beginOnlineBackup() and agent.endOnlineBackup() to handle the locking and stats update. Then you can use defer agent.mutex.Unlock() inside each helper, and I think you can also defer agent.endOnlineBackup() here so it's guaranteed for any return path.

… to protect all access to _isBackupRunning and the stats variable

Signed-off-by: deepthi <deepthi@planetscale.com>
Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

XtraBackup: Tablet goes unhealthy during backup
5 participants