Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in restic repository ensurer #1235

Merged
merged 1 commit into from
Feb 27, 2019

Conversation

skriss
Copy link
Contributor

@skriss skriss commented Feb 25, 2019

Fixes #1233

@nrb @carlisia I'd like to include this in v0.11, and also to backport it to v0.10.x.

I've been able to reproduce the bug consistently, and can confirm this fixes it. OP also confirmed that an earlier draft of this code fixed the issue for him.

@skriss skriss added this to the v0.11.0 milestone Feb 25, 2019
@skriss skriss requested review from carlisia and nrb February 25, 2019 21:18
@skriss
Copy link
Contributor Author

skriss commented Feb 25, 2019

For context, the main issue was that if you had >1 pod in a namespace with restic backups, and you went to restore into a new cluster, i.e. one that didn't have the ResticRepository CRD yet, you'd hit a race condition where two ResticRepository CR instances would be created for the same volume namespace + backup location (there should only be one). This happened because EnsureRepo was called multiple times concurrently, and each execution would first check to see if a ResticRepository existed, would see that it didn't, and then each would go create one, using a GenerateName (which meant the names of the repos wouldn't collide and cause one to get an error).

The creation of two repos would then trigger the second bug (send on a closed channel) which caused the observed panic.

backupLocation: backupLocation,
}

r.repoLocks[repoKey].Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be wrapped by r.repoLocksLock, since it's technically a mutation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it does need a read lock...i need to do a little re-jiggering to avoid deadlocks.

Copy link
Contributor

@nrb nrb Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a readlock to call the underlying Unlock mechanism? I'd think it would be a write, since state in the key's mutex is changing.

Copy link
Contributor Author

@skriss skriss Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the latest commit should clean this up. repoLocksLock (now repoLocksMu) synchronizes writes/reads to the repoLocks map, since maps are not threadsafe. And each mutex in the map synchronizes runs of EnsureRepo for a given namespace + location pair. The previous method structure had interleaved mutexes which was incorrect because it leads to deadlocks. That's no longer the case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll retest if there's a new image, @skriss? Let me know the tag to pull.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterGrace I pushed steveheptio/ark:v0.10.2-alpha.2 with the latest code

@nrb
Copy link
Contributor

nrb commented Feb 26, 2019

Code looks good, probably warrants a release note.

Signed-off-by: Steve Kriss <krisss@vmware.com>
@skriss
Copy link
Contributor Author

skriss commented Feb 26, 2019

squashed + added release note

@nrb
Copy link
Contributor

nrb commented Feb 26, 2019

While I still think this code is fine, I found out about the sync.Map type that came about in Go 1.9; I think it could apply here if we have to do much more sync logic with the RepositoryEnsurer.

@PeterGrace
Copy link

Tested the image that Steve pushed to steveheptio/ark:v0.10.2-alpha.2, it worked as expected. Thanks @skriss!

@skriss
Copy link
Contributor Author

skriss commented Feb 26, 2019

@carlisia PTAL

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@skriss
Copy link
Contributor Author

skriss commented Feb 27, 2019

@carlisia or @nrb pls merge!

@nrb nrb merged commit 783c7d8 into vmware-tanzu:master Feb 27, 2019
@skriss skriss deleted the restic-race-fix branch February 28, 2019 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Go crash when restoring a schedule backup
4 participants