-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs in restic repository ensurer #1235
Conversation
For context, the main issue was that if you had >1 pod in a namespace with restic backups, and you went to restore into a new cluster, i.e. one that didn't have the The creation of two repos would then trigger the second bug (send on a closed channel) which caused the observed panic. |
pkg/restic/repository_ensurer.go
Outdated
backupLocation: backupLocation, | ||
} | ||
|
||
r.repoLocks[repoKey].Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be wrapped by r.repoLocksLock
, since it's technically a mutation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it does need a read lock...i need to do a little re-jiggering to avoid deadlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a readlock to call the underlying Unlock
mechanism? I'd think it would be a write, since state in the key's mutex is changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the latest commit should clean this up. repoLocksLock
(now repoLocksMu
) synchronizes writes/reads to the repoLocks
map, since maps are not threadsafe. And each mutex in the map synchronizes runs of EnsureRepo
for a given namespace + location pair. The previous method structure had interleaved mutexes which was incorrect because it leads to deadlocks. That's no longer the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll retest if there's a new image, @skriss? Let me know the tag to pull.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeterGrace I pushed steveheptio/ark:v0.10.2-alpha.2
with the latest code
Code looks good, probably warrants a release note. |
Signed-off-by: Steve Kriss <krisss@vmware.com>
37fae4b
to
e3e76c2
Compare
squashed + added release note |
While I still think this code is fine, I found out about the sync.Map type that came about in Go 1.9; I think it could apply here if we have to do much more sync logic with the RepositoryEnsurer. |
Tested the image that Steve pushed to |
@carlisia PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm 👍
Fixes #1233
@nrb @carlisia I'd like to include this in v0.11, and also to backport it to v0.10.x.
I've been able to reproduce the bug consistently, and can confirm this fixes it. OP also confirmed that an earlier draft of this code fixed the issue for him.