Fix bugs in restic repository ensurer #1235

skriss · 2019-02-25T21:18:39Z

@nrb @carlisia I'd like to include this in v0.11, and also to backport it to v0.10.x.

I've been able to reproduce the bug consistently, and can confirm this fixes it. OP also confirmed that an earlier draft of this code fixed the issue for him.

skriss · 2019-02-25T21:21:54Z

For context, the main issue was that if you had >1 pod in a namespace with restic backups, and you went to restore into a new cluster, i.e. one that didn't have the ResticRepository CRD yet, you'd hit a race condition where two ResticRepository CR instances would be created for the same volume namespace + backup location (there should only be one). This happened because EnsureRepo was called multiple times concurrently, and each execution would first check to see if a ResticRepository existed, would see that it didn't, and then each would go create one, using a GenerateName (which meant the names of the repos wouldn't collide and cause one to get an error).

The creation of two repos would then trigger the second bug (send on a closed channel) which caused the observed panic.

nrb · 2019-02-25T21:57:32Z

pkg/restic/repository_ensurer.go

+		backupLocation:  backupLocation,
+	}
+
+	r.repoLocks[repoKey].Unlock()


Does this need to be wrapped by r.repoLocksLock, since it's technically a mutation?

yeah, it does need a read lock...i need to do a little re-jiggering to avoid deadlocks.

Is it a readlock to call the underlying Unlock mechanism? I'd think it would be a write, since state in the key's mutex is changing.

OK, the latest commit should clean this up. repoLocksLock (now repoLocksMu) synchronizes writes/reads to the repoLocks map, since maps are not threadsafe. And each mutex in the map synchronizes runs of EnsureRepo for a given namespace + location pair. The previous method structure had interleaved mutexes which was incorrect because it leads to deadlocks. That's no longer the case.

I'll retest if there's a new image, @skriss? Let me know the tag to pull.

@PeterGrace I pushed steveheptio/ark:v0.10.2-alpha.2 with the latest code

nrb · 2019-02-26T01:16:54Z

Code looks good, probably warrants a release note.

Signed-off-by: Steve Kriss <krisss@vmware.com>

skriss · 2019-02-26T02:31:52Z

squashed + added release note

nrb · 2019-02-26T04:19:38Z

While I still think this code is fine, I found out about the sync.Map type that came about in Go 1.9; I think it could apply here if we have to do much more sync logic with the RepositoryEnsurer.

PeterGrace · 2019-02-26T14:03:04Z

Tested the image that Steve pushed to steveheptio/ark:v0.10.2-alpha.2, it worked as expected. Thanks @skriss!

skriss · 2019-02-26T21:40:49Z

@carlisia PTAL

carlisia

lgtm 👍

skriss · 2019-02-27T15:18:48Z

@carlisia or @nrb pls merge!

skriss added this to the v0.11.0 milestone Feb 25, 2019

skriss requested review from carlisia and nrb February 25, 2019 21:18

nrb reviewed Feb 25, 2019

View reviewed changes

pkg/restic: fix concurrency bugs causing dupe repos, panics

e3e76c2

Signed-off-by: Steve Kriss <krisss@vmware.com>

skriss force-pushed the restic-race-fix branch from 37fae4b to e3e76c2 Compare February 26, 2019 02:30

carlisia approved these changes Feb 27, 2019

View reviewed changes

nrb merged commit 783c7d8 into vmware-tanzu:master Feb 27, 2019

skriss deleted the restic-race-fix branch February 28, 2019 15:12

faizshukri mentioned this pull request Apr 9, 2019

[stable/ark] Upgrade ark version to v0.10.2 helm/charts#12934

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs in restic repository ensurer #1235

Fix bugs in restic repository ensurer #1235

skriss commented Feb 25, 2019

skriss commented Feb 25, 2019

nrb Feb 25, 2019

skriss Feb 25, 2019

nrb Feb 25, 2019 •

edited

Loading

skriss Feb 25, 2019 •

edited

Loading

PeterGrace Feb 26, 2019

skriss Feb 26, 2019

nrb commented Feb 26, 2019

skriss commented Feb 26, 2019

nrb commented Feb 26, 2019 •

edited

Loading

PeterGrace commented Feb 26, 2019

skriss commented Feb 26, 2019

carlisia left a comment

skriss commented Feb 27, 2019

Fix bugs in restic repository ensurer #1235

Fix bugs in restic repository ensurer #1235

Conversation

skriss commented Feb 25, 2019

skriss commented Feb 25, 2019

nrb Feb 25, 2019

Choose a reason for hiding this comment

skriss Feb 25, 2019

Choose a reason for hiding this comment

nrb Feb 25, 2019 • edited Loading

Choose a reason for hiding this comment

skriss Feb 25, 2019 • edited Loading

Choose a reason for hiding this comment

PeterGrace Feb 26, 2019

Choose a reason for hiding this comment

skriss Feb 26, 2019

Choose a reason for hiding this comment

nrb commented Feb 26, 2019

skriss commented Feb 26, 2019

nrb commented Feb 26, 2019 • edited Loading

PeterGrace commented Feb 26, 2019

skriss commented Feb 26, 2019

carlisia left a comment

Choose a reason for hiding this comment

skriss commented Feb 27, 2019

nrb Feb 25, 2019 •

edited

Loading

skriss Feb 25, 2019 •

edited

Loading

nrb commented Feb 26, 2019 •

edited

Loading