Fix possible deadlock during server close #14813
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Recently while trying to merge a PR of mine the unit tests timed out, as they sometimes do.
After doing some digging I found that
lib/srv/regular.(*Server).close()
can deadlock, because it holds its own lock while closing its session registry, which closes the session, which emits a session leave event, which calls back out tolib/srv/regular.(*Server).getAdvertiseAddr()
, which tries to grab the lock."Why are we holding the lock during the whole
close()
call?" I wondered.Turns out that in #14173 I "fixed" a segfault by blindly adding a lock in
Server.Close()
where it was settings.heartbeat
ands.users
tonil
.I had a look here, and it really seems like there's no reason to set
s.heartbeat
ands.users
tonil
while closing, and thus no need to protect those with the lock. Seems like it was originally done to prevent a double-close, but that's really not necessary, they both just cancel a context.My solution here is to stop setting these to
nil
, and lose the locking ins.close()
ands.startPeriodicOperations()
. It's fine ifs.startPeriodicOperations()
is called afters.close()
(the original bug in #14173) because the context will be cancelled and everything will stop right away.I ran all the
lib/web
unit tests 387 times overnight (because that's where I saw this deadlock manifest) and was not able to reproduce this deadlock. I did repro it in 70 attempts off of master. There are other flaky tests in this package, but I saw the same set of flakes on master and this branch, minus the deadlock fixed here.