Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix possible deadlock during server close #14813

Merged
merged 2 commits into from
Jul 23, 2022

Conversation

nklaassen
Copy link
Contributor

Recently while trying to merge a PR of mine the unit tests timed out, as they sometimes do.

After doing some digging I found that lib/srv/regular.(*Server).close() can deadlock, because it holds its own lock while closing its session registry, which closes the session, which emits a session leave event, which calls back out to lib/srv/regular.(*Server).getAdvertiseAddr(), which tries to grab the lock.

"Why are we holding the lock during the whole close() call?" I wondered.

Turns out that in #14173 I "fixed" a segfault by blindly adding a lock in Server.Close() where it was setting s.heartbeat and s.users to nil.

I had a look here, and it really seems like there's no reason to set s.heartbeat and s.users to nil while closing, and thus no need to protect those with the lock. Seems like it was originally done to prevent a double-close, but that's really not necessary, they both just cancel a context.

My solution here is to stop setting these to nil, and lose the locking in s.close() and s.startPeriodicOperations(). It's fine if s.startPeriodicOperations() is called after s.close() (the original bug in #14173) because the context will be cancelled and everything will stop right away.

I ran all the lib/web unit tests 387 times overnight (because that's where I saw this deadlock manifest) and was not able to reproduce this deadlock. I did repro it in 70 attempts off of master. There are other flaky tests in this package, but I saw the same set of flakes on master and this branch, minus the deadlock fixed here.

@jakule
Copy link
Contributor

jakule commented Jul 23, 2022

You killed me with that comment 😆

Recently while trying to merge a PR of mine the unit tests timed out, as they sometimes do.

ut-fail

Thanks for fixing that.

@nklaassen nklaassen enabled auto-merge (squash) July 23, 2022 03:08
@nklaassen nklaassen merged commit 59baa34 into master Jul 23, 2022
@github-actions
Copy link

@nklaassen See the table below for backport results.

Branch Result
branch/v10 Create PR
branch/v9 Failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants