Fix for VideoRoom race condition (see #3124 and 3154) #3167
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a first attempt to fix a nasty (and rare) race condition that can occur in the VideoRoom, and that has been discovered thanks to logs provided in #3154: there's very good chances that #3124 is a duplicate of the same issue, which is why I'm grouping them together here.
The issue that happens is basically the following:
destroy_session
andhangup_media
are called, which do nothing since at that point the handle is still neither a publisher nor a subscriber;The problem is clearly the race condition between the "join" processing and the detach, which causes a
janus_videoroom_subscriber
instance to keep on living, even though its session has gone, and refering a session pointer that's now garbage. It's not something that can be easily fixed with just an additional reference, since that may solve the crash, but would leave a leak: in fact, the problem is that the subscriber instance is still in the publisher's list of recipients, and that subscriber can only be removed by adestroy_session
and/orhangup_media
, but those happened already and are not coming back (the session is over).As such, the fix this patch attempts is using
session_mutex
on a "join" (for both publishers and subscribers) as a way to ensure the race condition doesn't happen. In fact,session_mutex
is what we use to protect the sessions hashtable, which means it's used when a session is marked as destroyed: we already had checks in "join" for whether the session was destroyed or not, which could be bypassed in case of race conditions like the one above, and now should instead do their job properly.Marking this PR as a draft now, so that those who reported the problem can test it, and ensure there's no regression. From a quick check I don't think this new mutex usage should cause any instance of inverse lock order problems, but some more testing should help verify no deadlock will occur now. The PR is also a draft just in case it turns out a different mutex may be a better option (e.g.,
session->mutex
). As usual, feedback is more than welcome: I hope that, considering many people use the VideoRoom, we'll get more feedback than usual.