Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for VideoRoom race condition (see #3124 and 3154) #3167

Merged
merged 4 commits into from
Apr 11, 2023
Merged

Conversation

lminiero
Copy link
Member

This is a first attempt to fix a nasty (and rare) race condition that can occur in the VideoRoom, and that has been discovered thanks to logs provided in #3154: there's very good chances that #3124 is a duplicate of the same issue, which is why I'm grouping them together here.

The issue that happens is basically the following:

  1. a new handle is attached to the VideoRoom;
  2. a "join" as a subscriber is sent on that handle, which is queued for processing;
  3. in the meanwhile, the handle is detached, so destroy_session and hangup_media are called, which do nothing since at that point the handle is still neither a publisher nor a subscriber;
  4. in parallel, the "join" request is processed and goes through, adding the subscriber as a recipient for the specified publisher(s), but the session is now freed;
  5. some time later, the publisher goes away, and when accessing the session for that subscriber, there's a crash since session has been freed already.

The problem is clearly the race condition between the "join" processing and the detach, which causes a janus_videoroom_subscriber instance to keep on living, even though its session has gone, and refering a session pointer that's now garbage. It's not something that can be easily fixed with just an additional reference, since that may solve the crash, but would leave a leak: in fact, the problem is that the subscriber instance is still in the publisher's list of recipients, and that subscriber can only be removed by a destroy_session and/or hangup_media, but those happened already and are not coming back (the session is over).

As such, the fix this patch attempts is using session_mutex on a "join" (for both publishers and subscribers) as a way to ensure the race condition doesn't happen. In fact, session_mutex is what we use to protect the sessions hashtable, which means it's used when a session is marked as destroyed: we already had checks in "join" for whether the session was destroyed or not, which could be bypassed in case of race conditions like the one above, and now should instead do their job properly.

Marking this PR as a draft now, so that those who reported the problem can test it, and ensure there's no regression. From a quick check I don't think this new mutex usage should cause any instance of inverse lock order problems, but some more testing should help verify no deadlock will occur now. The PR is also a draft just in case it turns out a different mutex may be a better option (e.g., session->mutex). As usual, feedback is more than welcome: I hope that, considering many people use the VideoRoom, we'll get more feedback than usual.

@lminiero lminiero added the multistream Related to Janus 1.x label Feb 17, 2023
@lminiero lminiero marked this pull request as ready for review March 8, 2023 14:13
@lminiero
Copy link
Member Author

While there still are some race conditions to address, they seem to be happening in different places for now. As such, I'll start merging these fixes as they help greatly already. We'll address the other problems #3154 spotted in a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multistream Related to Janus 1.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants